rivero@sol.unizar.es
Abstract. We suggest a server modification to catch and serve
backlink information. Obtained data is included in the HTML sent out,
inserting either LINK
or A
elements. We make
emphasis on the use of the first one.
We found that the "citation index" weakness, namely dependances on spider efficiency, on the scale of the web, and on specific citation servers, are reasons enough to rule out its massive use. So referer log appears as the only other posibility. In average, it is faster than spider-based methods, as it records the link the first time any user clicks on it and does not depend of external sites.
Current implementations of this approach seem to be based on
analysis of the accumulated log file. This implies an unneeded
delay between link detection and effective inclusion in the
served information. Here, we want to propose a on-fly method to
log and serve referer data. This method can be added
to existent servers by redirecting requests
for .html
files to a CGI script.
In the future, it could be implemented as a native option of the server.
To check this approach, we made a toy script running over CERN httpd (see appendix). We have found it useful, and practically transparent. This paper is in some sense an account of our feelings when testing the script.
HTTP_REFERER
is stored.
In our toy model, we opt by storing requests of pathname/filename.html
in files pathname/filename.html.backs, so the file owner
can edit it. A independent path for .backs files could
be selected, to minimize intrusion in user's directories.
Of course, before storage some checks must be done, at least to exclude local self-references. The on-fly method limits the set of checkings we can do, but in practice this limitation has few consequences. Search engines and duplicates are the main points to take in account when checking.
Fortunately, search engines are very localized in the web and they have a well defined pattern. Matching URLs against a small "exclusions file" let us to rule out them almost completely.
Detection of new engines is of course a problem. Some automatization could be provided; on the internal side by taking note of hits to robots.txt file, on the external side by reading popular lists of search engines. But there are always engines out of the mainstream rules, and human intervention is needed to be aware of them
TITLE
and eventually
other LINK
data from them.
The second one could be implemented on-fly, as it can
be done with a HTTP HEAD
request. The first one implies a
long parsing along the body part, so it seems unpractical.
Regretly, current HTML authors do not abstract the A HREFs in the
body duplicating them as LINK HREFs in the head part.
Other aplications, as map makers or spiders, could benefit of locating all
this information with only a HEAD
request.
It would be nice that HTML editors were built encouraging this practice,
but perhaps would be better to give the server the possibility of
making this abstract automatically.
Regretly, HTML specification forgets to say which is the default
RELationship value when REL/REV
tag is not present. And
the information we want to provide is just the REVerse of the default one.
As the obvious tag LINK REV=""
is not
easy to understand, we are experimentally serving the documents with a
value "X-Default"
for that tag. So each reference is included as:
<LINK REV="X-Default" HREF=referer.url>
If a title for the link is avalaible (stored by hand, or got with a HTTP request) it can be included using the NAME tag. We could give to NAME "by default" the same value that to the HREF tag --in fact, this could be the obvious default, better than an empty string-- but we don't know if it is really needed.
We expect that LINK must be implemented by browsers at least in a minimal form, by extending the usual Go menu. As we see it, similar RELationships would appear grouped as a step of a hierarchical menu. Other presentations could happen, of course.
Information has the same format that in the previous case; only difference comes because now we must to make a visible menu, so we put the backlink URL, or the title if avalaible:
.html
files to a small CGI script.
We thought of inserting data only if documents are perceived as good HTML. In practice, we check for isolated </BODY> and </HEAD> tags in a line. This buggy implementation has a nice feature: users don't wanting backlinks in their pages can simply add a white space somewhere in the line, and our server skipts insertion then (Note that, insertion being a server feature, no reference to it would be done in the .html files. Robust servers would check some "options file" for each user).
In a very short period of time, the main entry pages automatically got every link we were aware of, and some others unexpected for us. Thus the method seems to work well.
We found hits from some search engines we were not aware of. By monitorizing such kind of hits along the two or three first days, be built a short exclusion file very sucessful in rejecting. It seems that almost everyone uses a very common set of engines, so less that fifteen lines were needed in the file.
No parsing is done actually. Main duplication source is the
%7E
code, due to pointers from personal homepages.
The service is completelly trasparent, no delay is detected by users, even in local access. It's worth to note that pages with no backlinks are sent unmodified, so owners only see the body alteration when some citation is registered. This is appreciated possitively by our page owners, as it implies that their pages are getting some external attention.
We feel that a important blocking for backlinks to become common is the
lack of clients supporting the LINK
tagging. Support
for the REVerse
of the default relation would be implemented, as it is the straighforward
generalization of the Back button. We want encourage developers
to include some menu bar letting the user to choose multiple backs.
It'is to be noted than some extensions to browsers already offer
multiple "forwards", by collecting in a separate window all the
A elements.
A further advantage of REV appears when working on search robots. The spider can choose then to go or not to go up the link, and the analitical engine can calibrate the character of the link when answering a sophisticated search request.
It would be possible to calibrate the "strength" of each link, by
counting the number of hits coming from each referring page. This data
could be useful to sort the menus in the client browser or, if 2-D
or 3-D mapping are implemented in the client, to give different
strength and colors to each link. Regretly HTML 3.2 LINK
has not a STRENGTH tag to sent out this information.
Finally, let us remark than though this method is more efficient than citation servers, its locality implies a weakness, as it can be enabled or disabled at webmaster will. Thus an average websurfer will found that only some pages are able to backlink directly. We feel that local backlink services must be encouraged, so users can avoid to rely on overcrowded search engines.
Exec /*html /home/http/cgi-bin/PutBacks Exec /*.htm /home/http/cgi-bin/PutBacks Exec / /home/http/cgi-bin/PutBacksThus mapping html file requests towards the CGI script. The script is the following:
#!/bin/sh ###################################################################### # Nombre: PutBacks # Descripcion: CGI para detectar e insertar citas a las paginas web # Author: Alejandro Rivero (rivero@sol.unizar.es) # Date: noviembre 1996 ########################################################################## # Usage: # With CERN httpd (3.0 or near) add to the config file some lines # indicating the directories to monitorize for backlinks # Exec /wwwlab/experiences/*.html /home/http/cgi-bin/PutBacks # Exec /alejo/*htm /home/http/cgi-bin/PutBacks # Path comes from start of data tree, as always. # Warnings: # - Provided as proof of concept only. A lot of work must we made to make # the script fool-proof. # - Use it at your OWN risk. ########################################################################## # variables for testing only. #SCRIPT_NAME="wwwlab/links.html" #HTTP_REFERER="http://some.site/nose.html" ######################################################################### # # We are sending out html files only. # echo 'Content-Type: text/html' echo '' # # base directory: root of our data tree # B_DIR=/home/http/htdocs # # Exclude file: list of nodes to exclude. Must have at least one name # EXCLUDE_FILE=/home/http/conf/hosts.noindex # # We first test for "default" http path specification, then look # for Welcome, welcome, or index files, in this order # if test -d ${B_DIR}/${SCRIPT_NAME}; then if test -f ${B_DIR}/${SCRIPT_NAME}/Welcome.html; then SCRIPT_NAME=${SCRIPT_NAME}/Welcome.html elif test -f ${B_DIR}/${SCRIPT_NAME}/welcome.html; then SCRIPT_NAME=${SCRIPT_NAME}/welcome.html elif test -f ${B_DIR}/${SCRIPT_NAME}/index.html; then SCRIPT_NAME=${SCRIPT_NAME}/index.html else echo "<h1>Error</h1> Can not process directory" exit 0 fi fi # # And of course we check if file exists... if not, we have a special log # for this kind of things if !(test -f ${B_DIR}/${SCRIPT_NAME}); then echo "<h1>Error</h1> File does not exist." echo -n `date` >> /home/http/logs/lost_files.log echo -n " " ${B_DIR}/${SCRIPT_NAME} >>/home/http/logs/lost_files.log echo " " ${HTTP_REFERER} >>/home/http/logs/lost_files.log exit 0 fi # # Now we really begin the game # # First, we test for "filename.backs", the file containing detected backlinks if (!(test -f ${B_DIR}/${SCRIPT_NAME}.backs )) then cat ${B_DIR}/$SCRIPT_NAME else #Ok, there are backlinks for this file. We are going to include them cat ${B_DIR}/${SCRIPT_NAME}| ( while read linea; do # if the html has a well separated HEAD tag, we add LINK info there # if (`echo $linea | grep -i -q "^</HEAD>"`) then # This "if" changued for speed... if test "$linea" = "</HEAD>" -o "$linea" = "</head>"; then for bklink in `cat ${B_DIR}/${SCRIPT_NAME}.backs` do echo -n "<LINK HREF=\"" echo -n $bklink echo "\" REV=\"X-Default\" > " # # Note for developers: # If your browser can not cope with a REV flag without parameters, # please suggest me the adequate one # done fi # and if the html has well separated BODY tag, we put some # visible backlinks there, for compatibility with old browsers :-) # if `echo $linea | grep -i -q "^</BODY>"`; then # again, this "if" changed to optimize output speed if test "$linea" = "</BODY>" -o "$linea" = "</body>"; then echo "<p><hr><i>" echo "Note that we have" echo "detected some pages probably pointing to this one:<menu>" # would be better UL COMPACT? for bklink in `cat ${B_DIR}/${SCRIPT_NAME}.backs` do echo -n "<li><A HREF=\"" echo -n $bklink echo -n "\" REV=\"X-Default\">" echo -n $bklink echo "</A>" done echo "</menu>" echo "you could be interested on checking them." echo "</i>" fi echo $linea done ) fi # Now we check if the current access can qualify as backlink, then # register it in the .backs file if (!(test -z "${HTTP_REFERER}" )) then #mini-bug: EXCLUDE_FILE must have at least one line if (!(`echo $HTTP_REFERER | grep -F -q -i -f ${EXCLUDE_FILE}`)) then # Of course, we create the file when it is needed if (!(test -f ${B_DIR}/${SCRIPT_NAME}.backs)) then touch ${B_DIR}/${SCRIPT_NAME}.backs fi # and we try to avoid reiteration... if (!(`grep -q -i $HTTP_REFERER ${B_DIR}/${SCRIPT_NAME}.backs`)) then echo $HTTP_REFERER >> ${B_DIR}/${SCRIPT_NAME}.backs fi fi fi # exit 0 #