Backlinks from the server side: On-Fly capture and insertion.

Theoretical Physics Department. Zaragoza University. Spain.
rivero@sol.unizar.es

Abstract. We suggest a server modification to catch and serve backlink information. Obtained data is included in the HTML sent out, inserting either LINK or A elements. We make emphasis on the use of the first one.

Introduction

The so-called backlink mechanism, which lets a user to go up from a page to any of the ones referring to it, is a currently wanted extension to web browsing. Some implementations have been proposed, either by using search engines as "citation indexes", or by locally analyzing referer logs provided by some servers.

We found that the "citation index" weakness, namely dependances on spider efficiency, on the scale of the web, and on specific citation servers, are reasons enough to rule out its massive use. So referer log appears as the only other posibility. In average, it is faster than spider-based methods, as it records the link the first time any user clicks on it and does not depend of external sites.

Current implementations of this approach seem to be based on analysis of the accumulated log file. This implies an unneeded delay between link detection and effective inclusion in the served information. Here, we want to propose a on-fly method to log and serve referer data. This method can be added to existent servers by redirecting requests for .html files to a CGI script. In the future, it could be implemented as a native option of the server.

To check this approach, we made a toy script running over CERN httpd (see appendix). We have found it useful, and practically transparent. This paper is in some sense an account of our feelings when testing the script.

Backlink detection

Each time a HTTP request is received for a HTML document, the file is served as usual, but just then the HTTP_REFERER is stored. In our toy model, we opt by storing requests of pathname/filename.html in files pathname/filename.html.backs, so the file owner can edit it. A independent path for .backs files could be selected, to minimize intrusion in user's directories.

Of course, before storage some checks must be done, at least to exclude local self-references. The on-fly method limits the set of checkings we can do, but in practice this limitation has few consequences. Search engines and duplicates are the main points to take in account when checking.

Exclusion of search engines

A very high percentage of external hits into any node comes from search engines. It seems worthless to store such backlink information, as there is a lot of variants in the search request.

Fortunately, search engines are very localized in the web and they have a well defined pattern. Matching URLs against a small "exclusions file" let us to rule out them almost completely.

Detection of new engines is of course a problem. Some automatization could be provided; on the internal side by taking note of hits to robots.txt file, on the external side by reading popular lists of search engines. But there are always engines out of the mainstream rules, and human intervention is needed to be aware of them

Avoiding duplicities

If we don't parse the URL to a default format, we will surely capture multiple references to the same page. Main duplicity sources are:

The %xx notation, which is the standard for special characters in URLs.
Uppercase/lowercase problems, as the Web includes a lot of machines lacking of such distintion.

Parsing is fast and can be done on real time. A more persistent problem can came from aliases to nodes. To convert node name to IP number is not a so good idea, as it is less informative for users. To convert nicknames to original names could be a better option, but it is time consuming as it implies a long query to nameserver, so the process remains latent a few seconds.

Ckecking and Identifying pages

It is interesting to contact the original referer page by two reasons: (a) To check if it really points to ours, as it could be a dinamically generated page, or even actually a non-existent one. (b) To get its TITLE and eventually other LINK data from them.

The second one could be implemented on-fly, as it can be done with a HTTP HEAD request. The first one implies a long parsing along the body part, so it seems unpractical. Regretly, current HTML authors do not abstract the A HREFs in the body duplicating them as LINK HREFs in the head part. Other aplications, as map makers or spiders, could benefit of locating all this information with only a HEAD request. It would be nice that HTML editors were built encouraging this practice, but perhaps would be better to give the server the possibility of making this abstract automatically.

Serving HTML documents with backlink information

Once captured the info, it must be arranged to be served. We feel that direct storage in user' .html files would be perceived as a intrusive practice. So we again let to the server the responsability, and we merge on-fly the .html file with the backlink data

Insertion as LINK (preferred)

The standard place to insert backlink data for a HTML document is the head part. In effect, HTML 3.2 (and any previous version) specifies that references to other documents can be done either actively in the body part, using the A element, or passively in the head part, using then the LINK element. Both elements share the same modifier tags, namely NAME, HREF and REL/REV.

Regretly, HTML specification forgets to say which is the default RELationship value when REL/REV tag is not present. And the information we want to provide is just the REVerse of the default one.

As the obvious tag LINK REV="" is not easy to understand, we are experimentally serving the documents with a value "X-Default" for that tag. So each reference is included as:


            <LINK REV="X-Default" HREF=referer.url>

If a title for the link is avalaible (stored by hand, or got with a HTTP request) it can be included using the NAME tag. We could give to NAME "by default" the same value that to the HREF tag --in fact, this could be the obvious default, better than an empty string-- but we don't know if it is really needed.

We expect that LINK must be implemented by browsers at least in a minimal form, by extending the usual Go menu. As we see it, similar RELationships would appear grouped as a step of a hierarchical menu. Other presentations could happen, of course.

Insertion at end-of-page

It is a fact that browsers currently neglect to read LINK elements, so if we want the backlink information to be useful today, we must to add it also with the A tag, thus modifying the body part. This is really a pityful move, as then the server alters authored pages in a visible way.

Information has the same format that in the previous case; only difference comes because now we must to make a visible menu, so we put the backlink URL, or the title if avalaible:

<A REV="X-Default" HREF=referer.url> url or title </A>

Experience

We implemented the concept in our local server configuring CERN httpd to sent any request for .html files to a small CGI script.

We thought of inserting data only if documents are perceived as good HTML. In practice, we check for isolated </BODY> and </HEAD> tags in a line. This buggy implementation has a nice feature: users don't wanting backlinks in their pages can simply add a white space somewhere in the line, and our server skipts insertion then (Note that, insertion being a server feature, no reference to it would be done in the .html files. Robust servers would check some "options file" for each user).

In a very short period of time, the main entry pages automatically got every link we were aware of, and some others unexpected for us. Thus the method seems to work well.

We found hits from some search engines we were not aware of. By monitorizing such kind of hits along the two or three first days, be built a short exclusion file very sucessful in rejecting. It seems that almost everyone uses a very common set of engines, so less that fifteen lines were needed in the file.

No parsing is done actually. Main duplication source is the %7E code, due to pointers from personal homepages.

The service is completelly trasparent, no delay is detected by users, even in local access. It's worth to note that pages with no backlinks are sent unmodified, so owners only see the body alteration when some citation is registered. This is appreciated possitively by our page owners, as it implies that their pages are getting some external attention.

Final remarks

On-fly intervention is a good option for the main tasks, but some maintenance task must be done in a periodic basis. Packages developed for schemes based on log analysis can be used for this job.

We feel that a important blocking for backlinks to become common is the lack of clients supporting the LINK tagging. Support for the REVerse of the default relation would be implemented, as it is the straighforward generalization of the Back button. We want encourage developers to include some menu bar letting the user to choose multiple backs. It'is to be noted than some extensions to browsers already offer multiple "forwards", by collecting in a separate window all the A elements.

A further advantage of REV appears when working on search robots. The spider can choose then to go or not to go up the link, and the analitical engine can calibrate the character of the link when answering a sophisticated search request.

It would be possible to calibrate the "strength" of each link, by counting the number of hits coming from each referring page. This data could be useful to sort the menus in the client browser or, if 2-D or 3-D mapping are implemented in the client, to give different strength and colors to each link. Regretly HTML 3.2 LINK has not a STRENGTH tag to sent out this information.

Finally, let us remark than though this method is more efficient than citation servers, its locality implies a weakness, as it can be enabled or disabled at webmaster will. Thus an average websurfer will found that only some pages are able to backlink directly. We feel that local backlink services must be encouraged, so users can avoid to rely on overcrowded search engines.

Appendix: CERN httpd toy implementation

We insert our script in the CERN httpd giving by putting some lines in the configuration file:

Exec  /*html             /home/http/cgi-bin/PutBacks
Exec  /*.htm             /home/http/cgi-bin/PutBacks
Exec  /                  /home/http/cgi-bin/PutBacks

Thus mapping html file requests towards the CGI script. The script is the following:


#!/bin/sh
######################################################################
# Nombre: PutBacks
# Descripcion: CGI para detectar e insertar citas a las paginas web
# Author: Alejandro Rivero (rivero@sol.unizar.es) 
# Date: noviembre 1996 
##########################################################################
# Usage:
# With CERN httpd (3.0 or near) add to the config file some lines
# indicating the directories to monitorize for backlinks
#    Exec    /wwwlab/experiences/*.html      /home/http/cgi-bin/PutBacks
#    Exec    /alejo/*htm		     /home/http/cgi-bin/PutBacks
# Path comes from start of data tree, as always.  
# Warnings:
# - Provided as proof of concept only. A lot of work must we made to make
#   the script fool-proof.  
# - Use it at your OWN risk. 
##########################################################################
# variables for testing only.
#SCRIPT_NAME="wwwlab/links.html"
#HTTP_REFERER="http://some.site/nose.html"
#########################################################################
#
# We are sending out html files only.
#
echo 'Content-Type: text/html'
echo ''
#
# base directory: root of our data tree
#
B_DIR=/home/http/htdocs
#
# Exclude file: list of nodes to exclude. Must have at least one name
#
EXCLUDE_FILE=/home/http/conf/hosts.noindex
#
# We first test for "default" http path specification, then look
# for Welcome, welcome, or index files, in this order
#
if test -d ${B_DIR}/${SCRIPT_NAME}; then
 if test -f ${B_DIR}/${SCRIPT_NAME}/Welcome.html; then
  SCRIPT_NAME=${SCRIPT_NAME}/Welcome.html
 elif test -f ${B_DIR}/${SCRIPT_NAME}/welcome.html; then
  SCRIPT_NAME=${SCRIPT_NAME}/welcome.html
 elif test -f ${B_DIR}/${SCRIPT_NAME}/index.html; then
  SCRIPT_NAME=${SCRIPT_NAME}/index.html
 else
   echo "<h1>Error</h1> Can not process directory"
   exit 0
 fi
fi
# 
# And of course we check if file exists... if not, we have a special log
# for this kind of things
if !(test -f ${B_DIR}/${SCRIPT_NAME}); then
 echo "<h1>Error</h1> File does not exist."
 echo -n `date` >> /home/http/logs/lost_files.log
 echo -n " " ${B_DIR}/${SCRIPT_NAME} >>/home/http/logs/lost_files.log
 echo    " " ${HTTP_REFERER} >>/home/http/logs/lost_files.log
 exit 0
fi
# 
# Now we really begin the game
#
# First, we test for "filename.backs", the file containing detected backlinks  
if   (!(test -f ${B_DIR}/${SCRIPT_NAME}.backs )) then 
 cat ${B_DIR}/$SCRIPT_NAME
else
 #Ok, there are backlinks for this file. We are going to include them
 cat ${B_DIR}/${SCRIPT_NAME}| ( 
  while read linea;  do 
  # if the html has a well separated HEAD tag, we add LINK info there 
         #        if (`echo $linea | grep -i -q "^</HEAD>"`) then
         #        This "if" changued for speed...
   if test "$linea" = "</HEAD>" -o "$linea" = "</head>";  then 
   for bklink in `cat  ${B_DIR}/${SCRIPT_NAME}.backs`  
    do
     echo -n "<LINK HREF=\""
     echo -n $bklink 
     echo    "\"  REV=\"X-Default\" > "
     #
     # Note for developers:
     # If your browser can not cope with a REV flag without parameters,
     # please suggest me the adequate one 
     # 
    done
  fi
  # and if the html has well separated BODY tag, we put some
  # visible backlinks there, for compatibility with old browsers :-)
               #  if `echo $linea | grep -i -q "^</BODY>"`; then
               # again, this "if" changed to optimize output speed
  if test "$linea" = "</BODY>" -o "$linea" = "</body>";  then 
   echo "<p><hr><i>"
   echo "Note that we have"
   echo "detected some pages probably pointing to this one:<menu>"
                                 # would be better UL COMPACT?
   for bklink in `cat  ${B_DIR}/${SCRIPT_NAME}.backs`  
   do
    echo -n "<li><A HREF=\""
    echo -n  $bklink
    echo -n "\" REV=\"X-Default\">"
    echo -n  $bklink
    echo    "</A>"
   done
   echo "</menu>"
   echo "you could be interested on checking them."
   echo "</i>"
  fi
  echo $linea 
 done
 ) 
fi

# Now we check if the current access can qualify as backlink, then
# register it in the .backs file
if (!(test -z "${HTTP_REFERER}" )) then
  #mini-bug: EXCLUDE_FILE must have at least one line
 if (!(`echo $HTTP_REFERER | grep -F -q -i -f ${EXCLUDE_FILE}`)) then
   # Of course, we create the file when it is needed 
   if (!(test -f ${B_DIR}/${SCRIPT_NAME}.backs)) then  
    touch  ${B_DIR}/${SCRIPT_NAME}.backs
   fi
   # and we try to avoid reiteration...
   if (!(`grep -q -i $HTTP_REFERER ${B_DIR}/${SCRIPT_NAME}.backs`)) then 
    echo  $HTTP_REFERER  >>   ${B_DIR}/${SCRIPT_NAME}.backs
   fi 
 fi
fi
#
exit 0
#

References

HTML 3.2 specification, http://www.w3.org/pub/WWW/TR/PR-html32-961105, see also http://www2.wvitcoe.wvnet.edu/~sbolt/html3/
HTTP 1.0 specifications, Draft-ietf-http-v10 (Work in Progress), see also http://www.w3.org/pub/WWW/Protocols/HTTP/Object_Headers.html
Robin Hanson, List of "backlinks" projects, http://www.hss.caltech.edu/~hanson/backlinks.html
Murray Maloney, Hypertext links in HTML, http://www.sq.com/papers/Relationships.html#RELREV
Robin Hanson, Find Critics, http://hss.caltech.edu/~hanson/findcritics.html
John Walker, Hack Links, http://www.fourmilab.ch/entrenous/hacklinks.html
T.W. Alliance,Things to evolve, http://www.student.nada.kth.se/~nv91-asa/Trans/alliance_questions.html
Foresight, The Backlinks page, http://www.islandone.org/Foresight/WebEnhance/backlinks.news.html
Alejandro Rivero, Some experiences logging referer data, http://dftuz.unizar.es/wwwlab/links.html
Ka-Ping Yee, Shodouka, http://www.lfw.org