Wednesday, May 28, 2014

Contrivance without confluence

The previous contrivance (without conflation) showed how to count only user-viewed visits to a web page, depending on the user-agent (i.e. browser) to load an image each time the page is viewed. This is good because that contrivance no longer counts requests for the page from robots and web crawler software, as it did in its first incarnation.

However, much information is lost as the count is maintained, because a mere tally is a stark abstraction of the visit information. Left out are the date of the visit, the user's IP address, the kind of browser used, the referring page, etc. etc.

In this contrivance, we'll show one way of capturing the date information. This is so the flow of all counts do not run together into a single number. It is as if each day's tally is a separate river flowing into the month, flowing into the year*.

The idea is to create a folder for each year, month, and day on which a visit occurs, and to keep a separate tally for each date.

Here is a CGI script to accomplish the task.

#!/bin/bash
echo "Content-type: image/gif"
echo
DATE=`date +%Y/%m/%d`
FOLDER_Y=../../counters/`echo $DATE | cut -d / -f 1`
FOLDER_M=../../counters/`echo $DATE | cut -d / -f 1-2`
FOLDER_D=../../counters/`echo $DATE | cut -d / -f 1-3`
if [ ! -d "$FOLDER_Y" ]; then
  mkdir $FOLDER_Y
fi
if [ ! -d "$FOLDER_M" ]; then
  mkdir $FOLDER_M
fi
if [ ! -d "$FOLDER_D" ]; then
  mkdir $FOLDER_D
fi
echo -n 1 >>$FOLDER_D/tallies
cat 1x1.gif

As before, line 1 lets the web server know that a bash shell should interpret this script, line 2 informs the web server (which lets the browser know) that an image will be returned, and line 3 signals the end of HTTP headers.

Line 4 obtains the system date from the machine running the web server, in the format YYYY/MM/DD. This is convenient, as the separating slash also has meaning for the file system, indicating nested folders. Note that this machine might be in a different time zone than the web master, and very often than the visitors. This scheme abstracts out the time of the visit, along with all of the other information available to the script.

Lines 5 and 8-10 create the folder for the year, if it doesn't already exist.

Lines 6 and 11-13 create the folder for the month, if necessary.

Lines 7 and 14-16 create the folder for the day, if necessary.

Line 17 tallies the count.

Line 18 as before provides the actual image data to be returned to the web server, which will pass it along to the user's browser.

To use this script you would upload it to the cgi-bin folder of the public html folder of your provider's web server machine. Supposing you named it tallywithdate.cgi any request from an image tag in one of your web pages for
[your domain]/cgi-bin/tallywithdate.cgi
would result in a view of that page receiving one tally or count.

Now, the question you are going to ask is, "Why do it this way, rather than using a relational database?" Very good question. In my case, it was to avoid setting up all of the machinery required. All that I needed was the date information, and the publisher of the website was content to have the date be relative to the U.S. central time zone (where the web server machine is located).

Another reason for using this technique is that, for this web site, I did not need PHP, because all of the web pages are generated off-line, and uploaded periodically, as a collection of hundreds of HTML pages. So, I took it as a challenge to implement the few pieces that required some server-side logic using only the Bash shell.

A word about disk space requirements. On the machine that hosts the web site in question, the file system uses blocks of 4K (4096 bytes). So one of these will be required for each year, month, day, and tallies file. Since the tallies file uses base one, it will require an additional 4K block when the tally exceeds 4096, and so on. I don't know how this would compare to a relational database solution. There, each count event would require at least 10 or 12 bytes for the record to contain the count and the system date. The exact comparison will be left as an exercise for the reader.

Another useful piece of information that is available to the script is the referring page. That information consists of the entire URL of the page containing the tally (image) request. That could be quite large. As an alternative, the pages of the web site in question are partitioned into equivalence classes, and each class of page is assigned a simple identifier. Then the following script (replacing only the penultimate line of the first script) records the tallies in a separate file for each class of web page.

#!/bin/bash
echo "Content-type: image/gif"
echo
DATE=`date +%Y/%m/%d`
FOLDER_Y=../../counters/`echo $DATE | cut -d / -f 1`
FOLDER_M=../../counters/`echo $DATE | cut -d / -f 1-2`
FOLDER_D=../../counters/`echo $DATE | cut -d / -f 1-3`
if [ ! -d "$FOLDER_Y" ]; then
  mkdir $FOLDER_Y
fi
if [ ! -d "$FOLDER_M" ]; then
  mkdir $FOLDER_M
fi
if [ ! -d "$FOLDER_D" ]; then
  mkdir $FOLDER_D
fi
TALLIES=`echo $QUERY_STRING | grep -o '[a-zA-Z][0-9a-zA-Z]*'`
if [ -n "$TALLIES" ]; then
  echo -n 1 >>$FOLDER_D/$TALLIES
fi
cat 1x1.gif

There you have it. Instead of a tally going into a file named tallies, it will go into a file whose name is provided as the query string of the image URL. For example, a request from one of your (fabulous) web pages that looked like this
[your domain]/cgi-bin/tallywithdate.cgi?fabulous
would result in a tally to the file named fabulous in the folder for the current (server) date.

The query string (portion of the image URL following the question mark) is sanitized (line 17) by accepting only the first occurrence of an identifier of letters and digits, starting with a letter. If such an identifier is present in the query string, then it is used as a file name (lines 18-20) to collect the tallies.

Scripts to display the count of page views for any given date, or date range, and for any class of web page are all left as exercises for the reader.

* So, with this iteration of the contrivance, we leave behind the confluence of all the counts into a single number. Hence the contrived title of this post: this is a contrivance without confluence.

[Added May 31] If you would like to limit the class names to those in a particular list of identifiers, place those identifiers in a file named, say eclassids, and add this line of code just before the last if of the script.

TALLIES=`grep "^$TALLIES$" ../../counters/eclassids | head -1`

[Added Jan 10, 2021] Note that lines 8-16 could be replaced by this single line of code:

mkdir -p $FOLDER_D
The -p flag will cause the mkdir command to create all the directories which don't already exist without giving any error messages. This would include the parent directories for a new month and year.

No comments:

Post a Comment