Previous Page TOC Next Page



– 24 –


Tracking Page Success


So, your site is up and running—now you're probably wondering if anyone is visiting. There are a few ways to find out: by using a counter, by accessing your virtual host's online statistics system for your site (if you have one), by using a program on your server for this purpose (again, if you have one), or by accessing your raw access logs.

Counters


We warned you earlier in this book about the use of counters. In all but very rare instances, we advise against them. However, we know some of you will either wish to disregard this advice, or will have clients who insist you do so, which is why we cover them here.

A Simple Counter


A CGI that creates a simple text counter is provided on your CD-ROM. This counter is very simple to set up. Here are the steps:

  1. Copy the CGI script (counter.pl) into a working directory.
  2. Follow the instructions in the readme file contained in the same directory.
  3. Add the code for the counter into your HTML file. It should look something like this (with your own URL of course):
  4. <!--#exec cgi="http://www.hampton.org/scripts/ textcounter/counter.pl"-->
  5. Upload your CGI script and your HTML file (see Figure 24.1) to the server.
  6. Test, test, test. If you have any trouble, refer to Matt's Web site at
  7. http://www.worldwidemart.com/scripts/textcounter.shtml

Figure 24.1. A simple text counter.

If you are interested in creating a more complex, odometer style graphical counter (see Figure 24.2), you can refer to Matt's graphical counter at




http://www.worldwidemart.com/scripts/counter.shtml.

Figure 24.2. A more complex graphical counter.

There are also many other references online for counter creation. The only times we have used counters are when clients have demanded them, and we're usually successful at talking clients out of this. Again, you wouldn't answer your phones by telling people that they're the twenty-five thousandth caller.


Note

A counter might be a good alternative if you have a client who calls you daily to check on their hit count. You may even wish to have one page being counted, and another containing the display. This way you can have an easy reference for your client without having a counter display on the home page.


Using Access Logs for Site Analysis


Using access logs to track your site is by far the best option. Access logs not only tell you how many times your site has been accessed, they contain other valuable information as well.

What Are Access Logs?


When a viewer visits your site and requests a document, a record of that request is written into a log by the Web server. Most servers support the CLF (Common Log Format). This format was derived from an old version of the NCSA Web server in order to keep simple request information; it was not designed for in-depth site analysis. The CLF contains only the viewer's hostname, the HTTP requested from your server (containing the URL), the time and date of the request, a return code for the request, and the number of bytes returned.

An access log consists of line of text that look like this:




lucifer.me.wig.org - - [07/Aug/1996:15:31:32 -0700] GET /graphics/banner.gif HTTP/1.0" 200 3021

This shows an access from host lucifer.me.wig.org on August 7, 1996 at 3:31 PM PST requesting the document /graphics/banner.gif using the usual GET method. The server returned an "OK" status code of 200, then transferred 3021 bytes of data.


Note

The -0700 following the date is telling you that the local time zone is seven hours earlier than Greenwich Mean Time, in this case Pacific Standard Time.

For most commercial purposes, you will want some additional information, which is why many servers also support the new Extended Common Log File (also from NCSA), and/or their own additional extended log formats. Extended log formats record additional useful information. Most often these extended log formats include information like the name of the viewer's browser and a field indicating which Web page referred the viewer to the site (where they came from).

Interpreting Log Files


Access logs are very complex; although you can read through them, they might make little sense unless you use a product developed to translate these logs into another format.

Figure 24.3. A "raw" access log as viewed through Notepad.

Products for this purpose can directly extract or interpret your log file data into different categories. Commonly available categories include hit counts for each file, breakdowns of requests based on hostname (or IP address), server performance statistics, where the viewer came from (the referring URL), and often an analysis of the correlations between these categories.

On Your Own Server

If you are using your own server, there is most likely a special program already residing on your system for viewing your site statistics. Check your server's manual for information on setting this up. If you have no luck there, try your server manufacturer's Web site.

If your server did not include an application for this purpose, there are add-on software solutions available to you. Commercial software tools like Marketwave Hitlist (see Figure 24.4), net.Analysis, and WebReporter are available on various platforms, allowing on-demand analysis of your Web site by generating detailed reports. Some of these products even enable you to "zoom in" on information in order to generate even more detailed reports. A listing of these types of applications are maintained by Yahoo at




http://www.yahoo.com/Business_and_Economy/Companies/Computers/Software/Internet/World_Wide_Web/




Log_Analysis_Tools/

Figure 24.4. The Marketwave Hitlist report dialog box.

Through Your Virtual Host

If you are using a virtual host that offers online site statistics, this will be your best option for seeing who is accessing your pages. Simply e-mail your host, asking where your statistics are located. Most ISPs that offer hosting services will have anticipated your needs and developed user-friendly access log reports, as shown in Figure 24.5.

Figure 24.5. A sample of an online site statistic report via a virtual host.

Dealing with Raw Access Logs

If you're not fortunate enough to be using a host that offers online site statistics, it will be necessary to gain access to your "raw" access logs. Your host will know where these are located.

Unless you're excited about dealing with those monstrous raw logs, your next step will be to translate these into a more usable format. There are many public-domain and commercial log-analysis tools that enable you to output these logs into HTML and various other document formats (even comma-delimited text formats are available, enabling easy importation to a database). A listing of some of these is available through Yahoo at




http://www.yahoo.com/Business_and_Economy/Companies/Computers/Software/Internet




/World_Wide_Web/Log_Analysis_Tools/

or at Stroud's shareware archive at




http://cws.wilmington.net/stat.html#access

Figure 24.6. WebTrends, a program that translates "raw" access logs.

A new option is using an online Java-enabled system, like Bazaar Analyzer. These systems can

Although Bazaar Analyzer (shown in Figure 24.7) is currently only available for Solaris systems, you can use it online through their server at http://www.bazaarsuite.com/webdemo/index.htm to check a site on any server. Although we have experienced some "bugs" when using this system, it seems like a good option if you're really under a time crunch.

Figure 24.7. Part of a report we created online using Bazaar Analyzer.

Should I Put My Site Statistics Online?


Well, do they contain information you wish to advertise? In most cases there is no reason to put your site statistics online. An exception to this would be if you have a widely used site and plan to sell advertising on it, or if for some other reason you wish to publicize the success of your site (if it were a site for an association, for instance, and you wanted the members to know how active their association is).

Site Statistics Services


Using an off-site analysis service provided by companies like NetCount (http://www.netcount.com/, see Figure 24.8) and I/Pro (http://www.ipro.com/prod.html) can save you time and system resources. Using such a service eliminates the task of having to maintain a database on site with the associated large disk space consumption, but of course there's no such thing as a free lunch. These services generally charge from $95 to $600 a month. A "free" service such as this is provided by the Internet Audit Bureau (http://www.internet-audit.com), but they require that you include their logo on your Web site.

Figure 24.8. NetCount, a site statistics service.

What Does This Really Mean?


Now that you have your access logs in a nice format, you would probably like to know what it all means. Well, access logs track hits (in addition to other information), not individuals. What is the difference between a hit and an individual? A hit is any request on the server (including HTML files, graphics, sound files, and so on). For example, an HTML page with four graphics counts as five hits, even though only one person viewed that one page. An individual is the individual user, the person receiving your communication.

So, how can you tell how many actual people are viewing your pages? One way is to approximate the number of unique hosts that are accessing your Web documents. This will give you a pretty inaccurate picture though. You see, server logs use IP addresses (210.86.5.21) or hostnames (ha.net) to describe where viewers are coming from. This refers to the computer or dial-up account the viewer is using. When viewing your logs, looking at the IP addresses and hostnames may seem a good indicator of how many people are visiting your site, who your viewers are, and where they are located. Unfortunately, there are many distortions that can occur in an analysis of this information:

  1. It assumes that each IP address or hostname is unique to one person. We're not sure what goes on with you and your computer, but we know of many computer labs, coffee shops, and offices that have many more than one person using a single computer.
  2. If you see a hostname of ucsc.edu (University of California at Santa Cruz), you can probably assume that viewer is from Santa Cruz, California. But what if they come from intel.com? Intel has offices all over the globe, many of which use the intel.com domain. Also consider the millions of viewers who use national service providers like America Online. They can be located all over the United States, but your log states them all as coming from aol.com.
  3. Proxy servers pose yet another dilemma. A primary function of the proxy server is to act as a liaison from within the security firewall to the Internet. A request from within the firewall goes first through the proxy server, which in turn makes the request to the Internet server. Server logs will normally list all of these requests under the proxy server's domain name (or IP address), and not that of the original host.
  4. Dynamic IP addressing (a means of spreading a large user demand for IP addresses across a few machines) is yet another stumbling block. This usually affects users of dial-up accounts. It means that one day a viewer came from 210.86.5.21 and the next day from 210.96.5.56. You could easily make the mistake of thinking this is a new visitor to your site, when in actuality the viewer was visiting just yesterday.

So, taking all this into account, what is the scientific method you should use to assess your access logs? Educated guessing is really the only option. When analyzing one of our own sites, we guesstimate the number of times our site is being accessed rather than the number of people accessing it. We call these visitors session users rather than new visitors.

We do this by counting the hits on our home page as new session users, but if we see the same address access the home page in a short amount of time, we figure it is the same viewer returning to the home page. If an IP address or domain name shows up out of the blue on another page of our site, it is counted as a new session user as well. When a line of request from a certain address stops for a period of time (say 15 minutes), we assume the viewer has left our site.

If your site has a huge number of hits (lucky you), you will need to analyze more details, such as the browser or referrer information, to get a more accurate picture of when one viewer leaves and another with the same address enters. If you see one address on your site for a very long time (requesting new pages over an extended period), this may indicate more than one viewer from the same address. All this will give you a very rough picture of how many times your site is being accessed (of course, this number may be underestimated due to caching).

Now that you have a rough picture of how many times your site is being accessed, you can move on to the really important stuff: how your viewers are using your site. You can see that there is really no foolproof way to track individual viewers as they visit your site, but you can see what individual viewers are doing when they get there.

As a new IP address (or hostname) appears in your log, you count it as a new session user. When the requests appear within a short span of time, this indicates the session user is navigating your site. You can then analyze the choices that user made. This is called session analysis and is one of the very best ways to improve your site.

Using Logs to Improve Your Site


Your logs are full of valuable information. The trick is asking the right questions.

Access Logs


To use access logs to improve your site, ask yourself the following, based on your site statistics and session analysis:


Error Logs


Your site's error logs also contain valuable information. The error log, as its name implies, logs your Web site's problems: requests for documents that don't actually exist, attempts to access protected documents, and internal errors from CGI scripts and the server itself. There are two types of entries in the error log: the warning and error messages generated by the Web server itself, and the error messages generated by any running CGI scripts. A typical error log consists of text like this (see Figure 24.9):




 [Tue Aug 6 23:22:08 1996] httpd: connection timed out for hr25.snm.org



[Tue Aug 6 22:02:03 1996] httpd: malformed header from script



[Tue Aug 6 22:10:11 1996] killing CGI process 638

Figure 24.9. A "raw" error log viewed through WordPad.

The messages that begin with the label httpd are generated by the HTTP server itself, and follow a standard format. The other is a warning message generated by a CGI script (such as killing CGI process 638).

Make it a point to check out your raw error logs from time to time (especially if you or your viewers are experiencing problems). Most software programs that translate these for you do not give as much detail as the raw logs themselves, so familiarizing yourself with them is a very good idea. Things to watch for include the following:

Connection timed out

This occurs when the browser breaks the connection before receiving the entire requested document. This usually means your viewers are leaving (pressing the Stop button) before your page is loaded. If so, maybe the file takes too long to download, and you should cut down its size. This message could also indicate that your server is prematurely timing out connections to users on slow machines. If so, and if you are running your own server, consider increasing the time-out values in the server's configuration files.

Client denied by server configuration

This usually means one of two things:

  1. When access to a directory is restricted to only certain IP addresses, a user other than those allowed tried to gain access.
  2. Someone attempted to gain unauthorized access to the system.

File does not exist, or no multi in this directory

This indicates that someone attempted to access a nonexistent URL. Tip: If this warning occurs frequently, you probably have a broken link.

Malformed header from scrip

This is warning you that a CGI script is producing poor output and the server can't interpret it (usually caused by a bad script). Normally, this is followed or preceded by a CGI error message from the script itself.

Password mismatch

This is telling you that a user typed an incorrect password when attempting to access a protected document. Tip: A long series of these may indicate an attempt to gain unauthorized access to your protected documents; it might be a good idea to see who was on your system at that time if this is a concern.

Quick and Dirty Guide: Get a Counter up in 20 Minutes


If you're interested in setting up a counter, but don't have time to deal with a CGI script, there is a very simple solution: Use a Web-counter service. One that we have used from time to time is Web-Counter. The process is really simple—here goes.

  1. Go to http://www.digits.com and read the policies page.
  2. Decide whether you can go with the free service or should upgrade to the commercial service (if you have over 100 hits per day).
  3. Fill out the online form (http://www.digits.com/create.html for the free service, see Figure 24.10) and click the Create Counter button, which will send you to the response page.

    Figure 24.10. Completing the Web-Counter's online form.

  4. Cut and paste the code they suggest into your HTML document (see Figure 24.11).
  5. Upload your new HTML document.

    Figure 24.11. Copying the suggested code for use in your HTML document.

  6. Test it out, and you're done (see Figure 24.12). We completed this in under 15 minutes.

    Figure 24.12. Testing the counter.

    Summary


    In this chapter, we've addressed the inexact science of studying access logs. We've also gone against our own opinions and have offered two different ways to include counters on your pages—in case you find it absolutely necessary.

    As you should now understand, there is no way to know exactly who is accessing your pages (unless you require everyone to log in). You can, however, gain some valuable insight into the usage of your site by reviewing and compiling server statistics.

    In the next chapter, we move on to ways of maintaining your system and working with multiple authors.

    Previous Page Page Top TOC Next Page