So, your site is up and runningnow you're probably wondering if anyone is visiting. There are a few ways to find out: by using a counter, by accessing your virtual host's online statistics system for your site (if you have one), by using a
program on your server for this purpose (again, if you have one), or by accessing your raw access logs.
We warned you earlier in this book about the use of counters. In all but very rare instances, we advise against them. However, we know some of you will either wish to disregard this advice, or will have clients who insist you do so, which is why we
cover them here.
A CGI that creates a simple text counter is provided on your CD-ROM. This counter is very simple to set up. Here are the steps:
Figure 24.1. A simple text counter.
If you are interested in creating a more complex, odometer style graphical counter (see Figure 24.2), you can refer to Matt's graphical counter at
http://www.worldwidemart.com/scripts/counter.shtml.
Figure 24.2. A more complex graphical counter.
There are also many other references online for counter creation. The only times we have used counters are when clients have demanded them, and we're usually successful at talking clients out of this. Again, you wouldn't answer your phones by telling
people that they're the twenty-five thousandth caller.
Using access logs to track your site is by far the best option. Access logs not only tell you how many times your site has been accessed, they contain other valuable information as well.
When a viewer visits your site and requests a document, a record of that request is written into a log by the Web server. Most servers support the CLF (Common Log Format). This format was derived from an old version of the NCSA Web server in order to
keep simple request information; it was not designed for in-depth site analysis. The CLF contains only the viewer's hostname, the HTTP requested from your server (containing the URL), the time and date of the request, a return code for the request, and the
number of bytes returned.
An access log consists of line of text that look like this:
lucifer.me.wig.org - - [07/Aug/1996:15:31:32 -0700] GET /graphics/banner.gif HTTP/1.0" 200 3021
This shows an access from host lucifer.me.wig.org on August 7, 1996 at 3:31 PM PST requesting the document /graphics/banner.gif using the usual GET method. The server returned an "OK" status code of 200, then transferred 3021 bytes of data.
For most commercial purposes, you will want some additional information, which is why many servers also support the new Extended Common Log File (also from NCSA), and/or their own additional extended log formats. Extended log formats record additional
useful information. Most often these extended log formats include information like the name of the viewer's browser and a field indicating which Web page referred the viewer to the site (where they came from).
Access logs are very complex; although you can read through them, they might make little sense unless you use a product developed to translate these logs into another format.
Figure 24.3. A "raw" access log as viewed through Notepad.
Products for this purpose can directly extract or interpret your log file data into different categories. Commonly available categories include hit counts for each file, breakdowns of requests based on hostname (or IP address), server performance
statistics, where the viewer came from (the referring URL), and often an analysis of the correlations between these categories.
If you are using your own server, there is most likely a special program already residing on your system for viewing your site statistics. Check your server's manual for information on setting this up. If you have no luck there, try your server
manufacturer's Web site.
If your server did not include an application for this purpose, there are add-on software solutions available to you. Commercial software tools like Marketwave Hitlist (see Figure 24.4), net.Analysis, and WebReporter are available on various platforms,
allowing on-demand analysis of your Web site by generating detailed reports. Some of these products even enable you to "zoom in" on information in order to generate even more detailed reports. A listing of these types of applications are
maintained by Yahoo at
http://www.yahoo.com/Business_and_Economy/Companies/Computers/Software/Internet/World_Wide_Web/
Log_Analysis_Tools/
Figure 24.4. The Marketwave Hitlist report dialog box.
If you are using a virtual host that offers online site statistics, this will be your best option for seeing who is accessing your pages. Simply e-mail your host, asking where your statistics are located. Most ISPs that offer hosting services will have
anticipated your needs and developed user-friendly access log reports, as shown in Figure 24.5.
Figure 24.5. A sample of an online site statistic report via a virtual host.
If you're not fortunate enough to be using a host that offers online site statistics, it will be necessary to gain access to your "raw" access logs. Your host will know where these are located.
Unless you're excited about dealing with those monstrous raw logs, your next step will be to translate these into a more usable format. There are many public-domain and commercial log-analysis tools that enable you to output these logs into HTML and
various other document formats (even comma-delimited text formats are available, enabling easy importation to a database). A listing of some of these is available through Yahoo at
http://www.yahoo.com/Business_and_Economy/Companies/Computers/Software/Internet
/World_Wide_Web/Log_Analysis_Tools/
or at Stroud's shareware archive at
http://cws.wilmington.net/stat.html#access
Figure 24.6. WebTrends, a program that translates "raw" access logs.
A new option is using an online Java-enabled system, like Bazaar Analyzer. These systems can
Although Bazaar Analyzer (shown in Figure 24.7) is currently only available for Solaris systems, you can use it online through their server at http://www.bazaarsuite.com/webdemo/index.htm to check a site on any server. Although we have experienced some
"bugs" when using this system, it seems like a good option if you're really under a time crunch.
Figure 24.7. Part of a report we created online using Bazaar Analyzer.
Well, do they contain information you wish to advertise? In most cases there is no reason to put your site statistics online. An exception to this would be if you have a widely used site and plan to sell advertising on it, or if for some other reason
you wish to publicize the success of your site (if it were a site for an association, for instance, and you wanted the members to know how active their association is).
Using an off-site analysis service provided by companies like NetCount (http://www.netcount.com/, see Figure 24.8) and I/Pro (http://www.ipro.com/prod.html) can save you
time and system resources. Using such a service eliminates the task of having to maintain a database on site with the associated large disk space consumption, but of course there's no such thing as a free lunch. These services generally charge from $95 to
$600 a month. A "free" service such as this is provided by the Internet Audit Bureau (http://www.internet-audit.com), but they require that you include their logo on your Web site.
Figure 24.8. NetCount, a site statistics service.
Now that you have your access logs in a nice format, you would probably like to know what it all means. Well, access logs track hits (in addition to other information), not individuals. What is the difference between a hit and an individual? A
hit is any request on the server (including HTML files, graphics, sound files, and so on). For example, an HTML page with four graphics counts as five hits, even though only one person viewed that one page. An individual is the individual
user, the person receiving your communication.
So, how can you tell how many actual people are viewing your pages? One way is to approximate the number of unique hosts that are accessing your Web documents. This will give you a pretty inaccurate picture though. You see, server logs use IP addresses
(210.86.5.21) or hostnames (ha.net) to describe where viewers are coming from. This refers to the computer or dial-up account the viewer is using. When viewing your logs, looking at the IP addresses and hostnames may seem a good indicator of how many
people are visiting your site, who your viewers are, and where they are located. Unfortunately, there are many distortions that can occur in an analysis of this information:
So, taking all this into account, what is the scientific method you should use to assess your access logs? Educated guessing is really the only option. When analyzing one of our own sites, we guesstimate the number of times our site is being accessed
rather than the number of people accessing it. We call these visitors session users rather than new visitors.
We do this by counting the hits on our home page as new session users, but if we see the same address access the home page in a short amount of time, we figure it is the same viewer returning to the home page. If an IP address or domain name shows up
out of the blue on another page of our site, it is counted as a new session user as well. When a line of request from a certain address stops for a period of time (say 15 minutes), we assume the viewer has left our site.
If your site has a huge number of hits (lucky you), you will need to analyze more details, such as the browser or referrer information, to get a more accurate picture of when one viewer leaves and another with the same address enters. If you see one
address on your site for a very long time (requesting new pages over an extended period), this may indicate more than one viewer from the same address. All this will give you a very rough picture of how many times your site is being accessed (of course,
this number may be underestimated due to caching).
Now that you have a rough picture of how many times your site is being accessed, you can move on to the really important stuff: how your viewers are using your site. You can see that there is really no foolproof way to track individual viewers as they
visit your site, but you can see what individual viewers are doing when they get there.
As a new IP address (or hostname) appears in your log, you count it as a new session user. When the requests appear within a short span of time, this indicates the session user is navigating your site. You can then analyze the choices that user made.
This is called session analysis and is one of the very best ways to improve your site.
Your logs are full of valuable information. The trick is asking the right questions.
To use access logs to improve your site, ask yourself the following, based on your site statistics and session analysis:
Your site's error logs also contain valuable information. The error log, as its name implies, logs your Web site's problems: requests for documents that don't actually exist, attempts to access protected documents, and internal errors from CGI scripts
and the server itself. There are two types of entries in the error log: the warning and error messages generated by the Web server itself, and the error messages generated by any running CGI scripts. A typical error log consists of text like this (see
Figure 24.9):
[Tue Aug 6 23:22:08 1996] httpd: connection timed out for hr25.snm.org [Tue Aug 6 22:02:03 1996] httpd: malformed header from script [Tue Aug 6 22:10:11 1996] killing CGI process 638
Figure 24.9. A "raw" error log viewed through WordPad.
The messages that begin with the label httpd are generated by the HTTP server itself, and follow a standard format. The other is a warning message generated by a CGI script (such as killing CGI process 638).
Make it a point to check out your raw error logs from time to time (especially if you or your viewers are experiencing problems). Most software programs that translate these for you do not give as much detail as the raw logs themselves, so
familiarizing yourself with them is a very good idea. Things to watch for include the following:
Connection timed out
This occurs when the browser breaks the connection before receiving the entire requested document. This usually means your viewers are leaving (pressing the Stop button) before your page is loaded. If so, maybe the file takes too long to download, and
you should cut down its size. This message could also indicate that your server is prematurely timing out connections to users on slow machines. If so, and if you are running your own server, consider increasing the time-out values in the server's
configuration files.
Client denied by server configuration
This usually means one of two things:
File does not exist, or no multi in this directory
This indicates that someone attempted to access a nonexistent URL. Tip: If this warning occurs frequently, you probably have a broken link.
Malformed header from scrip
This is warning you that a CGI script is producing poor output and the server can't interpret it (usually caused by a bad script). Normally, this is followed or preceded by a CGI error message from the script itself.
Password mismatch
This is telling you that a user typed an incorrect password when attempting to access a protected document. Tip: A long series of these may indicate an attempt to gain unauthorized access to your protected documents; it might be a good idea to see who
was on your system at that time if this is a concern.
Quick and Dirty Guide: Get a Counter up in 20 Minutes
If you're interested in setting up a counter, but don't have time to deal with a CGI script, there is a very simple solution: Use a Web-counter service. One that we have used from time to time is Web-Counter. The process is really simplehere
goes.
Figure 24.11. Copying the suggested code for use in your HTML document.
Figure 24.12. Testing the counter.
In this chapter, we've addressed the inexact science of studying access logs. We've also gone against our own opinions and have offered two different ways to include counters on your pagesin case you find it absolutely necessary.
As you should now understand, there is no way to know exactly who is accessing your pages (unless you require everyone to log in). You can, however, gain some valuable insight into the usage of your site by reviewing and compiling server statistics.
In the next chapter, we move on to ways of maintaining your system and working with multiple authors.