Home
Career Opportunities
Contact Us
>> enSircle
 
 



What the results mean

To generate reports for web sites Orion Group uses a log analysis tool called Analog. The details of the reports that Analog generates and how to best interpret them is discussed below.

It's divided into three subsections.
  • How the web works. This section discusses what happens when somebody connects to your web site, and what you can and can't find out about them. If you think that you can get statistics on how many people have visited your web site (or want to know why you can't), then this section is for you.
  • Analog's reports. This section gives a summary of analog's reports, what they contain, and which commands influence each one.
  • Analog's definitions. This section gives precise details on all of analog's terminology, exactly what is counted in each report, and so on.

How the web works

This section is about what happens when somebody connects to your web site, and what statistics you can and can't calculate. There is a lot of confusion about this. It's not helped by statistics programs which claim to calculate things which cannot really be calculated, only estimated. The simple fact is that certain data which we would like to know and which we expect to know are simply not available. And the estimates used by other programs are not just a bit off, but can be very, very wrong. For example (you'll see why below),if your home page has 10 graphics on, and an AOL user visits it, most programs will count that as 11 different visitors!

This section is fairly long, but it's worth reading carefully. If you understand the basics of how the web works, you will understand what your web statistics are really telling you.


1. The basic model. Let's suppose I visit your web site. I follow a link from somewhere else to your front page, read some pages, and then follow one of your links out of your site.

So, what do you know about it? First, I make one request for your front page. You know the date and time of the request and which page I asked for (of course), and the internet address of my computer (my host). I also usually tell you which page referred me to your site, and the make and model of my browser. I do not tell you my username or my email address.

Next, I look at the page (or rather my browser does) to see if it's got any graphics on it. If so, and if I've got image loading turned on in my browser, I make a separate connection to retrieve each of these graphics. I never log into your site: I just make a sequence of requests, one for each new file I want to download. The referring page for each of these graphics is your front page. Maybe there are 10 graphics on your front page. Then so far I've made 11 requests to your server.

After that, I go and visit some of your other pages, making a new request for each page and graphic that I want. Finally, I follow a link out of your site. You never know about that at all. I just connect to the next site without telling you.


2. Caches. It's not always quite as simple as that. One major problem is caching. There are two major types of caching. First, my browser automatically caches files when I download them. This means that if I visit them again, the next day say, I don't need to download the whole page again. Depending on the settings on my browser, I might check with you that the page hasn't changed: in that case, you do know about it, and analog will count it as a new request for the page. But I might set my browser not to check with you: then I will read the page again without you ever knowing about it.

The other sort of cache is on a larger scale. Almost all ISP's now have their own cache. This means that if I try to look at one of your pages and anyone else from the same ISP has looked at that page recently, the cache will have saved it, and will give it out to me without ever telling you about it. (This applies whatever my browser settings.) So hundreds of people could read your pages, even though you'd only sent it out once.


3. What you can know. The only things you can know for certain are the number of requests made to your server, when they were made, which files were asked for, and which host asked you for them.

You can also know what people told you their browsers were, and what the referring pages were. You should be aware, though, that many browsers lie deliberately about what sort of browser they are, or even let users configure the browser name. Also, a few browsers send incorrect referrers, telling you the last page that the user was on even if they weren't referred by that page. And some people use "anonymizers" which deliberately send false browsers and referrers.


4. What you can't know.
  1. You can't tell the identity of your readers. Unless you explicitly require users to provide a password, you don't know who connected or what their email addresses are.
  2. You can't tell how many visitors you've had. You can guess by looking at the number of distinct hosts that have requested things from you. Indeed this is what many programs mean when they report "visitors". But this is not always a good estimate for three reasons. First, if users get your pages from a local cache server, you will never know about it. Secondly, sometimes many users appear to connect from the same host: either users from the same company or ISP, or users using the same cache server. Finally, sometimes one user appears to connect from many different hosts. AOL now allocates users a different hostname for every request. So if your home page has 10 graphics on, and an AOL user visits it, most programs will count that as 11 different visitors!
  3. You can't tell how many visits you've had. Many programs, under pressure from advertisers' organisations, define a "visit" (or "session") as a sequence of requests from the same host until there is a half-hour gap. This is an unsound method for several reasons. First, it assumes that each host corresponds to a separate person and vice versa. This is simply not true in the real world, as discussed in the last paragraph. Secondly, it assumes that there is never a half-hour gap in a genuine visit. This is also untrue. I quite often follow a link out of a site, then step back in my browser and continue with the first site from where I left off. Should it really matter whether I do this 29 or 31 minutes later? Finally, to make the computation tractable, such programs also need to assume that your logfile is in chronological order: it isn't always, and analog will produce the same results however you jumble the lines up.
  4. Cookies don't solve these problems. Some sites try to count their visitors by using cookies. This reduces the errors. But it can't solve the problem unless you refuse to let people read your pages who can't or won't take a cookie. And you still have to assume that your visitors will use the same cookie for their next request.
  5. You can't follow a person's path through your site. Even if you assume that each person corresponds one-to-one to a host, you don't know their path through your site. It's very common for people to go back to pages they've downloaded before. You never know about these subsequent visits to that page, because their browser has cached them. So you can't track their path through your site accurately.
  6. You often can't tell where they entered your site, or where they found out about you from. If they are using a cache server, they will often be able to retrieve your home page from their cache, but not all of the subsequent pages they want to read. Then the first page you know about them requesting will be one in the middle of their true visit.
  7. You can't tell how they left your site, or where they went next. They never tell you about their connection to another site, so there's no way for you to know about it.
  8. You can't tell how long people spent reading each page. Once again, you can't tell which pages they are reading between successive requests for pages. They might be reading some pages they downloaded earlier. They might have followed a link out of your site, and then come back later. They might have interrupted their reading for a quick game of Minesweeper. You just don't know.
  9. You can't tell how long people spent on your site. Apart from the problems in the previous point, there is one other complete show-stopper. Programs which report the time on the site count the time between the first and the last request. But they don't count the time spent on the final page, and this is often the majority of the whole visit.

5. Real data.Of course, the important question is how much difference these theoretical difficulties make. In a recent paper (World Wide Web, 2, 29-45 (1999): PDF 228kb), Peter Pirolli and James Pitkow of Xerox Palo Alto Research Center examined this question using a ten day long logfile from the xerox.com web site. One of their most striking conclusions is that different commonly-used methods can give very different results. For example, when trying to measure the median length of a visit, they got results from 137 seconds to 629 seconds, depending exactly what you count as a new visitor or a new visit. As they were looking at a fixed logfile, they didn't consider the effect of server configuration changes such as refusing caching, which would change the results still more.
6. Conclusion.The bottom line is that HTTP is a stateless protocol. That means that people don't log in and retrieve several documents: they make a separate connection for each file they want. And a lot of the time they don't even behave as if they were logged into one site. The world is a lot messier than this naïve view implies. That's why analog reports requests, i.e. what is going on at your server, which you know, rather than guessing what the users are doing.

Defenders of counting visits etc. claim that these are just small approximations. I disagree. For example, almost everyone is now accessing the web through a cache. If the proportion of requests retrieved from the cache is 50% (a not unrealistic figure) then half of the users' requests aren't being seen by the servers.

Other defenders of these methods claim that they're still useful because they measure something which you can use to compare sites. But this assumes that the approximations involved are comparable for different sites, and there's no reason to suppose that this is true. Pirolli & Pitkow's results show that the figures you get depend very much on how you count them, as well as on your server configuration. And even once you've agreed on methodology, different users on different sites have different patterns of behaviour, which affect the approximations in different ways: for example, Pirolli & Pitkow found different characteristics of weekday and weekend users at their site.

I've presented a somewhat negative view here, emphasising what you can't find out. Web statistics are still informative: it's just important not to slip from "this page has received 30,000 requests" to "30,000 people have read this page." In some sense these problems are not really new to the web -- they are present just as much in print media too. For example, you only know how many magazines you've sold, not how many people have read them. In print media we have learnt to live with these issues, using the data which are available, and it would be better if we did on the web too, rather than making up spurious numbers.


7. Acknowledgements and further reading. Many other people have made these points too. While originally writing this section, I benefited from three earlier expositions: Interpreting WWW Statistics by Doug Linder;Getting Real about Usage Statistics by Tim Stehle; andMaking Sense of Web Usage Statistics by Dana Noonan. (The last two don't seem to be available on the web any more.)

Another, extremely well-written document on these ideas is Measuring Web Site Usage: Log File Analysis by Susan Haigh and Janette Megarity. Being on a Canadian government site, it's available in both English and French. Or for an even more negative point of view, you could read Why Web Usage Statistics are (Worse Than) Meaningless by Jeff Goldberg.


Analog's reports

This section summarises all of analog's reports, and the main commands which control them. For exact details on what is counted in each report, see the section on Analog's definitions.

Top lines


Program started at Thu-24-Sep-1998 13:48.
Analysed requests from Wed-16-Sep-1998 09:52 to Mon-21-Sep-1998 02:04 (4.7 days).
The top two lines of the output tell you when the program was run, and which dates it includes data from. (The second line includes all requests, even failures, whereas most reports only include successful requests.)

General Summary


(Figures in parentheses refer to the 7 days to 24-Sep-1998 13:48).
Successful requests: 79,646 (48,947)
Average successful requests per day: 17,036 (6,992)
Successful requests for pages: 31,138 (18,689)
Average successful requests for pages per day: 6,660 (2,669)
Failed requests: 9,008 (6,378)
Redirected requests: 344 (235)
Distinct files requested: 8,180 (2,884)
Distinct hosts served: 6,640 (4,991)
Corrupt logfile lines: 2
Data transferred: 976.92 Mbytes (627.06 Mbytes)
Average data transferred per day: 208.96 Mbytes (89.58 Mbytes)
The General Summary contains some overall statistics about the data being analysed: the most important being the number of requests (the total number of files downloaded, including graphics); the number of requests for pages (just counting the various pages on your site); the number of distinct hosts (the number of different computers requests have come from); and the amount of data transferred in bytes. For exactly what the various lines mean, see the section on Analog's definitions. Bear in mind that one user can generate many requests by viewing lots of different pages or images, or by viewing the same page many times.

The figures in parentheses represent the seven days given at the top of this report: the seven days before the program was run.

You can't find out the number of visitors or visits you've had, and don't believe any program which tells you that you can. See the section on How the web works for a discussion of this.

Time reports


Each unit (+) represents 800 requests for pages, or part thereof.
week beg.: #reqs: pages: 
---------: -----: -----:
13/Sep/98: 69614: 25277: ++++++++++++++++++++++++++++++++ 20/Sep/98: 10032: 5861: ++++++++
Busiest week: week beginning 13/Sep/98 (26,654 requests for pages).
These reports tell you how many requests there were in each time period. They also tell you which was the busiest time period.

The timezone is whatever your server records time in -- usually your server's local time, or sometimes GMT.

Time summaries


Each unit (+) represents 150 requests for pages, or part thereof.
day: #reqs: pages: 
---: -----: -----:
Sun: 2031: 1193: ++++++++ Mon: 8001: 4668: ++++++++++++++++++++++++++++++++ Tue: 0: 0: Wed: 13934: 5915: ++++++++++++++++++++++++++++++++++++++++ [etc.]

These reports tell you the total number of requests in each day or hour of the week, or in each period of the day, summed over all the weeks or days in the report. (It's not the average, nor is it the figures for just the last week or last day).

Other reports


Listing the first 5 files by the number of requests, sorted by the number of requests.
#reqs: %bytes:       last date: file
-----: ------: ---------------: ----
4123: 2.29%: 21/Sep/98 01:57: /~sret1/analog/
3064: 0.15%: 21/Sep/98 01:54: /~sret1/analog/analogo.gif
1737: 0.01%: 21/Sep/98 01:53: /~sret1/images/bar1.gif
1692: 0.01%: 21/Sep/98 01:53: /~sret1/images/bar16.gif
1685: 0.01%: 21/Sep/98 01:53: /~sret1/images/bar8.gif
67345: 97.54%: 21/Sep/98 02:04: [not listed: 8,175 files]

The rest of the reports are all quite similar. Here is a list of them. If you're unfamiliar with some of the terms, see the section on Analog's definitions.
  • The Host Report lists all computers which downloaded files from you.
  • The Domain Report lists which countries those computers came from. (If you only get "unresolved numerical addresses", see the FAQ.)
  • The Organisation Report attempts to list the organisations (companies, institutions, ISPs etc.) which the computer was registered under.
  • The Host Redirection Report and Host Failure Report list all computers which encountered redirections or errors.
  • The Request Report (the example above) lists which files were downloaded.
  • The Directory Report lists which directories those files came from.
  • The File Type Report lists the file types (actually, extensions) of those files.
  • The File Size Report breaks them down by size.
  • The Processing Time Report shows the time taken to serve each file.
  • The Redirection Report lists the filenames which resulted in redirections: mainly directories without the final slash, and "click-thru"'s.
  • The Failure Report lists the filenames which caused errors.
  • The Referrer Report lists which pages linked to your files (and also pages which included your images).
  • The Referring Site Report lists the servers those referrers were on.
  • The Search Query Report and the Search Word Report list which search terms people used to find your site.
  • The Internal Search Query Report and Internal Search Word Report list the search terms people used on scripts within your site.
  • The Redirected Referrer Report lists the referrers which led to redirections.
  • The Failed Referrer Report is essentially a broken link report.
  • The Browser Report lists the detailed versions of browsers used, and the Browser Summary collects them by vendor. You should be aware that browsers can lie about what sort of browser they are.
  • The Operating System Report lists the operating systems of the visitors whose browser types you know (as far as possible: it's not always possible to distinguish accurately between different Windows versions, for example, because the same browser can run on more than one Windows version).
  • The Virtual Host Report lists the activity of your various virtual domains.
  • The Virtual Host Redirection Report and Virtual Host Failure Report give the number of redirections and errors on each of those domains.
  • The User Report lists your visitors if your server requires authentication, or perhaps the visitors' cookies.
  • The User Redirection Report and User Failure Report list the users who encountered redirections or errors.
  • The Status Code Report lists the number of each HTTP status code that you had.
Usually you can only get some of these reports, depending on what information is recorded in your logfile.

Most of these reports have a hierarchical structure, like this example for the Domain Report:


Listing the first 5 domains by the number of requests, sorted by the number of requests.
no.: #reqs: %bytes: domain
---: -----: ------: ------
1: 13243: 16.23%: .com (Commercial)
: 1262: 1.26%: aol.com
2: 11783: 25.64%: .jp (Japan)
: 9592: 22.19%: ad.jp
: 1043: 1.97%: co.jp
3: 10073: 11.62%: .net (Network)
: 1926: 1.71%: uu.net
4: 9657: 13.31%: [unresolved numerical addresses]
5: 7388: 8.04%: .uk (United Kingdom)
: 5792: 5.74%: ac.uk
: 1510: 1.99%: co.uk
: 18502: 25.16%: [not listed: 82 domains]

Notice that the lower levels are always listed with their parents, so they break up the sort order. Also, they don't count towards the total number of items listed, so there are only 5 domains listed in the example above, as you can see in the first column.

Bottom lines


This analysis was produced by analog 5.24.
Running time: 8 seconds.

At the end of the output you can see which version of analog produced the report, and how long it took.

Analog's definitions

This section describes how analog defines its terms, and exactly what is counted in each category. It gets a bit technical at times -- if you're just trying to understand the output, I recommend you read the section on Analog's reports first.

We start with some basic definitions. The host is the computer which has asked you for a file (often called the "client"). The file might be a page (i.e., an HTML document) or it might be something else, such as an image. By default filenames ending in (case insensitive).html, .htm, or / count as pages.

The total requests counts all the files which have been requested, including pages, graphics, etc. (Some people call this the number of hits, but that word is also used in other ways by other people, so I avoid it). The requests for pages obviously only counts pages. One user can generate many requests by requesting lots of different files, or the same file many times.

The referrer for a request is the place that the user (or his computer) heard about your file from. If he followed a link to reach a page, it will be the previous page. In the case of a graphic on a page, the referrer will be the page containing the graphic.

Analog's kilobytes are 1024 bytes.


Analog recognises four categories of request, based on the HTTP status code of the request. You can see the total number of requests for each status code, and what the codes mean, in the Status Code Report. (Or see the HTTP spec for a detailed description.)

First, successful requests are those with HTTP status codes in the 200's (where the document was returned) or with code 304 (where the document was requested but was not needed because it had not been recently modified and the user could use a cached copy). Successful requests for pages refers to those lines on which the file requested was named and was a page.

Redirected requests are those with other codes in the 300's, indicating that the user was directed to a different file instead. The most common cause of these requests is that the user has incorrectly requested a directory name without the trailing slash. The server replies with a redirection ("you probably mean the following") and the user then makes a second connection to get the correct document (although usually the browser does it automatically without the user's intervention or knowledge). The other common cause of redirected requests is their use as "click-thru" advertising banners.

Failed requests are those with codes in the 400's (error in request) or 500's (server error). They come about for a variety of reasons, but the most common are when the requested file is not found or is read-protected.

Finally, requests returning informational status code are those with status codes in the 100's. These are very rare at the moment.


Most reports only include successful requests in calculating the number of requests, requests for pages, bytes, and last date: unless, of course, the report is a redirection or failure report. There is a further restriction on the time reports, the Status Code Report, the Processing Time Report, the File Size Report, and the bytes lines in the General Summary: the logfile line must also contain the name of the file requested, and the filename must be being counted. This is necessary to stop double counting if the server uses separate logs.

The "not listed" line at the bottom of each of the non-time reports represents those items which were not listed because they were below the floor for the report.

The figures in parentheses in the General Summary are for the last seven days: either the seven days before the TO time, or if no TO time is given, the seven days before the time of the program start. (It would be nicer to use the seven days before the last time in the logfile, but we don't know when this is until we've read the whole logfile, and by then it's too late.) The figures for the last seven days are not included if all, or none, of the requests fall in the last seven days.


Frequently asked questions

List of Questions

  1. Understanding the Output
    See also What the results mean.
    1. How do I find out the number of hits from your data?
    2. Why are there so many referrers from my own site?
    3. The analysis covers exactly a week, but the figures for the last seven days don't agree with the totals.
    4. I only have 240 requests in total. Why does analog think there are 840 requests per week?
    5. The pie charts don't agree with the figures in the tables.
    6. Why doesn't analog agree with the counter on my page?
    7. Why doesn't analog agree with grepping the logfile?
    8. Why doesn't analog agree with my other logfile analysis program?
    9. Why do I only get "unresolved numerical addresses" in the Domain Report?
    10. Why are directories listed in the Request Report?
    11. When someone reads one of my PDF files, it scores dozens of hits.
    12. Kilobytes should be 1000 bytes, not 1024 bytes.
    13. The Organisation Report doesn't identify organisations correctly.
    14. "Organization" isn't spelled correctly.

Understanding the Output

Most of the questions in this category are answered in the section on What the results mean, which I really recommend you read if you want to understand what analog is telling you.
  1. How do I find out the number of hits from your data?
    I don't use the word hits, because people use it in different ways, so it's misleading. I use requests for the number of transfers of any type of file (text, graphics, ...), and page requests for the number of transfers of HTML pages. See the section on Analog's definitions for more information.
  2. Why are there so many referrers from my own site?
    These come from all the internal links on your site, and all the graphics on your pages. See the section on How the web works for more information.
  3. The analysis covers exactly a week, but the figures for the last seven days don't agree with the totals.
    The figures in parentheses are for the seven days before the time the program was run. They are never for the seven days before the end of the logfile.
  4. I only have 240 requests in total. Why does analog think there are 840 requests per week?
    If you have 240 requests in two days, that's a rate of 840 requests per week. Just like if you drove 28 miles in 20 minutes, you'd have driven at 84 miles per hour.
  5. The pie charts don't agree with the figures in the tables.
    Possibly you are looking at out-of-date images. Make sure to reload the images as well as the text
  6. Why doesn't analog agree with the counter on my page?
    There are lots of possible reasons. Do they both start from the same date? Are you just looking at requests for that one page with analog, not for all your other pages and graphics? Also, analog will record all requests to that page; if it's a graphic, your counter will only measure requests from people on graphical browsers that reached that place on the page.
  7. Why doesn't analog agree with grepping the logfile?
    Have you understood what analog includes in its counts? In particular, most reports only list "successful" requests (HTTP status codes 200-209 & 304). A naïve grep would count failures too.
  8. Why doesn't analog agree with my other logfile analysis program?
    Small differences can be put down to different parsing. But if you are seeing large differences, you have to understand what analog counts, and what the other program counts. For example, some programs count HTTP status codes 301 & 302 as successes, whereas I think that to do so gives extremely misleading results.
  9. Why do I only get "unresolved numerical addresses" in the Domain Report?
    Your server only records the numerical IP address of the hosts that contact you, not their names.
  10. Why are directories listed in the Request Report?
    They are not directories, they are pages with the same name as the directory. For example, I have both a directory called /analog/ and a page called /analog/ (which happens to be the same as /analog/index.html).
  11. When someone reads one of my PDF files, it scores dozens of hits.
    PDF files are often downloaded and read one page at a time, and each page will then count as a separate request. Although this is not ideal, it's much less clear what to do about it. Analog has no way of knowing how many pages constituted a single download in the reader's mind. As usual, we can only reliably report how many requests there were at the server, not guess what users did with the file later.
  12. Kilobytes should be 1000 bytes, not 1024 bytes.
    Personally I think that 1024 bytes is a kilobyte.
  13. The Organisation Report doesn't identify organisations correctly.
    I admit they aren't perfect, but this is because in domains in which organisations aren't all at the same level in the domain hierarchy, there is no way to identify them perfectly without long lists.
  14. "Organization" isn't spelled correctly.
    Yes it is. If you want American spellings, you have to specify it.

 
Overview
Web Site Hosting
Web Sites
eCommerce
Web Integration
Web Technologies
Net Abuse Policy
Web Site Questionnaire
Electronic Commerce Questionnaire
 

Orion Group Software Engineers, Inc.   •   5770 Nimtz Parkway   •   South Bend, IN 46628   •   574-233-3401   •   sales@ogse.com