Menu
|
You are in:
Evaluate your site -
Choosing a web project - Access
statistics
Access
statistics
Web usage statistics, such as those produced by programs such as
analog cannot
be used to make strong inferences about the number of people who have
read a website or webpage. Although those who compile these statistics
usually try to make this clear, people still insist on misusing them to
make overly strong inferences. Attaching meaning to meaningless numbers
is worse than not having the numbers at all. When you lack information,
it is best to know that you lack the information. Web statistics may
give the user a false sense of knowledge which can be worse than being
knowingly ignorant.
A useful analogy is with putting up advertising posters. You will
never really know how many people have noticed them or read them.
It is not enough to say that the statistics should be taken with a
grain of salt; they should be taken with a salt lick. If you want to
understand why no inference about the number of people reading your
pages can be made from web statistics read on. Otherwise, you may
wish to just trust that statement or may wish to skip to
the section on Quick Questions and Answers.
Web stats are useful for web administrators to get a sense of the actual
load on the server. This is useful for diagnostics and planning, and for
detecting unusual behaviour that may require planning action. The goal
of the administrator is to keep the server running smoothly under
expected loads, while improving the speed and reliability of obtaining
documents from the site. The best way to achieve this is to have
browsers retrieve documents from places closer to where they will be
used (and even from memory) than to get them from the disk on the
server. It is only when the file is retrieved from the server that the
server has the ability to keep track of the access.
Let's take a fictitious example of what might happen when someone in
Nome, Alaska, say at Nome Community College (this would be a polytechnic
in the UK), wants to read
Cranfield's Prospectus.
The user would somehow select the URL with his/her browser, which would
then try the following.
- Browser Cache
- The particular instance of the browser will look in its own
memory (or what it may have saved on the its local disk).
If it finds the page corresponding to the sought for URL there it
will not go any further, and our site will never know that the
request was made.
- Local site cache
- If the page was not in the browser cache, the browser may look
to its site cache. That is, if someone at the user's same site
recently retrieved the page, it may be available to the user there.
If it finds the page corresponding to the sought for URL there it
will not go any further, and our site will never know that the
request was made.
- Local regional cache
- The site cache may be configured to look in a local regional
cache, say at the University of Alaska, Nome campus which might
provide a caching service for smaller sites around Nome.
If it finds the page corresponding to the sought for URL there it
will not go any further, and our site will never know that the
request was made.
- Large regional cache
- The local regional cache may be configured to look in a large
regional cache, say in Fairbanks Alaska, which might provide caching
for sites in Alaska that use it.
If it finds the page corresponding to the sought for URL there it
will not go any further, and our site will never know that the
request was made.
- The accelerator
- An accelerator is an out-going cache for a site. When a document
is requested from the site, the accelerator sees whether it has it
stored (it stores them in ways much faster to find and retrieve then
the server does with files in the directory structure) and serves
that up.
While it would be possible to have the accelerator keep a record
of which files it served up and to whom, this would defeat the
purpose, because it would require a disk operation to make that
record.
Now that you have an idea of what caching is, you are in a better
position to understand why it is impossible to make any inference
about numbers of people reading your pages from web statistics. But
there is more to come described in the section on
multiple hits per users. What is necessary to understand about
caching is that some users may go through a long and efficient cache
chain (as described in the example) and other users may not. Much of
this depends on how their site is set up or how they set things up
themselves.
Imagine (in the extreme case) a user who is doing no caching
whatsoever. Now if that user comes across the
Cranfield Home Page 20
times while browsing around the Cranfield pages that will count as
20 hits. Remember the statistics are about accesses, not about
people.
When comparing hits for different directories, it is important to
note how documents are structured. If you have a directory with a
single document on one hand, and on the other you have another
directory with the same amount of real content broken in to twenty
smaller documents, you will find far more hits into that second
section.
Most of everything listed here is either mentioned above or can be
inferred from the explanations above.
A quick list of the questions is provided here.
Not really. The number of individuals and sites using caches is
rising all the time, as is the amount of disk space and memory used
for caching. When the Cranfield Accelerator goes live (early
November, 1995), there should be an actual drop in our server stats,
while an increase in accesses, due to increased speed and
reliability of the server. Caching has been on the rise for more
than a year now. Even so, loads on systems (including ours) have
gone up dramatically.
Unfortunately not even this is possible. Suppose for example that
Japan has a very high level of regional and national caching while
Singapore does not (the example is fictitious). Under these
circumstances, web statistics might show more accesses from
Singapore than from Japan even if more people in Japan read our
pages.
A clear example of this is the number of accesses from "numerical
domains" that have recently started to top various lists. These are
accesses from sites that don't have proper reverse DNS listings.
Such sites are probably misconfigured single user machines, where
either the particular machine that is used in misconfigured or the
organisation they belong to has not straightened out its machine
names properly. It is reasonable to assume that those running such
misconfigured systems are far more likely to not have configured
their proxies correctly, so far less caching will be seen from those
sites.
Not really. The more popular pages will cache more, meaning that
real differences between page hits will be dramatically distorted.
It is probably safe to say this if one page shows more hits then
another that there really were more accesses to that page, but there
are circumstances under which even that weak inference won't be
true.
Not really. This is because any such multiplier would have to
differ from page to page and
differ from access region to access region.
Yes you can. There are several ways to do so, and there are some
circumstances for which it is even legitimate, but to do so merely
to get better stats is seriously misguided. This is for two reasons:
- You will make your page (much) harder for people to get to
and add to network traffic unnecessarily.
- If someone fails to reach your page at our site, they may
give up on the site all together. Thus hard to get at pages
(unless there is a clear reason for them being such) will be
unfair to other providers at the site.
Quiet embarrassingly, many of the pages on this site
don't normally cache properly. This is because I had some technical
difficulties with my configuration of server side includes and the
so-called "XBitHack". I've fixed that now, but now have to fix
dozens of documents to use things properly.
You may have noticed some pages with web counters. There are
basically two ways to put them in your page: the wrong way and the
very wrong way. The wrong way merely doesn't work and will not be
more useful than normal statistics. The very wrong way is counter
productive because it subverts the caching
mechanism which is not a good idea just to get statistics.
Please note that even if you think that statistics can be made
useful, counters on individual pages are displayed to the reader,
who isn't in the position to make the various adjustments needed to
get some sense of true readership.
Yes and no, but mostly no. There are two reasons for "mostly no".
One is simply that there are too many small caches out there which
may have cached our stuff (including the browser software internal
cache). Clearly not all of these are going to send us records on a
regular basis which we would then have to incorporate into all of
the other records to process statistics.
The other reason for "mostly no" is that even the large caches
are willing to only send a byte count. That is, one major UK cache
is considering sending out on a monthly basis how many bytes of data
they served up in our name.
We must remember that the caches are doing us a favour by making
our pages much easier to reach. We cannot ask them to take on a task
that would degrade the service or place an additional
administrative, disk, memory and CPU load on them. Without caching,
the web would have collapsed long ago.
Yes and no. If by minimum you mean "at least one" then yes. If you
have 400 hits from Japan then you can conclude that during that
period you had at least one reader from Japan. You cannot
infer that there were at least 400 readers, because the same reader
may hit a page many times in a short period of
time.
So, the only certain inference that can be made is that there was
at least one from a particular domain, or for a particular page.
One way is to set up
Mail Reply Forms
in your pages like the one at the end of this
document. Of course many more people will read your pages than
will complete the form, but the form can be used to judge serious
interest. Most people will, however, not fill out a form unless they
think they will get some sort of useful response, even if they read
the document seriously. (Did you fill out the form for this
document?)
Setting up these forms is not as difficult to do as it first
appears, and courses are offered on it by the computing centre
staff.
They are useful for system administrators to judge the actual load
on the server. The section on what stats are good
for contains more information.
Popular demand. It is not the computer centre's job to deny users
some service just because we know the request to be misguided.
Attempts to eliminate these statistics from the system met with
complaints. However, no great effort will be put into maintaining
statistics or access to them either. It is hoped that this document
will make it easier for the computer centre to withdraw statistics
altogether, except for what is required for system maintenance.
No. But you may have noticed that many of the individual problems
and difficulties could be partially mitigated by collecting
and using more information (from some caches for example or times of
requests) and using that to make very rough estimates of various
correction factors. It would take serious statistic analysis of the
sort that professional market research firms may be able to
undertake and still the estimates (and relative hits on pages or
from regions) would remain iffy. Performing complicated analyses on
dubious data only compounds the problem, and the marginal utility
would be negative (ie, the large amount of extra effort would not be
justified by the tiny gain in meaningfulness of the statistics).
-
|
|