I took a political science class once called "Intelligence and National Security". We learned about the history of intelligence, the global intelligence community, and so on. I don't remember much from it; I got a C in the class. But one concept did stand out: Intelligence organizations want raw intelligence data. Raw intelligence data is more than just an intercepted message. It can contain all kinds of useful contextual information: when was the message sent? who sent the message? where were they? who received the message? where were they? what medium did it travel through? what format was it in? and so on.
These are the "sources and methods" of intelligence, and they are often more valued and more closely guarded than any message. In The Code Book, Simon Singh explains how British cryptanalysts were able to break a version of the German Enigma cypher during World War II. The German morning weather report, which was issued daily at almost exactly 6:00am, from the same location every day, was a key to cracking the cypher. Every day, among hundreds of Enigma messages that the British government could intercept but not decipher, the weather report was a constant, and they could rely on its similar content over the course of days and months. This is exactly the kind of frequency that helps cryptanalysts break a system.
But this page is about web statistics, so I suppose I should get to the point.Server Logs: The gold mine
The best information you can have is the raw data about your web traffic. Practically every web server generates a traffic log, which is a file containing details of each request received by the server. Here's a typical line of that file, showing one request made to my Microsoft interview story:
gl-lab16.wpi.edu - - [02/Nov/2000:20:16:07 -0600] "GET /~carl/microsoft.html HTTP/1.1" 200 12173 "http://www.google.com/search?q=Microsoft+Interview+Questions&hl=en&lr=&safe=off&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 5.01; Windows 98)"
From this information, I can tell you the following: On the evening of November 2nd, around 8:15pm, someone from Worcester Polytechnic Instutute (wpi.edu) in Worchester, MA, using Windows 98 and Microsoft Internet Explorer 5.01, typed "Microsoft Interview Questions" into the Google search engine (www.google.com), and Google returned my Microsoft interview story as a match for their search. They clicked on my story at 8:16pm, resulting in the request of the file /~carl/microsoft.html documented above. The file (which is 12,173 bytes) was successfully sent to their browser (server status code 200).
If I hunt around more in the server log file and process it with a few software programs, I can trace patterns of requests and come up with a plethora of statistics. Even if I don't know who is looking at my site, I do know how to differentiate one person from another, because people come from different addresses. One request might come from gl-lab16.wpi.edu, while the next might be from aca07a49.ipt.aol.com. If I wish, I can search for all requests from aca07a49.ipt.aol.com, and I can see what path that person followed through my site. I can get a rough idea of how long they spent at my site. Did they look at the Microsoft interview, then leave the site? Or did they browse around a bit, checking out other areas of my site?
With a little more work, I can find out other things. Where are people coming from, primarily the United States, or perhaps Europe? South America? Quova, Inc. is a startup company creating a database that links internet addresses to geographic locations, down to the zip code, an effort that has been tried (and failed) in the past, but only now has corporate money behind it. This is yet another step toward very specific targeted marketing campaigns.
Privacy advocates hate this stuff. Web surfing should be anonymous, they say. But unfortunately it's not, and a simple web log can be very valuable for marketing strategies. Add in all the information most e-commerce sites collect about you and your buying habits, and you've got all the raw intelligence a marketing department might ever want. It's unfortunate that very few, if any monetary transactions on the Internet are anonymous. The option simply doesn't exist, because all this new technology has forced us to abandon the anonymity of cash, at least for now.
Anyway, on to the technique.
If you can get the actual server log file for your web site each week, you're set. You can compute all kinds of statistics from the file, using one of these tools:
I wish there were more free options for Windows and Macintosh users. Analog is the only free log file analyzer I could find, as of November 2000, but maybe I'm not looking in the right places.
If you can't get the server log file, don't fret-- you've got a number of other options. You can start by asking your web provider if they generate statistics for your site. Here's an example of what these statistics may look like, if your provider offers them. Statistics pages are generated from the server log file periodically, and they may not be as useful to you as the actual log file. But they do provide a lot of information, and you don't have to process the log file yourself.Access Counters
If you can't even get a statistics page, there are still a few options that will give you decent (though less reliable) statistics for your pages. Many companies provide free counter and statistics services. Also, some web providers offer a simple graphical counter (the speedometer-looking counters). I used to recommend that you steal a graphical counter from wherever you could find it, but that's not necessary anymore, as there are plenty of free services these days:
The real downside of all these sites is that they don't really count hits to your pages-- they count hits to the little counter graphic you place on your pages. As a result, if someone visits your site but doesn't load the counter graphic, they won't be counted. If a visitor has graphics turned off, or if they don't stay at the page long enough for their browser to load the counter graphic, you'll never know they visited. When search engine "web crawler" programs come to index your page, they won't be recorded because they only look at HTML files, not graphics.
Another side effect of the graphical counter: They won't be able to tell you anything about the URL that someone got to your page from, called the referring URL. The referring URL can show you what people searched for in order to get to your page, for example, or what sites they might have reached yours from. (but you can get a good idea of that from Google's Page-Specific Search)