Web Click Stream Analysis Using Linux Clusters Understanding Web Logs

Web Click Stream Analysis using Linux Clusters

By Marty Lurie - 2004-09-07 Page: 1 2 3

Understanding Web Logs

Marty Lurie describes how to use SQL to analyze the logs of a Web server to understand how people use the site and performance issues. Using Linux clusters with a shared nothing database can improve the performance of that processing. Examples are shown using IBM-Informix XPS Developers Kit.

"...every click you make, every .gif you take, we'll be watching you..." -- Anonymous, and still in hiding.

There is a wealth of information gathered by the Web server (httpd) and stored in the log file. In this article we'll look at how to use the logs to improve the performance and usability of the web site. Unfortunately, many sites just archive or delete the logs. I'll describe how to load the logs into a database and then to query the database to see how people navigate the site and how site performance can be improved.

I chose a Linux clustered database as a proof point of Linux scalability and also to show that all the old Pentium machines in the office could be put to good use. The idea here is to make a pile of old hardware accomplish the work of a newer, faster machine. A database server that runs on a cluster addresses one aspect of clustered computing. Most of the Linux supercomputer clusters focus on numeric problems. This cluster addresses the commercial requirements of data storage and manipulation. In a cluster, all the computers work on the same query at the same time. The key question: as you add more computers, how much faster do you get the answer? Inter-node communication, data distribution, and load balancing are critical to cluster performance.

To demonstrate the load balancing issue I used the Parallel Virtual Machine (PVM) software common to Beowulf clusters. I used the IBM Informix(r) Extended Parallel Server? (XPS) database for testing the cluster database scalability.

Let's tackle the following:

Understanding the Web logs and loading them into the database
Using Structured Query Language (SQL) to answer questions about the Web site
Measuring the database cluster performance as more hardware is added

Understanding the Web logs and loading them into the database

The log files for the Apache Web server are found in the /var/log/httpd directory (all file names and locations are based on RedHat 6.2). You'll find a number of logs, the most current one being access_log. The default format of the file is:



192.168.1.142 - - [14/Nov/2000:16:27:21 -0500] "GET /time.html HTTP/1.1" 304 -

The first field is the IP address of the client. The "-" which comes next is the rfc1413 identity check, which is rarely used because of the overhead associated with using the identity check capability. Next is any active .htaccess user id authentication--in this case, there is none. Enclosed in square brackets ([ ]) is a time stamp of the access. The request method, in this case a GET, was for the url /time.html using HTTP/1.1. The 304 is the Status Code for the access as documented in the RFC for HTTP. The final "-" is the number of bytes sent -- none in this case. The 304 return code says the object has not been modified so no bytes are sent.

Please pass the cookies:

The default Web server log provides lots of information, but it can mask many individual browsers operating from within the same Internet Service Provider (ISP). Cookies are the solution.

Consider the following three lines from an access_log file:


206.175.175.226 - - [15/Nov/2000:12:23:36 -0500] "GET / HTTP/1.0" 200 1233

206.175.175.226 - - [15/Nov/2000:12:24:15 -0500] "GET / HTTP/1.0" 200 1233

206.175.175.226 - - [15/Nov/2000:12:25:27 -0500] "GET / HTTP/1.0" 200 1233

It looks like one browser, identified by a single IP address, has accessed the "/", or root index, page three times. However, if logging is enabled with the following in the Apache httpd.conf configuration file, the above hits tell a very different story


# add user tracking
CookieTracking on
CustomLog logs/clickstream "%{cookie}n %r %t"

With logging enabled, we can look in the file called "clickstream" (as configured above) for the following log information:


206.175.175.226.1017974309016881 GET / HTTP/1.0 [15/Nov/2000:12:23:36 -0500]

206.175.175.226.1028974309055577 GET / HTTP/1.0 [15/Nov/2000:12:24:15 -0500]

206.175.175.226.1017974309127275 GET / HTTP/1.0 [15/Nov/2000:12:25:27 -0500]

There were three completely different browsers accessing the site. This tracking is accomplished using a unique identifier (non-persistent cookie) for each browser. The number following the IP address is a user-tracking cookie that is sent by the server to the browser. Each time the browser makes a request, the cookie is sent back to the server by the browser.

Because the default installation does not have user tracking enabled, the examples in this article do not use CookieTracking. You can jump in with your current logs and do this analysis. You will want to turn on CookieTracking so all those clicks from "that large ISP" will become individual streams instead of one very busy user.

A note on privacy. A number of companies have been caught with their "hand in the cookie jar," so to speak. A quick Internet search yields 124,000 pages that contain the string "privacy" that also contain the string "cookies". This is a topic that has attracted some attention. Develop a privacy policy, post it to your site, and live by it.

Now that we have a handle on the log file, we need to load it into a database to run queries against the data. The log file strings are not friendly to a database high-speed loader so we will massage them into a delimited string. Dust off your favorite Practical Extraction and Report Language. The code you can download is in Perl. A quick disclaimer: my Perl code is not "good Perl style." The distinguishing features are: it works, and it has at least one comment line.

For the following input:


158.58.240.58 - - [04/Jul/2000:23:59:42 -0500] "GET / HTTP/1.1" 200 47507

203.127.32.40 - - [04/Jul/2000:23:59:25 -0500] "GET /jp/product/images/prod_top.gif HTTP/1.0" 
304 -

203.127.32.40 - - [04/Jul/2000:23:59:24 -0500] "GET /jp/product/down.html HTTP/1.0" 200 12390

211.45.44.33 - - [04/Jul/2000:23:59:19 -0500] "GET /kr/images/top_banner.gif HTTP/1.1" 200 
6719

211.45.44.33 - - [04/Jul/2000:23:59:24 -0500] "GET /kr/images/waytowin.gif HTTP/1.1" 200 
71488

211.45.44.33 - - [04/Jul/2000:23:59:25 -0500] "GET /kr/train/new.gif HTTP/1.1" 200 416

The output from the Perl program looks like this:


158.58.240.58|-|-|2000-07-04 23:59:42||/|200|47507|-|1|
203.127.32.40|-|-|2000-07-04 23:59:25|jp|/jp/product/images/prod_top.gif|304||gif|4|

203.127.32.40|-|-|2000-07-04 23:59:24|jp|/jp/product/down.html|200|12390|html|3|

211.45.44.33|-|-|2000-07-04 23:59:19|kr|/kr/images/top_banner.gif|200|6719|gif|3|

211.45.44.33|-|-|2000-07-04 23:59:24|kr|/kr/images/waytowin.gif|200|71488|gif|3|

211.45.44.33|-|-|2000-07-04 23:59:25|kr|/kr/train/new.gif|200|416|gif|3|

The last two fields, added by the Perl script are: 1) the object type, stripped from the end of the requested URL, and the 2) the link depth, which is the number of "/" characters in the url.

View Web Click Stream Analysis using Linux Clusters Discussion

Page: 1 2 3 Next Page: Loading up the Database

First published by IBM developerWorks