Web Click Stream Analysis Using Linux Clusters Loading up the Database

Web Click Stream Analysis using Linux Clusters

By Marty Lurie - 2004-09-07 Page: 1 2 3

Loading up the Database

To perform the analysis we'll use a relational database running on a cluster. Relational databases are the most prevalent database architecture. Data is presented to the user in table format. There are a number of relational databases available for Linux, including DB2(r) Universal Database and IBM-Informix XPS Developers Kit. My examples use the XPS Developers Kit. XPS has implemented its own message-passing layer rather than using the more typical Beowulf implementations (for example, PVM, MPI). Feel free to use the database of your choice.

If you want to test a cluster using the XPS Developers Kit, you can get a copy by contacting Mohan Natraj (World Wide Marketing Manager for XPS) at mogs@us.ibm.com. Mohan's phone number is (503) 525-7419.

The installation procedure is well-documented in the Installation Guide and the release notes/README file. To make life easier, download the easyinst.tar file which automates the installation process. By reading the shell script you will see the exact steps to run to do the installation. By reading the installation documentation, you'll understand why the steps are needed.

Relational databases use Structured Query Language (SQL) to work with the data. A tutorial on Informix SQL is available if you need it. We need both data definition language (DDL) to create tables and data manipulation language (DML) to query the data in the tables.

First let's create a table to store the data. The SQL to do this looks like:


-- comment lines explain the code and any unique XPSisms

-- create a database called weblog

-- the "in workslice.1" specifies the storage for the database

create database weblog in workslice.1;

-- a raw table is a non-logging table used for data warehousing

-- this is similar to a DB2 CREATE TABLE statement

-- with the option NOT LOGGED INITIALLY

create raw table webfact (
    ip char(15) ,
    foo char(1) ,
    bar char(1) ,
    timestamp datetime year to second,
    location char(16) ,
    url varchar(255) ,
    http_code char(3) ,
    bytecount int,
    objtype char(6),
    linkdepth int
)

-- the "fragment" clause specifies where to
-- put the data. In our case the "workslice"
-- is a disk set across the cluster

fragment by hash(ip) in workslice;

To run this SQL from a file:


dbaccess -e -weblogDDL.sql

Data loaders come in many different flavors. The XPS loader takes the relational table concept and maps it to Linux files and pipes. After the mapping has been defined with an external table you can use SQL to load, query, and unload data. The DDL for the external table is:


create external table e_webfact
-- the "sameas" saves re-typing the above DDL
sameas webfact
using (format "delimited",
REJECTFILE "/tmp/webfact.barf",
MAXERRORS 50000,
datafiles ("disk:1:/home/informix/work/dat/weblog.unl"));

Loading from the flat file of data defined in the "datafiles" clause is as simple as an insert-select. This is our first example of a DML statement.


-- now load
insert into webfact select e_webfact.* from e_webfact;

With data in the database we can now run queries to answer questions about the Web site.

Using SQL to answer questions about the Web site

Let's answer the following questions about the Web site with SQL:

What are the most popular objects?
Which objects consume the greatest site bandwidth?
How many errors are generated and what is their nature?
What is the distribution of activity over time?
What links are bookmarked, bypassing the home page?

This is only a sample of the questions that can be answered. The logs used for the examples are samples from www.informix.com before the latest web site revisions.

1) What are the most popular objects?

The most popular objects belong on the home page, or a link to them belongs on the home page, to make navigation and access easier. The query is an easy one:


select

    url,
    count(*) hits

from webfact
group by 1
order by 2 desc;

What is the most popular link at Informix for one sample log file? The envelope please:



url    /
hits    242
url    /informix/images/dot_clr.gif
hits    208
url    /informix/images/blackbar.gif
hits    181
url    /informix/absolut.css
hits    170

2) Which objects consume the most bandwidth?

The objects with the largest value of hits * object size consume the most site bandwidth. Why do we care? If a large 24-bit color image is eating most of the bandwidth, you might want to reduce the size or quality of the image to improve performance.

The query we use to find out which objects are consuming the most bandwidth is more complex and creates a temporary table on the fly:


-- use the maximum memory available
set pdqpriority 100;

-- if another query is updating the table don't block
set isolation to dirty read;

-- drop the temporary table if it exists
drop table foo;

select

    url,
    max( bytecount) mbytes,
    count(*) hits

from webfact
group by 1
-- create a temporary table, called foo, on the fly to store the
query output into temp foo;

select

    url,
    mbytes * hits totalbytes

from foo
order by 2 desc;

A sample from the Informix site yields:


url    /answers/english/docs/220sdk/5194.pdf
totalbytes    186935606

url    /answers/english/docs/912ius/4818.pdf
totalbytes    144750480

url    /answers/english/docs/gn7382/4372.pdf
totalbytes    106828470

url    /answers/english/docs/73ids/4354.pdf
totalbytes    56389410

3) How many errors are generated and what is their nature?

The first query below is a good quality-check of the site. The second query provides a list of urls and their errors in descending order, which gives a prioritized list of what to fix.


-- show summary of errors
select

    http_code,
    count(*) hits

from webfact
group by 1
order by 2 desc;

http_code	*(count())**
200	1478884	ok
304	452508	use cached copy
404	163149	not found
302	55476	moved temporarily
206	42576	partial content
500	7157	internal server err
401	4166	unauthorized
	123	?
403	25	forbidden
503	17	service unavailable



-- now look at the detail
select

    url,
    http_code,
    count(*) hits

from webfact
group by 1,2
order by 3 desc;

I'll spare you the output from this one. It is only interesting if it comes from your own site.

4) What is the distribution of activity over time?

The international community requires that this site is available 7x24. The distribution of hits over time illustrates the demand for high availability in the Web environment.


select

    -- get the hour of the access
    extend ( timestamp , hour to hour ) hour_x_hit,
    count(*) hits

from webfact
group by 1
order by 1 asc;

A graphical front end can make all the difference when presenting information. This graph, based on the above query, shows that the site is 7x24:

Figure 1. HTTP Hits by Time

5) What links are bookmarked, bypassing the home page?

If you build it they will come. And maybe even return for another visit. Browser bookmarks (or "favorites" if you use "that" browser) are your visitors' votes about the best content on the site. This content should be easily accessible. Putting links to the best content on the home page makes the home page more of a portal than an annoyance to avoid. Any customization or new information that is presented on the home page should be propagated to the popular pages to reach the returning bookmark audience.

The bookmark SQL is more complicated. We want the first object selected for a unique session. The UserTrack directive would be very helpful, but since this isn't available in the data we'll do the best we can. The comments in the code explain the stages of the query.


set isolation to dirty read;
set pdqpriority 90;
set explain on;

-- table to store bookmark results

--drop table bkmarks;

create raw table bkmarks (

    ip char (15),
    tstamp datetime year to second,
    linkdepth int,
    url varchar(255)

)
fragment by hash ( url ) in workslice;

select

    distinct ip,
    min( timestamp) tstamp

from webfact
-- filter for html pages, not .gif, .jpg ...
where objtype matches 'ht*'
group by 1
-- create a temporary table for the results
into temp foo;

-- now put a summary into the bookmarks table
insert into bkmarks
select

    w.ip,
    w.timestamp,
    w.linkdepth,
    w.url

-- notice that two tables are combined in the query
from webfact w,
foo f

-- the where clause allows the specification of how to
-- match-up the two tables and to do any desired filtering

where

    w.ip = f.ip
    and w.timestamp = f.tstamp
    and objtype matches 'ht*'

;

select url, count(*) bookmark_count from bkmarks
group by 1
order by 2 desc

All the analysis so far can be performed on any database. Now we'll look at the server used for these examples and how it can be clustered.

View Web Click Stream Analysis using Linux Clusters Discussion

Page: 1 2 3 Next Page: Linux Clusters

First published by IBM developerWorks