Web Click Stream Analysis Using Linux Clusters Linux Clusters

Web Click Stream Analysis using Linux Clusters

By Marty Lurie - 2004-09-07 Page: 1 2 3

Linux Clusters

Computer clusters use multiple computers to work on the same problem. There are two main reasons to cluster computers: performance and availability. Correspondingly, there are two dominant architectures for clusters: shared nothing architecture or a shared disk architecture. A shared nothing architecture is well suited to both tasks of performance and availability, a shared disk architecture lends itself more to addressing availability issues with limited scalability.

"Shared huh?" you might ask. The "sharing" refers to how the CPU and disk are allocated among the computers. The cluster of computers must define how they will work on the same task at the same time. In a shared nothing architecture, the computers all are assigned the problem to be solved and then work individually on data they control until they can return a result set to the user who launched the query. Each computer operates on the disk, memory, CPU, and data assigned to them. Conversely, a shared disk (or distributed lock manager) cluster uses a lock mechanism to arbitrate the cluster computers requests to a common disk repository of data. There is much debate about which architecture is better. I lean very strongly in favor of shared-nothing. An example of a shared-nothing supercomputer is the Search for Extraterrestrial Intelligence (SETI@home). Shared nothing databases are commercially available from IBM and Teradata. A distributed lock manager database (Cache Fusion) is available from Oracle. See Blair Adamache's article, Clustering for Scalability, for more on this topic.

Many Linux clusters are called Beowulf clusters. Rather than attempt a definition I offer you a quote from the FAQ (see http://www.beowulf.org/ for more information):

1. What's a Beowulf? [1999-05-13]

It is a kind of high-performance massively parallel computer built primarily out of commodity hardware components, running a free-software operating system like Linux or FreeBSD, and interconnected by a private high-speed network.

2. Where can I get the Beowulf software? [1999-05-13]

There is no software package called "Beowulf." There are, however, several pieces of software many people have found useful for building Beowulfs. None of them are essential. They include MPICH, LAM, PVM, the Linux kernel, and the channel-bonding patch to the Linux kernel.

Clusters can provide incredible compute power as you can see by looking at current rankings of cluster performance.

A Beowulf Example - Load Balancing Issues

Before we jump into a database cluster lets take a look at a simple two-node Beowulf cluster using Parallel Virtual Machine (PVM) with an X-windows interface (XPVM.) The two nodes are asymmetric: one is a 133 MHz Pentium, the second is a 500 MHz Pentium III. The compute task is trivial: gzcat a compressed file and direct the output to /dev/null. The display shows an interesting facet of cluster computing: there is no load balancing inherent in the cluster.

Figure 2. Cluster computing

The bar chart shows the "piecrust" computer (133 MHz) still working with an elapsed time of 5.7 seconds, while the "poobah" server (500 MHz) has already finished. If the final result is dependent on the result set from each node, the throughput is throttled by the slowest computer. The same issue arises if the amount of work assigned to a symmetrical cluster is skewed towards a subset of the nodes.

The Database Cluster

My cluster for the click stream analysis was a very low budget affair. I used all the old Pentium 233 MHz machines in the office I could scrounge, IBM-Informix XPS for Linux, and a 100 megabit ethernet switch. I purloined one 366 MHz machine, but as explained above, it was limited by the rest of the herd. The illustration shows two of the machines, the monitor, and the ethernet switch.

Figure 3. Physically clustered machines with penguins

Does it Scale?

I focused the cluster testing on performance scalability. How much additional performance is achieved by adding additional processors? The perfect case is linear (100%) scalability. There are some confounding variables. Moving from one processor to two processors introduces the communications layer between the nodes, sending the queries to the nodes (function shipping), and consolidating the result set to the requestor node. This tends to reduce scalability. This is a one-time penalty when moving from a single node to multiple nodes. Working to increase scalability is "super-linear speedup." As nodes are added more Random Access Memory (RAM) is available for in-memory data structures. RAM is so much faster than disk that 2 * n nodes might result in job times shorter than T / 2. Why? When the total data is much larger than the available RAM, the database will use disk space to store temporary results -- much like virtual memory paging in an operating system. As nodes are added to the cluster, their RAM becomes available to process the query. If enough nodes are added to allow all the data structures to reside in the RAM of all the machines, the swapping is avoided. Things speed up quite a bit when this happens.

The graph below shows run times for two, three, four, and five processors. The results are nearly linear. True 100% scalability would produce straight lines with constant slope. There is a mixed query workload and not all queries provide the same scalability. These tests were performed with a 100 megabit switch. The test case with five computers shows a slowdown on some queries. The fifth computer had no local data disk attached, so some of the SQL caused enough traffic to overload the 100 megabit switch.

Figure 4. Test cycle of coservers

It is easy to test the impact of network speed on cluster performance. I substituted a 10 megabit hub for the switch and re-ran the same suite of tests. Here we see ethernet at its worst. We have near linear slowdown instead of speed-up. The more packets flying around the wire, the more collisions and the slower we go. The Beowulf FAQ refers to a channel-bonding patch to make multiple Network Interface Cards (NICs) function as a single high-speed interface. This graph dramatically demonstrates why a high-speed connection is critical to the cluster.

Figure 5. Test cycle of 2, 3 and 4 nodes

Conclusions

Web log analysis can make significant performance and usability improvements to your Web site. Linux database clusters have tremendous potential. Please do try this at home - although your mileage may vary!

As you tune for performance (see Performance Guide), remember that there is now an additional major resource to measure and optimize: the normal CPU, disk, memory, and now interconnect. An easy way to get going on a two-node system without the expense of a 100 megabit switch (about $100) is a crossover cable (about $10). The cable will connect two machines to each other without a hub/switch.

Please let me know about your Linux database cluster experiences by contacting me at lurie@us.ibm.com (this is not a tech support email).

View Web Click Stream Analysis using Linux Clusters Discussion

Page: 1 2 3 Next Page: Understanding Web Logs

First published by IBM developerWorks