Design your data analysis to go beyond simple raw counts
Effective, multi-level analysis of Web data is a critical element for the survival of many Web-oriented businesses, and the design (and determination) of data-analysis tests is often the job of systems administrators and in-house application designers who may not have an understanding of statistics beyond tabulating raw counts. In this article, Paul Meagher delivers the skills and concepts Web developers need to be able to apply inferential statistics to their Web data streams.
Dynamic Web sites generate an enormous amount of data -- access logs, poll and survey results, customer profiles and orders, and more -- so increasingly, the job of a Web developer is not just to create the applications that generate this data, but also to develop applications and approaches to make sense of these data steams.
Often, the response of Web developers to the growing data-analytic requirements of managing their sites is inadequate. For the most part, Web developers haven't progressed much beyond reporting various descriptive statistics to characterize the data streams. An array of inferential statistical procedures (methodologies for estimating population parameters based upon sample data) could be fruitfully exploited, but at present are not being applied.
For example, Web-access statistics (as currently compiled) are little more than frequency counts grouped in various ways. The results of polls and surveys are too often expressed in terms of simple raw counts and percentages.
Maybe developers shouldn't be expected to deal with the statistical analysis of data streams except in superficial ways. After all, there are those who devote careers to the more complex data-stream analysis; they're called statisticians and trained analysts. They can be brought in when an organization needs more than just descriptive statistics.
However, an alternative response is to acknowledge that increasing savvy with inferential statistics is becoming part of the job description for Web developers. Dynamic sites are generating more and more data and it is arguably the responsibility of Web developers and system administrators to find ways of turning this data into actionable knowledge.
I advocate the latter response; this article is intended to help Web developers and systems administrators learn (or activate, in the case of inert knowledge) the design and analysis skills necessary to apply inferential statistics to their Web data streams.
Relate Web data to experimental design
The application of inferential statistics to Web data streams involves more than learning the math underlying various statistical tests. Equally important is the ability to relate the data-collection process to critical distinctions in experimental design: What is my measurement scale? How representative is my sample? What's my population? What's the hypothesis I'm testing?
To apply inferential statistics to your Web data streams, you need to first think of your results as if they were generated by an experimental design; then select an analysis procedure appropriate to that experimental design. Even if you consider it a stretch to think of Web polls and access log data as the results of an experiment, it is critical for you to do so. Why?
- It will help you select an appropriate statistical test.
- It will help you draw the appropriate conclusions from your collected data.
One aspect of experimental design that is critical in determining the appropriate statistical test to use is the choice of measurement scale for data collection.