Simple Linear Regression With PHP: Part 2 A data-exploration tool to address output and probability function shortcomings

Simple linear regression with PHP: Part 2

By Paul Meagher - 2004-05-21 Page: 1 2 3 4 5 6 7 8 9

A data-exploration tool to address output and probability function shortcomings

Part one of this series ended by noting three elements that were lacking in the Simple Linear Regression class. In this article, the author, Paul Meagher, addresses these shortcomings with PHP-based probability functions; demonstrates how to integrate output methods into the SimpleLinearRegression class; and creates graphical output. He then tackles these issues by building a data-exploration tool, designed to plumb the depths of information contained in small- to medium-sized datasets. (In part one, the author demonstrated how to develop and implement the heart of a simple linear regression algorithm package using PHP as the implementation language.)

In the first of this two-part series, "Simple linear regression with PHP," I explained why a math library can be useful for PHP. I also demonstrated how to develop and implement the heart of a simple linear regression algorithm using PHP as the implementation language.

The object of this article is to show you how to build a non-trivial data-exploration tool using the SimpleLinearRegression class discussed in Part 1.

Recap: The concept

The basic goal behind simple linear regression modeling is to find the line of best fit through a two-dimensional plane of paired X and Y values (that is, your X and Y measurements). Once you find this line using the least-squared-error criterion, then you can perform various statistical tests to determine how well this line accounts for the observed variance in Y scores.

A linear equation -- y = mx + b -- has two parameters that must be estimated based on the X and Y data provided, which are the slope (m) and y intercept (b). Once you have estimates of these parameters, you can enter your observed values into a linear equation and see what predicted Y values your equation generates.

To estimate the m and b parameters using a least-squared-error criterion you'll want to find estimates of m and Y that minimize the difference between your observed and predicted values for all values of X. The difference between observed and predicted values is called error (y_i - (mx_i + b)) and, if you square each error score and sum these residuals, the result is a number called the squared error of prediction. Using a least-squared-error criterion to determine the line of best fit involves finding estimates of m and b that minimize the squared error of prediction.

The estimators, m and b, that satisfy the least-squared-error criterion can be found in two basic ways. First, you can use a numerical search procedure to propose and evaluate different values of m and b, ultimately settling on estimates producing the least squared error. The second approach is to use calculus to find the equations for estimating m and b. I will not go into the calculus involved to derive these equations, but I do use these analytic equations in the SimpleLinearRegression class to find the least-squared estimates of m and b (see the getSlope() and getYIntercept methods in the SimpleLinearRegression class).

Even though you have equations that can be used to find the least squared estimates of m and b, it does not follow that once you plug these parameters into the linear equation that the result is a line that provides a good fit to the data. The next step in the simple linear regression procedure is to determine if the remaining squared error of prediction is acceptable or not.

You can use a statistical decision procedure to reject the alternative hypothesis that a straight line fits the data. This procedure is based upon computing a T statistic and using a probability function to find the probability of observing a value that large by chance. As mentioned in Part 1, the SimpleLinearRegression class generates a fairly large number of summary values and one important summary value is a T statistic that can be used to measure how well a linear equation fits the data. The T statistic tends toward a large value if the fit is good; if the T value is small, your linear equation should be replaced by a default model that assumes the mean of the Y values is the best predictor (because the mean of a set of values is often a useful predictor of the next observed value).

To test whether the T statistic is large enough that you can reject the mean of the Y values as the best predictor, you need to compute the probability of obtaining the T statistic by chance. If that probability is low, then you can reject the null hypothesis that the mean is the best predictor and, correspondingly, gain confidence that a simple linear model offers a good fit for the data. (For more on computing the probability of the T statistic, see Part 1.)

Back to the statistical decision procedure. It tells you when to reject the null hypothesis, but it does not tell you whether to accept the alternative hypothesis. In a research context, the alternative hypothesis of a linear model needs to be established by theoretical and statistical arguments.

The data-exploration tool you are building implements the statistical decision procedure for a linear model (the T test) and provides summary data that can be used to construct the theoretical and statistical arguments necessary to establish a linear model. The data-exploration tool could be classified as a decision-support tool for knowledge workers exploring patterns in small- to medium-sized datasets.

From a learning point of view, simple linear regression modeling is worth studying because it is the gateway to understanding more advanced forms of statistical modeling. Many of the core concepts from simple linear regression, for example, establish a good foundation for understanding Multiple Regression, Factor Analysis, Time Series, and so on.

Simple linear regression is also a versatile modeling technique. It can be used to model curvilinear data by transforming the raw data, typically with logarithmic or power transformations. These transformations can linearize the data so that simple linear regression can be used to model the data. The resulting linear model would be expressed as a linear formula relating the transformed values.

View Simple linear regression with PHP: Part 2 Discussion

Page: 1 2 3 4 5 6 7 8 9 Next Page: Probability functions

First published by IBM developerWorks