A data-exploration tool to address output and probability function shortcomings
Part one of this series ended by noting three elements that were
lacking in the Simple Linear Regression class. In this article, the
author, Paul Meagher, addresses these shortcomings with PHP-based
probability functions; demonstrates how to integrate output methods
SimpleLinearRegression class; and creates
graphical output. He then tackles these issues by building a
data-exploration tool, designed to plumb the depths of information
contained in small- to medium-sized datasets. (In part one,
the author demonstrated how to develop and implement the heart of a
simple linear regression algorithm package using PHP as the
In the first of this two-part series, "Simple linear regression with PHP," I explained why a math library can be useful for PHP. I also demonstrated how to develop and implement the heart of a simple linear regression algorithm using PHP as the implementation language.
The object of this article is to show you how to build a non-trivial data-exploration tool using the
SimpleLinearRegression class discussed in Part 1.
Recap: The concept
The basic goal behind simple linear regression modeling is to find the line of best fit through a two-dimensional plane of paired X and Y values (that is, your X and Y measurements). Once you find this line using the least-squared-error criterion, then you can perform various statistical tests to determine how well this line accounts for the observed variance in Y scores.
A linear equation -- y = mx + b -- has two parameters that must be estimated based on the X and Y data provided, which are the slope (m) and y intercept (b). Once you have estimates of these parameters, you can enter your observed values into a linear equation and see what predicted Y values your equation generates.
To estimate the m and b parameters using a least-squared-error criterion you'll want to find estimates of m and Y that minimize the difference between your observed and predicted values for all values of X. The difference between observed and predicted values is called error (yi - (mxi + b)) and, if you square each error score and sum these residuals, the result is a number called the squared error of prediction. Using a least-squared-error criterion to determine the line of best fit involves finding estimates of m and b that minimize the squared error of prediction.
The estimators, m and b, that satisfy the
least-squared-error criterion can be found in two basic ways. First,
you can use a numerical search procedure to propose and evaluate
different values of m and b, ultimately settling on
estimates producing the least squared error. The second approach is to
use calculus to find the equations for estimating m and b. I will not go into the calculus involved to derive these equations, but I do use these analytic equations in the
SimpleLinearRegression class to find the least-squared estimates of m and b (see the
getYIntercept methods in the
Even though you have equations that can be used to find the least squared estimates of m and b, it does not follow that once you plug these parameters into the linear equation that the result is a line that provides a good fit to the data. The next step in the simple linear regression procedure is to determine if the remaining squared error of prediction is acceptable or not.
You can use a statistical decision procedure to reject the
alternative hypothesis that a straight line fits the data. This
procedure is based upon computing a T statistic and using a probability
function to find the probability of observing a value that large by
chance. As mentioned in Part 1, the
class generates a fairly large number of summary values and one
important summary value is a T statistic that can be used to measure
how well a linear equation fits the data. The T statistic tends toward
a large value if the fit is good; if the T value is small, your linear
equation should be replaced by a default model that assumes the mean of
the Y values is the best predictor (because the mean of a set of values is often a useful predictor of the next observed value).
To test whether the T statistic is large enough that you can reject the mean of the Y values as the best predictor, you need to compute the probability of obtaining the T statistic by chance. If that probability is low, then you can reject the null hypothesis that the mean is the best predictor and, correspondingly, gain confidence that a simple linear model offers a good fit for the data. (For more on computing the probability of the T statistic, see Part 1.)
Back to the statistical decision procedure. It tells you when to reject the null hypothesis, but it does not tell you whether to accept the alternative hypothesis. In a research context, the alternative hypothesis of a linear model needs to be established by theoretical and statistical arguments.
The data-exploration tool you are building implements the statistical decision procedure for a linear model (the T test) and provides summary data that can be used to construct the theoretical and statistical arguments necessary to establish a linear model. The data-exploration tool could be classified as a decision-support tool for knowledge workers exploring patterns in small- to medium-sized datasets.
From a learning point of view, simple linear regression modeling is worth studying because it is the gateway to understanding more advanced forms of statistical modeling. Many of the core concepts from simple linear regression, for example, establish a good foundation for understanding Multiple Regression, Factor Analysis, Time Series, and so on.
Simple linear regression is also a versatile modeling technique. It can be used to model curvilinear data by transforming the raw data, typically with logarithmic or power transformations. These transformations can linearize the data so that simple linear regression can be used to model the data. The resulting linear model would be expressed as a linear formula relating the transformed values.