Fit the model to the data
SimpleLinearRegression procedure is used to fit a straight line to the data in which the straight line has the following standard form:
y = b + mx
The PHP form of this equation would look something like Listing 3:Listing 3. PHP equation that fits the model to the data
SimpleLinearRegression class uses a least-squares
criterion for deriving estimates of what the Y Intercept and Slope
parameters should be. These estimated parameters are used to construct
a linear equation (see Listing 3) to model the relationship between the X and Y values.
Using the derived linear equation, you can then obtain predicted Y values for each X value. If the linear equation is a good fit to the data, then the observed and predicted Y values tend to agree.
How to determine a good fit
SimpleLinearRegression class generates a fairly large number of summary values. One important summary value is a T statistic that can be used to measure how well a linear equation fits
the data. If the fit is good, then the T statistic tends to have a
large value. If the T statistic is small, the linear equation should be
replaced by a model that assumes the mean of the Y values is
the best predictor (that is, the mean of a set of values is often a
useful predictor of the next observed value making it the default
To test whether the T statistic is large enough to reject the mean of the Y values as the best predictor, you need to compute the probability of obtaining the T statistic by chance. If the probability of obtaining a T statistic is low, then you can reject the null hypothesis that the mean is the best predictor and, correspondingly, gain confidence that a simple linear model offers a good fit for the data.
So, how do you compute the probability of the T statistic?
Compute the T statistic probability
Because PHP lacks mathematical routines to compute the probability of a T statistic, I decided to shell out to the statistical computing package R (see www.r-project.org in Resources) to obtain the necessary values. I also wanted to raise awareness about this package because:
- R provides quite a few ideas PHP developers might want to emulate in a PHP math library
- With R, you can confirm that values obtained from a PHP math library agree with those obtained from a mature, freely available, open source statistical package.
The code in Listing 4 demonstrates just how easy it is to shell out to R for one value.Listing 4. Shell out to the R statistical computing package for one value
Note that the path to the R executable is set and used in the two
functions. The first function returns a probability value associated
with a T statistic based upon the Students T distribution, while the
second inverse function computes the T statistic corresponding to a
given alpha setting. The
getStudentProb method is used to assess the fit of the linear model; the
getInverseStudentProb method returns an intermediate value used to compute a confidence interval for each predicted Y value.
Space constraints keep me from going into detail about all the functions in this class, so I encourage you to consult an undergraduate statistics textbook if you want to understand the termininology and steps involved in a Simple Linear Regression analysis.