Search

Useful Lists

Web Host
Partners

Online Manuals

Simple linear regression with PHP: Part 1
By Paul Meagher - 2004-05-12 Page:  1 2 3 4 5 6 7 8

## Fit the model to the data

The `SimpleLinearRegression` procedure is used to fit a straight line to the data in which the straight line has the following standard form:

y = b + mx

The PHP form of this equation would look something like Listing 3:

Listing 3. PHP equation that fits the model to the data
 `````` \$PredictedY[\$i] = \$YIntercept + \$Slope * \$X[\$i] ``````

The `SimpleLinearRegression` class uses a least-squares criterion for deriving estimates of what the Y Intercept and Slope parameters should be. These estimated parameters are used to construct a linear equation (see Listing 3) to model the relationship between the X and Y values.

Using the derived linear equation, you can then obtain predicted Y values for each X value. If the linear equation is a good fit to the data, then the observed and predicted Y values tend to agree.

### How to determine a good fit

The `SimpleLinearRegression` class generates a fairly large number of summary values. One important summary value is a T statistic that can be used to measure how well a linear equation fits the data. If the fit is good, then the T statistic tends to have a large value. If the T statistic is small, the linear equation should be replaced by a model that assumes the mean of the Y values is the best predictor (that is, the mean of a set of values is often a useful predictor of the next observed value making it the default model).

To test whether the T statistic is large enough to reject the mean of the Y values as the best predictor, you need to compute the probability of obtaining the T statistic by chance. If the probability of obtaining a T statistic is low, then you can reject the null hypothesis that the mean is the best predictor and, correspondingly, gain confidence that a simple linear model offers a good fit for the data.

So, how do you compute the probability of the T statistic?

### Compute the T statistic probability

Because PHP lacks mathematical routines to compute the probability of a T statistic, I decided to shell out to the statistical computing package R (see www.r-project.org in Resources) to obtain the necessary values. I also wanted to raise awareness about this package because:

1. R provides quite a few ideas PHP developers might want to emulate in a PHP math library
2. With R, you can confirm that values obtained from a PHP math library agree with those obtained from a mature, freely available, open source statistical package.

The code in Listing 4 demonstrates just how easy it is to shell out to R for one value.

Listing 4. Shell out to the R statistical computing package for one value
 `````` RPath --slave"; \$result = shell_exec(\$cmd); list(\$LineNumber, \$Probability) = explode(" ", trim(\$result)); return \$Probability; } function getInverseStudentProb(\$alpha, \$df) { \$InverseProbability = 0.0; \$cmd = "echo 'qt(\$alpha, \$df)' | \$this->RPath --slave"; \$result = shell_exec(\$cmd); list(\$LineNumber, \$InverseProbability) = explode(" ", trim(\$result)); return \$InverseProbability; } } ?>``````

Note that the path to the R executable is set and used in the two functions. The first function returns a probability value associated with a T statistic based upon the Students T distribution, while the second inverse function computes the T statistic corresponding to a given alpha setting. The `getStudentProb` method is used to assess the fit of the linear model; the `getInverseStudentProb` method returns an intermediate value used to compute a confidence interval for each predicted Y value.

Space constraints keep me from going into detail about all the functions in this class, so I encourage you to consult an undergraduate statistics textbook if you want to understand the termininology and steps involved in a Simple Linear Regression analysis.

View Simple linear regression with PHP: Part 1 Discussion

Page:  1 2 3 4 5 6 7 8 Next Page: The burnout study