Some thoughts on probability modeling
I'd like to cover three other topics in probability modeling before I end this article. They concern the following issues:
- Is there such a thing as too much when it comes to fitting data?
- How much can the determination of which distribution and what adjustment parameters to use be automated?
- A potential new path to developing a Probability Distributions Library implemented in PHP.
One question you might ask is why bother fitting an empirical data distribution to a theoretical probability distribution. Wouldn't it be possible to use the empirical data distribution to directly compute the probabilities of certain outcomes? After all, a relative frequency histogram can easily be converted to a probability histogram. You could also develop a program that would compute the probability of observing a male taller than 72 inches by counting the number of data points with a value greater than 72 inches and dividing this number by the total number of observations you have to work with. Wouldn't it be better to use an actual empirical probability distribution rather than a less accurate theoretical probability distribution to construct your probability models?
In some cases, this is the correct route. Software can be written that allows you to make inferences about the probability of certain outcomes using empirical probability distributions with irregular shapes. These inferences may be more accurate than using any of the available theoretical probability distributions because it is possible that none of them fit the distribution data well.
The main problem you run into, though, if use an empirical probability distribution, is over-fitting your data. The purpose of constructing a probability model is to generalize to new cases of your random variable. For example, if you accept that the real distribution of male height should always have a dip at 69 inches, do you think that this will be true for future cases? If one run of the random-number generator produces fewer sixes then sevens, should you slavishly predict that future runs will produce the same relative frequencies?
The argument against using empirical probability distributions is that they tend to over-fit the data and can reduce your ability to generalize to new instances of your random variable. This argument should also be kept in mind when one resorts to more advanced techniques, such as curve fitting with Fourier components, to represent the irregularly shaped probability distributions. While the curve might conform more closely to the probability distribution, is it trying to conform too closely? What happens when you take your next sample? Do you need to redraw the curve?
In univariate probability modeling (versus a curve-fitting approach), you frame your investigation in terms of having an observed distribution and wanting to use this information to estimate the simplest possible probability model for the data. The argument against curve fitting is that your models might be marginally more accurate but much more complicated than a simple univariate probability model. Also, your parameter estimates may be less robust as new information comes in -- they may not serve the purpose of generalization to new cases.
Automating the distribution choice
Another issue I offer for pondering is whether to automate the process of finding the appropriate theoretical distribution and estimating which adjustment parameters to use. Tools are available that allow you to feed in a vector of measurements representing your random variable -- these tools automatically:
- Generate parameter estimates for a variety of theoretical distribution parameters (using the method of moments in most cases)
- Do goodness-of-fit testing to rule out certain probability distributions as candidates
- Rank the remaining theoretical distributions
While such tools are definitely useful, they are not a substitute for an intelligent analyst bringing experience, knowledge and theory to bear on the issue, performing exploratory data analysis, applying various goodness-of-fit tests to the data and making an all-things-considered judgment. In some cases, for theoretical or rational reasons, you might expect a random variable to be distributed in a certain manner. For example, a skewed normal distribution might suggest a mixture of two underlying normal distributions if you mixed male and female heights together to produce the height distribution. Also, a random variable that can be fit to a distribution using one adjustment parameter might be preferable, on the grounds of simplicity, to a theoretical distribution that requires two adjustment parameters, especially if the model using a single adjustment parameter makes empirical sense.
It would be interesting and useful to develop a PHP-based tool that would help automate the mechanical aspects of fitting data to various theoretical probability distributions. Such a tool would be designed (and assigned) as a useful exploratory tool -- a decision-making aid rather than a substitute for human insight and common sense.
A new PHP PDL?
A promising development is the recent announcement of a statistics-processing extension for PHP that wraps two libraries -- DCDFLIB (a library of C routines for CDFs, Inverses, and other parameters) and RANDLIB. This opens up the possibility of having a Probability Distributions Library implemented in PHP that is a wrapper around these tried and tested probability extensions. One advantage to using the API discussed in this article: It is vector-based and object-oriented while the proposed C-based extension is not.
The following code snippet illustrates how easy it is to integrate
these new functions into the suggested PHP-based API (note the call to
Another advantage of using a PHP-based API involves providing a uniform API to distribution functions that are implemented in C, PHP, or another language. If the DCDFLIB or RANDLIB libraries do not implement a particular distribution or method, you could implement the relevant distribution or method in PHP and the user would (and arguably should) be oblivious to these details. The API is of paramount importance to the user; the details of how it is implemented are of less importance.