Designing a PDL
I've introduced some important concepts such as frequency distribution, probability distribution, and goodness-of-fit testing. Now, I want to talk about an important tool you'll need to have at your disposal for ongoing probability-based modeling.
The Probability Distributions Library (PDL) can best be explained by a simple exercise -- constructing a feature list for a PDL. You'll begin building a PHP-based PDL, then provide some simple examples of how it can be used to construct probability models.
Before I built my own PHP-based PDL, I studied the feature set and code base of several existing PDLs. The two PDLs that influenced my own approach belong to the R package and the JSci packages. Let's discuss their respective strengths and highlight the functional and source code features that I felt were important to incorporate into my own PHP-based PDL.
R probability distributions
R is part of the open source R Project and is a high-level interactive environment for performing statistical work. The API for using the probability distributions component is optimized for interactive statistical work in the sense that all commands are short (the average command length is six characters) and the naming conventions for accessing particular probability distribution functions are regular (distribution names are prefixed by d, p, q, or r to indicate what type of distribution function is being requested).
For example, to invoke the distribution functions for the Poisson distribution (which models some discrete random variables, such as a count of the number of events that occur in a certain time interval or spatial area), you would use these four commands:
Similarly, to invoke the distribution functions for the Exponential distribution (a relatively simple, commonly used distribution used to model the behavior of units that have a constant failure rate; more on this later in the article), you would use these four commands:
As you can see, not much typing is involved (a useful feature for interactive computing), so once you understand what the d, p, q, and r prefixes mean (which is not initially obvious), you can easily infer how to access corresponding distribution functions for other probability distributions.
The d prefix stands for Density Function and signals to R that you want the probability value associated with a particular x value -- for instance, Prob[X = x]. In other words, given a contiguous range of x values, the density function will give you a corresponding range of probability values that could be used to graph the shape of the probability distribution for the supplied range of x values. Most textbook authors refer to this distribution function as the Probability Density Function or PDF.
The p prefix stands for Probability Function and signals to R that you want the probability that your random variable is less than or equal to some x value -- for example, Prob[X <= x]. This function is the one that users often care about the most because it is used in statistical tests to evaluate the probability of some observed outcome. Most textbook authors refer to this distribution function as the Cumulative Distribution Function or CDF.
The q prefix stands for Quartile Function and signals to R that you want the inverse of the Cumulative Distribution Function. In other words, given a probability value such as 0.05, it finds the x value such that Prob[x >= X] = 0.05. It is commonly used to find a "critical value" for your study outcome such that you will reject the null hypothesis if you observe a result greater than your critical value. Instead of calling this distribution function the Quartile Function as R does, I prefer to call it the Inverse Cumulative Distribution Function or Inverse CDF.
The r prefix stands for Random Number Generating Function and signals to R that you want it to generate a number or numbers distributed according to the specified distribution. It is very useful in simulation work.
A final noteworthy aspect of the R distribution functions is that they are vector oriented. If you supply more than one value in an argument slot, it returns more than one value. For example, if you want the critical values from an exponential distribution corresponding to probabilities of 0.1, 0.05, and 0.01, you can simply type this:
The command returns this list of critical values:
The vector orientation of the R distribution functions makes them convenient for both interactive and non-interactive use.
JSci probability distributions
JSci is a SourceForge project that espouses the following mission:
JSci is a set of open source Java packages. The aim is to encapsulate scientific methods/principles in the most natural way possible. As such, they should greatly aid the development of scientific-based software.
The JSci package is similar to the R package in that it implements a uniform set of distribution-related functions for all implemented probability distributions. Also like R, it offers a regular interface to these distribution-related functions for all implemented statistical distributions.
The JSci package does not, however, implement as many probability distributions as R; it does not include the random number generating functions for each distribution; and it is not vector oriented. From the point of view of coverage and functionality, the JSci package is not yet as extensive or powerful as the R statistical distributions library.
At the level of source code though, the JSci package is a well-crafted object-oriented library of probability distributions. It is at the source-code level that the JSci package shines and its architecture heavily influenced my own approach to coding a PHP-based library of probability distributions. Essentially, my design objective was to implement much of the same functionality as R's PDL, but to code it in a style more like the JSci approach.
The notable JSci source-code features I sought to emulate were the following:
- All probability distribution classes reside in the same directory.
- The probability distribution functions for a particular type of
probability distribution reside in a single class file (for example,
All probability distributions extend an abstract
ProbabilityDistribution.javaclass defines a set of methods that all specific probability distribution types are expected to instantiate. The architecture is simple and provides a straightforward framework for extending the library to new probability distributions.
ProbabilityDistribution.javaclass also contains other helper methods that all specific distribution classes can use. The purity of the object is diluted a bit by adding these helper methods here -- as more methods non-specific to the Probability Distributions object are added a separate class would likely be needed.
constructormethod for each probability distribution allows you to set the slowly changing parameters for your probability distribution and use these instantiated parameters in subsequent calls to the class methods. You do not, for example, need to continue supplying the mean and standard deviation parameters once you have instantiated your distribution with these parameters. With R you must keep supplying these parameters in your function calls. This difference boils down to the fact that JSci is a more object-oriented implementation of a PDL while R's PDL is more function oriented (but implements a consistent method interface that all probability distribution functions are expected to adhere to).
JSci has a more verbose, Java-like API for its PDL that is more functionally descriptive than the R API (what does
runif()mean?), but it is not as well suited to interactive usage. In devising my own method name labels, I sought a functionally descriptive API that was as abbreviated as possible.
A final consideration that inspired my decision to base the source-code architecture more on JSci than R is that the JSci package is released under the LGPL license, a license which is more PHP-friendly than the GPL license under which R is usually released.