Updating through conjugate priors
Suppose that you go live with your simple binary survey and collect the following responses.
- Four participants respond with a 1-coded answer (success events).
- Sixteen participants respond with a 0-coded answer (failure events).
People in the live survey responded with the same success proportions (k/n = 4/20 = 1/5 = .20) as in the pilot survey (k/n = 1/5 = .20).
Look at the graph of the results expressed as a beta distribution with a=4 and b=16.
As you can see, the graph is starting to sharply peak around the parameter estimate .20 and the standard deviation of the parameter estimate is decreasing. The beta distribution is representing the fact that you have more data to learn from and that your estimates can be more firmly placed within a range of values. Confidence intervals for your parameter estimate, also known in Bayesian statistics as credible intervals, can also be computed (but I'll leave that as an exercise).
I will conclude this discussion by demonstrating how easy it is to update the Bayes parameter estimate with new information by using the concept of a conjugate prior. In particular, I will look at how to combine the parameter estimate obtained from the test survey with the parameter estimate obtained from the live survey. Don't throw out the test survey data if it is representative of the population you want to draw inferences about and the test conditions remain the same.
Essentially, a conjugate prior allows you to represent the Bayes parameter estimation formula in incredibly simple terms using the beta parameters a and b:
aposterior = alive + atest
bposterior = blive + btest
aposterior = 4 + 1
bposterior = 16 + 4
Using the conjugate priors updating rule to combine test and live survey parameter estimates, you pass a=5 and b=20 into our
BetaDistribution class and plot the resulting probability distribution.
This probability distribution represents the posterior estimate of . In accordance with Bayes theorem, you computed the posterior distribution for your parameter estimate P( | R) by combining parameter estimate information derived from your likelihood term with the parameter estimate information derived from your prior term.
You can summarize the test survey results through the parameters atest=1 and btest=4 in a beta prior distribution (P() = Beta[1, 4]). You can summarize the live survey results through the parameters alive=4 and blive=16 in a beta likelihood distribution (that is, P(D | ) = Beta[4, 16] ).
Adding these conjugate beta distributions (Beta[1, 4] + Beta[4, 16]) together amounts to adding together the a and b parameters from both beta distributions. Similarly, simple conjugate prior updating rules are available for Gaussian (Normal-Wishart family of distributions) and multinomial data (Dirichlet family of distributions) as well.
The concept of conjugate priors is attractive from the point of view of implementing Bayes networks and imagining how you might propagate information from parent nodes to child nodes. If several parent nodes use the beta a and b parameters to represent information about some aspect of the world, then you may be able to propogate this information to downstream child nodes by simply summing parent node beta weights.
Another attractive feature of the conjugate prior updating rule is that it is recursive and, in the limit, can be used to tell you how to update your posterior probabilities on the basis of a single new observation (another exercise for you to think about).
The use of conjugate priors is not, however, without its critics who argue that the mindless use of conjugate priors abrogates a Bayesian's responsibility to use all information at his disposal to represent prior knowledge about a parameter. Just because the likelihood distribution can be represented using a beta sampling model does not mean that you also need to represent your prior knowledge with a beta distribution. Personally, I would discount these criticisms in the case of a simple binary survey because the beta sampling model appears to be an appropriate representation to use to depict the prior estimate of what value of p you might observe.
I conclude this section by comparing maximum likelihood estimators of p with Bayesian estimators of p. Both estimation techniques produce unbiased estimates of p and converge on p in the long run (they share similar asymptotic behavior). MLE estimators are generally simpler to compute and are often preferred by statisticians when doing parameter estimation.
Bayesian estimators and MLE estimators differ in their small sample behavior as estimators. You should study convergence rates and bias measures to get a practical sense of how they might differ. Bayesian methods allow more flexibility in terms of how you might incorporate external information (through the prior probability distribution) into the parameter estimation process.