Developer Forums | About Us | Site Map


Useful Lists

Web Host
site hosted by netplex

Online Manuals

Implement Bayesian inference using PHP, Part 1
By Paul Meagher - 2004-04-21 Page:  1 2 3 4 5 6 7 8 9 10 11

Conditional probability and SQL

P(A | B) can be mapped onto database-query operations. For example, the probability of cancer given a positive test result, P(+cancer | +test), can be obtained by issuing this SQL query then doing some tallies on the result set like this:

SELECT cancer_status FROM Data WHERE test_status='+test'

If I gather information about how several boolean-valued tests co-vary with a boolean-valued diagnosis (like that of cancer or not cancer), then I can perform slightly more complex queries to study how diagnostically useful other factors are in determining whether a patient has cancer, such as in the following:

SELECT cancer_status FROM Data WHERE genetic_status='+' AND age_status='+' AND biopsy_status='+'

In the case of detecting e-mail spam, I might be interested in computing P(+spam | title_word='viagra' AND title_word='free'), which could be viewed as a directive to issue the following SQL query:

SELECT spam_status FROM Emails WHERE email_title LIKE 'viagra' AND email_title LIKE 'free' 

After enumerating the number of e-mails that are spam and have "viagra" and "free" in the title (like so):

count_emails(spam_status='+spam' AND email_title LIKE 'viagra' AND email_title LIKE 'free')

and dividing by the overall number of e-mails with the words "viagra" and "free" in the title:

count_emails(email_title LIKE 'viagra' AND email_title LIKE 'free')

I might arrive at the conclusion that the appearence of these words in the title strongly and specifically co-varies with the message being spam (after all, 18/18 = 100 percent) and this rule might be used to automatically filter such messages.

In Bayes spam filtering, you need to initially train the software in which e-mails are spam and which are not. One can imagine storing spam_status information with each e-mail record (for example, email_id, spam_status, email_title, or email_message) and doing the previous queries and counts on this data to decide whether to forward a new e-mail into your inbox.

View Implement Bayesian inference using PHP, Part 1 Discussion

Page:  1 2 3 4 5 6 7 8 9 10 11 Next Page: Frequency versus probability format

First published by IBM developerWorks

Copyright 2004-2017 All rights reserved.
Article copyright and all rights retained by the author.