Developer Forums | About Us | Site Map
Search  
HOME > TUTORIALS > SERVER SIDE CODING > ADMINISTRATION TUTORIALS > SPAM FILTERING TECHNIQUES


Sponsors





Useful Lists

Web Host
site hosted by netplex

Online Manuals

Spam filtering techniques
By David Mertz, Ph.D. - 2004-04-06 Page:  1 2 3 4 5 6 7 8 9

Summary and Resources

Given the testing methodology described earlier, let's look at the concrete testing results. While I do not present any quantitative data on speed, the chart is arranged in order of speed, from fastest to slowest. Trigrams are fast, Pyzor (network lookup) is slow. In evaluating techniques, as I stated, I consider false positives very bad, and false negatives only slightly bad. The quantities in each cell represent the number of correctly identified messages vs. incorrectly identified messages for each technique tested against each body of e-mail, good and spam.

Table 1. Quantitative accuracy of spam filtering techniques

TechniqueGood corpus
(correctly identified vs. incorrectly identified)
Spam corpus
(correctly identified vs. incorrectly identified)
"The Truth"1851 vs. 01916 vs. 0
Trigram model1849 vs. 21774 vs. 142
Word model1847 vs. 41819 vs. 97
SpamAssassin1846 vs. 51558 vs. 358
Pyzor1847 vs. 0 (4 err)943 vs. 971 (2 err)

Resources

  • The TDMA home page provides more information about the Tagged Message Delivery Agent.

  • You can get more information about ChoiceMail from DigitalPortal Software.

  • Pyzor is a Python-based distributed spam catalog/filter.

  • Vipul's Razor is a very popular distributed spam catalog/filter. Razor is optionally called by a number of other filter tools, such as SpamAssassin.

  • Read Paul Graham's essay "A Plan for Spam."

  • Eric Raymond has created a fast implementation of Paul Graham's idea under the name "bogofilter." In addition to using some efficient data representation and storage strategies, bogofilter tries to be smart about identifying what makes a meaningful word.

  • My own trigram-based categorization tools are still at an early alpha or prototype level. However, you are welcome to use them as a basis for development. They are public domain, like all the tools I write for developerWorks articles.

  • Lawrence Lessig has written a number of books and articles that insightfully contrast what he metonymically calls "west-coast code" and "east-coast code," in other words, the laws passed in Washington D.C. (and elsewhere) versus the software written in Silicon Valley (and elsewhere). I've written a short review of Lessig's Code and Other Laws of Cyberspace. See Lessig's Web site for more to think about.

  • Find more Linux articles in the developerWorks Linux zone.



View Spam filtering techniques Discussion

Page:  1 2 3 4 5 6 7 8 9 Next Page: Six approaches to eliminating unwanted e-mail

First published by IBM developerWorks


Copyright 2004-2024 GrindingGears.com. All rights reserved.
Article copyright and all rights retained by the author.