Spam Filtering Techniques 4. Rule-based rankings

Spam filtering techniques

By David Mertz, Ph.D. - 2004-04-06 Page: 1 2 3 4 5 6 7 8 9

4. Rule-based rankings

The most popular tool for rule-based spam filtering, by a good margin, is SpamAssassin. There are other tools, but they are not as widely used or actively maintained. SpamAssassin (and similar tools) evaluate a large number of patterns -- mostly regular expressions -- against a candidate message. Some matched patterns add to a message score, while others subtract from it. If a message's score exceeds a certain threshold, it is filtered as spam; otherwise it is considered legitimate.

Some ranking rules are fairly constant over time -- forged headers and auto-executing JavaScript, for example, almost timelessly mark spam. Other rules need to be updated as the products and scams advanced by spammers evolve. Herbal Viagra and heirs of African dictators might be the rage today, but tomorrow they might be edged out by some brand new snake-oil drug or pornographic theme. As spam evolves, SpamAssassin must evolve to keep up with it.

The README for SpamAssassin makes some very strong claims:

In its most recent test, SpamAssassin differentiated between spam and non-spam mail correctly in 99.94% of cases. Since then, it's just been getting better and better!

My testing showed nowhere near this level of success. Against my corpora, SpamAssassin had about 0.3% false positives and a whopping 19% false negatives. In fairness, this only evaluated the rule-based filters, not the optional checks against distributed blacklists. Additionally, my spam corpus is not purely spam -- it also includes a large collection of what are probably virus attachments (I do not open them to check for sure, but I know they are not messages I authorized). SpamAssassin's FAQ disclaims responsibility for finding viruses; on the other hand, the below techniques do much better in finding them, so the disclaimer is not all that compelling.

SpamAssassin runs much quicker than distributed blacklists, which need to query network servers. But it also runs much slower than even non-optimized versions of the below statistical models (written in interpreted Python using naive data structures).

View Spam filtering techniques Discussion

Page: 1 2 3 4 5 6 7 8 9 Next Page: 5. Bayesian word distribution filters

First published by IBM developerWorks