Spam Filtering Techniques 3. Distributed adaptive blacklists

Spam filtering techniques

By David Mertz, Ph.D. - 2004-04-06 Page: 1 2 3 4 5 6 7 8 9

3. Distributed adaptive blacklists

Spam is almost by definition delivered to a large number of recipients. And as a matter of practice, there is little if any customization of spam messages to individual recipients. Each recipient of a spam, however, in the absence of prior filtering, must press his own "Delete" button to get rid of the message. Distributed blacklist filters let one user's Delete button warn millions of other users as to the spamminess of the message.

Tools such as Razor and Pyzor (see Resources) operate around servers that store digests of known spams. When a message is received by an MTA, a distributed blacklist filter is called to determine whether the message is a known spam. These tools use clever statistical techniques for creating digests, so that spams with minor or automated mutations (or just different headers resulting from transport routes) do not prevent recognition of message identity. In addition, maintainers of distributed blacklist servers frequently create "honey-pot" addresses specifically for the purpose of attracting spam (but never for any legitimate correspondences). In my testing, I found zero false positive spam categorizations by Pyzor. I would not expect any to occur using other similar tools, such as Razor.

There is some common sense to this. Even those ill-intentioned enough to taint legitimate messages would not have samples of my good messages to report to the servers -- it is generally only the spam messages that are widely distributed. It is conceivable that a widely sent, but legitimate message such as the developerWorks newsletter could be misreported, but the maintainers of distributed blacklist servers would almost certainly detect this and quickly correct such problems.

As the summary table below shows, however, false negatives are far more common using distributed blacklists than with any of the other techniques I tested. The authors of Pyzor recommend using the tool in conjunction with other techniques rather than as a single line of defense. While this seems reasonable, it is not clear that such combined filtering will actually produce many more spam identifications than the other techniques by themselves.

In addition, since distributed blacklists require talking to a server to perform verification, Pyzor performed far more slowly against my test corpora than did any other techniques. For testing a trickle of messages, this is no big deal, but for a high-volume ISP, it could be a problem. I also found that I experienced a couple of network timeouts for each thousand queries, so my results have a handful of "errors" in place of "spam" or "good" identifications.

View Spam filtering techniques Discussion

Page: 1 2 3 4 5 6 7 8 9 Next Page: 4. Rule-based rankings

First published by IBM developerWorks