4. Rule-based rankings
The most popular tool for rule-based spam filtering, by a good margin, is SpamAssassin. There are other tools, but they are not as widely used or actively maintained. SpamAssassin (and similar tools) evaluate a large number of patterns -- mostly regular expressions -- against a candidate message. Some matched patterns add to a message score, while others subtract from it. If a message's score exceeds a certain threshold, it is filtered as spam; otherwise it is considered legitimate.
The README for SpamAssassin makes some very strong claims:
In its most recent test, SpamAssassin differentiated between spam and non-spam mail correctly in 99.94% of cases. Since then, it's just been getting better and better!
My testing showed nowhere near this level of success. Against my corpora, SpamAssassin had about 0.3% false positives and a whopping 19% false negatives. In fairness, this only evaluated the rule-based filters, not the optional checks against distributed blacklists. Additionally, my spam corpus is not purely spam -- it also includes a large collection of what are probably virus attachments (I do not open them to check for sure, but I know they are not messages I authorized). SpamAssassin's FAQ disclaims responsibility for finding viruses; on the other hand, the below techniques do much better in finding them, so the disclaimer is not all that compelling.
SpamAssassin runs much quicker than distributed blacklists, which need to query network servers. But it also runs much slower than even non-optimized versions of the below statistical models (written in interpreted Python using naive data structures).