Hiding contact information
For some e-mail users, a reasonable, sufficient, and very simple approach to avoiding spam is simply to guard e-mail addresses closely. For these people, an e-mail address is something to be revealed only to selected, trusted parties. As extra precautions, an e-mail address can be chosen to avoid easily guessed names and dictionary words, and addresses can be disguised when posting to public areas. We have all seen e-mail addresses cutely encoded in forms like "<mertzHIDDEN@NOSPAM.gnosis.cx>" or "echo firstname.lastname@example.org | tr A-Za-z N-ZA-Mn-za-m".
In addition to hiding addresses, a secretive e-mailer often uses one or more of the free e-mail services for "throwaway" addresses. If you need to transact e-mail with a semi-trusted party, a temporary address can be used for a few days, then abandoned along with any spam it might thereafter accumulate. The real "confidantes only" address is kept protected.
In my informal survey of discussions of spam on Web-boards, mailing lists, the Usenet, and so on, I've found that a category of e-mail users gains sufficient protection from these basic precautions.
For me, however -- and for many other people -- these approaches are simply not possible. I have a publicly available e-mail address, and have good reasons why it needs to remain so. I do utilize a variety of addresses within the domain I control to detect the source of spam leaks; but the unfortunate truth is that most spammers get my e-mail address the same way my legitimate correspondents do: from the listing at the top of articles like this, and other public disclosures of my address.
Looking at filtering software
This article looks at filtering software from a particular perspective. I want to know how well different approaches work in correctly identifying spam as spam and desirable messages as legitimate. For purposes of answering this question, I am not particularly interested in the details of configuring filter applications to work with various Mail Transfer Agents (MTAs). There is certainly a great deal of arcana surrounding the best configuration of MTAs such as Sendmail, QMail, Procmail, Fetchmail, and others. Further, many e-mail clients have their own filtering options and plug-in APIs. Fortunately, most of the filters I look at come with pretty good documentation covering how to configure them with various MTAs.
For purposes of my testing, I developed two collections of messages: spam and legitimate. Both collections were taken from mail I actually received in the last couple of months, but I added a significant subset of messages up to several years old to broaden the test. I cannot know exactly what will be contained in next month's e-mails, but the past provides the best clue to what the future holds. That sounds cryptic, but all I mean is that I do not want to limit the patterns to a few words, phrases, regular expressions, etc. that might characterize the very latest e-mails but fail to generalize to the two types.
In addition to the collections of e-mail, I developed training message sets for those tools that "learn" about spam and non-spam messages. The training sets are both larger and partially disjoint from the testing collections. The testing collections consist of slightly fewer than 2000 spam messages, and about the same number of good messages. The training sets are about twice as large.
A general comment on testing is worth emphasizing. False negatives in spam filters just mean that some unwanted messages make it to your inbox. Not a good thing, but not horrible in itself. False positives are cases where legitimate messages are misidentified as spam. This can potentially be very bad, as some legitimate messages are important, even urgent, in nature, and even those that are merely conversational are ones we do not want to lose. Most filtering software allows you to save rejected messages in temporary folders pending review -- but if you need to review a folder full of spam, the usefulness of the software is thereby reduced.