1. Basic structured text filters
The e-mail client I use has the capability to sort incoming e-mail based on simple strings found in specific header fields, the header in general, and/or in the body. Its capability is very simple and does not even include regular expression matching. Almost all e-mail clients have this much filtering capability.
Over the last few months, I have developed a fairly small number of text filters. These few simple filters correctly catch about 80% of the spam I receive. Unfortunately, they also have a relatively high false positive rate -- enough that I need to manually examine some of the spam folders from time to time. (I sort probable spam into several different folders, and I save them all to develop message corpora.) Although exact details will differ among users, a general pattern will be useful to most readers:
- Set 1: A few people or mailing lists do funny things
with their headers that get them flagged on other rules. I catch
something in the header (usually the From:) and whitelist it (either to
INBOX or somewhere else).
- Set 2: In no particular order, I run the following spam filters:
- Identify a specific bad sender.
- Look for "<>" as the From: header.
- Look for "@<" in the header (lots of spam has this for some reason).
- Look for "Content-Type: audio". Nothing I want has this, only virii (your mileage may vary).
- Look for "euc-kr" and "ks_c_5601-1987" in the headers. I can't read that language, but for some reason I get a huge volume of Korean spam (of course, for an actual Korean reader, this isn't a good rule).
- Set 3: Store messages to known legitimate addresses. I have several such rules, but they all just match a literal To: field.
4: Look for messages that have a legit address in the header, but that
weren't caught by the previous To: filters. I find that when I am only
in the Bcc: field, it's almost always an unsolicited mailing to a list
of alphabetically sequential addresses (mertz1@..., mertz37@..., etc).
- Set 5: Anything left at this point is probably spam (it probably has forged headers to avoid identification of the sender).