Spell-Checking Spam and Procmail RBL (Friday, June 23, 2006)

Unfortunately, my university e-mail account is inundated by spam. Because my university's e-mail system is inherently insecure, I forward e-mail from that account to a special account on this server, so that I don't have to pass credentials to the university in the clear. For local accounts, I have a variety of measures to block spam, such as Sender Policy Framework (SPF), and Real-time Black-Hole Lists (RBL). However, these do not work with forwarded accounts. SPF will not work at all, because the originating IP address is always the university, and will never match the actual source. Similarly, the RBL isn't useful, because the IP address that it would look at is the university's and not that actual sender. Finally, there's the problem that I can't use exim for filtering anyway, because it would just bounce messages back to the university, rather than blocking the original message.

Therefore, I have come up with a few techniques for blocking spam after it has been routed, before it's been delivered — via procmail. The first script I wrote checks the spelling of the incoming message — if there are too many words with g.a.r.b.a.g.e in them or that r3pl4c3 letters with numbers, or just have too many misspellings, then it gets flagged as spam. I've found that if 10% or more of the words are misspelled, it is almost certainly spam, and that this sort of filter catches the vast majority of spam that comes into my university account.

For the rest of the spam, I came up with a way to check the RBL from a script. I pass the headers of the message to the script, which extracts all of the IP addresses from the message, and uses the Unix utility "host" to query the RBL zones for each IP address.

I may eventually clean up these scripts and package them up for download. For now, if you're interested, contact me, and I'll be happy to share them with you.

