I'm collecting here a set of rules that I think would make an effective spamfilter.
The (flexible) requirements for this are...
- Simple, for my own sanity. I prefer to use simple determinism than complex pattern matching and probability.
- Valid mail does not need to be identified as such, but the owner needs to be informed of this.
- Server-based, as I check my mail from at least 3 different machines, and don't want to have to update client-based filters for all of these.
So far I have...
Filter out patterns that we are certain is spam:
- URLs, or parts of URLs (an ongoing process :-/ )
- Known headers, such as From: fields
- Possibly anti-spam-filtering techniques, i.e. spams have started to contain "bad" html tags (e.g. ) in large quantities.
Whitelist any address/subject regexps that I know to be not-spam. This includes all friends and acquaintances, and any mailing lists I am on. Similarly, any unique e-mail addresses I give out to websites should be on this list. This whitelist should be split up into several sublists - one for each mail header/category to match.
What would be handy here is some kind of link between my address book, and the whitelist, with the latter being automatically updated by the former. However, this may assume client-side interaction, and an alternative but similar method may be needed.
- Let anything through that is "In-Reply-To" something. I note the spammers haven't started using this yet. Maybe I shouldn't list this one here ;) A more advanced version could keep some kind of table of previously-seen message IDs, and check reponse headers against it though.
Update: Some spammers now include this header, but very very few - maybe one or two every fortnight.
- Anything that doesn't match any of the above gets filtered into a "pending/junk" directory (rather than being deleted) for me to check at my own leisure. This contains either spam that doesn't meet spam-blacklist rules, and personal mail that doesn't meet whitelist rules. I'm now seriously thinking of issuing an e-mail for each of these, containing instructions for the sender to add themselves to the whitelist. This could be done via either e-mail (i.e. "please forward your mail to whitelist@... to be added to my whitelist), or a web interface, or both. The amount of spam-to-personal mail ratio is high, so this may be worth it. If lots of random people were e-mailing me, this would be less desirable.
I split the idea of sending a return mail to people trapped by the filter into two mental blocks - spammers and "real" mails. I don't like the idea of giving out whitelist instructions to the former (although most spam is faked anyway), but it means I have to consider the way in which people add themselves to the list. This shouldn't be scriptable, although I doubt a spammer would bother to follow the instructions in it just to mail me. Even then, they'd have to do it for every mail they wanted to send me, and I could remove it pretty easily.
Found this interesting article by Declan McCullagh on "Challenge-Response" mails, and the patenting of these systems.
2003-10-13. Slashdot article on new anti-spam techniques that spider links sent in mails. This would a). be able to use the linked to website to work if it was spam or not, and b). possibly DDoS the spam site. The problem with b) is that it would also DDoS legitimate sites, too.
a). however, is interesting, and got me thinking. Firstly, I can do a lookup on various addresses in a mail. This would solve one problem - misconfigured email accounts on the same server as mine - currently if another host on the same server isn't catching all the domain's mail, then graham@thatdomain will get to me. So I can do a lookup on the To: address, and if it matches my own IP, but isn't to my domain, then I can bin it.
Secondly, I'm wondering if there's any kind of correlation between DNS entries - widely-used admin contacts or something, but I'm guessing, due to the nature of spammers, that there isn't. It would have to be registrar information I think, as hackers are now masking spam site DNS lookups, apparently. The problem then is the amount of info available from the registrar, which is (rightfully) decreasing, these days.
Thinking that some kind of table lookup, created from a definite blacklist, may help to filter the grey list. The problem now is this - filtering the grey list. The blacklist works, but needs constant "training", and the whitelist works on the whole (maybe 1 spam every week or so, probably using a randomAddress@myDomain, or setting an In-Reply-To: field). But 99.99% of the stuff in the grey list is spam, so I want to get more of it into the black list...
Quick update, so I don't forget. I'm now thinking of splitting the grey list into a more active approach. Anything that...
- Has content-type: text/html or multipart/alternative and...
- Contains a URL
will be sent a challenge e-mail. Everything else will be greylisted, but the greylist should be a lot more manageable. Spam will get through if it's plain text, and doesn't contain a URL, which is very little - enough not to worry about.
The challenge will also allow automatic whitelisting, with a "reply to this message to whitelist yourself, and include the message (generally quoted) for it to be sent to me" kind of message. This will alert anyone that's sending me non-plaintext, URLed messages (which may include website mailing lists that I haven't explicitly whitelisted, although these tend to go to a different address).
Note: I also intend to make the blacklists public, so people can check what is being blacklisted. There doesn't seem to be any point in making the whitelist available.
Update 2004-01-19. I've been baffled recently by spam mails sent to a variety of seemingly-random e-mail addresses at exmosis.net. For example, "3F2E6ABB.4080309 at exmosis.net". A hint on a mailing list notes that spammers harvest mailing list archives, which includes such strings as Message-IDs, and so the spam bot ends up mailing them. Luckily, the not-so-random nature of them means that they're easy to regexp out: :
:0 Hh * ^To:.*<[a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9]\.[0-9][0-9][0-9][0-9][0-9][0-9][0-9]@exmosis\. $SPAMDIR
Nice to find out where they're from, though. 2004-01-20. Noted a regexp blacklist that might be of use. 2004-04-22. Arse. Been hit by spam to a bunch of random words to my domain name. Are spammers reading my page, or do they just bank on the fact that people use catch-alls at their domain. I'm going to have to re-think something or other.
(See also: The Spam Infrastructure )