Wednesday, October 04, 2006

Automatic Detection of Spam Hand Jobs

Sometimes certain anti-spam ideas just hit you upside the head when you least expect them and seem so obvious you wonder what took you so long to figure it out.

I've already blogged about the fact that I've stopped all automated spam dead in it's tracks on my sites, but people manually posting can of course correct all of the errors detected and continue to make an unwanted garbage post.

I have an extensive junk detection filter that rejects anything with the usual suspects like viagra, cialis, gambling, poker, etc. which stops the nastiest of these posts. However, some little pain in the ass SEO aka spammer might slip thru with a hand job posting about his store in India selling magic beetle dung or something that you would never imagine putting in your junk filter in the first place.

A few days ago I decided to review the last 30 days of legitimate submissions and compare them to the few off topic hand jobs that slipped through the cracks and see if I could come up with anything that would allow me to stop the hand jobs of absolutely random and crazy things outside the realm of the typical common auto-spam posts.

Then, like a lightning bolt it suddently hit me, that with these random off topic hand spams it's not what's IN the posts it's what's NOT in the posts that makes them easily identifiable. The concept is to scan for a list of words that SHOULD be in the post, like quotes from anything in the thread or certain keywords related to the topic and automatically set everything to MODERATE that doesn't fit the usual posting patterns.

Basically it's a 'lack of content filtering' technique and off topic posts, like spam, stand out like a sore thumb.

Using this blog as as example for a topic, you would expect most comments to contain words like bot, spam, IP, host, crawl, firewall, htaccess, apache, etc. or a set of keywords derived from the original post title and text. The absence of any of these words is a clue that the post just might be SPAM or otherwise off topic and should be placed on moderation for the admin to review.

Since I've started using this new 'lack of content filtering' technique it's snared the few hand submissions to my other site that were completely off topic, those that I would've deleted immediately. The beauty is I can continue to leave the posting wide open for humans, not moderate everything, with only those posts that don't match the topic getting instantly set to moderate.

I expect a few false positives but so far 'lack of content filtering' is doing exactly what I expected it do and set a couple of crap submissions last night for shit like "zanaflex information", apparently some pill I've never heard of and "News, Stores, People, Careers at Finditt", some wannabe search engine, to moderate automatically while letting 20 on topic things thru without a hitch.

Another automated weapon in the war on spam!

4 comments:

Olliver said...

I like the idea behind this "negative match" filtering, but I wonder whether it would work reliably with short posts too. Additionally some freaks could mix up their spam message with relevant keywords from the article to circumvent the filter and "blast their ads to millions of blogs" ;-)

Olliver

IncrediBILL said...

It doesn't work with most short posts.

However, those are mostly noise posts like "AMEN BROTHER!" and nobody cares about those.

Sure the spammer could inject relevant keywords assuming they knew what I was looking for, which isn't the case at the moment.

For now, I'll remain MUM and happy ;)

IncrediBILL said...

I'm loving this shit...

Last night the usual Indian SEO suspects tried to post some spam about some Dallas company and it put all their crap in moderation so nobody saw it but me and I gleefully zapped that shit this morning.

BTW, here's the off topic SEO spammer's info:

IP: 203.115.81.14 14-Corpcustomer.pacenet-india.com

SEO SNAFU said...

Great post. But you failed to mention where to get the magic beetle dung at the end.