Saturday, January 07, 2006

Bot Blocking Blacklist vs Whitelist

Since launching my scraper stopper all sorts of little bots have been caught and categorized with the number of them simply staggering and it doesn't seem to stop with new ones popping up daily.

My initial pass at blocking bots was the automated snare to stop them in real time, then let me review it and install a permanent IP block if I wanted or simply add the crawlers' user agent string to my blacklist and block any occurance from any IP, or all of the above..

Then I had an epiphany as this blacklist approach is simply too much work as the number of bots out there is just too much for any one person to deal with sorting out and banning, even with the assistance of automatation to stop them in real-time.

My new approach which was sweet and simple was just the opposite using the whitelist approach and now all bots are initially banned with only the ones I deem worthy being added to the whitelist after the fact.

Compare the difference in the approach:

  • Previously blacklisted bots in the hundreds and growing
  • Current whitelisted bots less than 10
So it seems to me that the best policy to apply is block all, log the user agent strings requesting access, and only let in the ones you want and not vice versa or you'll just be spinning your wheels messing with every new scraper or Beta search engine (and there are a lot) to hit the internet.

3 comments:

Anonymous said...

I've looked at similar issues - how are you making the initial decision of 'is this a bot?'
The number of bots that are masquerading as IE is scary; its fairly easy to pick them by eye, as they don't download images or css and - most significantly - they are downloading the pages too darned fast!
Yes, a neural net could pick this up fairly quickly, but I;m curious as to what approach you've taken, as I haven't yet had time to really fix their bot butts :)

IncrediBILL said...

I can't disclose all of the things I use when profiling a bot or the scrapers might wise up and make it harder to tell.

What you said about easy to "pick them by eye" is what I've tried to translate into software making a robot profile.

Simply put, what files do they read, what files don't they read, how many pages, hot fast/slow/total duration, do they hit my spider trap pages, do they look at robots.txt, do they scan a special seeded page of links in sequential order, etc. etc.

I don't think I've snared any humans yet but if someone is stupid enough to read my robots.txt file all bets are off.

Anonymous said...

Ah. much the same as me, then :)
The robots.txt viewing is a challenge - 'we' are liable to look at other people's, to see what they are doing, and then may be treated as a bot :(
(Thanks for the answer, BTW :))