Monday, April 17, 2006

Blocked Spiders DO NOT Go Away

There are a few bold, albeit naive, statements by other so-called "bot blockers" that scrapers just go away after you deny them a few pages which is complete and utter BULLSHIT!

Some of the scrapers being blocked on my server have been set to BANNED for months now, haven't gotten a single page of value, yet they just keep coming over and over, attempting to get pages they remember regardless of the outcome.

Most bot blockers I've reviewed just set speed traps or page limits and then throw a captcha in their face to make them go away for a brief period of time, maybe a few hours, maybe a day or two, but many of them will come back over and over and get another chunk of pages when they return. The stakes are high and the scrapers want your content badly so putting silly little bandages on your website for short term solutions do not cure the long term problems.

The only way to truly stop them is to profile their behavior over time as my bot blocker throws all first-time suspected bot IP's into QUARANTINE. Once an IP makes it into quarantine they are immediately suspended for 24 hours and then challenged immediately when they return to the web site after 24 hours. This stops repeat offenders from getting any pages whatsoever when they return and also protects against permanently blocking a DHCP address by accident that is used to scrape only once. After a couple of repeated scrape attempts without breaking thru the challenge, which a human can easily do, the site is escalated from quarantine to BANNED which no longer presents challenges and just gives error messages on repeat visits.

Not rocket science but it has a lot more finesse than some of the more simplistic methods others employ and better hardens the site against repeated attempts at scraping.

No comments: