Many weeks and log file combings after The Great Anti-Scrape Off started it's become quite obvious that the effort was an enormous success.
The last bit of technology was deployed a couple of nights ago to challenge robots masking as humans seems to be stopping the last of them so it would appear that my site is now reasonably safe from typical crawlers and bots.
If someone has access to 10,000 IP addresses all bets are off but most scraping and crawling operations, except those that appear to be hiding behind AOL, seem to have fairly limited resources.
The last couple of tricks deployed include:
- Multiple checkpoint profiling to identify bots masquerading as humans
- Randomized challenge techniques with anti-blow-thru detection so that the typical captcha defeating techniques won't work
- Adaptive time monitoring for hit and run bots that seem to think they can get small chunks at a time and come back later under the radar for the next chunk
There may be other things going on out there in the wonderful world of scraping but it would take a fairly sophisticated scraper to bust through what's now currently in place.
The technology seems to work fine so far with up to 30K page views a day but it would be interesting to see how it would perform with 100K or 1M page views a day. What might be a bit challenging is the current implementation does quite a bit of database churn but for a medium-sized sampling that's only tracking a few hundred visitors at any time which is fairly insignificant.
In the end my scraper stopper is only protecting my database of content so any crawler can access about 10 pages without question such as the home page, about us, contact us, etc. so it will be painfully obvious to them that they are being blocked from delving deeper into the site.
The benefit to this approach is that all of the hard rules being used to block access to the full content doesn't break access to other technologies like the RSS feed which appears to have all sorts of crappy homegrown readers that don't identify themselves. However, when the greedy homegrown readers try to behave like a crawler and step into the site to grab the content linked from the RSS feed they are blocked unless expressly whitelisted.
The additional benefit to allowing a handful of top level pages to be crawled is that the web site doesn't automatically drop out view of lesser search engines or up and coming technology which would happen with a more harsh approach using the .htaccess file blocking all access to any pages. Additionally, there is a nice message on every blocked page letting them know they're probably seeing that page because they are an unauthorized crawler and legitimate crawlers may contact the webmaster and petition for access.
Basically, my web site has become OPT-OUT to any aggregators, crawlers or scraping thieves and now they will need to ask for permission to be let inside and profit from my work. Assuming it's a mutually beneficial proposition then I'm sure I'll let them crawl the site.
Now comes the million dollar question of whether to convert this to PHP and attempt to find a market or just keep it under wraps and much less conspicuous so the scrapers can't study what I've done and find any loopholes in the technology to exploit.
One final thought:
Could you imagine an entire internet that is OPT-OUT from crawlers?
The ability for the next Google to crawl the web to prove their technology would be a severe challenge!