Wednesday, December 28, 2005

High Speed Scrapers Steal More Than Pages

The common thinking about scrapers is they just steal your content and make money off your hard work. However, the damages can be worse and more immediate if they overload your web server and stop other traffic from accessing your site while they scrape. Some of them are so aggressive it's practically a DOS (denial of service) attack until they get what they want.

The problem for dynamic database driven websites (like mine that I'm trying to protect) is they tend to need more CPU resources than the normal garden variety static web sites so a high speed scraper, and sometimes even a regular search engine bot, can easily overload a server's CPU. This can quickly escalate to the point that the web server is queued up with so many page requests that it is unable to respond to new requests and appears "offline" for seconds, minutes or even hours depending on just how aggressive a spider gets with it's page requests.

Worse case, it may crash your server altogether under the strain.

The net result is that sites like mine that survive on advertising revenue, such as Google AdSense, suffer total income loss during temporary bot induced service outages. Visitors that would normally be clicking on ads are sitting there waiting for pages that will never display. Therefore, these bots do more direct damage to your pocket than just bandwidth wasting and stealing content for their own sites as the monetary losses can be quite immediate and left unchecked potentially devastating.

Prior to launching my new spider-trap/scraper blocker there were days where these scrapers had impacted the site income as much as 20%-30% in a single day. How this was possible is they were pummeling the server at night while nobody was watching and significant amounts of high income producing traffic coming from other time zones was lost.

This may be more of an issue to some webmasters than others but website scrapers, other than legitimate search engines, just need to be put out of business as they provide zero value and do nothing but steal.

Protect your site today with this nifty PHP scraper blocker Alex Kemp has written.

Many of you might find Alex's tool very useful to integrate into your web sites and stop copyright theft and income loss today.


baraqyal said...

Doesn't your webserver have some sort of DOS protection too? Or do the requests come from multiple IP addresses? 20-30% is pretty serious, I didn't realize it was that bad.

IncrediBILL said...

I think you missed the point that the request some so fast it's ALMOST like a DOS.

The requests don't have to come with nearly the volume of a DOS to overload a database driven web site, especially when you consider there are already a bunch of customers online as well.

Drop a couple of hundred requests in 2 seconds and it will take some time.

20% only happened a couple of times when some asian offloaders hammered the server over a period of several hours.

It's not a problem anymore as a handful of requests in a couple of seconds automatically blocks them.

Problem solved.

baraqyal said...

So if you block requests that are that quick, do you also end up blocking search engine spiders?

IncrediBILL said...

Nope, search engine spiders have a free pass built into the software.

It's actually a bit more complicated than that as I'm blocking fast scrapes, slow scrapes, etc.

I've some up with a set of profiling rules that seem to pretty accurately detect real visitors vs. robots but there is obviously a bit of fuzzy area.

Can't tell all the signals I'm looking at as the scrapers could adapt.

What I can tell you is BLOGS need to be redesigned because you just can stop someone scraping a single page once a day - piss poor design.

Maybe a white paper some day ;)

Reg Adkins said...

Hey Bill,
I've been getting some requests on scraper blockers in connection to a post I made "10 Tips for Utterly Destroying Your Blog..."

I'm referring folks to your "High Speed Scrapers" post. Are you planning an update or a follow up any time soon?

IncrediBILL said...

Reg, the whole blog is pretty much an update!

I'm just starting to sort it out with labels in the new blogger so it'll be easier to find all the scraper related posts.