Sunday, May 21, 2006

Publicly Available Website

That's the current buzzword most often used when you confront someone crawling your site, especially a corporation, that it's a "Publicly Available Website".

Well just because something is publicly available doesn't mean you have the right to do whatever you like with it. It's publicly available for the PUBLIC, meaning visitors, to read individual pages and it's also available to the 6 search engines that I permit to crawl my site. Other than that, just like any other publicly available business, I have the RIGHT TO RESTRICT ACCESS to anyone else that I so desire.

For instance many brick and mortar businesses say "No Shoes, No Shirt, No Service".

Well my website has similar rules "No Humans, No Permission, No Service".

If I even get a whiff off a robot on the site, permission denied.

You corporate and private scrapers just better get over your loser mantra as putting a website online, even on a public network, does NOT give everyone complete access to do whatever they feel like with your site. There are terms of service on that site which distinctly prohibit the use of unauthorized tools to crawl that site, and if you have to ask what's authorized then you don't have permission in the first place so go away.

The site doesn't have a "GNU Free Documentation License", instead it has one of those funny things called a "copyright" which means I own it, not YOU. Additionally, I pay for the server, not YOU. Which means, it's up to ME what is and isn't allowed, even when it's a "Publicly Available Website", NOT YOU!

Let's make it so simple even a 2 year old can understand it:

The website is MINE! MINE! MINE! ALL MINE! and NOT YOURS!

Is that language clear enough for the mental midgets scraping the web to comprehend?


Jayw said...

Hey Bill,

Quick question:

How in the world do you block scapers coming from AOL?

Those rotating proxies are a bugger and as much as I would like to block all of AOL I can't because they are 20% of my overall traffic.


IncrediBILL said...

Nothing you can do but detect bad behavior in real-time and put a temporary block on the IP until the scraping stops, and then unlock the IP for the next AOLer.

Using captchas, the first human to come along to that IP will unlock it but if it's the scraper, the behavior will stop it again.

The more an IP misbehaves, the longer it's in quarantine until it's down for a day perhaps.