Saturday, November 18, 2006

Good Scrapers, Bad Scrapers and Tinkerers, OH MY!

Someone posting on Freedom to Tinker as Neo said a similar thing to Greg Yardley's post that my bot blocking endeavors are going to stop tinkerers and end innovation on the web which is patently untrue.

The only thing my bot blocker is going to do is allow any webmasters, even non-technical neophytes, to have easy access to the tools that allow them to monitor and control access to their sites that is both easy to understand and administer. No more cryptic crap. The software will show them what's accessing their site so they can make informed decisions about what should crawl or what shouldn't crawl. That's what it's all about, knowledge, as knowledge is power and gives the webmaster the upper hand.

I'm not the only one blocking everything either as Brett Tabke of WebmasterWorld blocked everything from crawling for a while just to see what was bouncing off his firewall. What Brett decided to do was just require logins from people coming from bad internet neighborhoods. Since most websites don't have logins and subscriptions, my solution was to use captchas when bad behavior happens.

Yes, I'll admit I'm on a tear and block everything under the sun but I have a real purpose in my madness which is feeding bread crumbs to the rest of the creepy crawlers hitting my site so I know who they are, where they came from and where the content appears when it's indexed by search engines.

However, I don't intend on enforcing my particular brand of blocking on everyone that decides to use my bot blocker as one size doesn't fit all. The software has lots of options that the webmaster can set, and assuming the webmaster checks his control panel now and then, shows the webmaster what new things are on the web and allows them to grant access or be denied.

I don't foresee my bot blocker causing Neo's or Yardley's apocalyptic view of the web whatsoever but I do foresee the following changes:

  • New bots and people tinkering might just have to ask permission first to the network of bot blockers to get access, not a big deal and easily done.
  • Sloppy bots will go away or be fixed when they get stopped doing dumb things.
  • User agents will be unique per site or software, no more Java/1.5.0_03 so they can either learn how to set the UA or stay off the net.
  • Good scrapers that scrape for directories, that actually provide real links to sites, will need to identify themselves or go away.
  • Bad scrapers will be in serious jeopardy as the scraping noose closes.
Therefore, people that play by the rules, honor robots.txt and actually use a real user agent and supply a web page people review to see what they are doing and why they should be allowed to crawl will have no problem.

It's just the bottom feeding scrapers and spammers that will be in serious trouble and we may see botnets emerge to do the bidding of the nastiest of the crawlers.


Too late, botnets already exist and other groups are actively fighting the botnets.

So what am I missing that bot blocking technology will cause?

Oh yes, the return of MANNERS, COURTESY and RESPECT FOR COPYRIGHT which means asking permission, being OPT-IN, not just taking what you want regardless of the webmasters's wishes.

When you ask to crawl my site it's a business arrangement, you want to build a business and ask MY PERMISSION to be included in your business.

This is how it works in the real world.

If you want to do business with someone you have to ask first

It would appear that many think that respect and courtesy is something that's not part of the Internet and the entitlement to content just because it's on a PUBLIC NETWORK is flat wrong.

Walmart is technically a public place, anyone can just walk in the door, and if you walked into Walmart and do what most scrapers do on the web they would call the cops and haul your ass off to jail. Before you respond that Walmart is a private company, even the Public Library frowns on people doing what scrapers do and they have signs posted above copying machines warning you about copyright and you can only copy small quantities for personal use only.

I'm just giving webmasters the same control Walmart has:



Pretty simple.

The webmasters will be able to control their site as much as technology allows. If we get to the point that Neo suggests where every visitor has to enter a captcha before they can access any website, I suspect some legislation will possibly occur that will make crawling without permission an offense and the Australians are already working on legislation which is flawed, but they are heading in that direction.

I'm just making the tool, not telling people how to implement it.

The choice is up to the internet, webmasters and politicians how this all plays out, not me.


Lea de Groot said...

Thanks, Bill - now stop worrying about these annoyances and get on with getting the code up to prod quality. Beta even. ;)
(Remember, as Commander Taco says, "release often, release early" :))

Greg said...

Don't think I'm *against* the product you're building, Bill, business plans that start by assuming they can just collect a ton of info off other websites annoy the hell out of me.

IncrediBILL said...

Greg, I didn't think you were against what I'm doing but your post about the Semantic Web being blocked by my product struck a nerve.

It occurred to me that I needed to explain a bit more as the webmaster is in total control, I'm not just launching a bot blocker with fascist rules pre-installed to work like it does on my sites.

I figured it was just time to do a little damage control as many people are getting the wrong idea of what my product will do based on my staunch anti-bot rhetoric.

You can select multiple levels of blocking so the webmaster gets complete control over the implementation.

Just thought I'd clear that up ;)

Rob Leathern said...

Greg - why are you annoyed by those business plans? They have the ability to add a lot of value... and the free market should let people free ride on others' content just as it should let website owners block that freeriding. What is *should* mean over time (but this will usually take a LONG time) is that services arise that actually establish direct relationships between content partners OR they come to an agreement about how to take the blocks down, and consumers may choose the services that have gone to the extra trouble OR they may choose the quick and easy solution... that's the fun thing about the Web. Structured free-market chaotic discover.