Some web crawler has hit my site a few times called Heritrix which appears to be written mostly by the team at Archive.org, the same team that created ia_archiver for those of you that haven't had your coffee yet.
Yes, it supports robots.txt, but if you didn't know this damn thing existed you wouldn't bother blocking it now would you?
People writing crawlers wonder why webmasters get pissed tracking and opt-ing out all this nuisance crawling on their websites, but I digress, that's an old rant.
The real amusement is that Heritrix claims their technology is designed to "collect the digital artifacts of our culture and preserve them for the benefit of future researchers and generations" which is a bunch of pretty language to try to sidestep downloading a website without permission, especially when the webmaster probably isn't aware of your crawler, doesn't matter how you try to candy coat it.
Now comes the fun part,
let's see who was using it and why!
Today's attempted crawl was HUGE so it's safe to assume this thing has been on my site in the past and apparently the crawler was even banned on a previous bayarea.net IP address:
209.128.119.46 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://innovationblog.com)
Today the crawler used a different bayarea.net IP, could be DHCP, could be on purpose to sidestep the previous ban. Who knows, but it only got a couple of pages before the doors were automatically slammed by the bot blocker:
209.128.119.17 - "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://innovationblog.com)"
With my curiosity in overdrive, it was time to research
innovationblog.com and see why they were crawling my site. Not a clue as there's nothing but a "This Web Site Coming Soon" site under construction page, but the WHOIS for the site was very revealing.
Registrant:
Michael Osofsky
1758 Shoreline Blvd. Suite B
Mountain View, California 94043
United States
Registered through: GoDaddy.com
Domain Name: INNOVATIONBLOG.COM
Created on: 09-Mar-05
Expires on: 09-Mar-07
Last Updated on: 06-Mar-06
Administrative Contact:
Osofsky, Michael mosofsky@accelovation.com
1758 Shoreline Blvd. Suite B
Mountain View, California 94043
United States
(650) 968-4741 Fax --
Technical Contact:
Osofsky, Michael mosofsky@accelovation.com
1758 Shoreline Blvd. Suite B
Mountain View, California 94043
United States
(650) 968-4741 Fax --
Domain servers in listed order:
WSC1.JOMAX.NET
WSC2.JOMAX.NET
This
Michael seems to be involved with a company called
accelovation.com and he seems to be big in the innovation circles having founded the
MIT Innovation Club.
According to the Accelovation website:
Accelovation is the first and only Market Discovery System (MDS) that allows innovators to mine the online world for insights into unmet needs, trends, innovations and market activity.
Sound familiar?
We crawl you and use your information without permission to make a profit.
Where have we heard this before?
I'll bet they'll be surprised at my attitudes about this but they should try reading some
webmaster forums and find out what they're doing probably isn't welcome without permission, some clue posted about what the benefits are to the webmaster to allow his site to be crawled, yada yada yada we've been down this path a few times before, it's getting old.
Sorry, but your innovation collided with my innovation called a bot blocker.
Your crawl is denied, and thanks for playing.