I found this little Discobot from Discovery Engine trying to dance around on my server but the bot blocker bouncer at the door was already keeping him behind the velvet ropes.
Here's a sample of what I saw on my site:
208.96.54.74 "GET /robots.txt"It does honor robots.txt just like they said it did but it cached it for about 48 hours between visits.
"Mozilla/5.0 (compatible; discobot/1.0; +http://discoveryengine.com/discobot.html)"
208.96.54.68
"Mozilla/5.0 (compatible; discobot/1.0; +http://discoveryengine.com/discobot.html)"
They were nice enough to provide the range of IPs it uses:
208.96.54.67 - 208.96.54.96Those IPs are from Servepath which I already block.
Between whitelisting allowed bots and blocking more data centers then I'd care to admit, this poor little Discobot didn't stand a chance to discover anything.
Call back when you're all grown up and ready to send traffic.


7 comments:
Thanks for the tip. Surprisingly, they weren't blocked with their new user agent.
You might remember their previous user agent was disco/Nutch-1.0-dev (experimental crawler; www.discoveryengi
ne.com; disco-crawl@discoveryengine.com).
That Nutch reference caused their requests to be assigned a lower priority score in my processing engine ;-)
FWIW, DiscoveryEngine has SquirrelMail running.
I am curious incredbill if you have ever tried to track back the source of these scrapers. I long suspected porn,loans and gambling and prescription drugs but was very surprised when initial scrapers on a new site came from ip's and useragents related to shopping sites.
Dude, you're way behind in this game.
Just search my blog for "porn scraper" and see what pops up.
Hello, I'm the CEO of Discovery Engine. I wanted to say that we are not a porn scraper or spam site!
Our company was founded by computer scientists from Stanford and Google. We are building a new web-scale search engine to be launched publicly later this year.
The discobot is downloading pages to help users of our beta service find your content. It is OK if you want to block it, which is why we gave the instructions.
Bill, I am curious why you were blocking all ServePath requests by default?
BTW, we switched from Nutch to our own crawler for performance reasons. Didn't realize that would increase our "priority score." Can you say a any more about this, johann?
Thanks for the feedback, everyone.
Bill, I block all data centers because of the high volume of scrapers, spammers and proxy sites hosted in said data centers.
Besides, my high volume sites are whitelisted only (except this blog) so nothing gets in the front gates if I don't want it to get in.
So riddle me this, why doesn't Discobot support full trip DNS verification like Google?
If you want a leg up on all the other start ups then your crawler should be fully verifiable via DNS and not return rubbish like this:
208.96.54.74 -> customer-reverse-entry.208.96.54.74.
Heck, if you want to discuss these issues about being more "webmaster friendly" over a beer I'm in the SF bay area! ;)
Bill, I will take you up on the beer offer. FYI here is an announcement we put out today:
http://www.discoveryengine.com/news/pr-alex.html
Bill Mydlowec. Are you folks really from Google and Stanford? And did they forgot to teach you the basics of user friendliness? I could not read a single word on your website. Black background with some grey scribble on it, so it seemed to met? One, who is rather already skeptical if not downright suspicious of new crawlers would think it was deliberate!
Post a Comment