Friday, April 04, 2008

Discovery Engine's Discobot Discovered My Bot Blocker

I found this little Discobot from Discovery Engine trying to dance around on my server but the bot blocker bouncer at the door was already keeping him behind the velvet ropes.

Here's a sample of what I saw on my site:

208.96.54.74 "GET /robots.txt"
"Mozilla/5.0 (compatible; discobot/1.0; +http://discoveryengine.com/discobot.html)"

208.96.54.68
"Mozilla/5.0 (compatible; discobot/1.0; +http://discoveryengine.com/discobot.html)"
It does honor robots.txt just like they said it did but it cached it for about 48 hours between visits.

They were nice enough to provide the range of IPs it uses:
208.96.54.67 - 208.96.54.96
Those IPs are from Servepath which I already block.

Between whitelisting allowed bots and blocking more data centers then I'd care to admit, this poor little Discobot didn't stand a chance to discover anything.

Call back when you're all grown up and ready to send traffic.


12 comments:

Anonymous said...

Thanks for the tip. Surprisingly, they weren't blocked with their new user agent.

You might remember their previous user agent was disco/Nutch-1.0-dev (experimental crawler; www.discoveryengi
ne.com; disco-crawl@discoveryengine.com)
.

That Nutch reference caused their requests to be assigned a lower priority score in my processing engine ;-)

FWIW, DiscoveryEngine has SquirrelMail running.

Anonymous said...

I am curious incredbill if you have ever tried to track back the source of these scrapers. I long suspected porn,loans and gambling and prescription drugs but was very surprised when initial scrapers on a new site came from ip's and useragents related to shopping sites.

IncrediBILL said...

Dude, you're way behind in this game.

Just search my blog for "porn scraper" and see what pops up.

Anonymous said...

Hello, I'm the CEO of Discovery Engine. I wanted to say that we are not a porn scraper or spam site!

Our company was founded by computer scientists from Stanford and Google. We are building a new web-scale search engine to be launched publicly later this year.

The discobot is downloading pages to help users of our beta service find your content. It is OK if you want to block it, which is why we gave the instructions.

Bill, I am curious why you were blocking all ServePath requests by default?

BTW, we switched from Nutch to our own crawler for performance reasons. Didn't realize that would increase our "priority score." Can you say a any more about this, johann?

Thanks for the feedback, everyone.

IncrediBILL said...

Bill, I block all data centers because of the high volume of scrapers, spammers and proxy sites hosted in said data centers.

Besides, my high volume sites are whitelisted only (except this blog) so nothing gets in the front gates if I don't want it to get in.

So riddle me this, why doesn't Discobot support full trip DNS verification like Google?

If you want a leg up on all the other start ups then your crawler should be fully verifiable via DNS and not return rubbish like this:

208.96.54.74 -> customer-reverse-entry.208.96.54.74.

Heck, if you want to discuss these issues about being more "webmaster friendly" over a beer I'm in the SF bay area! ;)

Anonymous said...

Bill, I will take you up on the beer offer. FYI here is an announcement we put out today:

http://www.discoveryengine.com/news/pr-alex.html

Unknown said...

Bill Mydlowec. Are you folks really from Google and Stanford? And did they forgot to teach you the basics of user friendliness? I could not read a single word on your website. Black background with some grey scribble on it, so it seemed to met? One, who is rather already skeptical if not downright suspicious of new crawlers would think it was deliberate!

Anonymous said...

3.5 years later, they still have the suspicious black page with gray text and are still introducing their "next generation search engine."
But are sporting a new IP: 38.101.148.126 with discobot/1.1.

Russell said...

Howdy IncrediBILL,

Have you ever had said beer? Betting not. ;-)

Have you heard anything new from these folks? They are in a new IP range and crawling my site again after a long pause.

Maybe DiscoBill will come back and give us an update?

Seems like a 3+ year closed alpha phase would be a little excessive.

Either a world-changer, a floundering zombie project or a complete con... at this point they really have to say something about it.

Thanks for the fantastic site by the way, I have just discovered you over a search on this UA, and will be reading around some more tonight.

Garza Girl said...

Anyone re-examine the Discovery Engine (or the DiscoBot). Would love to know where they netted out.

Anonymous said...

They're still around and ignoring Disallow (but respecting crawl limits) in robots.txt. I ended up having to 403 them by user-agent in my apache config earlier today.

Glenn said...

another year, and they're still here - changed their User Agent string though - now showing up as

Mozilla/5.0 (compatible; discoverybot/2.0; +http://discoveryengine.com/discoverybot.html

Web page hasn't changed. I can find no list of employees or officers, and the address shows up as the offices of SF Heat, an ad agency. not sure what to make of all that, other than adding them to the list :)