Saturday, June 17, 2006

New Nutch Sighting at Rediff

Just to make sure our list doesn't grow stale, here's the new Nutch of the day:

203.199.83.162
pro3.rediffmailpro.com
"NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"
No clue why a business email company needs a web crawler but that's where it came from.

I noticed the Nutch developers picked up on my previous post and are discussing forcing the default user agent to be changed, which wasn't yet again, and ways to reduce the amount of actual crawling of individual websites by Nutch.

Good luck on that effort guys, we can use it!

2 comments:

Anonymous said...

Bill, as I mentioned elsewhere why would you want Nutch to force users to change the default user agent? The way it is now I only need one line to ban almost every Nutch user agent out there. If the developers change that it'll mean lots more work for all of us. I respect you a lot but I just don't get your logic on this one. ~g.

IncrediBILL said...

If you didn't notice when I posted the big list of NUTCH crawlers even the people that changed their info STILL had the word NUTCH in the user agent.

I would just like to know who/why they are running nutch is all.