Saturday, November 18, 2006

Good Scrapers, Bad Scrapers and Tinkerers, OH MY!

Someone posting on Freedom to Tinker as Neo said a similar thing to Greg Yardley's post that my bot blocking endeavors are going to stop tinkerers and end innovation on the web which is patently untrue.

The only thing my bot blocker is going to do is allow any webmasters, even non-technical neophytes, to have easy access to the tools that allow them to monitor and control access to their sites that is both easy to understand and administer. No more cryptic crap. The software will show them what's accessing their site so they can make informed decisions about what should crawl or what shouldn't crawl. That's what it's all about, knowledge, as knowledge is power and gives the webmaster the upper hand.

I'm not the only one blocking everything either as Brett Tabke of WebmasterWorld blocked everything from crawling for a while just to see what was bouncing off his firewall. What Brett decided to do was just require logins from people coming from bad internet neighborhoods. Since most websites don't have logins and subscriptions, my solution was to use captchas when bad behavior happens.

Yes, I'll admit I'm on a tear and block everything under the sun but I have a real purpose in my madness which is feeding bread crumbs to the rest of the creepy crawlers hitting my site so I know who they are, where they came from and where the content appears when it's indexed by search engines.

However, I don't intend on enforcing my particular brand of blocking on everyone that decides to use my bot blocker as one size doesn't fit all. The software has lots of options that the webmaster can set, and assuming the webmaster checks his control panel now and then, shows the webmaster what new things are on the web and allows them to grant access or be denied.

I don't foresee my bot blocker causing Neo's or Yardley's apocalyptic view of the web whatsoever but I do foresee the following changes:

  • New bots and people tinkering might just have to ask permission first to the network of bot blockers to get access, not a big deal and easily done.
  • Sloppy bots will go away or be fixed when they get stopped doing dumb things.
  • User agents will be unique per site or software, no more Java/1.5.0_03 so they can either learn how to set the UA or stay off the net.
  • Good scrapers that scrape for directories, that actually provide real links to sites, will need to identify themselves or go away.
  • Bad scrapers will be in serious jeopardy as the scraping noose closes.
Therefore, people that play by the rules, honor robots.txt and actually use a real user agent and supply a web page people review to see what they are doing and why they should be allowed to crawl will have no problem.

It's just the bottom feeding scrapers and spammers that will be in serious trouble and we may see botnets emerge to do the bidding of the nastiest of the crawlers.

OOOPS!

Too late, botnets already exist and other groups are actively fighting the botnets.

So what am I missing that bot blocking technology will cause?

Oh yes, the return of MANNERS, COURTESY and RESPECT FOR COPYRIGHT which means asking permission, being OPT-IN, not just taking what you want regardless of the webmasters's wishes.

When you ask to crawl my site it's a business arrangement, you want to build a business and ask MY PERMISSION to be included in your business.

This is how it works in the real world.

If you want to do business with someone you have to ask first

It would appear that many think that respect and courtesy is something that's not part of the Internet and the entitlement to content just because it's on a PUBLIC NETWORK is flat wrong.

Walmart is technically a public place, anyone can just walk in the door, and if you walked into Walmart and do what most scrapers do on the web they would call the cops and haul your ass off to jail. Before you respond that Walmart is a private company, even the Public Library frowns on people doing what scrapers do and they have signs posted above copying machines warning you about copyright and you can only copy small quantities for personal use only.

I'm just giving webmasters the same control Walmart has:

WE HAVE THE RIGHT TO NOT SERVE ANYONE.

NO SHIRT. NO SHOES. NO SERVICE.

Pretty simple.

The webmasters will be able to control their site as much as technology allows. If we get to the point that Neo suggests where every visitor has to enter a captcha before they can access any website, I suspect some legislation will possibly occur that will make crawling without permission an offense and the Australians are already working on legislation which is flawed, but they are heading in that direction.

I'm just making the tool, not telling people how to implement it.

The choice is up to the internet, webmasters and politicians how this all plays out, not me.

Google's Anti-Phish ROCKS!

After reading all of the whiners and complainers going on and on about how anti-phish in browsers was going to give people a false sense of security I decided to put it to the test today when a phishing email landed in my Inbox.

Within minutes of the arrival of the phishing email, I enabled the Google anti-phish in FireFox 2.0 and went to the site linked in the email:

http://g-lec.com/data/cont/news/sicherung/einfach_millionaer1/wells/
The very minute the screen loaded Google popped up an alert:



Here's the page without the Google alert covering it:



I'll try the anti-phish a few more times as the opportunity arises, but this first test was impressive. As soon as I get around to installing IE 7 then I'll test their anti-phish as well.

Way to go Google!

You get a nice well deserved pat on the back for this one!

eBay is Scraping?

Caught this story on WebmasterWorld about eBay scraping and sure enough found evidence of the same thing in my site.

The first IP is definitely a stealth bot, it's blocked, yet keeps asking for pages over the last couple of months.

216.113.181.67 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461; .NET CLR 1.1.4322)"
This IP address used a banned user agent so it would never be allowed to crawl in the first place yet still asked for a couple of page names it already knew about, weird.
216.113.168.141 "Java/1.5.0_09"
Here's eBay's info so you can block whatever the hell they're doing:
OrgName: eBay, Inc
OrgID: EBAY
Address: 2145 Hamilton Ave
City: San Jose
StateProv: CA
PostalCode: 95008
Country: US

NetRange: 216.113.160.0 - 216.113.191.255
CIDR: 216.113.160.0/19
What will they do with the information they collect, sell it to the highest bidding scraper?

Stealing T-H-U-N-D-E-R-S-T-O-N-E's Thunder

Here's another LayeredTech scraper busted for your amusement.

They call themselves a search engine and a web crawler, but when I can't find any information that ties the crawler back to the source without jumping through extraordinary means, such as feeding them bread crumbs to chase through the internet, I call them scrapers.

Here's the scraper or web crawler as they call it:

72.232.181.210 "Mozilla/2.0 (compatible; T-H-U-N-D-E-R-S-T-O-N-E)"
Here's where the scrapings end up:
http://www.buyersindex.com/
Apparently this thing is probably the Webinator by Thunderstone Software but it's hard to tell as the user agent has no link to any crawler information and a quick casual review of either website turned up nothing about the crawler.

It's not exactly like they're hiding or anything but it isn't completely above board either by not divulging who's crawling and why.


Will Google Really Banish Scrapers?

Many people at PubCon, including some major companies, were telling me their tales of scraper horror. All the stories were similar about being endlessly abused and they were having trouble getting the problem under control or just gave up in frustration. Several people even asked the search engines what they were going to do about scrapers in the Q&A of some PubCon sessions and got the old "we're working on it" response which I think is half-hearted.

When you consider that AdSense technology fuels most scraper sites it's obvious Google could simply look at any AdSense account serving up ads from a multitude of locations which is usually a clue there's something rotten happening. Not that everyone with AdSense on multiple domains is bad, but when you see a single AdSense account used on thousands of locations, you know there's a good chance it's all crap. However, Google probably makes way too much money from scrapers just to eliminate them altogether. What's more than likely to happen is Google might drop scrapers from the Google index but leave their AdSense accounts intact so that the revenue stream continues from these sites being found in Yahoo and MSN.

Perhaps we can hope Yahoo and MSN figure out how to detect and eliminate scrapers first and put our friends at Google between a rock and a hard spot with the dilemma of scrapers vs. AdSense revenue. Either Google would have to clean up their search results to make the users happy or leave the scrapers in to make the stockholders and bean counters happy, which could backfire either way. Needless to say, I don't see scrapers going away any time soon because the financial incentives to keep them are just too great.

Meanwhile, I recommend reporting scrapers on Google's Report a Spam Result page and see if Google is serious about getting rid of scrapers when found.

Sunday, November 12, 2006

Billed as a RoadBlock to the Semantic Web

Got a sudden burst of traffic from Greg Yardley's site today and noticed the topic was about "The coming semantic web roadblock" which I find amusing as I loathe the onslaught of data miners that hit my site and block their asses automatically on a daily basis.

Greg raised a couple of issues that I've heard a few times from other people that my technology will block everything and prevent new search technology from becoming established, and potentially block things that are currently providing value for your site and that's not entirely true nor my intent at all.

Remember, my primary goal is to make the websites using my product OPT-IN or whitelist things that want access instead of OPT-OUT or blacklist which doesn't work at all.

When you first install this bot blocking tool, it's in a PREVIEW mode by default which means you can see what it would be blocking but no action is being taken. It's completely passive when it's in PREVIEW mode and doesn't even challenge possible stealth scrapers, so it may not know if they're human or not but will take a guess. That means you can observe what's going on with your website for days or even weeks and then authorize anything that's providing value before turning the product LIVE and blocking the rest.

Now the next thing that's important to know is that the product records and reports new user agents that appear, so you will see in REAL TIME when something new, never before seen, hits the site. Remember, since we're OPT-IN, we haven't decided if these new things are good or bad yet so the first time they visit the site they'll get bounced off robots.txt assuming they honor it or not. The next time they visit, if the webmaster decided to let them in, they'll be allowed to crawl without issue.

To summarize, it's up to the decision of each webmaster whether or not the Semantic Web will become a reality or not, not me, my tool or service.

I prefer to think of Web 3.0 as the Democratic Web so if the majority decides to vote the Semantic Web out, who am I to argue?

Heritrix Activity Report

Heritrix isn't being adopted at the same rapid pace as Nutch is, but it keeps showing up from more and more places.

Here's the list of sightings, but the one that gives me the biggest giggle is the first, which claims to be "google.com" that came from Mannheim University in Germany.

134.155.241.9 "Mozilla/5.0 (compatible; heritrix/1.10.0 +http://google.com)"

137.82.84.97 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.worio.com/)"

137.82.84.97 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.worio.com/)"

152.163.214.140 "Mozilla/5.0 (compatible; heritrix/1.8.0
+http://wiki.office.aol.com/wiki/SEO)"

152.163.214.141 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://wiki.office.aol.com/wiki/SEO)"

152.163.214.144 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://wiki.office.aol.com/wiki/SEO)"

193.40.192.35 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://erika.nlib.ee)"

195.39.35.118 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.researcher.cz)"

198.162.51.70 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.worio.com/)"

207.241.233.35 "Mozilla/5.0 (compatible;archive.org_bot/heritrix-1.9.0-200608171144 +http://pandora.nla.gov.au/crawl.html)"

209.128.119.17 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://innovationblog.com)"

209.128.119.46 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://innovationblog.com)"

216.182.228.85 "Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"

217.91.71.203 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.schluetersche.de)"

24.8.197.68 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://crawlerx51.com)"

67.162.138.161 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://crawlerx51.com)"

71.229.152.72 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://crawlerx51.com)"

71.56.215.150 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://crawlerx51.com)"

72.20.99.46 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://www.accelobot.com)"

87.98.198.194 "Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"
The other one I found amusing was the Accelobot which claims to "help automate market research" and I wonder if their research showed them I wasn't interested in their help?

Not nearly as popular as other tools, but picking up a little steam unfortunately.

We'll keep an eye on this and let people know when it hits epidemic proportions.

Tracking HTTrack Website Downloader

I'm just curious why over 100 people in the last few months thought they could just download my whole website (not this blog) with HTTrack?

What were these dumb fucks going to do with it once they got it anyway?

  • Run a scraper script on the results?
  • Blatantly republish the content with their own template?
  • Run some data mining scripts on it?
  • Keep a copy just for shits and giggles?
Who the fuck knows, who the fuck cares, they aren't downloading 40K pages so they got NADA!

Here's a list of attempts to download the site:
12.218.132.246 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
151.196.39.206 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
151.44.39.130 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
151.57.203.117 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
157.150.112.6 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
160.75.107.93 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
166.102.234.113 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
168.209.97.34 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
193.194.84.227 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
193.253.222.153 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
194.138.39.53 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
194.51.93.106 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
194.57.91.165 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
195.115.20.132 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
195.229.242.53 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
195.246.48.241 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
196.1.179.77 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
196.30.245.149 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
196.31.142.11 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
200.170.96.119 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
201.0.55.48 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
202.147.168.130 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
202.58.205.163 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
202.65.119.252 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
202.83.173.59 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
202.90.87.7 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
203.189.231.13 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
203.87.188.194 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
206.223.8.30 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
208.102.27.19 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
208.255.142.57 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
208.255.142.57 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.117.81.45 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.129.60.250 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.200.201.102 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.200.201.214 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.200.203.48 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.251.8.5 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.81.218.82 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.93.224.35 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
213.136.106.252 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
213.216.199.2 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
213.228.0.86 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
213.23.124.2 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
216.108.210.225 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
216.76.80.93 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
220.247.221.131 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
24.205.6.210 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
61.90.220.86 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
62.210.102.125 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
62.57.32.142 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
64.222.233.72 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
68.220.248.94 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
69.22.0.123 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
69.88.8.6 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
69.88.8.6 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
70.71.114.43 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
71.227.195.118 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
71.70.233.219 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
72.255.6.100 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
74.132.128.2 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.103.33.75 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.144.203.67 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.144.234.32 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.170.26.10 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.170.39.87 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.191.116.41 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.53.155.234 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.208.36.91 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.245.178.4 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.246.203.43 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.250.148.63 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.29.232.56 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.50.176.143 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.56.85.53 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.90.175.201 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.16.147.149 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.225.167.110 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.228.167.150 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.239.139.105 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.242.65.70 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.245.61.27 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.248.45.214 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.65.0.229 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.66.135.81 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.83.202.247 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.93.27.229 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.93.27.229 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
83.135.199.34 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
83.135.224.26 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
83.16.51.174 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
83.179.163.75 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
83.93.133.158 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
83.93.133.158 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
84.162.79.29 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
84.245.166.176 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
84.4.209.62 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
84.6.122.9 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
84.72.193.77 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
84.90.2.1 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
86.195.214.61 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
86.68.132.131 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
87.218.59.4 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
87.81.178.38 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
87.89.114.228 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
88.139.139.203 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
88.73.106.129 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
89.54.130.12 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
The best part is, trying to download my site from my server gets them all automatically banned.

Greed will get you nowhere, not on my site anyway!

Here a Nutch, There a Nutch, Everywhere a Nutch Nutch

Nutch usage seems to be breeding faster than cousins in Kentucky so I figured it was time to post a sequel to the original How Much Nutch is Too Much Nutch.

Here's a complete breakdown on every IP that I've seen using Nutch with the actual word Nutch in the user agent for a grand total of 190 IP's crawling to date. Several of them like Cazoodle, MQBOT, and a few .EDU's are crawling from a block of IPs but the majority seem to be scattered all over the place.

Here's the list of all the creepy crawling Nutches:

124.32.246.36 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

124.32.246.45 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

128.208.3.173 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; raphael@unterreuth.de)

128.208.6.125 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.200 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.207 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.226 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.227 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.232 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.75 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.77 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.95.1.189 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

128.97.88.68 ilial/Nutch-0.9-dev

128.97.88.70 ilial/Nutch-0.9-dev

129.242.19.138 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

129.34.20.19 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

129.78.64.106 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.137.86 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.139.202 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.139.205 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.139.206 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.139.211 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.139.212 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.139.213 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.112.125.102 asked/Nutch-0.8 (web crawler; http://asked.jp; epicurus at gmail dot com)

131.112.125.103 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)

131.112.125.104 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.112.125.106 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.112.16.220 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.211.84.21 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

140.247.62.79 blogsearch/Nutch-0.9-dev

140.247.62.80 blogsearch/Nutch-0.9-dev

147.202.90.2 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

159.226.5.82 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.201 ilial/Nutch-0.9-dev

164.67.195.245 ilial/Nutch-0.9-dev

164.67.195.26 ilial/Nutch-0.9-dev

164.67.195.27 ilial/Nutch-0.9-dev

164.67.195.67 ilial/Nutch-0.9-dev

164.67.195.68 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.86 ilial/Nutch-0.9-dev

166.214.93.76 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

192.17.240.19 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.20 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.41 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.43 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.44 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.46 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.47 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.48 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.52 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.56 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.57 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu;
mqbot@cs.uiuc.edu)

192.17.240.58 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.60 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.71 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.74 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.76 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

193.145.45.68 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.117 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.118 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.119 HouxouCrawler/0.8-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)

193.203.240.120 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.121 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.122 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.252.148.51 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.42.229.3 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

195.72.131.70 HouxouCrawler/Nutch-0.8.2-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)

195.72.131.72 HouxouCrawler/Nutch-0.8.2-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)

195.72.131.73 HouxouCrawler/Nutch-0.8.2-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)

195.72.131.80 HouxouCrawler/Nutch-0.8.2-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)

203.113.130.205 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.147.0.44 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.199.83.162 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.244.218.1 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

207.176.224.241 Nutch/Nutch-0.8.1

207.176.224.245 Nutch/Nutch-0.8.1

207.214.93.42 MyNutch/V 0.3 (JP's Nutch Test Search Engine; jpnutch at yahoo dot com)

208.64.57.65 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)

210.174.3.130 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

210.196.73.193 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

210.245.31.15 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

210.245.31.18 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

211.152.34.34 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.101.97.63 test/Nutch-0.8.1 (test; www.apache.org; test@apache.org)

212.12.114.238 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.137.33.140 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.156.230.210 BilgiBetaBot/0.8-dev (bilgi.com (Beta) ; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.58.116.72 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

213.132.175.101 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

213.157.204.141 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

213.251.133.12 Misterbot-Nutch/0.7.1 (Misterbot-Nutch; http://www.misterbot.fr; nutch at misterbot.fr)

216.182.225.186 NutchEC2Test/Nutch-0.9-dev (Testing Nutch on Amazon EC2.; http://lucene.apache.org/nutch/bot.html; ec2test at lucene.com)

216.182.236.46 NutchEC2Test/Nutch-0.9-dev (Testing Nutch on Amazon EC2.; http://lucene.apache.org/nutch/bot.html; ec2test at lucene.com)

216.182.237.45 NutchEC2Test/Nutch-0.9-dev (Testing Nutch on Amazon EC2.; http://lucene.apache.org/nutch/bot.html; ec2test at lucene.com)

216.93.185.12 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

217.153.59.26 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)

217.31.51.128 Megatext/Nutch-0.8.1 (Beta; http://www.megatext.cz/; microton@microton.cz)

218.25.39.81 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

220.130.191.231 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.232 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.233 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.234 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.235 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.236 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.237 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.238 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.239 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.240 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

221.114.253.210 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

221.116.237.114 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

221.221.237.35 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

222.173.249.33 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

222.173.249.33 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

24.222.153.250 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

24.6.168.184 test/Nutch-0.8.1 (Test robot; http://test.com; info at test.com>)

58.186.61.164 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.187.12.236 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.215.74.242 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.215.75.2 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.87.139.90 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

59.160.240.115 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

59.160.240.116 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

59.160.240.183 Nutch-test/Nutch-0.9-dev

59.160.240.184 Nutch-test/Nutch-0.9-dev

59.160.240.185 Nutch-test/Nutch-0.9-dev

59.176.10.136 NutchCVS/0.01-beta (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

60.248.9.114 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

61.135.151.175 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

62.129.132.47 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

62.168.188.151 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

62.40.33.173 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

62.40.36.87 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

63.133.162.98 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

63.246.7.209 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.105.36.210 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html;
nutch-agent@lists.sourceforge.net)

64.241.242.18 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.242.88.10 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.242.88.60 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.34.172.78 BurstFind Crawler 1.0/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; crawler@burstfind.com)

64.34.180.167 Nokia6620/2.0 (4.22.1) SymbianOS/7.0s Series60/2.1 Profile/MIDP-2.0 Configuration/CLDC-1.0/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.38.10.26 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.71.164.125 Krugle/Krugle,Nutch/0.8+ (Krugle web crawler; http://www.krugle.com/crawler/info.html; webcrawler@krugle.com)

65.220.67.9 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

65.92.160.39 JLA/Nutch-0.8.1 (beta; http://dynamic.com/index.htm; info at test.com)

66.132.240.180 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.132.249.23 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.15.68.234 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.207.120.226 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.243.31.34 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

67.111.28.139 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

67.184.246.61 Nutch/Nutch-0.8 (Nutch Test; none; none)

67.52.101.242 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

68.178.171.109 test/Nutch-0.8.1 (Test robot; http://test.com; info at test.com>)

68.178.202.79 test/Nutch-0.8.1 (Test robot; http://test.com; info at test.com>)

68.205.124.164 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

68.205.127.94 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

68.97.222.117 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

69.248.26.83 Comrite/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

69.36.233.8 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

69.55.233.28 Argus/1.1 (Nutch; http://www.simpy.com/bot.html; feedback at simpy dot com)

70.143.79.234 JPNutchTest/Nutch-0.9-dev-JP-0.1 (JP Nutch Test; jpnutch at yahoo dot com)

70.197.81.79 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

70.56.66.216 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

70.90.188.18 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

70.96.99.254 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

71.216.0.210 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

71.217.33.149 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

71.241.153.125 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

71.35.163.79 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

72.0.207.162 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

72.2.25.66 abcxyz/Nutch-0.8 (nutchtesting; nutch; abc@xyz.com)

72.2.25.67 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

72.2.25.71 Nutch/Nutch-0.8

72.5.173.22 sdcresearchlabs-testbot/Nutch-0.9-dev (www.shopping.com/bot.html; researchbot@shopping.com)

72.51.37.148 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)

72.84.30.230 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

75.44.225.44 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

81.173.148.94 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

81.173.155.210 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

81.203.142.109 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

81.93.168.211 TRankBot/Nutch-0.8.1 (T-Rank AS; http://www.trank.no/; robot at trank dot no)

83.246.79.28 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

84.191.111.92 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

84.231.72.32 agent/Nutch-0.8 (http://lucene.apache.org/nutch/bot.html)

84.231.74.47 nutch/Nutch-0.8.1

85.117.62.114 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

85.18.14.22 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

87.139.106.60 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

88.191.23.109 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
Wasn't that fascinating reading?

This is some crazy shit that's almost like a DoS attack of non-stop web crawlers and I suspect it will get even worse as more people try to mine the Internet for free money.

Load up the firewall and your .htaccess filters with protection and brace for impact.