Saturday, June 17, 2006

New Nutch Sighting at Rediff

Just to make sure our list doesn't grow stale, here's the new Nutch of the day:
"NutchCVS/0.7.2 (Nutch;;"
No clue why a business email company needs a web crawler but that's where it came from.

I noticed the Nutch developers picked up on my previous post and are discussing forcing the default user agent to be changed, which wasn't yet again, and ways to reduce the amount of actual crawling of individual websites by Nutch.

Good luck on that effort guys, we can use it!

Thursday, June 15, 2006

Hanzo:web Social Archiving is Social Copyright Infringement

OK boys and girls, it's time to get pissed off as all notion of copyright and control of your site content has been tossed out the window as the fine folks over at hanzo:web ARCHIVE your site content on demand!

That's right, you click on their bookmarklet and TA DA! your page gets archived WITHOUT YOUR PERMISSION on someone else's server.

Here's the most priceless quote on their site:

Only you can save the Web!
So who's going to save the web from some bullshit like this?

Did you bother asking webmasters if they want their websites saved?

I don't want to be archived, I don't need to be saved, take your archiving toys and go fuck yourselves!


I just about wrecked a keyboard while sipping soda when I ran across this:

Respect for content

All archived pages, links and sites are stored exactly as they appeared on the web. Pictures, objects, links and flash are all retained as they are, preserved as originally conceived.


Are you fucking kidding me?

Where's the respect for my fucking copyright?

You'll be archiving pages WITHOUT PERMISSION, possibly with someone's AdSense account embedded and someone can be sitting on your sites click frauding accounts to death, or stealing content while it can't even be detected that someone is even accessing the pages via the archive.

When they "archive" your page it gets crawled by the following: "Mozilla/5.0 (compatible; heritrix/1.4.0 +"

inetnum: -
netname: hanzoweb
descr: Hanzo Archives Ltd
Now look at this shit coming from their servers: "GET / " Mozilla/5.0 (compatible; heritrix/1.4.0 +" "GET /robots.txt" "Python-urllib/1.16" "GET / " "Mozilla/5.0 (compatible; heritrix/1.4.0 +" "GET /robots.txt HTTP/1.0" "Python-urllib/1.16" "GET /"" "Python-urllib/2.4" "GET /" "Python-urllib/2.4"
So it's looking at robots.txt but what user agent are they looking for?

I dug around on their site and didn't see it, so I have no clue what the Python-urllib is looking for in robots.txt, but it really doesn't matter because the FAQ page plainly states that they don't give a flying fuck about your robots.txt file, they'll archive it anyway no matter WHAT YOU SAY MR. WEBMASTER and make it private:
The original crawl was subject to restrictions by robots.txt. This means that any archived content will be marked as private for browsing by the person crawling it, therefore, unless its your own archive, you will not see this content.
Sounds to me, as a webmaster, they're saying "FUCK YOU!".

Well, I blocked your service, so this webmaster is replying in kind "FUCK YOU!" no tresspassing allowed.

This is a huge problem as people will be snapping copies of anything for any reason and you, the webmaster, will have no control over what Hanzo:web stores or displays nor what these people do with your content after the fact.

BTW, when people start flaming me that I should've "contacted" them to find out what they were looking for in the robots.txt file, if they were doing it right, the path to this information would've been in the user agent string just like all the other sites do, or highlighted in the FAQ.

Nice idea but your draconian implementation doesn't deserve a second chance and it's blocked, out of mind, not a problem for me anymore.

FWIW, my bot blocker already stopped them from getting anything in the first place but I'm blocking their whole range of IPs just to make sure nothing slips through the cracks like stealth crawling as they have already demonstrated a complete lack of respect for everyones website.

Why is MonsterCommerce mining my site?

Didn't think this was worth mentioning until it happened several days in a row.

This is what's requesting various web pages:
"Mozilla/4.0 (compatible; Win32; WinHttp.WinHttpRequest.5)"
So the question is what do they want?

Are they scraping my directory for potential customer leads?

Very curious indeed.

Wednesday, June 14, 2006

Green Template Must Go

OK, I can stand it anymore, some of the colors in the blog are making me crazy and I just tweaked a couple of fonts because I could barely read the block quotes.

Going to either tweak this template some more or ditch it for something else altogether.

Any suggestions?

Tuesday, June 13, 2006

How Much Nutch is TOO MUCH Nutch?

Not too long ago I set off a storm of comments when I called out the writer of nutch on the carpet claiming his creation was being used excessively and abusing my server all over the place.

I was told by the legions of nutchies out there that I sucked, was told to get off the public network, called everything from an idiot to a grumpy webmaster and worse. They all claimed that nutch was wonderful thing and made search engines that were beneficial and I should stop complaining, shut up and let them crawl.


Being a patient man, I sat back and waited to collect enough data to show those nutchies that the usage of nutch is growing out of control and I really don't need 100+ unique IP addresses from everywhere from Turkey to Japan crawling my goddamn website.

Theoretically, if these 100 crawlers ask for my max of 40K+ pages each that's over 4 million pages served, assuming I let them have them in the first place, mostly for no purpose whatsoever.

Here's the list of the nutch plague seen on my site recently: NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch running at UW;; NutchCVS/0.8-dev (Nutch running at UW;; NutchCVS/0.8-dev (Nutch running at UW;; NutchCVS/0.8-dev (Nutch running at UW;; NutchCVS/0.06-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.06-dev (Nutch;; NutchCVS/0.06-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7 (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.06-dev (Nutch;; NutchCVS/0.7 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.06-dev (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.7 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; Misterbot-Nutch/0.7.1 (Misterbot-Nutch;; nutch at NutchCVS/0.7.1 (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.06-dev (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.7 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.06-dev (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.05 (Nutch;; NutchCVS/0.05 (Nutch;; NutchCVS/0.05 (Nutch;; BurstFind Crawler 1.0/0.7.1 (Nutch;; Nokia6620/2.0 (4.22.1) SymbianOS/7.0s Series60/2.1 Profile/MIDP-2.0 Configuration/CLDC-1.0/0.7.1 (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.8-dev (Nutch;; Krugle/Krugle,Nutch/0.8+ (Krugle web crawler;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.7 (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.8-dev (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.7.2 (Nutch;; Comrite/0.7.1 (Nutch;; Argus/1.1 (Nutch;; feedback at simpy dot com) NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7 (Nutch;; NutchCVS/0.7 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.06-dev (Nutch;; sdcresearchlabs-testbot/0.8-dev (;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.2 (Nutch;; NutchCVS/0.7.1 (Nutch;; NutchCVS/0.7.1 (Nutch;;
So we cranked all these IPs thru the reverse DNS grinder just to see who was running all this nutch without changing the default crawler strings. There are some that failed reverse DNS but I'm too lazy today to bother with WHOIS'ing that list so anyone that feels compelled to dig deeper feel free to post the results as a comment.

The reverse DNS of the IPs: name = name = name = name = name = name = name = name = name = name = name = name = name = name =
** server can't find SERVFAIL name = name = name = name = name = name =
** server can't find SERVFAIL
** server can't find NXDOMAIN
** server can't find NXDOMAIN
** server can't find NXDOMAIN
** server can't find NXDOMAIN
** server can't find NXDOMAIN
** server can't find NXDOMAIN name =
** server can't find NXDOMAIN name = name = name = name = name = canonical name = name =
** server can't find NXDOMAIN
** server can't find NXDOMAIN
*** Can't find No answer name =
** server can't find NXDOMAIN name =
** server can't find NXDOMAIN name = name = name = name = customer-reverse-entry. name = name = name =
** server can't find NXDOMAIN name = name = name = name = name = name = name =
** server can't find NXDOMAIN name = HOSTED-BY.PBTECH.COM. name = name = name = name = name = customer-reverse-entry. name = name = name = name = name = name =
** server can't find NXDOMAIN
** server can't find NXDOMAIN
** server can't find NXDOMAIN
** server can't find NXDOMAIN name = name =
** server can't find NXDOMAIN name = name = name = name =
** server can't find SERVFAIL
** server can't find SERVFAIL name = name = name = name = name = name = name = name =
** server can't find NXDOMAIN name = name = canonical name = 162.160/
162.160/ name = name =
** server can't find NXDOMAIN name = name =
*** Can't find No answer name =
Looks like just about everybody's running it from colleges to corporations and even Uncle Bob crawling the web from a dial-in, talk about slow, but where is the benefit for those of us being abused with it?

I'll admit that a few of the nutches actually resulted in search engines showing up online but who uses these search engines? Best I can tell, none of the actual 400K visitors/month to my site that's being crawled use any of these so-called search engines and probably never will.

Here's the problem, and maybe I'm just using nutch as an example because it's so easy to spot this virulent trend with a single source, but the amount of things attempting [they didn't succeed] to crawl my site daily would easily become a significant portion of my daily traffic if I let them all in which is insane.

What happens when this trend reaches it's natural conclusion?

Where it's heading in that the crawlers will soon exceed the actual visitors in terms of daily pages downloaded as more and more search engines, aggregators, and spybots come online looking for more ways to sell a slice of the internet to an ever increasing bunch of specialized niche markets. Not to mention we're still dealing with all the scrapers, link checkers and down right dumb things like refererrer checkers abusing our bandwidth.

It's out of control and someone needs to put the breaks on this nonsense.

Someday soon crawlers, with the exception of the big search engines, will need to ask permission to get on just about any website of scale, and will need to make a compelling argument why they should be allowed to index the site. The day of just taking what you want and doing what you want with it will surely come screeching to a halt as the burden of all this bandwidth usage starts to hit the hosting companies and trickles down to the webmasters.

Maybe the webmasters will fight back first and take control before it's too late.

Here's hoping.

Monday, June 12, 2006


There's a new company called Pronto that has a product in beta that not only crawls your site, displays message toasts while visitors are looking at your products via a browser plug-in.

For instance, someone is looking at the widget your online store is selling and suddenly a window pops up telling your visitors that they can get this widget cheaper elsewhere and tries to direct your shoppers away from your store.

Basically, this takes something like a service one step further by incorporating it into the browser and the potential harm to all the smaller online stores is enormous.

Anyone running any kind of ecommerce or affiliate site will definitely want to block this:

Here's the critical info on this crawler:

User Agent: "RedCarpet/1.3 ("
Actual IP's used:
Complete blocks of allocated IP's:
RedCarpet, Inc. INFLOW-9359-113352-18374 (NET-66-45-38-80-1) -

RedCarpet, Inc. INFLOW-9359-113352-19316 (NET-66-179-107-112-1) -

RedCarpet, Inc. INFLOW-9359-113352-19482 (NET-216-183-117-128-1) -
This is definitely a company no ecommerce site wants crawling unless you're sure you have the best prices so block 'em!

Bad Karma is a potential DDoS threat

Some guy has something he's put out as freeware called the Referrer Karma which gets the referring page and checks to see if it actually has a link to the site referred. If no link to your site exists on the referring page it slams the door on the visitor assuming it's a referrer spammer.

Two problems with that approach:

  1. Links that pass thru redirect pages from directories directory sites will fail this test every time as the referrer is the redirect page itself, not a web page with links on it.
  2. Sites that block bots, like mine, toss out error pagess when stupid user agents appear and VOILA! the visitor from my site gets bounced off by this stupid script.
Here's the info:
"Referrer Karma/2.0"
Next, let's explore my concern with potential vulernabilities with Referrer Karma.

If you think about the implementation of Referrer Karma for a minute you'll realize it would allow one kiddie script to potentially pull off a DDoS attack. This could be accomplished by issuing thousands of requests to a bunch of sites running this Referrer Karma and each request containing a faked referrer to the target site you're attacking.

You wouldn't need to wait for the page request to complete, just send out a ton of requests to a bunch of servers and terminate the socket when the websever respondes with data is ready. No need to download the resulting page as Referrer Karma has already done your dirty deed for you by hitting the other site asking for the requested page.

Ask for a few thousand pages in a few seconds from a a bunch of sites using Referrer Karma and step back and watch the fun as the target server melts.

Lack of Intelligence Competence Crawler

Well here's yet another site called the Intelligence Competence Center trying to crawl the web looking for things they can sell they to various industries.

Here's the crawler details:
"iCCrawler ("

"iCCrawler ("

all IPs it's used with my site...
I'm really getting sick and tired of these fucking corporate leeches that keep crawling [pun intended] out of the woodwork.

Sunday, June 11, 2006

HK Creepy Crawler

Been seeing this same bot "Java/1.4.1_04" asking for the same pages from the following IPs multiple times:

Don't know if it's a hosting account or ISP, but it's worth keeping an eye on these IPs.