Sunday, November 12, 2006

Heritrix Activity Report

Heritrix isn't being adopted at the same rapid pace as Nutch is, but it keeps showing up from more and more places.

Here's the list of sightings, but the one that gives me the biggest giggle is the first, which claims to be "google.com" that came from Mannheim University in Germany.

134.155.241.9 "Mozilla/5.0 (compatible; heritrix/1.10.0 +http://google.com)"

137.82.84.97 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.worio.com/)"

137.82.84.97 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.worio.com/)"

152.163.214.140 "Mozilla/5.0 (compatible; heritrix/1.8.0
+http://wiki.office.aol.com/wiki/SEO)"

152.163.214.141 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://wiki.office.aol.com/wiki/SEO)"

152.163.214.144 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://wiki.office.aol.com/wiki/SEO)"

193.40.192.35 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://erika.nlib.ee)"

195.39.35.118 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.researcher.cz)"

198.162.51.70 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.worio.com/)"

207.241.233.35 "Mozilla/5.0 (compatible;archive.org_bot/heritrix-1.9.0-200608171144 +http://pandora.nla.gov.au/crawl.html)"

209.128.119.17 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://innovationblog.com)"

209.128.119.46 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://innovationblog.com)"

216.182.228.85 "Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"

217.91.71.203 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.schluetersche.de)"

24.8.197.68 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://crawlerx51.com)"

67.162.138.161 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://crawlerx51.com)"

71.229.152.72 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://crawlerx51.com)"

71.56.215.150 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://crawlerx51.com)"

72.20.99.46 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://www.accelobot.com)"

87.98.198.194 "Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"
The other one I found amusing was the Accelobot which claims to "help automate market research" and I wonder if their research showed them I wasn't interested in their help?

Not nearly as popular as other tools, but picking up a little steam unfortunately.

We'll keep an eye on this and let people know when it hits epidemic proportions.

2 comments:

Scott Huot said...

I just started a new site about 2 weeks. And heretrix is my first unknown spider on the scene (compatible; heritrix/1.8.0 +http://crawlerx51.com)

It's above the msnbot for activity but below google and yahoo.

I'm just very slowly building the site (there's about 12 pages so far) - I have only one link to it anywhere so that the indexing can start but crawler51 found it pretty damn quick.

I also used an xml sitemap that i pinged at the moreover server - that could be it also.

It's interesting looking at the stats for a beginning site.

Anonymous said...

I've already spoken to Accelovation, the owner of the bot. The bot was hammering a number of my sites for hundreds of hours (literally...days...3-4 hits per minute).

Management did nothing when I asked that the bot stop it's antics.

I asked some powerful friends ("men in black") to take a look at the firm from the vantage point of the Irvine Skunkworks. Their comments:

1) The firm's servers are relatively weak when it comes to security and intrusion (people who live in glass houses shouldn't be throwing stones for days on end). Unsecured laptops, mostly MACs, are hooked right into the network and are "accessible," if you know what I mean.

2) There seems to be a significant Israeli/Russian criminal angle at work behind the scenes at this company. Several of their top java/php folks are on Interpol's "black hat" CID list. The "market research business" the company crows about must be a day job for most of them.

3) There's been lots of talent cross-over between this firm and Google. Many of those who work here came from Google: they were fired by Google for intellectual property or other malfeasance issues.

4) Google may be a major client (read: only) of theirs according to insiders.