Thursday, May 18, 2006

Another Web 2.0 Scraper Company

Don't roll your eyes and think that Bill's just making a fuss as this company claims they scrape:

Real-time Data Collection - Technologies for crawling, monitoring and scraping newly posted web content including the content from the “deep web."
See, I'm not making this shit up, honest to god admitted scrapers!

Not only that, they proudly display the number of sources they scrape updated constantly on their home page.

They appear to crawl without looking at robots.txt best I can tell, don't identify the source of the crawler other than it's the "Jakarta Commons-HttpClient", and their primary interest in my site seems to be attempting to crawl content referenced from my XML feed.

I'm not sure what information they could possibly think is on my site that could help "Institutional Investors leverage the latest technology and data to make better investment decisions" but they'll just have to be in the dark and use the Magic 8-ball from now own.

They have been seen using these IPs: "Jakarta Commons-HttpClient/3.0" "Jakarta Commons-HttpClient/3.0" "Jakarta Commons-HttpClient/3.0" "Jakarta Commons-HttpClient/3.0"
They are all part of the Geometric Group:
Geometric Group DP-206-188-0-0 (NET-206-188-0-0-2) -
I'd just block the whole range and be done with it and hope we don't cause the market to crash.

UPDATE: They switched to Java in 2007!

01/22/2007 "Java/1.5.0_03"
01/22/2007 "Java/1.5.0_06"

Then mysteriously, stopped pinging my server on 03/15/2007 after a year of being fed garbage.

Think someone finally realized they were getting bounced?


Anonymous said...

Hm, I have these, too.

Maybe they're worthy of my second IP block (after Layered Technologies).

Anonymous said...

You must have hit a nerve because they have changed the wording (and the URL).

"Identifying, harvesting and processing massive amounts of Internet data in real-time"

and the url now is:

That was the reason I blacklisted them in the first place. Just the word "Scrape" was enough for me...

Anonymous said...

And they changed user-agents, too:
I now have

* Java/1.5.0_06
* Mozilla/5.0

Da Scritch said...

I just have traffic from Never had this IP visiting me before. See the UA : - - [17/Sep/2007:23:30:40 +0200] "GET /blog.php/ HTTP/1.1" 200 52701 "-" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv: Gecko/20070515 Firefox/" - - [17/Sep/2007:23:30:43 +0200] "GET /blog.php/ HTTP/1.1" 200 52683 "-" "Mozilla/5.0" - - [17/Sep/2007:23:30:43 +0200] "GET /dotclear/rss.php HTTP/1.1" 302 367 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20070515 Firefox/" - - [17/Sep/2007:23:30:45 +0200] "GET /blog.php/post/2007/09/17/RIP-URL-URI-IRI-INRI-W3C HTTP/1.1" 200 38304 "-" "Firefox/1.0 (Windows; U; Win98; en-US; Localization; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)"

Anonymous said...

And now my htaccess id 22kb long :(

Still one less undesirable getting access to my content :)

Thanks for the tips I've been looking for info on this IP range for a while.