Don't roll your eyes and think that Bill's just making a fuss as this company claims they scrape:
Real-time Data Collection - Technologies for crawling, monitoring and scraping newly posted web content including the content from the “deep web."See, I'm not making this shit up, honest to god admitted scrapers!
Not only that, they proudly display the number of sources they scrape updated constantly on their home page.
They appear to crawl without looking at robots.txt best I can tell, don't identify the source of the crawler other than it's the "Jakarta Commons-HttpClient", and their primary interest in my site seems to be attempting to crawl content referenced from my XML feed.
I'm not sure what information they could possibly think is on my site that could help "Institutional Investors leverage the latest technology and data to make better investment decisions" but they'll just have to be in the dark and use the Magic 8-ball from now own.
They have been seen using these IPs:
22.214.171.124 "Jakarta Commons-HttpClient/3.0"They are all part of the Geometric Group:
126.96.36.199 "Jakarta Commons-HttpClient/3.0"
188.8.131.52 "Jakarta Commons-HttpClient/3.0"
184.108.40.206 "Jakarta Commons-HttpClient/3.0"
Geometric Group DP-206-188-0-0 (NET-206-188-0-0-2)I'd just block the whole range and be done with it and hope we don't cause the market to crash.
220.127.116.11 - 18.104.22.168
UPDATE: They switched to Java in 2007!
01/22/2007 22.214.171.124 "Java/1.5.0_03"
01/22/2007 126.96.36.199 "Java/1.5.0_06"
Then mysteriously, stopped pinging my server on 03/15/2007 after a year of being fed garbage.
Think someone finally realized they were getting bounced?