Saturday, May 20, 2006

RED ALERT #2 - Distributed IP Scraper on BBCOM

This one is virtually identical to scraper spotted on Vericenter, same profile to the letter.

Claims to be the same exact browser:

Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20060124 Firefox/
This spider was seen operating from these IP addresses:
This IP range belongs to BBCOM:
OrgName: Backbone Communications, Inc.
Address: 515 South Flower Street
Address: Suite 4350
City: Los Angeles
StateProv: CA
PostalCode: 90071
Country: US

NetRange: -
I'm blocking for the moment, but keeping an eye on the rest of their network in case these scrapers switch to a new block.

Another Proximity Alert on PCCW

Getting multiple hits from the same IPs over and over asking for the exact same web page, no images, nothing else, just the same web page day after day. One day the user agent was blank and another day simply"Mozilla/5.0".

This is the range of IPs that just keep repeating the same request.
Appears to be coming from HK, can't tell if it's a shared DHCP situation or what, but I'm blocking just to be safe.

Another DUMBASS Crawler

This one hails from South Africa, tried to crawl "/#top" as "GET /%23top"

You poor dumb fucker, BUSTED!

RED ALERT - Distributed IP Scraper Hosted on Vericenter

Here's a real sneaky scraper using distributed IPs that is using a bot that almost appears designed to fly under my bot blockers radar. No single IP address accessed enough pages or did anything obnoxious enough to set off any triggers but the collective accesses set off a proximity alarm and they got nailed anyway.

The scraper is pretending to be Firefox for Linux:

http://www.Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20060124 Firefox/
The range of IP's noticed in this scrape attack are as follows:
The host information is as follows:
OrgName: VeriCenter, Inc.
Address: 757 N Eldridge Parkway
City: Houston
StateProv: TX
PostalCode: 77079
Country: US
NetRange: -
Athough the attack seems to be centered on the block at the Houston datacenter of Vericenter, I think I'm going to completely block Vericenter as it doesn't appear to have any ISP facilities [ie. NO HUMANS] and see if anything else bounces off the bot blocker from their facilities.

Thursday, May 18, 2006

Another Web 2.0 Scraper Company

Don't roll your eyes and think that Bill's just making a fuss as this company claims they scrape:

Real-time Data Collection - Technologies for crawling, monitoring and scraping newly posted web content including the content from the “deep web."
See, I'm not making this shit up, honest to god admitted scrapers!

Not only that, they proudly display the number of sources they scrape updated constantly on their home page.

They appear to crawl without looking at robots.txt best I can tell, don't identify the source of the crawler other than it's the "Jakarta Commons-HttpClient", and their primary interest in my site seems to be attempting to crawl content referenced from my XML feed.

I'm not sure what information they could possibly think is on my site that could help "Institutional Investors leverage the latest technology and data to make better investment decisions" but they'll just have to be in the dark and use the Magic 8-ball from now own.

They have been seen using these IPs: "Jakarta Commons-HttpClient/3.0" "Jakarta Commons-HttpClient/3.0" "Jakarta Commons-HttpClient/3.0" "Jakarta Commons-HttpClient/3.0"
They are all part of the Geometric Group:
Geometric Group DP-206-188-0-0 (NET-206-188-0-0-2) -
I'd just block the whole range and be done with it and hope we don't cause the market to crash.

UPDATE: They switched to Java in 2007!

01/22/2007 "Java/1.5.0_03"
01/22/2007 "Java/1.5.0_06"

Then mysteriously, stopped pinging my server on 03/15/2007 after a year of being fed garbage.

Think someone finally realized they were getting bounced?

Wednesday, May 17, 2006

Blue Frog Legs and Spam for Dinner

Normally I don't comment on the news and such but Blue Security shutting down their anti-spam Blue Frog operation is stunning. Obviously the spammers were feeling the pressure and it was working or they wouldn't have attacked, the plan was working. Then in this shocking turn of events the anti-spamming Generals leading the attack not only retreated, they resigned from the effort!

So what if the spammers brought down a few servers and services here and there?

Isn't that the whole idea to get those spamming bastards out in the open so people can track them and block their asses once and for all, put them out of business?

When you start a war you certainly don't pack up and go home the minute you get a bloody nose so I suspect that there's a lot more to this story as people just don't fold so easily from a purely technological war. With all the money at stake in spamming, I'm suspecting the threats got a bit more personal which resulted in the sudden shut down, but that's purely speculating on my part.

If this escalated into seriously dangerous territory, where it seemed to be heading, the big service providers, government and everyone else would've gotten involved and put a permanent stop to those responsible.

Unfortunately you Blue Pussies gave up before we ever got a chance for it to get really interesting.

Maybe someone with balls will step up and continue where you left off.

Tuesday, May 16, 2006

Port 80 Proxies Expose Themselves

Don't know whoever the fucking morons are writing those stupid fucking proxy servers, but a shitload of them were just blocked today when we noticed they were appending ":80" to the URL and it shows up in the HTTP_HOST parameter.

Normally HTTP_HOST just has something like "" but when the connection is initiated from a certain cluster of proxy servers it shows up as "" which is trivial to block.

Saved me a shitload of work tracking down the IPs to block.

Thank you VERY MUCH you stupid fucking assholes!

Search Engine Harvesting

Probably never would've noticed this but the referral string changing set off my referral spam trap. Turned out it wasn't a referral spammer at all but someone crawling my site trying to mine specific topic search engine results from both Google and Yahoo.

Very strange behavior too.

Cloaked as MSIE, downloaded all the images as well to remain hidden, yet crawled so fast it also set off a speed trap.

Stupid stupid stupid.

Dragonfly Crawler

No clue what this thing is and nobody seems to have any real information about it, but it's been seen visiting from 5 different IP addresses. "dragonfly(" "dragonfly(" "dragonfly(" "dragonfly(" "dragonfly("
Claims to be associated with ( and the IP address is so close it's possible.

If anyone has any additional information it would be helpful.

When Innovations Collide

Some web crawler has hit my site a few times called Heritrix which appears to be written mostly by the team at, the same team that created ia_archiver for those of you that haven't had your coffee yet.

Yes, it supports robots.txt, but if you didn't know this damn thing existed you wouldn't bother blocking it now would you?

People writing crawlers wonder why webmasters get pissed tracking and opt-ing out all this nuisance crawling on their websites, but I digress, that's an old rant.

The real amusement is that Heritrix claims their technology is designed to "collect the digital artifacts of our culture and preserve them for the benefit of future researchers and generations" which is a bunch of pretty language to try to sidestep downloading a website without permission, especially when the webmaster probably isn't aware of your crawler, doesn't matter how you try to candy coat it.

Now comes the fun part,
let's see who was using it and why!

Today's attempted crawl was HUGE so it's safe to assume this thing has been on my site in the past and apparently the crawler was even banned on a previous IP address: "Mozilla/5.0 (compatible; heritrix/1.6.0 +
Today the crawler used a different IP, could be DHCP, could be on purpose to sidestep the previous ban. Who knows, but it only got a couple of pages before the doors were automatically slammed by the bot blocker: - "Mozilla/5.0 (compatible; heritrix/1.6.0 +"
With my curiosity in overdrive, it was time to research and see why they were crawling my site. Not a clue as there's nothing but a "This Web Site Coming Soon" site under construction page, but the WHOIS for the site was very revealing.
Michael Osofsky
1758 Shoreline Blvd. Suite B
Mountain View, California 94043
United States

Registered through:
Created on: 09-Mar-05
Expires on: 09-Mar-07
Last Updated on: 06-Mar-06

Administrative Contact:
Osofsky, Michael
1758 Shoreline Blvd. Suite B
Mountain View, California 94043
United States
(650) 968-4741 Fax --

Technical Contact:
Osofsky, Michael
1758 Shoreline Blvd. Suite B
Mountain View, California 94043
United States
(650) 968-4741 Fax --

Domain servers in listed order:
This Michael seems to be involved with a company called and he seems to be big in the innovation circles having founded the MIT Innovation Club.

According to the Accelovation website:
Accelovation is the first and only Market Discovery System (MDS) that allows innovators to mine the online world for insights into unmet needs, trends, innovations and market activity.
Sound familiar?

We crawl you and use your information without permission to make a profit.

Where have we heard this before?

I'll bet they'll be surprised at my attitudes about this but they should try reading some webmaster forums and find out what they're doing probably isn't welcome without permission, some clue posted about what the benefits are to the webmaster to allow his site to be crawled, yada yada yada we've been down this path a few times before, it's getting old.

Sorry, but your innovation collided with my innovation called a bot blocker.

Your crawl is denied, and thanks for playing.