Thursday, September 07, 2006

Counting Scrapers on your Abacus

Had a couple of persistent little fuckers hosting with Abacus that just keep trying and trying to download a boatload of pages that I've been monitoring for months now.

The specific IPs of these boxes are:

206.225.82.155 "Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)"

206.225.91.164 "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)"

206.225.83.179 "Evaal/0.7.2 (Evaal search engine; http://evaal.coml; bot@evaal.com)"

216.55.161.38 "Java/1.4.1_04"

216.55.142.118 "Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)"

216.55.162.3 "PEAR HTTP_Request class ( http://pear.php.net/ )"

216.55.147.80 "sna-0.0.1 mikeelliott@hotmail.com"
Toss in a couple of proxies:
206.225.85.127
206.225.86.86
And some other miscellaneous bullshit not worth mentioning.

Here's what to block:
OrgName: Abacus America Inc.
OrgID: ABAC
NetRange: 206.225.80.0 - 206.225.95.255

OrgName: Abacus America Inc.
OrgID: ABAC
NetRange: 216.55.128.0 - 216.55.191.255
Now you've been COMPLETELY BLOCKED so count THAT on your Abacus!

More Evolving Scrapers

Like I've been reporting, they're all going stealth.

I keep seeing user agent change from this:

62.163.33.234 "Java/1.4.1_04"
To this:
62.163.33.234 "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
Soon the usual blocking methods won't work whatsoever.

Wake up and smell the COPY before it's too late!

Block the Bots Tonight

Time for a little lunacy break for people feeling blue battling the bad bots.

Sing along boys and girls...

Sung to the tune of "Rock Around the Clock"
with apologies to Bill Haley and the Comets.

One, two, three bots, four bots, blocked.
Five, six, seven bots, eight bots, blocked,
Nine, ten, eleven bots, twelve bots, blocked,
We're gonna block all the bots tonight.

Put your firewall on and lock em out,
We'll have some fun when they scream and shout,
We're gonna block all the bots tonight,
We're gonna block, block, block, their scraping blight.
We're gonna block, gonna block, all the bots tonight.

When the block strikes two, three and four,
If the scrapers slow down we'll yell for more,
We're gonna block all the bots tonight,
We're gonna block, block, block, their scraping blight.
We're gonna block, gonna block, all the bots tonight.

When the server dings five, six and seven,
We'll be right in bot blocker heaven.
We're gonna block all the bots tonight,
We're gonna block, block, block, their scraping blight.
We're gonna block, gonna block, all the bots tonight.

When it's eight, nine, ten, eleven too,
I'll be blocking bots and so will you.
We're gonna block all the bots tonight,
We're gonna block, block, block, their scraping blight.
We're gonna block, gonna block, all the bots tonight.

When the counts hit twelve, we'll laugh and yell,
As a dozen bad bots have just went to hell!
We're gonna block all the bots tonight,
We're gonna block, block, block, their scraping blight.
We're gonna block, gonna block, all the bots tonight.

University of Toronto Goes Bat Shit for VPI

Something coming from the University of Toronto keeps making periodic pitstops at my server and only request _vpi.xml like I give a shit about this file.

142.150.4.114 [kahuna.erin.utoronto.ca.] "Firebat 2.5.22" "/_vpi.xml"
Looks like a bunch of bullshit to me as I tried to weed through the ramblings about Jabber groupchat protocol since I've never had anything remotely related on my server whichs brings up the million dollar questions, why is this little fucker looking for it?

Dunno what the motives are but they didn't get far, back to class asshole.

Tuesday, September 05, 2006

Firefox Memory Leaks

Leaving Firefox 1.5 up and running too long without ever closing it for days always seems to eventually cause issues like the swap drive running non-stop or something.

Anyway, I decided to keep the Windows Task Manager up and running all the time so I can monitor Firefox performance and it appears there are some serious memory leaks and issues with closed or stopped downloads that may not be stopping the thread reading the data in the background.

A couple of easily reproduced problems involves stopping a very large page downloading, we're talking thousands and thousands of lines of text, but it appears to keep loading into memory even after it's no longer visible, pushing the memory footprint up to 200MB+ with only a couple of tabs open.

Sure hope they do some better testing on the 2.0 code as I may switch back the IE 7 if it's substantially better as one thing Microsoft does know how to do is keep their code from leaking memory and not leaving zombie threads running in the background.

Sunday, September 03, 2006

Scrapers4U.de

Today I noticed another hit from this same server farm in Germany with something pretending to be a Windows browser:

62.75.218.82 [elbe016.server4you.de.] requested 16 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x4.90)"
So I checked my archives and sure enough it's been here a time or two before attempting to get inside and there was some hit's from other assocated IP's in their range.

Who hosts this mess appears to be intergenia.de:
netnum: 62.75.128.0 - 62.75.255.255
org: ORG-iGCK1-RIPE
netname: DE-INTERGENIA-20010727
descr: intergenia AG
Which also owns plusserver.de, server4you.de, server4you.com, netfabrik.de, and some end user services who's IP's may be a part of intergenia.de's range, no clue.

The plusserver.de, server4you.de and netfabrik.de both appear to use this range:
inetnum: 217.172.167.0 - 217.172.169.255
netname: PLUSSERVER-1
descr: PlusServer - Dedicated Premium Serverhosting
descr: http://www.plusserver.de
The server4you.com seems to have this block:
OrgName: Server4You Inc.
NetRange: 69.64.32.0 - 69.64.63.255
Comment: http://www.server4you.com
Which means the crawler that started this search still can't be pinned down to a specific hosting block for server4you other than the reverse DNS claims it's server4you.de. I poked around doing a few nslookups in that range and they return either return static-ip-62-75-*-*.inaddr.intergenia.de or someserver.server4you.de so I'm a little hesitant just to block the whole intergenia.de range.

So it looks like I'll block the obvious hosting ranges by IP and server4you.de by reverse DNS for now.

Bots from ServerDeli at Mediopia

Something came crawling from ServerDeli hosted at Mediopia, and it was the typical bot with an invalid user agent if you notice the space between "compatible" and ";" and nevers asks for robots.txt, just pages.

Here's the crawler info:

209.125.47.35 [win1.serverdeli.com.] requested 26 pages as "Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)"
Sorry, but my site isn't a deli snack for whatever bullshit you're running.

It always always gives me pause when you see a webhosting company using HOTMAIL addresses for their contact information:
OrgName: MEDIOPIA TECHNOLOGIES (IMA'D W/ 69998
OrgID: MTIW6
Address: 9507 34TH AVE
City: JACKSON HEIGHTS
StateProv: NY
PostalCode: 11372
Country: US

NetRange: 209.125.47.0 - 209.125.47.255
CIDR: 209.125.47.0/24
NetName: ATWORK-65024-55156
NetHandle: NET-209-125-47-0-1
Parent: NET-209-125-0-0-1
NetType: Reassigned
Comment:
RegDate: 2005-05-02
Updated: 2005-05-02

OrgTechHandle: ACH48-ARIN
OrgTechName: CHICO, ALFREDO
OrgTechPhone: +1-718-476-0313
OrgTechEmail: MYMEDIOPIA@hotmail.com
So it looks like blocking 209.125.47.* wouldn't hurt anything.

Core-Project Hijacks an IP

Saw these idiots again today looking for FrontPage on my server:

207.226.161.69 - "POST /_vti_bin/_vti_aut/author.dll HTTP/1.1" 404 1176 "-" "core-project/1.0"
207.226.161.69 -"HEAD / HTTP/1.0" 200 - "-" "-"
207.226.161.69 - "POST /_vti_bin/_vti_aut/author.dll HTTP/1.1" 404 1176 "-" "core-project/1.0"
The IP appears to be dedicated to a single customer hosted on Rackco.com:
cigar-review.com
cigarreview.com
Sadly, Rackco has shared and dedicated hosting so I was unable to easily pin down if this was a compromised server or some little script monkey running in a different account on a shared server.

I guess the only thing I'm amused with is how would some random script in shared hosting, if that is indeed the case, crawl out using a different IP than the server default.

Traceroute have a few clues:
ge6-14.colo02.ash01.pccwbtn.net (206.223.115.48)
ge13-1.br01.ash01.pccwbtn.net (63.218.44.125)
209-8-237-222.rackco.net (209.8.161.222)
mike.rackco.com (209.8.238.194)
cigar-review.com (207.226.161.69)
Still nothing pointing out more than one IP to block.

Ah well, either way, can't seem to narrow down the IP range assigned to Rackco because rwhois.cais.net isn't responding and ARIN just shows the major block assigned to PCCW formerly "Beyond The Network America, Inc.".

Wel'll keep an eye on this one.