Saturday, March 29, 2008

WHO is Scraping My Site!

Note the lack of a question mark in the title because this wasn't a question about "WHO?" but an actual statement about "WHO!" and by that I mean the WHO as in an office of the World Health Organization.

It registered 411 page requests from 203.94.76.59 which is a non-portable address assigned to the WHO Representative Office in Sri Lanka.

Here's the IP and UA:

203.94.76.59
"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
Here's the WHOIS:
inetnum: 203.94.76.56 - 203.94.76.63
netname: WHO-SLT-LK
country: LK
descr: WHO Representative Office
descr: 385, Health Inform. Centre, Suwasiripaya, Deens Road, Colombo-10
admin-c: NS198-AP
tech-c: NS198-AP
status: ASSIGNED NON-PORTABLE
mnt-by: MNT-SLT-LK
source: APNIC

person: Network Administrator SLTNet
nic-hdl: NS198-AP
address: Sri Lanka
country: LK
mnt-by: MNT-SLT-LK
source: APNIC
It pretended to be a human browser like so many of them do these days by pulling all the images from the index page and then it took off ripping pages like a bandit.

It wasn't even a smart bot as the first link it hit off the index page was my bot trap which is easily flagged and avoidable in the robots.txt as a no crawl zone, so it definitely wasn't human.

Of course the robots.txt file is my other bot trap but what the hell.

Then it went screaming along asking for the next 409 pages at 2-3 pages a second.

It would appear that WHO should check out the health of their computer network as something is rotten in their offices in Sri Lanka.

Friday, March 28, 2008

REBI-Shoveler Digging for Korean Search Engine

REBI-Shoveler must be easily overlooked as it's very unusual to go to a search engine and type in the user agent and get no authoritative hit from any bot hunter whatsoever. There were tons of hits from various web stat pages but nothing I could easily find that gave me any clue what in the hell this thing was.

With this little information all I knew was it came from Korea, otherwise I was stumped:

116.122.36.150 "REBI-Shoveler v0.1"
Finally I decided to see if I could find any more clues in the several years of bot tracking archive files I keep and sure enough, there was a single original hit on my server that contained the answer I was looking for:
116.122.36.48
"REBI-Shoveler/RS Ver. -100.0 (REBI's great worker ... ; http://rebi.co.kr; deisys@rebi.co.kr)"
This bot operates out of multiple IPs in the range of 116.122.36.* and here's a little translation for you from their site about REBI, but not mention about robots.txt nor did it ask for the file when it visited my site today, so it's behaving badly.

Now you know who REBI is that's shoveling shit off of your server.

Enjoy.

We'll Have Anon Of That, John Doe Must Go

Looks like JonDonym - the internet anonymisation service is actively operating as those little anonymous hits are coming from their servers.

I have a couple of actual scrapes happening from their IPs, who would suspect abuse of anon proxies, right?

Here's a couple of examples of activity:

141.76.45.34 [proxy1.anon-online.org.]
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13

141.76.45.35 [proxy2.anon-online.org.]
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13

Don't know what other IPs it operates from but 141.76.45.* and anything resolving to anon-online.org are blocked for now.

Good luck with your John Doe anonymity while I work on my taxes as you've just been H&R Blocked!

With tax deadlines close at hand I couldn't resist ;)

Monday, March 24, 2008

Please Install Flash - Idiots Guide To Flash Web Stupidity

Time to rant about a big pet peeve of mine, that little line of javascript that detects whether or not Flash is installed and the stupid shit developers do when it fails.

For a little introduction to the problem, I run Firefox with NoScript enabled globally for security purposes. However, I can easily enable javascript with a click except some developers do some really stupid shit that's costing their clients visitors.

Here's a few brain dead examples of Flash sites done wrong in the hands of idiots:

1. When javascript is disabled a blank page often results without even a hint, looks broken, visitors go away thinking you're stupid as dirt for putting up a blank page.

2. Redirecting visitors to a "Please Download Flash" page is just asinine. When visitors then enable javascript so your flash will work we're off on some other stupid page instead of where we wanted to go. Yup, frustrate your visitors and they'll just go elsewhere where sites aren't developed by designers that rode the short yellow bus to VoTech.

3. Using the NOSCRIPT tag to incorrectly tell us we don't have Flash installed because that tag actually means we have javascript disabled and you have no fucking clue if we have Flash installed or not until we turn on javascript you fucking idiots. Tell us correctly to ENABLE JAVASCRIPT to run the site in your NOSCRIPT tag and then let the javascript tell us we don't have Flash installed.

I'm sure I'll have some other addendums later but these are the top 3 offending things moronic Flash site developers do off the top of my head.

Anyone else got a pet Flash peeve?