Friday, July 28, 2006

SCRAPER BUSTED #7 - Yahoo Loves Scraping Porn Spammers

NSFW - Unfortunately I can't bust these scrapers completely as they got snared with an earlier version of my error pages that didn't contain the IP address to tie back to the source of the scraping. It's sad really, as I'd like to complete the loop with these particularly slimy vermin but maybe I'll get the chance next time if they keep up this crap.

Anyway, they do have my content buried in their sites and it's searchable only in Yahoo as neither Google nor MSN seem to have these slimeballs indexed, at least not with my content in the pages.

The various domains and subdomains that contain my scrapings, and we're talking a whole page, not a snippet, are here:
I didn't link to the actual page where the scrapings are as some innocent webmasters have been violated severely with this mess and I'd prefer not to give any link love to these pages to make the problem worse.

These bottom feeders seem to be registered mostly out of Ecuador and are hosted on ServerBeach, Webazilla, and a few other places but appear to be using HOSTPOINTS.NET for their DNS. I'm thinking any of the places listed in the U.S. will get served with a DMCA notice and since it's a dedicated box, and not a shared server, the ISP probably has little wiggle room when it comes to compliance other than shut the server down.

Don't get me wrong, I don't have a problem with porn sites, people gotta spank it and someone provides them with the pictures to help keep their seeds out of the gene pool, it's a public service.

However, when these fucking porn spammers start scraping pages from innocent websites and then linking our pages to their festival of filth as a means to get traffic off our backs, then it's a problem, a real big fucking problem.

If the fine people at Yahoo want to drop me a line I'll be more than happy to convey the specifics privately so you can get this slime off your search.

SCRAPER BUSTED #6 - Keyword Stuffing Porn Scraper Nailed

NSFW - This scraper is one of those that uses a website blender to scramble content around therefore it scrambled it's IP address around but 3 parts of the IP were more than enough to lock onto this site scraper/spammer with a blank user agent.

The website I found still contained the IP scrambled as 69, 186, 189 and 50, not in any particular order.

However, this gave me a clue:

So it looks like that IP starts with "69.50."

The winning combination logged in my archive was which had a blank user agent.

This mess claims to be hosted at which doesn't even have a fucking web page.
Probably the upsteam colo running this mess, hard to say.

These sites are NSFW so beware, and disable javscript before poking around as there are some eye-opening keyword stuffed pages with links to massively keyword stuffed pages at the root of "" and the others but a redirect elsewhere if you have javascript turned on.
Nothing much was learned from WHOIS in this case as each domain is registered to someone different in various countries even with the only thing they seem to have in common is the DNS servers:
Name Server:NS1.2ESTDO.COM
Name Server:NS2.2ESTDO.COM
However, a little DNS digging turned up this pile of identical keyword stuffed shit sites on their servers:
The info and range for Intercage, if they still exist, that hosts this garbage is:
OrgName: InterCage, Inc.
OrgID: INTER-359
Address: 1955 Monument Blvd.
Address: #236
City: Concord
StateProv: CA
PostalCode: 94520
Country: US

NetRange: -
Did a quick check and there are at least 3 scrapers coming from that range: "" "" "Snoopy v1.2"
Where there's smoke, there's fire, block block block.

Thursday, July 27, 2006

Prolific Spammer Thwarted

Last month I wrote that Blog Spam Is Not A Problem as I was easily able to write a quick filter that eliminated all automated spam from my site [not this site] and since then have only suffered thru a handful of manual posts that I had to delete which is a drop in the bucket compared to an automated flood.

Just out of curiosity, I checked my 'bounced post' log today to see who was the most prolific at their attempted spamming and some Ukrainian fucker from wins the prize.

All of his attempted posts start out as domains that aren't even active, sometimes not even registered, and once they become active are quickly disabled:

Account has been suspended

This account has been suspended because of abuse. Account user, please contact support for more information.
Most of the domains, with a few exceptions, claim to be registered to this asshole with what appears to be fake information:
Registrant Name:Rick Quickly
Registrant Organization:N/A
Registrant Street1:Lenina 120-345
Registrant Street2:
Registrant Street3:
Registrant City:Moscow
Registrant State/Province:0
Registrant Postal Code:0
Registrant Country:CN
Registrant Phone:+7.95345567
Registrant Phone Ext.:
Registrant FAX:
Registrant FAX Ext.:
Keep in mind, not a single of these attempted spams got past my new filter, but this asshole is very determined and they come non-stop.

Here's a shitload of domains you can filter out:
For the naysayers that whine that blocking spam is hard, you can see for yourself it's not THAT hard and these idiots can be easily stopped.

Now go whine some more while I bask in the comfort of automatic spam stopping.

Tuesday, July 25, 2006

Danger and Avantgo mobile devices passed the test

Couple of the mobile providers Danger and AvantGo made reasonable user agents, unlike the usual cesspool of mobile user agents, that passed the bot blocker without slamming into a brick wall.

Here's their user agents:

Mozilla/5.0 (Danger hiptop 2.0; U; AvantGo 3.2)

Mozilla/4.0 (compatible; AvantGo 6.0; FreeBSD)
Wow, look at that, simple, straightforward, what a concept.

Will the real Google please stand up?

The same lame attempts to crawl as "google" always come from the same IP's in China all the time so I decided to take a look and see what else these brain damaged fart knockers were doing.

Here's the wannabe-Googlers from China: "google" "google" "google" "google" "google"
So I looked a little deeper to see what else they were calling themselves and got the following:

Activity in the 61.135.131.* block: google "" sohu agent
Activity in the 220.181.26.* block: "google" "sogou spider" "" "sohu agent" "" "sogou spider" "sogou spider" "" "sogou spider"
Looks like same exact bullshit being run from two blocks of IP's so zap their ass and have a nice day.

GoDaddy Suspended DesertWalls

A couple of weeks ago I reported 1&1 Web Host Goes Spamming and now GoDaddy appears to have suspended the source of the spam which was

Total solutions
4520 Ficus Tree Road
Boynton beach, Florida 33436
United States

Registered through:, Inc. (
Created on: 10-Jul-06
Expires on: 11-Jul-07
Last Updated on: 10-Jul-06

Administrative Contact:
Cohen, Bruce
Total solutions
4520 Ficus Tree Road
Boynton beach, Florida 33436
United States
(561) 734-8122

Domain servers in listed order:

Not that it really solves any problems as the spammers will just factor the $10 domain names into the cost of doing business but it's nice to see them being punished just a little.

Monday, July 24, 2006

LinkTiger Declawed

Another fucking link checker, there are so many I barely mention them anymore but something called LinkTiger had good headline potential.


Here's da shit: - "Mozilla/5.0 (compatible; linktiger/1.0; +
Didn't ask for robots.txt, some do, some don't, and is hosted on the dedicated server jungle at which is another worthwhile place to block:
OrgName: iPowerWeb, Inc.
NetRange: -
Sorry LinkTiger, my Serengeti is off limits to you.

Taggers Target Their Ass With My Foot

Some bullshit crawler called " bot" from the Land Down Under looks at robots.txt with the lame ass default user agent of the Python library before exposing their actual user agent. - "GET /robots.txt" "Python-urllib/1.16" - "GET /" "" " bot"
Reverse DNS of the IP resolves to which redirects to which is a bunch of Digg wannabe mother fuckers and the bot claims it's crawling for and it all looks related, boring as hell, who gives a fuck.

Go fuck a kangaroo and keep my server out of your mental masturbation clusterfuck operation as I don't want to be digged, dugg or tagged, just piss off.

Ah, I feel better now.

Sunday, July 23, 2006

Block Browsezilla malware

This so-called browser named Browsezilla used to get onto my site with the old UA:

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Browsezilla; .NET CLR 1.1.4322)"
Then those obnoxious assholes stepped it up a level by inserting a goddamn HTML hyperlink into the UA which my bot blocker stops instantly thinking it's referrer spam.
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; <a href=>Browsezilla </a>)
Almost felt bad for the assholes using this shit browser being blocked and almost unblocked it for them, but decided to do a little research first and HOLY SHIT this thing appears to be evil.

Read this InternetNews alert and this Panda alert and this detailed post by Oliver on a NotePet and you'll probably agree it's a good idea all around just to block Browsezilla so their users will abandon this fucking malware before it's an epidemic

I'm thinking it's best left blocked and possibly redirected to the InternetNews article to scare the shit out of people using the damn thing.


Nothing pisses me off more than some company like EmeralShield sending a bot that masks who they are when they request robots.txt files and then proceeds to crawl with the actual user agent name.

Look at this shit: - "GET /robots.txt" "-" "-" - "GET /" " Web Spider ("
I was curious who these dumb fucks were so I checked and found a thread on WebmasterWorld and then some bigger horseshit in their forum.
We also use the webbot with our web filter service. Customers visit sites that we don't know about and we use the webbot to go and dig the site. In this case we are looking primarily to filter porn for our customers. The site pages that are downloaded are fed into a scan engine that attempts to determine if the site is objectionable or not.
Well dig this, your customers can grow the fuck up and be adults about the 'net as you aren't digging my fucking website as I have too many little piss ants like your crawler all trying to get a piece of my website so you get ... NOTHING! NOT A SINGLE FUCKING PAGE!

Dig that?

I bet not.

Crayon Crawler Outside the Lines

I can't determine if the supposedly kid friendly Crayon Crawler is a crawler or the browser, their site is down in flames at the moment, but anyone stupid enough to put the word CRAWLER in the user agent deserves to get dumped anyway.

Probably a browser, who cares: "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; Crayon Crawler)" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Crayon Crawler; FunWebProducts; .NET CLR 1.1.4322; MSN 9.0;MSN 9.1; MSNbQ002; MSNmen-us; MSNcOTH; MPLUS)" "Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1; Crayon Crawler; SV1)"
Don't need any snotty faced zip-popping brats on my site anyway.


Finding Cellphone User Agents Rosetta Stone

I'm sure I could upgrade how I process user agents to accomodate all these mobile freaks that can't tear themselves away from the internet but I'm not sure I want to mess with it.

In order to lock out most of the bad bots I only allow user agents that start with "Mozilla/" or "Opera/" as the very first part of the string which seems to work real well.

Well, unfortunately the assholes that make cellphones don't seem to give a shit about fitting into an easily identifiable group of browsers and have a bazillion user agents.

Something like this doesn't even fit the mobile user agent definitions:

HTC-8100/1.2 Mozilla/4.0 (compatible; MSIE 6.0; Windows CE; PPC; 240x320) UP.Link/"
The only upside here is this one tells us it's screen size in the UA which is useful to know.

Why can't all of these dickheads at least do ONE THING in the user agent that just screams out "THIS A WIRELESS DEVICE OR CELLPHONE" like prefixing them all with "WAP/" or something civilized like that instead of having to know all the goddamn vendors and part numbers?

The closest thing I came to finding a reasonably identifiable fingerprint for a mobile device was looking for "Profile/MIDP", "MMP/" or "Configuration/CLDC" which seem to be a few good checks for most things mobile.

Just look at examples of all this gibberish:
Nokia6600/1.0 (4.09.1) SymbianOS/7.0s Series60/2.0 Profile/MIDP-2.0 Configuration/CLDC-1.0
Samsung-SPHA880 AU-MIC-A880/2.0 MMP/2.0 Profile/MIDP-2.0 Configuration/CLDC-1.1
SANYO-S750/2.130 UP.Browser/ (GUI) MMP/2.0
BlackBerry8700/4.1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/102

SonyEricssonP900/R102 Profile/MIDP-2.0 Configuration/CLDC-1.0

MOT-C650/0B.D2.23R MIB/2.2.1 Profile/MIDP-2.0 Configuration/CLDC-1.0 (Google WAP Proxy/1.0

Vodafone/1.0/703SH/SHG001 Browser/UP.Browser/ Profile/MIDP-2.0 Configuration/CLDC-1.1 Ext-J-Profile/JSCL-1.2.2 Ext-V-Profile/VSCL-2.0.0

SCH-A950 UP.Browser/ (GUI) MMP/2.0

LGE-PM225/1.0 UP.Browser/ (GUI) MMP/2.0

SHARP-TQ-GX25/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.0 UP.Browser/ (GUI) MMP/2.0 UP.Link/

and on and on and on...
Yeah, I read the spec on wireless user agents and it's just a big old fucking mess of gibberish that opens the door for scrapers to pretend they're cell phones with javascript disabled and scrape the fuck out of a website.

Well you could argue that wireless devices don't use too many pages so just limit their access and I'll counter that containment method with a shitload of anonymous proxies and/or a small fleet of $2/month hosting accounts.

The problem with blocking proxies is just about all of these freaking toys with browsers use proxy servers to convert web pages to a few lines of links and text so just blocking any old proxy they use will typically block them altogether.

It's a gaping hole that can barely be contained and my best strategy to date is by only allowing these devices access via IP's that resolve to wireless service providers which is sketchy at best.

Why is this sketchy?

Someone can scrape over a 3G network at speeds of 400K-700K or better.

Not highly likely, but definitely probable and easily doable.

I'm getting annoyed as the tighter I make the noose, the more obvious it's weaknesses thanks to the swiss cheese that is the internet and all the bungling engineers.

It's a prime example of that old technology law:
"If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization"

Roffle from CCBill

I keep getting hit by this bizarre user agent Roffle and there isn't much information at all that I can find about this on the net as it doesn't even appear in most of the user agent resource sites.

Here's what I know about it: "Roffle/l.ol(compatible; MSIE 6.0; Windows NT 5.0;"
In my case it always crawls from the same IP address which oddly enough is owned by CCBill LLC:

whois NET-64-38-240-0-1

OrgName: CCBill LLC
Address: 1501 W 17th St
City: Tempe
StateProv: AZ
PostalCode: 85281
Country: US

NetRange: -
So why in the hell does an internet billing system have a bot crawling the net?

Guess I could ask them, but that's not as much fun as figuring it out with clues on the net!