Saturday, June 24, 2006

SCRAPER BUSTED #1 - Vipse Corporation using InetURL

Well, perhaps it's not cool to out people but it's also not cool to try to steal my shit, scramble my stuff with other people's stuff so it looks like it's a paragraph written by a drunken monkey, slap AdSense on it and THEN have the balls to attribute that gibberish source as being my domain name.

So fuck it, here we go...

Mind you the only real data they have was before I put the bot blocker online so as they are either expanding or updating, the bot blocker is replacing what they have with my errors.

It actually took me a while to bust this scraper because his code kept chopping up the data I was feeding them so it took a while to find a page with the scrapers IP but finally I was able locate who they were and review their activity in my scraoer archive.

This scraper's IP shows scraping from Italy:

213.203.184.30 "InetURL/1.0"
The scrapings from that IP address ended up on loghinuovi.net, 9-shopping.us, and some other places as this appears to be a full blown scrape and spam operation.

According to whois, this is our scraper:
Vipse Corporation
Ryan's Place
High Street
St Johns, Antigua WI PO Box 744
AG
A little bit of research shows this scraper has a ton of crap sites:

They are mostly NonSense™ sites (that's what I call gibberish AdSense sites) like cellulari.us, loghi.us, loghi-suonerie.us, suonerie.us, suonerie-loghi.us, and anzwers.us for a short list, some may not even work anymore but they have buried landing pages with black on black text and all sorts of NonSense.

After hunting around it appears the root AdSense account, according to "advertise on this site" from cellulari.us is all tied to www.categorico.com.

Doing a little more research, a whois on cartegorico.com shows this owner:
whois categorico.com

Noago Srl
Via Vittorio Veneto 25
Borgomanero, Italy Novara 28021
IT
Which explains the original scraping IP from Italy.

TA DA!

You can scrape me but you cannot hide.

Update...

The following aroma from Roma dropped in and translated this page:
Referring Link http://www.google.com/search?sourceid=navclient&ie=UTF-8&rls=GGLG,GGLG:2006-23,GGLG:en&q=noago srl
Host Name host229-2.pool8250.interbusiness.it
IP Address 82.50.2.229
Country Italy
Region Piemonte
City Novara
Coincidence?

I think not...

Friday, June 23, 2006

Be trendy and ameliorate your blog!

Not to be bragging, but my vocabulary is pretty extensive, but I had to take pause today at a link request email from someone trying to "ameliorate our google positioning".

OK, in context it was obvious what the word meant but I'm scratching my head going "who the fuck uses that word?" as I read a lot and can't remember actually seeing that word in print.

For those of you that still haven't figured it out, ameliorate means improve, so perhaps the author of this email was trying to ameliorate my vocabulary, or just use it to catch my attention.

Heck, if it was just an attention grabbing ploy it sure as hell worked.

This could be a great word to start using in marketing:

Ameliorate, the new and improved word for improve!

I can see it now:

"NEW ULTRA TIDE! It's ameliorated!"


People seeing that on a box of Tide in the store would buy the fuck out of it just because it sounds like a technical chemical type word and they won't admit they dont know what the word means. They will just have to have it because god knows, if you wash your clothes with anything else it wouldn't be ameliorated now would it?

What a concept, let's start an ameliorated trend today!

Tuesday, June 20, 2006

Crawlers with Toolbars?

If you believe this user agent that came from China, the crawler has the Alex Toolbar installed.

211.100.25.206 "Crawler Mozilla/4.0( compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; Alexa Toolbar)"
I'll assume everything after the word Crawler was just stuck in the user agent with hopes to get past the filters but ironically enough, the word Crawler is what set off the traps in the first place.

Now let's have a loud "NEENER NEENER!" for whoever pulled that boner.

New Nutch from Germany

These Nutch things are like Energizer bunnies, they just keep coming and coming and coming...

87.139.106.60 "NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"
It's like a bad zombie movie that never stops with all their arms outstretched as they all try to get you, one page at a time...

Monday, June 19, 2006

Yahoo Research

Caught this unidentified beast belonging to Yahoo trying to access pages 4 different days:

209.73.174.15
research5.corp.scd.yahoo.com
"Wget/1.9+cvs-stable (Red Hat modified)"
Don't know what they were doing but it didn't work, they didnt get what they wanted ;)

Sunday, June 18, 2006

RedKernel is Backasswards

This isn't a new bot but what they did made me laugh out loud:

66.55.143.162 - "GET / HTTP/1.1" RedKernel WWW-Spider 2/0 (+http://www-spider.redkernel-softwares.com/)"
66.55.143.162 - "GET /robots.txt HTTP/1.1" "RedKernel WWW-Spider 2/0 (+http://www-spider.redkernel-softwares.com/)"
66.55.143.162 - "GET / HTTP/1.1" "RedKernel WWW-Spider 2/0 (+http://www-spider.redkernel-softwares.com/)"
66.55.143.162 - "GET /robots.txt HTTP/1.1" "RedKernel WWW-Spider 2/0 (+http://www-spider.redkernel-softwares.com/)"
That's right, they asked for the home page and then robots.txt, in reverse order, twice in a row.

Why would you ask for the home page and they check to see if my site permits your bot to crawl?

Do you think you're just entitled to the home page regardless?

How bizarre as their website doesn't even mention robots.txt:
REMOVE a website from our link directory:

Our www-spider (= a bot like googlebot) is an automated crawler/indexer.
So it works with meta names like all other bots.

You just have to insert in your meta names: (only in your index)
<meta name="RK_WWW_Spider" content="noindex">

Then just wait the next crawl of your website. The meta name will be detected and your site will not be indexed. It can takes few months before your site is crawled.
Meta tags?

What a joke, say it boys and girls "block! block! block!", now isn't that better?

Can't see the Layeredtech for the Forex

Another layeredtech.com hosted beast call Netforex crawled out of the woodwork today with HTML in the user agent which is the most obnoxious shit ever. Nothing at their website yet except a default Apache page.

It looked at robots, amazing, then got slapped when it hit the home page.

Iisn't bot blocking automation is wonderful?

72.232.204.58 - "GET /robots.txt HTTP/1.0" "<a href='http://www.netforex.org'> Forex Trading Network Organization </a> (info@netforex.org)"

72.232.204.58 - "GET / HTTP/1.0" "<a href='http://www.netforex.org'> Forex Trading Network Organization </a> info@netforex.org"
Just makes me positive that blocking the entire range of Layeredtech was also a good idea.