Saturday, July 01, 2006

Return of the Jeteye

Beyond the obvious StarWars reference in the title of this post, the Force is not strong with this crawler. This Jeteye crawler is bucking for the Empire's buggy software award today as it came and asked for robots.txt a whopping 11 consecutive times, and THEN asked for 11 other pages. - "GET /robots.txt HTTP/1.1" "jeteyebot/0.1;" - "GET /robots.txt HTTP/1.1" "jeteyebot/0.1;" - "GET /robots.txt HTTP/1.1" "jeteyebot/0.1;" - "GET /robots.txt HTTP/1.1" "jeteyebot/0.1;" - "GET /robots.txt HTTP/1.1" "jeteyebot/0.1;" - "GET /robots.txt HTTP/1.1" "jeteyebot/0.1;" - "GET /robots.txt HTTP/1.1" "jeteyebot/0.1;" - "GET /robots.txt HTTP/1.1" "jeteyebot/0.1;" - "GET /robots.txt HTTP/1.1" "jeteyebot/0.1;" - "GET /robots.txt HTTP/1.1" "jeteyebot/0.1;" - "GET /robots.txt HTTP/1.1" "jeteyebot/0.1;"
Can anyone really write software that bad after gradeschool?

I weep for the future.

Investor Relations Group Running Email Harvester

Things just get stranger and stranger but MG Investor Relations appears to be running a known spam harvester program.
Who knows, maybe the server is hacked and a spammer is running this from a subdomain without the owners knowing about it.

Who cares.


First Look - Bladder Fusion Leaks Onto Web

Some Comcast customer must be pissing their pants laughing after changing the name of their user agent to bladder fusion.
"bladder fusion 1.4.0"

No other activity has been recorded from that address so we don't know what this little pisser is all about.

Let me know if you see this one tinkling on your website.

Another Nutch Batch Hatched

Here's another new batch of Nutch crawlers bugging the shit out of my website today.
"NutchCVS/0.7.2 (Nutch;;"
"NutchCVS/0.7.2 (Nutch;;"
"NutchCVS/0.7.2 (Nutch;;"

This is getting old, it's like one of those biblical plagues that just comes in overwhelming numbers like locusts, frogs and now Nutch.

Wednesday, June 28, 2006

SCRAPER BUSTED #5 - Site is so bad McAfee SiteAdvisor blocked the page load!

Well, I knew some scrapers were bad but this one takes the cake. []
User Agent: lwp-trivial/1.41
The scraping was tracked to a page on

When I went to open the page in Internet Explorer I got this error instead. may cause a breach of browser security.

We blocked your browser from visiting this site.

In our tests, attempted to make unauthorized changes to our test PC by exploiting a browser security vulnerability. This is a serious security threat which could lead to an infection of your PC.
The proud owner of this lovely site is:

Registrant Name:Sid Wongvorakul
Registrant Street1:979 Rutland Dr
Registrant City:Memphis
Registrant State/Province:TN
Registrant Postal Code:78243
Registrant Country:US
The site's information is as follows: (
The site is hosted by this now blocked company:
Address: AccessIT - Hosting Services
Address: 75 Broad Street, Suite 1902
City: New York
StateProv: NY
PostalCode: 10004
Country: US

NetRange: -
The SCRAPING IP came from iWeb Technologies which I'll assume is hosting a proxy site that was used to scrape and is now also on my blocked list.

OrgName: Groupe iWeb Technologies inc.
OrgID: GIT-20
Address: 3185, rue Hochelaga
City: Montreal
StateProv: QC
PostalCode: H1W-1G4
Country: CA
NetRange: -
From Canada to Tennessee and finally landing in New York, my scraped data took a wild trip and ended with a McAfee warning, sheesh.

Nastiest thing I've run into so far, but I doubt it will be the worst.

HTTPanties on ThinkGeek

The ThnkGeek website has some amusing undies for sale called HTTPanties.

Currently they offer the following printed on your crotch covers:

  • "200 OK" if you feel frisky
  • "403 Forbidden" which is obvious
  • "411 Length Required" for size queens
  • "413 Requested Entity Too Large" which most of us will never see
We would like to suggest these additional HTTP offerings:
  • "101 Switching Protocols" for bisexuals
  • "300 Multiple Choices" for swingers
  • "301 Moved Permanently" for transexuals
  • "304 Not Modified" for virgins
  • "307 Temporary Redirect" for having an affair
  • "400 Bad Request" for that time of the month
  • "405 Method Not Allowed" stamped on the ass if not into anal
  • "409 Conflict" when she's pissed
  • "416 Requested Range Not Satisfiable" when you didn't make her come
  • "417 Expectation Failed" when he couldnt get it up
And for ladies that don't like adult toys, vegetables, and other non-human phallic replacement parts:
  • 415 Unsupported Media Type
Last but not least, if she was all horny and went to bed without you, waiting while your dumb ass kept watching football or some shit, and she fell asleep don't be surprised to see this when you finally make it to bed:

  • 504 Gateway Timeout

User Agent: "Microsoft Internet Explorer"

After seeing this user agent coming from Raytheon my curiosity got the best of me so I looked up any other occurances of this exact user agent and got the list below. None of them asked for more than a page or two, very rare occurances, and the requests originate from corporate proxies and dial-ups so I don't see any pattern to this yet.

03/07/2006 "Microsoft Internet Explorer"
04/05/2006 "Microsoft Internet Explorer"
05/04/2006 "Microsoft Internet Explorer"
05/31/2006 "Microsoft Internet Explorer/4.40.426 (Windows 95)"
06/11/2006 "Microsoft Internet Explorer"
06/28/2006 "Microsoft Internet Explorer"
Just keeping an eye on this one as well.

Odd Raytheon Behavior

OK, not sure what this is about, but this is just WEIRD....

The IP [] came out of nowhere and asked for only these files which was very bizarre out of context. - "GET /page1.html HTTP/1.0" "Microsoft Internet Explorer" - "GET /favicon.ico HTTP/1.0" "Microsoft Internet Explorer" - "GET /page2.html HTTP/1.0" "Microsoft Internet Explorer" - "GET /favicon.ico HTTP/1.0" "Microsoft Internet Explorer"
Could it be something trying to test for bookmarks pages still existing and update favicon.ico for those pages?

If anyone has a clue, as usual, I'm curious


FYI, someone at Raytheon really keeps an eye on the blogs as this was barely posted an hour before they read it. Not so unusual as many companies seem to be spying on anything said about them all the time and I see this constantly, just not so quick!


OrgName: Raytheon Company
Address: 141 Spring St
City: Lexington
StateProv: MA
PostalCode: 02421
Country: US

Why Block Proxy Sites

People ask why I block anonymous proxy servers and here's a prime example coming from our friends at ThePlanet.

Not only is someone using Firefox from this location but it appears they're cloaking information to get Google to crawl through the proxy as well which can result in hijacked pages.

06/23/2006 "Mozilla/5.0 (compatible; Googlebot/2.1; +"

06/25/2006 "Mozilla/5.0 (compatible; Googlebot/2.1; +"

06/25/2006 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060508 Firefox/"

06/27/2006 "Mozilla/5.0 (compatible; Googlebot/2.1; +"

06/28/2006 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060508 Firefox/"
That, in a nutshell, is why I block proxy servers.

First Look - Retrieve_Title

It's hit my site twice now asking for a whopping 1 page each time, and I can't seem to find out anything else about it except it seems to be coming from Florida using a service called Neosmart. "Retrieve_Title"
Doesn't seem to be a big threat, as the name of the bot implies it might be an automated link checker or something, but we've got our eye on it just the same.

Any information on this one is kindly appreciated.

Tuesday, June 27, 2006

First Look - Pussycat from India

No clue what this critter is up to as it only attempted 2 page requests and went away. "PussyCat 1.0, Murzillo compatible"
Anyone else spotted this "Murzillo" compatible cat?

Update -

Definitely a spammer as it's trying to hit my submit page but the idiot is using a GET instead of a POST so it would always be rejected, not to mention I don't allow his user agent to get onto the site in the first place.

This slimeball tried to hit again from a new location and this time I noticed it's coming via a proxy server: "PussyCat 1.0, Murzillo compatible"
Forwarded IP ->
So I looked up the original hit and it was also via proxy server: "PussyCat 1.0, Murzillo compatible"
Forwarded IP ->
Looks like the spammer is trying to cover his tracks routing thru various proxy servers but he's stupid and keeps using the same user agent which is so easily blocked.

Nobody claimed spammers were smart and this one just proves it.


Don't know what this is all about, but it smells like spam of some sort.

Y!TunnelPro is a plug-in for Yahoo Instant Messenger but this morning it showed up as a user agent from 3 different networks in 18 seconds. Never saw it before, ever, and it hasn't come back since.

07:49:00 "Y!TunnelPro"
07:49:04 "Y!TunnelPro"
07:49:17 "Y!TunnelPro"
Your guess is as good as mine, anyone else see this garbage?

Sunday, June 25, 2006

SCRAPER BUSTED #4 - Snoopy Porn Spammer

Here's another scraper that's very prolific that can no longer hide: "Snoopy v1.2"
The IP address of the scraping comes from the US:
OrgName: InterCage, Inc.
OrgID: INTER-359
Address: 1955 Monument Blvd.
Address: #236
City: Concord
StateProv: CA
PostalCode: 94520
Country: US

NetRange: -
NetHandle: NET-69-50-160-0-1
Parent: NET-69-0-0-0-0
NetType: Direct Allocation
RegDate: 2003-06-04
Updated: 2005-09-01
All of the sites are in the Ukraine, go figure.

The scrapings are cloaked and now they use NOARCHIVE after getting whacked in Google not too long ago .

I wonder who reported them for cloaking, hmmmm....


I unlinked the old Google searches as they are obsolete.

So far in 2008 I only see Snoopy here: "Snoopy v0.94" DOMINET (NET-66-128-60-0-1) -

SCRAPER BUSTED #3 - London Scrapings on Latvian Websites

Some subdomain on a Latvian website finally coughed up an IP address so I could link the scraping with the scraper. The IP address that did the scraping was with a blank user agent that appears to come from the UK.

I can't really tell what actually is since I can't read it, and it has a TON of subdomains, but a few of those are definitely occupied by a scraper in This is an oddball scraper as all the pages are cloaked and redirect elsewhere so it's more difficult to get information about this one but not impossible.

Problem is, I'm not that bored to dig further, busted, blocked, banned, bye bye.

Webzone uses

Finally tracked down the to it's spammy search engine owner called WebZone. ""
address is
Whois says:
Domain Name:
Domain servers in listed order:
Ties the scraper/crawler to the same hosting company, good enough for me.
The AdSense account on this scraper, um, search engine links to

Time to block,, blah.

SCRAPER BUSTED #2 - sunshineholidayrental

Tracked another scraping bastard that uses a blank user agent from the IP address of which appears to be a Lithuanian scraper/spammer with a dedicated server at

Here's a few of this assholes active sites:
Some of them may look almost legit at first glance but trust me, there are tons of keyword and phrase stuffed bullshit pages hiding behind those seemingly innocent facades.

All the AdSense accounts claim to be owned by which currently seems to be a Joomla default site.

Blocked HostDime a while back, looks like it was a good idea.

Brinkster is hosting referral spammer or is the spammer

This javascript referral spam keeps showing up in my logs and the fact that several of these were for Brinkster themselves I figured they had something to do with this.

03/14/2006 "<script>'')</script>"

04/29/2006 "<script>'')</script>"

05/16/2006 "<script>'')</script>"

06/25/2006 "<script>'')</script>"
It's obvious these idiots wanted publicity and now they have it, and it's not good publicity.

Trying to get people to buy hosting by breaking a webstats page to auto-redirect is pretty foul.