Saturday, May 27, 2006

Scrapers Impacting Conversion Rates?

Was just reading on Threadwatch about the report on the decline in conversion rates for online stores and suddenly had an epihany that scrapers may be involved in this equation.

Let's assume that these conversion rate facts and figures include many of the non-human stealth crawlers that I'm blocking on a daily basis. There's no way your average online retailer is probably aware of this situation and you know they're being scraped just like the rest of us, maybe even scraped MORE than the rest of us, who knows.

Using one of my websites as an example, it averages 13,500 visitors a day and 50-200 stealth crawlers are being blocked which accounts for .5% - 1.5% of my daily traffic, which would definitely impact the conversion rate for any store with similar traffic.

Perhaps scrapers on a very large website getting a million visitors a day wouldn't have much impact unless the site attracts a lot more scrapers than my site. However, a smaller online retailer with similar traffic to the site I'm protecting would obviously notice a difference in their conversion rate, a HUGE difference, just by adjusting their stats to include pages downloaded by stealth crawlers.

Just another example of how scraping and stealth crawling is BAD FOR THE WEB and needs to be stopped.

ServePath to being banned

Found a bunch of random stuff coming from a hosting company called ServePath today while running historical analysis on a batch of IPs.

Now these are the visible crawlers that came from ServePath: PEAR HTTP_Request class ( ) "Jakarta Commons-HttpClient/3.0" "Jakarta Commons-HttpClient/3.0" Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) NutchCVS/0.7.1 (Nutch;;
Here's the whole range:
OrgName: ServePath, LLC
NetRange: -
I'm going to block the whole thing and see if there are any stealth crawlers operating out of that location that haven't tripped any alarms yet and see what happens.

Amazon's A9 Amateur Hour

Guess what boys and girls?

We've all been forced to OPT-IN to yet another non-standard "web tool" that Amazon's A9 has thrust upon us. A9's blog said they introduced this crap last July but it's obviously been so low key compared to everything else hitting my server that I overlooked this small slice of idiocy.

This has been showing up in my logs for a while now" - “GET /siteinfo.xml HTTP/1.1″ 404 1639 “-” “Java/1.5.0_04″
The only reason I noticed it today was the amount of times it hit the server escalated and they're racking up a bunch of 404 errors requesting this file I've never heard about which is idiotic and stupid.

Ever hear of any internet standard such as ROBOTS.TXT to see if I even want you looking for this stupid file on my server?

Apparently not as the only file being hit is "siteinfo.xml".

Had to resort to a reverse DNS lookup just to find out it was who was doing this stupid crap. Didn't the vaudeville programmers that wrote this joke ever hear of setting the USER AGENT to identify who and what this is instead of Java/1.5?

Amazon, if you happen to read this pay very close attention to the fact that many web applications bombard my server with the user agent of "Java/1.whatever" on a daily basis which are all BLOCKED so you will never ever get access to siteinfo.xml until you properly identify yourself.

Here's a sample "siteinfo.xml" file that you can install in your root web directory:
<?xml version="1.0"?>
<siteinfo xmlns="">
<name>Amazon SiteInfo Sucks</name>
<text>Doesn't use standards</text>
<text>Doesn't identify itself</text>
I commented about their lack of professionalism and standards being used in this implementation on their blog but it's awaiting moderation and I doubt they'll let my less than happy comments be published, but we shall see.

Thursday, May 25, 2006

PlanetLabs Bombards Server - Abused or Compromised?

Well here's a new one that was uncovered this week when a tipster wishing to remain anonymous sent me a very suspicious looking log file snippet with a bunch of identical accesses from over 130 IP addresses ranging over a couple of hours.

After doing a little bit of research it looks like this "attack" came from a consortium of computers called PlanetLab located in various universities and research institutions around the world and this appears to be only a portion of the network that was aimed at our tipsters server. We don't know at this point if this was an isolated demonstration of their network, whether they were being abused by a member or if a hacker has breeched the protocol, but the potential for damage here is huge.

Their website claims the following stats:

PlanetLab currently consists of 668 machines, hosted by 325 sites, spanning over 25 countries. Most of the machines are hosted by research institutions, although some are located in co-location and routing centers (e.g., on Internet2's Abilene backbone). All of the machines are connected to the Internet. The goal is for PlanetLab to grow to 1,000 widely distributed nodes that peer with the majority of the Internet's regional and long-haul backbones.

Below are sample of the log files, IPs involved, and the reverse DNS of all the IPs which is what we used to figure out this was probably PlanetLab. There were other files accessed as well, but browsers don't typically look at robots.txt so that's all we needed to suspect something was wrong with this situation and treated it as a potential attack.

If this was an actual PlanetLab project aimed at crawling the web undetected and aggregate tons of data, then it failed miserably. Now that we know who you are and where you are, our servers will be watching to see if you strike again.

If this was an unauthorized test then PlanetLab better beef up security as this network is one big DDoS attack just waiting to happen under control of the wrong person.

Here's a sample snippet of the log file: - - [11/May/2006:08:45:18 -0400] "GET /robots.txt HTTP/1.1" 200 452 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6" "-" - - [11/May/2006:08:45:18 -0400] "GET /robots.txt HTTP/1.1" 200 452 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6" "-" - - [11/May/2006:08:45:18 -0400] "GET /robots.txt HTTP/1.1" 200 452 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6" "-" - - [11/May/2006:08:45:59 -0400] "GET /robots.txt HTTP/1.1" 200 452 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6" "-" - - [11/May/2006:08:46:03 -0400] "GET /robots.txt HTTP/1.1" 200 452 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6" "-" - - [11/May/2006:08:46:03 -0400] "GET /robots.txt HTTP/1.1" 200 452 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6" "-" - - [11/May/2006:08:46:38 -0400] "GET /robots.txt HTTP/1.1" 200 452 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6" "-" - - [11/May/2006:08:46:39 -0400] "GET /robots.txt HTTP/1.1" 200 452 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc3 Firefox/1.0.6" "-"
Here's the complete list of IP's involved:
To make sense of all this mess, I crunched them all thru NSLOOKUP to see if any patterns emerged and what was a common theme was .EDU and PLANETLAB all over the place.

Here's the reverse DNS on all the IPs for your viewing pleasure: name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = canonical name = 21.0/
21.0/ name = name = planetlab11.Millennium.Berkeley.EDU. name = planetlab8.Millennium.Berkeley.EDU. name = planetlab15.Millennium.Berkeley.EDU. canonical name = 22.0/
22.0/ name = name = planetlab6.Millennium.Berkeley.EDU. canonical name = 23.0/
23.0/ name = name = planetlab7.Millennium.Berkeley.EDU. name = planetlab10.Millennium.Berkeley.EDU. name = planetlab14.Millennium.Berkeley.EDU. name = name = name = name = name = name = name = name = name = name = name = name = name = planetlab2.cs.Virginia.EDU. name = name = name = name = name = name = name = planetlab16.Millennium.Berkeley.EDU. name = planetlab5.Millennium.Berkeley.EDU. name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = canonical name = 74.72/
74.72/ name = name = name = name = name = name = name = name = name = crt1.PLANETLAB.UMontreal.CA. name = crt3.PLANETLAB.UMontreal.CA. name = name = name = name = planetlab9.Millennium.Berkeley.EDU. name = name = name = planetlab-2.EECS.CWRU.Edu. name = name = name = name = name = name = name = name = name = planetlab2.CS.UniBO.IT.
** server can't find NXDOMAIN
** server can't find NXDOMAIN name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = name = canonical name = name = canonical name = name = name = name = name = name = name = name =
Best we can tell it was definitely PlanetLab involved with this and I'm very upset that an organization like this would aim a large section of their network at a single server at the same time without permission.

This is abuse, pure and simple, without proper user agent attribution or anything, and I welcome them to come here and let us know what really happened.

While we're waiting on PlanetLab to respond, and I wouldn't hold my breath, I'm going to block the IPs listed above and probably ban anything with "planetlab" or "planet-lab" in the reverse DNS location name until further notice.

RED ALERT #4 - NiceBot Neighborhood

Found another distributed IP batch sitting in a hosting farm claiming to be "nicebot".

Nicebot my ass...

Here's the range of IP's spotted with user agent nicebot: - nicebot - nicebot - nicebot - nicebot - nicebot - nicebot - nicebot - nicebot - nicebot
NSLOOKUP claims they belong to ServerPronto.

Non-authoritative answer: name =
So I think I'm going to just block this range from ServerPronto as it's a hosting farm:
Serverpronto INMM-69-60-114-0 (NET-69-60-114-0-1) -
Some of you might naively think that you can just block "nicebot" with rewrite rules and solve your problem. However, my research has shown that many of these bots eventually change names when they get blocked by too many sites. You're best off blocking the source permanently so they don't slip thru the cracks next week crawling as something like ""Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; BTOW V9.0; SV1)" which you can't detect.

Remember, they're desperate when you cut off their source of revenue and they'll attempt to adapt so use the best prevention up front which is lock them out by location and don't waste your time fighting changing user agent names.

Bots gone WILD!

This is just a follow up on a couple of the bots using distributed IP's I've highlighted recently which just won't take NO for an answer. Ever since their little cluster of scraping IPs has been uncovered and blocked it's still been a non-stop daily request for hundreds of pages per scraper.

These bots are very nasty so if you weren't paying attention the first time, go back and block THIS, THIS and THIS as they are some hungry-assed bots that need to be stopped.

Wednesday, May 24, 2006

BEZEQINT-HOSTING has a scraper

Coming from the lovely land of Israel is a scraper from Bezeq International, and I can't tell if this is a hosting IP or a DSL connection, but I'm guessing it's hosting but could just be DHCP.

Like I can read their website, feh!

Anyway, the bot always claims to be:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; (R1 1.1); .NET CLR 1.1.4322)
Comes from the following addresses:
So block those at a minimum and if you want to be safe.

Ta ta, no scrape for you!

Monday, May 22, 2006

Odd traffic from Hong Kong, Middle-East and Africa

Has anyone noticed any huge spikes in traffic from Hong Kong, Saudi Arabia, South Africa or Dubai lately?

They are setting off alarms all over the place with my bot blocker and it's all coming from shared networks so I can't tell yet if it's just a lot of people using a few IPs or a few crawlers going crazy in a scraper haven.

I'm thinking about just setting the whole bunch of them to "CAPTCHA-mode" which is the equivalent of forcing them to login before accessing my site. This will quickly determine the number source of the activity based on the number of unanswered CAPTCHA's vs. a valid response from a human.

Let's see what happens next, I'll keep you posted ;)

RED ALERT #3 - GoDaddy hosting distributed scraper

This one may have just moved to a new location as I've been watching similar activity before which stopped. These new antics have been going on at this location for a week now and I waited just to make sure it was really coming from a common location which appears to be a block of IPs on some GoDaddy hosting farm

This creepy crawler doesn't use any user agent string whatsoever and keeps asking for pages like "/#top" and other stupid stuff. Below is the range of IPs and the number of pages asked for just today. You'll note it was a slow day for them asking for only 75 pages, but the day isn't over yet. [] requested 30 pages as "" [] requested 15 pages as "" [] requested 15 pages as "" [] requested 15 pages as ""
Performed an nslookup and got this:

Non-authoritative answer: name =

When I did a whois on the IP there came the surprise:

OrgName: Go Daddy Software, Inc.
OrgID: GDS-31
Address: 14455 N Hayden Road
Address: Suite 226
City: Scottsdale
StateProv: AZ
PostalCode: 85260
Country: US
178.128.0 - 178.255.255

Now do a whois on
NetRange: Registrant:
Special Domain Services, Inc.
14455 N Hayden Rd
Scottsdale, Arizona 85260
United States

Registered through:
Created on: 30-Mar-98
Expires on: 29-Mar-12
Last Updated on: 07-Feb-06

Not sure it makes sense to block the entire GoDaddy IP range, so for now is all I'm blocking unless I see more rogue activity in their network.

BTW, anyone notice how many sneaky crawler networks I'm busting now that I have proximity alarms in place to spot organized activity?

This proximity alarm is great as it doesn't care if the crawlers ask for 1 page or 100 pages, the minute it detects multiple IP addresses in a similar range doing these things it pops up on my radar. The best thing is that the distributed crawler doesn't even have to use more than one IP address per day as long as they break one of my "bad bot rules" on each visit so the IP is flagged and archived. The proximity report of archived bad bot activity will then expose those archived bots operating from a single location.

Pretty tricky, eh?

You stupid bots better wise up quick, you can't hide behind a bank of IPs, your days are numbered!

Sunday, May 21, 2006

Publicly Available Website

That's the current buzzword most often used when you confront someone crawling your site, especially a corporation, that it's a "Publicly Available Website".

Well just because something is publicly available doesn't mean you have the right to do whatever you like with it. It's publicly available for the PUBLIC, meaning visitors, to read individual pages and it's also available to the 6 search engines that I permit to crawl my site. Other than that, just like any other publicly available business, I have the RIGHT TO RESTRICT ACCESS to anyone else that I so desire.

For instance many brick and mortar businesses say "No Shoes, No Shirt, No Service".

Well my website has similar rules "No Humans, No Permission, No Service".

If I even get a whiff off a robot on the site, permission denied.

You corporate and private scrapers just better get over your loser mantra as putting a website online, even on a public network, does NOT give everyone complete access to do whatever they feel like with your site. There are terms of service on that site which distinctly prohibit the use of unauthorized tools to crawl that site, and if you have to ask what's authorized then you don't have permission in the first place so go away.

The site doesn't have a "GNU Free Documentation License", instead it has one of those funny things called a "copyright" which means I own it, not YOU. Additionally, I pay for the server, not YOU. Which means, it's up to ME what is and isn't allowed, even when it's a "Publicly Available Website", NOT YOU!

Let's make it so simple even a 2 year old can understand it:

The website is MINE! MINE! MINE! ALL MINE! and NOT YOURS!

Is that language clear enough for the mental midgets scraping the web to comprehend?