Saturday, April 15, 2006

Alexa Hiding in the Shadows

What the hell's going on with Alexa trying to crawl my site with NO user agent string!

Had an attempted anonymous crawl from 209.237.238.224 which is:

whois 209.237.238.224
Alexa Internet ALEXA-INTERNET (NET-209-237-237-0-1)
209.237.237.0 - 209.237.238.255
They still read the robots.txt file, but no clue who it was:
209.237.238.224 - - "GET /robots.txt HTTP/1.0" 200 111 "-" ""
What the hell's going on Alexa?

Someone break the crawler or you just trying to sneak under the radar after everyone blocked your ass?

You claim to crawl as "ia_archiver" but I don't see any ID here whatsoever!

Please explain, inquiring minds want to know!

Friday, April 14, 2006

Cloaked Bots Snared, Charted and Graphed

Since improving the profiling algorithm of my bot blocker so that cloaked bots could be trapped more accurately the statistical data has been piling up and now that data has been compiled and graphed for your viewing pleasure.

Hopefully this will explain to those that think I'm trying to stop beneficial spiders that those spiders aren't what I've been going on about whatsoever.

For those new to the blog we'll quickly define a couple of terms.

  • Cloaked Bot - a web crawler that uses the user agent string of a browser like Internet Explorer, Firefox or Opera
  • Blocked Agents - those crawlers that plainly identify themselves, such as Googlebot, but are unwanted and typically blocked via robots.txt, .htaccess, or other methods.
The chart below shows the amount of requested pages by cloaked bots pretending to be browsers (in purple) compared to blocked agents that have a plainly identifiable user agent string. One thing immediately obvious is that page requests by the cloaked bots far exceed the number of page requests made by all the other blocked agents combined.



The trend analysis of the cloaked bots reveal that, other than a couple of recent spikes, their page requests are on a slow but steady decline over the period charted. This could be a direct result of blocking them and sending them pages of error messages since they aren't getting what they need. Note that in the same time period the normal blocked agents keep attempting to request pages at a steady pace, regardless of the fact they're getting nothing but error pages, with no obvious trend evident at this time.

While some of this may look underwhelming at first, realize that many crawls would've been for hundreds or thousands of pages that are now being stopped before they ever start. Much of the attempted crawling appears to be repeat offenders that already have a complete listing of all the pages of the web site from prior visits. Some cloaked bots do manage to get the site map before being stopped and try to crawl all the links they know about, albeit fruitlessly.

At a minimum this information proved my theory, for my website anyway, that it's not the bots you know and can see that are the real problem, it's the bots you can't see.

With all this in mind, do you think you really know how many actual visitors and pages views your site gets?

Thursday, April 13, 2006

Spy vs. Blog Cache

OK, this is getting ridiculous, spy on someone that's stupid and falls for your tricks.

I reported on blog spying the other day and the direct spying stopped showing up in my activity log but now I'm seeing information spies watching my site via Google cache.

Let me be the first to tell you sneaky snoopers that you've met your match this time, the gloves are off so keep fucking with me so I can add every trick in your book to my bot blocker product. Just keep showing off trying to spy on me as every time you play a new trump card you can never use that trump card against me, or anyone that will install my bot blocker in the future.

Talk about playing with fire, sheesh.

Did you think that you could stop hitting my blog and hit my cache and not be noticed?

Look at this listing from Google cache that was scrubbed of identifiable data to mask these people:

http://72.14.203.104/search?q=cache:incredibill.blogspot.com/ %22[THE IP OF THEIR WEB SITE]%22

Very clever, scanning Google for the domain name of my blog, and narrowing the search for LINKS to your site from my blog, by your IP address instead of your domain name hoping it would go unnoticed.

Maybe other people are stupid and fall for this shit but I noticed that IP address, did an NSLOOKUP, and VOILA!, it was the same block of IPs and spybot company name that I've blocked on my main website and discussed in this blog before and now this continuing surveillance is bordering on, bordering hell, it's harassment.

Just go away, we're done already.

UPDATE: After lunch I set the blog to NOARCHIVE so as soon as Google and all the rest update and kill the cache we'll see what they try next.

Tuesday, April 11, 2006

Beware Blog Spies

This isn't the first time I've noticed this but it's definitely the quickest that my blog has attracted spying operations that monitor information about specific topics.

The first time it was overtly obvious what was happening when I posted about corporate crawlers that spy on websites and the very next day my blog was under daily surveillance from these corporate clowns.

The second most obvious time started yesterday when I posted the blurb "Dubious from Dubai" which immediately started getting hits from what appeared to be automated tools performing automated queries for Dubai on search.blogger.com and one of them was from 217.164.235.140 which according to whois is Dubai: "P.O. Box 1150, Dubai, UAE". The other hits were from several places in New York which make me wonder who's camping on any information about these Arab countries and for what purpose?

I'm not a cloak and dagger kind of guy but sometimes you just see things that raise your eyebrows.

Monday, April 10, 2006

Yugoscrapia My Ass

The "Get a Fucking Clue" award for today goes to whoever the hell is sitting behind 212.102.136.25 hiding as Mozilla/5.0 (compatible; MSIE 5.0) from Yugoslavia that has spent over 8 hours now downloading 1,000+ error messages from my site and is currently still going strong.

Well pal, whoever you are, my automation is better than your automation because mine has a built-in BULLSHIT DETECTOR that stopped yours from doing it's job.

Dubious from Dubai

Just got hit for a couple hundred pages each from this triumvirate of IPs from Dubai:

195.229.241.181
195.229.241.180
195.229.241.187
Reverse lookup said NXDOMAIN and Whois didn't get much either so I'm just gonna block 195.229.241.* and be done with it.

Data Mining Kills the User Agent String

FACT: Your website is raw material for the Internet data mining industrial machine.

Much like the California Gold Rush there's a lot of free money on the table and everyone is scrambling to grab their share on the Internet. This time instead of sifting thru rocks looking for gold nuggets they're using bots instead of shovels and crawling websites instead of standing in a creek. The purpose of all this web crawling is to sift out information from a variety of websites looking for gold nuggets of content that will help get free money in the form of internet advertising. Everyone wants a share of the free money and your website could contain just the right couple of nuggets of gold that the Internet claim jumpers need to succeed.

Simplistic filtering of the User Agent string to block these claim jumping bots has definitely become obsolete because most undesirable bots already don't identify themselves as anything unique and try to hide their presence as the prize is too big to let a webmaster stop them. Don't think this behavior is limited to simple content thieves trying to capitalize on your hard work with AdSense as there are several corporations that I've caught in my snare and probably a bunch more lurking behind IPs that don't expose them with a simple reverse DNS lookup.

What kind of data mining happens on your site?

  • Search Engines
  • Data Aggregators
  • Web Copiers/Offline Readers
  • Copyright Compliance
  • Branding Compliance
  • Corporate Security Monitoring
  • Media Monitoring (mp3, mpeg, etc.)
  • Link Checkers
  • Privacy Checkers
  • Content Scrapers (pure theft)
  • so on and so forth

Other than search engines which provide a valuable service bringing you traffic, many of these so-called services are just one-way bandwidth hogs that not only earn money off your back but you get to pay for the privelege!

Not all of the aforementioned services try to hide who they are and the more legit ones still check robots.txt and present a user agent string so you can opt-out (don't get me started) of their service. However, as the free money flows on the internet so does the desire not to get caught and stopped such as the spy services and scrapers.

More and more crawlers daily are pretending to be users than admit what they truly are to permit the webmaster to stop them, and that trend seems to be growing rapidly as the stakes are higher.

Use robots.txt and .htaccess while you can but you're only stopping the good guys as everything else has gone underground and there doesn't appear to be any reversal of that trend anytime soon.