Monday, April 10, 2006

Data Mining Kills the User Agent String

FACT: Your website is raw material for the Internet data mining industrial machine.

Much like the California Gold Rush there's a lot of free money on the table and everyone is scrambling to grab their share on the Internet. This time instead of sifting thru rocks looking for gold nuggets they're using bots instead of shovels and crawling websites instead of standing in a creek. The purpose of all this web crawling is to sift out information from a variety of websites looking for gold nuggets of content that will help get free money in the form of internet advertising. Everyone wants a share of the free money and your website could contain just the right couple of nuggets of gold that the Internet claim jumpers need to succeed.

Simplistic filtering of the User Agent string to block these claim jumping bots has definitely become obsolete because most undesirable bots already don't identify themselves as anything unique and try to hide their presence as the prize is too big to let a webmaster stop them. Don't think this behavior is limited to simple content thieves trying to capitalize on your hard work with AdSense as there are several corporations that I've caught in my snare and probably a bunch more lurking behind IPs that don't expose them with a simple reverse DNS lookup.

What kind of data mining happens on your site?

  • Search Engines
  • Data Aggregators
  • Web Copiers/Offline Readers
  • Copyright Compliance
  • Branding Compliance
  • Corporate Security Monitoring
  • Media Monitoring (mp3, mpeg, etc.)
  • Link Checkers
  • Privacy Checkers
  • Content Scrapers (pure theft)
  • so on and so forth

Other than search engines which provide a valuable service bringing you traffic, many of these so-called services are just one-way bandwidth hogs that not only earn money off your back but you get to pay for the privelege!

Not all of the aforementioned services try to hide who they are and the more legit ones still check robots.txt and present a user agent string so you can opt-out (don't get me started) of their service. However, as the free money flows on the internet so does the desire not to get caught and stopped such as the spy services and scrapers.

More and more crawlers daily are pretending to be users than admit what they truly are to permit the webmaster to stop them, and that trend seems to be growing rapidly as the stakes are higher.

Use robots.txt and .htaccess while you can but you're only stopping the good guys as everything else has gone underground and there doesn't appear to be any reversal of that trend anytime soon.

2 comments:

T.J. said...

Hi,
Not really a comment, but I would be interested to here your views on webaroo. There is a thread about it in the Usenet newsgroup alt.www.webmaster and I thought you might be iterested in what is being said there.
Title of thread in the newsgroup is, " Another site sucker rears its hideous head"

IncrediBILL said...

If you've read enough of by blog you would know the types of things Webaroo are doing would put them on top of my shitlist.

I wouldn't mind so much if they asked permission, but like everyone else, they feel entitled to do whatever they want with your content which is the wrong approach.

Hopefully, we'll be doing something about changing this mentality in the near future when people will have to BEG to crawl sites.