Saturday, February 18, 2006

Robots.txt gives Bad Bots clues to access

In a rather lengthy debate with the owner of Majectic-12 on WebmasterWorld the issue of robots.txt came up over and over and I finally revealed that robots.txt is arcane and a real problem in the world of scrapers as it gives them clues to accessing your content.

Not only does robots.txt reveal which user agents may be blocked in the .htaccess file but it also reveals which agents are allowed into your server. Any roque bot not getting access to your site can simply examine the robots.txt file and use any allowed user agent name to get past the barracades.

My recommendation is to use a generic robots.txt file such as follows:

User-agent: *
Disallow: /stayout.html
Disallow: /keepaway.html
Disallow: /cgi-bin/
Disallow: /someotherdirectory/

Then allow and disallow robots privately in your .htaccess file only to keep that information away from prying eyes and give the lamest of the scrapers fewer clues how to penetrate your site's defences.


UPDATE: the Majectic-12 conversation at WMW is on hold pending review now so maybe I spilled to many beans on site security issues. It was a great debate, hope it comes back only slightly altered.

3 comments:

thebear said...

I agree.

Bill have you looked at what Google has for cache entries for this blog's home page?

http://64.233.179.104/search?q=cache:XlTc7_uXCSEJ:incredibill.blogspot.com/+site::incredibill.blogspot.com&hl=en&gl=us&ct=clnk&cd=1

Shouldn't it look somewhat like?

http://incredibill.blogspot.com/2006_01_29_incredibill_archive.html

IncrediBILL said...

Google is always behind which I've blogged about before. They have some newer content indexed but the cache it out of date. I have no clue, they're a mess in this dept.

thebear said...

No it isn't a question of being behind.

It is a question of the totally missing right hand navigation column.

It almost looks like a case of cloaking.