Friday, March 03, 2006

POP QUIZ: Your Site Already Has a Spider Trap?

Most of you web site owners already have a spider trap on your web site and you don't even know it. There are about 3 pages that humans almost NEVER READ and spiders gobble up daily so all you have to know is which pages these are and then grep for them in your access logs and VOILA! you see a list of mostly spiders hitting your web site that can be blocked at will.

Once you get a list of who's been looking at youre spider trap pages, simply take each IP in the list and then grep for all activity for that in your access log. When you see hits to all pages and no images loaded it's a clincher you got a spider but just dont get carried away and block Google/MSN/Yahoo.

Even if an entry in your access log says Googlebot as the user agent it may not be Google, so check out where the IP resolves and make sure it's in the domain with a reverse DNS lookup, which you can do on DNS Stuff if you don't have other tools available.

Now, anyone want to guess which 3 pages or files on a web site are spider traps?

I know I give you all a lot of information but anyone should be able to figure this out by staring at any typical web site and see which links you would never click.

If nobody can figure it out MAYBE I'll tell you on Monday, if I'm in the mood and if I remember.

Come on people, POP QUIZ! post your guesses, don't be shy!


Anonymous said...

My thoughts would be:
Legals, Terms of use and sitemap xml

Jim said...

well we know one is robots.txt

IncrediBILL said...

Come on folks, those are good but you're missing an obvious one...

Cool Noise said...

Privacy policy?

IncrediBILL said...

Very nicely done!

There might be one or two more labels I've seen on various sites like "User Policy" or something absurd like that but you've all covered the basics.

Everyone gets an 'A' on your Bot Busting 101 exam.

Class dismissed.

Anonymous said...

The only visits to my copyright pages come from spiders. I find this some what ironic.