Most of you web site owners already have a spider trap on your web site and you don't even know it. There are about 3 pages that humans almost NEVER READ and spiders gobble up daily so all you have to know is which pages these are and then grep for them in your access logs and VOILA! you see a list of mostly spiders hitting your web site that can be blocked at will.
Once you get a list of who's been looking at youre spider trap pages, simply take each IP in the list and then grep for all activity for that in your access log. When you see hits to all pages and no images loaded it's a clincher you got a spider but just dont get carried away and block Google/MSN/Yahoo.
Even if an entry in your access log says Googlebot as the user agent it may not be Google, so check out where the IP resolves and make sure it's Google.com in the domain with a reverse DNS lookup, which you can do on DNS Stuff if you don't have other tools available.
Now, anyone want to guess which 3 pages or files on a web site are spider traps?
I know I give you all a lot of information but anyone should be able to figure this out by staring at any typical web site and see which links you would never click.
If nobody can figure it out MAYBE I'll tell you on Monday, if I'm in the mood and if I remember.
Come on people, POP QUIZ! post your guesses, don't be shy!
Friday, March 03, 2006
POP QUIZ: Your Site Already Has a Spider Trap?
Posted by IncrediBILL at 3/03/2006 06:26:00 PM
Subscribe to:
Post Comments (Atom)
5 comments:
My thoughts would be:
Legals, Terms of use and sitemap xml
Come on folks, those are good but you're missing an obvious one...
Privacy policy?
Very nicely done!
There might be one or two more labels I've seen on various sites like "User Policy" or something absurd like that but you've all covered the basics.
Everyone gets an 'A' on your Bot Busting 101 exam.
Class dismissed.
The only visits to my copyright pages come from spiders. I find this some what ironic.
Post a Comment