Thursday, August 23, 2007

State of Spider Verification One Year Later

A year ago at SES in San Jose we made a big fuss about not being able to validate if the spiders were truly coming from the search engines or being spoofed.

At the time some people were maintaining lists of known valid spider IP addresses while others used to authorize entire ranges of IPs for various datacenters just in case they used new IPs which frequently happened.

Finally the big 4 search engines have all gotten on board implementing round trip DNS checking for spider verification with Google leading the pack back in September '06 right on the heels of SES San Jose.

Here's the implementation timeline:

08/06/06 - How to verify Googlebot on Google's Webmaster Central Blog

11/29/06 - Ask has round trip DNS support as well. Not sure of the exact date but it appears Ask beat out Microsoft based on a post on Matt Cutts Blog. I remember them mentioning this at one of the conferences last year, definitely PubCon at a minimum. If someone from Ask wants to give us an official date that would be nice.

11/29/06 - Search robots in disguise on Live Search team's blog. I remember when I asked the search engine panel at PubCon when they were going to follow Google's lead on this issue the Live Search guy's hand shot right up and said they already had it done.

Look at how quick and responsive 3 search engines were to webmaster complaints about spoofing issues.

...and barely getting it done before SES San Jose '07

06/05/07 - Yahoo! Search Crawler, Slurp, has a new Address and Signature Card on the Yahoo! Search Blog.

Better late then never and it would probably have been a big embarrassment had another year passed without keeping up with the competition.

Other spiders that appear to have implemented round trip DNS validation, to name a few off the top of my head, include Exabot, Furlbot, Twiceler, VoilaBot, even a few aggregators like BecomeBot and and a whole lot more so it's catching on.

Then you have stragglers like Gigabot that don't even bother setting any reverse DNS whatsoever and you have to do a whois on the IP address just to see if the IP block is assigned to their company or not. Come on people, get with the the program!

Obviously we still have a few search engines that need to catch up but at least all the major players can now be verified and a simple PHP script using round trip DNS verification can stop proxy hijackers and scrapers that spoof the search engines.


Anonymous said...

Yup, just when we thaught, that just maybe, we could rely on SEs to provide valid information.

What is this, Yahoo! ? - - [31/Aug/2007:18:42:54 +0000]

Anonymous said...

And here comes the real Slurp, - - [31/Aug/2007:18:42:54 +0000]

Look at the time, each bot requested the same page. Ink got a 403 and Slurp a 200