Sunday, December 23, 2012

Honors Robots.txt My Ass

I thought I'd seen everything until I stumbled onto a service called "All Site Search" which has to following to say about robots.txt:

Do you care about robots.txt?

Yes, we do, but we will still index everything. However, the things indexed that match the statements in your robots.txt file, will be marked. This way, you can yourself decide if those things should be visible to the visitors to your site, or perhaps you want it to be visible only to you. This is easily done by a parameter setting.
The purpose of robots.txt is to tell the spider not to crawl the pages in the first place, not to ignore it while crawling only to enforce it when indexing.

If the webmaster has installed a robots.txt entry for their spider "alsRobot2" or "alsRobot3" then why in the hell would they need to define it again? The webmaster either wants it crawled or they don't. Period.

It also means someone could use this free service to index anyone's website, since they basically ignore robots.txt, to get access to the entire site regardless of the webmaster's wishes.

Another bot best blocked in .htaccess since they don't use robots.txt properly just in case their service gets used maliciously for whatever reason.