Sunday, December 23, 2012

Honors Robots.txt My Ass

I thought I'd seen everything until I stumbled onto a service called "All Site Search" which has to following to say about robots.txt:

Do you care about robots.txt?

Yes, we do, but we will still index everything. However, the things indexed that match the statements in your robots.txt file, will be marked. This way, you can yourself decide if those things should be visible to the visitors to your site, or perhaps you want it to be visible only to you. This is easily done by a parameter setting.
The purpose of robots.txt is to tell the spider not to crawl the pages in the first place, not to ignore it while crawling only to enforce it when indexing.

If the webmaster has installed a robots.txt entry for their spider "alsRobot2" or "alsRobot3" then why in the hell would they need to define it again? The webmaster either wants it crawled or they don't. Period.

It also means someone could use this free service to index anyone's website, since they basically ignore robots.txt, to get access to the entire site regardless of the webmaster's wishes.

Another bot best blocked in .htaccess since they don't use robots.txt properly just in case their service gets used maliciously for whatever reason.

1 comment:

Anonymous said...

Hi Bill,


You write:
"It also means someone could use this free service to index anyone's website, since they basically ignore robots.txt, to get access to the entire site regardless of the webmaster's wishes."


This is of course not possible. Look further down on the page
http://www.allsitesearch.com/als_faqs.htm
It says "Can someone else use my search results?"


It is not possible to do what you are saying. For example, you cannot register the site "www.somesite.com" at All Site Search unless you are the owner of the domain "somesite.com".


Additionally, you cannot "steal" someone elses Search. If, for example, John is the owner of "www.somesite.com" and have registered and implemented the search on his site, you cannot put his search on your site. The reason for this is that the search must be executed from the site for which the registration was done.


Thus, the site owner is in 100% control. With what to show as well. The following simple line hides all robot exluded material.
<input type="hidden" name="ro" value="1">
This is an extended service. The site owner may then have two search interfaces. One that displays all and only used by the site owner and another interface for the public.



Would be grateful if you updated your blogg.



The very Best Regards,


Jan-Olof Granlund, Munax