Thursday, January 01, 2009

MSNBOT Crawled Thru Javascript!

Today I caught MSNBOT-MEDIA crawling thousands of links that were only accessible thru javascript.

These links were only intended for human use, only accessible via javascript, therefore never added to robots.txt...

... until today

Here's the MSNBOT specifics used for this crawl: "GET /feedback.html?id=1010101234 HTTP/1.0"
"msnbot-media/1.0 (+"
I have a page used for site feedback for various page elements and each link on the page has an OnClick command like this:
a href="#" OnClick="OpenFeedback(1010101234)
Elsewhere in the code is the actual function:
function OpenFeedback(id) {'feedback.html?id=' + id,.....')
MSNBOT appears to have assembled it together and was crawling thousands of links such as "/feedback.html?id=1010101234" and so on, page after page.

I have no clue if this was a handjob done just for the site in question or some new pet project testing their ability to crawl javascript, but the game has definitely changed.

To put it bluntly, javascript itself is no longer sufficient to curtail crawlers on the web, at least not simple javascript.