Thursday, January 01, 2009

MSNBOT Crawled Thru Javascript!

Today I caught MSNBOT-MEDIA crawling thousands of links that were only accessible thru javascript.

These links were only intended for human use, only accessible via javascript, therefore never added to robots.txt...

... until today

Here's the MSNBOT specifics used for this crawl:

65.55.235.202 "GET /feedback.html?id=1010101234 HTTP/1.0"
"msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
I have a page used for site feedback for various page elements and each link on the page has an OnClick command like this:
a href="#" OnClick="OpenFeedback(1010101234)
Elsewhere in the code is the actual function:
function OpenFeedback(id) {
window.open('feedback.html?id=' + id,.....')
}
MSNBOT appears to have assembled it together and was crawling thousands of links such as "/feedback.html?id=1010101234" and so on, page after page.

I have no clue if this was a handjob done just for the site in question or some new pet project testing their ability to crawl javascript, but the game has definitely changed.

To put it bluntly, javascript itself is no longer sufficient to curtail crawlers on the web, at least not simple javascript.

7 comments:

Anonymous said...

I trust that it passed the reverse dns lookup script that you have set up? Are you able to see the links in microsoft's webmaster central yet?

Anonymous said...

Wow...I was quite surprised to see this happening. Please keep the thread up to date if you start seeing those in webmaster central! You got a Sphinn from me mate.

Anonymous said...

Bots are getting smarter, there's no doubt about that. Every time I look at my logs it seems like they've learnt at least one new trick.

This is going to be big news for all those webmasters who've been using JS to hide links. And there must be loads, because before nofollow there weren't that many other options.

Anonymous said...

For the readers, as he didn't mention it - yes, that IP resolves to Microsoft Live Search.

Anonymous said...

From an SEO point of view this is a very interesting development that we have been watching. What about text in RSS feed’s being used as back links? This new development has opened all sorts of questions. One thing for certain is that Google and others are starting to crawl and index Flash a lot better with the use of image scanning software that can read the text on an image. Well spotted Bill. Will be keeping an eye on this post. Keep it up mush!!!

Berni said...

Well this is nothing new. I now and then get requests with Javascript shreds like
GET /self.location}
If the IP range is not in the scope of my target audience, it gets blocked.

Bradford Web Design said...

I have just caught MSNBOT at the same thing and googled what was going on and found your blog! Have you worked out a way to stop it yet?