Wednesday, November 22, 2006

Exalead Preview Violating Webmasters Content

It's been ages since I've wandered over to Exalead and played with it for a while. I get a few spurious hits from their cute little search engine so I thought I'd explore for a bit and see what it had to offer.

Oh look, nice layout, thumbnails, click on the thumbnails and get a site preview...

Oh my god, they downloaded my page in real-time, stripped out my javascript so they could frame it without my frame buster working, and the page looks like shit now.

I'm speechless...

Not to mention infuriated that they would violate my content in such a manner.

If you want to just block the preview mode, they send a request like this:

193.47.80.78 "GET / HTTP/1.1" "http://www.exalead.com/search" "NG/4.0.2897.395"
So blocking "^NG/" in .htaccess should do it and add "NOARCHIVE" to all your pages just to make sure they don't pull up an old copy, as that would REALLY piss people off if they don't honor NOARCHIVE.

If you just want to block the bot, it's "Exabot/3.0".

If you just want to block them completely, they crawl from here:
inetnum: 193.47.80.0 - 193.47.80.255
netname: EXALEAD
route: 193.47.80.0/24
Just another reason why webmasters will keep hating some web sites and search engines because they just don't get it so fuck 'em, they can't play in my sandbox any more.

7 comments:

Cd-MaN said...

Hello.

I understand that it's your content on the website and you can do whatever you want with it, but isn't this a little overzealous? If you insist that only humans should request your pages you (a) will loose some visitors who use some funky setup or are from the wrong IP block and (b) you'll kill of the chance for your content to be exposed on other pages

IncrediBILL said...

My traffic has been increasing, not decreasing, since I started taking control of my content - and it's not this blog either.

If they want to index me in Exalead, then they'll have to do something about the preview mode thing, it's over the top.

Anonymous said...

Hello,

Since then we had not had any bad feedback from webmasters for our preview. But I understand your point, we will add the ability for the webmasters to refuse their pages to be previewed and thumbnailed by the use of these two meta:
meta name="robots" content="nopreview"
or
meta name="robots" content="nothumbnail"
It should be online very soon.
Would that be ok for you ?

Anonymous said...

exaleadguy said:
"Would that be ok for you ?"

Certainly wouldn't be OK for me!

I manage over 15 sites, and having to retrospectively hand edit every one of hundreds of pages is a huge task.

Your method also adds an extra line of code that valuable search engines will have to wade through before they get to the all important content.

tmaster said...

No the robot does not comply with simple basic robots.txt commands to not load images.

User-agent: *
Disallow: /images/

IncrediBILL said...

ExaleadGuy,

First, your crawler should tell us where to find information about your bot including the exact user agent names you look for, if it's "Exabot" or "NG" or whatever, I'd prefer the preview at least be "Exabot NG" so people would know they're related.

I would suggest a user agent string like:

"Exabot/3.0 http://www.exalead.com/bot.html"

I would prefer a robots.txt entry for your bot so I don't have to clutter up all my webpages on the site in question, which has about 40K.

User-agent: Exabot
Option: nopreview
Option: nothumbnail

For people that can't access robot.txt, such as many bloggers, your suggestion is fine and they can add it to their template.

Allow them to be combined for brevity:

meta name="robots" content="nopreview, nothumbnail"

tmaster said...

Looks like they think using robots.txt would be to confusing based on the reply they posted
Here on my site.

If the usage of the thumbnails tend to generalize in search engines, the webmasters will certainly take it into account in the writing of their robots.txt file and we will be able to use it again but for the moment it would be deceptive both for webmasters and end users.