Tuesday, November 06, 2007

Munax Stealth Crawler

Stumbled upon a stealth crawler hitting my site from multiple IPs and it turned out to belong to Munax who claims right up front that they haven't named their crawler and fake being a legit user which is pretty damned scummy.

My guess would be they figured out they couldn't access sites with good security so they decided to get around it without a bot name, but here's some bullshit excuse they use:

Our crawler does not have a "name", yet. Instead it announces itself to be a standard web browser, a "Mozilla 4.0" kind-of-browser compatible with the browser Microsoft Internet Explorer 6.0, running on the Windows NT 5.1 operating system. The reasons for this are: (a) Today, web servers are intelligent enough to react on the type of user agent. If our crawlers had a name, say MunaxRob or something like that, many web servers would not know about it and would return junk or maybe nothing at all. (b) We want the web server to return a page to us where the page looks as close as possible to a page that can be viewed with a standard web browser. This, to create the best possible indexing in our database and a WYSIWYG experience for anybody that is visiting our search engine.
Well listen up fuckheads, there's a reason we would return junk or nothing at all which is we don't want your goddamn spider crawling our fucking website!

What part of FUCK OFF! don't you understand that drives you to bypass our security and crawl regardless of whether we want you or not?

Amazingly they admit their IP range:
Your site might have been visited by our crawlers, with network addresses in the range of 82.99.30.2 - 82.99.30.73. Here is a short FAQ answering some of the questions you might have:
I've confirmed this crawl range in my logs:
82.99.30.15 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.17 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.21 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.25 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.26 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.30 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.33 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.37 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.45 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.54 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.67 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Well, this fucking crawler is now blocked.

Bunch of bullshit....

5 comments:

Sam Jones said...

Why not block the whole Munax range 82.99.30.0/25:

% Information related to '82.99.30.0 - 82.99.30.127'
inetnum: 82.99.30.0 - 82.99.30.127
netname: MUNAXNET
descr: Munax AB
country: SE
person: Jan-Olof Granlund
address: Munax AB
address: Artillerigatan 6
address: 114 51 Stockholm
address: Sweden
phone: +46 (0)73 932 3678
e-mail: info@munax.com

and yes their crawling FAQ is amazingly arrogant and conntemptible of site owners.

Note how they say, while they respect robots.txt, they will (a) **still** retrieve your home page regardless, AND (b) will also - regardless of any robots.txt settings - STILL INDEX ANY PAGES ON YOUR SITE THAT MAY BE LINK TO YOUR SITE OFF OTHER SITES "assuming that those links must be OK to index since other sites are allowed to link to your site"

And note the instructions on removing your site from their index - basically they are saying 'if you are lucky, we might remove your content if you are very lucky'. Arseholes!!

=======================
"Do you honour the robots.txt protocol ?

Yes we do. However, the crawler will always (almost) fetch the first page of the site, i.e. the page of the root URL "/". This is for ranking calculation reasons. When we leave beta state we will most likely change this so the first page will be skipped too. Also, if other sites links to multimedia on your site, the crawler will index those links, assuming that those links must be OK to index since other sites are allowed to link to your site.

The crawlers will ignore a robots.txt file if it is not correctly written.


How do I exclude my site from being indexed ?

Remove NOSPAM from the email address info@NOSPAMmunax.com and send an email with the subject "Exclude my site from indexing, code: 84jdur74ud". In your email you should state the full URL of the site. Also, note that others might want to have your site excluded, so be sure to use a correct senders email address. It should have the same domain name as the site you want to exclude.

Because of being in beta state and so many things to do and so many requests to serve, your site might not be excluded until the next time we crawl & index the web.

=================

Johann said...

Well Bill, you're kinda late to the Munax party. :-)

IncrediBILL said...

I'm never late to the party, I just like making an entrance...

Doug said...

Aha! How interesting - I left a comment, for the first time, on one of your posts the other day (http://incredibill.blogspot.com/2008/03/rebi-shoveler-digging-for-korean-search.html) and lo and behold 3 days later I get hit by the Munax stealth crawler on one of my blogs and I ended up back here from a Google search on Munax. I doubt however the two have anything to do with each other :) - Good information Thanks keep up the good work!. now excuse me while I go pop their fucking subnet into my block list. Bunch of fucking leeches.

Mick said...

I posted this somewhere else but I cant find it now, so here goes again;
Has anyone heard of the sedooz search engine, they visited my site site today?
http://www.setooz.com/
The page just says it's under construction but the title tells it all!