Sunday, August 31, 2008

MJ12BOT's Dirty Little Secret

Many of you have seen MJ12bot hammering your site from IPs all over the world, both the legit crawler and the fake virus version 1.0.8 as well.

"Mozilla/5.0 (compatible; MJ12bot/v1.2.3; http://www.majestic12.co.uk/bot.php?+)"
Everyone likes an underdog and we tolerate these crawlers when they claim such a noble purpose:
We do spider the Web for the purpose of building a distributed search engine with fast and efficient downloadable distributed crawler that will enable people with broadband connections to help contribute to, what we hope, will become the biggest search engine in the world.
It's been crawling for years and I've never seen any traffic from this damn thing, has anyone?

Then the "offshoot" of this so-called search engine emerges which is Majestic SEO which claims to be a "a commercial offshoot from Majestic-12".

What do they do with the data gathered when they crawl your site?

Here's a direct quote, including typos and bad grammar:
competitive reports are now avaialble, you can buy credits and then use them to see information any domain!
So let's get this right, if we let you crawl our server then you'll let our competitors buy information that can be used against us?

We have a simple solution to this distributed problem:
RewriteCond %{HTTP_USER_AGENT} MJ12bot
RewriteRule .* - [F]
Compete with that.

21 comments:

Connie said...

I already had this one on a black list. Not sure why?

Shouldn't using a white list of opt-in list stop this user agent?

IncrediBILL said...

Of course the whitelist will stop this little bugger, but not everyone whitelists.

I'm just spreading the word on what these crawls are being used for so others are aware they're selling their SEO soul by letting it crawl.

corey said...

those mother *******

Johann said...

Btw: Gigablast Inc. also does something like this. Not sure if you allow Gigabot but I don't.

IncrediBILL said...

Gigablast is different though, it's a real search engine and has real blocks of IP addresses so you can verify it's really Gigablast.

Not so with MJ12BOT as it's completely parasitic and leeches bandwidth off people running the distributed crawler and then uses the data for something other than what their project claimed, which is slimy.

Worse yet, you can't validate it's really MJ12BOT so it could be a scraper or that damn virus.

Think they share the profits with all those volunteers running the distributed crawlers?

Anonymous said...

AIUI, the "commercial" version only examines parts of your site that your competitors can see anyway just by directly going there, and doesn't hack or nefariously access private and confidential information in any way. So while you may not like it, I don't see how you can honestly claim that letting it in is "selling your SEO soul" or that it is in some way illegitimate. It even identifies itself, distinctly from the volunteer version, from what I can discern based on your blog post.

If it is surreptitiously or nefariously installed on the machines that run it, then that is a problem, but an entirely separate issue from your apparent complaints.

IncrediBILL said...

My competitors can't crawl my site, they aren't allowed. Nor is much of anyone beyond the top 4 search engines.

It's also unlikely my competitors can manually wade through hundreds of thousands of pages without automation.

Therefore, my competitors can't get that pretty compiled report with information at their fingertips if this bot is blocked.

Besides, you miss the point that people are being asked to run this distributed crawler for the noble cause of contibuting to some search engine, not selling the results out to snooping competition.

If the SEO reports were the original goal, I doubt very many would volunteer to run the distributed crawler without some form of compensation for helping generate the raw data needed for the commercial results.

GaryK said...

I'm so glad someone has the balls to speak out against Majestic. Keep up the great work Bill.

alexc said...

Hello,

I am the MJ12bot creator and founder of the project. I don't really think we deserved the harsh words posted above and let me explain why.

We respect people's decision not to crawl their sites - we obey robots.txt very well and support stuff like Crawl-Delay that many crawlers don't. However I think that while some might not approve what and how we do, it is a bit unfair to label us "completely parasitic" because we are not. We also want webmasters to benefit from information we have - that's why Majestic-SEO gives free reports to verified sites, just like Google does. I guess Google that makes billions on your data is not parasitic and us trying to create alternative are? Of course you'd say that Google gives you traffic, but how can we generate good traffic to your sites if we can't (at the moment) rank results as well as Google?

We are building a search engine and we had 1 bln pages indexed in 2006. Problem was that as more pages were added we were getting less and less relevant results. This was because we did not and could not (at the time) analyse web graph and pick the best pages that need to be crawled and indexed and then ranked depending on anchor text, backlinks etc. So we had to create anchor index - note that Majestic-SEO was launched almost 4 years since we started the project - that was certainly not the original goal, but a stepping stone in survival to actually achieve our original goal.

We simply can't go further without learning how to rank pages better, for this we needed Anchor Index and now that it is created (and using most of available hardware) we need to earn money to buy more hardware for full text index.

Do you know how much hardware it takes to crawl and analyse graph of the web? Lots. So far around 50 TB (highly compressed data) used just for the anchor index (web graph) and we simply can't full-text index all data because we would need a lot more resources. UK is not USA and we don't get here millions of dollars of VC money for stupid but easy to implement ideas (ours is very complex and I don't think stupid), so we have to earn it. I just got back from viewing local data center and was quoted £1300 per month for rack for 16A power, and £2450 for 32A - that does not even count cost of servers to fill up that rack. If Google was still indexing 4 bln pages (when the project started that had that many) we would not need all that, but they moved to 30-40 bln and to actually pick that many for indexing we need to crawl and analyse more of the other (often useless) pages.

Competitor who might want information about _your_ site won't be interested in your internal links (and this is the only thing you protect by disallowing crawler), in fact only external backlinks will be analysed. At the moment Yahoo's Site Explorer can be used or by digging Google. By blocking our bot you are not achieving anything apart from losing chance to analyse your own backlinks for free (as long as you allow MJ12bot) on our site.

Anyone who actually follows our project carefully would see that this strategy was discussed to our project members regularly - we are NOT selling this information behind people's backs - check out forum and you will see easy proof about it. Of course we will share profits with project members - 20% of them, in fact we go much further than that - project members (and anyone can join) will become actual shareholders so that they would benefit from the value of the company. Just how many companies do anything like this? Did YouTube share money after Google's acquisition with the people who actually made it happen by uploading their videos? No, they didn't! But we would! Does this sound like "parasitic" behavior? I don't think so - I think (and obviously I am biased in that) that we are acting well above the board doing things that are right rather than what commercially make sense. So there are a fair few things we simply don't do - like re-selling crawled content (full-text or HTML), unlike some other search engines - we simply don't do it and won't do it. But what we can't do is continue trying to beat Google without having any revenues whatsoever - it was difficult choice on what exatcly to do and it seemed that by creating Anchor Index we will achieve objectives of being more relevant and also earning money to further the project.

IncrediBILL has got his own unique view on banning bots that I do not agree with and we had a number of arguements about that, but I respect his decisions - I just think it would be fair to avoid jumping to conclusions and spend some time researching the subject before making harsh statements.

regards,

AlexC

P.S. All our bots are clearly self-identified, we obey robots.txt and in no way we allow or encourage crawler installations on computers not owned by the person who runs it.

IncrediBILL said...

Welcome Alex, and I didn't think the discussion was too harsh, but sorry you felt that way.

In reviewing your site again I'll correct myself in that you do have a couple of links to your SEO site from the search engine site. Of course that means nothing to people that haven't visited your site in over a year and have no clue that their data is being repurposed.

Did you send an email to every site you indexed telling them of this change so they could opt-out?

I didn't think so.

Now let me explain a SYMBIOTIC vs PARASITIC relationship.

SYMBIOTIC: Google, Yahoo and Live all give me something in return called traffic which I'm able to monetize. Google and Yahoo also have revenue sharing programs (AdSense and YPN) that send me fat checks every month for letting them crawl my site.

PARASITIC: Takes and give nothing in return. Any crawler that can't show me the money or the traffic is one I'm showing the door.

Free reports? Already get free reports, don't need any more free reports.

Shareholders? I have lots of stock in very viable internet startups that went south so I'm well stocked for internet toilet paper, don't need any more of that.

Well Identified Bot? How? There's no reverse -> forward round trip DNS checking to prove the bot is who it claims to be, just the user agent which even a virus used so that's completely unreliable. All of the established search engines, even many of the new ones, now use round trip DNS to stop the spoofing so you can argue all you want but that's one you have to chalk up in the LOSE column no matter what you say so let's move past it.

You said:
"Competitor who might want information about _your_ site won't be interested in your internal links"

That's incorrect because competitors *DO* want internal links and anchor text because the way the site is internally architected is just as important to SEO as as external links and anchor text. Perhaps you should experiment with your competition and see how pages are actually ranked before you go any further as that statement showed ignorance about SEO in general.

Lastly, nobody cares how much money you're spending on hardware or that you can't convince a VC to invest. Those are your problems, not ours.

Perhaps had you focused on a narrow niche, where you built a small full text index for a specific industry or segment of the web, instead of making the big fat crawl, you would already have a viable search engine, could have worked out how to return relevant results for that and would have a completely viable service to sell already, poised to expand.

Unfortunately, it appears you bit off more than you can chew and now are suffering the consequences of trying to do too much too soon, and we're all tired of waiting.

End of story.

IncrediBILL said...

One last thought I forgot to add to the previous post:

To be honest, most of us would like to see you be successful which is why we initially allowed you to crawl our sites in the first place.

I was serious when I said that you you should scale back and focus on a niche area and run with it because you could do a much better and more thorough job than the general search engines and expand from there.

alexc said...

I will try to keep it short.

First of all we do not provide internal anchor text/backlink analysis. Only external backlinks are analysed/shown.

"Did you send an email to every site you indexed telling them of this change so they could opt-out?"

That would be email spamming and we don't do it. We don't even know emails of every indexed site - we are not email harvester you know, all emails we find on pages are dropped in analysis - we don't even save them. I don't think you really thought through this suggestion.

Indeed we are not sending traffic yet to your sites but we are working hard to solve it - my personal firm believe is that unless we reach Googles level of relevancy (or better) we won't have a good chance - Google's growth in their index (as well as frequency of updates) forced us to rethink a number of things.

I think it is important to keep in perspective that we have been working on this for almost 4 years now (next month it will be 4). We started offering Majestic-SEO competitor analysis only in July 2008 - I think this clearly demostrates that it was not our original intention to do anchor index - 4 years is a very long time, most companies go bust during this time and we were also forced to think how we can keep R&D for the next 4 or probably more years. In USA people can get crazy VC money, but not in the UK and it is not in my nature to to tell people bull**** in order to get money. So we are forced to earn it doing something with the data/sofware we created in order to survive.

Scaling down is something that I considered. But I am not interested in making 1 bln full text index that only searches for some vertical information - even in this case we need to master backlinks and anchor text analysis as it is the key to relevancy.

I appreciate you don't care about endless problems that need to be resolved in projects like this and the fact that we need to earn money to actually just survive (yet alone buy a lot of hardware) is not important to you.

Let me just explain what's important for us: we obey robots.txt, that's very important for us, recently we added another layer designed to make this support more robust. When our bot sees Crawl-Delay used for some other bot it will assume courtesy delay of 5 seconds because we think that webmaster that used Crawl-Delay (even not for our bot) is probably sensitive about crawling, so we try to modeterate ourselves. We also choose now less urls to crawl from sites and we do better session id filtering than before. When Majestic-SEO was created we also thought of webmasters to offer them alternative for Google Webmaster Central - this is also free for your own sites, why do we do it? Because we wanted to give something back to people - webmasters who do allow us to crawl their sites, we don't yet have ability to full-text index everything but at least at that point we had something that can help webmasters and we offered that.

Also there is no "dirty secret" - our Majestic-SEO site is clearly affiliated with Majestic-12 project, and there is a link from MJ12bot page to it. This, yet again, shows that we are open and not trying to hide - if we were indeed "parasitic" bad people we'd hide that affiliation and you'd never know. That's what some other search engines do when they resell your data - you don't bash their "dirty secrets" because they are actually secrets, ie done not in an open way.

We are already gaining better understanding of relevancy with our anchor index - this allows us to crawl smarter, and it will allow us to actually understand how to rank better - the key to success of any serious search engine, unlike many others we (I think) will actually have a chance.

As I said we respect robots.txt, so we will naturally respect your decision to block our bot from crawling. What I think however is important is to see the whole story and make lightly more informed decision that won't just be based on "dirty secret" and "parasitic" phrases that do not explain near 4 years of work on our very hard to implement project.

alexc said...

Oh one more last thing (I promise). Ask yourself, how did I find this page in the first place? Am I a fan of Bill? Did I randomly got through some blog roll thingy? No.

I found this page by looking at backlinks shown by Yahoo Site Explorer. I was doing competitors analysis from that search engine - it allowed me to find this page. I mainly use it to see how we can improve to their level in order to be finding new pages and crawling them as fast at them. It's a complex research work, but I find it particularly ironic that with all this bashing of Majestic-SEO you ignoring Yahoo Site Explorer that actually allowed me to find this page!

IncrediBILL said...

Google's index growth is irrelevant in terms of what you're doing because you aren't Google, or even close, or near Yahoo or Live for that matter.

You only need the most popular pages to start returning relevant results and go from there.

Maybe you would rather tell me how well you respect robots.txt instead.

Hell, I could grab a copy of nutch or heritrix and be in the search engine game at some level in very little time and it would also respect robots.txt.

BTW. that crazy VC money is in the UK and Canada too and it requires something called a viable business plan, not BS, which is what they would call 4 years without a revenue stream.

If your plan, as you implied is to keep afloat another 4 years going down this same path, the search industry will have changed so much by the time you get something working that it'll be completely irrelevant.

Besides, I never said you weren't working hard, lot's of people work hard and fail. I merely suggested you forget that Google grail and work on something more attainable that can provide value today.

alexc said...

> Maybe you would rather tell me
> how well you respect robots.txt
> instead.

I think we respect it very well. Do you have any evidence to the contrary? We support crawl-delay and we also support recommendation to stay away from sites that return 403 Forbidden to robots.txt requests, we even going further - some other hard server errors (like 500) on such requests would keep our bot away. We are easily accessible by email on the bots page (you can see our email in very big letters) and I am happy to say that we are getting very few MJ12bot related emails - the ratio is probably one email per 3 bln crawled urls, not every email is even about robots.txt issue. So I think we are pretty good about it.

-----------

Google's index growth is very relevant - the speed and depth of crawling of some (but not all sites) comes from their very good knowledge of the web graph. When we crawled first few billions of pages I had them indexed - there was lots of junk, spam, and just not very valuable pages. The problem is how to measure this algorithmically - there is no other good scalable way to do that but to analyse web graph. That's the main motivation for the anchor index. Now using that index I expect to identify actual good pages that are worthy for inclusion into full-text index.
Problem for this is that really you need to crawl 15-20 times more data to find worthy candidates. Another problem is to know whether we actually crawl about the same stuff as Google or we are stuck in some bad spammy sites. It is very difficult to actually run large scale web crawl because we have to make decisions on which pages and how many from each site to crawl with many billions more urls available for crawling that we have to avoid crawling. That's very hard to decide and anchor index right now helps us do it.

Sure you can grab Nutch and be in this "game", but you will soon discover the very same problems we discovered and how will you try to solve it? The solution comes from web graph analysis but it was very difficult to actually build search engine that can allow to deal with hundreds of billions of urls and backlinks. We've been working on solving this for the last 18 months, and also on trying to ensure we actually crawl quality sites - this is very hard to define algorithmically.

Say look at our anchor index quality research - notice that the first builds of the index that we started tracking for quality date back at least a year, but in reality work started on solving scalability problems at start of 2007. Look at numbers and notice that quality of index was growing as it helped detect bugs in software and focus on more important sites - we have not started selling competitive information until July 2008, which is almost 4 years since start of the project.

Bill, I am interested on working on things that I do right now - these problems certainly have higher complexity than monetary returns, but I like intellectual challenge and I want to continue this work. We can't however do it indefinately without any revenues so something had to be decided and it was deemed (suggested by me but agreed with the community that supports our project) that the path we took is the right now.

I am not interested in talking to VCs because they would force the project to be much more commercial - right now what we do is not dirty and certainly not a secret, so your blog post is factually incorrect.

You might consider working on something that targets Google holy grain as insane thing - fair enough, nobody forces you to do so. For me however it is the work of great interest even though at times I have to deal with things like this that does not help us much.

I don't want to argue more about it - I think any readers who manage to read so far will have enough food for thought to decide who is right here.

And just to cover one thing very clearly:

"Think they share the profits with all those volunteers running the distributed crawlers?"

Yes we absolutely will, our project members over time have developed a substantial amount of trust towards what I do personally as they can see how the work progresses regularly, we have not had any profits to distribute in the last 4 years but Majestic-SEO helps us now earn and move closer towards our objectives without having to beg for money from VCs. I will be back to post here when we do the first pay out - it will be this year.

IncrediBILL said...

You have some real funny ideas about VC because good ideas will have VCs fighting over who gets the investment.

It's the other ideas that have to beg.

Hope you hit pay day eventually but we all know the highway to hell is paved with good intentions.

alexc said...

I do have first hand experience with VCs and also I do have business education - I know the questions those guys will be asking, and while I can answer them successfully now it is only because there is Majestic-SEO around - otherwise it would be chicken-egg situation: no traffic to search means it can't ad supported, but most importantly I don't believe new search engine will be successful unless it is as least as good as Google in terms of relevancy. Majestic-SEO is what allows us to understand relevancy better and channel this knowledge into building decent full-text search, it should also help fund it - if you think this approach is "too commercial" then you sure won't like what VCs would have pushed for.

> we all know the highway to hell is
> paved with good intentions.

That is true Bill, which is why it is important to look at previous road that was walked on - for us it is almost 4 years of work, and Majestic-SEO is a logical step forward - without this backlinks index we would not have improved our crawling and we would certainly have no chance in being relevant - if you look at search engines like Cuil you will see that they have problem ranking sites well, that's the same problem we had - too much content, you need very good understanding of webgraph and anchor text before having a decent chance to rank well and fast.

I'd personally prefer not to worry about things like revenue, marketing etc - sadly if we don't do it then we will certainly fail, and I don't think it would be justified after all that effort - interesting that just yesterday new backlink was found pointing to majesticseo.com - it was found in our daily crawl from your blog home page - pretty amazing really :)

IncrediBILL said...

So how fresh is your index that you sell to SEOs?

How can it compete with what Yahoo's giving away free in terms of being up-to-date?

alexc said...

> So how fresh is your index that you sell to SEOs?

We are not as fresh as we'd like - current full index merge is up to May 2008, however whenever a domain is added (say verified for free just like in GWT) then any newly found backlinks in daily crawl will be shown. We use this primarily to analyse whether our crawl is improved (and it did big time this summer). New full index update is just in the works right now, it should finish near end of the month and we will be doing this more frequently - probably once every 2 months or maybe even once every month.

Next month we will also switch to regular recrawls of important sites, newly found backlinks will be fed via daily updates so at that point I hope we will become pretty quick at finding new links on popular sites.

This is not just for SEO - this (and many other) operations are important for relevancy purposes - anchor index allows us to model this behavior and compare it with G/Y, and see how we can improve our crawling. It is much easier to see what to improve when operating with backlinks - Google is very good when it comes to up to date indexing - picking up new info, so good that without it is very difficult (if possible) to compete with them, so usage of anchor index helps us solve this complex problem as well.

> How can it compete with what
> Yahoo's giving away free in terms
> of being up-to-date

Yahoo is indeed pretty fresh - however they are limited to 1000 urls and they don't show the best backlinks (Google shows even worse quality and quantity). More importantly with our focus on analysis of what affects relevancy (which was the primary motivation for the anchor index) we will be able to calculate metrics that Yahoo won't show - for example real weight of backlinks, checking which ones come from bad neighbourhoods etc.

Thing is - some backlink may come from domain that is actually heavily backlinked from dodgy places, or spammy backlinks - this is the kind of information that helps determine relevancy and we will be able to calculate things like this, I don't think Yahoo and especially Google will ever want to make anything like this available.

Sorry for too much text - this is pretty complex topic to explain in a few words :(

Anonymous said...

My competitors can't crawl my site, they aren't allowed. Nor is much of anyone beyond the top 4 search engines.

It's also unlikely my competitors can manually wade through hundreds of thousands of pages without automation.


No, they just get Yahoo or Google to do so for them.

Therefore, my competitors can't get that pretty compiled report with information at their fingertips if this bot is blocked.

Not from Majestic, but from Yahoo?

Besides, you miss the point

NO. I do not miss any point and you will desist from insulting me in public!

Your competitors seeing what's in full public view simply is not evil in any way, shape, or form. Are you so paranoid that you'd also object if you had a bricks-and-mortar presence and one of them drove by and saw the exterior of your place? Or even browsed inside like a customer could? What about it being visible to them on Google Earth and Google Street View?

I just don't get reacting so strongly to people accessing information that has been made public as opposed to, say, confidential trade secrets!

IncrediBILL said...

You still miss the point, but I wouldn't expect anything less.

Just because information is made publicly available doesn't mean it's available to be used however anyone wants.

For instance, Getty and Corbis publicly post images but if you republish that image, your ass will be sued.

If I opt to allow some site to crawl, or block them, it's my business either way, not yours.

Now go crawl back under your little anonymous rock of yours and have a nice steaming hot cup of STFU.