Saturday, March 04, 2006

FIRST SIGHTING: Sproose Goose got Plucked

Yet another Silicon Valley startup search engine called Sproose came crawling this morning tagged as sproose/0.1-alpha using Nutch. Well, in their site it claims they have seed funding from VC's, also reported elsewhere, but you can't do any searches yet as they are currently building their Knowledge Rank™ which sure sounds a lot like PageRank, huh?

The first knowledge they got when they hit my site was that they didn't rank high enough to crawl my content and got the door automatically slammed in their faces by being an unauthorized bot. Sorry boys, robots.txt is so 90's, we use razor wire around the compound to keep people out these days.

You may have seed capitol, but I require being wined and dined before you may crawl my 40K pages, or just email mail me with a PLEASE as this entitlement mentality to crawl every site and run up our costs online just because you have been funded is BULLSHIT.

BTW, the people that wrote NUTCH should be hauled out in front of a firing squad and shot as I'm seeing more and more crawling from their little engine that couldn't constantly bouncing off my site.

22 comments:

Anonymous said...

what'so bad about Nutch?

Seems like a good search engine platform to me...

please can you give some details?

IncrediBILL said...

Nothing really wrong with Nutch except every scraper and his brother are aiming it at websites and sucking down bandwidth and chewing up CPU resources on servers that don't belong to them.

Basically, a very nice tool giving everyone a better sledgehammer to scrape which just makes my bot blocker needed more than ever.

True, it will read robots.txt, but being open source how hard to you think it is to eliminate that little hindrance to scraping?

IMO, Nutch is like a hand gun on the internet, you should need a license to use the damn thing!

Anonymous said...

Hey dude, here's a tip to save you all the trouble of outsiders "sucking down bandwidth," "chewing up CPU," and all that security and monitoring - TAKE YOUR SERVER OFF THE PUBLIC NETWORK.

Anonymous said...

Ummm actually yeah it is....

Anonymous said...

yea dude, you suck. take your network off the net or get off the pot.

IncrediBILL said...

Sorry you morons feel that way but websites are for people, not leeches with crawlers running up bandwidth.

They are being blocked, they are being stopped, this will not be a problem for anyone wanting it stopped much longer.

Anonymous said...

I went to that site Sproose. That does not look like an authentic attempt to making something real. I bet its one guy, playing with nutch, chose your domains as a test mule, and let her rip. If I had venture funding, the first thing i'd do is spend some money on a better design.

Anonymous said...

I think you're being a grumpy webmaster. Search engines provide a valuable public resource. Without them we'd be back in '93 when we had to search through 99 pages of junk just to find one spec of gold. And they do this for free. Google could have made a mint back in the early days by charging for search but they didn't, because it went against their values.

Do spiders incur resource-hits on web servers- yes, slightly. But they mostly stick to text and the main draw on web servers are in the images and multimedia.

Moreover most web sites want to be found. As long as a spider obeys robots.txt and page meta-data (and I don't know of any search engine that doesn't- certainly no reputable ones) and as long as the spider's using a reasonably selective algorithm to keep from indexing the same page over and over again, I don't think that spider's doing anything wrong. You can't fault a search engine just because it doesn't have marketshare. It may not be big now, but it may be the up-and-coming search engine of the future, or it may live on just satisfying a niche. Remember Google was once an upstart search engine funded by VC capital too (and Excite, Lycos, Webcrawler, and Infoseek used to be major players). It doesn't matter- this stuff changes over time. The spider's trying to do something productive and the costs associated with it are incidental (and none if you block them with robots.txt).

Over time as bandwidth, CPU cycles and storage become cheaper and cheaper it'll be easier for upstarts to jump into the search engine game. At that time we may see such a proliferation of search engines that they just become redundant. It'll shift (back) to a game of attracting eyeballs, rather than a competition of algorithms, speed, and breadth of database, and frequancy of crawls. And given time, I expect the superflious search engines will just die down and fade away- not because of a technologicial burden of maintaining them, just because there's no point to having them. But by the time bandwidth becomes cheap enough for search engines to spread out of control, bandwidth will be cheap enough on the webserver side so that it won't matter if they're indexed by one search engine or by fifty. The costs will be negligiable. It'll always be cheaper for the web servers hosting the page than it will be for the search engine doing the crawl.

Personally I think you just like touting your bot-blocking technologies.

IncrediBILL said...

Very nicely stated but we aren't talking about legitimate spiders most of the time and what they use the data for has ZERO benefit to my website.

I let the 5 search engines that provide me traffic into the site, everything else gets bounced.

Most of what I'm blocking look like humans surfing the internet, they do not look like bots, they do not use robots.txt, they take whatever they want and do whatever they want with it.

They are not humans as humans don't ask for 400 pages in 10 seconds or anything silly like that.

Do some research on SCRAPERS and you'll be able to discuss this more logically.

Anonymous said...

I'd feel honor if my site is crawled by search engines. - Roboo

IncrediBILL said...

Well, I'd feel honored if you got the point that these aren't real search engines crawling.

Anonymous said...

The problem is, idiots disable the script that SHOULD follow robot.txt and then they collect emails for spamming. Not only that, they DO try to index a zillion pages in a minute thus bringing a server to it's knees. So a few scumbags make it bad for the good ones. For those that Do these cutsie tactics, you need a truckload of horse shit delivered to your kids sandbox.

Anonymous said...

Do you really think these idiots need Nutch to do their scrapping ? Writing programs that follow HTML links is really simple. Have you done some tree traversal ? From what I know, Nutch interesting capabilities are for search. The software does not even support distributed crawling. It's mostly irrelevant for a search engine (ref. http://wiki.apache.org/nutch/FAQ#head-a76c464e63a6bfb80ace2044202fbe14c1b9cda5).
Talk about a sledgehammer...

In short: the problem for a search engine is not crawling, but searching.

Do you really think that with a tool like Nutch, everyone will begin to crawl the entire Web for building their home-made Google search page? How that would be possible without a cluster over thousands of servers at their disposal?

Please don't mix scrappers and search engine.

IncrediBILL said...

Did I say traversing the web was hard?

Now define scraper as anyone that scrapes content to profit, invluding the search engines, are technically scrapers. The differerence is there are beneficial scrapers and those we don't on our site.

Now let's discuss NUTCH based scrapers.

Most don't even change the user agent so we know who they are [red flag] and they that crawl millions of pages that don't even have relevance to their niche, which is DAMNED annoying.

Doesn't need to be millions of nutch users, it could just be hundreds and still be annoying.

Don't worry, they aren't all using Nutch, the other half is using Heritrix or whatever the hell it's called from the WayBack machine.

Either way, don't need 'em, don't want 'em.

Just because you build it doesn't mean it has to be allowed to crawl.

Anonymous said...

By the way - does anyone here actually know what the sproose goose is without using Google (or some nutch based se) :)?

Anonymous said...

"Sub-optimal" indeed... You want all of the benefits and none of the problems entailed by freedom. Read some literature (how about Michel Foucault's "Discipline and Punish" as a wake-up call?) and get an education outside computers.

- Written by someone who's been in computers since 1968 and was the probably first kid his age [possibly of any age] to implement a Latin-to-English translator in IBM 360 assembler.

Anonymous said...

The only thing you show with your answers is your arrogance and presumptuousness. Don't you have any good arguments that you need to insult most of the people who disagree with you?

If you spent the time you waste hunting bots in a real job, you'd probably could afford the extra bandwidth you seem to need...

Anonymous said...

You've just proved my point.

I was hoping to read intelligent arguments and real data as to how much bandwidth the evil Nutch users might consume from your valuable connection and how to avoid it... but that's probably asking too much from your blog. Luckily there are plenty of intelligent and educated people who can do it for you.

Good bye, I don't think I'll be coming back to your wonderful and most interesting blog, I don't like wasting my valuable time.

Anonymous said...

In my opinion all bots that are not connected to a major search engine should be blocked thats why I am also writing blocking software.

The DCMA protects copyright holders and we can reserve the right for our content to be displayed only to users and not allow bots to copy it.

US law also makes it ilegal to use a trick or false information to gain access to a computer system. Pretending to be a human by faking a useragent to get pass a ban is a violation of this law.

Just because something is on the net does not make it PD and copyright holders can limit you to the right to view not copy or store.

Scrappers scan google for the sites with top listings then copy your content to make a duplicate site and then make money off of ads. I have even seen ones that display a copy of your site when google comes by so it looks just like your site in the listings but when a user clicks on it he sees another site without your content.

It is theft!

Anonymous said...

Just a quick note that many firewalls can allow you to set a connection threshold and temporarily ban an IP address that is making too many connections.

Anonymous said...

Google could have made a mint back in the early days by charging for search but they didn't, because it went against their values."

ROFLMAO!
By the time Google came along there were plenty of free search engines (remember altavista? Lycos?).

Google won out because they build the best way to monetize search without annoying people. (google ads)

Maybe we should lighten up on nutch,
but if anyone disabled robots.txt in their crawler, yeah, that's real sleazy.

Anonymous said...

Come on Bill, tell Amazing n co what you really think ...

- I'm here after server load issues, maybe coz of bad spiders/bots, maybe coz of referer spam (tried blocking by words/phrases in URL, but the words n phrases I'm seeing keep on ballooning way beyond "holdem").
Seen your recommendation re Referrer Karma, and checked this: will try.
Likewise to try AlexK file vs bots.
Hoping I can set em up correctly.

I write stuff, take photos; and just coz I put on net, my words and images (and those of others posting to my sites) aren't for simply stealing.
Nor do I like having my bandwidth used excessively, server load increased, by selfish folk.

But there you go, it takes all sorts; maybe some others here have doors to their homes open when they're out, so people can wander in and use electricity, take whatever they wish.