Sunday, April 08, 2007

Webaroo's Content Stealing PulseBot Flatlined

If you've never seen Webaroo before, the concept of copyright obviously has been completely glossed over.

Here's what it says on their website:

Webaroo servers crawl the web, analyze web pages and automatically select the subset of pages with the greatest diversity and quality in the least storage size. These pages are then packaged into topic-specific "Web Packs" that can be downloaded by users onto their devices. Once downloaded, users can search and browse that content on the go.
Here's an English to English translation:
Webaroo takes whatever copyrighted content of yours we want and repackage it for our customers without permission. Of course we do it without permission because nobody knows about Webaroo in the first place so they won't stop us or the many bot names. Isn't it cool how we're going to steal your shit and pack it up so others can download it and now they don't even need to bother visiting your website? Wicked!
Look at the total number of bot names coming from their crawler's IP address.
64.124.122.228 "WebarooBot (Webaroo Bot; http://64.124.122.252/feedback.html)"

64.124.122.228 "PiyushBot (Piyush Web Miner; http://piyush.com/feedback.html)"

64.124.122.228 "RufusBot (Rufus Web Miner; http://www.webaroo.com/rooSiteOwners.html)"

64.124.122.228 "RufusBot (Rufus Web Miner; http://64.124.122.252/feedback.html)"

64.124.122.228 "SumeetBot (Sumeet Bot; http://64.124.122.252/feedback.html)"

64.124.122.228 "PsBot (PsBot; http://64.124.122.252/feedback.html)"

64.124.122.228 "pulseBot (pulse Web Miner)"
Hell, if you were trying to stop them using robots.txt it's a lost cause as the bot names seem to get changed faster than a baby's diaper.

I would just block their range of IP's, it's more convenient.
Webaroo MFN-B843-64-124-122-224-27 (NET-64-124-122-224-1)
64.124.122.224 - 64.124.122.255
That's how you stop name changing bots, the firewall way.

Package THAT into a topic-specific "Web Pack" and download it.

15 comments:

bull said...

Better deny the whole 64.124.0.0/15

cdman83 said...

I (yet again) would like to remind you that maybe you are taking this too far and making this too personal.

First of all, about your point of bots costing money: probably you refer to other sites, because all the bandwith for this blog is payed by Google.

Second of all: there is a thing called copyright. If you are so nervous about "alternative consumption" of your content, the least you could do is put up a copyright notice on your blog (again, I'm talking about your blog since I don't know what other sites you have), which stipulate the conditions under which we can access your content (for example there are a variety of pre-written licenses over at creativecommons.org.

Thirdly: if the bot really obeys the robots.txt, I see no reason for blocking IP addresses.

Finally: This service reproduces the whole page, does not steal content. It is an innovative idea and you start to sound like a luddite with all these complaining.

And a comment to your commenter: sure, block an entire ISP (that IP range is owned by the Texas based Metromedia Fiber Network), that will show them!

zCat said...

Good catch Bill, caught the "pulseBot" just now but didn't realise it was part of an extended Webaroo family.

@cdman83: According to Wikipedia, a "Luddite" is someone who makes information available via a website to genuine human users with a live connection to the website, and reserves all rights to restrict presentation of information in any manner they see fit. Including to "innovative" services who don't have the courtesy to ask if it's OK what they are doing. (Actually it doesn't say that, but as it's Wikipedia could be easily made to do so ;-).

Anonymous said...

Yep ... for about five seconds before someone changed it back...

IncrediBILL said...

cdman83,

Hell yes I take anything personally that impacts my income, especially when it involves using my own content against me.

If you only had a clue how the scrapings end up competing against the sites they scrape in the search engines, and it was impacting your income, you would be singing a different tune.

What part of reproducing the whole page don't you get is outright theft and copyright infringement?

Snippets are fair use, taking whole pages are not.

Besides, there's a big difference between protecting copyright and being a luddite, silly boy.

Also, what part of Webaroo's multiple bot names bypassed you?

To obey robots.txt you need to know the robots name in advance and when it's changing all the time who in the hell knows what they do or don't honor.

And NO, I'm not discussing protecting this silly blog.

IncrediBILL said...

BTW cdman83, my site blocks EVERYTHING by default. I don't blacklist, I whitelist, so this stuff never reaches my actual content.

I just report this stuff so others know what's bouncing off my OPT-IN firewall.

Anonymous said...

Hank Aaron takes a mighty swing at Bill with the cl00bat:

http://www.dklevine.com/general/intellectual/against.htm

IncrediBILL said...

Nobody took a swing at me and you should really learn to read because his opening paragraph defeats the whole thesis.

Copyrights and patents give the original creator, who may have spents a large sum of money to bring the creation to light, protection over copycats that try to quickly and cheaply reverse engineer their technology without licensing it.

The IP protection doesn't stop innovation whatsoever. IP protection actually makes people think for themselves and find new ways to do the same things without stealing the ideas of others OR they can license existing work and then improve upon it.

Maybe I should steal a copy of your so-called "Hank Aaron"'s website and post it somewhere claiming the work was all mine and slap PPC and affiliate ads all over the pages and see if he doesn't shit a massive brick.

Anonymous said...

Maybe someday you'll get a C&D for thinking in English and told to pay up because eventually Disney managed to wangle themselves a monopoly on that, too, and you'll shit an entire brick house.

More likely, the system'll collapse under it's own weight long before then.

And the opening paragraph you quoted is the official rationale that the authors would have proceeded to deconstruct and demolish if you'd bothered to read any further. Oh, but they're economists, learned men of great experience and wisdom, and as such they use lots of big words, so you couldn't read any further, at least not without returning to school and completing grade 5 first...

But I expect you'll sail along in your present state of (lack of) knowledge until eventually that C&D shows up. And it will -- more and more creators of "intellectual property" are getting dinged for not being completely original, and making up a whole new thing from scratch in a complete vacuum unconnected with reality. Game makers get sued for using the word "hobbit". Filmmakers because a 3-second snatch of song can be heard over the radio in one scene, and they didn't track down and pay protection to whichever record label "owns" the singer. Musicians because they go bada-baba-BING at some point and those exact same five drum beats have, shocker, been used before by some other drummer that happens to be "owned" by a different label. And so forth.

cdman83 said...

First of all I would like to ask everybody to stick to rational comments. This is a subject matter that many people (including myself) feel very strongly about, but lets try to have a civilized conversation.

First of all my point still stands that you do not disclose the license under which you publish this blog. Do you disclose it on the sites you are protecting?

Second of all, what is the financial burden created by bots who obey the robots.txt? (Of course robots which don't should be blocked) If a bot obeys the robots.txt and you exclude it because of the whitelisting approach you could exclude services like Krugle (they also made some mistakes - as they admit in the linked blog post - but they tried to remedy it as fast as possible - and for added bonus - they are using Nutch for crawling). Now I don't know if this applies to your sites (probably not), but the general idea is that someday someone could come up with an idea which could interest you, but you turn them away.

Thirdly: there is no such thing as Intellectual Property. There is copyright, trademark and industrial secrets. Listen to Richard Stallman.

Finally: in my opinion it is morally wrong to claim that you created something from nothing and you deserve all the money from it (what copyright does). We all are standing on the shoulders of giants, we should appreciate that and try to give something back (just to make the idea more concrete: did you donate / contribute back to Apache, PHP, Python, Linux, Bind, etc?)

IncrediBILL said...

Anonymous, that's just wishful thinking on your part as I've been the guy sending the C&D's and DMCA's to thieves, not the guy getting them.

Not like I'm such a genius that people have to steal from me but they're just lazy fucks that try to make bank off my work so fuck 'em, DMCA and C&D to the rescue.

FWIW, I license a lot of things already, including software, so I'm OK with the system as I know how to properly play in the sandbox.

It's people that think they should get something for nothing that have issues with copyrights and patents.

cdman83 said...

"It's people that think they should get something for nothing that have issues with copyrights and patents."

The people supporting idiotic patents and overextended copyright are the ones thinking that they should get something for nothing (does the fact that Disney build its success on public domain material and then started lobbying for perpetual copyright or Amazon's one-click shopping sound ethical?). Meanwhile very talented people who would like to do something because they have a passion for it (like the open source movement) are blocked by such lawyerase.

Of course there are a lot of people who are just thieves, but one should (imho) not assume from the start that someone is a thief from the start, because the "everybody is assumed innocent until proven guilty" is a great idea.

cdman83 said...

And one more thing about the "patent system":

All of them were build (including the US) on the idea of first not respecting the existing patents, build the nation up, then start bullying others to respect the new patents, the foundation of which were some stolen ideas if you subscribe to the mantra that ideas are property. Just an other angle which shows that patents are hypocritical.

IncrediBILL said...

Oh well, I'm off to file a patent.

Enjoy.

Anonymous said...

Lol, good old days. PulseBot was mine, I forgot to turn on the flag to respect robots.txt, I was very new anyways, first job ;)