Thursday, June 15, 2006

Hanzo:web Social Archiving is Social Copyright Infringement

OK boys and girls, it's time to get pissed off as all notion of copyright and control of your site content has been tossed out the window as the fine folks over at hanzo:web ARCHIVE your site content on demand!

That's right, you click on their bookmarklet and TA DA! your page gets archived WITHOUT YOUR PERMISSION on someone else's server.

Here's the most priceless quote on their site:

Only you can save the Web!
So who's going to save the web from some bullshit like this?

Did you bother asking webmasters if they want their websites saved?

I don't want to be archived, I don't need to be saved, take your archiving toys and go fuck yourselves!

SPIT ALERT - PUT DOWN YOUR DRINK!

I just about wrecked a keyboard while sipping soda when I ran across this:

Respect for content

All archived pages, links and sites are stored exactly as they appeared on the web. Pictures, objects, links and flash are all retained as they are, preserved as originally conceived.

RESPECT FOR CONTENT?

Are you fucking kidding me?

Where's the respect for my fucking copyright?

You'll be archiving pages WITHOUT PERMISSION, possibly with someone's AdSense account embedded and someone can be sitting on your sites click frauding accounts to death, or stealing content while it can't even be detected that someone is even accessing the pages via the archive.

When they "archive" your page it gets crawled by the following:
87.98.198.194 "Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"

inetnum: 87.98.198.192 - 87.98.198.207
netname: hanzoweb
descr: Hanzo Archives Ltd
Now look at this shit coming from their servers:
87.98.198.194 "GET / " Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"
87.98.198.194 "GET /robots.txt" "Python-urllib/1.16"
87.98.198.194 "GET / " "Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"
87.98.198.194 "GET /robots.txt HTTP/1.0" "Python-urllib/1.16"
87.98.198.194 "GET /"" "Python-urllib/2.4"
87.98.198.194 "GET /" "Python-urllib/2.4"
So it's looking at robots.txt but what user agent are they looking for?

I dug around on their site and didn't see it, so I have no clue what the Python-urllib is looking for in robots.txt, but it really doesn't matter because the FAQ page plainly states that they don't give a flying fuck about your robots.txt file, they'll archive it anyway no matter WHAT YOU SAY MR. WEBMASTER and make it private:
The original crawl was subject to restrictions by robots.txt. This means that any archived content will be marked as private for browsing by the person crawling it, therefore, unless its your own archive, you will not see this content.
Sounds to me, as a webmaster, they're saying "FUCK YOU!".

Well, I blocked your service, so this webmaster is replying in kind "FUCK YOU!" no tresspassing allowed.

This is a huge problem as people will be snapping copies of anything for any reason and you, the webmaster, will have no control over what Hanzo:web stores or displays nor what these people do with your content after the fact.

BTW, when people start flaming me that I should've "contacted" them to find out what they were looking for in the robots.txt file, if they were doing it right, the path to this information would've been in the user agent string just like all the other sites do, or highlighted in the FAQ.

Nice idea but your draconian implementation doesn't deserve a second chance and it's blocked, out of mind, not a problem for me anymore.

FWIW, my bot blocker already stopped them from getting anything in the first place but I'm blocking their whole range of IPs just to make sure nothing slips through the cracks like stealth crawling as they have already demonstrated a complete lack of respect for everyones website.

8 comments:

Anonymous said...

Bill, you're gonna have a heart attack one of the days. I'd hate to see that happen. I'd lose a valuable source of information. :)

IncrediBILL said...

Sometimes things get lost in translation as it looked like you were peddling their wares!

Sorry about that.

Anonymous said...

This is an interesting topic, as I do understand the desire that some people have to "preserve" a "snapshot" of a website in anticipation of the eventuality that interesting, important content will be moved, removed or flat out disappear. The archiving of information, especially digital information, will be something that historians have already been wailing about. I think it's important to note the difference between these "digital historians" and the assholes who want to scrape your content and profit through click-fraud. If there were some way for them to display old content with permission, complete attribution and NO profiteering things might be different. Funny thing is, in a hundred years nobody's going to remember our blogs anyway ;)

IncrediBILL said...

I think it's OK for them to archive sites willing to be archived, but taking it forceably without permission violates copyright.

Anonymous said...

A. You don't claim copyright here. (There's not a single notice on the page)
B. What's the difference between this and offline viewing?

IncrediBILL said...

A) had copyright but it appears the damn template is messed up, but it doesn't matter as copyright is implied the minute you create the work unless claimed otherwise, but I wasn't discussing my BLOG content

B) offline viewing wasn't the issue, someone else was making a public repository and...

C) The site I was protecting from Hanzo web wasn't the blog at all, but that's irrelevant as they have no business copying what doesn't belong to them under any circumstances.

Anonymous said...

This is a public repository. There's no difference between an archiver and a saved web page.

IncrediBILL said...

Exceot the archiver is a service, one I don't wish to participate in, which is why (not on this blog) they were banned in the firewall.

Simple enough.