OK boys and girls, it's time to get pissed off as all notion of copyright and control of your site content has been tossed out the window as the fine folks over at hanzo:web ARCHIVE your site content on demand!
That's right, you click on their bookmarklet and TA DA! your page gets archived WITHOUT YOUR PERMISSION on someone else's server.
Here's the most priceless quote on their site:
Did you bother asking webmasters if they want their websites saved?
I don't want to be archived, I don't need to be saved, take your archiving toys and go fuck yourselves!
SPIT ALERT - PUT DOWN YOUR DRINK!
I just about wrecked a keyboard while sipping soda when I ran across this:
RESPECT FOR CONTENT?
Respect for content
All archived pages, links and sites are stored exactly as they appeared on the web. Pictures, objects, links and flash are all retained as they are, preserved as originally conceived.
Are you fucking kidding me?
Where's the respect for my fucking copyright?
You'll be archiving pages WITHOUT PERMISSION, possibly with someone's AdSense account embedded and someone can be sitting on your sites click frauding accounts to death, or stealing content while it can't even be detected that someone is even accessing the pages via the archive.
When they "archive" your page it gets crawled by the following:
22.214.171.124 "Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"Now look at this shit coming from their servers:
inetnum: 126.96.36.199 - 188.8.131.52
descr: Hanzo Archives Ltd
184.108.40.206 "GET / " Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"So it's looking at robots.txt but what user agent are they looking for?
220.127.116.11 "GET /robots.txt" "Python-urllib/1.16"
18.104.22.168 "GET / " "Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"
22.214.171.124 "GET /robots.txt HTTP/1.0" "Python-urllib/1.16"
126.96.36.199 "GET /"" "Python-urllib/2.4"
188.8.131.52 "GET /" "Python-urllib/2.4"
I dug around on their site and didn't see it, so I have no clue what the Python-urllib is looking for in robots.txt, but it really doesn't matter because the FAQ page plainly states that they don't give a flying fuck about your robots.txt file, they'll archive it anyway no matter WHAT YOU SAY MR. WEBMASTER and make it private:
The original crawl was subject to restrictions by robots.txt. This means that any archived content will be marked as private for browsing by the person crawling it, therefore, unless its your own archive, you will not see this content.Sounds to me, as a webmaster, they're saying "FUCK YOU!".
Well, I blocked your service, so this webmaster is replying in kind "FUCK YOU!" no tresspassing allowed.
This is a huge problem as people will be snapping copies of anything for any reason and you, the webmaster, will have no control over what Hanzo:web stores or displays nor what these people do with your content after the fact.
BTW, when people start flaming me that I should've "contacted" them to find out what they were looking for in the robots.txt file, if they were doing it right, the path to this information would've been in the user agent string just like all the other sites do, or highlighted in the FAQ.
Nice idea but your draconian implementation doesn't deserve a second chance and it's blocked, out of mind, not a problem for me anymore.
FWIW, my bot blocker already stopped them from getting anything in the first place but I'm blocking their whole range of IPs just to make sure nothing slips through the cracks like stealth crawling as they have already demonstrated a complete lack of respect for everyones website.