IncrediBILL's Random Rants: Dynamic Robots.txt is NOT Cloaking!

Sunday, July 08, 2007

Dynamic Robots.txt is NOT Cloaking!

If I read just one more post that claims using dynamic robots.txt files is a form of CLOAKING it might be enough to drive me so far over the edge that it would make "going postal" look pale by comparison.

For the last time, I'm going to explain why it's NOT CLOAKING to the mental midgets that keep clinging to this belief so they will stop this idiotic chant once and for all.

Cloaking is a deceptive practice used to trick visitors into clicking on links in the search engine and then showing the visitor something else altogether, a bait and switch practice. Technically speaking, cloaking is a process where you to show specific page content to a search engine that crawls and indexes your site and show different content to people that visit your site via those search results from that search engine.

Robots.txt files are never indexed in a search engine, therefore they will never appear in the search results for that search engine, therefore a human will never see robots.txt in the search engine, click on it, and see a different result on your website.

See? NO FUCKING CLOAKING INVOLVED!

Since the robots.txt file is only for robots, and humans shouldn't be looking at your robots.txt file in the first place, then showing the human "Disallow: \" is perfectly valid although you may show an actual robot other things as the human isn't allowed to crawl.

Let's face it, some of the stuff in our robots.txt file might be information we don't want people looking at or hacking around as it's just that: PRIVATE.

Additionally, robots.txt tells all of the other scrapers and various bad bots what user agents are allowed so if you're allowing some less than secure bot to crawl your site, the scrapers can adapt to that user agent to gain unfettered crawl access.

Dynamic robots.txt is ultimately about security, it's not about cloaking, and nosy people or unauthorized bots that look at robots.txt are sometimes instantly flagged as denied and blocked from further site access so keep your nose out and you won't have any problems.

If you still think it's cloaking, consider becoming a temple priest for the goddess Hathor as a career in logical endeavors will probably be too elusive.

71 comments:

Anonymous said...: It's your site and your responsiblity is to project it from theives. You decide how your visitors view your content. The visitor should never be given the right to decide how you manage and protect what is yours. If all your doing is trapping theives and unwanted visitors, what does it matter what "Label" anyone wants to put on site and content protection.; 7/09/2007 1:03 PM
Anonymous said...: Of course you neglect to mention how cloaking justifies, nay pretty much requires stealth crawling to combat it...

As for Hathor, sorry, but I've never really considered having a Goa'uld implanted in my abdomen to be my preferred kink.; 7/09/2007 7:12 PM
IncrediBILL said...: Stealth crawling is easy to combat, especially if you make the mistake of looking for a robots.txt file, and just as easy if you don't and get snared in the spider trap within a page or two.

... and you forget to mention how stealth crawling most likely breaks unauthorized access laws, minor detail.; 7/10/2007 12:20 AM
Anonymous said...: Re: Stealth Crawling

If you show up at my front door wearing a mask and holding a gun I will treat you as a thieve.

If a spider wants to crawl in stealth mode it will get the same content as anyone else until it is detected for what it is. After detection the "steath spider" will get what any other spider of this kind gets.; 7/10/2007 12:45 PM
IncrediBILL said...: Stealth spiders are kind of stupid and tend to get 2-3 pages before they get shut down so I'm not terribly concerned about them.; 7/10/2007 1:08 PM
Anonymous said...: If a page is visible to the public, retrieving it cannot conceivably break any unauthorized access laws.; 7/11/2007 1:01 PM
IncrediBILL said...: That would be incorrect as I've researched the problem and run it past a couple of legal eagles.

If you deliberately bypass security measures designed to keep certain types of activity out of a server, they are most likely breaking the unauthorized access laws. Those laws don't state PUBLICLY AVAILABLE or PRIVATE data, just that the perpetrator's intent is to bypass security measures designed to keep him out.

The laws are very loosely written so it's just a matter of time before someone uses them against a scraper.; 7/11/2007 3:19 PM
Anonymous said...: If it's legal for me to use Firefox to retrieve a particular page, then it should be legal for me to use the software of my choice at my end to retrieve that same page. That software might be IE. It might be Opera. It might be Safari. It might prefetch the page in anticipation of my needs. It might crawl it for some other reason. I don't see how it even makes a difference at the other end, in all cases getting simply a "HTTP/1.1 GET" for a particular URL from a particular IP address, which will cost it the same in bandwidth and server resources to honor.

Overloading the server with abnormally large numbers of requests would fall afoul of existing denial-of-service rules.

Republishing whatever I retrieved without permission could (depending on factors ranging from CC or other liberal licensing to fair use) run afoul of copyright laws.

In those situations, I might be fair game under the existing laws covering such things -- although in the case of simply bogging down the server, they'd have to make a case that I'd caused $5000 worth of damage somehow, which would probably require I saturate their bandwidth for hours, unlikely to happen with a well-designed browser, even one that does prefetching or other automation of tasks. Or even if I were to somehow get and operate a spider that was reasonably well-designed.

Any laws that might, in effect, somehow make it legal to access a site with one browser and illegal with another are overreaching. To paraphrase, the laws are very loosely written so it's just a matter of time before someone strikes them down as unconstitutional and someone else drafts narrower replacements.; 7/12/2007 12:38 PM
Anonymous said...: Great post Bill. I fully agree. Security and cloaking are two different issues. People seem to forget that web sites have been serving dynamic content for a while, and that means serving different content to different visitors, which is personalization, not cloaking. These same people forget that the definition of cloaking as it relates to search engines involves intent to deceive search engines so that they can get better rankings. If your goal is security, and not rankings, and you're not attempting to deceive search engines, then no cloaking is involved.; 7/12/2007 1:01 PM
IncrediBILL said...: Thanks WebGeek, at least you get it unlike our Anonymous poster above you.

Sorry Anon, prefetch is blocked on my server as anticipation of your needs isn't my concern and any other type of crawl gets hammered as my TERMS OF SERVICE forbid unauthorized crawls and software is in place to enforce those rules.

Besides, this isn't just about my server, and you really don't have to make a big case of $X damages with the laws in question, and using this law sidesteps copyright as it's all about unauthorized access which is MUCH easier to prove.

Did you ever read this post?

http://incredibill.blogspot.com/2006/11/legality-of-stealth-robots_116474846285292987.html

I ran it across a couple of lawyers desks that claimed my interpretation of the law was actually "too narrow".

You can debate it until you're blue in the face, I really don't give a shit except that you're boring the piss out of me.

So unless you're a lawyer or know one that says the two I talked to are wrong, this conversation is going nowhere.; 7/12/2007 4:16 PM
Anonymous said...: The difference between personalization and cloaking? I'd say the line is crossed when there's deception or discrimination going on, and someone gets a shittier deal than someone else (or than Googlebot).

Bill, I never signed anything agreeing to your terms of service, so you can shove them.

Bill, your URL got truncated so it won't be usable.

Bill, even if I had signed something agreeing not to crawl your site, if I later did so that's breach of contract, a civil matter and not a criminal breach of antihacking laws. If I dictionary-attacked your password and defaced your web site that would be hacking. If you think a stealth crawler is "illegal hacking" you obviously have no real experience of what serious hacking means! But a general trait of antihacking/"unauthorized access" laws is that you have to have fraudulently gained access to a resource that you weren't granted permission for. If I fetch something from your public_html, regardless of what software I use to do so the fact that it's in your public_html kind of means I am authorized to fetch it. If I'd got at the /etc/passwd file on the server somehow (useless though it probably is, if you have half the brains you routinely brag about), then I'd've probably committed some sort of unauthorized access.

You might be able to make a case in the event someone's bot defeated your captcha, on the grounds that it's fraudulent to pass a captcha without being human.

You also might be able to make a case in the event someone's bot reads your robots.txt and then goes where it was just told it wasn't welcome.

As for your two lawyers, well, it's easy to find lawyers sympathetic to almost any legal theory or point of view. Whether something would hold up in court or not is an entirely separate issue. I could probably find a lawyer to say that your whole blog here ought to be illegal -- in China, anyway. :)

Certainly, if the lawyers insisted that if you ever tried their legal theory they were to be paid up front and regardless of the eventual outcome, I wouldn't put too much stock in their claims.; 7/13/2007 1:35 PM
IncrediBILL said...: >> Bill, your URL got truncated so it won't be usable.

I just checked, the entire line was there but visually truncated with CSS so if your IQ was really that high you could've selected the line, used copy/paste like I did, and it worked.

Obviously you're too smart for me so fuck off.

Unauthorized Access laws aren't civil breach... so blah blah horseshit blah and I didn't find two sympathetic lawyers, I hit two random lawyers in this space, so blah blah I'm as bored as shit with this conversation as any human could possibly ever be blah.

If you knew so much you would know you DON'T use lawyers to file criminal cases, you file a complaint with the police, they subpoena the records, then the DA gets involved if it's interesting and they think they can win so it's no $$$ out of any pocket.

Problem is you can't threaten criminal action vs. a civil case or you'll get in all sorts of trouble. If you file and win a criminal case the civil case is a cake walk.; 7/13/2007 3:02 PM
IncrediBILL said...: Gee, strong words from someone too stupid to copy a truncated URL off a page.; 7/14/2007 2:18 PM
Anonymous said...: "If a page is visible to the public, retrieving it cannot conceivably break any unauthorized access laws."

I want you to try your therory at NYT and let us know how many pages you are allowed to read before you must sign-in. I believe the number is 5 or 6 before sign-in is required and Googlebot has crawled the whole site. NYT decides how many you get, not you.

Content on a site is the property of the site and you don't have the right to see it just because you want to. However, if you abide by the rules and guidelines setout by the owner of the content you may get to access it.

Do you have the right to crawl a site just because you want a copy of it on your hard drive/server? - No you don't

"Any laws that might, in effect, somehow make it legal to access a site with one browser and illegal with another are overreaching."

No they aren't. Try changing the UA, on any of thoses browsers, to Googlebot or Slurp. Both Google and Yahoo have stated that if a AU doesn't pass a "Double" lookup it is being spoofed. If you use AU spoofing on many sites you will be denied access.

The user must abide by the rules and conditions established by the site owner. Content isn't available to you merely becuase you want it.; 7/15/2007 1:28 PM
Anonymous said...: Re: truncated URL -- you mean smart enough not to waste my time.

Re: other anonymous user -- authoritarian fascist prig. While sites may try to enforce such rules themselves, I believe that short of overt hacking it is a civil not criminal matter, or at least should be. And if no contract has been signed and then breached, and no damages have occurred, not even a civil matter (no lawsuits).; 7/15/2007 1:33 PM
Anonymous said...: "Re: other anonymous user -- authoritarian fascist prig."

Content isn't for you to view just because you want to. What don't you understand? If I find my content on your server you had better have a contract with me, if you distribute any of my content you need a contract .

My site users are allowed to use a browser and if one does use a "site ripper" that adversely affects my server that user had better have a contract.

We haven't even begun to mention running malicious code in a form field or trying to establish too connections at the same time.

At no time has a site owner needed a contract to define what is clearly unacceptable user of their site, server and content.; 7/15/2007 2:06 PM
Anonymous said...: You're describing a) copyright infringement and b) DoS attacks, both of which are already illegal. A quiet crawl that doesn't result in republishing anything clearly fails to cause either of these effects. Let viewers prefetch if they want to -- subject to not overloading the server of course. :P

I don't like any suggestions that sites should have the right to arbitrarily limit what technology can be used to access them beyond the obvious need to avoid being overwhelmed with excessive traffic. There are disabled people and other special needs out there for whom unusual software may be an important benefit, without harming your servers any or somehow magically guaranteeing copyright infringement.

As for my right to view the content, I don't see why I shouldn't be able to so long as I can cover the marginal cost of doing so, which is approximately zero if it's much less than gigs and gigs of data.; 7/16/2007 3:16 PM
IncrediBILL said...: re: "I don't like ... limit what technology can be used "

Too bad for you then.

When you start paying everyone's bills then you can make all the rules.

Until then, draconian bot blocking rules are in full effect.

BTW, site rippers aren't a DoS attack until they overload your server so please use the proper terminology.; 7/16/2007 4:43 PM
Anonymous said...: "A quiet crawl", is prohibited in my robots.txt so, just because you are in possession of software doesn't give you the right to use it.

"that doesn't result in republishing". Seeing as you haven't paid me for my content, any form of duplication is "republishing" in my eyes.

"As for my right to view the content", you don't have the right to view the content of a site just because you want to. Site visitors are allowed to view the content as long as they abide my acceptable use policy (robots.txt file).; 7/16/2007 5:51 PM
Anonymous said...: Screw both of you. Your AUPs are not binding on me if I haven't signed anything, according to the laws of the country in which I live. The most you can do if I do something that you don't like is refuse me service -- if you can even identify me to do so; you cannot sue me or involve the police, as there's no illegality or breach of contract, not according to any laws I am subject to. So there. :P; 7/17/2007 3:37 PM
IncrediBILL said...: re: refuse me service

I've been trying but your like the kicked puppy that just keeps coming back for more that can't take a fucking hint.; 7/17/2007 6:23 PM
Anonymous said...: Yeah, I noticed -- your captcha often has to be done twice to work, and I'm damned sure I'm not misspelling anything. Obviously your attempts to block me are half-assed at best, which somehow fails to surprise me.; 7/18/2007 5:46 PM
IncrediBILL said...: You are such a simpleton, the captcha is BLOGGERS, not mine, I have no control over it.

I never attempted to block you, I just assumed you would get sick of being berated for being the mental midget that you are and just go away.

Who knew you weren't that smart.; 7/19/2007 12:05 AM
IncrediBILL said...: Excuse me?

I'm not the one complaining about the captcha as I've never seen it fail intermittently or otherwise.

Just more of your typical babbling bullshit nonsense as usual.; 7/19/2007 7:29 AM
Anonymous said...: Yeah ........
Bill, I think we have got to know the user and/or writer of what is being referring to as "Special Needs Software". Blocking, trapping and feeding this one "incriminating" pages should be even more fun now.; 7/19/2007 1:34 PM
Anonymous said...: Are you now accusing me of being an "evil scraper"? Because I assure you I am not. I have never copied and republished wholesale from a website (yours or anyone else's) and don't plan to.

Besides, the content on your site just isn't that good; if I ever did want to make an mfa site this is one of the last places on Earth I'd choose to scrape to feed it. :P; 7/20/2007 7:58 PM
Anonymous said...: User-agent: *
Disallow: /

The above means no bots allowed for any reason, even a "special needs bot" with a spoofed UA.

Cache-control: no-cache ... no exception for user-agent caches

Cache-control: no-store .. means do not store any part of either this request or any response to it.This applies to both single-user and shared caches.

"I have never copied and republished wholesale from a website" You don't have the right to duplicate any part of any site I own.; 7/21/2007 7:32 AM
Anonymous said...: Private copying is fair use, numbskull. Read 17 USC section 117 and nearby. I can cache a local copy of your web site if I want to and a) you have no way of knowing I did so and b) no legal leg to stand on if you did somehow found out.; 7/22/2007 12:16 AM
Anonymous said...: I can cache a local copy of your web site if I want to

Correction: you can try and cache a local copy, but Incredibill is also at liberty to limit your access to the site, which may hinder your attempt to make a full private copy.

Another analogy: say I have a shop. Anyone is free to come in and browse around - that's the whole point of a shop - but if someone spends the whole day there, fingering the merchandise and generally not being any benefit to the business, it's my right to ban that person from the shop.; 7/22/2007 7:00 AM
Anonymous said...: "Private copying is fair use" may be applied if the content was obtained without violating the acceptable use policy of the server.

User-agent: *
Disallow: /

If your "Specail Needs Site Ripping Content Duplication Software" violates my robots.txt file you didn't have the right to access the content in the first place. You can't duplicate what you don't have the right to have.; 7/22/2007 12:31 PM
Anonymous said...: You fanatics make no sense, and don't even know it.

First off, copying material from a web site for private use is not "fingering merchandise in a shop" by any stretch. Not even if it's an ecommerce site, since you can't actually handle the goods without buying them first and having them shipped. Actually it's "relieving the burden on your servers" -- in other words, it's window shopping without taking up room inside the store that could instead be taken up by a customer more likely to buy something. Hell, it's not even taking up sidewalk room in front of the window -- it's standing there for a moment, taking a photograph, and then vacating the space.

As far as "authorized" goes, if I manually retrieved and used right-click-save-as on a bunch of pages that's clear fair use. If I automate the process at my end, but in such a way that it makes no material difference to the server (it sees the same hits, retrieving the same files, costing the same bandwidth, at similar intervals, and the same amount of stuff, zero, gets republished without permission), I fail to see any basis for claiming that you've been wronged since as far as you and your servers are concerned EXACTLY THE SAME THING HAPPENED.

Nonetheless I think it would be wise for advocates for users' rights to push for legislation or a court precedent establishing affirmatively that choice of user-agent and feature-set of user-agent is a user's free choice and does not affect what is considered authorized or what is considered fair use, and that a web master's sole remedy for scrapers is to use existing copyright infringement law or similarly, and to ban them from the site, and that a web master's sole remedy for sources of excessive traffic is to throttle traffic on a per-source basis, ban unrepentant sources, and levy DoS charges against those that are clearly intentionally being abusive. That sounds like it's all webmasters need to protect their legitimate interests. In fact I think I'll go see that the EFF becomes aware of this emerging issue in users' rights now...; 7/23/2007 2:19 PM
Anonymous said...: "If I automate the process at my end, ....." Once the process is automated the robots file stipulates what is acceptable on the website.

User-agent: *
Disallow: /

Means no bots allow.

Site visitors and their actions are monitored and if suspect, even if they are logged in, must prove they are infact human. The content is free to read as long as visitors abide by my policy.; 7/23/2007 2:59 PM
IncrediBILL said...: re: "If I automate the process at my end"

If you automate it your bot won't be able to answer the captcha after a few pages and your IP will be blocked so automate until you're blue in the face and make my day.

Then you can go whine to the EFF until your ass bleeds that you were caught doing stupid shit and got banned from 100s if not 1000s of websites that monitor postings of "bot activity"... oh WAH! fucking WAH!

I'd love to watch them laughing at you, it would be priceless.; 7/23/2007 3:35 PM
Anonymous said...: Actually, your robots.txt is advisory, and certainly only applies to someone who retrieves it. Users have the right to their choice of browser software so long as they don't generate excessive server loads. Period. You can use technical means to try to frustrate users exercising their rights, but that doesn't change the facts. And in the future I hope to see such deliberate frustration of users' rights made illegal, and the lot of you control freaks arrested.

I'd also like to see you shut up now. This is getting pointlessly repetitive.; 7/24/2007 9:54 AM
Anonymous said...: "Users have the right to their choice of browser software .... " No you don't! What is it you don't understand? I don't care who produced the software, even if it's MightySoft.

The server and site are my property. Acceptable use is defined by me. If you don't like it, piss off. If you violate any of my usage policies you will be banned.; 7/24/2007 11:18 AM
Anonymous said...: ""Users have the right to their choice of browser software .... " No you don't!"

Yes we do.

"LMFAO - you mean scrapers should be allowed to scrape and people that try to protect their servers and content should be in the wrong?"

No, I mean that scrapers should be liable under existing copyright laws and people that try to arbitrarily restrict non-infringing use (other than by simply capping per-user bandwidth use) are in fact in the wrong.

"Listen shit head, on my server you HAVE NO RIGHTS except those I grant."

Now, now, there's no call for being uncivil here. I've been quite calm and civil here after all. And I have all the rights my country grants to its citizens, whether you like it or not. Your only recourse is to try to keep me out -- and I can legally try to sneak back in, within reason. Especially since I'm not really doing anything but typing at my own computer. If your HTTP server grants a request it just authorized my access on your behalf and as long as I didn't get it to do so by cracking a password or something, there was no fraud involved. For instance I can spoof my geographical location if I please -- you have no legal right to know it without deception after all. You can certainly limit bandwidth use. You can certainly prosecute infringement of your copyrights. I don't see why you don't feel that that is sufficient to protect your interests. The only reason I can think of is that one of your interests IS trampling on users' interests, which could otherwise be accomodated at no cost to you.

And your web site is not like the inside of your house. It is like the front of your shop, including the windows for shoppers to look in through. You have the right to set prices of merchandise and prevent shoplifting. You do not have the right to restrict peoples' use of the sidewalk out front, or charge them simply to look in the window, or any of that. Besides the fact that doing so is monumentally stupid for a variety of reasons -- consider that any real-world shop doing that will end up turning away lots of customers and making less money.; 7/25/2007 1:41 PM
IncrediBILL said...: If I don't own the sidewalk you're right, I can't stop them from walking by but I sure as hell can keep people from coming inside because of that time honored tradition "NO SHIRT, NO SHOES, NO SERVICE" and "WE RESERVE THE RIGHT TO REFUSE SERVICE TO ANYONE".

If you aren't wearing your MSIE, Firefox or Opera shirt that too bad for you, no admission.

Lynx? Sorry, those 2-3 hits a month get booted.

If a real world shop turned away the bandwidth stealing copyright infringing thugs I've punted they would make *MORE* money, not less, just like I have and not waste time fighting copyright abuses.

I think you miss the point entirely in that bot blocking is both server resource protection and preemptive copyright protection. Doing both actually make a better experience for the real users as they aren't encountering overloaded servers or misleading content used by the bandits to lure them off to nasty sites elsewhere.

Besides, some sites are only designed for one browser and the site simply won't run on anything other than MSIE which tramples on the user's browser choices far worse than anything I've ever done.

What you claim is "accomodated at no cost to you" is such BULLSHIT you have no clue. Tracking down and legally pursuing copyright infringement issues are much more time consuming and costly than installing a bot blocker. The technological solution wins hands down in the grand scheme of things in saving time, money, stress and providing a better overall end user experience.

You obviously have no experience running a high volume site or what's involved, there is a huge cost in abuse, and stopping it saves money.

OK, maybe a few users get hassled a bit, but the needs of the many outweigh the needs of the few. Those few users will have to answer a captcha now and then or maybe set their browser UA to show a mainstream product name.

OH WAH!; 7/25/2007 2:11 PM
Anonymous said...: ""Users have the right to their choice of browser software .... " No you don't!"

Yes we do.

Go find me something that can back up "Yes we do".

Copying an entire website is not protected by fair use.

http://www.usatoday.com/marketing/tos.htm; 7/26/2007 1:22 PM
Anonymous said...: http://www.chicagotribune.com/services/site/chi-copyright,0,3270120.htmlstory
"You may not scrape or otherwise copy our Content without permission."; 7/26/2007 1:35 PM
Anonymous said...: "preemptive copyright protection"

There is no right in law to this.

"some sites are only designed for one browser"

Yes -- bad, non-standards-compliant sites that can go straight to hell (windows update heads up that list of course)

"Tracking down and legally pursuing copyright infringement issues are much more time consuming and costly"

I weep for the greedy copyright holders of the world. No, really! I do!

Seriously I have no problem, and the majority of the public have no problem, with putting the burden of proof and of enforcement on copyright holders. In fact it's high time copyright went the way of the dodo. It's ludicrous to charge thousands of times the marginal cost for something. Anyway scraping wouldn't be an issue if another anon user's suggestion for google were implemented whereby it would treat the oldest working site with the same content as the non-supplemental one and use the maximum pagerank of the duplicates as the pagerank of the hit. The result would be that all a scraper's MfA site could do would be to increase YOUR traffic (and if theirs got high enough even your pagerank). Google could also just ignore the other sites. The idea being if the original site is online the oldest site is going to be the original, which had to be up before a copy could be made. Google searches would tend to lead people to the original and it would be your revenue-generating ads getting seen and not the scraper's.

This is a fix that would please everybody and obviate the perceived need for user-hostile behavior by the likes of you.

Last but not least, your technical measures certainly should not be illegal to try to circumvent. If you are going to try such shit, then at least have the good graces to have to WORK for it and be in an arms race, rather than just push a button and automatically get your way. A healthy marketplace balances all interests and allows no one player to impose their will unilaterally and without recourse by the push of a button; instead the more they want it and the more anyone else doesn't the more they have to work for it. That's as it should be.

"Go find me something that can back up "Yes we do"."

Bill of Rights? Constitution? You know those various documents that say that that which is not expressly forbidden by the government is allowed?

Simple common sense? That without users your sites are worthless so users' wishes ARE important?

"Copying an entire website is not protected by fair use."

Copying it for your private use only is -- and caching happens all the time anyway. That can actually reduce load on your precious servers when people go back to reread something in the future!

You can whine and moan and say you don't like it or even that you don't allow it but you can't actually forbid it with legal teeth unless you get everyone to sign something before they read a single page.

"You may not scrape or otherwise copy our Content without permission."

They can say that -- it's free speech. They can't enforce it, save against genuine copyright infringement such as republishing chunks of it elsewhere.

They can't even detect it if it's done right. Certainly not if it's done manually, or by a driver-assisted tool that rate-limits and randomizes its activity and punts to a human whenever it encounters a captcha. You can't punish what you can't detect. Which means you should just focus on fighting genuine infringement, or making it harmless. If Google adopted the suggested change above that would make scraping harmless except for the greedy sites that want to actually charge people money just to read stuff, instead of using ads or sales of real goods to make ends meet.; 7/26/2007 2:54 PM
IncrediBILL said...: re: "In fact it's high time copyright went the way of the dodo."

Actually, it's time you and your ridiculous comments go the way of the dodo as people that produce copyrighted works have the right to make money off of those works.

Otherwise newspapers, books, magazines, movies, photography., art. music and websites would cease to exist if those that made them weren't compensated.

And this statement is just stupid:
"They can't even detect it if it's done right. Certainly not if it's done manually, or by a driver-assisted tool that rate-limits and randomizes its activity and punts to a human whenever it encounters a captcha. You can't punish what you can't detect."

You may be surprised what can be detected because stealth works both ways.

I detect things done in all sorts of bizarre ways because the text has bugs and poison pills in it.

Use my text from some of my sites as-is and it will stand out in the search engine with a big red flag and they are easily busted.

And this clincher...

""Copying an entire website is not protected by fair use."

Copying it for your private use only is "

BULL-FUCKING-SHIT!

Even the public library limits the number of pages you can copy from a book as "fair use".

Try copying certains types of copyrighted material at Kinkos, or a photolab even, and they'll tell you NO and show you the door.

And this gem...
"I weep for the greedy copyright holders of the world. No, really! I do!"

Well I weep for the scrapers of the world that when faced with impending technology that will put them out of business run from blog to blog like chicken little crying the sky is falling, that copyright holders have no rights, we have no right to grant limited rule-based access to our servers and all sorts of stupid shit.

If you knew what you were talking about it would sure make a difference as it feels like I'm having a conversation with a 10 year old.; 7/26/2007 3:26 PM
Anonymous said...: "Actually, it's time you and your ridiculous comments go the way of the dodo as people that produce copyrighted works have the right to make money off of those works."

Yes; however restricting copying is a privilege, not a right, granted by society at the peoples' suffrance. As such it's subject to revocation if the people will it and the legislature respects the will of the people.

"Use my text from some of my sites as-is and it will stand out in the search engine with a big red flag and they are easily busted."

That's scraping and republishing. That is copyright infringement and yes, it's easily detected. I was talking about someone copying part of your site for their own private use only -- no republishing (so no copyright infringement) and no abnormally heavy traffic to your server.

"Try copying certains types of copyrighted material at Kinkos, or a photolab even, and they'll tell you NO and show you the door."

And your point is? That's a private business choosing not to expose itself to potential liability for contributory infringement in case you use them to make copies you intend to distribute rather than keep to yourself. Copy in your own home for your own use and it's another matter entirely.

"Well I weep for the scrapers of the world"

I don't care for scrapers either. I do care that their bad acts might poison the well for everyone else, including users who have no intention of redistributing anything without permission but want to browse with the technology of their choice for their own convenience. I don't see the harm in them doing so so long as they don't commit copyright infringement and don't (accidentally) DoS your servers. In fact, a lot of these technologies would let them e.g. retrieve stuff they'll want to read from your servers at off-peak hours and retrieve all of it only once, RELIEVING the burden on your servers.

That you argue against such things shows that you value control for its own sake far more than you do either a) your copyrights or b) your server's resources, no matter how much you may protest.

Also, since this thread is below the fold now, it shouldn't be getting this much attention. Why haven't I had the last word yet??; 7/27/2007 5:46 PM
IncrediBILL said...: re: "I was talking about someone copying part of your site for their own private use only"

Dude, you are so behind the curve as I can find that as well. The javascript embedded in all my pages shows up if they are online at any time they open my pages.

As a matter of fact, I hit "BREAK IN FOR CHAT" just to scare the shit out of the little fuckers when I see the page is coming from a personal computer. You can see that page close so fast you can't help but laugh.

re: "however restricting copying is a privilege, not a right, granted by society at the peoples' suffrance"

Excuse me?

It's my god given right to control distribution up to the point of making it PAID access.

What suffrance?

Explain exactly how people suffer if they can't copy something and I'll give you a gilded pile of bullshit as an appropriate award for your reply.

Re this jewel of ignorance:
"That you argue against such things shows that you value control for its own sake far more than you do either a) your copyrights or b) your server's resources, no matter how much you may protest."

You don't know shit, never did, and still don't.

I ran my servers wide open with no blocks for 7 years until the underbelly of the internet went out of control and started knocking my sites offline and copied material started to appear EVERYWHERE.

I was never a control freak but when faced with online extinction, I responded with the most powerful counter response possible.

You really don't get it that my actions were for sheer online survival and nothing more.

Put that in your kindergarten bubble pipe and smoke it...; 7/27/2007 8:05 PM
Anonymous said...: "The javascript embedded in all my pages shows up if they are online at any time they open my pages."

???

"It's my god given right to control distribution up to the point of making it PAID access."

No, it is not. Check the legal code sometime, and the constitution. It's a privilege, and it has limitations. For example there's this thing called "fair use".

"Explain exactly how people suffer if they can't copy something"

That's not quite the meaning of suffrance, and as for explaining, well, that would take a whole freaking book. Fortunately, it already exists.

"I was never a control freak but when faced with online extinction"

Online extinction? Well, copycat sites knocking you down into "supplemental" in Google is really Google's fault; keeping the original (oldest known) link the non-supplemental one would make scraping pointless and harmless.

It's also interesting that the vast majority of webmasters don't feel the need to take such drastic measures or blog belligerently about it...; 7/28/2007 12:42 PM
Anonymous said...: ... What planet did you come from?

If you didn't understand what Bill was saying about his use of js, you aren't capable of writing the software you want to use to copy websites.

Fair use isn't applied until after you can prove you obtained the content with violated a robots.txt file and/or the terms of service (TOS) of the website (copyright holder).

"It's also interesting that the vast majority of webmasters don't feel the need to take such drastic measures ........ " Head over to NYT with that idea and let's see how far you get.

"As we are not lawyers, we cannot judge the legal issues involved." Your link produced dribble in under 30 seconds.

Your use of "supplemental" tells me you don't understand it's use in regards to an SE.

Your knowledge is scraped and the software you want to use is a scraper.......

Log stats provide alot more that an IP. IF I believe that any IP has violated what I have stipulated what is acceptable on my website I can get the address(location of the computer in the USA) that I believe has stolen from me.; 7/28/2007 3:08 PM
IncrediBILL said...: It's like talking to fucking rain man...

The use of Javascript, such as in the code snippet of Live Person or some of the open source versions of that type of online service, shows me someone is reading my page as long as a) they are connected to the internet and b) they have javascript active which most people do.

If I see someone reading a copied version I can just break in for chat directly from that pilfered page residing on their desktop.

Now do you understand or do you need a Sesame Street video to fully dumb it down into terms you can comprehend?

re: "...vast majority of webmasters don't feel the need to take such drastic measures..."

Ahahahaha, if you could ever make a valid argument it would probably send shock waves around the world like an intellectual tsunami.

The vast majority of webmasters don't even know how to build a website, let alone protect it. If it weren't for tools like blogger, word press, and other plug and play solutions those webmasters wouldn't be on the web in the first place.

Most of them don't even know they need a robots.txt file and many of them whine about being scraped in just about every forum around.

Not a problem as some webhosts have actually implemented solutions that impact ALL their customers which I think I've even blogged about so it's already taken care of by default for one large group of people.

re: "copycat sites knocking you down into "supplemental" in Google"

You really don't know what scrapers do or how they use the content and your continued ignorance shows every time you make such bullshit statements.

Also, that link you posted to that 'freaking book' is just pure whiny assed nonsense but thanks for sharing.

BTW, hope your next post is as entertaining as the last as you've become the source of a good daily chuckle for many people following this thread.; 7/28/2007 3:38 PM
Anonymous said...: "If you didn't understand what Bill was saying about his use of js, you aren't capable of writing the software you want to use to copy websites."

He didn't exactly explain himself very well, and I'm certainly capable of disabling or (selectively) enabling JS in any web software I might someday write.

"Fair use isn't applied until after you can prove you obtained the content with violated a robots.txt"

That is just plain nonsense -- fair use has to do with copying and copyright law, and has nothing to do with robots.txt which so far as I am aware is not the subject of any laws at all, only Internet standards.

"Your knowledge is scraped and the software you want to use is a scraper"

No, I am not a scraper. If you must know I am an advocate for users' rights, particularly on behalf of the disabled and the poor. There is an enormous difference.

"Log stats provide alot more that an IP. IF I believe that any IP has violated what I have stipulated what is acceptable on my website I can get the address(location of the computer in the USA) that I believe has stolen from me."

1. Log stats will show IP, time, and retrieved page. Are you now saying you'll be implementing unwritten rules that limit how many pages can be retrieved, even at a low rate and sporadically over time, and banning users for violating them? That will piss off a lot of humans that aren't using any kind of assistance technology you know. That's just plain wrong. If it's in public_html it's fair game for someone to link to it and view it, pretty much by definition. If you'd rather not have the general public viewing a given page put it behind a password. It's that simple.

2. An IP address will tell you what ISP someone uses, what country (often not the USA), and often what city or approximate area of that country, but it won't give you their street address or name. Getting those requires a subpoena, and getting that requires probable cause be demonstrated. That someone browsed some of your public_html Web pages (my God! The horror!) hardly suffices as probable cause. Submit your complaint that this user viewed seventy of my pages over the past three weeks to a judge and get laughed at; see if I care. Get doubly-laughed-at when the IP is Brazilian or South African or Taiwanese.

3. Nobody is contemplating stealing anything here, or even infringing your precious copyrights. Get over it.

"If I see someone reading a copied version I can just break in for chat directly from that pilfered page residing on their desktop."

If I read your site from a copy on my desktop there is nothing pilfered about it. It's called manual caching. It relieves the burden on your servers that you were do worried about. If that page is easily accessed online anyway, the only difference doing it this way makes is that it saves your servers some work. If it's unaltered enough your childish script tricks work then your stupid ad banners also still work and still get you revenue too! And it's easy to disable JS or even strip it down to just the text if need be anyway.

"You really don't know what scrapers do or how they use the content"

Yes I do. They try to knock you down into supplemental so their copy comes up tops in a Google search, and then their ads instead of yours generate revenue for them instead of for you. If Google made the suggested ranking change that would be nipped in the bud, as the nonsupplemental result to come up would still be yours and people would see your site, with your ads generating revenue for you, no matter what.

You just don't like the idea that it could actually be fixed in a way that didn't justify draconian measures and insistence on ironclad control by webmasters, because you have such contempt for users and want to be able to smack them around and you're afraid of losing your current, flimsy justifications for doing so.

You're a pitiable creature.

"BTW, hope your next post is as entertaining as the last as you've become the source of a good daily chuckle for many people following this thread."

Nobody is supposed to be following this thread at all now. It's below the fold. I'm supposed to have had the last word due to nobody checking it and posting new comments anymore. Why hasn't that happened yet? Something nasty you did so I'm forced to keep repeating myself to defend myself? Stop it and let it die please.; 7/29/2007 7:43 AM
IncrediBILL said...: Like I said, you really don't understand what scrapers do as you're still doing your little narrow minded supplemental chant.

I'd go into more details about how the advanced scrapers leverage the SE's but then I'd be educating the script kiddies, worse yet YOU would learn something, and we certainly wouldn't want that.

You're not only wrong about that, but exhibit even further technical ignorance about this post being "below the fold" that you keep babbling about.

There's something new called "Subscribe to Post Comments" so people can track a thread forever regardless of whether it's above or below the fold.

Wow, isn't that amazing?

You certainly aren't.

Hard to take anything seriously from someone that has been proven doesn't know shit over and over in a single thread.

Enjoy the ride.; 7/29/2007 11:34 AM
Anonymous said...: "Like I said, you really don't understand what scrapers do"

Then perhaps you should explain it better?

[Several insults]

Eh? Nothing left after all the useless content free insulting blather is skipped?

I'm disappointed in you.

Now, to all of you who want to tell me what I can and cannot do with my b0x and the bits and bytes I have on it: you'll have to pry my keyboard from my cold dead fingers before I'll give up my full rights as System Administrator on the hardware I own. Now get the fuck off my internet. :P; 7/30/2007 4:47 PM
IncrediBILL said...: I'm not sure which anonymous posted that as it ended up sounding like a pro-bot blocker,

Very odd.

Someone missed their meds.; 7/30/2007 6:58 PM
Anonymous said...: You simply don't get it, do you?

You can do what you like at the server side, including block obvious abuse of your bandwidth.

I can do what I like at the client side, including automate stuff for convenience where the outward behavior of my machine is not noticeably altered.

And legal types can stay the fuck out of it unless there's a DoS attack or equivalently crippling and abusive traffic overload, or copyright infringement in the form of stuff republished without author permission.

Simple, nuh?; 7/31/2007 12:34 PM
IncrediBILL said...: If my server side code doesn't like what your client side code does, such as PRE-FETCHing a bunch of pages you probably won't look at, or attempting to OPEN ALL 40 links from my RSS feed at once, that's when the fun starts and you'll find yourself wondering why you're staring at a captcha or worse yet, a page telling you to stop doing that shit and come back when you have learned to stop being a web hog.; 7/31/2007 12:53 PM
Anonymous said...: Heh heh heh ... if I did design and use such software it would retrieve pages only one at a time and at irregular and fairly lengthy intervals, same as a human clicking links, and in a traversal order that would make it totally unobvious that it wasn't exactly that.

The only way I can see to detect and screw with that, which mind you doesn't excessively burden your servers nor automatically guarantee by some magic that stuff will be republished without your permission, is if you require a captcha to be answered just to retrieve and browse any pages at all, regardless of the activity pattern. Try doing that and watch your site traffic plummet to zero as everyone finds a friendlier site to surf.

Speaking of captchas, I think blogger's is trying to send you a message to be gfraid, be very gfraid. :); 8/01/2007 3:29 AM
IncrediBILL said...: Sorry, but the slow random scrape will still get snagged as a scraper because attempting to look like a human still fails if you do it long enough.

I know I can stop the slow crawl as I've already done it, nailed a few of 'em.

Tracking activity for a period of hours (days?) after the last page access even keeps track of a slow crawl because each page access resets the timeout value and keeps the tracker actively monitoring your crawler, no matter how slow, will trigger a challenge eventually.

It is theoretically possible to scrape my site from a single IP address assuming you're willing to wait 109 years until the scrape completes.; 8/01/2007 11:31 AM
Anonymous said...: I assume this is because it eventually tries to fetch every page at your site?

But I'm not talking about scraping. I'm talking about prefetching and other intelligent caching and other local-use stuff. This would not go spidering your whole site. It would only grab stuff linked from a page a human visited, or perhaps stuff linked with particular words even. 99% of your site it would ignore. Just like a human surfer. Any surf assistant worthy of the name would avoid grabbing lots of random stuff its human master was uninterested in, however slowly.; 8/02/2007 4:58 AM
IncrediBILL said...: OK, let's say your average user looks at 4 pages. That means even prefetch skews the normal user access patterns by downloading either too many pages or some of the lesser used pages which would trip the challenge.

Besides, prefetch is evil and should be abolished because all it does is ramp up bandwidth waste.

Why is this a problem?

Do the math!

As millions of surfers start prefetching they'll choke the existing pipelines. When the pipelines get overloaded the infrastructure gets upgraded to handle the load we all have to pay for it in our monthly bills.

So forget my server, bot blocking, or anything else as it's just a stupid fucking idea in general.; 8/02/2007 9:15 AM
Anonymous said...: And you're just a fucking arsehole. :P Anyway the real problem is that the existing infrastructure is used inefficiently -- one web site is being slashdotted while half the network capacity elsewhere in the world sits idle, and the site goes unavailable? WTF? All of the data underlying a site could be provided by way of a DHT network and suddenly only dynamic stuff requires a central server at all, and then all that server has to do is keep track of the latest versions of things and direct clients to the appropriate DHT hash-keys for these bits and for whatever script combines them into a whole. Once a site-deployment technology based on DHT nodes (with each installed browser providing some storage and forwarding capacity) a new web will be born that will make this shoddy thing look like gopher. And it will pay for itself; users in running the DHT node/browser software will pay for their usage in the form of providing incremental capacity to the global system, so all of the costs and benefits are distributed.

Of course, big centralized for-profit businesses that like having a big loud easy to find site and grassroots stuff and small business alternatives being hard to find will hate this sort of decentralized system. Well, tough titties. They can try to fight the future but they can't win.; 8/03/2007 6:37 AM
IncrediBILL said...: I knew this conversation would finally devolve into that same fantasyland communist DHT share the world bullshit again.

Go peddle crazy elsewhere.; 8/03/2007 11:04 AM
Anonymous said...: Where's the communism in that? it's a micropayment system, like the one bittorrent has under the hood. It doesn't get much more capitalist than that. It's also decentralized. It doesn't get much less Soviet than that.; 8/04/2007 7:06 AM
IncrediBILL said...: You might as well run around telling people to use BETA instead of VHS and stop wasting our time.; 8/04/2007 10:22 AM
Anonymous said...: Content distribution is up to the copyright holder. If you cause content, that you don't own or have the right to, to be distributated your looking for legal trouble.

A website has the right to regulate who or, what accesses it's content. A browser has a cache that is for private personal use only. Using software that is prohibitted by the site owner to feed a "Content Distribution Network" has nothing to do "Fair Use"

Bittorrent has overcome many legal issues to the point that it an opt in service. You don't have the right to opt in my content.; 8/04/2007 11:04 AM
IncrediBILL said...: Don't use logic and facts!

You'll just inspire him to continue his free-the-net free-the-data end-of-copyright tirades.; 8/04/2007 12:35 PM
Anonymous said...: I think you've misunderstood me. I'm not suggesting people copy your web site and put it on p2p systems. I'm pointing out that before long, site operators are going to cut costs by using p2p-like systems themselves to make their site available, and this new web will start to supersede the old same as the old superseded gopher. In other words, existing web sites will have competition from new technology. Faster, cheaper, better competition. Look out!

P.S. "ythsex" ... interesting choice of captcha doncha think?; 8/05/2007 9:41 AM
IncrediBILL said...: To cut costs?

Dedicated servers and 2,000GB of bandwidth a month can be had for as little $200, sometimes less!

Site owners that opt for P2P networks opposed to a dedicated server aren't making any fucking money in the first place and are a joke.

How in the hell would you run a forum in a P2P network? There has to be code and a database somewhere, the same with a blog, directory, and many other types of sites. The technology to manage such complex sites of a P2P network and all the failure points would be mind boggling.

Again, please, go peddle crazy elsewhere.; 8/05/2007 1:27 PM
Unknown said...: Oh, really? Explain how a forum system called Frost manages to work that is based on a distributed, p2p system then? Yep -- I shit you not. It's been done. The key thing is that already-existing posts are stored in a distributed hashtable; it's only tying in the new stuff with the existing and maintaining an index that doesn't work so easily. Dynamic content requires only a pointer to the "latest version" be updatable, and you get near-permanent archiving for free. (Stuff no longer even linked to may be garbage-collected so not always truly permanent.)

Oh and did I mention that for most people $200 a month is a rather large chunk of change to be spending just to have a web site? That's most nonbusinesses. Businesses of course are always cost-conscious, at least if they don't have a monopoly. If the $2400 a year expense can be reduced to $10 a year or so, they'll jump at it once the technology matures.

Of course it means control freaks have to give up trying to control distribution, and realize where the real value is and where the real money is to be made online. It isn't content itself; content is a commodity. The market drives its price rapidly down to zero no matter how hard you try to prop the price up. Smart online businesses make money some other way. Amazon comes to mind. And Google...; 8/06/2007 6:29 AM
Anonymous said...: I agree with Twisted's post.

Incredibill writes:
[p2p implementations automatically suck]

Wrong.
[X]! (Bzzzzt!)

[people who do not run their own business should not use the internet]

Incorrect!
[X][X]! (Bzzzzzzt!)

[Usenet is horribly broken and useless]

Complete and utter bollocks!!
[X][X][X]! (BZZZZZZZZZT!!!) That's three strikes! You're out!

Better luck next game!; 8/07/2007 7:43 AM
IncrediBILL said...: That comment wasn't worth the bytes used for storage.

I didn't say "people who do not run their own business should not use the internet" so it's obvious you can't even fucking read.

WRONG! BZZZZT!; 8/07/2007 10:35 AM
Anonymous said...: You've now graduated from merely delusional-seeming to definitely lying:

"Besides, if you can't afford $200/month to run an actual business it's time to get the fuck off the 'net"

Non-businessmen should "get the fuck off the 'net". Your own words.

Sorry. YHL. HAND.; 8/08/2007 6:41 AM
IncrediBILL said...: I meant business operators, not your average internet users. If you can't afford $200 for a server you aren't running a real business in the first place.; 8/08/2007 11:30 AM
Anonymous said...: Yeah, yeah, right. :P It isn't about affording it anyway. It's about cutting costs that can be cut. The business that cuts a cost another one doesn't, and still delivers the same (or better) service, wins.; 8/09/2007 7:29 AM
IncrediBILL said...: Yes, and the business that relies on a shoddy ramshackle P2P cluster fuck and goes belly up because it can't control it's own site gets just what it deserves.

BTW, I always have the last word.; 8/09/2007 4:39 PM

IncrediBILL's Random Rants

Sunday, July 08, 2007

Dynamic Robots.txt is NOT Cloaking!

71 comments:

About Me

Subscribe

Classic Rants

Labels

Blog Archive

Brogroll

Tech Links