Wednesday, November 05, 2008

Temporarily Block HotLinking To Find Copyright Abusers

Blocking hotlinks is usually considered a method used to conserve bandwidth and stop leeching of images off your server. However, you can also use hotlink blocking to quickly and easily find all those sites using your content.

The most common solution for Linux servers is to add the following hotlink blocking code into your .htaccess file.

RewriteEngine On
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http(s)?://(.*\.)? [NC]
RewriteRule \.(jpeg|jpg|gif|png)$ - [F]
Obviously you want to change to the domain name of your site before adding this to .htaccess on your site.

Now once you've added this code the fun begins as you sit back a few hours and wait for all the "403 forbidden" codes to start filling up your access log file.

Now using a simple grep on your log file will generate a nice list of sites in the referrer field that are hotlinking your images, or much worse which is often the case.

grep "\.jpg" access_log | grep " 403 "

grep "\.gif" access_log | grep " 403 "
The first part of the grep locates all ".jpg" files then the second part filters out all but the " 403 " forbidden errors.

After a day or 2 you'll have a nice list of sites to send C&D's, DMCAs, and all sorts of fun stuff.

Now disable your hotlink blocking script or remove it from your .htaccess file.

Why disable hotlink blocking?

Because hotlink blocking encourages people to actually download your images making the process of finding stolen images way more difficult. Therefore, a temporary hotlink block shows you everyone doing this just long enough to take corrective measures, then let your site wide open again and wait for the next batch of idiots to start hotlinking.

Hope a few of you find this little tip handy!


Emil Vikström said...
This comment has been removed by the author.
Emil Vikström said...

Why not just skip the .htaccess part and grep directly for the rows with inferior hotlinking?

grep '\.jpg' access.log | grep '[0-9] "http:' | grep -v '[0-9] "'

Anonymous said...

Better yet, do what I did once to a bunch of myspace hotlinkers.

Took the hotlinked picture, renamed it on my server. I took a picture of my ass and uploaded it to my server with a nice note saying for all the thiefs out there to kiss it.

Needless to say the hotlinks dropped like flies.

Anonymous said...

Interesting. Kind of a backdoor approach to an old persistent problem. I think I like it!

sallreen said...

Copyright laws were designed to protect those in society whom we celebrate and honor, often representative of the lowest paid workers, the artists. We don’t expect to take freely work from our doctors, lawyers, plumbers, electricians, mechanics, or others whose work we value and honor with compensation..

Anonymous said...

> "Just because information is made publicly
> available doesn't mean it's available to be
> used however anyone wants."
> Sure it does

No it doesn't.

This is what copyright laws are for: giving the
original author the ability to publish his/her
creation without making it legally possible for
every thieving bastard to do with it as he/she

Copyrighted works are made available for perusal
by you, but with certain limitations as set forth
by the original author in a license or similar.

> The only way for it to be otherwise is if you somehow
> had a natural right to tell me what I could or could
> not do, alone or with only other consenting adults,
> inside the privacy of my own home.
> You don't.

You are completely missing the point. This is not
inside anyone's house, Bill is talking about people
who are stealing content and making it available on
the Internet. Huge difference.

Anonymous said...

"No it doesn't."

Yes it does.

"This is what copyright laws are for"

Copyright laws are of questionable legitimacy, but even supposing them to be perfectly good, they do not give an author the right to control downstream use. They give an author the right to charge for copies, and to limit distribution to authorized distributors.

Read up on the first sale doctrine, among other things. And look at the letter of copyright law. It does allow a book author to decide who can publish the book. It does not allow a book author to decide a buyer can't read it through some particular brand of spectacles, or on the bus, or between the hours of 7pm and midnight.

Similarly, copyright law does allow Bill to decide what sites can mirror his content. It does allow Bill to charge money for access. It does not, however, grant him any right to forbid browsing it with a particular browser, or on a mobile device, or between the hours of 7 and midnight.

Ergo, to the extent that a user's choice of browser (or whatever) does not impinge upon the bandwidth consumption, and so long as the user does not republish the content without Bill's permission, Bill has no legal claim against that user.

"Copyrighted works are made available for perusal
by you, but with certain limitations as set forth
by the original author in a license or similar."

That is a complete misunderstanding of copyright. Copyright provides for exactly two limitations: whether and under what conditions the user can distribute copies of the work, and whether and under what conditions the user can "publicly perform" the work. This covers the user scraping Bill's website and republishing it without Bill's permission, but not the user simply browsing Bill's website with browser X instead of browser Y.

"You are completely missing the point."

No, I am not.

See above, and read up on copyright law and what it really does and does not grant as exclusive rights to the authors of works.

"Bill is talking about people
who are stealing content and making it available on
the Internet."

Bill was actually talking about his supposedly having an actual cause of legal action against people for accessing his site with nonapproved-by-Bill browser software, in another thread. Unfortunately, rather than debate the matter honestly, Bill chose to keep deleting my reply and then block all new comments to that thread. So I reposted my reply to this one, and kept reposting it when he continued to keep deleting it -- making it clear that he has to either accept that my comment will persist in being available on the site or else he'll have to completely disable comments site-wide. Unfortunately, he deleted it again (shame on him!) and left your reply, and also, by breaking up the thread among different blog posts, has made the issue more confusing.

On the other hand, hotlinking is also (probably) not copyright infringement.

It certainly is not distributing without authorization. The only thing a hotlinker distributes is text like "a href=xyz.html" or "img src=xyz.html". If merely pointing someone to an authorized copy of the work constiuted copyright infringement, then telling someone that the bookstore has copies of the new Stephen King book would be copyright infringement.

And it is, by hypothesis, an authorized copy. (Hotlinking to an unauthorized copy *might* be some form of contributory or vicarious infringement, though it still would not be direct infringement.)

There might be a case made that hotlinking with embedding may qualify as an unauthorized "public performance" of the work. (So "img src=" but not "a href=" linking.)

As I understand it, though, the idea was not to attack hotlinking itself, but to use hotlinking to detect the subset of scraped copies that didn't make their own copies of images. That catches unauthorized republishing of the site, which does appear to be straightforward copyright infringement. I have no argument there, save my questioning of copyright law's legitimacy and desirability in general. So no legal argument. Nor did I attempt to make one.

Sorry for the confusion. If there was a little less censorship around here, there would likely not have been any, though.

The original comment I made will be reposted following this comment. Both comments will keep being reposted if they get deleted. Both are legitimate, on-topic comments regarding issues discussed regularly at this blog, after all.

John Arundel said...

Hi Bill,

I happened to see your post in the WebmasterWorld forums on 'Wordtracker attempts crawling my site':

I'm an engineer here at Wordtracker and thought you might appreciate an explanation of what happened. We have a 'lateral search' tool which helps people find related keywords for a particular search. This involves looking at the pages returned for the original search, and attempting to identify any keywords found in the page.

So our servers weren't attempting to crawl your site, but one or more of your pages must have been returned as a search result for a keyword entered by one of our users. We definitely don't crawl sites automatically or try to spider your content; however, we appreciate that some people object to receiving traffic from our servers, so we maintain a list of domains which are automatically removed from any searches. If you'd like yours added to that list, please email me directly and I'll arrange it for you - and do pass this information on to anyone else who you think may need it.

Please accept my apologies for the trouble you've been caused.