Friday, June 02, 2006

Locating and Blocking Proxy Servers

Since some of my readers want to know how I'm doing it, here's a few tips on how you too can eliminate the anonymous proxies from your site. Probably won't get them all and you might get a few false positives as well but it's better to have some defense against this menace than none at all.

A large number of these proxy servers are on .EDU domains because of all the bleeding heart crap about making information free for all without censorship and being able to surf without fear of retribution. That's a very noble and altruistic motive but you open the doors for competitve spying, scrapers, phishing theives and a lot more so don't take this the wrong way when I don't appreciate what you're doing with our tax and college dollars and send out a big "FUCK YOU" to establishments of higher learning that permit this bullshit. If people in other countries don't like being censored, let them overthrow their fucking government, it's not our problem and my server and copyrighted content shouldn't be vulnerable to attack because of the gaping holes opened up by your bleeding heart asses, but I'm off on a tangent.

The other groups of asshole proxies are the many web-based CGI and PHP proxy servers (like eatmoreblueberries) being used to bypass restricted internet access imposed on corporate, library and school networks. Well I'm sorry but you're supposed to be WORKING or STUDYING so let me give you a big "FUCK YOU" as well. Not only do they download your pages, they strip out YOUR ads and insert their OWN ads, assholes. So for all you slackers using those proxies, zip it up, close the porn sites, go back to work, and get a life you little fuckers as MySpace isn't it.

So, with a bit of ranting aside, back to blocking proxies...

New proxy servers pop up every 5 seconds so my method requires multiple techniques:

  1. Import lists of known proxies and block them
  2. Look for proxy environment variables
  3. Test the IP for typical proxy ports and see if it works
  4. Check for a port number being appended to your domain done by lame proxies
  5. Monitor for proxy crawl thru of known services
1. Import Lists

This step is pretty obvious and can be automated by downloading the lists from a few well known proxy list sites, or if you're lazy you can subscribe to a service or two already doing that.

Probably doesn't hurt to validate these proxies, which can be done automatically, otherwise your list will grow infinitely as they appear and disappear very quicky

2. Proxy Environment Variables


You can check for the following:
HTTP_VIA
HTTP_X_FORWARDED_FOR
HTTP_PROXY_CONNECTION

Yes, those will tell you a proxy made the request but remember that AOL and many others are also a proxy so then it becomes more complicated as you have to evolve a list of known good proxies vs. all the rest and do further processing on those you don't know.

FYI, the really good anonymous proxies don't send that information so you'll never know it's a proxy.

3. Test for Proxy Ports

It will look simple but it's way more complicated to get right.

in PHP you can check to see if you can open port 80 on the incoming IP to see if it's an open proxy like this:

$fp = @fsockopen($theIP, 80, $errno, $errstr, 5);
if ($fp) {
// OPEN PORT
}

But that's very simplistic as most don't use :80, they use port :8080 and other weird #s like :3128, to avoid what the admins are currently blocking.

Not to mention, some proxies are very slow so you want to do exhaustive testing on post-page processing so you don't slow down the user experience on the front end of the page. You only have to do this once per IP, but someone could think your website is down if the process takes to long and worse case you get a positive answer that it's a proxy they've only accessed one page and you block the next page.

Once you detect the proxy add it to your proxy lists built in step 1 above and you'll never have to worry about this one again.

Remember, you may end up blocking IPs from colleges and universities but remember our alma mater, good old FU.

4. Port Numbers Appended to Domain

The dumbest of the dumb append a port number to your domain name which is easy to test in the HTTP_HOST variable. The only exceptions I've have to make to this rule so far is for the poor dumb bastards still using prodigy.net.mx which astonished me that prodigy still existed even as a name on a block of IPs!

5. Proxy Crawl Thru

What some of these dumb fuck proxy operators do is set up a cloaked directory, probably a clone of DMOZ or some shit, and cloak this directory to the search engines.

When you see things like Googlebot, Mediabot, Msnbot, etc. hitting your servers outside of their known range of IP's it means only 1 of 2 possibilities.
  1. Someone is trying to spoof the user agent to get onto your server
  2. The crawler is coming thru a proxy port
The best defense is to throw up an error or something in this case as I've had some pages hijacked by this nonsense so you really don't want to serve up real pages in this event as the search engines simply aren't that smart.

BTW, before serving up an error message, it's wise to do a reverse DNS lookup to make sure that Googlebot really isn't on a new block of IP's owned by google.com.

Summary

Probably not as simple as you had hoped but a couple of techniques are very straight forward and stop some level of the proxy nonsense without fear of blocking innocents.

Good luck trying this and may all your proxy requests bounce off your server like a rock skipping across a pond.

2 comments:

Anonymous said...

Bill,

Thanks for the info.

-JayW

Unknown said...

Hi,
I came up with similar concepts for blocking open proxy's, but didn't have the bit of detail you showed here, ty. As well as checking for port 80 etc as theses could be open on some legitimate sites I was considering checking for patterns of available ports and comparing that with profiles of real web sites to get a score type result of the likelyhood of it been a open proxy

Tim