Tuesday, January 23, 2007

Google Still Crawling Proxy Sites and Hijacking Pages

Google has been crawling through cloaked pages on proxy sites for quite some time and it got better for a while but suddenly there's a new rash of this crap and they don't seem to be able to stop it.

Here's an example of hijacked pages associated with a proxy and I'm not sure Google properly detects the pages as duplicate content else the listings would say supplemental. However, one search for a specific example did block the results, but look how high these proxies rank when you include the supplemental results. In another sample there were 2 proxy sites with a hijacked page in the top 20 results, just look for the word proxy in the URLs.

Obviously hijacking is still a problem years after it was originally reported, a little better than before, but the beast still exists in the underbelly of their SERPs.

The following are a sample of some recent proxy sites Google crawled through:

http://free4.hostrocket.com/~hzhostr/
http://www.78y.net/
http://www.akiyan.com/hrgn/
http://codedup.com/proxy/
http://stealthclick.com/phproxy/
http://www.anonymonline.com/
http://www.jamesh.us/cgiproxy/
http://www.anonoxy.com/ (redirected from
lay-low.net)
http://www.proxymod.com/
http://myspaceproxy.gr/
There are a ton of proxies and Google may crawl through them as well but I'm only tracking the ones that get linked to my sites. A couple of the above seem to be inactive at the moment, they are volatile and may resurface on those domains, but even their current 404 status doesn't remove the results from Google, very odd.

Now this one was quite amusing:
http://codedup.com/proxy/
Their hosting support left them a message:
I have chmodded nph-proxy.pl to 000 and changed it over to the root
user's ownership. Please to not install this script elsewhere or use
it on our service before contacting support. Thank you.
Well, it appears you've been spanked!

Why can't Google detect that they're being led astray by proxy sites cloaking links is amazing to me as Yahoo and MSN don't seem to have this problem to the same extent as I rarely catch them crawling via a proxy, at least not as frequently as Google does.

NOTE: This post contains time sensitive information and links to searched that are reporting a current problem that may not be accurate in the future.

1 comment:

bob said...

Yes but a responsible proxy owner would add this to their robots.txt
(which google honours 100%)

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /cgiproxy/

or

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /browse.php ##whatever the script is that proxifies the pages/ or directory

etc..