Saturday, October 25, 2008

Google Analytics Finds Bandits and Proxies!

Google Analytics has a Hostnames feature that most overlook which normally displays the hostname of the site your visitor landed on, like example.com. However, you'll probably notice a bunch of IP addresses and other interesting information in this list including sites that may have stolen your content!

To see what I'm referring to go into your Google Analytics account and go to Visitors -> Network Properties -> Hostnames.

Many of the IPs listed will be for Google or Yahoo translator services or such and you wouldn't want to block any of these. Other IPs and host names will be proxy servers in data centers you probably never heard of and possibly host names to places that have your stolen content posted!

Now expand the date range for your report to show all your Hostname data as far back as Google has been tracking your site and see what people have been doing with your site all this time.

Probably not worth trying to just block old single proxy IPs as proxy sites come and go all the time, but most likely you'll find these IPs are associated with data centers which host lots of servers and perhaps that proxy is just on a new IP so now you have another data center you can block.

Fun fun fun!

The list of actual host domain names, not the IPs, is what I found most useful as a few of those turned out to be idiots that managed to scrape a page or two from my site and still had my Google Analytics tracking codes on their pages!

Enjoy this new toy while I start sending C&Ds to the idiots with my tracking codes still on their sites.

5 comments:

g1smd said...

I always have several profiles defined within each analytics account:

One has two filters that:
- includes only stuff browsed on "www.domain.com" *and*
- that isn't being browsed by anyone that has a "staff" cookie. This is the one used for normal operation.

One has a filter "everything not www.domain.com" which finds all the copies (as well as sees people looking at SE caches {where that is allowed} etc). This gets looked at a couple of times per month.

One has a filter that lists "everything browsed by people with a 'staff' cookie", just to see that staff are still cookied-up.

The last profile has no filters and shows "everything". It isn't used for anything in particular.

Staff have their own password-protected "start" page which has but one purpose - get the "staff" cookie on to their machine so they are no longer included in the stats. They are encouraged to always start there, just in case they now have a new browser and/or new PC.

IncrediBILL said...

Yes, I did see SE caches as well.

People do need to verify the IP with WHOIS to make sure they don't block Google or Yahoo services by accident.

IncrediBILL said...

FYI, before we confuse people about SE caches, what you see is your Google Analytics code being loaded from the SE cached page.

Since I use NOARCHIVE there are no SE cache references in my hostname list so I completely forgot to mention it.

Mitch Miller said...

Hi Bill -
I have found content many times,
never thought to look for Analytics too.

Thanks for posting this,
Mitch

Brandon Kelly said...

I discovered a duplicate site ... using this feature.

Same issue -- proxy site duplicating my stuff, still has my google Analytics code and address and phone number!! Though they took the time to change the logo.

Although, my google analytics traffic surged as soon as this site came online - it appears the traffic from the proxy site is pretending to be a Mozilla user -- without any host names or location information.

I'm thinking I should block any requests where I can't resolve the host name.

I've informed google of the rogue proxy site. Hopefully that helps.

Thanks for the great Blog.