The other day I was tightening up bot blocker security just a bit to not only verify requests are coming from the Google IP range but specifically which bots were asking for information instead of the carte blanche approach that "If it's from Google, it must be good" which was a bullshit assumption.
Sure enough, I found something crawling my site at a pretty good pace today and it was someone using the Google Translator to scrape AND translate my site all at the same time.
Isn't that amusing!
Pretty sure it wasn't any type of Googlebot as it didn't ask for robots.txt and requested things like "/#top" which Google doesn't try to crawl, nor would a human in a browser send that request, so it's a bad bot using a loophole.
So follow along kiddies to what I've done to date:
- Locked Googlebot access by known ranges of Google IPs to stop Googlebot spoofing
- Installed NOARCHIVE to stop scraping via Google's cache index
- Blocked PROXY servers when Google comes crawling through one to avoid page hijacking
- Tightened security to specifically look for Googlebot or Mediapartners only to avoid nonsense via the web accelerator or other nonsense services they provide
What a joke Google, what a joke...
This is why I keep ranting about PROXY servers being bad, yet ANOTHER example of how any type of proxy, which in effect is what Google translator is, can be exploited.
How can I prove to you it's a bot?
When bad behavior is detected my bot blocker will CHALLENGE the requests with a captcha of some sort, might be a simple one, might be a hard one, but this crawler via the translator asked for 159 pages which, up to a point, were all unanswered captchas, then messages about being blocked for bad behavior, and it still kept going asking for different pages one after the other at a rapid pace.
CHALLENGE: 184.108.40.206 [hs-out-f136.google.com.] requested 159 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1),gzip(gfe) (via translate.google.com)"Now some of you might point out that it could've been a lot of people going thru the proxy server at the same time trying to translate pages. That's easy for me to refute as I track the proxy information, if present, when I log bogus page requests and most of them came from the same IP address in Brazil.
Proxy Detected -> VIA=1.0 translate.google.com (TWS/0.9), 1.0 proxy.google.com:80 (squid)
name = 201-35-249-163.bnut3703.dsl.brasiltelecom.net.br.
Oh joy, more shit to debug.
Thank you Google.
FYI, I asked Matt Cutts to pony up the actual IP's of Googlebot so I could be more precise and his answer was:
IncrediBILL, I don’t think we’ve done so in the past because it changes from time to time, and we didn’t want to give bad/stale information.Earth to Google, just post the damn IP list for all your crawlers and those of use using it for security will worry about updating our sites. Maybe you should include new IPs with a lead time like 7 days in advance to give everyone a chance to update. Put the list in an XML file and we can automate updating our security, not a problem, really, as it's better than letting idiots scrape my site via your swiss cheese security on your translator!