Friday, July 14, 2006

Scraping via Google Translator

The other day I was tightening up bot blocker security just a bit to not only verify requests are coming from the Google IP range but specifically which bots were asking for information instead of the carte blanche approach that "If it's from Google, it must be good" which was a bullshit assumption.

Sure enough, I found something crawling my site at a pretty good pace today and it was someone using the Google Translator to scrape AND translate my site all at the same time.

Isn't that amusing!

Pretty sure it wasn't any type of Googlebot as it didn't ask for robots.txt and requested things like "/#top" which Google doesn't try to crawl, nor would a human in a browser send that request, so it's a bad bot using a loophole.

So follow along kiddies to what I've done to date:

  • Locked Googlebot access by known ranges of Google IPs to stop Googlebot spoofing
  • Installed NOARCHIVE to stop scraping via Google's cache index
  • Blocked PROXY servers when Google comes crawling through one to avoid page hijacking
  • Tightened security to specifically look for Googlebot or Mediapartners only to avoid nonsense via the web accelerator or other nonsense services they provide
Then, after ALL THAT, I find Google has yet another vulnerability which is the translator, which has probably been used to scrape me for months now, and they dont seem to care when someone is asking for pages at 1 second or less per page either.

What a joke Google, what a joke...

This is why I keep ranting about PROXY servers being bad, yet ANOTHER example of how any type of proxy, which in effect is what Google translator is, can be exploited.

How can I prove to you it's a bot?

When bad behavior is detected my bot blocker will CHALLENGE the requests with a captcha of some sort, might be a simple one, might be a hard one, but this crawler via the translator asked for 159 pages which, up to a point, were all unanswered captchas, then messages about being blocked for bad behavior, and it still kept going asking for different pages one after the other at a rapid pace.
CHALLENGE: [] requested 159 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1),gzip(gfe) (via"
Now some of you might point out that it could've been a lot of people going thru the proxy server at the same time trying to translate pages. That's easy for me to refute as I track the proxy information, if present, when I log bogus page requests and most of them came from the same IP address in Brazil.
CHALLENGE "Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-BR; rv:1.7.8) Gecko/20050511 Firefox/1.0.4,gzip(gfe) (via"

Proxy Detected -> VIA=1.0 (TWS/0.9), 1.0 (squid)


name =
On top of all that, it looks like Google jacks up my javascript in the captcha when they run it thru the translator so if a legitimate visitors, unlike the crawling asshole from Brazil, does something that invokes a challenge you're just fucked as you can't break out.

Oh joy, more shit to debug.

Thank you Google.

You know what's real fucking hysterical about Google breaking my javascript captcha?

The cheap ass CGI proxy servers run by kiddies trying to get to MySpace from school don't even break my javascript, so this is truly some PhD worthy software that broke my shit.

FYI, I asked Matt Cutts to pony up the actual IP's of Googlebot so I could be more precise and his answer was:
IncrediBILL, I don’t think we’ve done so in the past because it changes from time to time, and we didn’t want to give bad/stale information.
Earth to Google, just post the damn IP list for all your crawlers and those of use using it for security will worry about updating our sites. Maybe you should include new IPs with a lead time like 7 days in advance to give everyone a chance to update. Put the list in an XML file and we can automate updating our security, not a problem, really, as it's better than letting idiots scrape my site via your swiss cheese security on your translator!

1 comment:

Yavo said...

haha.. thanks for that post man - really educational on how screwed up things can be if you are not actually into scraping content but the other way around - creating it. i will be launching my first article directory in the next few days, so am currently working on saving my ass from spam bots. hope you don't mind if I take an idea or two from your blog :)

keep em comming - i see you got just 2 posts for 2012. i recommend you to get the best out of your blog this year as you might experience a drop of visits for christmas, due to genocide, mass panic or inter-dimentional wormhole with the size of the Earth.