Sunday, July 23, 2006

Finding Cellphone User Agents Rosetta Stone

I'm sure I could upgrade how I process user agents to accomodate all these mobile freaks that can't tear themselves away from the internet but I'm not sure I want to mess with it.

In order to lock out most of the bad bots I only allow user agents that start with "Mozilla/" or "Opera/" as the very first part of the string which seems to work real well.

Well, unfortunately the assholes that make cellphones don't seem to give a shit about fitting into an easily identifiable group of browsers and have a bazillion user agents.

Something like this doesn't even fit the mobile user agent definitions:

HTC-8100/1.2 Mozilla/4.0 (compatible; MSIE 6.0; Windows CE; PPC; 240x320) UP.Link/"
The only upside here is this one tells us it's screen size in the UA which is useful to know.

Why can't all of these dickheads at least do ONE THING in the user agent that just screams out "THIS A WIRELESS DEVICE OR CELLPHONE" like prefixing them all with "WAP/" or something civilized like that instead of having to know all the goddamn vendors and part numbers?

The closest thing I came to finding a reasonably identifiable fingerprint for a mobile device was looking for "Profile/MIDP", "MMP/" or "Configuration/CLDC" which seem to be a few good checks for most things mobile.

Just look at examples of all this gibberish:
Nokia6600/1.0 (4.09.1) SymbianOS/7.0s Series60/2.0 Profile/MIDP-2.0 Configuration/CLDC-1.0
Samsung-SPHA880 AU-MIC-A880/2.0 MMP/2.0 Profile/MIDP-2.0 Configuration/CLDC-1.1
SANYO-S750/2.130 UP.Browser/ (GUI) MMP/2.0
BlackBerry8700/4.1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/102

SonyEricssonP900/R102 Profile/MIDP-2.0 Configuration/CLDC-1.0

MOT-C650/0B.D2.23R MIB/2.2.1 Profile/MIDP-2.0 Configuration/CLDC-1.0 (Google WAP Proxy/1.0

Vodafone/1.0/703SH/SHG001 Browser/UP.Browser/ Profile/MIDP-2.0 Configuration/CLDC-1.1 Ext-J-Profile/JSCL-1.2.2 Ext-V-Profile/VSCL-2.0.0

SCH-A950 UP.Browser/ (GUI) MMP/2.0

LGE-PM225/1.0 UP.Browser/ (GUI) MMP/2.0

SHARP-TQ-GX25/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.0 UP.Browser/ (GUI) MMP/2.0 UP.Link/

and on and on and on...
Yeah, I read the spec on wireless user agents and it's just a big old fucking mess of gibberish that opens the door for scrapers to pretend they're cell phones with javascript disabled and scrape the fuck out of a website.

Well you could argue that wireless devices don't use too many pages so just limit their access and I'll counter that containment method with a shitload of anonymous proxies and/or a small fleet of $2/month hosting accounts.

The problem with blocking proxies is just about all of these freaking toys with browsers use proxy servers to convert web pages to a few lines of links and text so just blocking any old proxy they use will typically block them altogether.

It's a gaping hole that can barely be contained and my best strategy to date is by only allowing these devices access via IP's that resolve to wireless service providers which is sketchy at best.

Why is this sketchy?

Someone can scrape over a 3G network at speeds of 400K-700K or better.

Not highly likely, but definitely probable and easily doable.

I'm getting annoyed as the tighter I make the noose, the more obvious it's weaknesses thanks to the swiss cheese that is the internet and all the bungling engineers.

It's a prime example of that old technology law:
"If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization"


Ocean said...

Bill Drop me a line, I have a few tricks you can use to identify the spoofers from the real mobile browsers. Which will help cut down on false positives for most of the new mobile browsers.

GaryK said...

You sure get around Ocean. :)