Thursday, June 21, 2007

Javascript Cloaked Spam Pages Baffle Search Engines

Recently I ran across a large series of scraper sites that are the ultimate in openly cloaking to the search engines. The pages I see when I view the source are the same pages cached by the search engines, nothing special there so a search engine crawling outside it's IP range to check for cloaking would see the same page.

However, access those pages with javascript enabled and you are instantly redirected to a wide variety of affiliate pages. The trick is these pages all have a single embedded link to a heavily obfuscated page of javascript that redirects you to the affiliate pages.

The scraping to build these cloaked pages came from 216.75.15.26 which is in the cari.net IP range:

OrgName: California Regional Intranet, Inc.
NetRange: 216.75.0.0 - 216.75.63.255
Just goes to show you that traditional cloaking is a thing of the past as the war has escalated into obfuscated javascript. The only way I see the search engines winning this war is to actually execute that javascript and see if the resulting action was to take the visitor away from the page.

Just goes to show that people claiming here in comments recently that "Stealth crawling is necessary to keep honest webmasters honest" are out of their league and don't really know what the score is on the web as the sites aren't honest when they are in plain site, no stealth needed, they worked around it.

Wonder what they'll think up next?

6 comments:

Anonymous said...

The only way I see the search engines winning this war is to actually execute that javascript

This would mean SEs have to have a solution for every client-side scripting/coding, just to make sure the pages are not cloaked.

Not impossible, but cumbersome, and definitely would slow things down.

Anonymous said...

Au contraire, mon capitan ... it proves my point about stealth crawling. In fact, it means that search engines need to begin working by puppeting a normal Web browser, robotically driving it, from a random residential IP. The result is to truly see the same page a human does, right down to any Javashit shenanigans and evil popups. And of course to potentially get infected, so the search machines need to be firewalled from anything really valuable at the search engine's company. Of course, with virus detection they can detect sites trying to infect them and flag them as dangerous, something Google already tries to do.

Actually, this suggests the shape of the likely Google-killer: a distributed-computing search engine project. People with broadband always-on connections would donate a bit of computer time and bandwidth by running an agent that occasionally wakes up to puppet their browser, visit a few sites to index them, send some data, and shut down again while the machine is otherwise idle. In the process:

* Each machine treads very lightly, using only a little bandwidth at remote sites. The overall coordination divvies up web pages in such a way that ones that resolve to the same IP address are trickled out over time slowly enough that even the collective mass of participating computers won't use a noticeable amount of their bandwidth.
* The machines look like ordinary user-agents coming from ordinary residential IPs, because they are.
* The project gets an automatically honest view of the Web, and of any particular site.
* It's easy to get your own site into the index if you participate -- just activate the agent and tell it to visit your site.
* Participants who find a broken link or a site's changed since it was indexed can fix this just as easily.
* The agent would be designed with inbuilt safeguards against being misused to cause DoS, with a fixed timeout between visits to pages hosted at the same IP.
* Incentives to participate would include being part of the Google-killer, plus the ability to easily get your own sites indexed and to get broken/outdated links corrected.
* The distributed nature of the spidering would be easier on the network than the heavy usage currently concentrated near Google's network addresses.
* The distributed nature of the spidering would make such a project cost a fraction of a traditional large-scale search operation, with only the index storage and maintenance being centralized.
* Using Kademlia or a similar technology even the index need not be centralized. Then the central costs plummet to nearly zero, and the total cost is amortized over the participants. Participating would then be sensible as a requirement to use the index.
* It would be a platform suitable for piggybacking research projects.
* If the index included aggregate opinion ratings from the user base, it could also provide content rating, and search rankings could tank for sites people felt did something nefarious -- including scraper sites.
* The downside is that participants might put their computers at enhanced risk of infection with spyware or whatever. A specialized, standards-compliant browser masquerading as IE might be used, which lacks IE's many vulnerabilities and in particular cannot install software or run ActiveX controls, nor BHOs, but would appear to do so successfully to remote sites, except for those sites never receiving any feedback from the "successfully installed" client-side components.
* A variation on the scheme would add the option for the agent to monitor a participant's normal Web surfing and quietly index the pages they visited behind the scenes. There would have to be some obfuscation to protect the privacy of the participant so who indexed what site couldn't be determined easily. The automatic background activity at idle times would provide additional deniability that the user visited that porn site or whatever. A distributed index would enable further obfuscations, after the fashion of Ian Clarke's freenet project. In the simplest case, it would just be easy (or even automatic) to Tor route the indexing data to obfuscate participant IP addresses.
* Freenet-style methods for making it robust in the presence of a few hostile nodes would be employed, given that something that became "the Google killer" would invite attempts to game the system in various ways.
* A fully distributed, Freenet-style (or even freenet-derived?) crawler/index would have massive advantages. It would be indestructible short of a huge worldwide disaster. It would be resistant to all kinds of legal pressures or other attempts at forcible restraint, so it could index what certain people don't want indexed but it's in everyone else's interest to index. It could excerpt or even provide Wayback-style archives of web history, backups for dead pages, and the like without legal monkeywrenches being throw into the works, benefiting everyone except the strict IP-controllists who hate the idea of not having total dictatorial control over the use of every byte they publish. Site operators get a faithful mirror (down to their revenue-generating ads, unless blocked at user request) when their site is down; users get more reliable sites...
* The index can also naturally map the occurrence and boundaries of any censorship. For example, the exact boundaries in space and list of blocked sites for the Great Firewall of China would be mapped if there were many participants both in and out of China. Error messages, as well as sites popular elsewhere no-one there even tries to visit, would provide the data points.

Dave said...

The outcome of the war against spam is not guaranteed.
Of course I hope the good guys win, but I wonder if it will take some kind of legal intervention (e.g., scraping- which is in violation of copyright law, frequently goes unchallenged).

One problem with anonymous's idea: spammers would sign their own computers up as "ordinary residential users", and fake the data for their own websites.

IncrediBILL said...

Good Lord! I step out for a couple of hours and come back to find Anonymous is writing goddamn manifestos on my blog.

At least with quotes like "Au contraire, mon capitan" I can assume he's possibly a ST:TNG fan and not all bad.

Maybe I'll read the whole tirade and comment later when I have nothing better to do after a few hours of no-limit poker.

IncrediBILL said...

Well, I read those comments and there's a few minutes of my life I'll never get back.

Reads like a Marxist "Search Manifesto" for the internet.

Some of your ideas have already been tried here:

http://www.planet-lab.org/
http://www.majestic12.co.uk/

Neither are without vulnerabilities.

I was laughing so hard when I realized you wanted to put the control over crawling the web into the very hands of the people causing the problems for Google in the first place, that I realized I must be being pranked.

No major project with serious financial backing, especially your fantasy 'google-killer', could ever be effectively run off of a distributed network without being infiltrated and abused by hackers and spammers. Especially considering that the same network of computers could also be easily infected with a botnet that could literally take over the distributed crawl and dictate the results.

Good one, you should write comedy.

Maybe you should focus your attention on real issues like the "FBI Says at Least One Million Computers Infected by Botnets" before going off the deep end into fantasyland.

http://news.yahoo.com/s/afp/20070615/tc_afp/usitcrimehackers

Enjoy the light reading.

Anonymous said...

I never claimed there wouldn't be security issues to work out. But consider that google's ranking uses the way pages are linked to, so a lot of how it ranks pages already is crowdsourced. The pages themselves are obviously crowdsourced. It's the logical next step to crowdsource the searching and even the index too. A few rotten apples trying to game the system will be swamped by the law of averages and the honesty of the majority in this case. If designed right it would require compromising a majority of the machines to achieve anything more than a noticeable dip in speed. Even one million compromised machines out of however-many-billion is a joke in those terms. Also, voluntarily visited sites being indexed should occur from manual browsing; when that (rather than assigned sites) shows characteristics of automation (such as dozens visited in a second, say) it can be detected.

Oh, and you seem to be laboring under the delusion that "anonymous" is the handle of a single user, rather than meaning what it says. Get that brain checked; minor symptoms like that might be an early sign of something serious, like a tumor, and with those, as a rule, the earlier it's caught and treatment begins, the better the prognosis.