Tuesday, June 05, 2007

TextDigger Caught Using Stealth Shovel

Some semantic search thing called TextDigger stumbled into my spider trap today.

I have nothing against semantic search, I'm not an anti-semantite (that's not the word you think it is, read it twice, i made it up just to be punny), but I'm definitely anti-stealth crawler.

According to the bot blocker, TextDigger requested 136 pages after being challenged while using the following user agent:

64.124.138.164 [nat1.textdigger.com]
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Here's their IP range:
TextDigger MFN-B849-64-124-138-160-28 (NET-64-124-138-160-1)
64.124.138.160 - 64.124.138.175
Not sure if what hit my server was their actual main crawler or not, but they aren't gaining any brownie points with me crawling in stealth for any reason.

10 comments:

Anonymous said...

Stealth crawling is necessary to keep honest webmasters honest. Otherwise they can show search engine bots whatever they want to see, even while presenting humans with something different.

And it already happens, quite a lot. A whole bunch of sites show Googlebot the entire contents of their site, but show the hapless humans that click a link from a Google search only a login page and try to extort a spammable email address, other personal information, or even money from them before letting them read. In effect, they make Google their unwitting accomplices in a bait-and-switch scam. (The "bait" is seemingly freely available information; the "switch" being suddenly springing a demand for money, or at least receiving their spam, on the user.)

Google can prevent this in basically three ways.
1. They can simply rely on non-automated spot checks, user complaints, and the like. Obviously this is a game of wack-a-mole with Google always playing catch-up. Also obviously, if they depend on user complaints they've already failed, since each user complaint means someone ALREADY got dinged by the bait-and-switch.
2. They can crawl from random IP addresses and through proxies. The evil sites can not simply let a particular IP range see their content and force everyone else to pay up. Instead they have to let anything crawling as "Googlebot" in, which lets humans freely read anything Googlebot sees, as it should be. But only if they jump through hoops to spoof their UA as "Googlebot" for the sites that otherwise try to scam them. Also has the effect that any crawler can get a free pass by calling itself "Googlebot", whether it really is or isn't. Webmasters can't tell who or what is really crawling them.
* Or Google can stealth its crawler completely, using various IPs and UA strings. This isn't any worse for webmasters, who still can't tell who or what is really crawling them, but it's better for users. Sites can't discriminate based on UA *or* IP without risking offending the almighty Google, so they have to be honest and present the same view of their site to human visitors as they do to Google. Users don't have to jump through UA-spoofing hoops; Google users can be confident that the search results and excerpt text they see really do reflect what they'll find at the other end of the link. Sites can still lock content away behind registerwalls, but they now have to pay the price that what they lock up doesn't get indexed by search engines and drive traffic; they can no longer bait and switch, nor have their cake and eat it too. If they want to use evil barrier methods to monetize their traffic, then they have to pay to attract that traffic by advertising, either with sponsored links at search engines or banner ads or even radio or TV. If they monetize their traffic in other ways such as banner ads (or not at all) THEN they can get free search engine placement and the resulting free traffic. (And of course they can still put some stuff out in the open to entice people to visit their site, then on their own site offer more to people who register. But the enticing stuff that the search engines index has to be freely available now; no bait and switch.)

So your being against stealth crawling is tantamount to being in favor of webmasters being able to rip off users, force Google and other search engines to be accomplices in bait-and-switch scams, and generally being able to double-dip, get free traffic without giving anything back to that traffic for free, and act all evil.

I hope this doesn't mean that your own business model is based on feeding Googlebot googlebait and visitors a "please register" page. If it is, then shame on you!

IncrediBILL said...

OK, you ran off topic about Google as my post was about TextDigger, not Google at all.

You can justify it any way you want, but stealth crawling is just the slimy things robots do that want inside with or without permission, or without detection.

Stealth crawlers STEAL and reuse content or SPY on sites, all sorts of things that make them unwanted on my server so who gives a shit what they do to others, it's what they do to ME that makes me block 'em.

Of course they fake out Google and other crawlers, but if I stop them from faking their crawl on my site they don't have my data to cloak to Google.

As a matter of fact, crawling in stealth may be illegal trespass in California.

I ran the law across a lawyers desk and they agreed with my interpretation so it's just a matter of time to find the right case to test this theory in court.

Besides, if the government can put a spammer up on IDENTITY THEFT charges for simply using the names in his mail list as the FROM address, my hypothesis isn't too far fetched.

IncrediBILL said...

One last thought on your comment...

No web crawler needs to run 100% stealth to check a site.

The crawler only needs to randomly access 1-2 pages to verify the integrity of the site, no more, so there is ZERO justification for a full scale stealth crawl.

Forrest said...

Pardon my ignorance, but exactly what is stealth crawling?

Obviously, people have strong feelings about this, as we've seen: So your being against stealth crawling is tantamount to being in favor of webmasters being able to rip off users, force Google and other search engines to be accomplices in bait-and-switch scams, and generally being able to double-dip, get free traffic without giving anything back to that traffic for free, and act all evil.

Already I can tell that's a major logical leap without being sure exactly what people are talking about ... but it seems I might ought to learn more.

IncrediBILL said...

Forrest, in a nutshell there are bots that identify themselves and bots that try to hide pretending to be someone using Internet Explorer or Firefox. Those bots that pretend to be MSIE or Firefox are considered stealth crawlers because they're trying to fly in under the radar without being identified and they ignore all the rules of the game (robots.txt) that civilized internet crawlers obey.

Unfortunately, stealth is as stealth does, and most stealth does some pretty stupid shit since a rudimentary PHP or PERL script can easily identify and block the majority of them.

Anonymous said...

The other side of the story is: a site that knows the search crawlers from the humans can give the search crawlers better access, then smack the humans with a demand for money when they follow the search engine links, and other evil stuff like that. So you get for example Google hits with interesting, relevant-looking excerpt text, click the link, and see ... a login page and a bunch of credit card company logos. No sign of the excerpt text anywhere. You've just been bait-and-switch scammed, and Google's just been made an unwitting accomplice in the scam! Only crawling as "Mozilla 4.0 (compatible; MSIE ...)" from a random IP address can guarantee a search engine sees and indexes the same thing that a human would see visiting the page, and therefore that the search engine's organic results are honest. Anyone who wants to be a hit for content they've locked behind a paywall or registerwall can pay to take out an ad or be a "sponsored link", but doesn't deserve free referrals.

IncrediBILL said...

That's not the other side of the story at all, that's called a PAY site, they exist, sites like the Wall Street Journal.

Not sure what your point is because not everything on the 'net is free, never has been, never will be.

I'd prefer to know the data I need exists, paid or not, whether I pay to see it is up to me, not Google or anyone else.

However, if it's a scam site I can simply do a chargeback and get my money back and they get dinged $35 for the chargeback to boot, so the scammers will get hosed, not the end user, so again, who gives a shit?

Anonymous said...

I have no problem with pay sites per se. But they shouldn't get free referrals. If they want to be top hits in Google's rankings they should have to do one of two things -- either buy a paid placement "sponsored link" from Google, or provide some free content that Google indexes and that drives traffic their way, but that humans can freely read, and which gets "bodies in the shop" that might decide they are interested in the members-only stuff (or might not).

However, showing Google stuff that humans can't freely read is cheating. Your free traffic from organic search rankings should come from your free content, and only from your free content.

This applies regardless of what search engines, besides Google, might be catered to.

IncrediBILL said...

If I remember correctly Matt Cutts has recently said that those sites must show at least 1 page freely to the visitor or they'll be considered cloaked so I'm pretty sure this issue has already been decided.

Anonymous said...

Yeah -- the login page.

I want the stricter requirement: nothing appears in the excerpt text or otherwise as a "teaser" that isn't either freely viewable by the one click on the accompanying search engine link or part of a paid ad instead of an organic hit.

IOW, paid ads can contain excerpts from registration-required content, so long as these aren't deceptive (the excerpted-from text is actually available if you register), but free ads (i.e. organic placement) can only contain excerpts from no-registration-required content. Click the link, see the text from which it was excerpted.

Fair's fair. Getting free stuff requires giving something freely.