Monday, March 26, 2007

WorldWebWide Scrapes LookDumb

Today my content tracking bugs led me to something in the worldwebwide.net which originated from LookSmart.

The IP address where the data was originally crawled from:

60.88.242.64 -> sv-crawlfw4.looksmart.com
This is nothing new as LookSmart seems to be a scraping target as I've already reported same thing happening with GoodBidWords.com containing scraped LookSmart listings.


8 comments:

Anonymous said...

All this bot blocking stuff is going to be academic real soon; you know that, right?

Thing is, before long all the major search engines are going to be stealth crawling. They'll have to, just to present honest search rankings to their users that are based on what humans browsing the results would actually see.

The big weakness of the current system, where search bots come from recognizable IPs and use familiar UA strings, is that site operators can easily configure their servers to show search engine bots whatever they want to see, and something entirely different to humans. Such as showing content to a bot, but a login page and a "Register using Paypal" button to a human, or content to a bot and a bunch of ads to a human. Some of these sites show scraped content to the bot; some show original stuff, but lock it behind some sort of barrier when a human comes knocking. There are other bait and switch tactics too, such as to show something relevant to bots and a political rant to humans. Ultimately, the only way for search engines to base their rankings on what the sites actually provide to humans who visit is for them to see the same thing humans do, and the only way to do that is to pass for one of them. So they must stealth crawl, at least to double-check the results of regular crawls, and then why not just stealth crawl all the time and do half the work?

So before long, trying to detect and block all stealth crawlers will, even if it succeeds, only earn you a self-inflicted gunshot wound to the foot.

OTOH, the current obsession with control over site content and gathering accurate visitor stats is just a passing phase. The current central-server-per-site implementation of the web is unstable against a superior alternative becoming technologically mature in the next decade: distributed hash tables with transparent caching and proxying. Those will automatically scale to meet demand, so being smothered in traffic (e.g. slashdotted) goes away as a problem. The costs of hosting something become tiny, too, or rather distributed over the viewership. By operating a node that gives access to the DHT-web, they actually provide some amount of caching and proxying and storage themselves, which means people directly pay for the costs of hosting the content they browse by actually helping host the content they browse with their own storage, CPU, and bandwidth. No more need for ad revenues and accurate stat tracking. Of course, this is also around the time that quaint notions like "copyright" become of more interest to historians than to practicing attorneys. Already de facto unenforceable, and increasingly under fire from critics from every point on the political spectrum, it'll surprise me if it lasts as long as 2020 in any meaningful form.

IncrediBILL said...

Never mind, it's late.

Obviously anonymous was a scraper.

Anonymous said...

If you're taking any kind of advanced debating classes, I suggest dropping those courses as it's highly improbable that you'll pass with debating skills like those. You didn't address a single point raised in the comment to which you were responding! For example, that major search engines will have to stealth crawl. If they won't, how will they avoid being patsies for the deceptive ploys mentioned in the original comment? I should note here that I've seen this stuff too, particularly with Google. Text in Google's summary that doesn't occur anywhere on the page linked to, for instance. I've seen several where something appears quickly and then is nearly instantly replaced by means of "meta refresh" directives in the source. The something is jumbled phrases with popular query terms strewn all over -- obvious Googlebot bait -- and the redirect goes to some advertisement. Google can't detect this sort of thing unless it does sophisticated things like follow redirects and pretend to be human. So what happens if Google doesn't pretend to be human? Sites like those quickly learn to send different content to Google's IP ranges to get whatever search rankings they want. People complain, Google employees manually demote the ones that get investigated, and the vast majority get away with it -- Google's relegated to playing whack-a-mole while the moles breed like rabbits. Google quickly becomes useless. Google goes out of business. Google shareholders sue the CEO and Board of Directors. Or, the latter avoid this happening by having Google at least sometimes spider from oddball IP ranges disguised as Firefox or IE. And then if you block Google crawling this way, either you don't get indexed or it looks like you're trying to deceive Google about what people will see when they surf your site. Either way you lose...

I'm not as qualified to comment on the copyright stuff, or the hash stuff. But if there is a way to share the hosting load across everyone browsing and make a web that's immune to slashdotting, vandalism, individual sites going down, access discrimination, and problems of that nature, then it is likely to supersede the current web, much the way the web superseded gopher -- remember gopher?

BTW, your captcha code is badly broken. It often rejects the first attempt but accepts the second, when both were typed correctly. Sorry I can't tell you precisely what's wrong with it, but it might be related to the amount of time that passes between typing in the code and hitting the submit button, or to whether additional text is entered in the top form after the code is entered. Have your experts take a look at it and fix it; it's annoying.

IncrediBILL said...

I only address points that I think have any validity but I'll humor you with a couple of comments just because I'm bored and have nothing better to do at the moment than address you as your anonymous post is the most important thing to happen since Apache displayed it's first web page.

Googlebot doesn't need stealth as they can use the community to rank bad SERPs in the same way DIGG works without any stealth whatsoever. Even if Google did stealth, they only need to hit about 1 page randomly or a couple per domain to see if anything funny is going on, so I don't see complete stealth crawls anytime soon as they aren't needed to spot problems.

Indeed you aren't qualified to comment on the hosting load issues because CACHE causes dynamic database driven sites, which many big sites are, to work poorly which is why many already use directives to tell cache servers not to retain their content.

When you start caching pages it eliminates the ability to deliver personalized pages based on geo-targeting and personal preferences, search patterns, etc.

Needless to say, it won't work.

Besides, if you can't figure out that the captcha belongs to Blogger, and not me, how can I take the rest of your posts on technical issues seriously?

Anonymous said...

This is hardly a better effort than your last one. For one thing, you assumed that the two earlier posts were from the same author. For another, you did not consider that even blocking "1 or 2 page" stealth crawls will stop Google detecting fraud, and that it's much cheaper and more reliable to build a good DB from the outset by stealth crawling than build a bad DB and then crowdsource complaints and spend eternity playing wack-a-mole against one deceptive site at a time while every time you smack one down, two more spring up. Look at how terrible some Google searches have become with obvious junk and you will realize that Google's current approach simply won't scale any further.

Finally, I don't suppose it occurred to you that a) basic presentation of content doesn't require dynamic, DB-driven anything, and b) just making the DB itself directly accessible in a distributed manner and having a smart client put the pieces together should eventually displace current dynamic approaches? Plus by using hashtables you get automatic versioning and journaling for free? Of course, that won't sit well with the types that want to hijack peoples' surfing to shove ads down their throats, or rewrite history and completely erase what they said in the past, or whatever, but what the hey. Those types will keep running old-fashioned centralized WWW sites with fewer and fewer users while the majority gravitate to a faster, more reliable platform that gives them more control over the experience. Market forces will do the rest.

As for the captcha, while of course it's provided by your hosting provider, it is still appearing on your web pages, and it is still failing to work correctly. Take some responsibility! Complain to your hosting provider if they are providing shoddy tools. You of all people should believe you deserve better from your hosting provider.

IncrediBILL said...

Good grief you like to babble endlessly about nonsense.

It's a shame the captcha isn't completely broken.

Anonymous said...

a$$wipe

IncrediBILL said...

>> there are two morons who read this blog...

Actually, I think there's 1 moron that writes this blog and a few hundred morons that read it.

Some of them are even brave enough to post comments! :)