Friday, March 24, 2006

WebCorp Crawls Why?

Saw this entry on my blocked log a couple of days in a row now:

03/24/2006 193.60.130.67 "WebCorp/1.0"
So I looked it up and sure enough it's WebCorp
webcorp.uce.ac.uk.
Went to their website and they have some mental mind fuck linquistic mashup running they call SEARCH and it's the slowest thing I've ever seen since I used a Commodore-64.

However, it claims to cull results from Google, Altavista, Metacrawler and AllTheWeb so why in the hell is this thing crawling attempting to crawl my website if my content isn't even being taken into consideration for their results?

Sorry, you PhDs in linquistics are just too smart for me so maybe you have some higher purpose for attempting to crawl that just escapes us bot blocking neanderthals.

Here's a phrase you cunning linquists might know - "fuck off"

Thursday, March 23, 2006

How Clever Yet so STOOOOOPID!

These assholes that crawl my site must think I'm looking for browser-like behavior such as image loading or something, or they're just using an API to drive Firefox in an effort to completely mask their tracks running complete browser operation including loading ads.

Now let's follow the fun antics of this crawler:

63.197.247.13 - - "GET /robots.txt HTTP/1.1" 200 111 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"

Reading robots.txt with a browser?

I should've stopped you here but I didn't just because I know you won't get far, 10-20 pages tops, and since I'm a bit on the sadistic side, I've been letting them continue on past robots.txt lately just to see how good the rest of my traps are working.

STRIKE #1

Now you access a page the robots.txt told you not to use?

More importantly, you can only see this page name if you're looking in robots.txt or finding it hidden in my HTML, normal visitors don't see this page.

63.197.247.13 - - "GET /dont_click_this_page.html HTTP/1.1" 200 12895 "-" Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"

STRIKE #2

Then your stupid ass program continues to attempt to load 60 more pages while not noticing you're getting a captcha after stepping into the spider trap and after a few pages of that, getting error messages telling you that YOU'VE BEEN BUSTED.

STRIKE #3 - YOU'RE OUTTA THERE!

Which just goes to show that even the ones that do things that would bypass my 3 page robot stopping technique still get stopped in record time. Not only that, they could've been stopped before the first page as reading robots.txt is a cardinal sin for a browser.

However, stopping them too fast at this point and I wouldn't have had the fun of getting more profile information from this type of crawler.

The harder they try, the harder they fall, as someone obviously went to some extremes to pull this off and still ultimately failed in the end.

Cyveillance Keeping An Eye on My Blog

You tell the world Cyveillance is snooping on your website and sure enough here they come snooping on my blog too.

Referring Link http://blogsearch.google.com/blogsearch?hl=en&q=cyveillance&ie=UTF-8&scoring=d
IP Address 65.213.208.155
Country United States
Region Virginia
City Arlington
ISP Cyveillance
We know you're watching us watching you spy on us.

FYI, they also seem to have this range of IPs:
CYVEILLANCE UU-65-213-208-128-D4 (NET-65-213-208-128-1)
65.213.208.128 - 65.213.208.159

Wednesday, March 22, 2006

Stop that bot, 3 pages or LESS

Some days you just wake up and BINGO! you have the best damn idea you ever had in a long time. That's right, something so simple and as plain as the nose on my face just slapped me upside the head today and I'm positive I can shut down bots masking as humans in 3 pages or less into a crawl.

However, it was real a bonus day for me as I came up with not one but TWO new techniques to add to my arsenal of bot blocking weapons. The only downside is both of these tricks require changes in the web pages in order to make it work, but it's well worth the trouble if it can stop bots dead in less than 3 pages.

We'll be doing some testing for a week or so to see if it's really as effective as I think it is and verify it's not snaring humans and let you all know how it works but I'm REALLY excited as this KICKS ASS so far!

Anyone know a good patent attorney that's reasonably priced? ;)

More Fun With Stupid Bots

The website I'm protecting from all these idiots uses javascript navigation for some pull-down menus and it appears that a couple of the scrapers appear to be attempting to scrape something out of the javascript looking for embedded URLs in the script itself.

Unfortunately for them, but lucky for me, their code is mildly brain damaged and doesn't bother parsing the extracted information to see if it's a valid URL and these morons are trying to access a page name from the server like "/getURLfromMenu".

Now that I've noticed this little tidbit I went back and checked my bot blockers error logs and it's already stopped this from about 40 different IPs in the last couple of weeks. A couple of the others attempting this had invalid user agents, meaning they didn't scrape it lately as they haven't been getting into the site for many months, so this has been going on a LONG time before I caught them.

At least I have another bit of criteria to add to my instant block list.

Stealing a line from Seinfeld with apologies to the Soup Nazi:
NO SCRAPE FOR YOU!

Monday, March 20, 2006

Clone Wars

Who are all these wannabe assholes?

I've been the original IncrediBILL online since I first got my hands on a 300 bps modem and if you don't believe ME, then ask my wife FRANtastic!

Most of these fuckers were probably hopping from ball to ball just to keep from landing in their Dad's tissue or being the glue sticking the pages of the Playboy together, or maybe shitting yellow in a diaper at the time.

Look at this shit, IncrediBill's to the left, IncrediBill's to the right, the fuckers are crawling out of the goddamn woodwork!

For the love of god make up your own fucking names.

Sunday, March 19, 2006

Educating the Public About Scraping

After talking to a lot of people lately, many webmasters and Silicon Valley internet savvy types, it has become obvious that they simply are oblivious to the entire problem with rogue bots and scrapers. Most people I've been discussing this with are aware of crawlers and they're aware of things like robots.txt, but completely in the dark about what goes on bypassing so-called standards. Ultimately, they leave the conversation with a new level of fear about the security of their online content and run to the nearest console and start searching for unauthorized usage of their content which, as we all know, they typically find without too much trouble.

The real eye-opener for most that aren't building sites that thrive off Webmaster Welfare™ (aka AdSense) seems to be the entire AdSense economy that fuels the bottom-feeding scraper sub-culture that Google has unwittingly created. Once they understand the motivation not only is it clear why scrapers scrape to anyone, but many wonder why they didn't think of it first! Then it's obvious that the low hanging fruit has universal appeal and everything on the net is fair game for the unethical types that pluck that fruit at any cost.

So the question remains, after this small sampling of industry savvy folks, is how wide is the blissful ignorance to this pandemic?

Wonder how many people learn something about this for the first time hitting this web site and just think I'm a paranoid loon with a tinfoil hat dancing with a flute celebrating the summer solstice?

I'm suspecting the depth of the problem is not known by most, even by webmasters fighting one-off copyright infringement, those that even have a hint think it's being overblown and from what I'm seeing in the last week in my banned log files, it will get a lot worse before it gets better.