Thursday, March 23, 2006

How Clever Yet so STOOOOOPID!

These assholes that crawl my site must think I'm looking for browser-like behavior such as image loading or something, or they're just using an API to drive Firefox in an effort to completely mask their tracks running complete browser operation including loading ads.

Now let's follow the fun antics of this crawler:

63.197.247.13 - - "GET /robots.txt HTTP/1.1" 200 111 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"

Reading robots.txt with a browser?

I should've stopped you here but I didn't just because I know you won't get far, 10-20 pages tops, and since I'm a bit on the sadistic side, I've been letting them continue on past robots.txt lately just to see how good the rest of my traps are working.

STRIKE #1

Now you access a page the robots.txt told you not to use?

More importantly, you can only see this page name if you're looking in robots.txt or finding it hidden in my HTML, normal visitors don't see this page.

63.197.247.13 - - "GET /dont_click_this_page.html HTTP/1.1" 200 12895 "-" Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"

STRIKE #2

Then your stupid ass program continues to attempt to load 60 more pages while not noticing you're getting a captcha after stepping into the spider trap and after a few pages of that, getting error messages telling you that YOU'VE BEEN BUSTED.

STRIKE #3 - YOU'RE OUTTA THERE!

Which just goes to show that even the ones that do things that would bypass my 3 page robot stopping technique still get stopped in record time. Not only that, they could've been stopped before the first page as reading robots.txt is a cardinal sin for a browser.

However, stopping them too fast at this point and I wouldn't have had the fun of getting more profile information from this type of crawler.

The harder they try, the harder they fall, as someone obviously went to some extremes to pull this off and still ultimately failed in the end.

2 comments:

Niels said...

Hi Uncle Bill :)

So you serve the "visitor" a captcha when you suspect its a bot?

Without giving away to much information could you make a post about all the techniques/traps you use? (with spicy anecdotes offcource)

"Uncle Bill's hunting diary" I love it.

IncrediBILL said...

Most of it's already in the blog scattered around, and spiced to taste, mine, not yours ;)

If you read enough you'll either figure it out, be totally confused, have a sick headache, or all of the above.