Sunday, March 05, 2006

Best Bot Ever - Almost!

Tonight I saw the future of sneaky bots and it did everything to look like a human so I'm thinking they used a developer toolkit to drive the crawl via MSIE. This thing downloaded images, banner ads from 3rd party servers, ran javascript and even accessed AdSense ads so it was as convincing as anything you can imagine.

Spooky and amazing in how well it cloaked itself.

I couldn't tell by looking at the log files either, very impressive.

Then after patiently crawling for a whopping 11,097 seconds or more than 3 hours for those of you that can't divide 11,097 / 3600 in your head, it exceeded my max page count which is set fairly high.

Then it proceeded to very slowly and stealthily ask for 20 more pages after being told it had exceeded it's daily limit of pages.

BUSTED!

However, the point being if it had been just a bit smarter to realize it was getting a repeated error page I'd have never known it wasn't a human.

Not a good turn of events whatsoever!

Nasty.

5 comments:

Nebraska said...

Is it possible that your site has become the ultimate testing grounds for new bots? If I were running bots I would try to find out your site and constantly probe the defenses.

thebear said...

For crying out loud Bill, a simple kiddy script can drive a normal browser and you know what normal browsers do.

Yep, that's correct, they download images, run both java and javascript, and if directed to follow links would clicky on banners etc.

I have a script that displays a webpage and snapshots it so I have a picture of the page, I think you can find examples on the web.

nebraska,

Yep Bill is setting himself up as a bot defense circumvention proving ground. But he is up to the task.

IncrediBILL said...

Of course a kiddie script could drive this but where it strays from being a kiddie script is how if accessed the content, very random, looked like a human with a purpose doing research.

If they hadn't kept asking for random pages after the fact I would've never noticed because a human which trapped in my snare hits RELOAD, RELOAD, backs up a page hits RELOAD and storms off all pissed off.

The bot just kept moving forward looking for new things without knowing it was just snared, that's the difference.

Besides, kiddies are impatient, this was a very good fake, best I've ever seen.

Still not good enough but NEXT TIME?

Nebraska said...

I am curious if you notice any performance issues with all the bot traps you have running?

IncrediBILL said...

There are potential performance problems but my site is dynamic in the first place so opening TWO MORE databases and doing a couple of more updates is kind of trivial.

Then again, getting knocked offline for 90 minutes from a high speed scraper is a HUGE performance problem which hasn't happened in many months since I started busting their asses.

If you have a static site you might notice a little extra minor bump but I'm refining the techniques now doing some simple work on the front end of the page before delivery and some time consuming tasks at the back end after the page has been flushed to the visitor.

The net change in the backend approach to the heavy duty computing is that a scraper might get a handful of pages before the front end task can identify them as something to halt.

Lot's of ways to do this but I'm still trying to approach it in such a way that it's viable to be installed in a stand alone web hosting account vs. something installed server wide so the average joe webmaster will be able to take advantage of this without a webhost knowing or caring about it being installed.

Bottom line is it's a trade-off with a little more CPU juice used per page but it averages out as a lot LESS is used over the course of the day from the abusers on my site.