Wednesday, March 22, 2006

More Fun With Stupid Bots

The website I'm protecting from all these idiots uses javascript navigation for some pull-down menus and it appears that a couple of the scrapers appear to be attempting to scrape something out of the javascript looking for embedded URLs in the script itself.

Unfortunately for them, but lucky for me, their code is mildly brain damaged and doesn't bother parsing the extracted information to see if it's a valid URL and these morons are trying to access a page name from the server like "/getURLfromMenu".

Now that I've noticed this little tidbit I went back and checked my bot blockers error logs and it's already stopped this from about 40 different IPs in the last couple of weeks. A couple of the others attempting this had invalid user agents, meaning they didn't scrape it lately as they haven't been getting into the site for many months, so this has been going on a LONG time before I caught them.

At least I have another bit of criteria to add to my instant block list.

Stealing a line from Seinfeld with apologies to the Soup Nazi:
NO SCRAPE FOR YOU!

4 comments:

Dan said...

Hey Bill - I've heard that the box men are creating smarter bots that can get past most bot checks. Some are even making their way into IRC chat sessions - have you heard anything about this?
Thanks,
D

IncrediBILL said...

They haven't come up against my bot checks yet, I'll block 'em where they stand.

IRC chat, eh, nothing I'm working with so let 'em in :)

Anonymous said...

Bill, I'm not a webmaster, but need a small website "Locked down" or blocked from spiders or other agents (not crawlers from regular search engines). Is there someone who can do this, a business or some software?
Thanks, Mary

IncrediBILL said...

Mary,

If you're on Linux there are a bunch of .htaccess files out there that will keep most of the crap out. Sadly, nobody's doing what I'm doing yet and stopping the one's that MASK as a browser which is a large portion of the crawlers these days.

Keep your eyes on this site, when it goes public, we'll let you know.