Looksmart using Nutch?

When I was looking thru the blocked bot log today I ran across a single nutch hit that caught my attention which upon closer inspection appears to Looksmart playing with Nutch, not even identifying themselves as Looksmart.


This was the entry:

03/11/2006 07:07:16 BAD_AGENT "NutchCVS/0.05 (Nutch;;" "index.html"
So I looked it up and there they were:
Address: Non-authoritative answer: name =

Open Source replacing jobs in failing companies?

Think someone got fired in the seach dept. down there, if you can call it that.

Stupid Bots Can't Parse

The other day I posted about the idiots trying to locate "/#" and "/#top" but I didn't even notice a couple of nasty bots that aren''t handling SGML or HTML properly and are leaving things like "&" in the URIs.

Well, that just made my life a WHOLE BUNCH EASIER as a couple of the worst offenders are really sloppy like that so at the moment I've got at least 5 snares running just on their idiocy alone.

BTW, if I haven't mentioned it lately, that little bastard gnootBot is still attempting to crawl my site all these days later. Persistent little fucker, I'll give it that much.

Saturday is Scrapeday™?

Did I miss a memo?

Was Saturday designated Scrapeday™ and nobody told me?

Today's attackers should've been filmed and released onto video as "Bots Gone Wild" as there were as many as 5 at a time hitting me that were masking as a browser, well masking the user agent only, not their bad behavior.

The boys over a must've noticed they were being stopped and tried to send a new can of whoop ass from the following IPs:

I'm so done with them now that anything coming from is just going to get the big FUCK YOU from first contact.

The other thing that caught my attention today were several scrapers that used multiple IPs to avoid detection, big shock, but was being tracked by the cookie again which stopped the morons as they hit several landmines. It's just priceless to think someone goes to all the trouble to use what most likely was a series of proxies to avoid detection but don't dump the cookie when the IP switched to a different network.

God I love idiots, they just make my work easier.

Unipeak Privacy Scraper

Don't you just love all these so-called privacy proxy sites like these monkey spankers over at Unipeak that claim to "Filter out unwanted advertising such as banners and popups" when in fact they insert their own unwanted ads on the page AFTER TAKING MINE OFF!

They appear to be downloading your pages from so just block their asses.

Friday, March 10, 2006


We're moving into the next phase of bot busting and need your help!

Anyone out there willing to open their access logs to let the Mad Scientist take a peek by letting my scraper analyzer comb a months worth of data to see what kind of shit is hitting your fan?

I would be interested in other sites that generate traffic from 100K-400K visitors a month on this first pass but won't turn down 1M+ visitors either. Your site and traffic will be kept 100% confidential with only vague statistical data comparisons used from the final analysis.

The purpose of this experiment is to see if you're being abused by the same IPs and agents, to see if there is commonality in the spider traps being used, etc. and if not, WHO is abusing you and would my current scripts stop them dead in their tracks.

Basically, this is a quest for help to build the bot blocker Rosetta Stone.

In return you'll get a report of all potentially bad activity happening on your server that you could take action against immediately.

If you wish to contact me privately and anonymously, not in blog comments, then I suggest private mail to IncrediBILL on SearchEngineWatch, WebmasterWorld and ThreadWatch.

When we have enough sample subjects we'll post a comment here that we're closed for submissions.

Be a part of internet history that's about to unfold, submit your logs today!

Thursday, March 09, 2006

Matt Cutts Confirms People Are Stupid

Rarely do I write about something posted on someone else's blog but Matt's post about "How to sign up for WebmasterWorld" is absolutely priceless. According to Matt the problem arises when he references people to WebmasterWorld and then people write to him asking if there is a way to get into WebmasterWorld for free because the login screen implies you have to pay to join.

Let's give Brett Tabke kudo's for such brilliant marketing to underplay the free registration to access WebmasterWorld because there is a lot of valuable information over there and after all, he's running a business and not a charity. He once told me how much he pays a month for his servers and bandwidth and it's a shitload. I can't really say I blame Brett for making people think they need to open their wallets to get inside.

After a small amount of begging and pleading with Matt to kill the thread, I realize he's just too nice a guy to the people he's trying to help and isn't concerned that he's foisting idiots onto WebmasterWorld that can't even figure out a simple IQ test to get inside.

According to Matt:

if I wanted to post a pointer to a WMW thread, I didn’t want to get “how do I do it?” questions. And I’ve seen that from some people with high IQ.
Well just how freaking smart can they be Matt?

If you put cheese in a maze even a rat can find it eventually if they're hungry enough so why coddle these whiners that can't even help themselves to a free registration?

Not that I mind helping people, but I draw the line at holding their hands and wiping their asses when all they need to do is read and click, it's nothing mind bending.

If you want coddling then Matt's your guy, and a nice one too, no question about it.

I'll continue challenging people to think for themselves as I'm a firm believer in that teach a man to fish theory.

Googlenoia [warning, major rant]

What the hell is wrong with all you people?

Don't you have a fucking life that doesn't revolve around Google?

The last few days it's nothing but AdSense is crashing, BigDaddy is broken, Google stock is tumbling, blah blah get a fucking life blah.

First, losing some traction in the search engines thanks for some PhDs having brain farts at the 'plex doesn't mean your AdSense is broken or any of the conspiracy theories you can concoct to add to the mystique of AdSense. People using AdSense dance around it like the monkeys banging on the monolith in the opening scenes of 2001 a Space Odyssey. Come on fucknuts, unwrap the tightly wound tinfoil hats and use that steaming heap of gray shit called brains and realize that trends change, advertisers and budgets wax and wane, Google tests things now and then and on top of the list SHIT HAPPENS.

Funny, once upon a time Yahoo stock was skyrocketing and like all things that go up it came down but for some reason Google is different and held to different standards. Well too fucking bad you psuedo-religious Google freaks, it's not different. Google made a lot of people a lot of money, some would call them filthy fucking rich, and if you were too stupid to buy early or sell high then FUCK YOU for being a moron so stop whining and move on. For what it's worth you should probably sell at a huge loss before you lose everything and you're homeless living in your goddamn car because it's not going to rebound, the wild ride is over.

Last but not least, those crybabies that come out in droves every time every search engine changes, especially Google, and you or your customers go up and down from #1 to #5 or heaven forbid your sorry ass slipped to PAGE TWO and you're not in the top 10 anymore.

Well for those of you whining about your SERPs I have 2 words:

With hundreds of thousands of pages competing for these top keywords you were lucky you got there in the first place and the fact that you couldn't survive a new way of indexing is just too damn bad. Nobody OWES you that position in the search engine and you don't have the right to demand getting it back, adapt and deal with it or just fuck off as I'm sick of hearing all your shit.

I'll bet conversely someone else is happy as shit they moved up and are dancing in the aisles getting all that free traffic that your whiny ass is bitching about as one man's tragedy is another man's blessing when it comes to search engines.

This Googlenoia is just getting so fucking old, can't we find something new to talk about?

Tell me about your latest project, found any good blogs lately?



Wednesday, March 08, 2006

Dumbest of the Dumb

Ever see a crawler look for "/#top" on your web site?

I'm still laughing over whoever wrote that shit.

The dumb fucker came back today 3 times and tried to get "/#top" every time.

Appears to be some stupid fucker from The Netherlands using DHCP as each access was from something like

Funny Funny Scraper Shit

Well, my scraper challenge page contains a URL in the sticky challenge loop that can vary per page that let's a human get past right away but keeps spiders looping until they try to index that particular new link, which would let them pass if they indexed it quickly, which of course is too late by the time they actually index it and are locked out for a while.

Ok, now the funny shit, these idiot spiders are now coming back looking for this link as an actual page so if you're not already in the sticky spider loop and ask for the page directly, WHAMMO!, you go DIRECTLY TO JAIL, DO NOT SCRAPE PAST GO, DO NOT SCRAPE 200 PAGES!

It's like a high-tech comedy show at times and you just sit back sipping bourbon waiting for the first asshole to set foot in the latest snare.

Ah, my side is killing me from laughing so hard at these fucking idiots.

Anonz Azz booted to infinity and beyond!

That's right, someone was on my site that came from and I'll bet you're shocked that it has something to do with their website aren't you?

Well a BOOT TO HEAD for both of them!

And one for Jenny and the wimp...

Maxwell's Silver Hammer

Most of you probably think I'm very brash and run amok implementing things on my server all willy-nilly with hardly a concern for the damage that I might be inflicting on my visitors but that's further from the truth than you can imagine. I'm actually very cautious and do a lot of testing with each new approach I phase into my bot buster by first executing the rules and giving me a preview of what would happen for a day or so before I make the rules live.

That means to date all the IPs I've been banning are being banned in software so that I could monitor their returns and activity to verify they're really a permanent source of abuse or a one-shot attack from a dynamic IP.

Well, enough of them are returning on a regular basis that I've decided it's time to start the next phase of the project which I call Maxwell's Silver Hammer where it will decide automatically that the source of abuse is bad enough and just drop them in the .htaccess file so they simply bounce off the server and don't even tie up my scripts keeping an eye on them anymore.

So here we go sweating bullets that this code won't accidentally crash some night and leave the .htaccess file all banged up and bring the site down to it's knees.

Progress, gotta love it.

Monday, March 06, 2006

Pathologically Extreme

Yep, that's what my bot busting obsession was called today in private email.

Now that I'm "pathologically extreme" I must thank the person for his bluntness as it did bring up the point that there's a lot more to this bot busting issue that someone sitting on the sidelines only casually familiar with my quest and this blog may know.

In all fairness, if you told me I'd be on this quest to abolish unauthorized access to my site 12 months ago I would've laughed in your face and said "what harm does a little crawling do anyway?" and yes, I used to hold those tightly wound content control freaks in low regard as misguided time wasting fools.

However, then I decided to get out of the consulting game and focus more attention just on my own web sites which bring in a decent revenue stream without all of the whining and hassles of customers.

That's when all hell broke loose as suddenly both the spammers and scrapers started hammering my old server so hard it was going down all the time. Not physically crashed mind you, but it was just so busy serving the needs of spammers and scrapers that my income needs weren't being met whatsoever. We're talking DOS attacks because of the sheer speed and volume of this nonsense and the server just didn't respond for 5, 10, 15 and the worst was 90 minutes at a shot. It got SO BAD at one point I had to completely get rid of server side spam filtering as that tool itself could use up all the CPU when some spammer came along doing a pump and dump of spam.

These shameless greedy bastards were impacting my site, my SERPs, my wallet and really pissing me off - the shit had to stop.

First was the easy part which was just getting rid of the spam. I blocked email coming from most of Asia and Russia which eliminated the majority of the high speed spam dumps and gave me some breathing room to work. Then I made the only way to contact me a form on the web sites, eliminating all email addresses but 2, and literally set the server not to BOUNCE emails but REJECT emails. Why I did this is bounce emails still come into your server and attempt to send a response back but most spam has a bogus reply address and thousands of bounce emails quickly fill up the queue and your email system grinds to a screaming halt processing bounce deliveries all day long. Trust me on this, just REJECT those undeliverable emails, no bandwidth or CPU wasted at all as they just bounce off your server harmlessly never to be seen again.

Guess what?

Asia and Russia are no longer blocked as REJECTing their emails stopped them from being a threat.

At this point there are never more than 10 emails sitting in my mail queue at any time and the spam that gets thru is literally a handful of emails a day, blissfully under control, I love it.

However, after solving this problem the old server was still going down like it was under a spam attack and after a while I came to the conclusion my site was probably just too busy to handle the load and my older slower server just couldn't deal with the demands of all the visitors, search engines, etc. and set out to upgrade.

Now, with a big fast shiny dual Xeon box it's back up and running faster than ever.

Two weeks later some fuckers took it offline for 90 minutes in the middle of the night and I lost my shit, that was it, the straw that broke the camels back, no more Mr. Nice Guy.

.... this was war....

Then the whole process kind of evolved into a huge eye opening adventure at this point and being a naturally curious guy and a programmer with a huge ego [yes, I am IncrediBill and I can stop these bastards] it kind of took on a life of it's own.

First, stopping the high speed scrapers was easy, totally childs play.

Next, the sheer volume of scraping became apparent once I was monitoring real-time site activity while squashing the high speed scrapers and looking for other unauthorized resource wasting bots.

Evolution just kept happening as one thing led to another, stopping more scrapers unearthed even more scrapers, that the errors I fed scrapers unveiled tons of sites with MY SHIT on them, and that many people had apparently built AdSense-incentivized businesses based on bottom feeding off my business and in the process were diluting keywords I was earning money from using my own content against me.

OK, now THAT pissed me off even more.

So while some of you may call the depths and extremes I'm taking to protect my shit as "pathologically extreme" my side of the story is self-defense for my very survival and I'll be damned if some bottom-feeding leeches are going to take me down without a good fucking fight.

Yes, that's it people, as far as I'm concerned at this point it's a fight to the death, theirs and not mine if I have anything to say about it. So far I've spent a ton of time and money addressing the issues and at this point it's paying off but for how long only time will tell but it's definitely going to be a death match for one of us.

Now, go buy my CD's and t-shirts in the lobby to help fight the cause and invite me as a motivational speaker at your next nerd conference to spread the word!

Nah, that would be WAY too extreme!

UK Scraperz in da Hood

There really aren't too many truly persistent jerks out there and I hate to rat out sites or IPs unless I'm sure they are truly rotten but this wins the award for most annoying IP address of the year. I blocked a long time ago and it just never stops, it's relentlessly attempting to crawl, day after day, asking for thousands of pages and doesn't fucking take NO! for an answer.

A quick reverse DNS on this idiot shows it's apparently on some UK server farm, possibly run by, that IDs itself as

Out of curiousity I went back and ran a reverse DNS on all my blocked IPs and big shock, this server farm has a few hits in my list.

Matter of fact, I just reviewed my logs for today and all of the 88.208.19* IPs listed above hit the server so it appears to be distributed scraping.

So I'm thinking it's probably not such a bad idea just to block their whole damn neighborhood based on the non-stop abuse I'm getting from a couple of these IPs.

I would block at a minimum:

I'd keep an eye out on anything from this range as well:

inetnum: -

They claim to resell dial-up and broadband otherwise I'd suggest blocking the whole damn network but at this point I'm not sure 100% if we're looking just at servers or surfers but so far blocking what I've blocked doesn't appear to have any negative impact on my site except to stop their stupid bots from downloading my content.

Let me know if anyone else is seeing activity from these guys and what IPs you're seeing as this is about as bad as I've seen and they need to be stopped.

Alexa Bowling for Matt Cutts

Don't know who did it or how they did it but it was a stroke of genius to load Alexa up with a bunch of related sites for Matt Cutts that are obviously spam.

Wonder how hard this was, just write a script using the MSIE toolkit with Alexa's toolbar installed to just referer spam the crap out of his site?

Is it possible from a single IP to game Alexa or would someone need to enlist a legion of anonymous proxy servers to pull this off?

Or, could it be so simple that someone has dissected the Alexa toolbar API and just fed it a load of junk about Matt?

We may never know unless someone confesses but I'm thinking it might be worth a giggle to try and see what it takes to make something like that happen.

Good thing I'm not bored today!

Sunday, March 05, 2006

Best Bot Ever - Almost!

Tonight I saw the future of sneaky bots and it did everything to look like a human so I'm thinking they used a developer toolkit to drive the crawl via MSIE. This thing downloaded images, banner ads from 3rd party servers, ran javascript and even accessed AdSense ads so it was as convincing as anything you can imagine.

Spooky and amazing in how well it cloaked itself.

I couldn't tell by looking at the log files either, very impressive.

Then after patiently crawling for a whopping 11,097 seconds or more than 3 hours for those of you that can't divide 11,097 / 3600 in your head, it exceeded my max page count which is set fairly high.

Then it proceeded to very slowly and stealthily ask for 20 more pages after being told it had exceeded it's daily limit of pages.


However, the point being if it had been just a bit smarter to realize it was getting a repeated error page I'd have never known it wasn't a human.

Not a good turn of events whatsoever!


Every Webmasters Worst Nightmare

Here I sit working on my website late at night, we're talking LATE at 3am, so I can test some new bot blocking technology in the middle of the night when the traffic is low so impact will be minimal if I screw up and knock everyone offline.

Ok, upload some code, bug, quick patch, bug, something weird going on so hop onto SSH and look at server and something is chewing up CPU like crazy and website isn't responding properly. Whew, it's an automated nightly update that only lasts a couple of minutes and everything is back to normal.

Test some code, find another minor bug, upload some new code, run the page, and it goes and goes and goes. What the heck, did I just blow something big time? Check SSH again, staggers a little and stops, then generates a socket error and closes.

Holy Fuck! Server Down!

So I click here and there in my browser, EVERYTHING is down!

YAY! I didn't crash the server, the cable modem is offline.

30 harrowing minutes later the damn cable modem comes back online and everything on the server is running just fine except I left a debugging message that's displaying on all the pages.

Remove that message, test it again, everything seems to be OK, time for bed.


Bracing for nightmares tonight....

Bandwidth sucking Arachmo

Some speedy little crawler from Japan named Arachmo that I've never seen before set off my speed trap tonight so you may want to just block this pest before it hits your server. "Mozilla/4.0 (compatible; Arachmo)"
At one point this little bastard was asking for more than 10 pages a second.

Maybe Godzilla will just step on the damn server while fighting with Mothra some day!