Thursday, November 10, 2011

Bot Blacklisting vs. Whitelisting, are you a convert yet?

I'm still shocked that after all these years people not only are still practicing the ancient black art of blacklisting, but I'm even still more shocked to see several so-called website content security products being recently released that rely on blacklisting as their primary defense.

Are they fucking kidding?

Do people really pay good money to chase an endless supply of bots?

Let's explore the blacklisting dam vs. whitelisting dam metaphor to get a simple grasp on this issue. For those not familiar with the problem, blacklisting is like building a dam on a river with a big gaping hole in the middle that damn dam. While holding back some water, or bad bots in this instance, the damn blacklisting dam still lets most of it spill through, total waste of time and money. Whitelisting on the other hand is like a real dam that holds everything back except for the controlled spill, aka the whitelisted items, which are the only things allowed to pass. Therefore, just like damming a river, common sense dictates you build a solid dam with whitelisting to control all those bots and do it right the first time.

Blacklising is a pretty futile methodology, obviously the choice of masochistic webmasters. Look at the amount of time and resources wasted maintaining a blacklist. Tons of bot entries, lots of log analysis and processing power just to keep up with them. Heck, all the bad bots have to do to defeat your blacklist is change their user agent name every single time they access your site.

Simply combine any two random words in the dictionary and you've just got a new bot name that can bypass any blacklist. Hell, just pick almost any single word from the dictionary and you'll defeat the blacklist, two words is overkill really. Some bots merely send a couple of random strings of gibberish as a user agent which works perfectly to defeat silly tactics like blacklisting.

Now examine the simple implementation of a Whitelist. There aren't that many beneficial things that crawl your site and most sites can thrive with a whitelist of less than 20 entries, maybe 100 max. instead of the hundreds or thousands of items in a blacklist. Small lists, easy to maintain, and negligent processing required to validate the list in real-time, low impact on server load.

Using any raw logfile analysis program it's easy to identify what should be whitelisted in mere minutes. Best thing is that whitelisting means you can spend your spare time actually working on your site instead of chasing bad bots to blacklist as everything not whitelisted is automatically kicked to the curb by default with no extra effort on the part of the webmaster.

Those that I've actually convinced to convert to whitelisting in the past have done nothing but sing it's praises.

Compare that to those still blacklisting, they don't have any spare time to sing.


Anonymous said...

Hi Bill,

How do insure you dont' accidently block a google bot?

IncrediBILL said...

Googlebots are validated using reverse DNS, the same way we've been doing it since 2006, the sane way we stopped 302 proxy hijacking, it works.

Anonymous said...

Fully agree about white listing.

But where you said, "negligent processing", I think you really meant "negligible processing". :-D

Bruce ( said...

I suspect the lack of a straight forward way to whitelist is the biggest reason for it not being used (e.g., I'm guilty).

I did whitelist using robots.txt but of course that doesn't stop anyone not following the rules. The good part is it did noticeably reduce the bot traffic on my site. It might also be a good way to test a real whitelist as I inadvertently blocked a few bots I needed/wanted.

18x66 said...

I agree with your observations, I have been using a honey-pot type trap which works, but of course lets an awful lot slide. In the beginning it was not too much trouble to read through access logs and make adjustments, and later through digested logs, but there is no way I will ever keep up with it all these days. I have just started looking into whitelisting which brought me here naturally. Do you have links to how-to info for those who are not competent php or javascript programmers? I have a lot of reading to do.

Anonymous said...

If you believe in whitelisting so much, why do you use a free service like this? Why don't you host you're own blog and write about it?