Thursday, December 06, 2007

Bad Behavior Needs Behavior Modification

WebGeek recently reported on Bad Behavior Behaving Badly where he got locked out of all his own blogs and was listed as an enemy of the state and put on the FBI's 10 most wanted geek list and all sorts of things.

OK, I'm exaggerating but read his post and it's close enough.

Anyway, there was something he mentioned about being concerned with:

"If left unattended in this state for a long time, a site could lose valuable search engine rankings, after the spiders of the Big 3 (Google, Yahoo, and MSN) find that they are locked out repeatedly with 403 errors."
Since he mentioned it, I've looked over the source code for Bad Behavior before and how they validate robots isn't something I'd put on my website because it relies solely on IP ranges alone and they are incomplete based on raw information I've collected from the crawlers themselves.

The search engines have clearly stated that they may expand into new IP ranges at any time without notice and the only official way to validate their main crawlers is with full round trip DNS checking to validate Googlebot for instance with IP ranges as a backup just in case they make a mistake.

So this code could easily be obsolete at any time:
if( stripos($ua, "Googlebot") !== FALSE || stripos($ua, "Mediapartners-Google") !== FALSE) {
require_once(BB2_CORE . "/google.inc.php");
}

// Analyze user agents claiming to be Googlebot
function bb2_google($package)
{
if (match_cidr($package['ip'], "66.249.64.0/19") === FALSE && match_cidr($package['ip'], "64.233.160.0/19") === FALSE) {
return "f1182195";
}
return false;
}
Even more importantly, I've tracked Google crawlers in the following IP ranges which is 2 more IP ranges than Bad Behavior has in their code!
64.233.160.0 - 64.233.191.255
66.249.64.0 - 66.249.95.255
72.14.192.0 - 72.14.239.255
216.239.32.0 - 216.239.63.255
The same criticism exists for validating the other bots in that Bad Behavior needs to have a little more robustness in the validation code so that it isn't accidentally blocking valid robots from indexing web pages. Unless I'm missing something I don't even see where Yahoo crawlers are specifically validated (I'm tracking 11 IP ranges for Yahoo) and MSNBOT was missing the 131.107.0.0/16 CIDR range, etc..

As it stands, the code doesn't have all the IP ranges that I've seen used for any of the major search engines so there is some risk, albeit not a big risk, that some legitimate search engine traffic is being bounced.

Not only that, but the MSIE validation is full of holes and most of the stealth crawlers I block will zip right through Bad Behavior and scrape the blog.

I think WebGeek is right, I would disable the add-in until those issues are resolved.

10 comments:

Anonymous said...

It's better than nothing. Add the IPs to the whitelist.php and , if you want to, run rdns in your .htaccess file.

I'm using it now on one site coupled with another script. Even though it isn't bullet proof, it's better than running wide open.

IncrediBILL said...

I think you missed the point that it's not whether it's bullet proof, it's whether you're going to lose pages that can't be indexed as the anti-spoof bot validation isn't close to what it should be.

Anonymous said...

I don't mind being educated when I'm missing the point.

If one adds the SE IPs to the "whitelist" and runs rnds, my rdns is executed in the httpd.conf file, any Gbot coming off an IP that doesn't pass rnds would never get to the script. With the SE IPs whitelisted the script would only be examining "other" requests, right?

Bill, if I'm "out to lunch" with the above reasoning let me have it :).

IncrediBILL said...

You're not out to lunch, but then again you're doing some of the work that Bad Behavior should be doing, and it's a chain of command problem.

What happens if your .htaccess or httpd.conf file allows something validated by RDNS but BB bounces it because your BB whitelist isn't up to date?

OOOPSIE!

That page won't be indexed.

Does BB log these and tell you it bounced a page requested by something claiming to be Googlebot?

Sure hope so or you'll have lots of unseen issues.

It's the chain of command thing that your .htaccess or httpd.conf will let the Googlebot in with valid RNDS but then BB will bounce it if the IP range isn't in the whitelist and a new IP address for a crawler happens without warning and sometimes RDNS fails.

Last thing the code COULD do in real time (mine does) after all other mechanisms fail, before bouncing the unvalidated crawler, is a WHOIS lookup to see if it's a block owned by Google, Yahoo or Microsoft.

I use RDNS, an IP list, and WHOIS for a fallback check before bouncing and if it fails all 3 it's probably not legit. However, BB's IP list is all they use and it appear incomplete and without RNDS you're going to be making mistakes and bounce some valid bots.

I think I linked out to a WMW article about this happening a couple of days ago with MSNBOT, the MSN boys messed up the RDNS on a new set of IPs. Very nasty.

The bottom line is BB by itself can cause trouble and even doing RNDS in front of BB isn't a complete solution because BB may still bounce it when something like the MSNBOT mistake happens.

I'm very cautious about this stuff and cover all my bases the best I can which is why I raised this issue after WebGeeks story so that others wouldn't get burned is the search engines with, IMO, incomplete crawler validation and lots of opportunity for bouncing real bots.

Remember the trinity:
RDNS, IP list and WHOIS

If it fails all 3 what more could you do in automated validation?

Flag the bounce for human review.

Anonymous said...

Okay I hear you and yes it's better to alert those who believe that it's an install and forget option.

"Does BB log these and tell you it bounced a page requested by something claiming to be Googlebot?"

My modded version does log and that's why I know about and use the whitelist.php file. The output of that log almost dropped me on my ass when I first started running it. In my configuration, the script is doing all it can.

For me it's the fact that it's free. Did I expect bugs and look for them? You bet I did. Was I surprised when I found bugs and deficiencies, no, I expected to find them. This script can be used by those who are willing to work with it. Is it ready for prime time? NO! Not yet.

I need my own blog so I can vent my own rants also :). Thanks for the air time Bill

IncrediBILL said...

ban proxies, the difference is you appear to be one of the few that understand what's going on with crawlers and IPs and RDNS with the added bonus you seem to be able to program to fix the deficiencies.

Your average webmaster doesn't have a clue about what's crawling the net, whether it's valid or not, and can't program whatsoever so they're probably best off using nothing IMO than a DIY kit that can bite them in the ass.

Maybe you should offer all your changes back to the author so he'll improve the code since it is a Creative Commons license.

Anonymous said...

I'm going to disagree with Bill on his own blog ...... :)

I ran BB on a G validated test site with no rnds and monitored the log files. After approx 500,000 of G crawled pages the only IP range BB failed on was 72.14.192.0/18. G webmaster tools didn't show any 403s for the pages that BB bounced.

Bill, the script poses very little danger to the end user compared to the major problems of all those unmodded open source CMSs/Blogs. If BB bounces a few pages that is a very minor problem compared to the Dup content issues of the CMSs/Blogs being used.

BB isn't perfect, even unmodded it is better than nothing.

S Allen, heads up. I can append query strings to your urls. This means I can create 100s of pages for each and ever page on your site. BB is the very least of your worries. I know this because some "Dirt Bag" did it to me from behind a proxy.

I have alot of respect for Bill and what he is doing. My disagreement should only viewed in a professional manner

IncrediBILL said...

Did I say I saw G crawl outside a narrow range very often?

Nope.

But when it happens you're screwed without having a RDNS check or having a WHOIS test as a failsafe.

I won't trust my sites to something that can't do those two things in the event their IP list has become stale.

That's just me and I make my living off the web so excuse me for liking my free money! ;)

Unknown said...

I'm going to go one step further and criticize incrediBILL for not contacting the author of Bad Behavior about these issues before making a public posting.

Thanks for nothing.

I've never seen Google's crawlers using the other IP ranges you posted. They're used by Google employees themselves (actual humans using actual web browsers) and other Google services, such as their translator, Web Accelerator, etc. If you have any logs which show Googlebot using those ranges, I'd love to see them.

Finally, the decision to use hard-coded IP address ranges is a calculated design decision. Yes, it means the ranges have to be updated every so often; this has happened once in two and a half years, with msnbot. That was detected and fixed in a matter of hours. Bad Behavior is built for speed; any of the other methods you've proposed would be unacceptably slow.

IncrediBILL said...

Why do I need to contact the author?

I read another blog post which reminded me of something I noticed in the code and I just commented on that fact.

Welcome to the Blogosphere! ;)

My IP ranges are built automatically
so when Googlebot, or OTHER bots from Google (I track them all) crawl from those IPs at least once it triggers that the IP range is included.

Googlebot isn't the only Google crawler working from the 'plex. Especially with a blog, something like Feedfetcher, AdsBot-Google or even Google-sitemaps needs a free pass to access the site and should also be spoof free.

72.14.199.15 "AdsBot-Google (+http://www.google.com/adsbot.html)"
72.14.194.27 "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)"

So on and so forth, probably things you don't want accidentally snared.

FYI, I build for speed, accuracy AND dependability. Reverse DNS isn't slow except for the 1st hit assuming you only do it ONE TIME per IP per 24 hours and cache the result. Then I use known IP ranges as a backup in case the RDNS fails or that bot doesn't have RDNS support yet, which is just as FAST as you're doing.

Finally, WHOIS is the last resort in real time checking which is done ONE TIME per 24 hours and cached as well and only done if the other 2 tests have already failed.

While you would bounce them, I do something slow ONE TIME, validate it, and track that the IP is validated for 24 hours before doing it again.

The computer on the other end of the connection doesn't really care how slow I am validating that bot the 1st time and if it was going to fail anyway, who cares if I'm holding up the connection for another second before dropping them with a 403?

So my methods aren't unacceptably slow and are more reliable, especially if someone forgets to update the software.

Besides, I've tested my methods tested on sites processing millions of pages per month and it's not perceptible to the end user and easily survived a few DDoS attacks so that's plenty fast enough.

We can agree to disagree on methods but I'm sure we both agree the bots need to stop and I just suggested a little more care taken to not block a few good bots by accident.

That's all.