Saturday, September 23, 2006

Whitelist OPT-IN htaccess file

People are always asking me how to build an OPT-IN .htaccess file, which I advocate, opposed to the traditional blacklist methods.

The problem with OPT-IN is it's VERY unforgiving and you really need to check your visitor stats and make sure you're letting in all the crawlers that are sending you traffic.

Belows is a bare bones sample of how it works and anything not in the list gets a 403 Forbidden error so you'll probably need to add more items and refine this for your particular website.

Sample .htaccess file for Apache 2.0:

#allow just search engines we like, we're OPT-IN only

#a catch-all for Google
BrowserMatchNoCase Googlebot good_pass
BrowserMatchNoCase Mediapartners-Google good_pass

#a couple for Yahoo
BrowserMatchNoCase Slurp good_pass
BrowserMatchNoCase Yahoo-MMCrawler good_pass

#looks like all MSN starts with MSN or Sand
BrowserMatchNoCase ^msnbot good_pass
BrowserMatchNoCase SandCrawler good_pass

#don't forget ASK/Teoma
BrowserMatchNoCase Teoma good_pass
BrowserMatchNoCase Jeeves good_pass

#allow Firefox, MSIE, Opera etc., will punt Lynx, cell phones and PDAs, don't care
BrowserMatchNoCase ^Mozilla good_pass
BrowserMatchNoCase ^Opera good_pass

#Let just the good guys in, punt everyone else to the curb
#which includes blank user agents as well

<Limit GET POST PUT HEAD>
order deny,allow
deny from all
allow from env=good_pass
</Limit>

Just save the above as a file named ".htaccess" in your httpdocs or root web folder in your hosting account and all the crazy bots abusing your site will get bounced from now on.

Remember, anything not listed will no longer have access so be careful and make sure everything your site needs allowed is in the list.

Enjoy.

10 comments:

JayW said...

You mean I can get rid of?

Thanks :)

RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider|ExtractorPro) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^E?Mail.?(Collect|Harvest|Magnet|Reaper|Siphon|Sweeper|Wolf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DTS.?Agent|Email.?Extrac) [NC,OR]
RewriteCond %{HTTP_REFERER} iaea\.org [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PlantyNet_WebRobot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^gamekitbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ichiro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^avuk [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^bdfetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^AIBOT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^aibot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Libby [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Jakarta [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^mysearch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^OmniExplor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PHP/4.2.2 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^POE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SearchIndy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Xenu [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^rameda [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Huron [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LWP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SocietyRobot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Snapbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Exabot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^voyager [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^updated [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^psycheclone [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^wwwster [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^IRLbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FAST [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Shim [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^findlinks [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^TMCrawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Nusearch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^OCP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^blaiz [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Survey [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmeraldS [NC,OR]
# Download managers
RewriteCond %{HTTP_USER_AGENT} ^(Alligator|DA.?[0-9]|DC\-Sakura|Download.?(Demon|Express|Master|Wonder)|FileHound) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Flash|Leech)Get [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Fresh|Lightning|Mass|Real|Smart|Speed|Star).?Download(er)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Gamespy|Go!Zilla|iGetter|JetCar|Net(Ants|Pumper)|SiteSnagger|Teleport.?Pro|WebReaper|NutchCVS?) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(My)?GetRight [NC,OR]
# Image-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(AcoiRobot|FlickBot|webcollage) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Express|Mister|Web).?(Web|Pix|Image).?(Pictures|Collector)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image.?(fetch|Stripper|Sucker) [NC,OR]
# "Gray-hats"
RewriteCond %{HTTP_USER_AGENT} ^(Atomz|BlackWidow|BlogBot|EasyDL|Marketwave|Sqworm|SurveyBot|Webclipping\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (girafa\.com|gossamer\-threads\.com|grub\-client|Netcraft|Nutch) [NC,OR]
# Site-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(eCatch|(Get|Super)Bot|Kapere|HTTrack|JOC|Offline|UtilMind|Xaldon) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(Auto|Cop|dup|Fetch|Filter|Gather|Go|Leach|Mine|Mirror|Pix|QL|RACE|Sauger) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(site.?(eXtractor|Quester)|Snake|ster|Strip|Suck|vac|walk|Whacker|ZIP) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCapture [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo\ Pump [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [NC,OR]
# Tools - undo curl after yahoo verifiation
# RewriteCond %{HTTP_USER_AGENT} ^(curl|Dart.?Communications|Enfish|htdig|Java|larbin) [NC,OR]
# RewriteCond %{HTTP_USER_AGENT} (FrontPage|Indy.?Library|RPT\-HTTPClient) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(libwww|lwp|PHP|Python|www\.thatrobotsite\.com|webbandit|Wget|Zeus) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Microsoft|MFC).(Data|Internet|URL|WebDAV|Foundation).(Access|Explorer|Control|MiniRedir|Class) [NC,OR]
# Unknown
RewriteCond %{HTTP_USER_AGENT} ^(Crawl_Application|Lachesis|Nutscrape) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[CDEFPRS](Browse|Eval|Surf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Demo|Full.?Web|Lite|Production|Franklin|Missauga|Missigua).?(Bot|Locat) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (efp@gmx\.net|hhjhj@yahoo\.com|lerly\.net|mapfeatures\.net|metacarta\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Industry|Internet|IUFW|Lincoln|Missouri|Program).?(Program|Explore|Web|State|College|Shareware) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Mac|Ram|Educate|WEP).?(Finder|Search) [NC,OR]
# RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa|MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR]
# RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NaverRobot [NC]
RewriteRule .* - [F]

OrangeSkidoo said...

It seems to work, and it's a much more efficient solution, but the error log is filling up with supposedly denied requests. The access log lists the requests with code 200, but the error log has 'client denied by server configuration' for the same requests -- even the ones I know worked, because they were made by me!

Also oddly, I use mod_rewrite, and these 'denied requests' have duplicate entries where that comes in to play:

[Tue Sep 26 22:49:56 2006] ... /home/...
[Tue Sep 26 22:49:56 2006] ... /home/.../index.php
[Tue Sep 26 22:51:09 2006] ... /home/.../archive
[Tue Sep 26 22:51:09 2006] ... /home/.../index.php

Any idea why that's happening?

IncrediBILL said...

For some assistance with implementing a more robust version of my OPT-IN file, take a peek over at a thread started about this IHelpYou's site.

They're really getting into is and helping each other complete their OPT-IN files.

A couple of bots I overlooked to add to my OPT-IN list are:

Feedfetcher-Google
YahooFeedSeeker
Google-Sitemaps

Anonymous said...

Hi Bill

I have use some time reading about this opt-in list, but couldent get i to work with htaccess.

I visited the extrnal link you gave, and noticed Savvy1 posted this code from htaccess (regarding user-agents tha bypass the minimum filter you set up):

-Limit GET POST PUT HEAD-
order allow,deny
allow from env=good_pass
deny from env=bad_pass
-/Limit-

But that dos ONLY limit user-agents marked "bad_pass".

So how to set up a more "bullet proof" list in htaccess, based on the opt-in and blocking of bots that have the first words of a legitimate user-agent?

Thanks :-)

IncrediBILL said...

Not sure why you're having issues with the opt-in approach as I installed that script in one of my .htaccess files for a lesser used site and then tested it using a tool that allowed me to spoof user agents and it whacked a bunch of stuff just like it was supposed to do.

It's not a comprehensive list though, needs more additions.

Anonymous said...

IncrediBILL

I have a long list of "deny from" in my .htaccess file, mostly ips for proxies (hate those damn proxies).

The OPT-IN UA-list should work fine seperatly as long as it has the "Limit GET POST PUT HEAD" (right?)

I tried to change UA to "LinkCheck/0.1" - and passed without problems ...

Any idea?

Anonymous said...

I have solved the htaccess problem reagrding the opt-in list.

Thank you for the "idea".

It is a pitty that there is no place to get help to find solutins on how to block spam, bots, etc

I have used several days to secure some websites, learn more Perl, Apache htaccess, etc.

It is "easier" to find the problem and information, but harder to do something about it.

Would have been nice with a forum or a download area with examples and secure scripts (etc).

But thanks again Bill :-)

IncrediBILL said...

Sorry I didn't have a full solution but debugging problems in Apache drives me bonkers, not to mention lack of needed functionality, which is why I use a script instead.

Glad I was able to help some though.

ewel said...

Hi Bill, I think you have made a very valuable contribution by proposing a whitelist method to keep bad bots out.

Still I am missing an important element in all this. I have read as much as I could find about this subject, but one thing I cannot find is a copy&paste solution for beginners.

I am one of many who are using a content management system to make a nice but amateur-built website, and who need a good anti-bad bot solution but do not understand enough to fill in the blanks. Lacking a good copy&paste solution we amateurs no doubt keep things interesting for attackers which ultimately is not in the interest of the internet community.

In other words, I think I understand why you are not publicising exactly what you did and how you did it, but would it not make sense for experts like yourself to guide beginners like me in enough detail to enable us beginners to help clearing the internet of the rubbish of malvolents?

Having read about blacklist solutions it is clear to me that these are only effective if they are kept up to date, which is something that an amateur webmaster is unlikely to do. Also blacklisting solutions seem to be quite complex and beyond the grasp of most beginners. Your whitelisting solution on the other hand seems much more understandable and practical.

Would you be able to post or point to a solution which I could just copy and paste into my htaccess file? Or perhaps a script that I can simply upload?

LKovacs said...

Blocking data center could mean,that I could stop the spam refferers which from RU, from my site?