Sunday, July 15, 2007

Rehabilitating Massive Amounts of 404 Errors

One of my sites used to get as many as 100K 404 errors in a single month.

Leading cause of this problem?

SEARCH ENGINES!

That's correct, the #1 leading cause was search engines but they were just a symptom of a bigger problem and not the root cause. Sloppy scrapers and crappy wannabe search engines and directories that mucked up the URLs were the true culprit. Then the major search engines crawled these sloppy sites, indexed those mucked up URLs, and that's when all the 404 fun starts.

Obviously my bot blocking stopped the scraping so the source of the mucked up URLs eventually faded away but that still left a serious amount of junk in the search engine crawler queues to clean up.

Some of the links had everything from an ellipsis in the middle to fragments of a javascript OnClick() appended to the link. My personal favorites were the Windows script kiddies that don't realize Linux servers are case sensitive and converted all my links to lower case. There were lots of other errors but you kind of get the point of what kind of damage can be inflicted with homemade crawlers written by incompetent assholes.

There were obvious solutions to use to clean up the search engines but those didn't address the immediate issue of visitors hitting 404 errors. Since I didn't want any actual visitors hitting these mucked up links to get a 404 error page, I set about logging and redirecting all the 404 errors that could be recovered to the actual intended page. Many of the mucked up links contained enough of the original path that I could identify the original page and put the request back where it belonged. Over a period of time the corrections began to stick in the search engines and eventually the 404 responses dwindled to a much smaller and manageable number.

Just another reason to be a diligent in blocking unwanted crawlers and scrapers as nothing good ever came from letting them crawl.