Thursday, November 08, 2007

How to Super Charge Your Link Checker

Most external link checkers people use can only detect the simple problems with your links such as servers being offline, missing pages (404 errors), or some other type of server error making your outbound link technically broken. These old school link checkers don't know how to detect the myriad of soft 404 errors that send a "200 OK" as a result. Worse yet, traditional link checkers aren't smart enough to detect whether your outbound links have changed hands and are possibly in a domain park, converted to a porn site, or possibly contain malware.

Here's a few tips for those that may want to super charge your link checker to detect domains that have transitioned into domain parks or parked pages and catch those soft 404 errors.

1. Do a full trip DNS check on your domain names.

Example of a full trip DNS check: -> ip address ->

The resulting full trip DNS lookup for some domain parked sites return these domains:
Parked pages on GoDaddy are a bit more complex because it's a combination of parkwebwin + but not too terrible to interpret:
2. Whois Lookup for more detailed information.

If the full trip DNS fails to uncover anything useful then getting the WHOIS information about the domain name and/or IP address might yield interesting results. You might find the site is hosted at which runs, a domain park, or is hosted at (duh!) or shows DNS servers such as NS1.PARKED.COM.

3. Examine the redirects and landing page names.

When you request the URL, assuming you process your own redirects, you can observe that certain types of soft 404 errors redirect to the home page of some servers or a standard default page served up by admin control panels. Additionally, some parked pages also have intermediate redirects that clearly identify the page is being redirected to a landing page which can also be trapped.

Some sites return a "200 OK" but the page lands on a page name like "404error.html" or "404.asp" and there are a large list of these. Unfortunately, just looking for any page with "404" in the page name will kick out many false positives but recording a list of these will help you quickly find a good list of them.

Some samples of various types of 404 pages and URLs you might find:
4. Examine the page content

The least accurate method is to actually process the page content of the landing page to look for various fingerprints that can be used to detect a site gone bad. Simple phrases such as "this site is temporarily not available" or "this web site coming soon" can spot sites that are no longer active. The problem with this method is that the text fingerprints can easily be changed, may generate some false positives, and is the least reliable. However, it's often the final recourse to detecting 100s of bad pages so you just keep updating your list of fingerprints as you find them and manually double check these types of broken links for false positives.

5. Compare the previous WHOIS profile

Save copies of all the whois information you get during link checking and use it in future link checks to detect ownership changes. Assuming the link checker passes the site after all of the above profile checks, compare the current WHOIS information to the last time you checked the site. Odds are that if the site has changed hands it no longer contains the content you originally linked to and may be a link you want to remove.


Now you know all of my basic ingredients for building a super charged link checker and should have some ideas on how to spruce up your own link checker. Building the ultimate link checker is nothing simple that can be accomplished in a day nor does working on it ever stop because the internet is constantly changing. However, if you have a ton of outbound links or run a large directory a super charged link checker is the only way to check links and time spent building the link checker is far better than manually checking tens of thousands of links by hand.

No comments: