Tuesday, August 15, 2006

Link Checkers Don't Understand

Having a few conversations that are going nowhere with some link checker sites.

ME: "Sorry but I had to block your link checker as you're never going to find what you want as I can't allow any of you to crawl 40K pages. Would you mind just telling me what you want to find and I can tell you exactly where it is?"

Link Checkers: "Just point us to your links page with robots.txt"

ME: "The whole site is links, it's a directory, and robots.txt is EXCLUSION only, not INCLUSION, so I can't tell you where to crawl only where NOT to crawl which is impractical with 40K pages anyway."

Link Checkers: "We stop after X pages anyway."

ME: "You're still wasting my bandwidth as the odds of finding what you're looking for in the top level pages is real slim. How about telling me who you want in the referrer field and I'll just redirect your crawler to the exact page you need."

Link Checkers: "Error, does not compute, too logical, error, error, erroooooooorrrrrr...."

So there you have my current state of impasse with the link checking community.

As soon as they can come up with a compromise I'll unblock them, but until then NADA PAGE!


Lea said...

I'm about to write a linkchecker - my intention is to allow me to check (generally) single pages on other sites aren't 404ing before I display links to them. This is for an informational site I have which links to pages on obscure sites and is intended to minimise the hands on maintenance I have to do.
I was planning to have it check something like weekly.
So, tell me, despite it being unlikely that I will be sending my checker to any of your sites, what would you like to see in the agent string to help identify it?
(Obviously I'll be obeying robots commands)

Anonymous said...

Lea, why don't you just get the HTTP headers?

IncrediBILL said...

Lea's doing the same thing I did, check one page and see if it still exists.

The problem is that you actually NEED the full page, not just the header, because some people lose domains and it's no longer the page you originally linked.

Scan for words like this in the returned page text:

* domaincontender.com
* seeq
* directnic
* etc.

Case insensitive search obviously, but those are a couple of examples of how I detect pages now in domain parks and eliminate them.

Another possible way is to retain the IP address of each linked site and store it for future reference.

Scanning all the domains for their current IP address is a real easy way to spot sites that have moved or simply went away and it's fast as hell too.

With that narrowed list of sites with changed IP's you can scan just those sites pages in a flash.

Lea said...

Thanks Bill

(anonymous - even if I do get the headers, I still need an agent string!)