Thursday, October 26, 2006

Google, Yahoo and MSN Like Indexing Pure Garbage Sites

The other day I was working on a link checking filter so I could comb many thousands of linked sites and eliminate all sites from my index automatically that no longer contain valuable content.

What I did was make a filter that checked the profile of information on the page looking for signals that detected any sites that have reverted to default registrar pages, default hosting pages, or have become part of domain parks or scraper sites.

After successfully detecting and filtering out many sites that had fallen by the wayside, I started to wonder if the search engines actually indexed all of this crap.

Sure enough, a quick check of Google, Yahoo and MSN confirmed that the search engines eat these shit sites like candy although they can be easily detected and eliminated either by profiling the page content or checking the whois information, or a combination of both.

What purpose does indexing these millions of garbage web sites serve for any search engine?

I mean seriously, the scraper spam sites are one thing, but these are so easily detected there's no ryhme or reason they show up as results to any search being they are 100% crap.

Anyone from one of the major search engines mind dropping a note to explain why hundreds of thousands of cloned garbage sites are being indexed?

We'd really love to hear from you on this topic, please feel free to post a comment :)

2 comments:

Anonymous said...

They may be being indexed, but the important thing is: do any of them rank anything worth a damn? Probably not.

They are just a small part of the 'noise' indexed by all the engines.

If you filter them out, you may as well filter out all the other seemingly useless, content valueless stuff in their indexes as well, like people's old log files and countless other examples of garbage the SE's manage to net.

However, what is garbage to one person may be gold to another. For someone else's purposes, the presence of these sites in the index may be useful.

And perhaps the spiders need to visit and index those sites just in case they suddenly turn into something useful too.

So once again, the issue is not whether they are indexed, but how they rank.

I suspect under most of the major engines these garbage sites do indeed rank like garbage.

Anonymous said...

I think leaving these sites indexed, but not having them shown up as relevant search result comes in quite handy for spam research. If someone is actually looking for text snippets by spambots, then the crap should show up, but it should not whenever a regular search is done*.

--
* an example could be a search for "free" "widgets", which ideally yields in pages offering free widgets and not some dumbarse's doorway hell redirecting to a PPC sponsored rubbish page.