Thursday, October 19, 2006

Nutch used to advertise Houxou?

I keep seeing this crawler for Houxou: "HouxouCrawler/Nutch-0.8.2-dev ('s nutch-based crawler which serves special interest on-line communities;; crawler at houxou dot com)"
When you go to their link it doesn't say anything about the crawler, it just shows you their homepage. I'm not sure what special interest on-line communities you can possible be serving when you can't even post the page your user agent links claim to be on your website.

Before I gave up altogether, I decided to see what I could come up with in Google and found some interesting results but the site appears to be down.
Nutch: search results
help. Hits 1-9 (out of about 9 total matching pages): WHOIS - ... 20030922 source: RIPE person: Monu Ogbe address: 15 Penman Close, ... - 10k - Supplemental Result - Cached - Similar pages

Nutch: 搜索帮助 - [ Translate this page ]
搜索英文单词不区分大小写, 因此搜索NuTcH 等同于搜索nUtCh. ... 评分详解)显示Nutch如何给该网页打分. (anchors)显示指向该网页而被Nutch索引的anchor文本. ... - 7k - Supplemental Result - Cached - Similar pages

So what's the deal?

Why is Houxou crawling with a link to a missing page about bots?

Is this just a ploy to get webmasters trying to figure out what the Houxou crawler is to look at their hosting services?

Who knows, guess we'll just have to wait and see but it smells fishy to me.

Smiley Face User Agent

This should be filed under "What the fuck is wrong with people".

Here's the user agent with a hyperlinked smiley face to some bullshit website in The Netherlands: - "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; <A HREF=>;-)</A>; .NET CLR 1.1.4322; InfoPath.1)"
Looks like some asshole might've done this to his browser as the request did have a Google referrer so it's probably a real human that landed on my site.

Well pal, you got an error message when you hit my site didn't you?

Bet you're not so fucking smiley faced now.

Stupid shit.

Sunday, October 15, 2006

Netsweeper Caught Using Multiple Brooms

In the badly behaving corporate bots dept. we offer Netsweeper as our newest entry from Canada. They run one of those content filtering companies that thinks they should be allowed to crawl your site no matter what just to protect their clients.

Sorry, but we happen to disagree with all these content filtering spiders that feel the need to crawl without any regard for robots.txt and we really don't need a whole buttload of content filtering companies scanning the fucking web.

Yes, I threw in the word fucking just so your asshole spider will flag this post as bad content so none of your goddamn customers can read this so blow that out your ass.

Let's see what Netsweeper runs: "webcollage/1.127" "NutchCVS/0.7.2 (Nutch;;" Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0
These IP addresses have the following host names: -> ->
Let's just cut thru the chase and here's the information to block their ass:
CustName: Netsweeper
Address: 4-512 Woolwich Street
City: Guelph
StateProv: ON
PostalCode: N1H-3X7
Country: CA
RegDate: 2003-04-08
Updated: 2003-04-08

NetRange: -
Ta ta Netsweeper, you've been blocked and swept under my rug.