Saturday, June 17, 2006

New Nutch Sighting at Rediff

Just to make sure our list doesn't grow stale, here's the new Nutch of the day:

203.199.83.162
pro3.rediffmailpro.com
"NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"
No clue why a business email company needs a web crawler but that's where it came from.

I noticed the Nutch developers picked up on my previous post and are discussing forcing the default user agent to be changed, which wasn't yet again, and ways to reduce the amount of actual crawling of individual websites by Nutch.

Good luck on that effort guys, we can use it!

Thursday, June 15, 2006

Hanzo:web Social Archiving is Social Copyright Infringement

OK boys and girls, it's time to get pissed off as all notion of copyright and control of your site content has been tossed out the window as the fine folks over at hanzo:web ARCHIVE your site content on demand!

That's right, you click on their bookmarklet and TA DA! your page gets archived WITHOUT YOUR PERMISSION on someone else's server.

Here's the most priceless quote on their site:

Only you can save the Web!
So who's going to save the web from some bullshit like this?

Did you bother asking webmasters if they want their websites saved?

I don't want to be archived, I don't need to be saved, take your archiving toys and go fuck yourselves!

SPIT ALERT - PUT DOWN YOUR DRINK!

I just about wrecked a keyboard while sipping soda when I ran across this:

Respect for content

All archived pages, links and sites are stored exactly as they appeared on the web. Pictures, objects, links and flash are all retained as they are, preserved as originally conceived.

RESPECT FOR CONTENT?

Are you fucking kidding me?

Where's the respect for my fucking copyright?

You'll be archiving pages WITHOUT PERMISSION, possibly with someone's AdSense account embedded and someone can be sitting on your sites click frauding accounts to death, or stealing content while it can't even be detected that someone is even accessing the pages via the archive.

When they "archive" your page it gets crawled by the following:
87.98.198.194 "Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"

inetnum: 87.98.198.192 - 87.98.198.207
netname: hanzoweb
descr: Hanzo Archives Ltd
Now look at this shit coming from their servers:
87.98.198.194 "GET / " Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"
87.98.198.194 "GET /robots.txt" "Python-urllib/1.16"
87.98.198.194 "GET / " "Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"
87.98.198.194 "GET /robots.txt HTTP/1.0" "Python-urllib/1.16"
87.98.198.194 "GET /"" "Python-urllib/2.4"
87.98.198.194 "GET /" "Python-urllib/2.4"
So it's looking at robots.txt but what user agent are they looking for?

I dug around on their site and didn't see it, so I have no clue what the Python-urllib is looking for in robots.txt, but it really doesn't matter because the FAQ page plainly states that they don't give a flying fuck about your robots.txt file, they'll archive it anyway no matter WHAT YOU SAY MR. WEBMASTER and make it private:
The original crawl was subject to restrictions by robots.txt. This means that any archived content will be marked as private for browsing by the person crawling it, therefore, unless its your own archive, you will not see this content.
Sounds to me, as a webmaster, they're saying "FUCK YOU!".

Well, I blocked your service, so this webmaster is replying in kind "FUCK YOU!" no tresspassing allowed.

This is a huge problem as people will be snapping copies of anything for any reason and you, the webmaster, will have no control over what Hanzo:web stores or displays nor what these people do with your content after the fact.

BTW, when people start flaming me that I should've "contacted" them to find out what they were looking for in the robots.txt file, if they were doing it right, the path to this information would've been in the user agent string just like all the other sites do, or highlighted in the FAQ.

Nice idea but your draconian implementation doesn't deserve a second chance and it's blocked, out of mind, not a problem for me anymore.

FWIW, my bot blocker already stopped them from getting anything in the first place but I'm blocking their whole range of IPs just to make sure nothing slips through the cracks like stealth crawling as they have already demonstrated a complete lack of respect for everyones website.

Why is MonsterCommerce mining my site?

Didn't think this was worth mentioning until it happened several days in a row.

This is what's requesting various web pages:

72.32.59.114
wordsmith.monstercommerce.com
"Mozilla/4.0 (compatible; Win32; WinHttp.WinHttpRequest.5)"
So the question is what do they want?

Are they scraping my directory for potential customer leads?

Very curious indeed.

Wednesday, June 14, 2006

Green Template Must Go

OK, I can stand it anymore, some of the colors in the blog are making me crazy and I just tweaked a couple of fonts because I could barely read the block quotes.

Going to either tweak this template some more or ditch it for something else altogether.

Any suggestions?

Tuesday, June 13, 2006

How Much Nutch is TOO MUCH Nutch?

Not too long ago I set off a storm of comments when I called out the writer of nutch on the carpet claiming his creation was being used excessively and abusing my server all over the place.

I was told by the legions of nutchies out there that I sucked, was told to get off the public network, called everything from an idiot to a grumpy webmaster and worse. They all claimed that nutch was wonderful thing and made search engines that were beneficial and I should stop complaining, shut up and let them crawl.

Bullshit.

Being a patient man, I sat back and waited to collect enough data to show those nutchies that the usage of nutch is growing out of control and I really don't need 100+ unique IP addresses from everywhere from Turkey to Japan crawling my goddamn website.

Theoretically, if these 100 crawlers ask for my max of 40K+ pages each that's over 4 million pages served, assuming I let them have them in the first place, mostly for no purpose whatsoever.

Here's the list of the nutch plague seen on my site recently:

124.32.246.36 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

124.32.246.45 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

128.208.6.200 NutchCVS/0.7.1 (Nutch running at UW; http://crawlers.cs.washington.edu/; sycrawl@cs.washington.edu)

128.208.6.226 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.227 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.77 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

129.242.19.138 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

129.34.20.19 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

129.78.64.106 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.112.16.140 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.112.16.220 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.211.84.21 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

136.165.45.122 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

137.43.154.203 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

147.202.90.2 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.24 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.245 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.26 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.27 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.68 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.85 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

166.214.93.76 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.117 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.118 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.119 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.120 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.121 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.122 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.252.148.51 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.113.130.205 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.131.194.84 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.147.0.44 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.244.218.1 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

209.131.61.1 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

210.174.3.130 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

210.196.73.193 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

210.245.31.15 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

210.245.31.18 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.12.114.238 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.127.226.60 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.137.33.140 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.156.230.210 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.58.116.72 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

213.132.175.101 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

213.186.36.107 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

213.251.133.12 Misterbot-Nutch/0.7.1 (Misterbot-Nutch; http://www.misterbot.fr; nutch at misterbot.fr)

216.93.185.12 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

220.218.159.50 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

221.114.253.210 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

221.116.237.114 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

221.221.237.35 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

24.222.153.250 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

24.224.226.18 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.186.61.164 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.187.12.236 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.87.139.90 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

59.160.240.115 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

60.248.9.114 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

61.135.151.175 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

62.129.132.47 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

62.168.188.151 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

62.40.36.87 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

63.133.162.98 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.105.36.210 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.151.112.44 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.241.242.18 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.242.88.10 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.242.88.60 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.34.172.78 BurstFind Crawler 1.0/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; crawler@burstfind.com)

64.34.180.167 Nokia6620/2.0 (4.22.1) SymbianOS/7.0s Series60/2.1 Profile/MIDP-2.0 Configuration/CLDC-1.0/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.38.10.26 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.71.164.103 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.71.164.107 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.71.164.108 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.71.164.125 Krugle/Krugle,Nutch/0.8+ (Krugle web crawler; http://www.krugle.com/crawler/info.html; webcrawler@krugle.com)

65.220.67.9 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

65.9.20.49 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

65.91.114.3 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.108.32.4 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.15.68.234 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.162.5.43 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.207.120.226 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.243.31.34 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

67.111.28.139 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

67.52.101.242 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

68.205.124.164 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

68.205.127.94 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

69.248.26.83 Comrite/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

69.55.233.28 Argus/1.1 (Nutch; http://www.simpy.com/bot.html; feedback at simpy dot com)

70.197.81.79 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

70.30.97.106 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

70.56.66.216 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

70.96.99.254 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

71.241.153.125 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

71.35.163.79 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

72.0.207.162 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

72.2.25.67 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

72.5.173.12 sdcresearchlabs-testbot/0.8-dev (www.shopping.com/bot.html; http://lucene.apache.org/nutch/bot.html; researchbot@shopping.com)

72.51.37.148 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

81.203.142.109 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

83.246.79.28 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

84.191.111.92 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
So we cranked all these IPs thru the reverse DNS grinder just to see who was running all this nutch without changing the default crawler strings. There are some that failed reverse DNS but I'm too lazy today to bother with WHOIS'ing that list so anyone that feels compelled to dig deeper feel free to post the results as a comment.

The reverse DNS of the IPs:
36.246.32.124.in-addr.arpa name = 124x32x246x36.ap124.ftth.ucom.ne.jp.
45.246.32.124.in-addr.arpa name = 124x32x246x45.ap124.ftth.ucom.ne.jp.
200.6.208.128.in-addr.arpa name = qbert.cs.washington.edu.
226.6.208.128.in-addr.arpa name = zork.cs.washington.edu.
227.6.208.128.in-addr.arpa name = nethack.cs.washington.edu.
77.6.208.128.in-addr.arpa name = pacman.cs.washington.edu.
138.19.242.129.in-addr.arpa name = vortex05.cs.uit.no.
19.20.34.129.in-addr.arpa name = yktgi01e0-s5.watson.ibm.com.
106.64.78.129.in-addr.arpa name = www-cacheC.usyd.edu.au.
140.16.112.131.in-addr.arpa name = ginga.ks.cs.titech.ac.jp.
220.16.112.131.in-addr.arpa name = endeavor.furui.cs.titech.ac.jp.
21.84.211.131.in-addr.arpa name = burum.labs.cs.uu.nl.
122.45.165.136.in-addr.arpa name = webmining.spd.louisville.edu.
203.154.43.137.in-addr.arpa name = dhcp-892b9acb.ucd.ie.
** server can't find 2.90.202.147.in-addr.arpa: SERVFAIL
24.195.67.164.in-addr.arpa name = cairo.ee.ucla.edu.
245.195.67.164.in-addr.arpa name = tacoma.ee.ucla.edu.
26.195.67.164.in-addr.arpa name = archer.ee.ucla.edu.
27.195.67.164.in-addr.arpa name = gutman.ee.ucla.edu.
68.195.67.164.in-addr.arpa name = treviso.ee.ucla.edu.
85.195.67.164.in-addr.arpa name = chandra.ee.ucla.edu.
** server can't find 76.93.214.166.in-addr.arpa: SERVFAIL
** server can't find 117.240.203.193.in-addr.arpa: NXDOMAIN
** server can't find 118.240.203.193.in-addr.arpa: NXDOMAIN
** server can't find 119.240.203.193.in-addr.arpa: NXDOMAIN
** server can't find 120.240.203.193.in-addr.arpa: NXDOMAIN
** server can't find 121.240.203.193.in-addr.arpa: NXDOMAIN
** server can't find 122.240.203.193.in-addr.arpa: NXDOMAIN
51.148.252.193.in-addr.arpa name = io.x-echo.com.
** server can't find 205.130.113.203.in-addr.arpa: NXDOMAIN
84.194.131.203.in-addr.arpa name = ldnatvip.livedoor.com.
44.0.147.203.in-addr.arpa name = proxy.ji-net.com.
1.218.244.203.in-addr.arpa name = u1.gpu42.samsung.co.kr.
1.61.131.209.in-addr.arpa name = nat1.burbank.corp.yahoo.com.
130.3.174.210.in-addr.arpa name = pae0382.tokyte00.ap.so-net.ne.jp.
193.73.196.210.in-addr.arpa canonical name = 193.192h.73.196.210.in-addr.arpa.
193.192h.73.196.210.in-addr.arpa name = aa2003080550002.userreverse.dion.ne.jp.
** server can't find 15.31.245.210.in-addr.arpa: NXDOMAIN
** server can't find 18.31.245.210.in-addr.arpa: NXDOMAIN
*** Can't find 238.114.12.212.in-addr.arpa.: No answer
60.226.127.212.in-addr.arpa name = 212-127-226-60.cable.quicknet.nl.
** server can't find 140.33.137.212.in-addr.arpa: NXDOMAIN
210.230.156.212.in-addr.arpa name = dsl.static212156230210.ttnet.net.tr.
** server can't find 72.116.58.212.in-addr.arpa: NXDOMAIN
101.175.132.213.in-addr.arpa name = webserver4.octrium.nl.
107.36.186.213.in-addr.arpa name = ns3490.ovh.net.
12.133.251.213.in-addr.arpa name = ns31793.ovh.net.
12.185.93.216.in-addr.arpa name = customer-reverse-entry.216.93.185.12.
50.159.218.220.in-addr.arpa name = 220x218x159x50.ap220.ftth.ucom.ne.jp.
210.253.114.221.in-addr.arpa name = 221x114x253x210.ap221.ftth.ucom.ne.jp.
114.237.116.221.in-addr.arpa name = 221x116x237x114.ap221.ftth.ucom.ne.jp.
** server can't find 35.237.221.221.in-addr.arpa: NXDOMAIN
250.153.222.24.in-addr.arpa name = blk-222-153-250.eastlink.ca.
18.226.224.24.in-addr.arpa name = blk-224-226-18.eastlink.ca.
164.61.186.58.in-addr.arpa name = 58-186-61-xxx-dynamic.hcm.fpt.vn.
236.12.187.58.in-addr.arpa name = adsl-dynamic-pool-xxx.fpt.vn.
90.139.87.58.in-addr.arpa name = p578b5a.tokyte00.ap.so-net.ne.jp.
115.240.160.59.in-addr.arpa name = 59.160.240.115.static.vsnl.net.in.
114.9.248.60.in-addr.arpa name = 60-248-9-114.HINET-IP.hinet.net.
** server can't find 175.151.135.61.in-addr.arpa: NXDOMAIN
47.132.129.62.in-addr.arpa name = HOSTED-BY.PBTECH.COM.
151.188.168.62.in-addr.arpa name = host-62-168-188-151.adsl.caucasus.net.
87.36.40.62.in-addr.arpa name = pythagoras.portal.o2.ie.
98.162.133.63.in-addr.arpa name = mail.visvo.com.
210.36.105.64.in-addr.arpa name = h-64-105-36-210.snvacaid.covad.net.
44.112.151.64.in-addr.arpa name = customer-reverse-entry.64.151.112.44.
18.242.241.64.in-addr.arpa name = sv-fw.looksmart.com.
10.88.242.64.in-addr.arpa name = sv-crawl.looksmart.com.
60.88.242.64.in-addr.arpa name = sv-crawlfw4.looksmart.com.
78.172.34.64.in-addr.arpa name = slimy.vhosting.com.
167.180.34.64.in-addr.arpa name = server7.springright.com.
26.10.38.64.in-addr.arpa name = server1.netsweeper.com.
** server can't find 103.164.71.64.in-addr.arpa: NXDOMAIN
** server can't find 107.164.71.64.in-addr.arpa: NXDOMAIN
** server can't find 108.164.71.64.in-addr.arpa: NXDOMAIN
** server can't find 125.164.71.64.in-addr.arpa: NXDOMAIN
9.67.220.65.in-addr.arpa name = mirror.setnine.com.
49.20.9.65.in-addr.arpa name = adsl-9-20-49.mia.bellsouth.net.
** server can't find 3.114.91.65.in-addr.arpa: NXDOMAIN
4.32.108.66.in-addr.arpa name = cpe-66-108-32-4.nyc.res.rr.com.
234.68.15.66.in-addr.arpa name = bdsl.66.15.68.234.gte.net.
43.5.162.66.in-addr.arpa name = 66-162-5-43.static.twtelecom.net.
226.120.207.66.in-addr.arpa name = firewall.net-sweeper.com.
** server can't find 34.31.243.66.in-addr.arpa: SERVFAIL
** server can't find 139.28.111.67.in-addr.arpa: SERVFAIL
242.101.52.67.in-addr.arpa name = rrcs-67-52-101-242.west.biz.rr.com.
164.124.205.68.in-addr.arpa name = 164.124.205.68.cfl.res.rr.com.
94.127.205.68.in-addr.arpa name = 94.127.205.68.cfl.res.rr.com.
83.26.248.69.in-addr.arpa name = c-69-248-26-83.hsd1.nj.comcast.net.
28.233.55.69.in-addr.arpa name = HVAR.SIMPY.com.
79.81.197.70.in-addr.arpa name = 79.sub-70-197-81.myvzw.com.
106.97.30.70.in-addr.arpa name = CPE00095b51f4c9-CM0013718d007c.cpe.net.cable.rogers.com.
216.66.56.70.in-addr.arpa name = 70-56-66-216.tukw.qwest.net.
** server can't find 254.99.96.70.in-addr.arpa: NXDOMAIN
125.153.241.71.in-addr.arpa name = pool-71-241-153-125.nycmny.fios.verizon.net.
79.163.35.71.in-addr.arpa name = 71-35-163-79.tukw.qwest.net.
162.207.0.72.in-addr.arpa canonical name = 162.160/27.207.0.72.in-addr.arpa.
162.160/27.207.0.72.in-addr.arpa name = link.enhancededge.com.
67.25.2.72.in-addr.arpa name = h72-2-25-67.bigpipeinc.com.
** server can't find 12.173.5.72.in-addr.arpa: NXDOMAIN
148.37.51.72.in-addr.arpa name = server1.properazzi.com.
109.142.203.81.in-addr.arpa name = 81-203-142-109.user.ono.com.
*** Can't find 28.79.246.83.in-addr.arpa.: No answer
92.111.191.84.in-addr.arpa name = p54BF6F5C.dip.t-dialin.net.
Looks like just about everybody's running it from colleges to corporations and even Uncle Bob crawling the web from a dial-in, talk about slow, but where is the benefit for those of us being abused with it?

I'll admit that a few of the nutches actually resulted in search engines showing up online but who uses these search engines? Best I can tell, none of the actual 400K visitors/month to my site that's being crawled use any of these so-called search engines and probably never will.

Here's the problem, and maybe I'm just using nutch as an example because it's so easy to spot this virulent trend with a single source, but the amount of things attempting [they didn't succeed] to crawl my site daily would easily become a significant portion of my daily traffic if I let them all in which is insane.

What happens when this trend reaches it's natural conclusion?

Where it's heading in that the crawlers will soon exceed the actual visitors in terms of daily pages downloaded as more and more search engines, aggregators, and spybots come online looking for more ways to sell a slice of the internet to an ever increasing bunch of specialized niche markets. Not to mention we're still dealing with all the scrapers, link checkers and down right dumb things like refererrer checkers abusing our bandwidth.

It's out of control and someone needs to put the breaks on this nonsense.

Someday soon crawlers, with the exception of the big search engines, will need to ask permission to get on just about any website of scale, and will need to make a compelling argument why they should be allowed to index the site. The day of just taking what you want and doing what you want with it will surely come screeching to a halt as the burden of all this bandwidth usage starts to hit the hosting companies and trickles down to the webmasters.

Maybe the webmasters will fight back first and take control before it's too late.

Here's hoping.

Monday, June 12, 2006

RED ALERT - ECOMMERCE SITES PROTECT YOURSELF PRONTO!

There's a new company called Pronto that has a product in beta that not only crawls your site, displays message toasts while visitors are looking at your products via a browser plug-in.

For instance, someone is looking at the widget your online store is selling and suddenly a window pops up telling your visitors that they can get this widget cheaper elsewhere and tries to direct your shoppers away from your store.

Basically, this takes something like a Shopping.com-type service one step further by incorporating it into the browser and the potential harm to all the smaller online stores is enormous.

Anyone running any kind of ecommerce or affiliate site will definitely want to block this:

Here's the critical info on this crawler:

User Agent: "RedCarpet/1.3 (http://www.pronto.com/robots.html)"
Actual IP's used:
66.45.38.54
66.45.38.56
66.45.38.59
66.45.38.86
66.45.38.88
66.45.38.90
66.45.38.92
66.45.38.91
66.45.38.94
66.179.107.117
216.183.117.132
216.183.117.135
Complete blocks of allocated IP's:
RedCarpet, Inc. INFLOW-9359-113352-18374 (NET-66-45-38-80-1)
66.45.38.80 - 66.45.38.95

RedCarpet, Inc. INFLOW-9359-113352-19316 (NET-66-179-107-112-1)
66.179.107.112 - 66.179.107.127

RedCarpet, Inc. INFLOW-9359-113352-19482 (NET-216-183-117-128-1)
216.183.117.128 - 216.183.117.143
This is definitely a company no ecommerce site wants crawling unless you're sure you have the best prices so block 'em!

Bad Karma is a potential DDoS threat

Some guy has something he's put out as freeware called the Referrer Karma which gets the referring page and checks to see if it actually has a link to the site referred. If no link to your site exists on the referring page it slams the door on the visitor assuming it's a referrer spammer.

Two problems with that approach:

  1. Links that pass thru redirect pages from directories directory sites will fail this test every time as the referrer is the redirect page itself, not a web page with links on it.
  2. Sites that block bots, like mine, toss out error pagess when stupid user agents appear and VOILA! the visitor from my site gets bounced off by this stupid script.
Here's the info:
65.98.116.226
cp5.secserverpros.com.
"Referrer Karma/2.0"
Next, let's explore my concern with potential vulernabilities with Referrer Karma.

If you think about the implementation of Referrer Karma for a minute you'll realize it would allow one kiddie script to potentially pull off a DDoS attack. This could be accomplished by issuing thousands of requests to a bunch of sites running this Referrer Karma and each request containing a faked referrer to the target site you're attacking.

You wouldn't need to wait for the page request to complete, just send out a ton of requests to a bunch of servers and terminate the socket when the websever respondes with data is ready. No need to download the resulting page as Referrer Karma has already done your dirty deed for you by hitting the other site asking for the requested page.

Ask for a few thousand pages in a few seconds from a a bunch of sites using Referrer Karma and step back and watch the fun as the target server melts.

Lack of Intelligence Competence Crawler

Well here's yet another site called the Intelligence Competence Center trying to crawl the web looking for things they can sell they to various industries.

Here's the crawler details:

212.227.103.133
s15208971.onlinehome-server.info.
"iCCrawler (http://www.iccenter.net/bot.htm)"

also...

82.165.39.218
p15197600.pureserver.info
"iCCrawler (http://www.iccenter.net/bot.htm)"

all IPs it's used with my site...

212.227.103.133
212.227.93.221
82.165.39.218
I'm really getting sick and tired of these fucking corporate leeches that keep crawling [pun intended] out of the woodwork.

Sunday, June 11, 2006

HK Creepy Crawler

Been seeing this same bot "Java/1.4.1_04" asking for the same pages from the following IPs multiple times:

210.177.215.25
210.177.215.28
210.177.215.29

Don't know if it's a hosting account or ISP, but it's worth keeping an eye on these IPs.