Tuesday, June 13, 2006

How Much Nutch is TOO MUCH Nutch?

Not too long ago I set off a storm of comments when I called out the writer of nutch on the carpet claiming his creation was being used excessively and abusing my server all over the place.

I was told by the legions of nutchies out there that I sucked, was told to get off the public network, called everything from an idiot to a grumpy webmaster and worse. They all claimed that nutch was wonderful thing and made search engines that were beneficial and I should stop complaining, shut up and let them crawl.

Bullshit.

Being a patient man, I sat back and waited to collect enough data to show those nutchies that the usage of nutch is growing out of control and I really don't need 100+ unique IP addresses from everywhere from Turkey to Japan crawling my goddamn website.

Theoretically, if these 100 crawlers ask for my max of 40K+ pages each that's over 4 million pages served, assuming I let them have them in the first place, mostly for no purpose whatsoever.

Here's the list of the nutch plague seen on my site recently:

124.32.246.36 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

124.32.246.45 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

128.208.6.200 NutchCVS/0.7.1 (Nutch running at UW; http://crawlers.cs.washington.edu/; sycrawl@cs.washington.edu)

128.208.6.226 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.227 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.77 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

129.242.19.138 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

129.34.20.19 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

129.78.64.106 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.112.16.140 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.112.16.220 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.211.84.21 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

136.165.45.122 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

137.43.154.203 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

147.202.90.2 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.24 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.245 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.26 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.27 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.68 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.85 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

166.214.93.76 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.117 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.118 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.119 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.120 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.121 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.122 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.252.148.51 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.113.130.205 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.131.194.84 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.147.0.44 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.244.218.1 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

209.131.61.1 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

210.174.3.130 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

210.196.73.193 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

210.245.31.15 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

210.245.31.18 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.12.114.238 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.127.226.60 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.137.33.140 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.156.230.210 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.58.116.72 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

213.132.175.101 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

213.186.36.107 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

213.251.133.12 Misterbot-Nutch/0.7.1 (Misterbot-Nutch; http://www.misterbot.fr; nutch at misterbot.fr)

216.93.185.12 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

220.218.159.50 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

221.114.253.210 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

221.116.237.114 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

221.221.237.35 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

24.222.153.250 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

24.224.226.18 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.186.61.164 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.187.12.236 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.87.139.90 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

59.160.240.115 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

60.248.9.114 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

61.135.151.175 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

62.129.132.47 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

62.168.188.151 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

62.40.36.87 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

63.133.162.98 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.105.36.210 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.151.112.44 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.241.242.18 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.242.88.10 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.242.88.60 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.34.172.78 BurstFind Crawler 1.0/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; crawler@burstfind.com)

64.34.180.167 Nokia6620/2.0 (4.22.1) SymbianOS/7.0s Series60/2.1 Profile/MIDP-2.0 Configuration/CLDC-1.0/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.38.10.26 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.71.164.103 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.71.164.107 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.71.164.108 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.71.164.125 Krugle/Krugle,Nutch/0.8+ (Krugle web crawler; http://www.krugle.com/crawler/info.html; webcrawler@krugle.com)

65.220.67.9 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

65.9.20.49 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

65.91.114.3 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.108.32.4 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.15.68.234 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.162.5.43 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.207.120.226 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.243.31.34 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

67.111.28.139 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

67.52.101.242 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

68.205.124.164 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

68.205.127.94 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

69.248.26.83 Comrite/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

69.55.233.28 Argus/1.1 (Nutch; http://www.simpy.com/bot.html; feedback at simpy dot com)

70.197.81.79 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

70.30.97.106 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

70.56.66.216 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

70.96.99.254 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

71.241.153.125 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

71.35.163.79 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

72.0.207.162 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

72.2.25.67 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

72.5.173.12 sdcresearchlabs-testbot/0.8-dev (www.shopping.com/bot.html; http://lucene.apache.org/nutch/bot.html; researchbot@shopping.com)

72.51.37.148 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

81.203.142.109 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

83.246.79.28 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

84.191.111.92 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
So we cranked all these IPs thru the reverse DNS grinder just to see who was running all this nutch without changing the default crawler strings. There are some that failed reverse DNS but I'm too lazy today to bother with WHOIS'ing that list so anyone that feels compelled to dig deeper feel free to post the results as a comment.

The reverse DNS of the IPs:
36.246.32.124.in-addr.arpa name = 124x32x246x36.ap124.ftth.ucom.ne.jp.
45.246.32.124.in-addr.arpa name = 124x32x246x45.ap124.ftth.ucom.ne.jp.
200.6.208.128.in-addr.arpa name = qbert.cs.washington.edu.
226.6.208.128.in-addr.arpa name = zork.cs.washington.edu.
227.6.208.128.in-addr.arpa name = nethack.cs.washington.edu.
77.6.208.128.in-addr.arpa name = pacman.cs.washington.edu.
138.19.242.129.in-addr.arpa name = vortex05.cs.uit.no.
19.20.34.129.in-addr.arpa name = yktgi01e0-s5.watson.ibm.com.
106.64.78.129.in-addr.arpa name = www-cacheC.usyd.edu.au.
140.16.112.131.in-addr.arpa name = ginga.ks.cs.titech.ac.jp.
220.16.112.131.in-addr.arpa name = endeavor.furui.cs.titech.ac.jp.
21.84.211.131.in-addr.arpa name = burum.labs.cs.uu.nl.
122.45.165.136.in-addr.arpa name = webmining.spd.louisville.edu.
203.154.43.137.in-addr.arpa name = dhcp-892b9acb.ucd.ie.
** server can't find 2.90.202.147.in-addr.arpa: SERVFAIL
24.195.67.164.in-addr.arpa name = cairo.ee.ucla.edu.
245.195.67.164.in-addr.arpa name = tacoma.ee.ucla.edu.
26.195.67.164.in-addr.arpa name = archer.ee.ucla.edu.
27.195.67.164.in-addr.arpa name = gutman.ee.ucla.edu.
68.195.67.164.in-addr.arpa name = treviso.ee.ucla.edu.
85.195.67.164.in-addr.arpa name = chandra.ee.ucla.edu.
** server can't find 76.93.214.166.in-addr.arpa: SERVFAIL
** server can't find 117.240.203.193.in-addr.arpa: NXDOMAIN
** server can't find 118.240.203.193.in-addr.arpa: NXDOMAIN
** server can't find 119.240.203.193.in-addr.arpa: NXDOMAIN
** server can't find 120.240.203.193.in-addr.arpa: NXDOMAIN
** server can't find 121.240.203.193.in-addr.arpa: NXDOMAIN
** server can't find 122.240.203.193.in-addr.arpa: NXDOMAIN
51.148.252.193.in-addr.arpa name = io.x-echo.com.
** server can't find 205.130.113.203.in-addr.arpa: NXDOMAIN
84.194.131.203.in-addr.arpa name = ldnatvip.livedoor.com.
44.0.147.203.in-addr.arpa name = proxy.ji-net.com.
1.218.244.203.in-addr.arpa name = u1.gpu42.samsung.co.kr.
1.61.131.209.in-addr.arpa name = nat1.burbank.corp.yahoo.com.
130.3.174.210.in-addr.arpa name = pae0382.tokyte00.ap.so-net.ne.jp.
193.73.196.210.in-addr.arpa canonical name = 193.192h.73.196.210.in-addr.arpa.
193.192h.73.196.210.in-addr.arpa name = aa2003080550002.userreverse.dion.ne.jp.
** server can't find 15.31.245.210.in-addr.arpa: NXDOMAIN
** server can't find 18.31.245.210.in-addr.arpa: NXDOMAIN
*** Can't find 238.114.12.212.in-addr.arpa.: No answer
60.226.127.212.in-addr.arpa name = 212-127-226-60.cable.quicknet.nl.
** server can't find 140.33.137.212.in-addr.arpa: NXDOMAIN
210.230.156.212.in-addr.arpa name = dsl.static212156230210.ttnet.net.tr.
** server can't find 72.116.58.212.in-addr.arpa: NXDOMAIN
101.175.132.213.in-addr.arpa name = webserver4.octrium.nl.
107.36.186.213.in-addr.arpa name = ns3490.ovh.net.
12.133.251.213.in-addr.arpa name = ns31793.ovh.net.
12.185.93.216.in-addr.arpa name = customer-reverse-entry.216.93.185.12.
50.159.218.220.in-addr.arpa name = 220x218x159x50.ap220.ftth.ucom.ne.jp.
210.253.114.221.in-addr.arpa name = 221x114x253x210.ap221.ftth.ucom.ne.jp.
114.237.116.221.in-addr.arpa name = 221x116x237x114.ap221.ftth.ucom.ne.jp.
** server can't find 35.237.221.221.in-addr.arpa: NXDOMAIN
250.153.222.24.in-addr.arpa name = blk-222-153-250.eastlink.ca.
18.226.224.24.in-addr.arpa name = blk-224-226-18.eastlink.ca.
164.61.186.58.in-addr.arpa name = 58-186-61-xxx-dynamic.hcm.fpt.vn.
236.12.187.58.in-addr.arpa name = adsl-dynamic-pool-xxx.fpt.vn.
90.139.87.58.in-addr.arpa name = p578b5a.tokyte00.ap.so-net.ne.jp.
115.240.160.59.in-addr.arpa name = 59.160.240.115.static.vsnl.net.in.
114.9.248.60.in-addr.arpa name = 60-248-9-114.HINET-IP.hinet.net.
** server can't find 175.151.135.61.in-addr.arpa: NXDOMAIN
47.132.129.62.in-addr.arpa name = HOSTED-BY.PBTECH.COM.
151.188.168.62.in-addr.arpa name = host-62-168-188-151.adsl.caucasus.net.
87.36.40.62.in-addr.arpa name = pythagoras.portal.o2.ie.
98.162.133.63.in-addr.arpa name = mail.visvo.com.
210.36.105.64.in-addr.arpa name = h-64-105-36-210.snvacaid.covad.net.
44.112.151.64.in-addr.arpa name = customer-reverse-entry.64.151.112.44.
18.242.241.64.in-addr.arpa name = sv-fw.looksmart.com.
10.88.242.64.in-addr.arpa name = sv-crawl.looksmart.com.
60.88.242.64.in-addr.arpa name = sv-crawlfw4.looksmart.com.
78.172.34.64.in-addr.arpa name = slimy.vhosting.com.
167.180.34.64.in-addr.arpa name = server7.springright.com.
26.10.38.64.in-addr.arpa name = server1.netsweeper.com.
** server can't find 103.164.71.64.in-addr.arpa: NXDOMAIN
** server can't find 107.164.71.64.in-addr.arpa: NXDOMAIN
** server can't find 108.164.71.64.in-addr.arpa: NXDOMAIN
** server can't find 125.164.71.64.in-addr.arpa: NXDOMAIN
9.67.220.65.in-addr.arpa name = mirror.setnine.com.
49.20.9.65.in-addr.arpa name = adsl-9-20-49.mia.bellsouth.net.
** server can't find 3.114.91.65.in-addr.arpa: NXDOMAIN
4.32.108.66.in-addr.arpa name = cpe-66-108-32-4.nyc.res.rr.com.
234.68.15.66.in-addr.arpa name = bdsl.66.15.68.234.gte.net.
43.5.162.66.in-addr.arpa name = 66-162-5-43.static.twtelecom.net.
226.120.207.66.in-addr.arpa name = firewall.net-sweeper.com.
** server can't find 34.31.243.66.in-addr.arpa: SERVFAIL
** server can't find 139.28.111.67.in-addr.arpa: SERVFAIL
242.101.52.67.in-addr.arpa name = rrcs-67-52-101-242.west.biz.rr.com.
164.124.205.68.in-addr.arpa name = 164.124.205.68.cfl.res.rr.com.
94.127.205.68.in-addr.arpa name = 94.127.205.68.cfl.res.rr.com.
83.26.248.69.in-addr.arpa name = c-69-248-26-83.hsd1.nj.comcast.net.
28.233.55.69.in-addr.arpa name = HVAR.SIMPY.com.
79.81.197.70.in-addr.arpa name = 79.sub-70-197-81.myvzw.com.
106.97.30.70.in-addr.arpa name = CPE00095b51f4c9-CM0013718d007c.cpe.net.cable.rogers.com.
216.66.56.70.in-addr.arpa name = 70-56-66-216.tukw.qwest.net.
** server can't find 254.99.96.70.in-addr.arpa: NXDOMAIN
125.153.241.71.in-addr.arpa name = pool-71-241-153-125.nycmny.fios.verizon.net.
79.163.35.71.in-addr.arpa name = 71-35-163-79.tukw.qwest.net.
162.207.0.72.in-addr.arpa canonical name = 162.160/27.207.0.72.in-addr.arpa.
162.160/27.207.0.72.in-addr.arpa name = link.enhancededge.com.
67.25.2.72.in-addr.arpa name = h72-2-25-67.bigpipeinc.com.
** server can't find 12.173.5.72.in-addr.arpa: NXDOMAIN
148.37.51.72.in-addr.arpa name = server1.properazzi.com.
109.142.203.81.in-addr.arpa name = 81-203-142-109.user.ono.com.
*** Can't find 28.79.246.83.in-addr.arpa.: No answer
92.111.191.84.in-addr.arpa name = p54BF6F5C.dip.t-dialin.net.
Looks like just about everybody's running it from colleges to corporations and even Uncle Bob crawling the web from a dial-in, talk about slow, but where is the benefit for those of us being abused with it?

I'll admit that a few of the nutches actually resulted in search engines showing up online but who uses these search engines? Best I can tell, none of the actual 400K visitors/month to my site that's being crawled use any of these so-called search engines and probably never will.

Here's the problem, and maybe I'm just using nutch as an example because it's so easy to spot this virulent trend with a single source, but the amount of things attempting [they didn't succeed] to crawl my site daily would easily become a significant portion of my daily traffic if I let them all in which is insane.

What happens when this trend reaches it's natural conclusion?

Where it's heading in that the crawlers will soon exceed the actual visitors in terms of daily pages downloaded as more and more search engines, aggregators, and spybots come online looking for more ways to sell a slice of the internet to an ever increasing bunch of specialized niche markets. Not to mention we're still dealing with all the scrapers, link checkers and down right dumb things like refererrer checkers abusing our bandwidth.

It's out of control and someone needs to put the breaks on this nonsense.

Someday soon crawlers, with the exception of the big search engines, will need to ask permission to get on just about any website of scale, and will need to make a compelling argument why they should be allowed to index the site. The day of just taking what you want and doing what you want with it will surely come screeching to a halt as the burden of all this bandwidth usage starts to hit the hosting companies and trickles down to the webmasters.

Maybe the webmasters will fight back first and take control before it's too late.

Here's hoping.

15 comments:

Anonymous said...

That was an entertaining post from March which was before I found your blog so I wasn't aware of it.

Jeeze what a bunch of idiots. They sound like spoiled pubescent script kiddies.

I’m not technical enough to write software to block these assholes so I have to do it the old fashion way – log file analysis, analytics, session management and monitoring via ssh/putty.

My .htaccess file is huge but the impact on the server (dual amd 248’s, 4gb ram) when I remove the blocks - which I did briefly over the weekend - causes the load to jump up to 80+ and pages timeout.

I don’t even try to run the bastards down anymore because of the amount of time it consumes. I’ve got all the usual UA filters in place. If they are masking their UA and it’s a US ISP I block the C range. If it’s an overseas ISP or any hosting company I block their entire CIDR range.

It really pisses me off that I have no recourse other than blocking them. They cost me money, time, screw up the user experience and endanger my site with potential dup-content penalties.

I’m fed up. Let me know when your scraper blocker is ready for beta. . .

-JayW

Anonymous said...

do:

ssh user@webserver
vi ... /htdocs/robots.txt
User-agent: *
Disallow: /

your problem is sloved.

To sad that you reverse-ip-lookup-, logfile-voyeur-, script-kiddi know that less about internet technology.
:-)

IncrediBILL said...

Too bad Anonymous can't read as he would know from my blog that not a single nutch got a single page besides server errors bouncing them off the site.

I don't have to look at log files and I don't have to tinker with robots.txt, it's all automated.

I'm sure you thought you were being clever while embarassing yourself.

Better luck next time.

;)

Anonymous said...

Your post somehow intimates that the visitor calling you a grumpy webmaster wasn't correct :).

As you know bill, I've got two sites that use nutch; mozdex and acrosscan. Both crawl politely and announce their presence with both a correct useragent and an email address that I recieve and read (Of course those addresses are scraped by people and I get spammed, but am I blaming folks like you bill? :)).

Further, while I haven't measured recently, there were over 1 million searches being done on mozdex last time it was measured - and I believe that will grow over time.

I'm ultimately refuting two points: first, other search engines CAN potentially bring in traffic - block them now without the chance for them to seed and grow and it'll be bad for everyone in the future. Sure mozdex isn't a google buster, but it does have potential to bring in traffic even if that's not quite realized yet.

Secondly I trust it's clear from my actions - and others - that nutch isn't the problem, the user is. As you noted nutch is like a gun. It's not the guns fault. I've got my 'gun' stored properly and handle it with respect. Others shoot off all over the place. Not the guns fault.

Nutch is a great opensource tool - IMO it's the only viable opensource project that allows for a commercial quality SE and that's a noble thing.

IncrediBILL said...

Wheel,

Sadly, you're one of the few good ones out there that even bother changing your user agent string.

If I get off my lazy butt I'll let mozdex in the door some day soon.

Besides, I didn't say there weren't ANY good uses for nutch, nor did I say I would block them all, but 100+ unique instances of nutch wanting access? That's never going to happen, especially when they don't identify themselves with anything except the default user agent.

Guess I'm concerned what will happen when there's 1,000 nutches wanting in the door or 3,000 as more people figure out how to use it, or even more.

Most people don't even have a robots.txt file on their web site and they'll be all excited they have all this traffic and wonder why they have so many visitors a day and nobody clicks on their ads, that will be amusing.

But you're right, it's the people that are the problem, not Nutch.

Yes, I know it's too easy to attack the technology just like saying if we didn't have guns there would be no shootings. That's correct, there would be stabbings instead, and take away the knives and we're hack to good old fashioned strangulation but I digress...

Anonymous said...

Let's talk about responsibility.

If you build and sell weapons like guns, land mines or a-bombs, you are co-responsible for any damage caused by use of this weapons. It's too easy to say that only the user is the problem.

I think programmers are responsible for their codes, too. Any publication of potentially dangerous programs like Nutch is irresponsible.

Very many people want to start their own search engine to get rich and famous like the g-guys. There are much more than 3,000 potential Nutch users in the world.

IncrediBILL said...

Actually, most just want to scrape enough cash to eek out a living or build a nest egg off someone else's work.

Anonymous said...

I agree with Bill. I went through the logs of one of my sites and I did some serious thinking about the abuse of bandwidth by Nutch and other bots, including locations.

My finding were that most abuse and spybots came from (1) Asia (2) Europe and (3) Latin America. In the USA a lot of spybots hang around ^38. then we have "content filtering" companies that think it's OK to consume bandwidth that I PAY FOR, then resell their product to other parties - it violates my copyright notice too.

Although you may consider it "radical," we had no use for anyone to even view the site from these locations and they were all banned by IP. Junk bots and bandwith use took a dive. Now when I see hit, they are, except on rare occassions, from real people looking at the site. (I really think most webmasters would be astounded at how few hits are real people looking at your site!)

Another thing to do, and it is rather amusing, is to write a PHP script at the top of your pages so the offending bot or IP will get a "white page" - it won't even have source. I can't post it here, but do a search for: PHP $botlist array exit This should lead you to the correct PHP to give Nutch a blank page! LOL!

Anonymous said...

Now Amazon.com is trying to scrape sites for some unknown reason. You'll notice it went 403 Poor Jeff Bezos the Bozo!

GET / HTTP/1.0" 403 - "-" "NutchEC2Test/Nutch-0.9-dev (Testing Nutch on Amazon EC2.; http://lucene.apache.org/nutch/bot.html; ec2test at lucene.com)"


Then there are the hundreds of attempted hits each week from various computer science departments at places like UCLA, the University of Indiana and the University of Washington (all US Defense contractors). I guess the U.S. Government is now getting universities to do their dirty work with nutch as a tool.

403 TO YOU NUTCH BABY!

Anonymous said...

I say any Nutch is too much Nutch. It's a simple matter of "Do I gain anything from letting this bot crawl my site?" The answer is no. I'm not down with hacked together scripts and "Search engines" that this claims to power. I haven't seen any real value or traffic driven from it so it's banned.

Anonymous said...

Now Apple.com is into nutch too!

Multiple hits from a17-201-22-87.apple.com GET /robots.txt HTTP/1.0" 403 - "-" "nutchCVS/Nutch-0.8.1 (nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"

So, are they looking for iTunes, iPods fake MACs on my site? Dumb! Dumb! Dumb!

I'm also getting pounded by a company called Websense using Time Warner IP's - all go 403.

OH! They are US Government Contractors too... Give "W" my regards! LOL! US Government Registrations for Websense:

http://www.bpn.gov/bincs/begin_search.asp



Company Name: WEBSENSE INC
CAGE Code: 1TGU0
UPC:
DUNS Number:
JCP Cert. Number:
ZIP Code: 92121
State: CA
Phone: 8007231166


CAGE Information


Company Name: WEBSENSE INC
CAGE Code: 1TGU0
Status: A - Active Record
Parent CAGE:
Address: 10240 SORRENTO VALLEY RD
P.O. Box:
City: SAN DIEGO
ZIP: 92121
CAO-ADP: S0514A - HQ0339
State: CA
County: SAN DIEGO
Voice Phone Number: 858-320-8000
Fax Phone Number: 858-458-2950
Date CAGE Code Established: 4/4/2001
Last Updated: 1/18/2005

AND/////////////


Company Name: WEBSENSE INC.
CAGE Code: 09AQ6
UPC:
DUNS Number: 878553643
JCP Cert. Number:
ZIP Code: 92121
State: CA
Phone: 8583208000


CAGE Information


Company Name: WEBSENSE INC.
CAGE Code: 09AQ6
Status: A - Active Record
Parent CAGE:
Address: 10240 SORRENTO VALLEY RD
P.O. Box:
City: SAN DIEGO
ZIP: 92121 - 1605
CAO-ADP: S0514A - HQ0339
State: CA
County: SAN DIEGO
Voice Phone Number: 858-320-8000
Fax Phone Number: 858-458-2955
Date CAGE Code Established: 4/17/1997
Last Updated: 2/17/2006


CCR Information


Company Name: WEBSENSE, INC.
Address: 10240 SORRENTO VALLEY RD
City: SAN DIEGO
ZIP Code: 92121 - 1605
State: CA
Point of Contact: SALES
Contact Phone: 8583208000

Anonymous said...

This one is Websense:

43.5.162.66.in-addr.arpa name = 66-162-5-43.static.twtelecom.net

In fact any of the curious things you get from .twtelecom.net is Websense - that crazy little company that loves to rip off your bandwidth to resell your copyrighted property as a "web filtering" company.

There were a lot of .edu hits from nutch. They are U.S. Government contractors.

A friend of mine at a GiGaPoP told me the latest government ploy is to try to hit your site using a .edu supercomputer facility and if that doesn't work they then switch to "k12" servers that are operated under the university's control. For example, if they hit your site and you block .edu in one second or less they will hit you using a .k12.whatever_state to cache your site. I had already heard what was up so I was ready for the ruse.

Anonymous said...

Correction in my post:

66-162-5-43.static.twtelecom.net

This is the Websense hit. I got another in there by mistake.

IncrediBILL said...

Interesting Jeff, I'd sure like to see proof that it's the gov ripping sites and not just edu school search projects.

Besides, nobody can cache my main site as it's protected by real-time anti-rip technology unless you have 5K+ non-consecutive IPs at your disposal.

You may snag a few random pages here or there, but the odds of ripping the entire site are slim without getting caught.

Anonymous said...

The universities under contract to the U.S. Government - GigaPops and university computer science centers have stopped spying for the government using nutch after it was discovered by people who had inside knowledge of operations at various university computer science centers.

To give you an example, Stanford University Computer Science Center in California switched from nutch to Twiceler. A hit looks like this:

Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)

The registration indicates the owner of cuill.com is:

Registrant:
Tom Costello
Tom Costello
1127 Thorntree CT
San Jose, CA 95120
US
Email: costello@cs.stanford.edu

Visit the URL: http://www.cuill.com

Now that looks like a valid robot! LOL! When will these U.S. Government - University GigaPop computer science center contractors ever give up?