tag:blogger.com,1999:blog-19248375.post115016707768687754..comments2023-10-18T05:54:12.748-07:00Comments on IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?IncrediBILLhttp://www.blogger.com/profile/14244934627308399202noreply@blogger.comBlogger15125tag:blogger.com,1999:blog-19248375.post-31762509227719189832007-06-25T13:09:00.000-07:002007-06-25T13:09:00.000-07:00The universities under contract to the U.S. Govern...The universities under contract to the U.S. Government - GigaPops and university computer science centers have stopped spying for the government using nutch after it was discovered by people who had inside knowledge of operations at various university computer science centers.<BR/><BR/>To give you an example, Stanford University Computer Science Center in California switched from nutch to Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19248375.post-79132670997510135562007-04-08T00:08:00.000-07:002007-04-08T00:08:00.000-07:00Interesting Jeff, I'd sure like to see proof that ...Interesting Jeff, I'd sure like to see proof that it's the gov ripping sites and not just edu school search projects.<BR/><BR/>Besides, nobody can cache my main site as it's protected by real-time anti-rip technology unless you have 5K+ non-consecutive IPs at your disposal.<BR/><BR/>You may snag a few random pages here or there, but the odds of ripping the entire site are slim without getting IncrediBILLhttps://www.blogger.com/profile/14244934627308399202noreply@blogger.comtag:blogger.com,1999:blog-19248375.post-71515715328364415022007-04-07T18:21:00.000-07:002007-04-07T18:21:00.000-07:00Correction in my post:66-162-5-43.static.twtelecom...Correction in my post:<BR/><BR/>66-162-5-43.static.twtelecom.net<BR/><BR/>This is the Websense hit. I got another in there by mistake.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19248375.post-47562669256610905642007-04-07T18:19:00.000-07:002007-04-07T18:19:00.000-07:00This one is Websense:43.5.162.66.in-addr.arpa name...This one is Websense:<BR/><BR/>43.5.162.66.in-addr.arpa name = 66-162-5-43.static.twtelecom.net<BR/><BR/>In fact any of the curious things you get from .twtelecom.net is Websense - that crazy little company that loves to rip off your bandwidth to resell your copyrighted property as a "web filtering" company.<BR/><BR/>There were a lot of .edu hits from nutch. They are U.S. Government contractors.<Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19248375.post-1170994124732962602007-02-08T20:08:00.000-08:002007-02-08T20:08:00.000-08:00Now Apple.com is into nutch too! Multiple hits fro...Now Apple.com is into nutch too! <BR/><BR/>Multiple hits from a17-201-22-87.apple.com GET /robots.txt HTTP/1.0" 403 - "-" "nutchCVS/Nutch-0.8.1 (nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)" <BR/><BR/>So, are they looking for iTunes, iPods fake MACs on my site? Dumb! Dumb! Dumb!<BR/><BR/>I'm also getting pounded by a company called Websense using Time Warner IP'sAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-19248375.post-1170809068827249022007-02-06T16:44:00.000-08:002007-02-06T16:44:00.000-08:00I say any Nutch is too much Nutch. It's a simple m...I say any Nutch is too much Nutch. It's a simple matter of "Do I gain anything from letting this bot crawl my site?" The answer is no. I'm not down with hacked together scripts and "Search engines" that this claims to power. I haven't seen any real value or traffic driven from it so it's banned.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19248375.post-1161907222638833632006-10-26T17:00:00.000-07:002006-10-26T17:00:00.000-07:00Now Amazon.com is trying to scrape sites for some ...Now Amazon.com is trying to scrape sites for some unknown reason. You'll notice it went 403 Poor Jeff Bezos the Bozo!<BR/><BR/>GET / HTTP/1.0" 403 - "-" "NutchEC2Test/Nutch-0.9-dev (Testing Nutch on Amazon EC2.; http://lucene.apache.org/nutch/bot.html; ec2test at lucene.com)"<BR/><BR/><BR/>Then there are the hundreds of attempted hits each week from various computer science departments at places Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19248375.post-1157225832596909062006-09-02T12:37:00.000-07:002006-09-02T12:37:00.000-07:00I agree with Bill. I went through the logs of one ...I agree with Bill. I went through the logs of one of my sites and I did some serious thinking about the abuse of bandwidth by Nutch and other bots, including locations.<BR/><BR/>My finding were that most abuse and spybots came from (1) Asia (2) Europe and (3) Latin America. In the USA a lot of spybots hang around ^38. then we have "content filtering" companies that think it's OK to consume Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19248375.post-1150611018586816432006-06-17T23:10:00.000-07:002006-06-17T23:10:00.000-07:00Actually, most just want to scrape enough cash to ...Actually, most just want to scrape enough cash to eek out a living or build a nest egg off someone else's work.IncrediBILLhttps://www.blogger.com/profile/14244934627308399202noreply@blogger.comtag:blogger.com,1999:blog-19248375.post-1150538944435848952006-06-17T03:09:00.000-07:002006-06-17T03:09:00.000-07:00Let's talk about responsibility.If you build and s...Let's talk about <B>responsibility</B>.<BR/><BR/>If you build and sell weapons like guns, land mines or a-bombs, you are co-responsible for any damage caused by use of this weapons. It's too easy to say that only the user is the problem.<BR/><BR/>I think programmers are responsible for their codes, too. Any publication of potentially dangerous programs like Nutch is irresponsible.<BR/><BR/>Very Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19248375.post-1150313551168240942006-06-14T12:32:00.000-07:002006-06-14T12:32:00.000-07:00Wheel, Sadly, you're one of the few good ones out ...Wheel, <BR/><BR/>Sadly, you're one of the few good ones out there that even bother changing your user agent string. <BR/><BR/>If I get off my lazy butt I'll let mozdex in the door some day soon.<BR/><BR/>Besides, I didn't say there weren't ANY good uses for nutch, nor did I say I would block them all, but 100+ unique instances of nutch wanting access? That's never going to happen, especially whenIncrediBILLhttps://www.blogger.com/profile/14244934627308399202noreply@blogger.comtag:blogger.com,1999:blog-19248375.post-1150312471054795982006-06-14T12:14:00.000-07:002006-06-14T12:14:00.000-07:00Your post somehow intimates that the visitor calli...Your post somehow intimates that the visitor calling you a grumpy webmaster wasn't correct :).<BR/><BR/>As you know bill, I've got two sites that use nutch; mozdex and acrosscan. Both crawl politely and announce their presence with both a correct useragent and an email address that I recieve and read (Of course those addresses are scraped by people and I get spammed, but am I blaming folks like Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19248375.post-1150311084642865702006-06-14T11:51:00.000-07:002006-06-14T11:51:00.000-07:00Too bad Anonymous can't read as he would know from...Too bad Anonymous can't read as he would know from my blog that not a single nutch got a single page besides server errors bouncing them off the site.<BR/><BR/>I don't have to look at log files and I don't have to tinker with robots.txt, it's all automated.<BR/><BR/>I'm sure you thought you were being clever while embarassing yourself. <BR/><BR/>Better luck next time.<BR/><BR/>;)IncrediBILLhttps://www.blogger.com/profile/14244934627308399202noreply@blogger.comtag:blogger.com,1999:blog-19248375.post-1150306002325892282006-06-14T10:26:00.000-07:002006-06-14T10:26:00.000-07:00do:ssh user@webservervi ... /htdocs/robots.txtUser...do:<BR/><BR/>ssh user@webserver<BR/>vi ... /htdocs/robots.txt<BR/>User-agent: *<BR/>Disallow: /<BR/><BR/>your problem is sloved.<BR/><BR/>To sad that you reverse-ip-lookup-, logfile-voyeur-, script-kiddi know that less about internet technology. <BR/>:-)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19248375.post-1150281417314165292006-06-14T03:36:00.000-07:002006-06-14T03:36:00.000-07:00That was an entertaining post from March which was...That was an entertaining post from March which was before I found your blog so I wasn't aware of it. <BR/><BR/>Jeeze what a bunch of idiots. They sound like spoiled pubescent script kiddies.<BR/><BR/>I’m not technical enough to write software to block these assholes so I have to do it the old fashion way – log file analysis, analytics, session management and monitoring via ssh/putty.<BR/><BR/>Anonymousnoreply@blogger.com