We heard all the security hype when MSIE 7 and Firefox 2 came out and it turns out it was tons of hype and hoopla that was completely meaningless. They'll stop us from being Phished but those Java trojan horse and worm vulnerabilities still exist and have a revolving door to get into your computer if you have Java enabled.
This issue was highlighted in a recent post about McAfee SiteAdvisor Green Lights Notorious Malicious Sites but I thought I'd post about this again just in case people missed the part at the bottom of that long post highlighting how all of these vulnerabilities existed long before either version shipped and they simply didn't fix them or give us reasonable controls to hinder the problem.
The simple solution to avoid things like the Win32/Agent.RX trojan is to disable Java altogether, not Javascript but Java itself. The problem is there are a lot of useful applets all over the net, especially the fun ones like games on Pogo.com, or Yahoo Games, so eventually we'll want to turn Java back on in the browser for those sites.
Now the hard question:
Just how hard would it be for the browsers to allow us to enable Java and Javascript per site?
This was a very blatant oversight of a well known vulnerability, yet it still exists in recently released products without any type of protection other than to completely disable Java. If that Java option per site exists I sure missed it as I snooped around the options before posting this. If it's there it's buried somewhere in the basement of options or I'm blind as nothing just hopped out about this issue other than to disable Java altogether.
Funny, they have silly options for privacy freaks to ask about cookies, or remembering passwords, and all sorts of other good things but when it comes to real security, WHAMMO! here comes the trojan without as much as a warning.
If you can warn about installing add-ons without first asking permission so how hard can this be, to simply ask first if we want to load Java?
That's a very strong statement from at least 2 browser providers that have made it very clear they don't give a shit if we get hacked or not if we have Java enabled. The technology to stop the browser from loading Java without asking permission is so simple that an apprentice programmer could implement it.
Had my virus scanner not been up-to-date, I'd have been screwed pure and simple.
Gee thanks browser makers, thanks for these major security updates.
Saturday, November 25, 2006
MSIE 7 and Firefox 2 Still Not Reasonably Secure
Posted by
IncrediBILL
at
11/25/2006 05:24:00 PM
5
comments
More Bot Activity in Bezinqint
Had a chance to look for more possible Picscout crawling activity in another block of bezeqint.net IP's and found a rash of activity. Some was definitely bot activity, others had a fairly small sample and nothing was definitive except the crawl speed which can also be explained away by pre-fetch technology.
Here's a definitive bot in that range:
88.152.15.7 Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)Just another garden variety scraper or is Picscout sharing IP's with bezinqint's other customers?
About 20 others set off alarms but nothing quite as aggressive as that one IP listed asking for about 50+ pages in increments of 5-10 seconds apart, and that was after they were being challenged so it's a bot to be sure.
Who the bot belongs to is the question.
I'm considering blocking this range considering the number of alarms that were set off.
inetnum: 88.152.170.0 - 88.152.255.255
netname: ADSL-CUSTOMER-CONNECTION
role: BEZEQINT NETWORKING TEAM
route: 88.152.176.0/20
Not that everyone should block the whole thing, but the one IP address referenced was definitely a problem child.
So much data, so little time...
Posted by
IncrediBILL
at
11/25/2006 04:55:00 PM
0
comments
Labels: Scrapers
McAfee SiteAdvisor Green Lights Notorious Malicious Sites
McAfee's SiteAdvisor is a great idea and I've been a big fan as it helps avoid many bad sites. However, they're obviously not catching certain things that some of the more clever malicious site owners are doing to avoid their detection. This has led to them green lighting one of the most malicious sites I've seen and this guy has a bunch of them just waiting for unsuspecting visitors.
In this instance, SiteAdvisor gave a completely false sense of security.
CAUTION: Some of the links below may try to inject a worm or trojan.
Here's the results for http://www.euc2005.com/ which claims it's perfectly safe which is blatantly wrong:
Here's the site when you click to visit http://www.euc2005.com/:
I clicked the link "Czym jest GIS" which claims to be loading DIRECTIONS and up pops the bogus search page and my anti-virus goes off claiming that the site was atttempting to install a trojan from http://tisall.info/e/us02/e.cab. Additionally note the yellow warning bar at the top of MSIE 7 claiming the site was trying to install an add-on to the browser at the same time.
SiteAdvisor would do themselves a favor and just red flag anything that is related to Inhoster, where the trojan attempted to download from, as they appear to be a haven for spammers, scrapers and other malicious activity and numerous bad references can be found to their hosting all over the net.
Just for giggles I checked a few more bad domains I knew and SiteAdvisor hadn't checked any of them yet. However, this one below blew my mind because all of the URL's displayed in Yahoo were the actual CAB files themselves and SiteAdvisor didn't even warn me that clicking on a .cab file might be a bad idea.
Come on guys, this is a no brainer, if you actually find a listing in a search engine linking directly to the virus or worm file, or a suspicious file type such as a .cab or .exe, you should at least put up the yellow CAUTION symbol at a minimum.
IMO the real fault here isn't that McAfee SiteAdvisor missed these files, it's that the browser allows certain files to be executed randomly without asking. For the love of god, the browsers have options to ask per site if you want a stinking COOKIE which can do no immediate harm to your computer. Something as vulnerable as MSIE that can install trojans that just started downloading automatically, without warning or controls, and only when it looked like something was an add-on did I even get a warning from MSIE 7.
What's most amazing is both FireFox 2 and MSIE 7 are NEW RELEASES yet still vulnerable to some particularly nasty problems that has been around for ages and neither of them did anything to protect against this in their latest releases.
Is everyone at these browser companies asleep at the wheel?
Hopefully SiteAdvisor can figure out what they missed that allowed this rogue site to be green-listed and avoid these problems moving forward as it's obvious they're the only ones even trying to help as the browsers just left the problem remain in all their new versions.
P.S. The company hosting these sites, theplanet.com, has been notified about the problem and we'll all be watching to see if these domains continue to function.
Posted by
IncrediBILL
at
11/25/2006 02:40:00 PM
1 comments
Friday, November 24, 2006
Google Image Search Used for Copyright Abuse Mashup
Today I came across a bunch of slimy mashup sites that combine images from Google Image Search (your images) with affiliate ads. The attempt is to try to make it look like a legitimate directory or search engine for the topic but what's happening is your images are being used without permission and being attributed to other sites.
CAUTION: Some of the links below may try to inject a worm or trojan.
Go to one of the sites here:
http://www.euc2005.com/photography/Digital-photography.htmlSee those images?
The images are directly from Google Image Search for "Digital Photography":
http://images.google.com/images?hl=en&q=Digital+photography&btnG=Search+Images
Here's the URL to the images from that page:
http://images.google.com/images?q=tbn:RaA8sUkYvwGMCM:Same crap going on from this Polish site too:
http://www.saugus.net/Photos/images/pemigewasset_river.jpg
http://images.google.com/images?q=tbn:WHUZv_rlRPX82M:
http://www.jungleboffin.com/images/artoriginals/digitalpower/6.jpg
http://images.google.com/images?q=tbn:dXhig_gAZri
KQM:http://joecarr.ca/astro/images/2003/2003N707a.jpg
http://images.google.com/images?q=tbn:XoEoJdT29qBX8M:
http://www.kanaphoto.com/img/digital_7.jpg
http://buy-xenical-us.qo.pl/Note that the domain EUC2005.COM is a dummy domain, it's actually pulling up searches in a frame from this site:
http://f-mf.org/search.php?q=digital+photography
Just for giggles, I did a search for PHOTO.NET in their little search window to see what came up:
Here's the same search in Google Images:
This mess all seems to be hosted on theplanet.com, big shock, on at least 4 servers that I can find, click the IP below for a list of domains:
74.52.114.114I found some of this same crap on every server, click some links from the home pages of these domains and you'll see the same old shit like this:
74.52.114.115
74.52.114.116
74.52.114.117
http://homepage-building.info/carl-bucherer/or this:
http://kdcconstruction.net/morel/or this:
http://internetuniversityincome.com/pantech/etc.
Now let's see who appears to be behind this mess:
Domain ID:D128714698-LRORSomeone else has our copyright infringing buddy listed in an MVPS HOSTS file for some bad things as well:
Domain Name:F-MF.ORG
Created On:11-Sep-2006 11:05:33 UTC
Last Updated On:11-Nov-2006 03:50:00 UTC
Expiration Date:11-Sep-2007 11:05:33 UTC
Sponsoring Registrar:Direct Information PVT Ltd dba PublicDomainRegistry.com (R27-LROR)
Status:OK
Registrant ID:DI_2372832
Registrant Name:Soodkhet Kamchoom
Registrant Organization:N/A
Registrant Street1:2002 E. Tamarack Road
Registrant Street2:
Registrant Street3:
Registrant City:Altus
Registrant State/Province:Oklahoma
Registrant Postal Code:73521
Registrant Country:US
Registrant Phone:+001.5806436662
Registrant Phone Ext.:
Registrant FAX:
Registrant FAX Ext.:
Registrant Email:soodkhet@zlex.org
Name Server:NS1.F-MF.ORG
Name Server:NS2.F-MF.ORG
# [Soodkhet Kamchoom]Now let's see where the base of search operations F-MF.ORG resides:
127.0.0.1 alllinx.info
127.0.0.1 dinet.info #[Trojan.Win32.Small.EV]
127.0.0.1 eqash.net #[eTrust.Win32/Secdrop.JU]
127.0.0.1 frdolls.net
127.0.0.1 frlynx.info
127.0.0.1 joutweb.net
127.0.0.1 linim.net #[eTrust.Win32/Secdrop.JU]
127.0.0.1 linxlive.net
127.0.0.1 lipdolls.net
127.0.0.1 nwframe.net #[Win32/Nitwiz.A]
127.0.0.1 zllin.info #[MHTMLRedir.Exploit][Win32/Dialer.KM]
host F-MF.ORGI looked at the adjacent server IP 66.230.138.194 and BINGO! there's some of the domains listed (in bold) in the MVPS HOSTS files, amazing isn't it?
F-MF.ORG has address 66.230.138.195
whois 66.230.138.195
OrgName: ISPrime, Inc.
OrgID: IPRM
Address: 25 Broadway
Address: 6th Floor, Suite #2
City: New York
StateProv: NY
PostalCode: 10004-1086
Country: US
NetRange: 66.230.128.0 - 66.230.191.255
CIDR: 66.230.128.0/18
alllinx.infoThere's obviously more, but I'm bored chasing this idiot at this time, maybe later.
cleanchain.net
drefus.org
eqash.net
frlynx.info
frsets.info
joutweb.net
linim.net
linxlive.net
nwframe.net
recdir.org
I've been advocating everyone block access from known datacenters and proxy servers for quite some time to stop scraping and other abuse so had the Googlers listened, and I know they heard me, this abuse wouldn't be happening right now and webmasters wouldn't have to deal with this level of abuse.
Sorry to say, I'm going to have to add this line to all my robots.txt for Google, Yahoo and MSN until they resolve this vulnerability:
Disallow: /images/Why won't they listen when I explain what the vulnerabilities are?
Why must we the webmasters have to deal with this garbage?
Firing up the DMCA letters now, several search engines and ISPs are about to be served...
If your images show up on their pages, join me in fighting this good fight.
Posted by
IncrediBILL
at
11/24/2006 04:22:00 PM
11
comments
Wednesday, November 22, 2006
Exalead Preview Violating Webmasters Content
It's been ages since I've wandered over to Exalead and played with it for a while. I get a few spurious hits from their cute little search engine so I thought I'd explore for a bit and see what it had to offer.
Oh look, nice layout, thumbnails, click on the thumbnails and get a site preview...
Oh my god, they downloaded my page in real-time, stripped out my javascript so they could frame it without my frame buster working, and the page looks like shit now.
I'm speechless...
Not to mention infuriated that they would violate my content in such a manner.
If you want to just block the preview mode, they send a request like this:
193.47.80.78 "GET / HTTP/1.1" "http://www.exalead.com/search" "NG/4.0.2897.395"So blocking "^NG/" in .htaccess should do it and add "NOARCHIVE" to all your pages just to make sure they don't pull up an old copy, as that would REALLY piss people off if they don't honor NOARCHIVE.
If you just want to block the bot, it's "Exabot/3.0".
If you just want to block them completely, they crawl from here:
inetnum: 193.47.80.0 - 193.47.80.255Just another reason why webmasters will keep hating some web sites and search engines because they just don't get it so fuck 'em, they can't play in my sandbox any more.
netname: EXALEAD
route: 193.47.80.0/24
Posted by
IncrediBILL
at
11/22/2006 10:24:00 AM
7
comments
Tuesday, November 21, 2006
Bezeqint Hosts Scrapers, Spammers and more
The previous post about Hunting Picscout assumes that they are operating out of bezeqint.net which is where their website is hosted. The decision to block the first range of bezeqint.net from the previous post was easy because it appears to be a data center where residential customers wouldn't be blocked.
Then there is this other barrage of crap coming from what claims to be BEZEQINT-CABLES which may be residential but I can't read Hebrew so who knows. Anyone that can translate Bezeqint's site and give us more clues would be greatly appreciated.
This one particular IP tried to crawl about 300 pages:
84.110.241.167 "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 4.0)"Garden variety scraper or part of Picscout?
Hard to say.
However, we have found a new rash of activity while researching bezeqint.net looking for PicScout but these were all in my spam trap, no referrers, all one shot attempts to post something that was blocked, mostly about Viagra.
The spammers all came from these blocks:
inetnum: 84.110.208.0 - 84.110.223.255Here's the big list which makes me wonder if it's DHCP or a botnet?
netname: BEZEQINT-CABLES
inetnum: 84.110.224.0 - 84.110.239.255
netname: BEZEQINT-CABLES
inetnum: 84.110.240.0 - 84.110.255.255
netname: BEZEQINT-CABLES
84.110.208.4 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)Don't know what in the hell is going on with Bezeqint but I think I'm going to start tracking to see if I'm getting any legitimate traffic from there and if not, I'll just block their entire network as nothing good seems to be coming from them.
84.110.211.29 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.217.105 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.217.116 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.217.192 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.220.146 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.220.90 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.224.132 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.224.15 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.224.15 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.224.152 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.225.61 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.225.84 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.225.95 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.226.179 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.226.248 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.226.93 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.226.94 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.227.133 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.227.175 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.228.115 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.228.126 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.229.104 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.229.189 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.229.240 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.229.250 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.229.73 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.231.107 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.231.12 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.231.134 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.231.154 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.231.200 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.231.52 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.231.99 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.232.216 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.232.239 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.232.5 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.233.177 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.233.193 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.233.207 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.233.229 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.233.245 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.233.252 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.233.39 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.234.113 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.235.110 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.236.103 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.236.112 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.236.116 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.236.157 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.236.8 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.236.93 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.237.49 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.237.93 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.238.117 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.238.221 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.238.37 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.239.139 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.239.69 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.240.110 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.240.242 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.240.39 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.240.42 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.241.132 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.241.149 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.241.163 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.241.187 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.241.45 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.241.98 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.242.118 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.242.141 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.242.86 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.242.88 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.243.107 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.243.125 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.243.17 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.243.86 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.244.148 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.244.185 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.244.201 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.244.240 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.244.254 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.244.4 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.245.122 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.245.124 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.245.154 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.245.247 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.246.10 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.246.223 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.246.226 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.246.41 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.247.126 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.247.28 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.248.165 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.248.226 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.249.117 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.249.201 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.249.217 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.249.218 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.250.120 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.250.131 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.250.155 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.250.189 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.250.213 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.250.68 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.250.71 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.250.87 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.251.112 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.251.141 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.251.150 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.251.80 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.252.10 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.252.133 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.252.165 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.252.178 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.252.44 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.253.151 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.253.186 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.253.83 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.254.213 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.254.237 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.254.33 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.254.67 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.255.214 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.255.248 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.255.55 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.255.81 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
84.110.255.84 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Posted by
IncrediBILL
at
11/21/2006 12:19:00 PM
2
comments
Labels: Scrapers
Monday, November 20, 2006
Abotcalypse Now
My goodness, I must be making those people that run bad bots afraid because the number of blog and comment posts decrying how I'm about to destroy the web are cropping up almost daily since my presentation at PubCon.
Assuming that the people making these posts are the same people causing the problems that drive me to the "extremes of mouth-frothing, profanity, and severe bot-blocking" you might think that they would take note that they are responsible for the bad behavior that might stop the next Google from being born.
Unfortunately I'm the "arrogant twit" as I'm so self-obsessed with my mission from hell that I'm oblivious to how my technology will wreck the net.
Riiiiiiiiiiiight.
You people running scrapers need to take stock of the damage you'll ultimately cause and stop trying to pass the blame my way.
I'm just a supplier of weapons in the war on bad bots so surrender now and save the web.
Nah, that would be too easy as they don't think they do anything wrong.
In summary, not being someone that wants to disappoint the author of that amusing post, I laughed so hard reading it because it was just too fucking funny, there's your profanity.
Enjoy.
Posted by
IncrediBILL
at
11/20/2006 10:20:00 PM
2
comments
Labels: Bad Bots
Saturday, November 18, 2006
Good Scrapers, Bad Scrapers and Tinkerers, OH MY!
Someone posting on Freedom to Tinker as Neo said a similar thing to Greg Yardley's post that my bot blocking endeavors are going to stop tinkerers and end innovation on the web which is patently untrue.
The only thing my bot blocker is going to do is allow any webmasters, even non-technical neophytes, to have easy access to the tools that allow them to monitor and control access to their sites that is both easy to understand and administer. No more cryptic crap. The software will show them what's accessing their site so they can make informed decisions about what should crawl or what shouldn't crawl. That's what it's all about, knowledge, as knowledge is power and gives the webmaster the upper hand.
I'm not the only one blocking everything either as Brett Tabke of WebmasterWorld blocked everything from crawling for a while just to see what was bouncing off his firewall. What Brett decided to do was just require logins from people coming from bad internet neighborhoods. Since most websites don't have logins and subscriptions, my solution was to use captchas when bad behavior happens.
Yes, I'll admit I'm on a tear and block everything under the sun but I have a real purpose in my madness which is feeding bread crumbs to the rest of the creepy crawlers hitting my site so I know who they are, where they came from and where the content appears when it's indexed by search engines.
However, I don't intend on enforcing my particular brand of blocking on everyone that decides to use my bot blocker as one size doesn't fit all. The software has lots of options that the webmaster can set, and assuming the webmaster checks his control panel now and then, shows the webmaster what new things are on the web and allows them to grant access or be denied.
I don't foresee my bot blocker causing Neo's or Yardley's apocalyptic view of the web whatsoever but I do foresee the following changes:
- New bots and people tinkering might just have to ask permission first to the network of bot blockers to get access, not a big deal and easily done.
- Sloppy bots will go away or be fixed when they get stopped doing dumb things.
- User agents will be unique per site or software, no more Java/1.5.0_03 so they can either learn how to set the UA or stay off the net.
- Good scrapers that scrape for directories, that actually provide real links to sites, will need to identify themselves or go away.
- Bad scrapers will be in serious jeopardy as the scraping noose closes.
It's just the bottom feeding scrapers and spammers that will be in serious trouble and we may see botnets emerge to do the bidding of the nastiest of the crawlers.
OOOPS!
Too late, botnets already exist and other groups are actively fighting the botnets.
So what am I missing that bot blocking technology will cause?
Oh yes, the return of MANNERS, COURTESY and RESPECT FOR COPYRIGHT which means asking permission, being OPT-IN, not just taking what you want regardless of the webmasters's wishes.
When you ask to crawl my site it's a business arrangement, you want to build a business and ask MY PERMISSION to be included in your business.
This is how it works in the real world.
If you want to do business with someone you have to ask first
It would appear that many think that respect and courtesy is something that's not part of the Internet and the entitlement to content just because it's on a PUBLIC NETWORK is flat wrong.
Walmart is technically a public place, anyone can just walk in the door, and if you walked into Walmart and do what most scrapers do on the web they would call the cops and haul your ass off to jail. Before you respond that Walmart is a private company, even the Public Library frowns on people doing what scrapers do and they have signs posted above copying machines warning you about copyright and you can only copy small quantities for personal use only.
I'm just giving webmasters the same control Walmart has:
NO SHIRT. NO SHOES. NO SERVICE.
Pretty simple.
The webmasters will be able to control their site as much as technology allows. If we get to the point that Neo suggests where every visitor has to enter a captcha before they can access any website, I suspect some legislation will possibly occur that will make crawling without permission an offense and the Australians are already working on legislation which is flawed, but they are heading in that direction.
I'm just making the tool, not telling people how to implement it.
The choice is up to the internet, webmasters and politicians how this all plays out, not me.
Posted by
IncrediBILL
at
11/18/2006 08:40:00 PM
4
comments
Labels: Bad Bots
Google's Anti-Phish ROCKS!
After reading all of the whiners and complainers going on and on about how anti-phish in browsers was going to give people a false sense of security I decided to put it to the test today when a phishing email landed in my Inbox.
Within minutes of the arrival of the phishing email, I enabled the Google anti-phish in FireFox 2.0 and went to the site linked in the email:http://g-lec.com/data/cont/news/sicherung/einfach_millionaer1/wells/
The very minute the screen loaded Google popped up an alert:
Here's the page without the Google alert covering it:
I'll try the anti-phish a few more times as the opportunity arises, but this first test was impressive. As soon as I get around to installing IE 7 then I'll test their anti-phish as well.
Way to go Google!
You get a nice well deserved pat on the back for this one!
Posted by
IncrediBILL
at
11/18/2006 04:56:00 PM
3
comments
Labels: Phishing
eBay is Scraping?
Caught this story on WebmasterWorld about eBay scraping and sure enough found evidence of the same thing in my site.
The first IP is definitely a stealth bot, it's blocked, yet keeps asking for pages over the last couple of months.
216.113.181.67 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461; .NET CLR 1.1.4322)"This IP address used a banned user agent so it would never be allowed to crawl in the first place yet still asked for a couple of page names it already knew about, weird.
216.113.168.141 "Java/1.5.0_09"Here's eBay's info so you can block whatever the hell they're doing:
OrgName: eBay, IncWhat will they do with the information they collect, sell it to the highest bidding scraper?
OrgID: EBAY
Address: 2145 Hamilton Ave
City: San Jose
StateProv: CA
PostalCode: 95008
Country: US
NetRange: 216.113.160.0 - 216.113.191.255
CIDR: 216.113.160.0/19
Posted by
IncrediBILL
at
11/18/2006 01:16:00 PM
0
comments
Labels: Scrapers
Stealing T-H-U-N-D-E-R-S-T-O-N-E's Thunder
Here's another LayeredTech scraper busted for your amusement.
They call themselves a search engine and a web crawler, but when I can't find any information that ties the crawler back to the source without jumping through extraordinary means, such as feeding them bread crumbs to chase through the internet, I call them scrapers.
Here's the scraper or web crawler as they call it:
72.232.181.210 "Mozilla/2.0 (compatible; T-H-U-N-D-E-R-S-T-O-N-E)"Here's where the scrapings end up:
http://www.buyersindex.com/Apparently this thing is probably the Webinator by Thunderstone Software but it's hard to tell as the user agent has no link to any crawler information and a quick casual review of either website turned up nothing about the crawler.
It's not exactly like they're hiding or anything but it isn't completely above board either by not divulging who's crawling and why.
Posted by
IncrediBILL
at
11/18/2006 10:56:00 AM
0
comments
Labels: Scrapers
Will Google Really Banish Scrapers?
Many people at PubCon, including some major companies, were telling me their tales of scraper horror. All the stories were similar about being endlessly abused and they were having trouble getting the problem under control or just gave up in frustration. Several people even asked the search engines what they were going to do about scrapers in the Q&A of some PubCon sessions and got the old "we're working on it" response which I think is half-hearted.
When you consider that AdSense technology fuels most scraper sites it's obvious Google could simply look at any AdSense account serving up ads from a multitude of locations which is usually a clue there's something rotten happening. Not that everyone with AdSense on multiple domains is bad, but when you see a single AdSense account used on thousands of locations, you know there's a good chance it's all crap. However, Google probably makes way too much money from scrapers just to eliminate them altogether. What's more than likely to happen is Google might drop scrapers from the Google index but leave their AdSense accounts intact so that the revenue stream continues from these sites being found in Yahoo and MSN.
Perhaps we can hope Yahoo and MSN figure out how to detect and eliminate scrapers first and put our friends at Google between a rock and a hard spot with the dilemma of scrapers vs. AdSense revenue. Either Google would have to clean up their search results to make the users happy or leave the scrapers in to make the stockholders and bean counters happy, which could backfire either way. Needless to say, I don't see scrapers going away any time soon because the financial incentives to keep them are just too great.
Meanwhile, I recommend reporting scrapers on Google's Report a Spam Result page and see if Google is serious about getting rid of scrapers when found.
Posted by
IncrediBILL
at
11/18/2006 09:28:00 AM
0
comments
Sunday, November 12, 2006
Billed as a RoadBlock to the Semantic Web
Got a sudden burst of traffic from Greg Yardley's site today and noticed the topic was about "The coming semantic web roadblock" which I find amusing as I loathe the onslaught of data miners that hit my site and block their asses automatically on a daily basis.
Greg raised a couple of issues that I've heard a few times from other people that my technology will block everything and prevent new search technology from becoming established, and potentially block things that are currently providing value for your site and that's not entirely true nor my intent at all.
Remember, my primary goal is to make the websites using my product OPT-IN or whitelist things that want access instead of OPT-OUT or blacklist which doesn't work at all.
When you first install this bot blocking tool, it's in a PREVIEW mode by default which means you can see what it would be blocking but no action is being taken. It's completely passive when it's in PREVIEW mode and doesn't even challenge possible stealth scrapers, so it may not know if they're human or not but will take a guess. That means you can observe what's going on with your website for days or even weeks and then authorize anything that's providing value before turning the product LIVE and blocking the rest.
Now the next thing that's important to know is that the product records and reports new user agents that appear, so you will see in REAL TIME when something new, never before seen, hits the site. Remember, since we're OPT-IN, we haven't decided if these new things are good or bad yet so the first time they visit the site they'll get bounced off robots.txt assuming they honor it or not. The next time they visit, if the webmaster decided to let them in, they'll be allowed to crawl without issue.
To summarize, it's up to the decision of each webmaster whether or not the Semantic Web will become a reality or not, not me, my tool or service.
I prefer to think of Web 3.0 as the Democratic Web so if the majority decides to vote the Semantic Web out, who am I to argue?
Posted by
IncrediBILL
at
11/12/2006 04:50:00 PM
0
comments
Heritrix Activity Report
Heritrix isn't being adopted at the same rapid pace as Nutch is, but it keeps showing up from more and more places.
Here's the list of sightings, but the one that gives me the biggest giggle is the first, which claims to be "google.com" that came from Mannheim University in Germany.
134.155.241.9 "Mozilla/5.0 (compatible; heritrix/1.10.0 +http://google.com)"The other one I found amusing was the Accelobot which claims to "help automate market research" and I wonder if their research showed them I wasn't interested in their help?
137.82.84.97 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.worio.com/)"
137.82.84.97 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.worio.com/)"
152.163.214.140 "Mozilla/5.0 (compatible; heritrix/1.8.0
+http://wiki.office.aol.com/wiki/SEO)"
152.163.214.141 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://wiki.office.aol.com/wiki/SEO)"
152.163.214.144 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://wiki.office.aol.com/wiki/SEO)"
193.40.192.35 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://erika.nlib.ee)"
195.39.35.118 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.researcher.cz)"
198.162.51.70 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.worio.com/)"
207.241.233.35 "Mozilla/5.0 (compatible;archive.org_bot/heritrix-1.9.0-200608171144 +http://pandora.nla.gov.au/crawl.html)"
209.128.119.17 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://innovationblog.com)"
209.128.119.46 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://innovationblog.com)"
216.182.228.85 "Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"
217.91.71.203 "Mozilla/5.0 (compatible; heritrix/1.6.0 +http://www.schluetersche.de)"
24.8.197.68 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://crawlerx51.com)"
67.162.138.161 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://crawlerx51.com)"
71.229.152.72 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://crawlerx51.com)"
71.56.215.150 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://crawlerx51.com)"
72.20.99.46 "Mozilla/5.0 (compatible; heritrix/1.8.0 +http://www.accelobot.com)"
87.98.198.194 "Mozilla/5.0 (compatible; heritrix/1.4.0 +http://www.hanzoweb.com)"
Not nearly as popular as other tools, but picking up a little steam unfortunately.
We'll keep an eye on this and let people know when it hits epidemic proportions.
Posted by
IncrediBILL
at
11/12/2006 03:28:00 PM
2
comments
Tracking HTTrack Website Downloader
I'm just curious why over 100 people in the last few months thought they could just download my whole website (not this blog) with HTTrack?
What were these dumb fucks going to do with it once they got it anyway?
- Run a scraper script on the results?
- Blatantly republish the content with their own template?
- Run some data mining scripts on it?
- Keep a copy just for shits and giggles?
Here's a list of attempts to download the site:
12.218.132.246 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"The best part is, trying to download my site from my server gets them all automatically banned.
151.196.39.206 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
151.44.39.130 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
151.57.203.117 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
157.150.112.6 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
160.75.107.93 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
166.102.234.113 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
168.209.97.34 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
193.194.84.227 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
193.253.222.153 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
194.138.39.53 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
194.51.93.106 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
194.57.91.165 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
195.115.20.132 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
195.229.242.53 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
195.246.48.241 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
196.1.179.77 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
196.30.245.149 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
196.31.142.11 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
200.170.96.119 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
201.0.55.48 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
202.147.168.130 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
202.58.205.163 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
202.65.119.252 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
202.83.173.59 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
202.90.87.7 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
203.189.231.13 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
203.87.188.194 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
206.223.8.30 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
208.102.27.19 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
208.255.142.57 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
208.255.142.57 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.117.81.45 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.129.60.250 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.200.201.102 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.200.201.214 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.200.203.48 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.251.8.5 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.81.218.82 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
212.93.224.35 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
213.136.106.252 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
213.216.199.2 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
213.228.0.86 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
213.23.124.2 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
216.108.210.225 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
216.76.80.93 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
220.247.221.131 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
24.205.6.210 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
61.90.220.86 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
62.210.102.125 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
62.57.32.142 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
64.222.233.72 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
68.220.248.94 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
69.22.0.123 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
69.88.8.6 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
69.88.8.6 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
70.71.114.43 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
71.227.195.118 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
71.70.233.219 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
72.255.6.100 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
74.132.128.2 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.103.33.75 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.144.203.67 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.144.234.32 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.170.26.10 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.170.39.87 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.191.116.41 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
80.53.155.234 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.208.36.91 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.245.178.4 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.246.203.43 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.250.148.63 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.29.232.56 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.50.176.143 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.56.85.53 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
81.90.175.201 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.16.147.149 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.225.167.110 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.228.167.150 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.239.139.105 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.242.65.70 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.245.61.27 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.248.45.214 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.65.0.229 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.66.135.81 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.83.202.247 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.93.27.229 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
82.93.27.229 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
83.135.199.34 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
83.135.224.26 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
83.16.51.174 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
83.179.163.75 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
83.93.133.158 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
83.93.133.158 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
84.162.79.29 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
84.245.166.176 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
84.4.209.62 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
84.6.122.9 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
84.72.193.77 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
84.90.2.1 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
86.195.214.61 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
86.68.132.131 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
87.218.59.4 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
87.81.178.38 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
87.89.114.228 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
88.139.139.203 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
88.73.106.129 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
89.54.130.12 "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
Greed will get you nowhere, not on my site anyway!
Posted by
IncrediBILL
at
11/12/2006 01:39:00 PM
0
comments
Labels: Scrapers
Here a Nutch, There a Nutch, Everywhere a Nutch Nutch
Nutch usage seems to be breeding faster than cousins in Kentucky so I figured it was time to post a sequel to the original How Much Nutch is Too Much Nutch.
Here's a complete breakdown on every IP that I've seen using Nutch with the actual word Nutch in the user agent for a grand total of 190 IP's crawling to date. Several of them like Cazoodle, MQBOT, and a few .EDU's are crawling from a block of IPs but the majority seem to be scattered all over the place.
Here's the list of all the creepy crawling Nutches:
124.32.246.36 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)Wasn't that fascinating reading?
124.32.246.45 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
128.208.3.173 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; raphael@unterreuth.de)
128.208.6.125 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)
128.208.6.200 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)
128.208.6.207 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)
128.208.6.226 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)
128.208.6.227 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)
128.208.6.232 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)
128.208.6.75 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)
128.208.6.77 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)
128.95.1.189 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
128.97.88.68 ilial/Nutch-0.9-dev
128.97.88.70 ilial/Nutch-0.9-dev
129.242.19.138 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)
129.34.20.19 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
129.78.64.106 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
13.1.137.86 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
13.1.139.202 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
13.1.139.205 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
13.1.139.206 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
13.1.139.211 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
13.1.139.212 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
13.1.139.213 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
131.112.125.102 asked/Nutch-0.8 (web crawler; http://asked.jp; epicurus at gmail dot com)
131.112.125.103 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)
131.112.125.104 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
131.112.125.106 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
131.112.16.220 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
131.211.84.21 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
140.247.62.79 blogsearch/Nutch-0.9-dev
140.247.62.80 blogsearch/Nutch-0.9-dev
147.202.90.2 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
159.226.5.82 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
164.67.195.201 ilial/Nutch-0.9-dev
164.67.195.245 ilial/Nutch-0.9-dev
164.67.195.26 ilial/Nutch-0.9-dev
164.67.195.27 ilial/Nutch-0.9-dev
164.67.195.67 ilial/Nutch-0.9-dev
164.67.195.68 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
164.67.195.86 ilial/Nutch-0.9-dev
166.214.93.76 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
192.17.240.19 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.20 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.41 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.43 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.44 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.46 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.47 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.48 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.52 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.56 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.57 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu;
mqbot@cs.uiuc.edu)
192.17.240.58 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.60 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.71 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.74 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
192.17.240.76 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)
193.145.45.68 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
193.203.240.117 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
193.203.240.118 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
193.203.240.119 HouxouCrawler/0.8-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)
193.203.240.120 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
193.203.240.121 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
193.203.240.122 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
193.252.148.51 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
193.42.229.3 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
195.72.131.70 HouxouCrawler/Nutch-0.8.2-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)
195.72.131.72 HouxouCrawler/Nutch-0.8.2-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)
195.72.131.73 HouxouCrawler/Nutch-0.8.2-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)
195.72.131.80 HouxouCrawler/Nutch-0.8.2-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)
203.113.130.205 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
203.147.0.44 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
203.199.83.162 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
203.244.218.1 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)
207.176.224.241 Nutch/Nutch-0.8.1
207.176.224.245 Nutch/Nutch-0.8.1
207.214.93.42 MyNutch/V 0.3 (JP's Nutch Test Search Engine; jpnutch at yahoo dot com)
208.64.57.65 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)
210.174.3.130 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
210.196.73.193 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)
210.245.31.15 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
210.245.31.18 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
211.152.34.34 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
212.101.97.63 test/Nutch-0.8.1 (test; www.apache.org; test@apache.org)
212.12.114.238 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
212.137.33.140 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
212.156.230.210 BilgiBetaBot/0.8-dev (bilgi.com (Beta) ; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
212.58.116.72 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
213.132.175.101 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
213.157.204.141 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
213.251.133.12 Misterbot-Nutch/0.7.1 (Misterbot-Nutch; http://www.misterbot.fr; nutch at misterbot.fr)
216.182.225.186 NutchEC2Test/Nutch-0.9-dev (Testing Nutch on Amazon EC2.; http://lucene.apache.org/nutch/bot.html; ec2test at lucene.com)
216.182.236.46 NutchEC2Test/Nutch-0.9-dev (Testing Nutch on Amazon EC2.; http://lucene.apache.org/nutch/bot.html; ec2test at lucene.com)
216.182.237.45 NutchEC2Test/Nutch-0.9-dev (Testing Nutch on Amazon EC2.; http://lucene.apache.org/nutch/bot.html; ec2test at lucene.com)
216.93.185.12 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
217.153.59.26 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)
217.31.51.128 Megatext/Nutch-0.8.1 (Beta; http://www.megatext.cz/; microton@microton.cz)
218.25.39.81 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
220.130.191.231 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)
220.130.191.232 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)
220.130.191.233 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)
220.130.191.234 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)
220.130.191.235 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)
220.130.191.236 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)
220.130.191.237 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)
220.130.191.238 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)
220.130.191.239 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)
220.130.191.240 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)
221.114.253.210 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
221.116.237.114 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
221.221.237.35 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
222.173.249.33 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
222.173.249.33 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
24.222.153.250 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
24.6.168.184 test/Nutch-0.8.1 (Test robot; http://test.com; info at test.com>)
58.186.61.164 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
58.187.12.236 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
58.215.74.242 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
58.215.75.2 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
58.87.139.90 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
59.160.240.115 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
59.160.240.116 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
59.160.240.183 Nutch-test/Nutch-0.9-dev
59.160.240.184 Nutch-test/Nutch-0.9-dev
59.160.240.185 Nutch-test/Nutch-0.9-dev
59.176.10.136 NutchCVS/0.01-beta (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)
60.248.9.114 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
61.135.151.175 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)
62.129.132.47 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
62.168.188.151 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
62.40.33.173 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
62.40.36.87 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
63.133.162.98 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
63.246.7.209 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
64.105.36.210 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html;
nutch-agent@lists.sourceforge.net)
64.241.242.18 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)
64.242.88.10 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)
64.242.88.60 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)
64.34.172.78 BurstFind Crawler 1.0/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; crawler@burstfind.com)
64.34.180.167 Nokia6620/2.0 (4.22.1) SymbianOS/7.0s Series60/2.1 Profile/MIDP-2.0 Configuration/CLDC-1.0/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
64.38.10.26 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
64.71.164.125 Krugle/Krugle,Nutch/0.8+ (Krugle web crawler; http://www.krugle.com/crawler/info.html; webcrawler@krugle.com)
65.220.67.9 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
65.92.160.39 JLA/Nutch-0.8.1 (beta; http://dynamic.com/index.htm; info at test.com)
66.132.240.180 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
66.132.249.23 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
66.15.68.234 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
66.207.120.226 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
66.243.31.34 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
67.111.28.139 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
67.184.246.61 Nutch/Nutch-0.8 (Nutch Test; none; none)
67.52.101.242 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
68.178.171.109 test/Nutch-0.8.1 (Test robot; http://test.com; info at test.com>)
68.178.202.79 test/Nutch-0.8.1 (Test robot; http://test.com; info at test.com>)
68.205.124.164 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
68.205.127.94 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
68.97.222.117 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
69.248.26.83 Comrite/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
69.36.233.8 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
69.55.233.28 Argus/1.1 (Nutch; http://www.simpy.com/bot.html; feedback at simpy dot com)
70.143.79.234 JPNutchTest/Nutch-0.9-dev-JP-0.1 (JP Nutch Test; jpnutch at yahoo dot com)
70.197.81.79 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
70.56.66.216 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
70.90.188.18 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
70.96.99.254 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
71.216.0.210 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
71.217.33.149 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
71.241.153.125 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
71.35.163.79 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
72.0.207.162 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
72.2.25.66 abcxyz/Nutch-0.8 (nutchtesting; nutch; abc@xyz.com)
72.2.25.67 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)
72.2.25.71 Nutch/Nutch-0.8
72.5.173.22 sdcresearchlabs-testbot/Nutch-0.9-dev (www.shopping.com/bot.html; researchbot@shopping.com)
72.51.37.148 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)
72.84.30.230 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
75.44.225.44 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)
81.173.148.94 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
81.173.155.210 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
81.203.142.109 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
81.93.168.211 TRankBot/Nutch-0.8.1 (T-Rank AS; http://www.trank.no/; robot at trank dot no)
83.246.79.28 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
84.191.111.92 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
84.231.72.32 agent/Nutch-0.8 (http://lucene.apache.org/nutch/bot.html)
84.231.74.47 nutch/Nutch-0.8.1
85.117.62.114 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
85.18.14.22 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
87.139.106.60 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
88.191.23.109 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
This is some crazy shit that's almost like a DoS attack of non-stop web crawlers and I suspect it will get even worse as more people try to mine the Internet for free money.
Load up the firewall and your .htaccess filters with protection and brace for impact.
Posted by
IncrediBILL
at
11/12/2006 12:52:00 PM
2
comments
Thursday, November 09, 2006
JAP Anonymization Protects Scrapers Privacy
Isn't this nice, the JAP anonymization service is so busy trying to protect people's privacy that they don't give a shit that people will use their technology the assault web servers. Their slogan proclaims "ANONYMITY ISN'T A CRIME" but aiding and abetting an assault on a server could be considered a crime, questionable ethics at a minimum.
I got hit by someone utilizing their bullshit yesterday:
141.76.45.35 [proxy2.anon-online.org.] "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 4.0)"What these dipshits don't know is the few pages of data they downloaded, before the bot blocker kicked in and stopped the assault, has all been injected with hidden tags using CSS. Humans don't see these tags but the scraper, when stripping the HTML to get my text, will expose these tags to the search engine, and then I'll be able to hunt them down like the dogs they are.
141.76.45.34 [proxy1.anon-online.org.] "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 4.0)"
Anonymous doesn't mean anonymous for scrapers anymore because even if you hide where you crawl from if this data shows up on the web it will expose where you live so be careful what you do with that data you sneaky little bastards.
Posted by
IncrediBILL
at
11/09/2006 09:23:00 AM
5
comments
Labels: Scrapers
Wednesday, November 08, 2006
Bot Blocking Obsession for Men
Today I realized that my bot blocking has become such an obsession that I'm almost worse than a cultist running around spreading the word of Bot.
What started as a simple effort to save my own website from virtually daily DoS attacks from Asia turned into a hobby as it was kind of fun looking for the next big thing hiding out there.
Then that turned into a product idea as it evolved and I realized the tools I needed didn't exist which is why I resorted to building them in the first place.
Then my universe turned upside down when I realized how much crap was going on that people weren't even aware of lurking under the cover of stealth on the net, THEN it became an obsession to build a product to restore privacy and control back to the web.
The upside is, obsession is a good quality for people trying to launch a new product but it severely impacts your social skills when your mind can't get off the topic as you're now burning all of your processing power 24/7 dwelling on the topic to come up with new insights and innovations to stopping crawlers on a daily basis.
Some days I wonder if I should call a priest so he can splash me with holy water and watch my head spin and spit green pea soup across the room like Linda Blair, but that's being POSSESSED, and I'm only OBSESSED.
The good news is the embroidery company sent my new Polo shirts and the logo translated to thread very nicely, I'm happy about it so now I have my uniform for Vegas next week ;)
Even better news is the graphics designer is working on the new layout for the bot blocker control panel and it should be HOT with CSS clickable bar charts with hover and shit, I can hardly wait to see them and bolt them into the software this weekend.
The crawler apocalypse is on the horizon, there's a light at the end of the tunnel, thanks to all you people for being patient as I wanted this bot blocker thing to be a real solution, not just rushed out the door, and taking my time has truly paid off in what my bot blocker is capable of doing in the real world.
The best is yet to come, I had polo shirts made, you can tell I'm serious ;)
Posted by
IncrediBILL
at
11/08/2006 11:38:00 PM
6
comments
Saturday, November 04, 2006
Hunting PicScout, the Copyright Crawler Getty Uses
Everyone knows about PicScout used by Getty Images but nobody seems to know anything about PicScout's crawler, no user agent information, no IP's where they crawl from, nothing. When someone asked me if I knew anything about them I did a little research and nothing related could be found ANYWHERE, not even anything initially obvious in my bot blocker log files. Based on my initial observations PicScout actually seemed to be hiding better than all the other corporate crawlers I've researched to date, but maybe we can shed some light on this.
Not that I advocate copyright violation, as a matter of fact, I'm a staunch copyright defender.
However, attempting to crawl under the radar, refusal to honor robots.txt files, or identify your bot in any fashion and bypass website security measures gets under my skin more than anything so I picked up the gauntlet and tried to find signs of PicScout activity.
After the usual simple research methods failed, I decided to start by seeing where they were hosted.
host picscout.comAh ha!
picscout.com has address 82.80.254.37
host 82.80.254.37
37.254.80.82.in-addr.arpa domain name pointer bzq-80-254-37.dcenter.bezeqint.net.
I remember a rash of activity I shut down from bezeqint.net a while back so I looked a little deeper into this angle.
inetnum: 82.80.248.0 - 82.80.255.255Ah yes, they're the guys from Israel that were hammering one of my servers.
netname: BEZEQINT-HOSTING
descr: BEZEQINT-HOSTING
country: IL
I found a high volume of crawling from these IP's that was trapped by the bot blocker automatically and never answered the challenges, so it was definitely bot traffic.
82.80.249.195These IPs have only been spotted using the two following user agents:
82.80.249.196
82.80.249.197
82.80.249.201
82.80.249.202
82.80.249.203
82.80.249.204
82.80.252.130
Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)My theory is that this is PicScount attempting to crawl under the radar.
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; (R1 1.1); .NET CLR 1.1.4322)
Check your logs people, see if you have any activity in this range, I think it's them.
I would just block this range out of principle at this point as those IPs crawling aren't honoring any internet standards, and if it is PicScout, blocking them could possibly save you a massive chunk of money if some web designer used stolen images building your website.
UPDATE:
After posting this the fine people from PicScout visited the blog and revealed more information about their facilities.
The log showed this visit:
Host Name mail.picscout.comThe information I found from that, including another IP block is here:
IP Address 62.0.8.2
Country Israel
ISP Nv-picscout
inetnum: 62.0.8.0 - 62.0.8.255So, there's a few more IPs you might want to block, but I doubt they're scanning from the office.
netname: NV-PICSCOUT
descr: NV-PICSCOUT
country: IL
admin-c: OG570-RIPE
tech-c: NN105-RIPE
status: ASSIGNED PA
mnt-by: NV-MNT-RIPE
mnt-lower: NV-MNT-RIPE
source: RIPE # Filtered
UPDATE: Caught Getty keeping an eye on everyone today.
My blog log showed this:
Time: 12th June 200712:24:53 PMIt appears they were snooping on WebProWorld and followed the link here. The user agent claimed to be MSIE 6.0 but it's possibly an automated crawler, hard to say.
Host Name outbound.gettyimages.com
IP Address 206.28.72.1
Country United States
Region Washington
City Seattle
ISP Getty Images
Referrer: http://www.webproworld.com/graphics-design-discussion-forum/56384-invoiced-getty-images-unlawful-use-images.html
Anyway, we're watching you watch us, it works both ways.
Posted by
IncrediBILL
at
11/04/2006 05:13:00 PM
35
comments
Monday, October 30, 2006
Net::Trackback Rocks D-Block
Why is it every time someone puts some code out on the net like Net::Trackback that some asshole will download it and then aim their new creation at my server?
This is where they attempted to hammer my server this morning:
209.9.169.66 [209-9-169-66.sdsl.cais.net.] "Net::Trackback/1.01"Of course they got nothing but error message for their troubles, but this is still.... BULLSHIT!
209.9.169.78 [209-9-169-78.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.67 [209-9-169-67.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.70 [209-9-169-70.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.71 [209-9-169-71.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.69 [209-9-169-69.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.68 [209-9-169-68.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.72 [209-9-169-72.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.73 [209-9-169-73.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.75 [209-9-169-75.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.74 [209-9-169-74.sdsl.cais.net.] "Net::Trackback/1.01"
Can't even research the source as ARIN.NET's website won't load at this moment and CAIS.NET never responds to WHOIS inquiries and just hangs like this:
[Querying whois.arin.net]Never got a response...
[Redirected to rwhois.cais.net:4321]
[Querying rwhois.cais.net]
Bunch of BULLSHIT, that's what this is!
Posted by
IncrediBILL
at
10/30/2006 09:21:00 AM
4
comments