How often do we see that happy line of horse shit spread by every new startup that crawls the web about how minimal it's impact will be?
Every fucking one of them claim it but when you add them all together the bot traffic is quickly exceeding the human traffic.
Who the fuck am I kidding, on most sites the bots clearly out number the humans in pages read on a daily basis.
First we put the big search engines on top of the heap with Google, Yahoo and MSN crawling the crap out of your servers daily. Just the three of these guys can easily read as many pages as 10K visitors a day. Then throw in the wannabe search engines like Ask, Gigablast, Snap, Fast, etc. ad nauseam and it's over the top.
Now expand that list to include the international search engines like Baidu, Sogou, Orange's ViolaBot, Majestic12, Yodao, and on and on, tons of 'em.
Then we have all the spybots that feel entitled to crawl your site like Picscout, Cyveillance, Monitor110, Picmole, RTGI, and on and on.
Next add up all the specialty niche bots like Become, Pronto, OptionCarriere, ShopWiki, and all sorts of shit too numerous to mention.
Pile on top of this all the free fucking tools that every little shithead and make believe company uses to scrounge the 'net for god knows what, and god's not telling, like Nutch and Heritrix, plus the web downloaders, offline readers, and more.
Don't forget, many of these so-called search engines and shit now want screen shots as well so after they crawl your page they send a copy of Firefox or something to your site to download every page again plus every fucking image, never cached, over and over and over.
Did I forget to mention directories?
They'll want to link check you and get screen shots as well, don't leave them out or they'll feel fucking neglected.
Wait, there's more, those social sites like Eurekster, Jeteye, etc. that let people link to your shit and then come back banging on your site all the time to make sure that shit's still valid.
Then add up all the RSS feed readers and aggregators that pull down your RSS feeds that nobody ever fucking reads. Not to mention the RSS feed finders like IEAutodiscovery that run amok on your site just looking for RSS feeds ... FUCK!
If you run affiliate programs you have CJ quality bot or some shit hitting your site and if you run ads then the Google quality bot, it's always something.
Don't forget the assholes running the dark underbelly of the web with all the scrapers, spam harvesters, forum, blog and wiki spammers, botnets and other malicious shit pounding on our sites daily.
Add on top of all this shit Firefox, Google Web Accelerator and now AVG's toolbar all pre-fetching pages that will most likely never be read and holy shit, we're being swamped!
OK, now that we've identified all this bot traffic, where's all the fucking people?
Of course you think all those hits from MSIE and Firefox are people, right?
Hell no!
Are you out of your fucking mind?
Those hits are the scrapers, screen shot makers and companies like Cyveillance and Picscout that don't want you to stop them from crawling your site so they just pretend to be humans to get past the bot blockers.
Well guess what?
There are no fucking people on your site. the internet is now run for and used exclusively by bots.
Apparently you missed the memo.
Monday, May 12, 2008
Impact On Your Bandwidth Will Be Minimal My Ass
Posted by
IncrediBILL
at
5/12/2008 03:24:00 PM
5
comments
Links to this post
Comparing Effectiveness of Anti-Virus Web Protection Methods
There's three basic methods being used at the moment to protect web surfers from potential dangers which are static (stale), active and passive.
Static Web Protection
Various companies use the static method which relies on crawling the web in advance to find vulnerabilities and then attempt to warn visitors about these problems as they are about to visit a web site. McAfee's SiteAdvisor and Google both take this approach and it's obviously only as good as your last scan and the malware can easily be cloaked and hidden from these somewhat obvious crawlers. Besides easily being fooled with cloaking, the data is always stale meaning sites good even 10 minutes ago could now be infested with malware and sites previously infested could have been cleaned.
This method isn't optimum for anyone and can be a nightmare for websites tagged as bad to get off the warning list assuming they ever find out they're on it in the first place as their business goes down in flames from traffic going elsewhere.
Active Web Protection
The latest AVG 8 includes a Link Scanner and AVG Search-Shield which aggressively checks pages in Google search results that you're about to visit in real time to help protect the surfer. Unfortunately, AVG made several mistakes, some that could be deemed fatal flaws, which allows this technology to be easily identified so that malware and phishing sites can easily cloak to avoid AVG's detection. Even worse for webmasters is that AVG pre-fetches pages in search results and as adoption of this latest AVG toolbar increases, it is quickly turning into a potential DoS attack on popular sites that show up at the top of Google's most popular searches.
While I think AVG's intentions were good, their current implementation easily identifies every customer using their product and causes webmasters needless bandwidth issues that could be easily resolved on their part with a cache server.
Passive Web Protection
The method used by Avast's Anti-Virus is to use a transparent HTTP proxy meaning that all of your HTTP requests pass through in invisible intermediate proxy service that scans for potential problems in the data stream in real-time. The data is always fresh, checked in real-time, the user agent doesn't change and there are no pre-fetches or needless redundant hits on websites.
The only downside is you don't know the site is bad in advance but that can easily be the case with static protection due to stale data and/or cloaking and active protection due to cloaking.
The Best of All
While the three approaches all have their potential problems it appears a combination of all three is probably the best approach.
Bad Site Database:
The SiteAdvisor/Google type database approach is good to log all known bad sites so they don't get a second chance to fool the other methods with cloaking once their are caught. This cuts down on redundantly checking known bad sites until the webmaster cleans it up and requests a review to clear their site's bad name.
Perhaps the Bad Site database concept needs to become a non-profit dot org so that all of the anti-virus companies can freely feed and use this database without all the corporate walls built up around the ownership of the data for the greater good, something like a SpamHaus type of thing or perhaps merged into SpamHaus.
Optimized Pre-Screening:
The AVG approach of pre-screening a site could be optimized by fixing the toolbar's user agent so it's not detectable and use a shared cache server to avoid behaving like a DoS attack on popular websites. The beauty is that the collective mind of all these toolbars with an undetectable user agent avoids the cloaking used to thwart detection associated with known crawlers. If the toolbar fed the results of these bad sites to the Bad Site Database, then there's a win-win for everyone.
Transparent Screening:
The final approach used by Avast should still be performed which is the HTTP proxy screening to that any site that manages to not be in the bad site database and still eludes the active pre-screening of pages, would hopefully get snared as the page loads into the machine.
Summary
When you pile up all of this security the chances of failure still exist but the end user is protected and informed as much as humanly possible from all of the threats present.
It would certainly be nice to see some of the anti-virus providers combine their efforts as outlined above to make the internet a safer place to visit.
Posted by
IncrediBILL
at
5/12/2008 11:46:00 AM
1 comments
Links to this post
Sunday, April 27, 2008
Off By More Than One
Can you believe that someone is actually surfing the web using some free browser called Off By One that doesn't appear to have been updated in the last 2 years?
The user agent is as follows:
"Mozilla/6.0(compatible;OffByOne;Windows 2000)"The irregular formatting convention triggered the bot trap with the lack of spaces alone.
Then it claims to be Mozilla 6.0 when it's probably Mozilla 3.0 at best.
Considering how few times, if ever, that this browser has visited it's obviously very rare.
Maybe some online nerd activist will get it declared as an endangered online species so it will become protected by law.
Don't laugh, you know it'll happen eventually...
Posted by
IncrediBILL
at
4/27/2008 01:42:00 PM
5
comments
Links to this post
Sunday, April 20, 2008
Reciprocal Link Exchange? Let's Swap!
For years I've been deleting all those emails asking me to exchange links and I won't swap links with any of that crap.
Suddenly I've had an epiphany and YES!, now I'll swap links with you, no problem!
I'm only agreeing to swap links as requested.
I'm not using NOFOLLOW on those links as requested.
You can see my links when you visit, online and visible as agreed.
Unfortunately my link swapping page will never be seen by Google, Yahoo, MSN or any other search engine but you'll see it just fine.
I'm going to hold up my end of the bargain, we swapped links, how about you?
Posted by
IncrediBILL
at
4/20/2008 03:32:00 PM
5
comments
Links to this post
Kaushik, What Freaking Experiments?
I found this user agent coming out of Microsoft's Area 131 requesting that people "contact kaushik for these experiments" that kept hitting one of my servers.
131.107.0.96 "contact kaushik for these experiments"So I did a little data mining of my own and searched Microsoft and couldn't decide if this experiment was from Kaushik #1 or Kaushik #2.
Both Kaushik's appear to be working for the Data Management, Exploration and Mining Group (DMX) at Microsoft, but which one ran this experiment?
OK, will the real Kaushik running these experiments please stand up?
BTW, was your experiment finding sites running bot blockers?
If so, you succeeded and your requests were stopped. ;)
Posted by
IncrediBILL
at
4/20/2008 02:24:00 PM
0
comments
Links to this post
DNS Right But User Agent Wrong
Ran into a user agent from DNSRight today that claimed to be some link check tool that doesn't appear on their site.
66.240.236.220 "GET / "So I ran some of their other tools that don't identify themselves at all.
"http://www.dnsright.com/" "DNSRight.com WebBot Link Ckeck Tool. Report abuse to: dnsr@dnsright.com"
66.240.236.220 "GET / HTTP/1.1" "-" "-"They host this mess at cari.net so just block 'em.
OrgName: California Regional Intranet, Inc.No more DNS Right or Left, it's now DNS Gone.
NetRange: 66.240.192.0 - 66.240.255.255
CIDR: 66.240.192.0/18
Posted by
IncrediBILL
at
4/20/2008 02:11:00 PM
0
comments
Links to this post
Thursday, April 17, 2008
Picmole, Yet Another Spybot!
There must be good money spying on everyone because it seems a new company springs up almost weekly trying to claim their stake in this new gold rush.
How many fucking spybots do we need?
Today on the spybot circuit the we're serving up a helping of Picmole that's using heritrix to do it's crawling. Surprisingly it still checks robots.txt but who knows if they'll honor it down the road because honoring robots.txt conflicts with accomplishing their stated goals.
Identifying their spider properly and crawling from easily identifiable IPs will also present them problems as their activities increase but being a new service they'll soon figure that out and probably go stealth like all the rest.
208.109.189.127 [ip-208-109-189-127.ip.secureserver.net.] requested 1 pages as "Mozilla/5.0 (compatible; heritrix/1.12.0 +http://www.picmole.com)"Sorry, but your bot hit a firewall on your first attempt.
Abort, Retry, Ignore?
Posted by
IncrediBILL
at
4/17/2008 03:59:00 PM
1 comments
Links to this post
Favcollector Bandwidth Waster
Here's another product of Canada doing the stupidest shit ever seen, collecting favicons.
It came and grabbed my icon, then hit the home page which the bot blocker promptly stopped, so who the knows what else it would've done beyond that.
66.207.217.138 [gaspra.crazylogic.net.] "Favcollector/2.0 (info@favcollector.com http://www.favcollector.com/)"From their FAQ:
Favcollector is a spider that searches the internet for favicons. It downloads and stores these favicons for each site it visits. It will go back once a month to see if the favicon has changed and will download the new icon if it is has, effictivly creating an archive of all favicons on the internet.Spider?
Spider my ass...
Spiders ask for robots.txt files, read them, and go away.
Not this piece of shit as it just comes and it takes what it wants without regard to the webmasters wishes.
Not only that, a bunch of trademarked icons are now on their site without permission which will most likely make some crazed trademark enforcers start jumping up and down once they find that site.
BTW, run a damn spell checker on your site as the word is effectively, not "effictivly" unless that's the Canadian spelling.
Posted by
IncrediBILL
at
4/17/2008 10:46:00 AM
0
comments
Links to this post
Canasasearchbot For Canasians, Oh Canasa!
It's hard to resist commenting on a bot that can't even spell it's own name or it's country name correctly.
206.248.137.34 [mycanadasearch.ca.] "canasasearchbot(http://www.mycanadasearch.ca/robots.html)"However they got it right on their robots page:
User-agent: canadasearchbotIt did ask for robots.txt but who knows if it was looking for "canasasearchbot" or "canadasearchbot", total crap shoot.
I tried their little search engine and it took it a really long time to come back with some really bad results.
Here's a "search tip", try searching your log file and examine what your crawler is putting in that log file before turning it loose on the world.
Nothing like that fine Canadian quality, eh?
Posted by
IncrediBILL
at
4/17/2008 10:13:00 AM
1 comments
Links to this post
Monday, April 14, 2008
Mozshot Tries Taking a Screenshot
Yet another Firefox-based screen shot tool hit my other site today just in time to take a screen shot of an error message telling them they weren't allowed to take screen shots without permission.
Details:
61.206.125.245 [tempest.nemui.org.]This thing appears to be open source, oh joy...
"Mozilla/5.0 (Gecko/20070310 Mozshot/0.0.20070628; http://mozshot.nemui.org/)"
Posted by
IncrediBILL
at
4/14/2008 05:46:00 PM
0
comments
Links to this post
