Saturday, February 25, 2006

Real Impression Tracking in Bot Buster

There was an unexpected benefit to all this bot busting and information recording in that now I can track on a daily basis all pages served to legitimate bots, blocked bots and actual people. This has allowed me to come up with some simple stats that shows a breakdown of where pages are served and I'm getting real close to almost matching exactly what you see in Google AdSense for page impressions.

This side effect alone could be an enormous benefit to people that need actual page impressions vs bot page impressions for selling advertising on their website and I've just made it a heck of a lot simpler to get that information since every single page passes thru my code in real time.

Coolio!

Cleverly Masked Bots Evolving

It would appear that the war over my content has been cranked up a level as bots masking as browsers and modifying their behavior to appear like people seems to be escalating. There are still some tell-tale signs that are easy to spot when you look at the server log but a couple of them that the bot blocker didn't catch are finding ways to game the system.

I didn't want to make the site more difficult for visitors but the only way to stop these guys would appear to be tossing in more random challenges like captchas and such after a pre-determined number of pages. To stop the typical captcha blow-thrus the challenges are very random and nobody could program a way to bypass them all as you don't know what they all are and I can add new ones daily if I wanted.

There's also something I noticed which isn't earth shattering but only humans seem to use my javascript menu which is a HUGE tell. Robots navigate the text links only but humans love those drop down lists and that's a clear sign that differentiates the two of them most of the time.

At the end of the day, it's just like trying to secure money in a bank, no matter how hard you try someone is going to rob you eventually but the best you can hope for is to make the number of times you get robbed as minimal as possible without pissing off all your customers in the process.

Thursday, February 23, 2006

Fuck Your Intellectual Property

Some asshats claiming to "defend your brand" sent their little AIPBOT to crawl my pages looking for anything of their clients on my site. Listen up fucknuts, you can use my SEARCH tool and look for something being on my site but you can kiss my ass when it comes to a 40K page crawl just to see if I'm violating someone's precious brand name.

This sense of entitlement of everyone to crawl the web is really starting to piss me off.

Take a hike assholes.

Link Me Or Else!

Here we go again with a persistent raging linkaholic badgering the shit out of me to link to his crappy little directory website so he can get a few AdSense clicks.

Hi YouBigWebStudYou,

I sent you a link request for bullshit-directory.com to see
if you would be interested in exchanging links.

I realize you are probably but wanted to let you know that I will be
removing your link next Wednesday if I don't hear back from you.

You can verify your link is by going to:
http://realfuckingannoying.com

with the following details
Title- Link To Me Please
URL-www.imbeggingyou.com
Description-I'm the biggest pain in the ass link-to-my-site whining spammer you've ever seen so link to me now before I beg and plead more.

If you do add my site please use the below information and let me know
the location you added it so I don't remove the link to yours
unknowingly.

I hope to hear from you before 2nd March 2006 but if not then I'm bound
to remove the link from my site.
Well fucking remove me already and stop sending me this shit!

Didn't the dead silence after your first spam give you a fucking clue?

If I could get my hands on you they'd have a new opening homicide scene for CSI next week so just keep it up, your luck is about to run out.

Oh yeah, it's a good thing you're in India so CAN-SPAM can't be used against you and you aren't registered with GoDaddy so they can't blackmail you to get your domain back. However, if there is a god one of those nasty little bugs in your water will give you atomic diahrea and you'll shit out a vital organ and die.

Wednesday, February 22, 2006

Search Engines Let Scrapers Bypass Spider Traps!

Just when you thought you've seen it all the actual search engines themselves can be used by scrapers to bypass spider traps. How this is accomplished is the scrapers find all of the indexed page names from your site in Google or Yahoo and then download pages the using known page names from your site thus side-stepping spider traps as they aren't actually spidering your site at all.

Therefore, just eliminating your pages from being CACHED in the search engines doesn't stop scrapers from still using the remaining data to their advantage.

Some days it just doesn't pay to get out of bed.

Tuesday, February 21, 2006

Another Plug-n-Scrape Component

Yet another toolkit letting armchair programmers attempt to grab my web pages.

Yawn.

This one ID's itself as:

IP*Works! V5 HTTP/S Component - by /n software - www.nsoftware.com
And their web site claims:
The HTTP component can be used to retrieve documents from the World Wide Web.
Might want to revise that to "used to be able to retrieve documents" as it went splat against my brick wall but I found it's calling card in my auto-blocked bot log.

Chitty Content

Must be the new wave of affiliate bots as I also got hit today by the Chitika ContentHit crawler or whatever the heck it is.

No way to verify this bot as reverse DNS for this IP address just claimed to be from Charter Communications.

71.10.233.52 Chitika ContentHit 1.0
This new bot of the day chit's just getting old.

Cell Phones and PDAs Can Piss Off

All these damn cell phones and PDAs all have unique user agent strings and for the last few months the handful that hit my website are all being told to piss off.

You people making cell phones and PDAs better wake up and smell the coffee as I'll be damned if I whitelist a bazillion user agents just to let your pissy products see 5 lines of my web site.

You all better come up with some better ideas for cell phone user agents as this unique name per phone shit isn't gonna fly.

CJ Quality Bot

Well here's a new one that I've never seen before from our friends at Commission Junction.

216.34.209.23 CJNetworkQuality; http://www.cj.com/networkquality
Unforunately they bounced off the walls, think I should let them in?

They might delist my site if I don't but based on the revenues I earned with them last month it's kind of a why bother IMO.

Fine, time to whitelist CJ, sigh.