Saturday, January 21, 2006

We Want Your Ad Space

The next deluge of emails after the link swapping morons are those greedy little make believe media companies that want my ad space.

Must get 10-20 of these a week and they all want to sell my link space.

Let's examine the potential here as they claim they'll only take 25% or 50% to "sell my space" depending on which bullshit artist company sends the email.

If they really looked at my site they would notice I also sell ads direct and take 100% of the revenue plus some AdSense which is about 60%.

So do they really think they can compete with my direct sales plus Google AdSense?

Both pay many thousands a month already and both are pretty reliable as far as getting paid, especially my direct ad sales, so what's in it for me?

On the rare occassion that I've had these types on the phone I ask two questions:

  1. Do you have significant ads in my industry or 100% targetting to my topic?
  2. Can you guarantee that you can replace my ad income at my current levels with your advertisers?
So far the answer has always been NO on both counts so why do they persist?

Idiots.

Hi! I'm Looking for Link Partners!

Those emails always start the same with minor variations:

My name is BLAH.
I work for http://www.blah.com/
I am looking for link partners whose sites would be of benefit to our visitors.
Your site would be an excellent fit.
Sorry pal, but you obviously didn't visit my site.

My site is about WIDGETS and not web hosting, pills to make your pecker hard, weight loss, real estate or any other damn bullshit you're selling.

You know what would be an excellent fit?

My foot up your ass.

Now go away boy, you're bothering me.

Scraping Down, Ad Revenue Up!

Somewhere in the past I rambled about my revenues getting stomped when greedy crawlers and thieving high speed scrapers hammered the crap out of my servers locking out legitimate visitors for minutes or hours during what could be considered a DOS attack based on the speed of requesting 40K pages.

Well here's a big shock, now that I've effectively stopped their asses cold my ad revenues have returned to levels they were about 3 months ago before the crawling became a near epidemic.

Makes you wonder if some of my listings were suffering in the Big 3 search engines because of this as my review of several log files showed that sometimes all 3 search engines were attempting to crawl at the same time some of the big time scrapers were tying up my server. That's when I suspected the Big 3 SEs timed out on those pages and lowered the listings in the SERPs like it did when I recently caused a problem and they've taken this long to get back where they were.

You know it was getting real bad when one night I get a wake-up call at 2am by my sister-in-law, who has sites on my server, calling to complain it went down 20 minutes ago while she was working on her web site and slightly afraid she did it. Pings to the server and various services confirmed it was up and running but unable to respond to data requests as it was just tied up, completely overloaded, and wouldn't even give me a prompt in SSH so I could kill the tasks and block the intruder. However, on that night I let it ride as I hate forced reboots that can crash a database (no sleep then) and sure enough it came back to normal an hour later as I expected.

Anyway, just thought someone that's been reading this saga might find this interesting as I certainly don't think it's a coincidence the ad revenue is bouncing back at the same time the crawlers have been beaten to a pulp.

Friday, January 20, 2006

PageBites Job/Resume Scraper

Well guess who stepped into my spider trap today but yet another robot from yet another start-up aggregator site called PageBites that thinks they have entitlement to make money sucking my bandwidth.

Like I've been telling you all for a long time now you need to raise your sheilds to OPT-OUT your site to all crawlers and whitelist just the ones you want to stop this pandemic crawling of your sites. They profit off your backs, gobble up bandwidth costing more money, then muck up your web stats to the point you can't tell your advertisers how many real impressions they really get as crawlers are becoming a pretty decent percentage of so-called visitors on a daily basis.

Search engine spam, email spam, blog spam, now crawler spam.

It's really starting to get to the point that I think turning my spider trap into a product so everyone can take control of their own sites is looking more viable by the day.

You can all take your robots and go home, party over, your ass never got past the home page.

Firefox Scraping Giggle of the Day

Just had an amusing scrape attempt from something that claims to be Firefox with HTTP_REFERER set properly as it crawled and everything but it attempted to rip 299 pages in 211 seconds which set off the alarms instantly and automatically shut them down after only a fraction of the page requests at that speed.

Maybe there's a Firefox plugin that did this and if that's the case it's just getting Firefox users blocked but I don't care as this kind of behavior isn't welcome.

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; xxxx) Gecko/xxxxx Firefox/1.0.7
I was amused watching it happen in real time but you lamo scrapers didn't really think it would work did you?

What my scraper blocker used to do was just stop counting page requests after it hit a specific threshold and temporarily disabled the scraper to stop them. However, my latest modification keeps counting continued attempts so after the initial threshold trigger blocks them the page counter just keeps going to see how many pages they really wanted. Eventually the scraper will set off a second level threshold trigger that gives them a permanent ban automatically if the total page requests are too extreme.

Pretty obvious when it just keeps going that there's no human at the controls.

Too funny - got anything better to throw at me?

Boycott Bellsouth

Everyone is going on and on about the Bellsouth drama where the good old boys at BellSouth are trying to charge (blackmail) internet companies a premium for using Bellsouth's backbone.

You can all pontificate about these blowhards at BellSouth all you want but my strategy is more straightforward.

Put your money where your mouth is.

Get rid of DSL and switch to a cable modem, satellite or anything except DSL.

Toss your landline phone in the trash and use VoIP, cell or anything else as long as it isn't BellSouth.

We aren't in BellSouth territory, we're in PacBell country, but we were also sick of all their overpriced services so we virtually freed ourselves of all RBOC services except a bare-bones $25/month landline mostly for incoming calls and emergency backup for internet dialup for those rare times when the cable modem goes offline.

Besides, we have a better phone service package with our cell phone so they can take all their overpriced features and go pound sand.

Basically, if people bail from BellSouth in masses they will get the message loud and clear when those year end multi-million dollar corporate bonus packages the execs are fond of go down the crapper.

This isn't the 80's anymore and pretending to be a monopoly when you really aren't is just stupid.

Thursday, January 19, 2006

So Many Rants, So Little Time

Yesterday I was all riled on up on a bazillion topics and instead of ranting all day decided "Fuck It!" and worked on my websites instead - not like you fuckers pay the bills.

Brief synopsis:

Gov't wants Google Searches:
They just want to know what we're searching for, not who's doing it, but as I see it as the government is stepping over their boundaries and Google is vying for martyrdom by turning them down. Sorry, but I don't see the church annointing St. Google anytime soon so if it doesn't contain personal information like IPs and such just turn it over Google you link baiting media hounds.

GoDaddy Shuts Down Websites:
Any moron that hosts with GoDaddy and can't abide by the terms of service gets what they fucking deserve so stop whining already, it's getting old.

German Judge Shuts Down Wikipedia.de:
Parents didn't want their son's name in the Wikipedia so they sue and now it's in 10,000 blogs instead just because some Oktoberfest liver transplant candidate masquerading as a German Judge doesn't know shit about the internet - PRICELESS

Last But Not Least:
Don't expect to get laid the rest of the week when your wife overhears you washing your hands at the sink muttering comments about "pussy fingers"

Until tomorrow...

Scrapers Don't Like Being Blocked

The last week has been getting more interesting as my banned scraper log file shows some rather interesting trends as they are squirming and thrashing trying to get around all the traps.

The most amusing is the ever changing user agent strings as they are definitely testing to see if I'm filtering based on specific user agent criteria and mostly they are right as everything is banned except http clients.

All of the legitimate search engines are being permitted based on their range of whitelisted IPs so trying to pretend to be Google, Teoma, Slurp, etc. will just instantly ban their IP for the day and repeated attempts might ban it permanently.

Almost as much fun as shooting fish in a barrel.

Wednesday, January 18, 2006

Courts Give Wendy's Chili Hoaxers the Finger

Sometimes you have to wonder about jurors as it's OK to kill your wife if you're O.J. or molest children if you're M.J., but mess with Wendy's Chili and your ass goes to jail for 9 years.

Guess people have their priorities.

Competitor Jumped the Shark

Nothing makes your morning like waking up to find an email from your competitors latest mass mailing explaining how he's working on his web site and all these improvements and fear sinks into your gut that you're about to be destroyed by something awesome.

You click the link with dread expecting to see COMPETITION 2.0 and as luck would have it you see HILARIOUS 2.0 instead.

I swear on a stack of religious mumbo jumbo that this guy used to have a site I considered a threat and now it looks more like some high school kid is doing his web design and things are broken all over the place.

Either he's thrashing trying to get some juice out of his site or he's lost his mind and it's about to go down in flames but either way it looks like a win-win for me based on what I'm seeing.

Google Analytics Accuracy Bullshit Challenge

Have you played with Google Analytics?

Has the happy horseshit syndrome settled in yet?

My web site shows the following stats:

1 direct access
2 http://domain1.com
3 http://www.google.com/search
4 http://domain2.com
5 http://search.yahoo.com/search
6 http://domain3.com
7 http://domain4.com
8 http://domain5.com
9 http://domain6.com
10 http://domain7.com
However, Google Analytics shows:
1 google
2 yahoo
3 (direct)
4 msn
5 aol
6 http://ga-domain1.com
7 http://ga-domain2.com
8 ask
9 aolsearch.aol.com
10 search
Best I can determine is Google combines all Google sources such as Google.com, Google.ca, Google.co.uk, etc. which makes it look more dominant as a single source but overall makes the actual weight of the individual Google sites merged into one big ass murky pile of BULLSHIT!

Worse yet is from my actual log files my #1 domain referrer shows as #62 in analytics.

Sorry Google, you can fool some of the people some of the time, and AdWords advertisers most of the time, but here at IncrediBILL's Random Rants we call this BULLSHIT!

Have a nice day.

Tuesday, January 17, 2006

Bot Busting Crawler Experiment Complete

Many weeks and log file combings after The Great Anti-Scrape Off started it's become quite obvious that the effort was an enormous success.

The last bit of technology was deployed a couple of nights ago to challenge robots masking as humans seems to be stopping the last of them so it would appear that my site is now reasonably safe from typical crawlers and bots.

If someone has access to 10,000 IP addresses all bets are off but most scraping and crawling operations, except those that appear to be hiding behind AOL, seem to have fairly limited resources.

The last couple of tricks deployed include:

  • Multiple checkpoint profiling to identify bots masquerading as humans
  • Randomized challenge techniques with anti-blow-thru detection so that the typical captcha defeating techniques won't work
  • Adaptive time monitoring for hit and run bots that seem to think they can get small chunks at a time and come back later under the radar for the next chunk
There may be other things going on out there in the wonderful world of scraping but it would take a fairly sophisticated scraper to bust through what's now currently in place.

The technology seems to work fine so far with up to 30K page views a day but it would be interesting to see how it would perform with 100K or 1M page views a day. What might be a bit challenging is the current implementation does quite a bit of database churn but for a medium-sized sampling that's only tracking a few hundred visitors at any time which is fairly insignificant.

In the end my scraper stopper is only protecting my database of content so any crawler can access about 10 pages without question such as the home page, about us, contact us, etc. so it will be painfully obvious to them that they are being blocked from delving deeper into the site.

The benefit to this approach is that all of the hard rules being used to block access to the full content doesn't break access to other technologies like the RSS feed which appears to have all sorts of crappy homegrown readers that don't identify themselves. However, when the greedy homegrown readers try to behave like a crawler and step into the site to grab the content linked from the RSS feed they are blocked unless expressly whitelisted.

The additional benefit to allowing a handful of top level pages to be crawled is that the web site doesn't automatically drop out view of lesser search engines or up and coming technology which would happen with a more harsh approach using the .htaccess file blocking all access to any pages. Additionally, there is a nice message on every blocked page letting them know they're probably seeing that page because they are an unauthorized crawler and legitimate crawlers may contact the webmaster and petition for access.

Basically, my web site has become OPT-OUT to any aggregators, crawlers or scraping thieves and now they will need to ask for permission to be let inside and profit from my work. Assuming it's a mutually beneficial proposition then I'm sure I'll let them crawl the site.

Now comes the million dollar question of whether to convert this to PHP and attempt to find a market or just keep it under wraps and much less conspicuous so the scrapers can't study what I've done and find any loopholes in the technology to exploit.

One final thought:

Could you imagine an entire internet that is OPT-OUT from crawlers?

The ability for the next Google to crawl the web to prove their technology would be a severe challenge!

Local Radio is Dead

Now that I've had a Sirius Satellite Radio for a couple of weeks I'm hooked.

Normally when I'm driving around in the car I'm sitting there punching one button after another looking for music in a sea of commercials or looking for some talk radio worth a shit and find nothing. Prior to Sirius my little Zen Micro had become my default music player in the car so at least I could listen to something I wanted to hear and pay attention to my driving instead of the non-stop button pushing.

Heck, now I'm listening to more radio than I have in many years as we just got the home docking station so it's Sirius in the car, home and office, it's on a LOT.

Thanks to Howard taking us kicking and screaming into a subsciption radio service as I'd SIRIUS-ly be missing out on the radio revolution if I hadn't made this switch.

Monday, January 16, 2006

Scraping From AOL Users Possibly Confirmed

There has been past speculation that someone is hiding behind AOL's proxy servers doing scraping and tonight I just happened to catch it live and decided to try something.

The pages were downloading at a slow clip via AOL with a user agent like this:

Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1; SV1; .NET CLR)
The minute I blocked them they tried about 10 more URIs with that user agent string and it suddenly changed to:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR ; .NET CLR ; Something I removed)
To me this proves there is someone cloaking under the bank of rotating AOL IP addresses and they just happened to download enough to catch my attention this time and tried a non-AOL browser once I stopped them.

Time to implement a little more sophistication in my bot blocker.

Sorry slick, you aren't.

BUSTED!

Cursing at Scrapers

Just for giggles I went looking to see what idiots were scraping and reposting my random rants and sure enough I found some asshole with a snippet of one of my rants in some made-for-adsense aggregator site and sure enough that page was running PSAs.

Listen up fuckwads, if you're gonna steal my shit then you better filter it to make it "family safe" for AdSense.

Fucking morons.

V7ndotcom Elursrebmem Punted by Google

Doing a few repetitive searches on Google just to see what ads were showing up resulted in this every now and then:

Your search - v7ndotcom elursrebmem - did not match any documents.
Does that mean some servers aren't updated yet with this SEO contest?

Not sure why people are having a contest to get to the top of a defective search engine.

Pay Per Crawl

With the web 2.0 aggregator craze at an all time high maybe it's time for a new business model of Pay-Per-Crawl. That's right, all the start-up leech sites wanting to waste our bandwidth should be sharing some of their VC loot with us just for the privilege.

I'm not talking about ripping anyone a new ass, but perhaps $5 per every thousand pages crawled would help cover my costs of dedicated servers and bandwidth.

Heck, if it wasn't for all the crawlers in the first place my prime site wouldn't even need a dual Xeon server to handle the load so all the bots are definitely running up my expenses so why in the hell shouldn't they share some of that cost?

Oops, they are sharing some of the cost now as I locked them all out so they can't profit from my hard work.

Too bad, so sad.

I'll take a check, money order, VISA or MASTERCARD to let your crawler back in but don't you dare ignore the crawl delay or anything else in my robots.txt or back out you go!

Who's Yer Daddy?

Did anyone notice advertising on the V7ndotcom Elursrebmem SEO contest?

So just to be a me-too I put up one serious ad:

Publishers Earn More
Free report on maximizing your ad
revenue - nothing to purchase
incredibill.blogspot.com

Then I put up one to make them go HUH?

Cat Tales of Horror
Cuddly Pets Terrorize Owners
Laugh, Cry, Run for your Lives
incredibill.blogspot.com
What possesses me to do this shit?

I'm old enough to know better, but spending my own money just to make people go "what the fuck is wrong with him?" may be a little over the top even for me ;)

v7ndotcom Elursrebmem Battle of Yahoo vs MSN

Yahoo leads the pack claiming about 151 results for V7ndotcom Elursrebmem

MSN in a close second returns117 results containing V7ndotcom Elursrebmem

With sad ass Google returning only 10 results for V7ndotcom Elursrebmem

This is nothing new as I've been reporting Yahoo and MSN regularly tear Google a new ass when it comes to new content ever since I started this blog and Google is still in slow motion. Google is behaving more and more like your slightly slower older cousin with a learning disability.

Blank User Agents

Who in the hell is using blank user agents on the web?

I can't tell if these are SpamBots, Scrapers, Trackback Pings, all of the above but none of them are getting thru so it doesn't matter.

They must naively think if people block per user agent string that if you have no user agent string you'll slip thru the cracks which might've been true until my paradigm shift switched from blacklisting to whitelisting.

You people still blacklisting, and you know who you are, are wasting your time on a no-win scenario that just doesn't work.

Sunday, January 15, 2006

V7ndotcom Elursrebmem

The great race to win the SEO contest for "V7ndotcom Elursrebmem" is on and it's actually surprising to see results so early in the contest.

SEO's get your V7ndotcom Elursrebmem's in early and often and may the best V7ndotcom Elursrebmem placement win!

Yes, this is a lame attempt at V7ndotcom Elursrebmem stuffing.

Blocking Reveals Odd Traffic Hits

Now that I've tightened the noose even tighter on bots and user agents there are some really odd hits showing up in my auto-blocked log file when analyzed for any false positives.

Something very odd has surfaced with a lot of single page requests coming from seemingly random IPs with no identification whatsoever, just the page request, but doing a reverse DNS lookup can be very revealing.

One such revelation was something I've never heard of called Covenant Eye's with an anonymous spider hitting my site. Since Covenant Eyes is a paid service it's too bad since they aren't going to be collecting any more free informaion by scanning my site. If you people figure out I'm blocking you and you really want access bad, maybe we can work out some financial terms for accessing my site!