Saturday, December 31, 2005

Slow Scrape Via AOL?

Something eluded to in a previous post, or maybe on someone else's site as it's all becoming one long blur, but it appears there is slow crawling running across an entire block of 256 IP addresses.

What seems to be happening is this scraping is coming from a couple of sources and one of them is someone using AOL as the IP resolved to an AOL proxy cache server. The implications are fairly disturbing in that blocking the scraper might also block a bunch of AOLers.

Many aren't aware that AOLers get issued a new IP address while surfing on the internet, typically around every 15 minutes, so if that person isn't using cookies they are quite difficult to track when the IP changes.

It could be multiple visitors but probably not due to the sequential nature of the pages being accessed and a couple of other factors that won't be mentioned so that the scrapers can't fake the behavior being targeted.

So now comes the ultimate question, as this activity is quite obviously scraping, to block or not to block, THAT is the question!

If a block is put in place it would obviously have to be some adaptive technology to analyze and temporarily block and only those IPs being used until the suspicious activity subsides.

Just what I need, to start out '06 chasing uber-scrapers.

Happy New Year scrapers, it might be your last.

Remember Me?

Don't you love all these web services that have the login with the checkbox "[ ] Remember Me On This Computer" under the password?

Then 30 minutes later you click something on the page and get "Session has timed out, click here to login again" or better yet, you go to their home page which has no clue who you are.

Did someone forget to click the checkbox on their website developer tool to "[ ] Implement Remember Me On This Site" or what?

This is further proof that the idiots of the world are taking over and the Future 2.0 is not going to be pretty.

War on Arrogant AdBlocking Assholes

Enough is enough already, this means WAR!

It's bad enough we have people disabling Javascript and running Norton Firewall blocking ads but now some Firefox extension lets you just right click on any ad and POOF! it's gone.

Once installed, it's a snap to filter elements at their source-address.
Just right-click: Adblock: done.
The only upside at this writing is Firefox, Norton Firewall and javascript disabling is all collectively under the 10% mark of all pages being displayed based on web log and advertisement impression comparisons on a couple of fairly popular web sites.

Why do they do this?
If you're tired of all the intrusive adverts that are increasingly taking over the Internet, Adblock is for you.
Well I got a better idea, if you're tired of the adverts that pay for the websites thus allowing people to read the content for free, then stay the fuck off those sites, maybe the entire internet or better yet, PAY the website directly to access their content!

That's right, you heard me you banner-blocking ad-stopping javascript-disabling assholes:

PAY TO PLAY or get the fuck off my bandwidth!

To counter-attack this ad blocking mentality let's try taking a page from the porn industry and create a new service like AdultCheck that would allow all the ad blockers access to all the member sites for an annual fee.

It's a perfect solution as the webmaster gets paid, the ad-blocking assholes get to see the content, and a new middleman web service gets rich off their ass.

Watch out blockers, your free ride is almost over.

Thursday, December 29, 2005

Webmasters Protect Your Own Damn Content!

What the hell is wrong with all these cry baby writers and webmasters crying foul that Google should protect their content?

  • Google didn't write your content, nor did MSN or Yahoo
  • Google didn't steal your content, nor did MSN or Yahoo
  • Google isn't a law enforcement agency, nor is MSN or Yahoo
  • Protect your own shit and grow the fuck up!
Putting the blame on Blogger and AdSense is just pandering to the masses when all sorts of blogs not on Blogger steal content as well and use all sorts of ways to monetize their sites other than AdSense but of course if you aren't Google bashing then nobody gives a shit and won't read your flea bitten bullshit blogs.

Hell, many scrapers aren't even using blogs, they just scrape crap into some automatically generated pages on some free web server hosted in BFE (Bum Fuck Egypt) where your complaints to will be ignored because your email isn't written in Swahili.

So what's a webmaster to do?
  • File a DMCA (Digital Millennium Copyright Act) complaint to the ISP hosting the site and Google, Yahoo and MSN. Since the search engines are all in the US they have to comply with US law.
  • Install some robot blocker/scraper stopper scripts (see link at page bottom)
  • Fix your damn blog software to make scraping harder instead of putting a months worth of shit on one page which is imperceptible for scraper stopper scripts to catch someone downloading a page every now and then vs. many pages.
How do you file a DMCA complaint?

You should read the DMCA law first (pay special attention to the safe harbor provision) as it will help you get it right the first time as there are some specific things you must do, such as provide exact samples or the infringement, URLs, etc. and swear on a stack of religious reading material that you're the original author.
Typically 1-2 days and POOF! they are gone.

They will now be offline, or at a minimum locked out of the 3 search engines which is what they use to drive traffic in the first place. If they are using AdWords (or something similar) for PPC traffic, you may have to take that up with Google's AdWords dept. or whoever they use.

Now, to make sure you are fully legal, take your lazy ass to the US Copyright website and spend about $30 getting a legal copyright so you can drag US-based thieving assholes into court and take advantage of what's called "statutory damages" which can hit 6 figures.

TIP: What I do on my primary web site is set a "usage" fee on the policies page, just like photographers do for images, and set the page content licensing fee at $2,000 per page. Whether this holds up in court is another matter but it sure works for photographers with stolen copyrighted images so I'm willing to give it a whirl when someone steals my shit.

Wednesday, December 28, 2005

High Speed Scrapers Steal More Than Pages

The common thinking about scrapers is they just steal your content and make money off your hard work. However, the damages can be worse and more immediate if they overload your web server and stop other traffic from accessing your site while they scrape. Some of them are so aggressive it's practically a DOS (denial of service) attack until they get what they want.

The problem for dynamic database driven websites (like mine that I'm trying to protect) is they tend to need more CPU resources than the normal garden variety static web sites so a high speed scraper, and sometimes even a regular search engine bot, can easily overload a server's CPU. This can quickly escalate to the point that the web server is queued up with so many page requests that it is unable to respond to new requests and appears "offline" for seconds, minutes or even hours depending on just how aggressive a spider gets with it's page requests.

Worse case, it may crash your server altogether under the strain.

The net result is that sites like mine that survive on advertising revenue, such as Google AdSense, suffer total income loss during temporary bot induced service outages. Visitors that would normally be clicking on ads are sitting there waiting for pages that will never display. Therefore, these bots do more direct damage to your pocket than just bandwidth wasting and stealing content for their own sites as the monetary losses can be quite immediate and left unchecked potentially devastating.

Prior to launching my new spider-trap/scraper blocker there were days where these scrapers had impacted the site income as much as 20%-30% in a single day. How this was possible is they were pummeling the server at night while nobody was watching and significant amounts of high income producing traffic coming from other time zones was lost.

This may be more of an issue to some webmasters than others but website scrapers, other than legitimate search engines, just need to be put out of business as they provide zero value and do nothing but steal.

Protect your site today with this nifty PHP scraper blocker Alex Kemp has written.

Many of you might find Alex's tool very useful to integrate into your web sites and stop copyright theft and income loss today.

The Great Anti-Scrape-Off

Previous posts have mumbled about my new spider-trap anti-scraper tool that I added to my web site and I must say it's working so well I'm contemplating converting it into PHP so the masses can play with it.

Don't hold your breath as I'm fundamentally lazy.

The features are as follows:

  • Fast crawl auto-block
  • Slow crawl detection and optional blocking
  • Spider trap auto-block
  • Webmaster control panel that shows last 15 minutes of live activity by visitor
  • Manual ban/block from the control panel
  • Allowed spider pass thru with built in passes for Google, Yahoo and MSN
Overall it has put the skids on 99.9% of the scrapers and off topic bots crawling my site within 2 days of being fully deployed. Previously I was just monitoring what it was doing, or going to do, in order to make sure it wasn't zapping legitimate spiders and visitors and now it seems to be ready for prime time so it was set to LIVE a couple of days ago with fingers crossed.

New enhancements I'm working on for next week aren't spider related but visitor related. These new 'sensors' will detect additional information per visitor such as cookies, javascript and banner blocking being used so I can remove old hacks that do some of this and make a centralized visitor knowledge base.

The final features in the visitor knowledge base will allow me to dynamically deploy the appropriate advertising model that each visitor can view when the second page is loaded, or theoretically allow me to start interjecting intermission pages to become a "subscriber".

So far so good, will keep everyone posted as to the effectiveness of the anti-scraper tools.

Tuesday, December 27, 2005

Voyager the Cosmix Crawler

Found another super slow web crawler slowly downloading pages for Cosmix's Kosmix health search which was completely off topic for my site. It was only moving at a snails pace looking for pages with medical information but being that far off topic I decided to block them just on principle alone.

What all these crawlers and scrapers do is inflate my page views and for my direct advertisers which are embedded in every page, inflating their page impressions and make the click thru rates for the ads look worse than they really are.

Sorry unwanted spiders, take a hike.

Sensis Web Crawler

Maybe I'm leading a sheltered life but I've never heard of the Australian based Sensis Web Crawler before this week when I noticed it slow crawling my biggest web site over a period of days now. Just before I almost blocked this bot I noticed that it seemed to be a legit service and Alexa claims is reasonably popular in their top 10,000 sites so I let it crawl.

Probably one of the best behaved but persistent little bots I've seen in a while.

Sensis has only crawled about 1300 pages in 24 hours which is about 54 pages an hour so Sensis will probably be on my site 30 days depending on how deeply they index based on their current rate of crawl.

This is much less abusive than Yahoo, MSN or Google that want whatever they want whenever they want and sometimes as fast as they can get it just to be competitive with each other.

Maybe the Big 3 can take a clue from this little player and be a bit more web server and bandwidth friendly. Then again, maybe Sensis just doesn't have the bandwidth and computing horsepower to beat the crap out of my server like the Big 3.

Good luck Sensis, will keep an eye on my results down under.

Monday, December 26, 2005

Google Peeks on Purpose

Google must peek at pages they are expressly told not to index in robots.txt on purpose just to see what in the heck is being hidden. There is no technical reason imaginable that they can't read the same robots.txt everyone else is reading as neither Yahoo, MSN or Teoma have ever crawled the pages marked off limits yet Google just can't seem to control themselves and keep their damned bots off those pages.

So which is it Google?

  • Everything at Google is still in BETA so what do you expect
  • Our engineering dept. just can't get all the bugs out, get over it
  • We peek regardless because we're Google and we can
I'd really like to know which it is as the competition doesn't seem to break those rules.