Tuesday, December 27, 2011

CrawlWall LIVES! Signup Now!

Some of you probably never thought you would see it happen, but CrawlWall LIVES!

Even though I've been sick as a dog for months, I started plugging away at the damn thing and doing some unit testing of various components on IncrediBILL.net, and now it's close to going into alpha testing.

Go signup if you want to be a CrawlWall alpha tester, I think you'll like it.

BTW, approved alpha testers, and it will be a small group, get a free copy of the final shipping version of CrawlWall so signup quick!

Thursday, November 10, 2011

Bot Blacklisting vs. Whitelisting, are you a convert yet?

I'm still shocked that after all these years people not only are still practicing the ancient black art of blacklisting, but I'm even still more shocked to see several so-called website content security products being recently released that rely on blacklisting as their primary defense.

Are they fucking kidding?

Do people really pay good money to chase an endless supply of bots?

Let's explore the blacklisting dam vs. whitelisting dam metaphor to get a simple grasp on this issue. For those not familiar with the problem, blacklisting is like building a dam on a river with a big gaping hole in the middle that damn dam. While holding back some water, or bad bots in this instance, the damn blacklisting dam still lets most of it spill through, total waste of time and money. Whitelisting on the other hand is like a real dam that holds everything back except for the controlled spill, aka the whitelisted items, which are the only things allowed to pass. Therefore, just like damming a river, common sense dictates you build a solid dam with whitelisting to control all those bots and do it right the first time.

Blacklising is a pretty futile methodology, obviously the choice of masochistic webmasters. Look at the amount of time and resources wasted maintaining a blacklist. Tons of bot entries, lots of log analysis and processing power just to keep up with them. Heck, all the bad bots have to do to defeat your blacklist is change their user agent name every single time they access your site.

Simply combine any two random words in the dictionary and you've just got a new bot name that can bypass any blacklist. Hell, just pick almost any single word from the dictionary and you'll defeat the blacklist, two words is overkill really. Some bots merely send a couple of random strings of gibberish as a user agent which works perfectly to defeat silly tactics like blacklisting.

Now examine the simple implementation of a Whitelist. There aren't that many beneficial things that crawl your site and most sites can thrive with a whitelist of less than 20 entries, maybe 100 max. instead of the hundreds or thousands of items in a blacklist. Small lists, easy to maintain, and negligent processing required to validate the list in real-time, low impact on server load.

Using any raw logfile analysis program it's easy to identify what should be whitelisted in mere minutes. Best thing is that whitelisting means you can spend your spare time actually working on your site instead of chasing bad bots to blacklist as everything not whitelisted is automatically kicked to the curb by default with no extra effort on the part of the webmaster.

Those that I've actually convinced to convert to whitelisting in the past have done nothing but sing it's praises.

Compare that to those still blacklisting, they don't have any spare time to sing.

Tuesday, November 08, 2011

Tracking Domain Intel Site Bots

I've taken a recent interest in tracking down the shitload crawlers of the domain sites out there that scrape your homepage, keywords, etc. and even display your AdSense and Google Analytics IDs.

Fucking asshats.

Posted a bunch of them on WebmasterWorld so drop in there to get details about domainsoutlook.com, statshow.com, urbandata.com, zitetrendz.com, hostnology.in, dawhois.com, clearwebstats.com, whoare.us, whoare.us, w3who.net, diigo.com, domainspyer.com, spyrush.com, aboutthedomain.com, seeallweb.org, webdetail.org and a minor update on domaintools.com.

Yes, I got busy :)

Wouldn't mind some feedback on my post about Spider Tracking Links - Examining 2 Methods - that would be nice!

Saturday, October 15, 2011

Where's Bill Been Hiding?

I actually haven't been hiding, I was on the endangered species list.

I have been sliding down a slippery slope for about a year now and seem to be finally recovering.

Hopefully.

It all started around the beginning of the year when my doctor gave me a new drug that after about a month appeared to be causing a mild rash. We didn't stop the drug at the time because it also seemed to be working incredibly well, doing a good job. Bad idea, should've stopped the drug immediately. However, what I've learned about these kinds of drug rashes since,, it probably wouldn't have made any difference in the outcome if we stopped right away as it was probably too late to stop what had already been put in motion.

Just some redness and a few blisters up around my shoulders, that was all it was when it started. Next thing you know my entire torso is covered in this crap. By April it was a big old mess and I was in trouble. It spread over my head, all hair gone, beard, etc. POOF! It spread down my back, eventually the arms and last but not least, it slowly spread down my legs until it reached feet. Head to toe, I was covered in compromised skin with no real treatment options except time to let it heal.

Time my ass, it's been many months and it's still fucked up.

My skin then started cracking, bleeding, making sores, etc. so needless to say I became home bound for about 6 months. Eventually other weird shit started like edema, most likely a long-term side effect of the prednisone I was taking. The edema got so bad I couldn't sit at a computer more than an hour or so a day before my arms and legs puffed up to the point I couldn't take it and had to lay down. That basically took me offline for quite some time and then other fun things started to happen like my fingernails started falling off which made it almost impossible to type until they were fucking gone altogether.

Then the real fun happened, the edema got so bad the doc tried taking me off the prednisone but did it too fast and just about killed me. My skin had been actually getting better but while being weaned off the prednisone too fast it went in reverse and the next thing you know I'm 50%(or more) compromised, can barely sit or get out of a chair because it's all raw. Real fun stuff.

In a short period I went from being completely ambulatory to using a cane, a walker, and finally being pushed around in a wheelchair when I went out because I couldn't muster enough energy to walk more than a short distance.

I was fading fast, it was not looking good, so I finally got them to admit me to a burn unit about a month ago since they specialize in wound care to get my skin problems stabilized. The first couple of days I spent in the burn unit I was practically bed ridden. Honestly, I didn't think I was ever walking out of that place. However, after spending close to 2 weeks in the burn unit, with new medication and expert wound care, I started to bounce back and walked out on my own.

Currently I still have skin being treated with drugs and bandages but the edema is gone, as well as the wheelchair, walker and cane!

After 6 months of being stuck in the house I finally managed to go out to restaurants a few times, a couple of stores, and even drove the car! Might not sound like much, but trust me, when you've been stuck indoors in the same chair staring at the same TV for 6 months, it's HUGE!

So that's where I'm at, better but not totally healed which could take months, but I'm now online and posting a lot so that should give you a clue I'm bouncing back and well enough to be taken off the endangered species list, or so I hope!

No clue if I'll ever be the same again, but it sure beats the alternative!

Thursday, June 30, 2011

Looking for LinkScrubber Alpha Testers!

After years of running an internal link checker that rocks, I'm taking it public!

Why LinkScrubber?

The answer was simple, the other link checkers sucked and just reported "200 OK" for many bad sites. They didn't catch domain parks, prize pages, SEO iujection hacks, soft 404 pages, so on and so forth, so it had to be written. Now LinkScrubber has a very extensive set of fingerprints for all of these sites and can detect more hidden link rot than you can imagine.

LinkScrubber doesn't have a full crawler yet, since I originally didn't need one as I was checking links stored in a database, but a crawler is coming real soon.

If you have the programming skills, the LinkScubber API is LIVE! If you want to try the API, just send an email to 'alpha' at the LinkScrubber.com domains and request your DEVKEY to activate
the API. The LinkScrubber API docs and some sample PHP code are here, along with the current set of status return codes

However, if you don't program, don't despair, UPLOAD YOUR LINKS! If you run Xenu or some other link checker simply copy and paste the external link report (OBLs) into notepad and upload it to LinkScrubber, it's that simple!

Give it a try and find hidden link rot your current link checker has no clue about.

You may be shocked at what you find!

Thursday, May 26, 2011

Anyone still listening?

I haven't blogged much for a variety of reasons this year, sorry about that.

Got a lot going on, anyone even still listening?

Probably not, don't blame ya.

Tuesday, May 24, 2011

Whitelisting, Not Blacklisting to stop bots!

Really getting sick of repeating myself as people just don't seem to get it when it comes to blocking bots that blacklisting doesn't fucking work.Blacklisting requires wasting time chasing bots in access logs, huge ass .htaccess files that slow Apache and impact server performance, and are easily bypassed with changing a single character in the user agent name.

Whitelisting on the other hand only tells the server what can pass, everything else bounces. Whitelists are usually short, Googlebot, Slurp, Bingbot, valid browsers, and nothing else, a fast list to process which doesn't slow Apache down whatsoever.

Then install a script to monitor for things that access robots.txt, spider trap pages, natural spider traps like your legal and privacy pages, plus speedy or greedy accesses, and you've pretty much solved you scraper problems.

But for fuck's sake, use your goddamn brain and WHITELIST or you're just wasting your fucking time and inviting scrapers, not blocking them.

Wednesday, April 27, 2011

When Webmastering Should Become A Capital Crime

A little project I've been working on has helped me unearth some of the sheer horrors that the search engines have to deal with on a daily basis and some web designers and webmasters should literally be shot.

How about a fully functional web page that has no anchors on the HTML page, none whatsoever, yet you can obviously see a fully functional navigation menu on the page?

That's right, some dumb fuckers are doing 100% javascript navigation, nothing in HTML or CSS, completely included as a javascript navbar. Sadly, Google rewards these assholes by indexing their shit.

But this is trivial, let's move on...

How about an old domain name with a meta refresh to the new domain name.

No big deal, right?

Until the new domain 301s to an adjusted location. The page at that location has a javascript redirect to yet another location. Finally it lands on a "200 OK" response with and no more redirects, what a maze from hell.

Getting ugly but I've seen worse.

The one that completely blew me away was some dumb fuckers with UTF-8 encoded websites being stored online as UTF-16 (UNICODE) so every letter you see on the screen is actually a double-byte character which doubles the bandwidth for no obvious fucking reason except some shithead self-proclaimed web fucking designer in some goddamn 3rd world shit hole country doesn't know how to save files properly when not coding in UTF-16 character sets.

Wonder if Google penalizes them for slow loading pages or just being god damned stupid?

Some of the big ISPs like PacBell do some really stupid fucking shit too, like this:

home.pacbell.net/someaccount

302 redirects to:

http://home.pacbell.net/cgi-bin/sunset.cgi?old_request=/someaccount

200 OK contains a javascript redirect (IT guy out that day?) to:

http://pages.prodigy.net/cgi-bin/index.cgi?pwpurl=http://home.pacbell.net/someaccount

404 finally!

"Discontinuation of Prodigy Personal Web Pages (PWP) Support "
And this shit just goes on and on and on...

It's just one big fucking mess on the web and some webmasters and designers should be very fucking ashamed.

One thing is very obvious, the search engines are doing a very skillful game of interpreting javascript these days otherwise many pages on the kinds of bullshit sites mentioned above would never be discovered.

Friday, January 28, 2011

GlueText Crawlers Identified and Blocked

Started noticing some leeched content showing up on a site called GlueText so it got my curiosity up to see how they were gathering their content.

Turns out initially they were using the default libwww-perl user agent back in '09

99.231.221.217 "libwww-perl/5.820"
Looks like they got a little smarter after being bounced by sites to switch to the old Netscape Navigator user agent for the Win 98 version which they still use today!
99.231.78.89 "Mozilla/4.76 [en] (Win98; U)"
GlueText appears to have historically used the following IPs:
99.231.78.89
CPE0024b2cbf30a-CM0016b536fb82.cpe.net.cable.rogers.com.

173.203.215.230
173-203-215-230.static.cloud-ips.com.

99.231.221.217
CPE0009a30119af-CM0016b536fb82.cpe.net.cable.rogers.com.

99.231.44.115
CPE002436a0fbf3-CM0017ee4740ec.cpe.net.cable.rogers.com.

76.65.207.92
TOROON63-1279381340.sdsl.bell.ca.
My most current test showed they were now using the following IPs:
These IPs were from cloud-ips.com, all from GlueText:
173.203.210.51
173.203.210.95
173.203.215.230
173.203.241.192
Other IPs still involved:
76.65.207.92 -> TOROON63-1279381340.sdsl.bell.ca

99.231.78.89 -> CPE0024b2cbf30a-CM0016b536fb82.cpe.net.cable.rogers.com
Doesn't request robots.txt, fakes a Netscape user agent to gain access without permission, doesn't appear to document how it crawls content nor does it appear to give webmasters any way to opt-out.

BAD ROBOT!

Blocked.