Spyveillance, Block 'em if you got 'em

OK, this must be a clue that my bot blocker has graduated to the head of the class as I've snared 2 coporations bypassing security measures within 24 hours pretending to be browsers.

Remember what I said about bot blocking being an onion that you keep peeling layer by layer?

The next one in our list of sneaky snoopers is Cyveillance, which apparently has been around for a while but went silently unnoticed until I cranked up the level of bot profiling on my site just a bit to see if I was missing anyone and BINGO! got 2 big fish in a day looking at the next layer of the onion.

According to what I've been reading at linuXgod's site, these boys spy for the RIAA, government and god knows who else or for what purposes. He's been trying to get them to stop crawling his site via a small back and forth of emails and they don't seem to be interested in complying.

My favorite quote is where they justify ignoring internet standards like robots.txt and mask the user agent string as a browser ""Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)".

Because many sites use redirection pages to route robots to special "indexing" pages, we identify our web crawler as an IE browser to ensure it receives the same content as the majority of web surfers on the internet and to allow our programmers to concentrate on a single interpretation of thehtml standard.
Well hell, doesn't that logic just make it fucking OK to ignore whether I want your robot on my server in the first place?

So you're justified in bypassing my security to stop browsers just to concentrate on a single html standard?

Well guess what, NO, YOU'RE NOT JUSTIFIED!

Here you go people, the range of IPs so block them as we're not being given any other means to detect this crawler:

Cyveillance QWEST-63-148-99-224 (NET-63-148-99-224-1) -


CYVEILLANCE UU-65-213-208-128-D4 (NET-65-213-208-128-1) -
Wish I had the bot blocker commercialized now to go mainstream and nail this nonsense.

Corporate Crawler Masking as MSIE

Well, color me stunned shocked and appalled as I ran into an actual real live corporation with a legitimate product that is deploying a crawler that sets the user agent as MSIE ""Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; ....)".

Yeah, that's right, forget robots.txt, forget letting you block them by normal user agent filtering means, they're getting into your website whether you like it or not because they have MANIFEST DESTINY!



These lovely sneaky snoopers that boldly bypass your firewalling efforts are Lightspeed Technologies and they appear to be operating from this IP range -

Just block them now as this is about the lowest I've seen a corporate crawler get and they should be blocked on principle alone by not honoring internet standards.

Terms of Service vs. Fair Use

Here's my next thought about how to combat ill behaved spiders that include snippets from your website and claim fair use. Include something in your TERMS OF SERVICE or LEGAL page on your website that prohibits unauthorized robots.

Therefore, even if they are within their rights of fair use they've violated your terms of service and you possibly have an actionable item on your hands.

Thinking about running this one past a lawyer as we need some boilerplate text like the GNU license that can be distributed and used everywhere as leverage against scrapers.

Film @ 11

This Means War

While I was out to lunch this afternoon some nitwits that I've banned over and over trotted out a new IP address and tried to scrape 1,000 pages when nobody was watching.

Sorry pals, someone WAS watching, it was my little silent sentinel buddy that I wrote myself that blocked your ass after about 20 pages and sent you a nice whopping 900+ pages of error messages.

I think I've had enough of your shit though and perhaps it's time we see what ABUSE@SCRAPERHOSTING.COM has to say about your repeated attacks on my server.

Hope you get shut down or at a minimum find yourself in bed with Lorena Bobbitt and wake up with a Frankendick.

Knuckle Scraping Neanderthal

When a scraper reads your robots.txt file don't you think they would avoid the disallowed pages and directories?

Then would you believe the scraper reads your robots.txt file a SECOND time after just downloading a few pages and immediately opens the page that it's told to leave alone and WHAMMO! gets stopped.

How FUCKING STUPID can you be to write such brain damaged code?

Chitika ContentHit IPs

Chitika took another shot at my server with the user agemt "Chitika ContentHit 1.0" but this time tried a whole bunch of IPs on a single web page, one that Chitika doesn't even appear on which was most amusing.

So there you have it, let 'em in block their ass at your leisure.