Some of my "scraped" content kept showing up in places not expressly authorized to have my content. This was making me a little batty as I'm pretty sure the bot blocker wasn't letting these people through, that my code didn't have holes like swiss cheese, then I figured it out. Finally there was a clue embedded in some of the data, as it included one of my tracking bugs, and it turns out the data originated from Gigablast.
Knowing it came from Gigablast, I looked up Gigablast's list of partners and VOILA! there was the site in question listed in their partner list.
Now comes the dilemma of what to do about this situation as I'm not happy with a couple of their partners and by allowing Gigablast, I'm permitting the partners access by default.
Worse yet, Google indexes the Gigablast data that's present in their partner sites, like Eurekster, so here you are competing with your own content in Google yet again via the Gigablast connection.
Since I really don't get any noticeable traffic from Gigablast or any of their partners, maybe it's time to cut the umbilical cord just to keep my own information from being used against me to rank their partner sites in Google.
Looks like we need some robots.txt commands that we can use to tell search engines like Gigablast it's OK to index, but not share with Snap for instance.
Maybe implement something like this in robots.txt for search engine partner control:
User-agent: GigabotIt's feels almost as bad, if not worse, than battling a scraper but this time I let this one in the front door with my blessings.
To block or not to block, THAT is the question...