Friday, April 11, 2008

RTGI - The French Social Media Spybot

Yet another social media mining operation designed to track every bit of intel said about brands, people, politics and more.

From a translation of their site:

Our solutions simplify the identification of influential communities and monitoring of their conversations, to the benefit of businesses, communication agencies or research institutes.

RTGI's approach allows the analysis of the links and content generated by the citizens, journalists, consumers or activists, to draw the contours of communities conversations around your issues, brands and products and their real impact on your image online. RTGI have elaborated the linkfluence to give a unit of reliable measurement of the influence of the social web sites.
The highlighting was added to help you see how it facilitates spying on your ass without going to much effort to do so.

Heck, the French government is in their list of clients!
  • Information Service (GIS) government
  • Ministry of the Economy, Finance and Employment Ministry of the Economy, Finance and Employment
  • Picardy Regional Council (RENUPI)
Sheesh, didn't need to translate as they have an English .EU version too.

Oh well, I'm not rewriting it!

Continuing on...

George Orwell obviously didn't anticipate the internet and he was off by a few years, 24 to be exact, but his overall message of Big Brother watching us in 1984 is finally coming true in 2008.

Anyway, back to the details:
"mozilla/5.0 (compatible; RTGI; http://rtgi.fr/)"
The IP's they operate from are:
88.191.50.170 -> sd-8985.dedibox.fr.
91.121.108.180 -> t800.rtgi.eu.
91.121.25.182 -> merlin.rtgi.eu.
91.121.25.184 -> r2d2.rtgi.eu.
91.121.79.160 -> c3po.rtgi.eu.
The old address of 88.191.50.170 doesn't appear to be active since 04/13/2007 so I probably wouldn't worry about that too much unless you just want to block that dedicated hosting range for good measures.
inetnum: 88.191.3.0 - 88.191.248.255
netname: FR-DEDIBOX
descr: Dedibox SAS
descr: Paris, France
route: 88.160.0.0/11
The dedicated host they currently use has this range of IPs:
inetnum: 91.121.0.0 - 91.121.31.255
netname: OVH
descr: OVH SAS
descr: Dedicated Servers
descr: http://www.ovh.com
So there you go, another way to make your site part of the anti-social media by keeping the snoops out.


Project Rialto's PRCrawler Is Data Mining?

Since I whitelist allowed bots I've had Project Rialto blocked since the beginning but I was curious what they were doing since they first showed up on my radar on 01/23/2008 and kept coming back over and over.

From one of their job ads:

We are designing high-performance algorithms and developing reliable, fault-tolerant and scalable real-time systems that can handle massive volume of data for in-depth analysis of user behavior to enable targeted advertising.

and...

Research and investigate academic and industrial data mining, machine learning and modeling techniques to apply to our specific business case
Oh boy!

It appears they want to crawl our sites and use that information to shove more ads in our face.

Somehow, I don't think so...

If you're going to mine data, shouldn't you get the URLs right?

The site they're attempting to "mine" is on a Linux box and URLs are case sensitive and my URLs all have upper/lower case in them yet the PRCrawler only asks for those URLs in all lower case so even if I left them crawl my site they'd get nothing but 404s.

No wonder their home page says they're a "stealth company" because I'd hide too if I couldn't even get the proper case of the URLs right.

Here's their user agent:
"PRCrawler/Nutch-0.9 (data mining development project; crawler@projectrialto.com)"
They operate from the following IPs:
64.47.51.153
64.47.51.158
67.202.0.157
67.202.0.17
67.202.0.71
67.202.10.65
67.202.18.229
67.202.29.20
67.202.3.112
67.202.3.141
67.202.3.151
67.202.56.219
67.202.58.214
67.202.59.117
67.202.62.162
67.202.62.45
72.44.36.20
72.44.36.8
72.44.37.72
72.44.39.55
The first two were from masergy.com, the rest are all from compute-1.amazonaws.com.
host-64-47-51-153.masergy.com.
host-64-47-51-158.masergy.com.
I haven't seen anything from masergy.com since the initial contact but that's only 2 months ago so who knows.

Don't know where they primed the pump for their data mining operation since they already had lots of information about my site when they attempted to crawl, but since it was all lower case it was completely useless.

I'm just curious if they got it my URLs from somewhere already in lower case or someone there slapped a tolower() around a line of code when importing the URLs into Nutch.

Don't know, don't care, it's amusing either way.

Good luck with Project Rialto, you're going to need it.

Wednesday, April 09, 2008

Radian6's R6_FeedFetcher Fetching More Than Feeds

For those of you unfamiliar with Radian6 it's a "social media monitoring tool" because apparently everyone with an opinion on the internet needs someone to spy on their ass since we're disruptive.

Well bummer.

Isn't it a shame the good old days are gone where companies told you everything you needed to know about their brand and you had to be a journalist just to get your opinion heard?

Of course those so-called journalists never gave you their real opinion because of fear of losing advertisers so it was all candy coated bullshit that just bordered on the truth because advertisers couldn't handle the truth fearing nobody would buy their shit.

Tough shit and god bless the great equalizer called the Internet that leveled the playing field between consumers and companies so we can find out what's really going on without everything being filtered through the company spin doctor.

Their crawler details are:

142.166.3.122 "R6_FeedFetcher(www.radian6.com/crawler)"
The amusing thing about the R6_FeedFetcher is I never see it fetching the feed, instead it's trying to fetch pages linked from the feed, which is what we call a crawler and not a fucking feed fetcher.

Does it read robots.txt to see if it's allowed beyond my RSS feed?

Fuck no.

I looked at all accesses on my RSS feed and didn't see anything obvious so maybe they get RSS feeds from FeedBurner or something similar, who knows.

Anyway, it's blocked now on my other site so I can be as disruptive as I want there.

However, who wants to place bets that this disruptive post will be monitored?


P.S. The site R6_FeedFetcher is blocked on is not this blog for first time readers ;)

Update:

After doing some research it appears they also have the following user agent:
R6_CommentReader(www.radian6.com/crawler
Also, read this interesting post about Radian6 on Simon's blog.