Since I whitelist allowed bots I've had Project Rialto blocked since the beginning but I was curious what they were doing since they first showed up on my radar on 01/23/2008 and kept coming back over and over.
From one of their job ads:
We are designing high-performance algorithms and developing reliable, fault-tolerant and scalable real-time systems that can handle massive volume of data for in-depth analysis of user behavior to enable targeted advertising.Oh boy!
Research and investigate academic and industrial data mining, machine learning and modeling techniques to apply to our specific business case
It appears they want to crawl our sites and use that information to shove more ads in our face.
Somehow, I don't think so...
If you're going to mine data, shouldn't you get the URLs right?
The site they're attempting to "mine" is on a Linux box and URLs are case sensitive and my URLs all have upper/lower case in them yet the PRCrawler only asks for those URLs in all lower case so even if I left them crawl my site they'd get nothing but 404s.
No wonder their home page says they're a "stealth company" because I'd hide too if I couldn't even get the proper case of the URLs right.
Here's their user agent:
"PRCrawler/Nutch-0.9 (data mining development project; email@example.com)"They operate from the following IPs:
220.127.116.11The first two were from masergy.com, the rest are all from compute-1.amazonaws.com.
host-64-47-51-153.masergy.com.I haven't seen anything from masergy.com since the initial contact but that's only 2 months ago so who knows.
Don't know where they primed the pump for their data mining operation since they already had lots of information about my site when they attempted to crawl, but since it was all lower case it was completely useless.
I'm just curious if they got it my URLs from somewhere already in lower case or someone there slapped a tolower() around a line of code when importing the URLs into Nutch.
Don't know, don't care, it's amusing either way.
Good luck with Project Rialto, you're going to need it.