Let's start this investigation by noting that Bloglines themselves claim to be a crawler now when you use reverse DNS on their IP address:
220.127.116.11 -> crawler.bloglines.comThis is what Bloglines is supposed to do, read your RSS feed:
18.104.22.168 "GET /rss_feed.xml" "-" "Bloglines/3.1 (http://www.bloglines.com;XXX subscribers)"However, they've stepped off the RSS path and started coloring outside the lines!
The first off thing I noticed was it asked for robots.txt without any user agent defined:
22.214.171.124 "GET /robots.txt" "-" "-"So I dug a little deeper and it appears they are running Firefox Minefield which was asking for a bunch of images from 3rd party websites where my graphic appears:
126.96.36.199 "GET /myimage.gif" "http://someotherwebsite.com/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"Finally, I found them requesting some web pages that are NOT in any RSS feed, what the fuck?
188.8.131.52 "GET /anyoldpage.html" "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"So, anyone have a clue what they're doing?
Yes, they're making screen shots that appear on ASK.com!
I looked up a few pages from one of my sites in ASK and sure enough, instead of screen shots of the actual web pages there were screen shots of error messages with the Bloglines IP address of 184.108.40.206 in big bold numbers.
The reason I figured that out so easily was I recently decided to just block everything claiming to be coming from Linux just to see what came up and that's why they got an error page instead of a screen shot. Sure, I'm probably blocking a few innocent Linux users as well but they account for an insignificant part of my traffic and overlap with the same tools that servers use so sacrifices were made.
Anyway, what we've learned is that Ask is using Bloglines' IP to make screenshots and look at your robots.txt file yet they don't disclose what they're even looking for in your robots.txt file.
Wasn't that fun?