Saturday, January 05, 2008

Why The Hell Is Bloglines Crawling?

Let's start this investigation by noting that Bloglines themselves claim to be a crawler now when you use reverse DNS on their IP address:

65.214.44.29 -> crawler.bloglines.com
This is what Bloglines is supposed to do, read your RSS feed:
65.214.44.29 "GET /rss_feed.xml" "-" "Bloglines/3.1 (http://www.bloglines.com;XXX subscribers)"
However, they've stepped off the RSS path and started coloring outside the lines!

The first off thing I noticed was it asked for robots.txt without any user agent defined:
65.214.44.29 "GET /robots.txt" "-" "-"
So I dug a little deeper and it appears they are running Firefox Minefield which was asking for a bunch of images from 3rd party websites where my graphic appears:
65.214.44.29 "GET /myimage.gif" "http://someotherwebsite.com/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"
Finally, I found them requesting some web pages that are NOT in any RSS feed, what the fuck?
65.214.44.29 "GET /anyoldpage.html" "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"
So, anyone have a clue what they're doing?

SCREENSHOTS!

Yes, they're making screen shots that appear on ASK.com!

I looked up a few pages from one of my sites in ASK and sure enough, instead of screen shots of the actual web pages there were screen shots of error messages with the Bloglines IP address of 65.214.44.29 in big bold numbers.

The reason I figured that out so easily was I recently decided to just block everything claiming to be coming from Linux just to see what came up and that's why they got an error page instead of a screen shot. Sure, I'm probably blocking a few innocent Linux users as well but they account for an insignificant part of my traffic and overlap with the same tools that servers use so sacrifices were made.

Anyway, what we've learned is that Ask is using Bloglines' IP to make screenshots and look at your robots.txt file yet they don't disclose what they're even looking for in your robots.txt file.

Wasn't that fun?

9 comments:

Anonymous said...

Of course that was fun mate :)

While i dont comment much or know much I like Thousands of others log in daily to have a read of you Brilliance..scanning up and down for any comments from others as well..eg Ban Proxies etc

Go you Good Thing

Anonymous said...

Great. Now this chucklehead is blocking everyone who uses Linux? And some other people seem to think it's actually a good idea?

Time to reconfigure all my browsers on all my Linux boxes to spoof a bog-standard IE-on-Windows UA ... oh wait, I already did that ages ago to get around all the "This site requires Micro$oft Internet Exploder" BS that's out there these days.

What happened to the times when men were men, women wore corsets, and sysadmins adhered to all applicable standards and RFCs on a best-effort basis? These days you can't report a broken link without getting "user unknown" bounces for "webmaster@" that themselves violate email RFCs, surf without using or pretending to use IE, edit half of Wikipedia without registering first, or expect your bookmarks, Back button, or the like to work properly instead of just refreshing the current page or even getting you redirected randomly.

Half this stuff is half-assed responses to abuse that do much more damage to the fabric of the net than the abuse did, and the other half is, as near as I can figure, completely gratuitous. (A fraction, including most instances of "This site requires Javashit", seem to be motivated by greed, usually for advertising dollars.)

I wonder how long before the general backlash against user-hostile web sites happens? An awful lot of web masters seem to be in need of a smack upside the head with a ClooBat(tm)^W^W^W^W^W^W^W reminder that the users come first and your site exists purely to serve its users, without which, after all, your revenue stream will dry up and blow away overnight.

IncrediBILL said...

You sound like a commercial for a bleeding heart soap opera.

Considering 99% of the "browsers" claiming to be of Linux origin come from data centers, just like the one I bounced today from Ask, I'm not terribly worried about the other 10-20 people that were possibly snared breaking down the fabric of the internet.

Besides, I have multiple ways to "block" something, such as drop kick it from the firewall, or Apache, or in software protecting the server which has 2 modes.

One mode just tells them why they were dumped with no recourse while the other mode presents them with some sort of challenge so humans can get in.

I tend to use the latter mode most often so humans can still gain access while the black hats and kiddie scripts sit and spin on their thumbs.

Yup, out of 20K+ visitors blocking 10-20 is going to destroy the internet.

Somehow I don't think so.

Motivation is keeping servers up and running from the overwhelming flood on non-human traffic that can easily overload a dual Xeon box with secondary goals of stopping the vermin in the underbelly of the web from profitting off my work.

So far, both had been achieved so piss off.

IncrediBILL said...

BTW, you overlooked that little bit called ACCOUNTABILITY as we have this unwritten agreement with the search engines where we let them in via robots.txt as long as they behave themselves.

Hiding behind IPs not identified as ASK and not letting us know what bot is looking in our robots.txt file kind of breaks that trust between a webmaster and a search engine.

Like I said, a few Linux visitors sacrificed this week to learn the truth was a small price to pay.

Anonymous said...

Linux visitors that had nothing to do with any search engine's misbehavior.

IncrediBILL said...

Did I say my original reason for blocking Linux had anything to do with misbehaving search engines?

Mostly it has to do with scrapers and other nonsense which isn't a search engine whatsoever.

Pay attention.

The fact that Ask and Bloglines got snared in the trap was just an added benefit.

Anonymous said...

Linux usage is probably higher than you think it is. It's not difficult to change your UserAgent and it's quite often necessary to change it in order to use certain parts of the internet.

I had a strange error message yesterday telling that I had to use Firefox or IE to view the website in question. What made it strange was that I was already using Firefox. It turns out that the JavaScript code was checking the OS as well as the browser and it required Windows. It also turned out that if you turned JavaScript off, you just got a blank page.

This is a vicious circle in which webmasters think that no one uses Linux and Linux users have to pretend to be Windows users to use the web.

Anonymous said...

Right on, ladadada.

(The innocent Linux users have nothing to do with the scrapers and bots either, of course.)

Anonymous said...

Probably not news to Bill, but may be of value to others reading this.

There are other ways of detecting the bot from Bloglines(/Ask) than blocking Linux UAs:

Grep access logs for the Bloglines bots' UA, extract the IP, then grep the logs again for that /8 block (whatever). Once a week, per page that has RSS links, a Minefield browser comes by from the same IPs as the Bloglines bot.

Apply the same tactic for msnbot to find a horde of puppeteered MSIE browsers over at Microplex.