TopicBlogs hasn't even launched yet but they managed to piss me off stepping over the boundary.
The RSS feed is fair game, but pulling the linked pages without permission is NOT fair game.
Here's an example:
72.36.205.106 "GET /rss_feed.xml HTTP/1.0" "topicblogs/0.9"Maybe you people over at TopicsBlog should implement robots.txt to see if we allow you to step off the RSS feed.
72.36.205.106 "GET /blogpage2.html HTTP/1.0" "topicblogs/0.9"
72.36.205.106 "GET /blogpage3.html HTTP/1.0" "topicblogs/0.9"
72.36.205.106 "GET /blogpage4.html HTTP/1.0" "topicblogs/0.9"
72.36.205.106 "GET /blogpage5.html HTTP/1.0" "topicblogs/0.9"
72.36.205.106 "GET /blogpage6.html HTTP/1.0" "topicblogs/0.9"
72.36.205.106 "GET /blogpage7.html HTTP/1.0" "topicblogs/0.9"
Until you fix it, you're just BLOCKED!
3 comments:
Hey Bill -- thanks for reporting these. Have you ever thought of creating a mod_security living ruleset to block out these intrusions? I think it would be a nice idea. :)
Hey Bill,
This is Jeff from topicblogs. I just thought I'd point out that our crawler adheres strictly to the robots exclusion protocol.
Your robots.txt file:
User-agent: *
Disallow:
does not prohibit our crawler (or, in fact, any crawler) from crawling your blog, but I'll be happy to remove it from our crawl list if you so wish.
Feel free to contact me at jeff AT you-know-which-domain.
Jeff
Ummm, Jeff, I don't write about things crawling my BLOG, I'm on blogger, have no control...
It's a big website elsewhere with 1M visitors and the point wasn't robots.txt, the point was why are you grabbing the full text when you should only be taking the RSS feed?
Post a Comment