Normally I'm always slamming corporate bots but when one company, like brandimensions appears to be playing by all the rules, I feel they should get a little praise.
Here's what their access attempts look like:
209.167.50.22 "GET /robots.txt HTTP/1.1" "www.brandimensions.com" "BDFetch"At least they asked for robots.txt and appear to only go in when allowed.
209.167.50.22 GET /somepage.html HTTP/1.1" "www.brandimensions.com" "BDFetch"
209.167.50.22 "GET /robots.txt HTTP/1.1" "www.brandimensions.com" "BDFetch"
209.167.50.22 GET /somepage.html HTTP/1.1" "www.brandimensions.com" "BDFetch"
209.167.50.22 "GET /robots.txt HTTP/1.1" "www.brandimensions.com" "BDFetch"
209.167.50.22 GET /somepage.html HTTP/1.1" "www.brandimensions.com" "BDFetch"
However, they had a couple of bumps that I'd like to see them fix.
1. Ask for robots.txt once or twice a day, maybe once an hour worse case, not every access.
2. Set your reverse DNS to say bdfetch.brandimensions.com or something similar so we can verify it's really your company and not someone spoofing you.
3. Include a link to a page about your crawler in the user agent, and a version number, such as ""BDFetch/1.0 +http://www.brandimensions.com/crawler.html"
Other than those minor glitches, kudos for at least trying to play by the rules and at least giving webmasters the choice to allow you to crawl or not.
Nicely done.
3 comments:
Thanks for your kudos. I'll forward your blog entry to my colleagues at Brandimensions -- it'll make their day.
One of our company's core values is to govern ourselves with integrity and professionalism. Our crawler playing by the rules and being a good Internet "citizen" is a result of this tenet.
Your suggestions make sense. Stay tuned to your web logs over the next few weeks...
Hugh Hyndman
CTO
www.brandimensions.com
Kudos are well and deserved, however, their crawler reports "www.brandimensions.com" as the refering url. Other robots (Yahoo, Google, etc..) simpley have a "-" for the refering url. That would be _much_ preferred. Unfortunately, I've set to exclude their robot because of this problem.
I'm seeing two types of traffic coming from the same IP address as BDFetch - one with user agent 'BDFetch' doing a few page views a day, the other with user agent 'Mozilla/4.0' etc doing 10 000 or so views a day.
Post a Comment