Just in the last few weeks I've been seeing some really odd hits on robots.txt from things claiming to be browsers, loading images like a browser, the whole nine yards.
Bot #1 - Post-crawl Robots.txt Reader
What I'm seeing is that instead of looking at robots.txt upfront, which is a trigger to shut down a bot, I'm seeing robots.txt read after one or two pages is read. That way, they can snoop my robots.txt file but not do it first therefore avoiding being stopped while collecting a safe page or two in order to find out what my pages are for a future crawls.
That's my theory and I wouldn't have considered this the case except I've seen the exact same behavior multiple times.
Time to start setting some new traps and see who crawls with the information gathered from these probes.
Bot #2 - 3 Phase Crawler
Next on the list is a stealth bot that looks like it's either taking a screen shot on the first page or downloading images just to try and trick my software into thinking it's human.
This beast does the following:
- Reads robots.txt with a blank user agent string
- Loads the home page as Linux Firefox and downloads all associated images which appears to be taking a screen shot
- Crawls the rest of the pages on the site disguised as Internet Explorer
Here's something amusing with what appears to be a Ukrainian spider that downloaded a linked image to my website as Internet Explorer and 4 seconds later hit robots.txt as an anonymous user agent.
188.8.131.52 - [12:20:47] "GET /banner.gif" "http://www.someotherwebsite.com" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 4.0)"This may be related to Bot #2 above, not sure, but I've seen a few hits like this where they follow the link and peek to see what's allowed and don't go any further.
184.108.40.206 - [12:20:51] "GET /robots.txt" "-" "-"