Just in the last few weeks I've been seeing some really odd hits on robots.txt from things claiming to be browsers, loading images like a browser, the whole nine yards.
Bot #1 - Post-crawl Robots.txt Reader
What I'm seeing is that instead of looking at robots.txt upfront, which is a trigger to shut down a bot, I'm seeing robots.txt read after one or two pages is read. That way, they can snoop my robots.txt file but not do it first therefore avoiding being stopped while collecting a safe page or two in order to find out what my pages are for a future crawls.
That's my theory and I wouldn't have considered this the case except I've seen the exact same behavior multiple times.
Time to start setting some new traps and see who crawls with the information gathered from these probes.
Bot #2 - 3 Phase Crawler
Next on the list is a stealth bot that looks like it's either taking a screen shot on the first page or downloading images just to try and trick my software into thinking it's human.
This beast does the following:
- Reads robots.txt with a blank user agent string
- Loads the home page as Linux Firefox and downloads all associated images which appears to be taking a screen shot
- Crawls the rest of the pages on the site disguised as Internet Explorer
Here's something amusing with what appears to be a Ukrainian spider that downloaded a linked image to my website as Internet Explorer and 4 seconds later hit robots.txt as an anonymous user agent.
82.207.93.90 - [12:20:47] "GET /banner.gif" "http://www.someotherwebsite.com" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 4.0)"This may be related to Bot #2 above, not sure, but I've seen a few hits like this where they follow the link and peek to see what's allowed and don't go any further.
82.207.93.90 - [12:20:51] "GET /robots.txt" "-" "-"
Very odd.
3 comments:
Interesting findings, Bill :-)
In regard to example #3, I noticed this several times in my logfiles too. Only difference was the request came from addresses which turned out to be rented servers. A search for the ip address would then reveal that these were in fact spambots.
My guess it the image download test is done to see, what user agents may slip through. Kind of ruleset probing if you want so. Whenever I got hit by suspicious looking visitors from Russia or Ukraine, I'm very quick with blocking large chunks of the ISP responsible, before they'll return and start spamming the weblog...
Olliver
- conspiracy department -
Russia/Ukraine has all sorts of nasties as I track a lot of scrapers back there too since I don't really have anything for them to spam, and even if I did, their bots wouldn't be able to do it! ;)
It's quite different to keep a balance here, because I sometimes have legitimate visitors from Russia/Ukraine too, who came from search engines and weren't looking for spam samples of theirs, but for actual subjects I wrote about ;-).
Post a Comment