Tuesday, July 31, 2007

Attempted Distributed Scrape from SAIX.net

This is the kind of scrape attack I warn my bot blocking comrades in arms that they would probably miss because it's distributed over multiple IP addresses. Had the scraper not left the default user agent "Java/1.6.0_02" most of the anti-scrapers would be helpless against this type of scrape.

Here's a sample of the activity:

198.54.202.246 [ctb-cache7-vif1.saix.net.] requested 3 pages as "Java/1.6.0_02"
198.54.202.194 [ctb-cache4-vif1.saix.net.] requested 1 pages as "Java/1.6.0_02"
196.25.255.210 [rba-cache2-vif0.saix.net.] requested 3 pages as "Java/1.6.0_02"
198.54.202.195 [ctb-cache5-vif1.saix.net.] requested 3 pages as "Java/1.6.0_02"
196.25.255.218 [rrba-ip-pcache-6-vif0.saix.net.] requested 4 pages as "Java/1.6.0_02"
198.54.202.214 [rrba-ip-pcache-5-vif1.saix.net.] requested 4 pages as "Java/1.6.0_02"
196.25.255.195 [ctb-cache5-vif0.saix.net.] requested 1 pages as "Java/1.6.0_02"
198.54.202.210 [rba-cache2-vif1.saix.net.] requested 2 pages as "Java/1.6.0_02"
198.54.202.218 [rrba-ip-pcache-6-vif1.saix.net.] requested 2 pages as "Java/1.6.0_02"
196.25.255.214 [rrba-ip-pcache-5-vif0.saix.net.] requested 1 pages as "Java/1.6.0_02"
198.54.202.234 [rba-cache1-vif0.saix.net.] requested 3 pages as "Java/1.6.0_02"
196.25.255.194 [ctb-cache4-vif0.saix.net.] requested 1 pages as "Java/1.6.0_02"
196.25.255.250 [ctb-cache8-vif0.saix.net.] requested 1 pages as "Java/1.6.0_02"
This is a prime example of why standard bot blocking that only takes a single IP address would fail because these are all proxy servers that claim to be forwarding on behalf of 41.240.133.235 [dsl-240-133-235.telkomadsl.co.za].

Assuming these script kiddies fix the default UA all that needs to be done to stop them is track access based on the proxy forward IP, which I do, which makes stopping this kind of nonsense childs play.

FYI, before anyone asks stupid questions like "How do you know it was a scraper?" it's because of the access of my pages names in sequential alphabetical order. Other than being distributed among many IPs via the SAIX caching proxy, which could be hard to identify via a log file review, the rest looked like it was amateur hour at the scraping faire.

This is why I tell people post-mortem Apache log file reviews simply don't work because there is insufficient information to identify things that my code easily catches in real time.