This is the kind of scrape attack I warn my bot blocking comrades in arms that they would probably miss because it's distributed over multiple IP addresses. Had the scraper not left the default user agent "Java/1.6.0_02" most of the anti-scrapers would be helpless against this type of scrape.
Here's a sample of the activity:
198.54.202.246 [ctb-cache7-vif1.saix.net.] requested 3 pages as "Java/1.6.0_02"This is a prime example of why standard bot blocking that only takes a single IP address would fail because these are all proxy servers that claim to be forwarding on behalf of 41.240.133.235 [dsl-240-133-235.telkomadsl.co.za].
198.54.202.194 [ctb-cache4-vif1.saix.net.] requested 1 pages as "Java/1.6.0_02"
196.25.255.210 [rba-cache2-vif0.saix.net.] requested 3 pages as "Java/1.6.0_02"
198.54.202.195 [ctb-cache5-vif1.saix.net.] requested 3 pages as "Java/1.6.0_02"
196.25.255.218 [rrba-ip-pcache-6-vif0.saix.net.] requested 4 pages as "Java/1.6.0_02"
198.54.202.214 [rrba-ip-pcache-5-vif1.saix.net.] requested 4 pages as "Java/1.6.0_02"
196.25.255.195 [ctb-cache5-vif0.saix.net.] requested 1 pages as "Java/1.6.0_02"
198.54.202.210 [rba-cache2-vif1.saix.net.] requested 2 pages as "Java/1.6.0_02"
198.54.202.218 [rrba-ip-pcache-6-vif1.saix.net.] requested 2 pages as "Java/1.6.0_02"
196.25.255.214 [rrba-ip-pcache-5-vif0.saix.net.] requested 1 pages as "Java/1.6.0_02"
198.54.202.234 [rba-cache1-vif0.saix.net.] requested 3 pages as "Java/1.6.0_02"
196.25.255.194 [ctb-cache4-vif0.saix.net.] requested 1 pages as "Java/1.6.0_02"
196.25.255.250 [ctb-cache8-vif0.saix.net.] requested 1 pages as "Java/1.6.0_02"
Assuming these script kiddies fix the default UA all that needs to be done to stop them is track access based on the proxy forward IP, which I do, which makes stopping this kind of nonsense childs play.
FYI, before anyone asks stupid questions like "How do you know it was a scraper?" it's because of the access of my pages names in sequential alphabetical order. Other than being distributed among many IPs via the SAIX caching proxy, which could be hard to identify via a log file review, the rest looked like it was amateur hour at the scraping faire.
This is why I tell people post-mortem Apache log file reviews simply don't work because there is insufficient information to identify things that my code easily catches in real time.
10 comments:
Interesting stuff. So how exactly do you track access based on the proxy forward IP?
I'm sure there is a market for Bill's, weapons grade, anti-scraping software.
Yes, Bill, ship a product!
We're on FreeBSD, and our site is basically a big, uncopyrightable database. We need something like this!
Let us know if you want beta testers.
FreeBSD, apache 1.3 and mysql4
One of these days I'm going to call Bill out... Put up a test site, and challenge us! Come on, it'll be fun! CaptureTheFlag 2.0.... it looks like a great opportunity for you, given how well your stuff works.
I'd be happy to sit on the contest advisory panel.... :-)
"I'm sure there is a market for Bill's, weapons grade, anti-scraping software."
Yeah ... in Iran. :P
Honestly Bill when you gonna come up with this software brother..you busting our balls :)Thiefs scrapers and crawlers are making me sick.
I want to be first user and affiliate heavens willing.
My best
mick
Don't worry about the shrink wrapping. All we need is a tarball with a README.txt file :)
I enjoy reading your blog, but some times it appears to be more of a training camp for the "bad guys". In this blogpost you are giving away our tricks to stop spammers/scrapers etc. "They all know it", you might say, but then if they know it why the h*** are they still using "this method" that are easily blocked?
Training camp?
Nope, the smart ones were already flying under conventional radar to avoid being blocked long before I showed up and the stupid ones are still stupid and get blocked all the time.
Post a Comment