Tuesday, May 29, 2007

Bot Blocker Tracking More Than 80K Unique IPs

Lately I've been doing some analysis work on my database of IPs that I'm tracking for bad behavior and it exceeded 80K unique IPs. Many of these are from data centers, bot nets, home-based scrapers and then some, but it's a staggering number when it exceeds 80K.

People always wonder why I'm such an anti-scrape nazi but it's really not hard to see the problem when you multiply 80K IPs trying to scrape an excess of 40K pages, which is a potential for having over 3 BILLION pages scraped in the last year.

Here's the number with all the zeroes: 3,200,000,000 pages.

OK, that's really a lot of pages and there's no way I'm paying for that kind of bandwidth.

I seriously doubt they would ever hit the maximum pages but there's no way I'm unlocking the doors and let them run rampant just to find out how bad it would really get.

Here's a sample of 3 greedy fuckers that paid a visit just today:

82.34.200.237 [82-34-200-237.cable.ubr05.hari.blueyonder.co.uk.] requested 710 pages as "Mozilla/4.0 (compatible; GoogleToolbar 4.0.1020.2544-big; Windows XP 5.1; MSIE 6.0.2900.2180)"

70.80.186.223 [modemcable223.186-80-70.mc.videotron.ca.] requested 1071 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

201.58.219.234 [20158219234.user.veloxzone.com.br.] requested 329 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)"

They only got a couple of pages before getting nothing but garbage, but they just keep trying. Based on the location of the IPs, I'm thinking it might be compromised machines in a botnet trying to scrape from stealth locations, hard to say.

The best part is, they're now charter members of my AUTO-QUARANTINE list of IPs meaning they're blocked from accessing any pages on their next trip unless a human is at the controls, and even then, they could get locked out real fast if they aren't careful!

11 comments:

Anonymous said...

Hey Bill,

What are you gonna do when your system doesn't work with IPV6? IPV6 will have tons and tons of ip's that people can use so it won't be feasible to block by ip anymore...

G-Man

IncrediBILL said...

Sorry dude, but IPV6 is childs play.

Nice to see how all you nay-sayers, aka spammers and scrapers, would like to see the knights in white satin fail but it's not gonna happen.

I've got more data processing conceptual knowledge in my little finger than you have in your entire botnet.

Anonymous said...

What got those IPs identified as bots within "a couple of pages"?

IncrediBILL said...

What got them blocked was their behavior as bots do certain things humans don't do. The fact that they continued asking for hundreds of pages after being detected pretty much makes is a slam dunk they were bots.

If I gave away all my secrets then the bots might make it harder to detect them so quickly.

Anonymous said...

In this particular instance, I'm not looking for the knight in white satin to fall LOL.

I am, however, curious how it's possible to block by IP with IPV6.

Here's my line of thinking...

The current cost of getting another IP is rather high because of the limited availablity of the IPs.

With IPV6, we'll have billions and billions of ip's that are available for every person on the planet.

Simple supply and demand would dictate that the price of those IPs would drop to a level that's easy for anyone to get a new ip at the drop of a hat - even doing so programatically.

So with that in mind - how can one block by ip?

G-Man

P.S. I don't run botnets - I avoid the illegal stuff :)

IncrediBILL said...

Sorry g-man, I'm so used to being attacked by every little scraper I go on the offensive too easily these days.

You are right that the new IPv6 will open up a boatload of new IPs, but the rules of engagement still haven't changed the best that I know. Blocks of IPs that are sold will easily fall into a couple of realms of either data centers which can just be dismissed entirely or ISPs that supply connectivity to individuals.

What I would expect to see with new and hopefully CHEAP IPs available is the big proxy sites give way to everyone on the planet with a unique IP. That would be perfect because someone fucks up and they are blocked for good without having that IP address re-issued to the next customer that dials out.

Theoretically, this could spell the end of scrapers hiding behind shared IPs, but probably not.

Not too concerned at the moment, we'll deal with the ever growing list of IPs when it happens.

Anonymous said...

I block most scrapers and spammers manually!

I have been using pure html and cgi, but have moved my commentpages/blog over to php and installed akismet and bad behaviour.

IncrediBILL, please tell us what you use to bust and block all the "pests" on your site, we need to learn form an expert.

Thanks :-)

IncrediBILL said...

It's my own code, about to be available to all. Been a long time coming but the scale of the information I have is overwhelming.

Took time to sort, collate, organize, and put into some format easily distributable to the masses.

Originally I anticipated I would be launched around Jan/Feb '07 but many things came into play that I didn't anticipate.

Keep your eyes open, announcements coming soon.

FYI, initially it's only Linux/Apache compatible but I'm thinking about opening up the API to other platforms.

Anonymous said...

I have to say, that's really an amazing number. 80,000 different IP addresses being used to steal content.

I'm curious to see the API you come up with to deal with this problem.

Anonymous said...

So where can we get this list of IPs???

IncrediBILL said...

You wouldn't want 80K IPs if I gave them to you.

If you put them all in your firewall the server would grind to a halt.

I use a database and custom code to manage it but I'm trying condensing the list into something more manageable, it's just taking time, a lot of time...