Friday, January 28, 2011

GlueText Crawlers Identified and Blocked

Started noticing some leeched content showing up on a site called GlueText so it got my curiosity up to see how they were gathering their content.

Turns out initially they were using the default libwww-perl user agent back in '09

99.231.221.217 "libwww-perl/5.820"
Looks like they got a little smarter after being bounced by sites to switch to the old Netscape Navigator user agent for the Win 98 version which they still use today!
99.231.78.89 "Mozilla/4.76 [en] (Win98; U)"
GlueText appears to have historically used the following IPs:
99.231.78.89
CPE0024b2cbf30a-CM0016b536fb82.cpe.net.cable.rogers.com.

173.203.215.230
173-203-215-230.static.cloud-ips.com.

99.231.221.217
CPE0009a30119af-CM0016b536fb82.cpe.net.cable.rogers.com.

99.231.44.115
CPE002436a0fbf3-CM0017ee4740ec.cpe.net.cable.rogers.com.

76.65.207.92
TOROON63-1279381340.sdsl.bell.ca.
My most current test showed they were now using the following IPs:
These IPs were from cloud-ips.com, all from GlueText:
173.203.210.51
173.203.210.95
173.203.215.230
173.203.241.192
Other IPs still involved:
76.65.207.92 -> TOROON63-1279381340.sdsl.bell.ca

99.231.78.89 -> CPE0024b2cbf30a-CM0016b536fb82.cpe.net.cable.rogers.com
Doesn't request robots.txt, fakes a Netscape user agent to gain access without permission, doesn't appear to document how it crawls content nor does it appear to give webmasters any way to opt-out.

BAD ROBOT!

Blocked.

2 comments:

Kim Raufort said...

Cool article keep up the good work. This site I will keep my eyes on in the future. Kim Denmark

Doug Wilson said...

So what name do we use for UA?

RewriteCond %{HTTP_USER_AGENT} ^libwww-perl [OR]