Friday, January 28, 2011

GlueText Crawlers Identified and Blocked

Started noticing some leeched content showing up on a site called GlueText so it got my curiosity up to see how they were gathering their content.

Turns out initially they were using the default libwww-perl user agent back in '09 "libwww-perl/5.820"
Looks like they got a little smarter after being bounced by sites to switch to the old Netscape Navigator user agent for the Win 98 version which they still use today! "Mozilla/4.76 [en] (Win98; U)"
GlueText appears to have historically used the following IPs:
My most current test showed they were now using the following IPs:
These IPs were from, all from GlueText:
Other IPs still involved: -> ->
Doesn't request robots.txt, fakes a Netscape user agent to gain access without permission, doesn't appear to document how it crawls content nor does it appear to give webmasters any way to opt-out.