Everyone knows about PicScout used by Getty Images but nobody seems to know anything about PicScout's crawler, no user agent information, no IP's where they crawl from, nothing. When someone asked me if I knew anything about them I did a little research and nothing related could be found ANYWHERE, not even anything initially obvious in my bot blocker log files. Based on my initial observations PicScout actually seemed to be hiding better than all the other corporate crawlers I've researched to date, but maybe we can shed some light on this.
Not that I advocate copyright violation, as a matter of fact, I'm a staunch copyright defender.
However, attempting to crawl under the radar, refusal to honor robots.txt files, or identify your bot in any fashion and bypass website security measures gets under my skin more than anything so I picked up the gauntlet and tried to find signs of PicScout activity.
After the usual simple research methods failed, I decided to start by seeing where they were hosted.
host picscout.comAh ha!
picscout.com has address 126.96.36.199
188.8.131.52.in-addr.arpa domain name pointer bzq-80-254-37.dcenter.bezeqint.net.
I remember a rash of activity I shut down from bezeqint.net a while back so I looked a little deeper into this angle.
inetnum: 184.108.40.206 - 220.127.116.11Ah yes, they're the guys from Israel that were hammering one of my servers.
I found a high volume of crawling from these IP's that was trapped by the bot blocker automatically and never answered the challenges, so it was definitely bot traffic.
18.104.22.168These IPs have only been spotted using the two following user agents:
Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)My theory is that this is PicScount attempting to crawl under the radar.
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; (R1 1.1); .NET CLR 1.1.4322)
Check your logs people, see if you have any activity in this range, I think it's them.
I would just block this range out of principle at this point as those IPs crawling aren't honoring any internet standards, and if it is PicScout, blocking them could possibly save you a massive chunk of money if some web designer used stolen images building your website.
After posting this the fine people from PicScout visited the blog and revealed more information about their facilities.
The log showed this visit:
Host Name mail.picscout.comThe information I found from that, including another IP block is here:
IP Address 22.214.171.124
inetnum: 126.96.36.199 - 188.8.131.52So, there's a few more IPs you might want to block, but I doubt they're scanning from the office.
status: ASSIGNED PA
source: RIPE # Filtered
UPDATE: Caught Getty keeping an eye on everyone today.
My blog log showed this:
Time: 12th June 200712:24:53 PMIt appears they were snooping on WebProWorld and followed the link here. The user agent claimed to be MSIE 6.0 but it's possibly an automated crawler, hard to say.
Host Name outbound.gettyimages.com
IP Address 184.108.40.206
Country United States
ISP Getty Images
Anyway, we're watching you watch us, it works both ways.