Tuesday, October 28, 2008

Why Does Copyscape/GoogleAlert Hide?

Never really played around with Copyscape/GoogleAlert much but I noticed it tries to completely hide it's presence when accessing a server which isn't cool.

Not that I'm a fan of plagiarism as my copy of the DMCA is almost worn out from use, but I'm even a less fan of sneaky web crawlers that pretend to be shit they aren't.

The IP that Copyscape uses: -> www.googlealert.com
The Copyscape user agent:
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
This is located in a Rackspace so if you're already blocking Rackspace then you probably won't be bothered with Copyscape in the first place:
inetnum: -
descr: Rackspace Managed Hosting
Of course you might not want to block this if you actually use Copyscape as it will become quite useless.


Copyscape said...

HI IncrediBILL,

Just picked up your post. By "hide" I assume you're referring to the user agent Copyscape uses. If so, the explanation is straightforward.

In our experience, many sites deliver strange content when receiving requests from user agents they do not recognize. By using a common user agent for our requests, we do our best to retrieve the content as it would be delivered to ordinary users.

Ian M said...

@copyscape - in that case, keep the existing UA but include "Copyscape" in the User-Agent e.g. like so:

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Copyscape)"

Anonymous said...

I think what copyscape really meant was something along the lines of ... so many webmasters are blocking our asses that we have to use a common user agent to get in the door....

Anonymous said...

Keep in mind that you might want to also block their other IP while you're at it,, just for good measure.

James S said...

I personally use the http://www.copygator.com website to find duplicated content. To me it has a number of benefits over copyscape:

1. it's automated and brings me results instead of me searching for duplicated content. All i had to do was submit my feed and it started monitoring my feed showing me who's republished my articles on the web.

2. i get notified by email so it contacts me when it finds copies of my articles online.

3. i use their image badge feature to alert me directly on my website when my content is being lifted.

4. it's a free service as opposed the "per page" cost of copyscape/copysentry.

webmaster said...

I also want to let you know that the copyscape bot ignores robots.txt entries, HTML meta-tags, an NOFOLLOW attributes in hyperlinks. So, in essence, copyscape is violating the copyrights of site owners themselves, and deserves to be blocked.

Anonymous said...

> copyscape bot ignores robots.txt entries, HTML meta-tags, an NOFOLLOW attributes in hyperlinks

Well, in that case you need not worry because it'll be blocked by your standard bot trap.