Saturday, July 07, 2007

Too Much FyberSpider In My Site's Diet

Found this FyberSpider thing that used to crawl from a Comcast address and has apparently grown up and is crawling from a real dedicated server now.

The ip was 69.36.5.45 and the reverse DNS claims to be server.fybersearch.net and sure enough there something called FyberSearch with what appears to be a functional search page. The results actually appear to be populated with data collected from their crawl, trade secret, don't ask.

It asked for robots.txt but doesn't have the user agent set properly so it won't get past many bot blockers, assuming it actually honors robots.txt, until they fix that little bug.
69.36.5.45 "GET /robots.txt HTTP/1.0" "Python-urllib/1.15"
69.36.5.45 "GET / HTTP/1.0" "FyberSpider"
Here's the data center info if you want to block it:
OrgName: JTL Networks Inc.
NetRange: 69.36.0.0 - 69.36.15.255
The search page has issues finding words in the one page I allowed to be indexed so I'm not terribly impressed, NEXT!

3 comments:

Anonymous said...

Hi,

My name is Nathan Enns from the search engine FyberSearch. I would like to address the two issues you raised in your post:

1. I do apologize for the crawler error you experienced. However, I did some tests after reading your post and things appear to be functioning correctly. I would like to solve whatever issue caused the malfunction on your site. If possible, could you post the date and time the violation occurred as well as the URLs FyberSpider crawled (unless the only URL crawled was the front page)?

2. I am sorry to hear you are not impressed with FyberSearch. As you noted, server upgrades have allowed us to move more backend processes to dedicated machines. It should also be pointed out that there is a difference between crawling a page and indexing a page. Just because a crawler recently visited your site does not mean the search engine has updated their index. The index update frequency has a lot to do with the computer resources of the engine and the way their programs work. Even the big engines crawl sites more than they update their indexes.


Anyone who reads this and who has experienced a similar issue is encouraged to visit FyberSearch and send us an email documenting the issue. It is not our goal to violate the robots.txt standard and none of our tests have shown any malfunction, both before and after reading this post.

Thank you for writing this post and reading my response. Keep up the good word with the blog!

-- Nathan Enns

IncrediBILL said...

Nathan, thanks for dropping in but you didn't pay attention to the post...

1. The crawl bug was you used "Python-urllib/1.15" as the user agent when reading robots.txt, not FyberSpider as it should be set.

Didn't say you violated robots.txt standards, re-read a second time and maybe you'll get what I meant as your user agent setting was the python default, that's all.


2. I'm aware a crawl isn't indexed instantly, that wasn't what I posted.

The search bug was that your search couldn't locate terms in pages CURRENTLY INDEXED, not newly crawled pages. I could get a specific result using certain keywords but not others, and those words that failed had already appeared in the search results and were currently indexed.

Anonymous said...

You are welcome!

1. Thanks for clarifying... I made the mistake of assuming it was doing something wrong after you said "assuming it actually honors robots.txt" and gave people info to help them block my bot. I see now that I didn't have to read so much into what you said, sorry.

2. You can't blame me for misunderstanding this one ;-) You only said it had a hard time finding terms in a page you allowed to be indexed. You didn't say that it was a page you found in our results. With what you wrote the first time it seemed like you were referring to a page that you just didn't block the bot from being crawled, not a page that you confirmed was indexed. Thanks for clarifying!