Focused crawler

A focused crawler or topical crawler is a web crawler that attempts to download only web pages that are relevant to a pre-defined topic or set of topics. Topical crawling generally assumes that only the topic is given, while focused crawling also assumes that some labeled examples of relevant and not relevant pages are available. Topical crawling was first introduced by Menczer . Focused crawling was first introduced by Chakrabarti et al.

Strategies

A focused crawler ideally would like to download only web pages that are relevant to a particular topic and avoid downloading all others.

Therefore a focused crawler may predict the probability that a link to a particular page is relevant before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton in a crawler developed in the early days of the Web. In a review of topical crawling algorithms, Menczer et al. show that such simple strategies are very effective for short crawls, while more sophisticated techniques such as reinforcement learning and evolutionary adaptation can give the best performance over longer crawls. Diligenti et al. propose to use the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet.

In another approach, the relevance of a page is determined after downloading its content. Relevant pages are sent to content indexing and their contained URLs are added to the crawl frontier; pages that fall below a relevance threshold are discarded.

The performance of a focused crawler depends mostly on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web search engine for providing starting points.

BotSeer

BotSeer is a Web-based information system and search tool that provides resources and services for research on Web robots and trends in Robot Exclusion Protocol deployment and adherence. It has been created and designed by Yang Sun, Isaac G. Councill, Ziming Zhuang and C. Lee Giles.

BotSeer provides three major services including robots.txt searching, robot bias analysis, and robot-generated log analysis. The prototype of BotSeer also allows users to search six thousand documentation files and source codes from 18 open source crawler projects. BotSeer serves as a resource for studying the regulation and behavior of Web robots as well as information about the creation of effective robots.txt files and crawler implementations. Currently, it is publicly available on the World Wide Web at the College of Information Sciences and Technology at the Pennsylvania State University. BotSeer has indexed and analyzed 2.2 million robots.txt files obtained from 13.2 million websites, as well as a large Web server log of real-world robot behavior and related analysis. BotSeer’s goals are to assist researchers, webmasters, web crawler developers and others with web robots related research and information needs.

Die Kommentare sind geschlossen.