Given the vast scale of the Web, crawling prioritisation techniques based on graph traversal, popularity, link analysis, and textual content are frequently applied to surface documents that are most likely to be valuable. While these techniques have proven effective for keyword-based search, retrieval methods and user search behaviours are shifting from keyword-based matching to natural language semantic matching. Semantic matching and quality signals have been applied during ranking with great success, and recently, researchers have proposed to exploit them also to prioritise the frontier of Web crawlers. To investigate more on this, we propose two novel neural policies with the goal of surfacing content that is semantically rich and valuable for modern search needs, ultimately aligning the crawler behaviour with the recent shift towards natural language search. Our experiments on the English subset of ClueWeb22-B and the MS MARCO Web Search and Researchy Questions query sets show that, compared to existing crawling techniques, neural crawling policies significantly improve harvest rate during the early stages of crawling.
Keywords: Crawling, Web search, Quality estimation
File: https://ceur-ws.org/Vol-4026/paper3.pdf

