Design and Challenges of Web Crawlers and Link Scheduling for Knowledge Graph Construction
The article explains how web crawlers (spiders) collect data for knowledge graphs, covering core tasks, major challenges, crawler features, new‑link expansion, storage design, link‑selection scheduling strategies, and the role of large‑scale data mining and machine learning in optimizing crawl efficiency.
Knowledge graphs rely heavily on data obtained by crawling websites, and the crawler is commonly referred to as a Spider.
Spiders continuously traverse the Internet, discover links, and fetch pages, which sounds simple but actually involves many difficulties.
Spiders have two core tasks: link‑selection scheduling and page fetching.
The main problems faced by spiders include:
Pressure calculation, control, and traffic allocation: how to improve crawl friendliness, avoid bans, evaluate site capacity, and distribute traffic among multiple users and diverse crawl requirements.
Limited crawl bandwidth requires predictive scheduling to maximize benefit, covering link discovery, pending‑link grading, and page update prediction, often using large‑scale data mining and machine learning.
Ingestion control: detecting low‑quality or duplicate pages early to prevent the system from being overwhelmed by spam.
1. Crawler
The simplest crawler is an HTTP client that receives a list of URLs, fetches them, and outputs pages, but real‑world needs go beyond simple HTTP fetching. Requirements include:
Support for multiple protocols (HTTP/HTTPS).
Handling various resource types (documents, images, videos).
Page stitching for embedded frames.
Page redirection handling (static via HTTP headers or meta tags, dynamic via JavaScript).
Page rendering to execute JavaScript and CSS so the output resembles a browser view.
Rendering can be built on Chrome or Firefox engines, but this reduces performance and is unnecessary for many pages; users can choose rendering when needed.
Additional challenges include avoiding site bans by controlling request rates, solving CAPTCHAs, and handling login authentication.
2. New‑Link Expansion
After a crawler retrieves pages, new‑link expansion continuously supplies fresh URLs for further crawling.
The web is a massive graph; starting from a small seed set (e.g., major portals), new links are extracted and merged back into the seed pool, allowing the spider to gradually reach the whole Internet.
Typical extraction parses <a> tags, but other methods are needed for plain‑text links or JavaScript‑generated URLs, which require a rendering‑capable crawler.
Domain‑specific crawlers often use regular expressions to define expansion rules. For example, to crawl news from NetEase:
Channel‑home URL pattern: http://news.163.com/[a-z]*/
News‑detail URL pattern: http://news.163.com/[0-9]/[0-9]/[0-9]/..html
Thus the expansion rule combines a seed‑page regex with an allowed‑new‑link regex.
3. Link Store and Page Store
As crawling proceeds, the number of links and pages grows rapidly, demanding efficient storage solutions.
Link store requirements:
Schema support for fixed‑length and variable‑length fields.
Batch read (select links) and batch write (merge new links).
Random reads for occasional queries.
Small record size (hundreds of bytes, max ~1 KB).
Heavy read‑modify‑write workload for link‑status updates.
Page store requirements:
Schema support for fixed‑length and variable‑length fields.
Batch reads for building indexes.
Batch writes for storing fetched pages.
Random reads (low volume).
Large record size (tens of KB up to several MB).
Page data is usually OLAP‑oriented and stored in HBase, while link data, which needs high‑frequency random writes, fits relational databases with sharding or NoSQL solutions such as Elasticsearch.
4. Link‑Selection Scheduling
After continuous crawling and new‑link expansion, the link store contains billions of seeds, but only a limited number can be fetched each day, requiring a scheduling strategy to decide which links to crawl.
Typical strategies include:
Breadth‑limited (breadth‑first) scheduling: prioritize shallow links (low follow depth) because they are more likely to be valuable to users.
Link grading: assign weights to seeds similar to OS process scheduling, then map weights to levels, select from high to low, ensure low‑level protection, and randomize within the same level.
Five traffic types: Coverage‑CHK, Coverage‑GET, Freshness‑CHK, Freshness‑GET, Update‑CHK; priorities are comparable within each type but not across types, and dynamic ratios balance them.
Site‑isolation selection: group links by site (URL reversed order) and schedule each site independently to avoid cross‑site interference.
Scheduling relies on numerous link attributes, such as URL patterns, domain, anchor text, clustering results, link depth, page rank, freshness, and feedback from previously crawled pages.
Important scheduling strategies
1. Pending‑link scheduling: use machine‑learning models to predict the post‑crawl value of a link based on its attributes, while handling challenges like reverse‑link computation, granularity balance, and fast feedback loops.
2. Index‑page scheduling: evaluate index pages by the quality and quantity of new links they can discover ("mother‑by‑child" principle), using modules like index‑check and hubdig to grade index and hub pages.
3. Update‑rate scheduling: decide which existing pages to refresh based on change probability (derived from pattern clustering, page attributes, and collective feedback) and update benefit (index presence, click/impression metrics), computing a combined CHK score as (update probability × update benefit).
Overall, link‑selection scheduling is the most complex spider module, involving deep data mining and machine‑learning techniques to achieve optimal crawl efficiency.
Author Introduction
Zheng Zhibin , South China University of Technology, Computer Science and Technology (Bilingual Class). Previously worked at BAT on e‑commerce, open platforms, mobile browsers, search advertising back‑ends, and big‑data AI, specializing in business architecture and platform solutions.
Baidu Crowdsourcing – The Nation's Largest Crowdsourcing Platform provides end‑to‑end massive data training services, including data acquisition, processing, and customized solutions to maximize data value.
Baidu Intelligent Testing
Welcome to follow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.