Inside Baidu’s First‑Generation Spider: How a C‑Only Backend Powered Fast Search

The article recounts Xu Haiyang’s hands‑on experience designing Baidu’s early Spider system, describing its pure C procedural architecture, bug‑fixing journey, PageRank processing, team‑management analogies, and his later moves into AI and education entrepreneurship.

21CTO
21CTO
21CTO
Inside Baidu’s First‑Generation Spider: How a C‑Only Backend Powered Fast Search

When discussing the web search infrastructure team, Xu Haiyang experienced the architecture development of Baidu's Spider system.

Except for using the Linux file system, everything was written in C with procedural programming, avoiding object‑oriented techniques to keep the system simple and fast.

He participated in the design and implementation of Baidu's first‑generation Spider. In the first two years the search engine faced many bugs and performance issues, a “fill‑the‑holes” phase.

During that period Xu systematically identified and resolved problems, emphasizing that technology comes from practice and that theory must be validated by real work.

Basic Architecture of Baidu Search Engine

The basic architecture includes several modules and subsystems as shown in the diagram below.

Baidu's crawler discovers new sites, fetches content, and then computes PageRank for forward and reverse indexing.

Many domestic sites contain duplicate or rewritten content; Baidu removes over 90% of such data through deduplication and anti‑spam processes.

The PageRank algorithm assigns weight based on originality and similarity, producing static weights and inverted index chains. When users query, results are combined with time factors.

The crawler started as a very simple task—just being able to fetch pages—following the principle “great simplicity”. No open‑source tools were used because without deep code understanding, problems become hard to fix, and complex tools add unnecessary complexity.

Developer Growth

Xu and Zhou Limin discussed what kind of developers to hire. Zhou, who left a UCLA PhD to join Baidu, believes programming requires innate logical talent, while Xu argues that programming skills can be cultivated through continuous learning and practice, given self‑awareness and strong interest.

At that time Baidu’s responsibilities spanned web search, forums, and product teams, all overseen by Xu.

Team Management Insights

Xu likens team dynamics to characters from *Journey to the West*: Tang Sanzang provides direction, Pigsy offers camaraderie, Sha Monk offers loyalty, and Sun Wukong represents the technically strong but sometimes arrogant engineer.

The analogy illustrates how technical talent can become isolated if they view others as inferior, similar to stories of Apple’s Steve Jobs and Steve Wozniak or other tech companies where technical and product roles must cooperate.

Even the most powerful engineer must be guided by higher‑level constraints, just as Sun Wukong is bound by the monk’s headband.

After leaving Baidu, Xu obtained stock, achieved financial freedom, and pursued further study at Microsoft Research Asia in neural networks and AI.

In 2015 he co‑founded a K‑12 online education startup, applying AI and recommendation algorithms to assist teachers.

He hopes to contribute to the 21CTO community as a mentor, sharing his experience in search, AI, and software development.

Wishing all readers a pleasant weekend.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backend Architecturesearch engineC programmingPageRankdeveloper growth
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.