Artificial Intelligence 13 min read

Baidu Search Deep Learning Model Architecture and Optimization Practices

Baidu's Search Architecture team details how its deep‑learning models have evolved to deliver direct answer results via semantic embeddings, describes a massive online inference pipeline that rewrites queries, ranks relevance, and classifies types, and outlines optimization techniques—including data I/O, CPU/GPU balancing, pruning, quantization, and distillation—to achieve high‑throughput, low‑latency search.

Baidu Tech Salon
Baidu Tech Salon
Baidu Tech Salon
Baidu Search Deep Learning Model Architecture and Optimization Practices

This article introduces Baidu Search Architecture Department's Model Architecture Team's work on deep learning models for search systems. The content covers three main areas: search deep learning model business and architecture evolution, super large-scale online inference systems, and deep model optimization practices.

1. Search Deep Learning Model Business and Architecture Evolution

The team explains how deep learning has transformed search from returning web links to directly providing precise answers. They discuss the semantic retrieval pathway, which uses embedding vectors (128/256 dimensions) to map user queries and web content into a semantic space where closer vectors indicate more similar meanings. This addresses limitations of traditional inverted index retrieval, especially in Chinese language contexts where small changes in characters can dramatically alter meaning (e.g., "山桃红了" vs "山桃花红了").

Search models differ from recommendation models: search uses transformer-based structures with vocabulary under 200K, deep models, and high computational requirements, while recommendation models handle massive feature tables (TB-level) with wide but shallow DNN structures.

2. Super Large-Scale Online Inference System

The online inference system consists of three main components: (1) Demand analysis/query rewriting - using deep learning to return semantically similar queries from colloquial expressions; (2) Relevance/ranking - using coarse/fine ranking models to compute relevance scores between user queries and web page titles/content; (3) Classification - identifying query types to display appropriate result cards.

The inference pipeline includes: caching for repeated requests, dynamic batching to improve hardware utilization, user-defined preprocessing, and an estimation queue system. The system handles uniform scheduling across multiple machines and data centers, and ensures reliability through fault detection and migration.

3. Deep Model Optimization Practices

Model optimization addresses three bottleneck types: IO bottleneck (data reading speed), CPU bottleneck (insufficient work allocation from CPU to GPU), and GPU bottleneck (optimal state with high GPU utilization).

Optimization work includes: (1) Training optimization - data reading optimization, framework scheduling, kernel fusion/development, and model implementation equivalence replacement; (2) Inference optimization - GPU/CPU load balancing and model structure pruning; (3) Model miniaturization - distillation (parallelizing teacher model inference on heterogeneous computing), quantization (FLOAT32 to INT8/INT4), and pruning (attention head pruning, layer skipping).

model optimizationdeep learningsearch engineTransformerGPU optimizationModel Distillationsemantic retrievalBaiduInference SystemModel Quantization
Baidu Tech Salon
Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.