Midgard: Adaptive Storage Management for Search – From Simple Tables to Intelligent Layers
This article examines how Baidu's search service evolved its storage architecture—from a basic key‑value table to a hybrid HDD/Redis cache and finally to a sharded, multi‑collection design—culminating in Midgard, an intelligent storage‑layer manager that abstracts and optimizes data access for changing business needs.
Background
Internet businesses are fundamentally data‑centric; Baidu's search engine exemplifies a large‑scale, data‑intensive service that requires efficient, low‑cost storage solutions.
Evolution of Storage Requirements
1.1 Business Start – Solution 1.0
When the "tendu" search service was first launched, its needs were simple: random reads/writes by key and occasional full scans. A Big‑Table‑style storage system supporting Get/Put and Scan operations was sufficient.
1.2 Fast Updates – Solution 2.0
As user volume grew, update latency became a critical issue. The HDD‑based storage used in 1.0 offered poor random I/O performance, and only about 5% of columns required high‑frequency Get operations. To address this, a hybrid approach was adopted: the primary table remained on HDD, while a Redis cache was introduced for the hot columns, handling frequent Gets in memory and writing to both cache and table.
1.3 Large‑Scale Ingestion – Solution 3.0
New product demands introduced two data sets: a "premium" collection with high quality and frequent updates, and a massive "broad" collection that is large but updates rarely. The system responded by sharding: one database stores the premium collection, another stores the broad collection, allowing independent scaling and optimization.
Introducing the Storage Solution Layer
2.1 Fine‑Grained Storage Challenges
Midgard abstracts storage decisions behind a meta‑module. Users declare interface requirements (e.g., 1000 QPS for PAGE_SCORE). Business can modify meta‑information to add new data or re‑tune storage without touching application code.
Core Architecture – Access by Name
Midgard consists of three main components:
Meta module: stores configuration such as cache presence and storage location for each field.
Server module: receives user requests, loads the relevant meta‑information, and performs preprocessing.
Operator/Executor layer: compiles requests into a series of operators and executes them.
Adjustability on Demand
Users register required QPS, modify meta‑information, and Midgard automatically optimizes storage based on runtime metrics. The system can add or remove cache layers, adjust sharding, or change storage media without code changes.
Compositional Computing Capability
Operators are atomic, enabling complex tasks to be expressed as simple primitive sequences. Example primitive program:
Scan(A).Join(B.sham_score).Filter(sham_score > 5).Delete(A)Midgard expands this using meta‑data, automatically inserting cache reads, fallback joins, and optional throttling operators.
Scan A
Join sham_score.cache // try cache
FallbackJoin sham_score // fallback to source
Filter sham_score > 5
Limit xx M/s // optional back‑pressure
Delete AConclusion
Midgard demonstrates an intelligent, adaptable storage layer for search systems. By decoupling business logic from concrete storage implementations, it enables incremental upgrades, caching, sharding, and future extensions without modifying application code, offering long‑term maintainability and scalability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
