How to Scale a Schema‑Free Classification Platform to 100 Billion Records
This article explains how to design a classification‑information system that handles 100 billion rows, ten‑thousand dynamic attributes, and hundreds of thousands of QPS by using vertical partitioning, unified metadata services, and an external search layer for scalable storage and retrieval.
Background
A classification‑information platform hosts many vertical categories (recruitment, real‑estate, second‑hand goods, etc.), each with its own sub‑categories and a core entity: the post. The platform must store billions of rows, tens of thousands of attributes, and support high‑throughput queries.
~10,000 distinct attributes across categories
Data volume reaches 100 billion rows
Every attribute can be queried, often in combination (e.g., salary + location)
Throughput can exceed several hundred thousand QPS
Naïve Approach
Initially a single table was used for a single category (recruitment): tiezi(tid, uid, c1, c2, c3); To support multi‑attribute queries, composite indexes such as index_1(c1, c2), index_2(c2, c3), index_3(c1, c3) were created.
When a new category (real‑estate) was added, the table simply grew more columns: tiezi(tid, uid, c1, c2, c3, c10, c11, c12, c13); Here c1‑c3 belong to recruitment and c10‑c13 to real‑estate.
Problems with the Naïve Design
As categories increase, the number of required composite indexes explodes, making it impossible to cover all two‑ or three‑attribute queries. Maintenance becomes unmanageable, and cross‑category queries are not supported.
Vertical Partitioning as a First Step
Separate tables per vertical category:
tiezi_zhaopin(tid, uid, c1, c2, c3); tiezi_fangchan(tid, uid, c10, c11, c12, c13);While this isolates schemas, it introduces new issues:
How to standardize tid across tables?
How to query a user's own posts regardless of category?
How to query the latest posts?
How to support cross‑category searches (e.g., homepage search box)?
Inconsistent storage technologies (Mongo, MySQL, custom stores) increase operational complexity.
Duplicate development of components and high maintenance cost.
Industry Best Practice: Three Central Services
To solve the above challenges, the architecture introduces three unified services:
Post Center Service (Info Management Center, IMC)
Category Management Center (CMC)
Search Service
1. Post Center Service
A single tiezi table stores only generic metadata; category‑specific data lives in a JSON ext field. tiezi(tid, uid, time, title, cate, subcate, xxid, ext); Example ext values:
{"job":"driver","salary":8000,"location":"bj"} {"type":"iphone","money":3500}The table is sharded into 256 databases, cached with Memcached, and serves ~100 billion rows.
2. Category Management Service
All attribute definitions, constraints, and hierarchy are stored separately. Each attribute is assigned a numeric key to reduce storage size.
Example mapping:
1 → job (recruitment, must be a lowercase string of length ≤ 32)
4 → type (second‑hand, must be a short integer)
Enum tables validate values when the attribute is enumerated.
Category hierarchy is also recorded (e.g., recruitment → sub‑category → sub‑sub‑category).
3. Search Service
Non‑ID queries are routed to an external index. The index provides positive‑order lookup for post IDs and handles complex attribute combinations.
Post ID queries hit the Post Center directly.
All other personalized searches go through the external index.
The search architecture includes a stateless proxy layer, a result‑aggregation layer, and a search core that loads index data into memory for sub‑10 ms latency.
Horizontal sharding of index data and data redundancy enable unlimited scaling of capacity and performance.
Conclusion
By separating storage (Post Center), metadata (Category Management), and retrieval (Search Service), the platform can handle 100 billion rows, 10 k attributes, and 100 k QPS. The solution emphasizes incremental design, extensibility, and the importance of a well‑structured metadata layer.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
