Backend Development 12 min read

How to Scale a Schema‑Free Classification Platform to 100 Billion Records

This article explains how to design a classification‑information system that handles 100 billion rows, ten‑thousand dynamic attributes, and hundreds of thousands of QPS by using vertical partitioning, unified metadata services, and an external search layer for scalable storage and retrieval.

dbaplus Community

Sep 6, 2023

How to Scale a Schema‑Free Classification Platform to 100 Billion Records

Background

A classification‑information platform hosts many vertical categories (recruitment, real‑estate, second‑hand goods, etc.), each with its own sub‑categories and a core entity: the post. The platform must store billions of rows, tens of thousands of attributes, and support high‑throughput queries.

~10,000 distinct attributes across categories

Data volume reaches 100 billion rows

Every attribute can be queried, often in combination (e.g., salary + location)

Throughput can exceed several hundred thousand QPS

Naïve Approach

Initially a single table was used for a single category (recruitment): tiezi(tid, uid, c1, c2, c3); To support multi‑attribute queries, composite indexes such as index_1(c1, c2), index_2(c2, c3), index_3(c1, c3) were created.

When a new category (real‑estate) was added, the table simply grew more columns: tiezi(tid, uid, c1, c2, c3, c10, c11, c12, c13); Here c1‑c3 belong to recruitment and c10‑c13 to real‑estate.

Problems with the Naïve Design

As categories increase, the number of required composite indexes explodes, making it impossible to cover all two‑ or three‑attribute queries. Maintenance becomes unmanageable, and cross‑category queries are not supported.

Vertical Partitioning as a First Step

Separate tables per vertical category:

tiezi_zhaopin(tid, uid, c1, c2, c3);

tiezi_fangchan(tid, uid, c10, c11, c12, c13);

While this isolates schemas, it introduces new issues:

How to standardize tid across tables?

How to query a user's own posts regardless of category?

How to query the latest posts?

How to support cross‑category searches (e.g., homepage search box)?

Inconsistent storage technologies (Mongo, MySQL, custom stores) increase operational complexity.

Duplicate development of components and high maintenance cost.

Industry Best Practice: Three Central Services

To solve the above challenges, the architecture introduces three unified services:

Post Center Service (Info Management Center, IMC)

Category Management Center (CMC)

Search Service

1. Post Center Service

A single tiezi table stores only generic metadata; category‑specific data lives in a JSON ext field. tiezi(tid, uid, time, title, cate, subcate, xxid, ext); Example ext values:

{"job":"driver","salary":8000,"location":"bj"}

{"type":"iphone","money":3500}

The table is sharded into 256 databases, cached with Memcached, and serves ~100 billion rows.

2. Category Management Service

All attribute definitions, constraints, and hierarchy are stored separately. Each attribute is assigned a numeric key to reduce storage size.

Example mapping:

1 → job (recruitment, must be a lowercase string of length ≤ 32)

4 → type (second‑hand, must be a short integer)

Enum tables validate values when the attribute is enumerated.

Category hierarchy is also recorded (e.g., recruitment → sub‑category → sub‑sub‑category).

3. Search Service

Non‑ID queries are routed to an external index. The index provides positive‑order lookup for post IDs and handles complex attribute combinations.

Post ID queries hit the Post Center directly.

All other personalized searches go through the external index.

The search architecture includes a stateless proxy layer, a result‑aggregation layer, and a search core that loads index data into memory for sub‑10 ms latency.

Horizontal sharding of index data and data redundancy enable unlimited scaling of capacity and performance.

Conclusion

By separating storage (Post Center), metadata (Category Management), and retrieval (Search Service), the platform can handle 100 billion rows, 10 k attributes, and 100 k QPS. The solution emphasizes incremental design, extensibility, and the importance of a well‑structured metadata layer.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend Metadata Databases

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.