How We Scaled a Billion‑User System: From Monolith to Microservices

This article recounts how a rapidly growing online platform transformed a tightly coupled, fragile architecture into a scalable, high‑availability system by applying dynamic/static separation, read‑write splitting, caching, load‑balancing, intelligent monitoring, and finally migrating to a micro‑service architecture.

21CTO
21CTO
21CTO
How We Scaled a Billion‑User System: From Monolith to Microservices

In 2016, after a seven‑day effort, the Ultraman team finally defeated the monster and the WeiYing R&D team passed a critical milestone.

Key metrics: DPV reached the billion level, DAU reached ten million, the system could process 10,000 orders per second, handle 30,000 calls per second, 90% of requests had latency under 200 ms, and SLA was 99%.

The original architecture was tightly coupled, with unreasonable data calls, no table separation, chaotic index usage, missing monitoring, many single points of failure, and fragile operations.

Survival became the primary requirement; business growth felt like a high‑speed car that could not be stopped for repairs.

Phase 1 – Quick Wins (Six‑Pulse Sword)

When the goal is simply to stay alive, focus on basic, immediate improvements:

Dynamic/Static Separation : Load dynamic data on the client side, push static data to a CDN.

Read/Write Separation : Use separate database servers for reads and writes to reduce load and I/O pressure; the cache handles most read operations.

Service Scalability : Add resources (application servers, database servers, cache servers) as needed to support higher concurrency.

Proper Indexing : Choose appropriate clustered or non‑clustered indexes based on workload (OLTP, DSS).

Latency Reduction : Keep request payloads small, use HTTP pipelining over TCP, employ connection reuse and appropriate time‑outs to meet 98% of user‑perceived latency goals.

Event‑Driven Architecture : Treat incidents as learning opportunities; a dedicated team analyses failures, proposes fixes, and tracks outcomes.

Intelligent monitoring was also built because traditional open‑source tools (Zabbix, Nagios) monitor only machine‑level metrics and cannot quickly pinpoint service‑level impacts. A custom monitoring system provides business‑key alerts within 5 minutes, fault location within 10 minutes, and drives automated remediation.

Phase 2 – Flood Control (Yu the Great)

When traffic surges like a flood, multi‑layered filtering is applied to let only valid requests reach the backend.

Client Caching : Maximize local cache usage; configure Cache‑Control, Expires, Last‑Modified, ETag headers for browsers and define effective cache policies for apps.

CDN Optimization : Add nodes for traffic distribution and combine with intelligent DNS to serve users from the nearest edge.

Load Balancing & Rate Limiting : Control traffic flow, prevent abuse, and optionally degrade service under extreme load, using peak‑based thresholds.

Read‑Side Caching : Tiered caches (in‑memory, Redis, etc.) store hot keys; protect backend databases from cache‑stampede with distributed locks and allow slight consistency loss for lower latency.

Write‑Side Database : Ensure primary database reliability and strong consistency; apply time‑based sharding and asynchronous processing.

Phase 3 – Service Atomization to Microservices

Having survived, the team aimed for a more respectable architecture by modularizing services into micro‑services.

The overall system was split into two major groups, Trading and UGC, each with its own modules and a shared common component.

Microservice Architecture (MSA)

Benefits:

High availability through isolation, automation, self‑monitoring, and fault recovery.

Flexibility and reusability; services can be combined freely.

Agility: small teams (≈10 people) can own individual services and iterate quickly.

Technology stack agnosticism via standard REST APIs; the team uses PHP, Java, JavaScript, Python, Go, Lua, etc.

Challenges and countermeasures:

Determining service granularity: balance communication cost versus modularity by aligning with business domains.

Each microservice requires its own database, leading to data proliferation; the team strengthens foundational services to handle this.

Operational complexity: adopt Docker, build a private PaaS, and implement automated deployment, service discovery, orchestration, and comprehensive monitoring to maintain SLA and scalability.

Message‑queue inconsistencies: standardize MQ usage and reduce inter‑service call complexity to lower communication overhead.

In summary, the WeiYing team transformed a fragile monolith into a resilient, scalable micro‑service ecosystem through systematic architectural refactoring, performance optimization, and operational automation.

Microservice Architecture Diagram
Microservice Architecture Diagram
Author: Xiao Dao, VP of WeiYing R&D Center
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringCloud NativePerformance OptimizationBackend ArchitectureMicroservicessystem scaling
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.