Designing a Scalable Architecture for Million‑Level DAU Internet Applications
The article explains how to build a highly available, horizontally scalable architecture for million‑level daily active users by combining DNS routing, L4/L7 load balancing, micro‑service decomposition, caching, sharded databases, hybrid‑cloud deployment, elastic scaling and multi‑level degradation strategies.
Recent incidents such as the Xi'an "One‑Code‑Pass" outage highlighted the importance of designing systems with proper scalability and automatic scaling capabilities to handle traffic spikes that exceed normal loads by many times.
The typical request flow for a million‑level DAU internet application includes several layers:
DNS : Resolves user IPs to regional IDC locations, leveraging caching to keep lookups fast.
L4 Load Balancer : Performs traffic forwarding based on domain, often implemented with software solutions like LVS (capacity >100k QPS) or hardware appliances such as F5.
L7 Load Balancer (Gateway) : Usually Nginx clusters that handle application‑level routing, authentication, logging, and monitoring.
Server Layer : Hosts the business logic, either as a monolithic application for simple or small teams, or as micro‑services when the codebase grows beyond a few hundred interfaces or teams exceed ten developers.
Cache : Uses systems like Memcached or Redis (single‑node capacity ~100k QPS, latency 1‑2 ms) to offload frequent reads from the database.
Database : Implements master‑slave separation for read‑write splitting and sharding (both by time and by user ID) to handle terabyte‑scale data volumes.
While this architecture suffices for million‑level DAU, supporting tens or hundreds of millions of daily users requires additional improvements:
Hybrid Cloud Architecture : Combines private‑cloud IDC bandwidth with public‑cloud resources. During traffic surges that exceed private‑cloud egress capacity, a portion of the load is shifted to public‑cloud zones, requiring seamless network interconnects (often dedicated lines) and a platform such as BridgX to abstract resource differences.
Full‑Link Elastic Scaling : Both L4/L7 layers, server instances, cache clusters, and databases must be able to scale out on demand. For example, a public‑cloud SLB can handle millions of concurrent connections, while Nginx instances can be added dynamically based on weighted QPS calculations.
Three‑Level Degradation Mechanism :
Level 1: Transparent to users, releases < 30 % capacity.
Level 2: User‑visible degradation, releases up to 50 % capacity.
Level 3: Severe degradation, releases 50‑100 % capacity, used only as a last resort.
Additional supporting mechanisms such as decision‑support systems, on‑call alerting, and automated scaling tools (e.g., CudgX) are essential for maintaining high availability at massive scale.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.