How We Scaled a Free Mask Distribution System to 220k+ Concurrent Users on Alibaba Cloud
During the early COVID‑19 response, a city launched a free‑mask reservation service that faced massive timed traffic, prompting a rapid evolution of its architecture—from a single‑server Nginx/Tomcat setup to a multi‑layer SLB, CDN, read‑write split DB, and finally an ideal CDN‑backed design—while documenting the performance bottlenecks, scaling limits, and concrete tuning steps that enabled a 7‑minute sell‑out.
Background
During the early pandemic a local government launched a free‑mask reservation system that accepted requests only between 09:00 and 12:00. The limited‑time, high‑traffic scenario caused severe concurrency and bandwidth spikes, prompting several architectural revisions.
Architecture V1 – Initial Design
At 22:00 on 2 Feb the system consisted of:
Clients accessed an ECS instance directly over HTTPS.
Nginx on the ECS listened on port 443.
Nginx reverse‑proxied to Tomcat; static files were served by Nginx, dynamic requests by Tomcat.
Application first queried Redis; on miss it queried MySQL. Data sync between Redis and MySQL was handled by the application.
Advantages: simple management and deployment.
Disadvantages: poor performance, no scalability, single point of failure. The service crashed when the reservation page leaked before the official start time.
Architecture V2 – Emergency Scaling
With the launch deadline imminent and no code changes possible, the team added infrastructure components:
SLB with horizontal mirroring for load distribution.
Managed read‑write‑split database (Alibaba Cloud RDS).
Adjusted Nginx protocol settings.
Backup cluster enabled via dual A‑record DNS.
Identified failures in SMS delivery and login‑cookie initialization.
Advantages: higher availability and increased capacity.
Disadvantages: static assets still on ECS caused SLB outbound bandwidth to hit ~5 Gbps and concurrency >220 k. DNS provider limits prevented adding CDN records.
Architecture V3 – CDN Integration
Further changes introduced:
CDN for massive bandwidth offloading.
Removed Nginx reverse proxy.
Disaster‑recovery switch to fall back to the legacy program.
Virtual server group for new‑old program switching; limited to 200 backend instances per 7‑layer SLB.
On 5 Feb the new architecture sold out inventory in 7 minutes with smooth user experience.
Advantages: static traffic offloaded to CDN, reducing SLB load.
Disadvantages: required an extra domain, introduced cross‑origin issues, and an SMS encoding bug forced a fallback.
Ideal Architecture V4 – CDN‑Backed Design
Target design:
Main domain points to CDN.
CDN forwards requests to SLB using different protocols (HTTP/HTTPS) to route traffic to either the new or old program based on listener configuration.
Advantages: static acceleration, dynamic origin, no cross‑origin problems, easy program switching.
Disadvantages: still requires manual configuration; container‑based deployment would further simplify scaling.
Performance Summary
The three‑generation architecture was run on 150 ECS instances for safety, but analysis showed that ~50 instances would have been sufficient, turning a 5‑hour outage into a 7‑minute sell‑out.
Optimization Notes
Parameter Tuning
Network kernel settings:
net.ipv4.tcp_max_tw_buckets = 5000 # → 50000
net.ipv4.tcp_max_syn_backlog = 1024 # → 4096
net.core.somaxconn = 128 # → 4096
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_tw_recycle = 1File descriptor limits ( /etc/security/limits.conf):
* soft nofile 65535
* hard nofile 65535Nginx tuning:
worker_connections 1024 # → 10240
worker_processes 1 # → 16 (or auto)
worker_rlimit_nofile 1024 # → 102400
listen 80 backlog 511 # → 65533Enable keep‑alive on Nginx to reduce short‑connection overhead.
Architecture Optimizations
Scale SLB backend ECS count and unify instance specifications.
Remove invalid upstream ports from Nginx configuration.
Use Alibaba Cloud Assistant for bulk ECS operations and parameter adjustments.
Monitor ECS, SLB, DCDN, Redis via CloudMonitor dashboards.
Switch SLB to layer‑7 listener mode and avoid session persistence that breaks login state.
Application Optimizations
Add GC logs and set JVM memory limits, e.g.:
/usr/bin/java -server -Xmx8g -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/app.gc.log -Dserver.port=8080 -jar /home/app/serverboot-0.0.1-SNAPSHOT.jarRequire a Redis‑based SSO session before issuing SMS verification codes.
Jedis pool tuning:
maxTotal 8 # → 20
acceptCount tuned to match somaxconnUpgrade Spring Boot 1.5’s bundled Jedis 2.9.1 to 2.10.2 to fix connection leaks.
Database Optimizations
Switch Redis public endpoint to internal VPC address.
Shorten Redis session timeout to free connections faster.
Optimize slow SQL queries (use CloudDBA).
Add read‑only replica for automatic read‑write splitting.
Increase TCP backlog settings and the number of read‑write‑split instances.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
