How We Scaled a Free Mask Distribution System to 220k+ Concurrent Users on Alibaba Cloud

During the early COVID‑19 response, a city launched a free‑mask reservation service that faced massive timed traffic, prompting a rapid evolution of its architecture—from a single‑server Nginx/Tomcat setup to a multi‑layer SLB, CDN, read‑write split DB, and finally an ideal CDN‑backed design—while documenting the performance bottlenecks, scaling limits, and concrete tuning steps that enabled a 7‑minute sell‑out.

ITPUB
ITPUB
ITPUB
How We Scaled a Free Mask Distribution System to 220k+ Concurrent Users on Alibaba Cloud

Background

During the early pandemic a local government launched a free‑mask reservation system that accepted requests only between 09:00 and 12:00. The limited‑time, high‑traffic scenario caused severe concurrency and bandwidth spikes, prompting several architectural revisions.

Architecture V1 – Initial Design

At 22:00 on 2 Feb the system consisted of:

Clients accessed an ECS instance directly over HTTPS.

Nginx on the ECS listened on port 443.

Nginx reverse‑proxied to Tomcat; static files were served by Nginx, dynamic requests by Tomcat.

Application first queried Redis; on miss it queried MySQL. Data sync between Redis and MySQL was handled by the application.

Advantages: simple management and deployment.

Disadvantages: poor performance, no scalability, single point of failure. The service crashed when the reservation page leaked before the official start time.

Architecture V1
Architecture V1

Architecture V2 – Emergency Scaling

With the launch deadline imminent and no code changes possible, the team added infrastructure components:

SLB with horizontal mirroring for load distribution.

Managed read‑write‑split database (Alibaba Cloud RDS).

Adjusted Nginx protocol settings.

Backup cluster enabled via dual A‑record DNS.

Identified failures in SMS delivery and login‑cookie initialization.

Advantages: higher availability and increased capacity.

Disadvantages: static assets still on ECS caused SLB outbound bandwidth to hit ~5 Gbps and concurrency >220 k. DNS provider limits prevented adding CDN records.

Architecture V2
Architecture V2

Architecture V3 – CDN Integration

Further changes introduced:

CDN for massive bandwidth offloading.

Removed Nginx reverse proxy.

Disaster‑recovery switch to fall back to the legacy program.

Virtual server group for new‑old program switching; limited to 200 backend instances per 7‑layer SLB.

On 5 Feb the new architecture sold out inventory in 7 minutes with smooth user experience.

Advantages: static traffic offloaded to CDN, reducing SLB load.

Disadvantages: required an extra domain, introduced cross‑origin issues, and an SMS encoding bug forced a fallback.

Architecture V3
Architecture V3

Ideal Architecture V4 – CDN‑Backed Design

Target design:

Main domain points to CDN.

CDN forwards requests to SLB using different protocols (HTTP/HTTPS) to route traffic to either the new or old program based on listener configuration.

Advantages: static acceleration, dynamic origin, no cross‑origin problems, easy program switching.

Disadvantages: still requires manual configuration; container‑based deployment would further simplify scaling.

Ideal Architecture
Ideal Architecture

Performance Summary

The three‑generation architecture was run on 150 ECS instances for safety, but analysis showed that ~50 instances would have been sufficient, turning a 5‑hour outage into a 7‑minute sell‑out.

Performance Statistics
Performance Statistics

Optimization Notes

Parameter Tuning

Network kernel settings:

net.ipv4.tcp_max_tw_buckets = 5000   # → 50000
net.ipv4.tcp_max_syn_backlog = 1024  # → 4096
net.core.somaxconn = 128            # → 4096
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_tw_recycle = 1

File descriptor limits ( /etc/security/limits.conf):

* soft nofile 65535
* hard nofile 65535

Nginx tuning:

worker_connections 1024   # → 10240
worker_processes 1       # → 16 (or auto)
worker_rlimit_nofile 1024 # → 102400
listen 80 backlog 511   # → 65533

Enable keep‑alive on Nginx to reduce short‑connection overhead.

Architecture Optimizations

Scale SLB backend ECS count and unify instance specifications.

Remove invalid upstream ports from Nginx configuration.

Use Alibaba Cloud Assistant for bulk ECS operations and parameter adjustments.

Monitor ECS, SLB, DCDN, Redis via CloudMonitor dashboards.

Switch SLB to layer‑7 listener mode and avoid session persistence that breaks login state.

Application Optimizations

Add GC logs and set JVM memory limits, e.g.:

/usr/bin/java -server -Xmx8g -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/app.gc.log -Dserver.port=8080 -jar /home/app/serverboot-0.0.1-SNAPSHOT.jar

Require a Redis‑based SSO session before issuing SMS verification codes.

Jedis pool tuning:

maxTotal 8   # → 20
acceptCount tuned to match somaxconn

Upgrade Spring Boot 1.5’s bundled Jedis 2.9.1 to 2.10.2 to fix connection leaks.

Database Optimizations

Switch Redis public endpoint to internal VPC address.

Shorten Redis session timeout to free connections faster.

Optimize slow SQL queries (use CloudDBA).

Add read‑only replica for automatic read‑write splitting.

Increase TCP backlog settings and the number of read‑write‑split instances.

backendScalabilitycloudload-balancingperformance-tuning
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.