How We Engineered a Million‑User Lottery System to Survive Massive Spikes
This article details the end‑to‑end architecture, rate‑limiting strategies, caching layers, database optimizations, and hardware upgrades that enabled a lottery service to handle daily traffic exceeding one million users during peak promotional events.
1. Server‑side Rate Limiting
We use an A10 hardware load balancer (commercial) instead of Nginx for simplicity, and Tomcat as the web server. Two main server‑side mitigations were applied:
CC protection : limit each IP to 200 requests per minute; excess requests are rejected. This can be configured on A10 or via Nginx connection‑limit modules.
Tomcat concurrency tuning : the default maxThreads=500 caused timeouts under heavy load. Performance testing showed degradation after 400 concurrent requests, so we reduced maxThreads to 400 to cap Tomcat processing capacity.
2. Application‑layer Rate Limiting
At the application level we added three mechanisms:
Semaphore control : a Java Semaphore with 350 permits (leaving 50 threads for rejecting excess requests) allows us to return a quick “no prize” response within ~10 ms for over‑limit traffic.
User‑behavior identification : real‑time human‑bot detection based on click patterns, IP, User‑Agent, device ID, etc. Requests lacking normal interaction are flagged and blocked. A risk‑list of known bots or scalpers is also maintained.
Additional rules : activity‑specific limits stored in cache further trim traffic.
Images illustrate the flow before and after behavior detection, showing peak traffic dropping from 600 k to 300 k requests per minute and prize exhaustion time improving dramatically.
3. Application‑layer Performance Optimization
Performance bottlenecks centered on the database. We applied:
Distributed cache (Ycache) : a Memcached‑based component stores large user‑related data to reduce DB reads.
Local cache : hot, rarely‑updated data (e.g., activity rules) cached in‑process using EhCache or a simple ConcurrentHashMap.
Optimistic locking : update statements include a version column to ensure only one winner decrements the prize count.
update award set award_num=award_num-1 where id=#{id} and version=#{version} and award_num>0Unique index : a unique constraint on (prize_id, user_id) prevents duplicate winning records.
4. Database and Hardware
Initial load tests with 50 concurrent users yielded average response times >600 ms and peaks >1 s, exposing a database connection pool of only 30‑50 connections. Raising the pool to 100 reduced connection timeouts but did not solve the latency.
VisualVM snapshots identified heavy time spent in database write methods and an RPC call. Further investigation revealed the test server used an old mechanical HDD, causing high log file sync wait times (>60 ms). Switching to SSD reduced average response time to 136 ms at 441 concurrent users, comfortably supporting the estimated 190 k requests per minute.
5. Other Optimization Ideas (Not Implemented)
Message queue to decouple prize drawing and allow asynchronous processing.
Asynchronous RPC for the long‑running call.
Read‑write separation for databases (discarded due to consistency concerns).
Activity‑level database sharding.
In‑memory databases for ultra‑low latency.
Hardware upgrades beyond SSD.
6. Key Takeaways
High traffic spikes often hide a large proportion of bot traffic; behavior detection is essential to protect real users.
Performance optimization must consider the entire stack—from code and JVM to network and storage—because a single hardware bottleneck can nullify all software improvements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
