How We Engineered a Million‑User Lottery System to Survive Massive Spikes

This article details the end‑to‑end architecture, rate‑limiting strategies, caching layers, database optimizations, and hardware upgrades that enabled a lottery service to handle daily traffic exceeding one million users during peak promotional events.

Architect
Architect
Architect
How We Engineered a Million‑User Lottery System to Survive Massive Spikes

1. Server‑side Rate Limiting

We use an A10 hardware load balancer (commercial) instead of Nginx for simplicity, and Tomcat as the web server. Two main server‑side mitigations were applied:

CC protection : limit each IP to 200 requests per minute; excess requests are rejected. This can be configured on A10 or via Nginx connection‑limit modules.

Tomcat concurrency tuning : the default maxThreads=500 caused timeouts under heavy load. Performance testing showed degradation after 400 concurrent requests, so we reduced maxThreads to 400 to cap Tomcat processing capacity.

2. Application‑layer Rate Limiting

At the application level we added three mechanisms:

Semaphore control : a Java Semaphore with 350 permits (leaving 50 threads for rejecting excess requests) allows us to return a quick “no prize” response within ~10 ms for over‑limit traffic.

User‑behavior identification : real‑time human‑bot detection based on click patterns, IP, User‑Agent, device ID, etc. Requests lacking normal interaction are flagged and blocked. A risk‑list of known bots or scalpers is also maintained.

Additional rules : activity‑specific limits stored in cache further trim traffic.

Images illustrate the flow before and after behavior detection, showing peak traffic dropping from 600 k to 300 k requests per minute and prize exhaustion time improving dramatically.

3. Application‑layer Performance Optimization

Performance bottlenecks centered on the database. We applied:

Distributed cache (Ycache) : a Memcached‑based component stores large user‑related data to reduce DB reads.

Local cache : hot, rarely‑updated data (e.g., activity rules) cached in‑process using EhCache or a simple ConcurrentHashMap.

Optimistic locking : update statements include a version column to ensure only one winner decrements the prize count.

update award set award_num=award_num-1 where id=#{id} and version=#{version} and award_num>0

Unique index : a unique constraint on (prize_id, user_id) prevents duplicate winning records.

4. Database and Hardware

Initial load tests with 50 concurrent users yielded average response times >600 ms and peaks >1 s, exposing a database connection pool of only 30‑50 connections. Raising the pool to 100 reduced connection timeouts but did not solve the latency.

VisualVM snapshots identified heavy time spent in database write methods and an RPC call. Further investigation revealed the test server used an old mechanical HDD, causing high log file sync wait times (>60 ms). Switching to SSD reduced average response time to 136 ms at 441 concurrent users, comfortably supporting the estimated 190 k requests per minute.

5. Other Optimization Ideas (Not Implemented)

Message queue to decouple prize drawing and allow asynchronous processing.

Asynchronous RPC for the long‑running call.

Read‑write separation for databases (discarded due to consistency concerns).

Activity‑level database sharding.

In‑memory databases for ultra‑low latency.

Hardware upgrades beyond SSD.

6. Key Takeaways

High traffic spikes often hide a large proportion of bot traffic; behavior detection is essential to protect real users.

Performance optimization must consider the entire stack—from code and JVM to network and storage—because a single hardware bottleneck can nullify all software improvements.

Overall architecture diagram
Overall architecture diagram
Application‑layer rate limiting diagram
Application‑layer rate limiting diagram
Traffic before/after behavior detection
Traffic before/after behavior detection
Transaction bottleneck illustration
Transaction bottleneck illustration
Initial load test results
Initial load test results
Database connection timeout
Database connection timeout
DBA performance metrics
DBA performance metrics
Final SSD test results
Final SSD test results
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationBackend Architectureload balancingcachinghigh concurrencyrate limitingdatabase scaling
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.