Backend Development 9 min read

How I Boosted a Python Service to 50k QPS: Real‑World Performance Tuning

This article documents a step‑by‑step performance optimization of a Python web module, covering requirement analysis, environment setup, load‑testing results, database and TCP bottleneck identification, caching strategies, kernel tuning, and the final achievement of 50,000 QPS with low latency.

Efficient Ops

Aug 4, 2021

How I Boosted a Python Service to 50k QPS: Real‑World Performance Tuning

Preface

This article records a Python program performance optimization, the problems encountered, and how they were solved, offering a practical optimization mindset while acknowledging that the approach is not the only possible solution.

How to Optimize

Optimization must be goal‑driven; vague claims of "million concurrent users" are meaningless without clear objectives. Before optimizing, define target metrics and identify the performance bottleneck rather than making random changes.

Requirement Description

The project was a standalone module split from the main site due to high concurrency. The split required a stress‑test QPS of at least 30,000, database load under 50%, server load under 70%, request latency under 70 ms, and error rate below 5%.

Environment configuration:

Server: 4‑core 8 GB RAM, CentOS 7, SSD Database: MySQL 5.7, max connections 800 Cache: Redis 1 GB

All services were purchased from Tencent Cloud.

Load‑testing tool: Locust with Tencent auto‑scaling for distributed testing.

Requirement details:

When a user visits the homepage, the system queries the database for suitable popup configurations. If none exist, it waits for the next request; otherwise it returns the configuration to the frontend. Various branches handle user clicks, timing, and subsequent returns.

Key Analysis

The three critical points are: 1) locating appropriate popup configurations for users, 2) recording the next return time in the database, and 3) logging user actions on the returned configuration.

Tuning

All three points involve database operations, both reads and writes. Without caching, every request hits the database, exhausting connections and causing slow SQL execution. The first step was to separate write operations and improve DB connection handling. The initial architecture is shown below:

Write operations were moved to a FIFO message queue implemented with a Redis list.

Initial load test results: QPS around 6,000, 502 errors rose to 30%, CPU fluctuated between 60‑70%, database connections saturated (~6,000 TCP connections), indicating a DB bottleneck due to repeated reads for user configurations.

After loading all configurations into Redis cache (reading from DB only on cache miss), a second test showed QPS up to ~20,000, CPU 60‑80%, DB connections around 300, and TCP connections about 15,000 per second.

Despite sufficient socket limits (ulimit -n reported 65,535, later increased to 100,001), QPS plateaued around 22,000. Investigation revealed that TCP connections remained in TIME_WAIT after the four‑way handshake, preventing immediate reuse.

TCP connections stay in TIME_WAIT after termination to prevent stray packets from being misinterpreted.

Since Linux does not expose a direct kernel parameter to shorten TIME_WAIT, the focus shifted to adjusting related settings:

#timewait count, default 180000
net.ipv4.tcp_max_tw_buckets = 6000
net.ipv4.ip_local_port_range = 1024 65000
# enable fast recycle
net.ipv4.tcp_tw_recycle = 1
# enable reuse of TIME-WAIT sockets
net.ipv4.tcp_tw_reuse = 1

After applying these kernel tweaks, the final load test achieved QPS of 50,000, CPU at 70%, normal database and TCP connections, average response time 60 ms, and 0% error rate.

Conclusion

The entire development, tuning, and testing cycle highlighted that web development is a multidisciplinary engineering practice involving networking, databases, programming languages, and operating systems, requiring a solid foundational knowledge base. Enabling tcp_tw_recycle and tcp_tw_reuse involves trade‑offs, as they can introduce other issues, but they were accepted here to achieve the performance goals.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Python Redis Load Testing TCP Tuning

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.