How I Cut API Response Time from 500 ms to 100 ms with Linux Tuning
This article recounts a two‑week Linux system tuning project that reduced a high‑traffic API’s P99 response time from over 500 ms to under 100 ms by systematically diagnosing bottlenecks, applying USE‑based analysis, and tuning kernel, network, file‑descriptor, memory, CPU, and I/O parameters.
Practical Review of a Linux System Tuning that Cut API Response Time from 500 ms to 100 ms
Introduction
"The interface was too slow, users complained! The product manager’s pressure forced me to act. The system handled tens of millions of daily requests, with P99 response time over 500 ms, hurting user experience. As the operations lead, I launched a comprehensive Linux tuning. After two weeks the P99 dropped to 100 ms (even 80 ms), an >80% improvement. This article fully reproduces the process and shares the methodology."
Technical Background: Multi‑Dimensional Impact of Linux Performance
Essence of Linux Performance Tuning
Linux performance tuning is not just tweaking a few kernel parameters; it is a system‑level engineering effort covering several layers:
CPU scheduling : process priority, CPU affinity, context‑switch frequency
Memory management : page cache, swap usage, memory fragmentation
Disk I/O : filesystem choice, I/O scheduler, read‑ahead strategy
Network stack : TCP parameters, connection queue, network buffers
Kernel parameters : file‑descriptor limits, process limits, semaphore settings
Each layer can become the system’s “short board”; according to the barrel‑of‑a‑bucket theory, overall performance is limited by the weakest link.
Performance Analysis Methodology
The classic USE method (Utilization, Saturation, Errors) is used to analyse CPU, memory, disk and network resources.
Utilization : percentage of resource usage
Saturation : amount of work the resource cannot handle (usually queue length)
Errors : number of error events
System Architecture
Before optimisation, the stack consisted of:
Application layer : Java micro‑services, Spring Boot, running in Docker containers
Servers : Alibaba Cloud ECS, 8‑core 16 GB, CentOS 7.9
Middleware : Nginx load balancer, Redis cache, MySQL database
Monitoring : Prometheus + Grafana, ELK log collection
Traffic : ~10 million requests per day, peak QPS ≈ 2000
Initial performance indicators:
Average response time: 280 ms
P99 response time: 500‑600 ms
CPU average utilization: 40‑50 %
System load: 5‑8
Network connections: 8000‑12000
Core Content: Systematic Performance Optimisation
Phase 1 – Bottleneck Diagnosis and定位
Establish Monitoring Baseline
Before any optimisation, a full monitoring baseline was built using a combination of tools:
# 1. CPU and Load analysis
sar -u 5 60 > cpu_usage.log
sar -q 5 60 > load_avg.log
# Top processes
top -b -n 10 -d 5 > top_output.log
# Context switches
vmstat 5 60 > vmstat.log
# Memory analysis
free -h
sar -r 5 60 > memory.log
cat /proc/meminfo
# Disk I/O analysis
iostat -x 5 60 > iostat.log
iotop -o -b -n 10 > iotop.log
# Network analysis
netstat -s > netstat_stats.log
ss -s > socket_stats.log
sar -n DEV 5 60 > network_dev.log
sar -n TCP 5 60 > network_tcp.logKey Issues Discovered
After a week of data collection the following critical problems were identified:
Problem 1: Excessive context switches
# vmstat output
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 2048576 86420 5243680 0 0 12 156 18000 42000 25 15 58 2 0The cs column shows 42 000 context switches per second, far above the normal 1‑5 000 range, meaning the CPU spends a lot of time on process switching.
Problem 2: TCP listen‑queue overflow
# netstat -s excerpt
TcpExt:
1247856 times the listen queue of a socket overflowed
1247856 SYNs to LISTEN sockets droppedThe listen queue overflow reached 1.24 million, indicating the system could not accept new connections promptly.
Problem 3: File‑descriptor limit near ceiling
# lsof -p $(pgrep -f java) | wc -l
58234
# limits
Max open files 65535 65535 filesUsage reached 89 %, causing “Too many open files” errors during peaks.
Problem 4: Frequent SWAP activity
Although swap space was small, monitoring showed frequent swap‑in/out, causing performance jitter.
Prioritisation of Bottlenecks
P0 – TCP listen‑queue overflow : directly leads to request failures, highest impact
P0 – File‑descriptor shortage : triggers application errors, must be fixed immediately
P1 – Excessive context switches : consumes CPU resources, degrades overall performance
P2 – SWAP usage : causes jitter, needs optimisation
Phase 2 – Network Stack Optimisation
TCP Connection Queue Tuning
Linux TCP has a SYN queue and an accept queue; enlarging both is key for concurrency.
# /etc/sysctl.conf
net.core.somaxconn = 65535 # default 128
net.ipv4.tcp_max_syn_backlog = 16384 # default 1024
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_tw_buckets = 10000
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 3
net.core.rmem_default = 262144
net.core.rmem_max = 16777216
net.core.wmem_default = 262144
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 262144 16777216
net.ipv4.tcp_wmem = 4096 262144 16777216
net.ipv4.ip_local_port_range = 10000 65535
sysctl -pVerification:
# Before
ss -s
Total: 12847 (kernel 12965)
TCP: 10234 (estab 8234, closed 1876, orphaned 23, synrecv 0, timewait 1852)
# After
ss -s
Total: 9247 (kernel 9365)
TCP: 7234 (estab 6234, closed 876, orphaned 8, synrecv 0, timewait 452)TIME_WAIT connections dropped from 1852 to 452, eliminating the queue overflow.
Network Connection Limits
In addition to kernel parameters, Nginx and application settings were tuned.
Nginx configuration:
# /etc/nginx/nginx.conf
worker_processes auto;
worker_rlimit_nofile 65535;
worker_cpu_affinity auto;
events {
use epoll;
worker_connections 20480;
multi_accept on;
}
http {
keepalive_timeout 60;
keepalive_requests 1000;
upstream backend {
server 127.0.0.1:8080 max_fails=3 fail_timeout=30s;
keepalive 256;
keepalive_requests 1000;
keepalive_timeout 60s;
}
}Spring Boot application configuration (application.yml):
# application.yml
server:
tomcat:
threads:
max: 500
min-spare: 50
accept-count: 500
max-connections: 20000
connection-timeout: 20000Phase 3 – System Resource Limits Optimisation
File‑Descriptor Limits
Default limits are insufficient for high‑concurrency workloads.
# System‑wide limit
fs.file-max = 6553560
# User limits (/etc/security/limits.conf)
* soft nofile 655350
* hard nofile 655350
* soft nproc 102400
* hard nproc 102400
root soft nofile 655350
root hard nofile 655350
# systemd service limit
[Service]
LimitNOFILE=655350
LimitNPROC=102400Verification after reload shows the limit at 655 350.
Memory and SWAP Optimisation
# /etc/sysctl.conf
vm.swappiness = 5
vm.min_free_kbytes = 524288
vm.vfs_cache_pressure = 50
vm.dirty_ratio = 20
vm.dirty_background_ratio = 5
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000
sysctl -pOptionally disable swap completely with swapoff -a and comment out the swap line in /etc/fstab. In this case we kept a small swap and set swappiness to 5.
Phase 4 – CPU and Process Scheduling Optimisation
Reduce Context Switches
Adjust thread pool size and bind processes to CPUs.
# Adjusted Spring Boot thread pool
server:
tomcat:
threads:
max: 200
min-spare: 20CPU affinity via taskset or worker_cpu_affinity auto in Nginx.
Install irqbalance to distribute NIC interrupts.
Effect Verification
# Before
vmstat 1 5
... cs 42000 us 25 sy 15 ...
# After
vmstat 1 5
... cs 8500 us 20 sy 10 ...Context switches fell from 42 000 /s to 8 500 /s, an 80 % reduction.
Phase 5 – Disk I/O Optimisation
I/O Scheduler Selection
# Check current scheduler
cat /sys/block/sda/queue/scheduler
# Set noop for SSD
echo noop > /sys/block/sda/queue/scheduler
# Or make permanent via GRUB
GRUB_CMDLINE_LINUX="elevator=noop"
grub2-mkconfig -o /boot/grub2/grub.cfgFilesystem Mount Options
# /etc/fstab
/dev/sda1 / ext4 defaults,noatime,nodiratime,data=writeback 0 1
mount -o remount /Log Optimisation (logback.xml snippet)
<!-- logback.xml -->
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<encoder>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<fileNamePattern>/var/log/app/app.%d{yyyy-MM-dd}.log</fileNamePattern>
<maxHistory>7</maxHistory>
</rollingPolicy>
</appender>
<appender name="ASYNC-FILE" class="ch.qos.logback.classic.AsyncAppender">
<appender-ref ref="FILE"/>
<queueSize>2048</queueSize>
<discardingThreshold>0</discardingThreshold>
</appender>
<root level="INFO">
<appender-ref ref="ASYNC-FILE"/>
</root>Phase 6 – End‑to‑End Verification and Stress Testing
After all tweaks, a full stress test was performed.
Stress Test Tool
# Install wrk
git clone https://github.com/wg/wrk.git
cd wrk && make && sudo cp wrk /usr/local/bin/
# Run test
wrk -t 12 -c 1000 -d 300s --latency http://your-api-endpointResults before optimisation:
Latency Avg 486.23ms, 99% 1.2s, Errors 1234, Requests/sec 1874Results after optimisation:
Latency Avg 96.12ms, 99% 234ms, Errors 0, Requests/sec 10152Summary and Outlook
This Linux tuning reduced API response time from 500 ms to 100 ms, an 80 % gain, solved the immediate bottleneck and built a solid foundation for future high‑traffic events. The process demonstrates the importance of systematic analysis, data‑driven optimisation, and thorough verification.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
