Operations 26 min read

How I Cut API Response Time from 500 ms to 100 ms with Linux Tuning

This article recounts a two‑week Linux system tuning project that reduced a high‑traffic API’s P99 response time from over 500 ms to under 100 ms by systematically diagnosing bottlenecks, applying USE‑based analysis, and tuning kernel, network, file‑descriptor, memory, CPU, and I/O parameters.

Ops Community

Oct 7, 2025

How I Cut API Response Time from 500 ms to 100 ms with Linux Tuning

Practical Review of a Linux System Tuning that Cut API Response Time from 500 ms to 100 ms

Introduction

"The interface was too slow, users complained! The product manager’s pressure forced me to act. The system handled tens of millions of daily requests, with P99 response time over 500 ms, hurting user experience. As the operations lead, I launched a comprehensive Linux tuning. After two weeks the P99 dropped to 100 ms (even 80 ms), an >80% improvement. This article fully reproduces the process and shares the methodology."

Technical Background: Multi‑Dimensional Impact of Linux Performance

Essence of Linux Performance Tuning

Linux performance tuning is not just tweaking a few kernel parameters; it is a system‑level engineering effort covering several layers:

CPU scheduling : process priority, CPU affinity, context‑switch frequency

Memory management : page cache, swap usage, memory fragmentation

Disk I/O : filesystem choice, I/O scheduler, read‑ahead strategy

Network stack : TCP parameters, connection queue, network buffers

Kernel parameters : file‑descriptor limits, process limits, semaphore settings

Each layer can become the system’s “short board”; according to the barrel‑of‑a‑bucket theory, overall performance is limited by the weakest link.

Performance Analysis Methodology

The classic USE method (Utilization, Saturation, Errors) is used to analyse CPU, memory, disk and network resources.

Utilization : percentage of resource usage

Saturation : amount of work the resource cannot handle (usually queue length)

Errors : number of error events

System Architecture

Before optimisation, the stack consisted of:

Application layer : Java micro‑services, Spring Boot, running in Docker containers

Servers : Alibaba Cloud ECS, 8‑core 16 GB, CentOS 7.9

Middleware : Nginx load balancer, Redis cache, MySQL database

Monitoring : Prometheus + Grafana, ELK log collection

Traffic : ~10 million requests per day, peak QPS ≈ 2000

Initial performance indicators:

Average response time: 280 ms

P99 response time: 500‑600 ms

CPU average utilization: 40‑50 %

System load: 5‑8

Network connections: 8000‑12000

Core Content: Systematic Performance Optimisation

Phase 1 – Bottleneck Diagnosis and定位

Establish Monitoring Baseline

Before any optimisation, a full monitoring baseline was built using a combination of tools:

# 1. CPU and Load analysis
sar -u 5 60 > cpu_usage.log
sar -q 5 60 > load_avg.log

# Top processes
top -b -n 10 -d 5 > top_output.log

# Context switches
vmstat 5 60 > vmstat.log

# Memory analysis
free -h
sar -r 5 60 > memory.log
cat /proc/meminfo

# Disk I/O analysis
iostat -x 5 60 > iostat.log
iotop -o -b -n 10 > iotop.log

# Network analysis
netstat -s > netstat_stats.log
ss -s > socket_stats.log
sar -n DEV 5 60 > network_dev.log
sar -n TCP 5 60 > network_tcp.log

Key Issues Discovered

After a week of data collection the following critical problems were identified:

Problem 1: Excessive context switches

# vmstat output
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo  in   cs us sy  id wa st
 4  0     0 2048576  86420 5243680    0    0    12   156 18000 42000 25 15 58  2  0

The cs column shows 42 000 context switches per second, far above the normal 1‑5 000 range, meaning the CPU spends a lot of time on process switching.

Problem 2: TCP listen‑queue overflow

# netstat -s excerpt
TcpExt:
    1247856 times the listen queue of a socket overflowed
    1247856 SYNs to LISTEN sockets dropped

The listen queue overflow reached 1.24 million, indicating the system could not accept new connections promptly.

Problem 3: File‑descriptor limit near ceiling

# lsof -p $(pgrep -f java) | wc -l
58234
# limits
Max open files    65535  65535  files

Usage reached 89 %, causing “Too many open files” errors during peaks.

Problem 4: Frequent SWAP activity

Although swap space was small, monitoring showed frequent swap‑in/out, causing performance jitter.

Prioritisation of Bottlenecks

P0 – TCP listen‑queue overflow : directly leads to request failures, highest impact

P0 – File‑descriptor shortage : triggers application errors, must be fixed immediately

P1 – Excessive context switches : consumes CPU resources, degrades overall performance

P2 – SWAP usage : causes jitter, needs optimisation

Phase 2 – Network Stack Optimisation

TCP Connection Queue Tuning

Linux TCP has a SYN queue and an accept queue; enlarging both is key for concurrency.

# /etc/sysctl.conf
net.core.somaxconn = 65535          # default 128
net.ipv4.tcp_max_syn_backlog = 16384 # default 1024
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_tw_buckets = 10000
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 3
net.core.rmem_default = 262144
net.core.rmem_max = 16777216
net.core.wmem_default = 262144
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 262144 16777216
net.ipv4.tcp_wmem = 4096 262144 16777216
net.ipv4.ip_local_port_range = 10000 65535
sysctl -p

Verification:

# Before
ss -s
Total: 12847 (kernel 12965)
TCP:   10234 (estab 8234, closed 1876, orphaned 23, synrecv 0, timewait 1852)

# After
ss -s
Total: 9247 (kernel 9365)
TCP:   7234 (estab 6234, closed 876, orphaned 8, synrecv 0, timewait 452)

TIME_WAIT connections dropped from 1852 to 452, eliminating the queue overflow.

Network Connection Limits

In addition to kernel parameters, Nginx and application settings were tuned.

Nginx configuration:

# /etc/nginx/nginx.conf
worker_processes auto;
worker_rlimit_nofile 65535;
worker_cpu_affinity auto;

events {
    use epoll;
    worker_connections 20480;
    multi_accept on;
}

http {
    keepalive_timeout 60;
    keepalive_requests 1000;
    upstream backend {
        server 127.0.0.1:8080 max_fails=3 fail_timeout=30s;
        keepalive 256;
        keepalive_requests 1000;
        keepalive_timeout 60s;
    }
}

Spring Boot application configuration (application.yml):

# application.yml
server:
  tomcat:
    threads:
      max: 500
      min-spare: 50
    accept-count: 500
    max-connections: 20000
    connection-timeout: 20000

Phase 3 – System Resource Limits Optimisation

File‑Descriptor Limits

Default limits are insufficient for high‑concurrency workloads.

# System‑wide limit
fs.file-max = 6553560

# User limits (/etc/security/limits.conf)
* soft nofile 655350
* hard nofile 655350
* soft nproc 102400
* hard nproc 102400
root soft nofile 655350
root hard nofile 655350

# systemd service limit
[Service]
LimitNOFILE=655350
LimitNPROC=102400

Verification after reload shows the limit at 655 350.

Memory and SWAP Optimisation

# /etc/sysctl.conf
vm.swappiness = 5
vm.min_free_kbytes = 524288
vm.vfs_cache_pressure = 50
vm.dirty_ratio = 20
vm.dirty_background_ratio = 5
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000
sysctl -p

Optionally disable swap completely with swapoff -a and comment out the swap line in /etc/fstab. In this case we kept a small swap and set swappiness to 5.

Phase 4 – CPU and Process Scheduling Optimisation

Reduce Context Switches

Adjust thread pool size and bind processes to CPUs.

# Adjusted Spring Boot thread pool
server:
  tomcat:
    threads:
      max: 200
      min-spare: 20

CPU affinity via taskset or worker_cpu_affinity auto in Nginx.

Install irqbalance to distribute NIC interrupts.

Effect Verification

# Before
vmstat 1 5
... cs 42000 us 25 sy 15 ...

# After
vmstat 1 5
... cs 8500 us 20 sy 10 ...

Context switches fell from 42 000 /s to 8 500 /s, an 80 % reduction.

Phase 5 – Disk I/O Optimisation

I/O Scheduler Selection

# Check current scheduler
cat /sys/block/sda/queue/scheduler
# Set noop for SSD
echo noop > /sys/block/sda/queue/scheduler
# Or make permanent via GRUB
GRUB_CMDLINE_LINUX="elevator=noop"
grub2-mkconfig -o /boot/grub2/grub.cfg

Filesystem Mount Options

# /etc/fstab
/dev/sda1 / ext4 defaults,noatime,nodiratime,data=writeback 0 1
mount -o remount /

Log Optimisation (logback.xml snippet)

<!-- logback.xml -->
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <encoder>
        <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
    </encoder>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
        <fileNamePattern>/var/log/app/app.%d{yyyy-MM-dd}.log</fileNamePattern>
        <maxHistory>7</maxHistory>
    </rollingPolicy>
</appender>
<appender name="ASYNC-FILE" class="ch.qos.logback.classic.AsyncAppender">
    <appender-ref ref="FILE"/>
    <queueSize>2048</queueSize>
    <discardingThreshold>0</discardingThreshold>
</appender>
<root level="INFO">
    <appender-ref ref="ASYNC-FILE"/>
</root>

Phase 6 – End‑to‑End Verification and Stress Testing

After all tweaks, a full stress test was performed.

Stress Test Tool

# Install wrk
git clone https://github.com/wg/wrk.git
cd wrk && make && sudo cp wrk /usr/local/bin/

# Run test
wrk -t 12 -c 1000 -d 300s --latency http://your-api-endpoint

Results before optimisation:

Latency Avg 486.23ms, 99% 1.2s, Errors 1234, Requests/sec 1874

Results after optimisation:

Latency Avg 96.12ms, 99% 234ms, Errors 0, Requests/sec 10152

Summary and Outlook

This Linux tuning reduced API response time from 500 ms to 100 ms, an 80 % gain, solved the immediate bottleneck and built a solid foundation for future high‑traffic events. The process demonstrates the importance of systematic analysis, data‑driven optimisation, and thorough verification.

system performance TCP Optimization resource limits Linux Tuning API latency reduction USE methodology

Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Practical Review of a Linux System Tuning that Cut API Response Time from 500 ms to 100 ms

Introduction

Technical Background: Multi‑Dimensional Impact of Linux Performance

Essence of Linux Performance Tuning

Performance Analysis Methodology

System Architecture

Core Content: Systematic Performance Optimisation

Phase 1 – Bottleneck Diagnosis and定位

Establish Monitoring Baseline

Key Issues Discovered

Prioritisation of Bottlenecks

Phase 2 – Network Stack Optimisation

TCP Connection Queue Tuning

Network Connection Limits

Phase 3 – System Resource Limits Optimisation

File‑Descriptor Limits

Memory and SWAP Optimisation

Phase 4 – CPU and Process Scheduling Optimisation

Reduce Context Switches

Effect Verification

Phase 5 – Disk I/O Optimisation

I/O Scheduler Selection

Filesystem Mount Options

Log Optimisation (logback.xml snippet)

Phase 6 – End‑to‑End Verification and Stress Testing

Stress Test Tool

Summary and Outlook

Ops Community

How this landed with the community

Was this worth your time?

0 Comments

Practical Review of a Linux System Tuning that Cut API Response Time from 500 ms to 100 ms

Phase 1 – Bottleneck Diagnosis and定位

Phase 2 – Network Stack Optimisation

Phase 3 – System Resource Limits Optimisation

Phase 4 – CPU and Process Scheduling Optimisation

Phase 5 – Disk I/O Optimisation

Phase 6 – End‑to‑End Verification and Stress Testing