Git Server CPU Spike After Migration: Insights into SSHD, XFS Locks, and PAM
After moving a Git server to a new data center, CPU sys time surged due to thousands of sshd processes contending on XFS read‑write locks while repeatedly reading a massive /var/log/btmp file caused by PAM postlogin, and the analysis shows how perf, strace and log rotation can resolve the issue.
Background
After moving the Git server from the Huahai data center to Nansha on May 6, the load became high and user access slowed.
Initial Investigation of High Load
Login to the server shows a high load, normal I/O and memory, but CPU idle is 0 and sys time reaches 80%. The CPU is mainly consumed by sshd processes; at peak there are more than 300 sshd processes, and a single IP can open over 100 connections instantly.
Observations from Zabbix
Two‑week Zabbix data shows load and CPU usage gradually increasing. User time (usr) stays stable while system time (sys) rises.
Hypothesis: Git traffic increase
Network traffic does not increase proportionally, suggesting Git request volume is not the cause.
sshd CPU Usage
It was suspected that many sshd processes cause context switches, but Zabbix and perf top did not confirm this.
Perf Analysis
Running perf top -e cycles shows the functions with the highest CPU share are up_read and down_read.
These functions are part of XFS read‑write semaphore handling. down_read and up_read are called by xfs_iklock and xfs_iunlock, which in turn are invoked by xfs_file_aio_read inside the sshd process.
Tracing the Call Stack
Using perf record -a -e cpu-clock -g and perf report a flame graph was generated, showing the call chain down to xfs_file_aio_read.
Identifying the File Being Read
Further tracing with perf top -e xfs:xfs_ilock points to device 8:2, which is the /dev/sda2 partition mounted as /. This is not the Git repository directory ( /home), so the heavy reads are not from Git data.
Strace Investigation
Enabling PAM postlogin in /etc/pam.d/sshd caused sshd to open /var/log/btmp on every login. Strace logs show that with PAM postlogin enabled a single SSH login generates about 708 342 system calls, while disabling it reduces the count to 6 445.
The most frequent system call is read on file descriptor 5, which corresponds to /var/log/btmp. The file has grown to 258 MiB.
Root Cause
When PAM postlogin is enabled, each SSH login reads the large /var/log/btmp file. The XFS filesystem adds read‑write locks ( xfs_rw_ilock / xfs_rw_iunlock) around each read, causing massive lock contention and high sys time.
Why Load Increased Over Time
As the server runs longer, /var/log/btmp grows, leading to more read calls and more lock contention, which explains the gradual increase in CPU sys time.
Solutions
Disable the PAM postlogin entries ( auth include postlogin and session include postlogin) in /etc/pam.d/sshd.
Rotate or truncate /var/log/btmp to keep its size small.
Consider using ext3/ext4 instead of XFS to reduce lock overhead.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
