Detailed Analysis of a GitLab Runner Performance Bottleneck
This article documents a multi‑stage investigation of intermittent GitLab Runner build timeouts and hangs, describing the background, VM configuration, successive diagnostic steps using strace, iotop, perf, and storage‑driver adjustments, and concludes with performance test results and lessons learned.
Background
The issue was reported by users experiencing occasional Gitlab Runner build timeouts and hangs. The problem appeared across multiple repositories and CI pipelines, prompting a deep investigation.
Fault Machine Configuration
The affected machine is a KVM virtual machine with the following specifications:
1 系统版本:Debian9.7
2 内核版本:4.9.0-8-amd64
3 CPU核数:16
4 内存大小:16GB
5 磁盘大小:300G sasAnalysis Details
First Investigation
Initial suspicion fell on docker.sock not responding. strace was used to trace system calls on the socket, confirming that the socket could become blocked, though no conclusive root cause was identified at the time.
Second Investigation
Further reports revealed high CPU st usage and elevated I/O. The VM was migrated to a dedicated host to eliminate CPU contention, but the I/O issue persisted, indicating a deeper problem.
Third Investigation
Analysis with iotop identified the loop kernel thread (PID 2, kthreadd ) as the primary source of I/O activity.
Fourth Investigation
The perf tool captured a flame graph, highlighting loop_queue_work as the hotspot function. Examination of the function suggested involvement of disk scheduling, but the VM used Ceph RBD, which does not allow changing the scheduler.
Fifth Investigation
A new error screenshot pointed to the devmapper storage driver consuming excessive space. Research showed that devmapper is deprecated and has performance drawbacks. The underlying issue was traced to the XFS filesystem lacking the ftype=1 option, which forced the use of devmapper .
could not use snapshotter overlayfs in metadata plugin "error="/home/gitlab/var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.overlayfs does not support d_type. If the backing filesystem is xfs, please reformat with ftype=1 to enable d_type supportReformatting the disk with mkfs.xfs -n ftype=1 /dev/vdc1 enabled the ftype option, allowing Docker to switch to the overlay2 driver. Subsequent tests showed that the loop thread no longer generated I/O, confirming the root cause.
Self‑Test Results
Using the same repository on a GitLab instance, performance was compared between devmapper and overlay2 storage drivers. The overlay2 driver exhibited significantly lower I/O metrics.
Thought Summary
The diagnosis spanned several weeks, progressing from symptom observation to system‑level tracing, kernel‑level profiling, and finally filesystem configuration. The key takeaway is the importance of preserving diagnostic data and systematically narrowing down from high‑level symptoms to low‑level kernel behavior.
References
Dynamic tracing tools: https://zhuanlan.zhihu.com/p/24124082
Flame graph usage: https://github.com/brendangregg/FlameGraph
Perf tool guide: http://www.brendangregg.com/perf.html
NetEase Game Operations Platform
The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.