Operations 8 min read

Detailed Analysis of a GitLab Runner Performance Bottleneck

This article documents a multi‑stage investigation of intermittent GitLab Runner build timeouts and hangs, describing the background, VM configuration, successive diagnostic steps using strace, iotop, perf, and storage‑driver adjustments, and concludes with performance test results and lessons learned.

NetEase Game Operations Platform
NetEase Game Operations Platform
NetEase Game Operations Platform
Detailed Analysis of a GitLab Runner Performance Bottleneck

Background

The issue was reported by users experiencing occasional Gitlab Runner build timeouts and hangs. The problem appeared across multiple repositories and CI pipelines, prompting a deep investigation.

Fault Machine Configuration

The affected machine is a KVM virtual machine with the following specifications:

1 系统版本:Debian9.7
2 内核版本:4.9.0-8-amd64
3 CPU核数:16
4 内存大小:16GB
5 磁盘大小:300G sas

Analysis Details

First Investigation

Initial suspicion fell on docker.sock not responding. strace was used to trace system calls on the socket, confirming that the socket could become blocked, though no conclusive root cause was identified at the time.

Second Investigation

Further reports revealed high CPU st usage and elevated I/O. The VM was migrated to a dedicated host to eliminate CPU contention, but the I/O issue persisted, indicating a deeper problem.

Third Investigation

Analysis with iotop identified the loop kernel thread (PID 2, kthreadd ) as the primary source of I/O activity.

Fourth Investigation

The perf tool captured a flame graph, highlighting loop_queue_work as the hotspot function. Examination of the function suggested involvement of disk scheduling, but the VM used Ceph RBD, which does not allow changing the scheduler.

Fifth Investigation

A new error screenshot pointed to the devmapper storage driver consuming excessive space. Research showed that devmapper is deprecated and has performance drawbacks. The underlying issue was traced to the XFS filesystem lacking the ftype=1 option, which forced the use of devmapper .

could not use snapshotter overlayfs in metadata plugin "error="/home/gitlab/var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.overlayfs does not support d_type. If the backing filesystem is xfs, please reformat with ftype=1 to enable d_type support

Reformatting the disk with mkfs.xfs -n ftype=1 /dev/vdc1 enabled the ftype option, allowing Docker to switch to the overlay2 driver. Subsequent tests showed that the loop thread no longer generated I/O, confirming the root cause.

Self‑Test Results

Using the same repository on a GitLab instance, performance was compared between devmapper and overlay2 storage drivers. The overlay2 driver exhibited significantly lower I/O metrics.

Thought Summary

The diagnosis spanned several weeks, progressing from symptom observation to system‑level tracing, kernel‑level profiling, and finally filesystem configuration. The key takeaway is the importance of preserving diagnostic data and systematically narrowing down from high‑level symptoms to low‑level kernel behavior.

References

Dynamic tracing tools: https://zhuanlan.zhihu.com/p/24124082

Flame graph usage: https://github.com/brendangregg/FlameGraph

Perf tool guide: http://www.brendangregg.com/perf.html

PerformancedockerCI/CDDevOpsLinuxperfGitLab Runner
NetEase Game Operations Platform
Written by

NetEase Game Operations Platform

The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.