Operations 14 min read

Why Docker Hosts Crashed: Tracing Kernel Null‑Pointer Bugs and the Fix

The article recounts a half‑year investigation of a high‑performance proxy cluster whose Docker hosts repeatedly crashed due to kernel null‑pointer dereferences, detailing log analysis, three faulty hypotheses, extensive web research, kernel and Docker upgrades, and the final operational lessons learned.

dbaplus Community
dbaplus Community
dbaplus Community
Why Docker Hosts Crashed: Tracing Kernel Null‑Pointer Bugs and the Fix

Background

A high‑performance proxy cluster runs Docker containers on hosts equipped with 10 GbE NICs. Each host uses a Linux bridge to assign IP addresses directly to containers, and all configuration data is stored in a scheduler. The environment is uniform: same OS (Linux 3.16.0‑4‑amd64), same Docker version (1.12.1/1.12.2), and identical hardware.

Symptom

After the service went live, hosts began to die one after another within weeks, causing all services on the host to stop. The hosts became completely unresponsive; remote login failed and the only evidence came from syslog collected via ELK.

Log Sample

Nov 12 15:06:31 hello-world kernel: [6373724.634681] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
Nov 12 15:06:31 hello-world kernel: [6373724.634718] IP: [] pick_next_task_fair+0x6b8/0x820
Nov 12 15:06:31 hello-world kernel: [6373724.634749] PGD 10561e4067 PUD ffdb46067 PMD 0
Nov 12 15:06:31 hello-world kernel: [6373724.634780] Oops: 0000 [#1] SMP

Initial hypotheses

Docker version incompatibility with the host kernel.

Linux bridge networking bug triggered by the bridge configuration.

Pipework script used for IP assignment contains a bug.

Investigation of hypothesis 1

All hosts showed recurring warnings such as:

time="2016-09-07T20:22:19.450573015+08:00" level=warning msg="Your kernel does not support cgroup memory limit"
time="2016-09-07T20:22:19.450618295+08:00" level=warning msg="Your kernel does not support cgroup cfs period"
time="2016-09-07T20:22:19.450640785+08:00" level=warning msg="Your kernel does not support cgroup cfs quotas"
time="2016-09-07T20:22:19.450769672+08:00" level=warning msg="mountpoint for pids not found"

These messages match known Docker‑1.9‑era bugs that were fixed in Docker 1.12.3 (see https://github.com/docker/docker/issues/24211). Upgrading Docker to 1.12.2 did not stop the crashes, indicating the problem lay elsewhere.

Investigation of hypotheses 2 and 3

Disabling Docker on the problematic host and only re‑configuring the network (Linux bridge) kept the host stable for a week, suggesting the bridge could be involved. Replacing pipework with manual IP assignment also failed to prevent crashes, so pipework was ruled out.

Deeper research

Searches on Server Fault and mailing lists uncovered a kernel bug that required Linux 3.18 to be fixed (https://lists.gt.net/linux/kernel/2256803, https://lkml.org/lkml/2014/2/15/217). The bug originates from a race between task_group and sched_task_group causing a null cgroup pointer after a fork.

Resolution

The team upgraded the host kernel to a version where the bug is fixed (3.19) and reinstalled the latest Docker release (1.19 at the time). Installation steps included removing old Docker packages and running the official install script:

curl -fsSL https://get.docker.com/ | sh
nohup docker daemon -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock -s devicemapper &

After the upgrades, the hosts have been running for over two months without further crashes.

Verification

16:44:15 up 28 days, 23:41, 2 users, load average: 0.10, 0.13, 0.15
docker    30320  1  0 Jan11 ?        00:49:56 /usr/bin/docker daemon -p /var/run/docker.pid

Lessons learned

Maintain an emergency plan and document possible failure modes before launch.

Prioritise restoring business services before deep debugging.

Select software versions that are widely tested and supported in production.

Upgrade the kernel only after confirming the exact version that contains the fix.

Implement proper locking and validation when multiple schedulers manipulate the same container resources.

Architecture diagram
Architecture diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockertroubleshootingLinux kernelKernel upgradeNull pointer dereference
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.