Why Docker Hosts Crashed: Tracing Kernel Null‑Pointer Bugs and the Fix
The article recounts a half‑year investigation of a high‑performance proxy cluster whose Docker hosts repeatedly crashed due to kernel null‑pointer dereferences, detailing log analysis, three faulty hypotheses, extensive web research, kernel and Docker upgrades, and the final operational lessons learned.
Background
A high‑performance proxy cluster runs Docker containers on hosts equipped with 10 GbE NICs. Each host uses a Linux bridge to assign IP addresses directly to containers, and all configuration data is stored in a scheduler. The environment is uniform: same OS (Linux 3.16.0‑4‑amd64), same Docker version (1.12.1/1.12.2), and identical hardware.
Symptom
After the service went live, hosts began to die one after another within weeks, causing all services on the host to stop. The hosts became completely unresponsive; remote login failed and the only evidence came from syslog collected via ELK.
Log Sample
Nov 12 15:06:31 hello-world kernel: [6373724.634681] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
Nov 12 15:06:31 hello-world kernel: [6373724.634718] IP: [] pick_next_task_fair+0x6b8/0x820
Nov 12 15:06:31 hello-world kernel: [6373724.634749] PGD 10561e4067 PUD ffdb46067 PMD 0
Nov 12 15:06:31 hello-world kernel: [6373724.634780] Oops: 0000 [#1] SMPInitial hypotheses
Docker version incompatibility with the host kernel.
Linux bridge networking bug triggered by the bridge configuration.
Pipework script used for IP assignment contains a bug.
Investigation of hypothesis 1
All hosts showed recurring warnings such as:
time="2016-09-07T20:22:19.450573015+08:00" level=warning msg="Your kernel does not support cgroup memory limit"
time="2016-09-07T20:22:19.450618295+08:00" level=warning msg="Your kernel does not support cgroup cfs period"
time="2016-09-07T20:22:19.450640785+08:00" level=warning msg="Your kernel does not support cgroup cfs quotas"
time="2016-09-07T20:22:19.450769672+08:00" level=warning msg="mountpoint for pids not found"These messages match known Docker‑1.9‑era bugs that were fixed in Docker 1.12.3 (see https://github.com/docker/docker/issues/24211). Upgrading Docker to 1.12.2 did not stop the crashes, indicating the problem lay elsewhere.
Investigation of hypotheses 2 and 3
Disabling Docker on the problematic host and only re‑configuring the network (Linux bridge) kept the host stable for a week, suggesting the bridge could be involved. Replacing pipework with manual IP assignment also failed to prevent crashes, so pipework was ruled out.
Deeper research
Searches on Server Fault and mailing lists uncovered a kernel bug that required Linux 3.18 to be fixed (https://lists.gt.net/linux/kernel/2256803, https://lkml.org/lkml/2014/2/15/217). The bug originates from a race between task_group and sched_task_group causing a null cgroup pointer after a fork.
Resolution
The team upgraded the host kernel to a version where the bug is fixed (3.19) and reinstalled the latest Docker release (1.19 at the time). Installation steps included removing old Docker packages and running the official install script:
curl -fsSL https://get.docker.com/ | sh
nohup docker daemon -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock -s devicemapper &After the upgrades, the hosts have been running for over two months without further crashes.
Verification
16:44:15 up 28 days, 23:41, 2 users, load average: 0.10, 0.13, 0.15
docker 30320 1 0 Jan11 ? 00:49:56 /usr/bin/docker daemon -p /var/run/docker.pidLessons learned
Maintain an emergency plan and document possible failure modes before launch.
Prioritise restoring business services before deep debugging.
Select software versions that are widely tested and supported in production.
Upgrade the kernel only after confirming the exact version that contains the fix.
Implement proper locking and validation when multiple schedulers manipulate the same container resources.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
