Avoid These 10 Common Docker Production Pitfalls (Plus 5 Hidden Issues)
This article compiles the ten most frequent Docker problems encountered in production—such as disk exhaustion, time drift, DNS failures, OOM kills, data loss, tag confusion, signal handling, resource‑limit oversights, and exposed daemon ports—provides concrete symptoms, root‑cause explanations, diagnostic commands, remediation steps, and preventive measures, and also lists five often‑overlooked traps.
Pitfall 1: Docker storage runs out (Disk Full)
Symptoms
Container fails to start, error
no space left on device docker psreports
Cannot connect to the Docker daemon df -hshows /var/lib/docker at 100% usage
File writes inside container return "No space left on device"
Root cause
Docker’s storage drivers (overlay2, devicemapper, btrfs, zfs) place image layers, container layers, logs and build cache under /var/lib/docker. If this directory shares a partition with the root filesystem or the partition is small, it quickly fills up.
Diagnostic commands
# Check Docker data directory usage
df -h /var/lib/docker
# Show Docker disk usage summary
docker system df
# Detailed usage per component
docker system df -v
# Inspect container log sizes
ls -lh /var/lib/docker/containers/*/*-json.log
# Examine overlay2 layer consumption
du -sh /var/lib/docker/overlay2/*Fixes
# Clean dangling images
docker image prune -a
# Clean build cache
docker builder prune -a
# Remove all unused resources (images, containers, networks, caches)
docker system prune -a --volumes
# Limit container log size (daemon.json)
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "3"
}
}
# Emergency manual log truncation
> /var/lib/docker/containers/<container-id>/*-json.logPrevention
Place /var/lib/docker on a dedicated partition or LVM volume.
Configure log rotation with max-size and max-file.
Regularly prune unused images and build cache.
Monitor disk usage and trigger alerts when usage exceeds 80%.
Pitfall 2: Container time differs from host
Symptoms
dateinside container is 8 hours behind host.
Application logs show incorrect timestamps.
Database entries have an 8‑hour offset.
Certificate validity calculations are wrong.
Root cause
Containers inherit the host kernel and, by default, use UTC. If the host runs in CST (UTC+8) but the container does not mount the host’s timezone files, it shows UTC.
Diagnostic commands
# Host time
date
# Container time
docker exec <container-id> date
# Check if timezone files are mounted
docker inspect <container-id> | grep -A 20 "Mounts"Fixes
# Option 1: Mount host timezone files at runtime
docker run -v /etc/timezone:/etc/timezone:ro \
-v /etc/localtime:/etc/localtime:ro \
nginx
# Option 2: Set TZ environment variable (if base image supports it)
docker run -e TZ=Asia/Shanghai nginx
# Option 3: docker‑compose timezone configuration
services:
app:
image: my-app:latest
environment:
TZ: "Asia/Shanghai"
volumes:
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
# Option 4: Set timezone in Dockerfile
FROM ubuntu:20.04
RUN apt-get update && apt-get install -y tzdata && \
ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && \
echo "Asia/Shanghai" > /etc/timezonePitfall 3: Containers cannot resolve internal DNS names
Symptoms
Host can ping redis-master, container cannot. curl http://nginx works on host but fails inside container.
Public DNS (e.g., baidu.com) resolves, internal names do not.
Cross‑container communication errors like "could not resolve host".
Root cause
Docker’s embedded DNS (127.0.0.11) knows names created by --link or user‑defined networks, but it does not automatically forward queries to the host’s custom DNS servers.
Diagnostic commands
# View container's resolv.conf
docker exec <container-id> cat /etc/resolv.conf
# Inspect container network mode
docker inspect <container-id> | grep -A 10 "NetworkSettings"
# Test DNS from inside container
docker exec <container-id> nslookup nginx
docker exec <container-id> dig nginx
# Host DNS configuration
cat /etc/resolv.confFixes
# 1. Use --dns to specify DNS servers
docker run --dns 192.168.1.53 nginx
# 2. docker‑compose DNS configuration
services:
app:
image: my-app:latest
dns:
- 192.168.1.53
- 8.8.8.8
networks:
- my-net
networks:
my-net:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/16
# 3. Global daemon DNS (daemon.json)
{
"dns": ["192.168.1.53", "8.8.8.8"]
}
# Remember to restart Docker after editing daemon.json
systemctl restart dockerPitfall 4: Container OOMKilled
Symptoms
docker psshows container exited.
Last log line looks normal; no error. docker inspect reports OOMKilled: true.
Host logs ( dmesg or journalctl) contain OOM records.
Root cause
Memory limits are enforced by Linux cgroups. When a process exceeds its limit, the OOM killer terminates a process inside the container. If the process does not handle SIGKILL, the container exits abruptly.
Diagnostic commands
# Check OOM status
docker inspect <container-id> | grep -E "OOMKilled|ExitCode|State"
# View container memory usage peak
docker stats <container-id> --no-stream
# Inspect memory limit configuration
docker inspect <container-id> | grep -A 5 "Memory"
# Host cgroup memory stats
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
# Host OOM logs
dmesg | grep -i "out of memory"
journalctl -i oom | tail -20Fixes
# 1. Increase memory limit and restart container
docker run --memory=1g my-app:latest
# 2. docker‑compose memory limits
services:
app:
image: my-app:latest
mem_limit: 1g
mem_reservation: 512m
# 3. For Java apps, set JVM heap below container limit (75‑80%)
docker run -e JAVA_OPTS="-Xmx768m" --memory=1g my-java-app
# 4. Horizontal scaling – add more container instancesPrevention
Set realistic memory limits; avoid overly large or tiny values.
Explicitly configure JVM/Node.js heap sizes to stay within limits.
Enable monitoring and alerts when memory usage exceeds 80% of the limit.
Deploy OOM alert scripts on the host.
Pitfall 5: Containers cannot access the external network while the host can
Symptoms
ping baidu.comworks on host but fails inside container. curl https://google.com times out from container.
Inter‑container communication works (same bridge network).
Container can reach host IP but not other external IPs.
Root cause
Typical reasons are MTU mismatch, iptables NAT rules being cleared, missing host‑to‑container forwarding, or absent proxy configuration.
Diagnostic commands
# Test IP‑layer connectivity
docker exec <container-id> ping 8.8.8.8
# Test DNS resolution
docker exec <container-id> ping baidu.com
# Test application‑layer connectivity
docker exec <container-id> curl -v https://google.com
# Inspect host iptables NAT rules
iptables -t nat -L -n | grep DOCKER
# Check Docker bridge configuration
ip addr show docker0
ip route show
# Verify MTU settings
ip link show eth0
docker network inspect bridge | grep -i mtu
# Capture packets for analysis
tcpdump -i docker0 -n host 8.8.8.8Fixes
# MTU adjustment (run container with specific MTU)
docker run --network=host --mtu=9000 my-app
# Global daemon MTU configuration
{ "mtu": 9000 }
# Restore Docker iptables rules
iptables -t nat -F
iptables -t filter -F
systemctl restart docker
# Proxy handling – propagate host proxy to container
docker run -e HTTP_PROXY=http://host.docker.internal:7890 my-appPitfall 6: Data loss after container removal
Symptoms
Re‑deployed container cannot find previously written data.
Database container starts with an empty database.
Configuration changes disappear after container restart.
Root cause
By default containers use a copy‑on‑write filesystem that disappears when the container is removed. Data persists only when stored in named volumes, bind mounts, or tmpfs.
Diagnostic commands
# Show container mounts
docker inspect <container-id> | grep -A 20 "Mounts"
# List volumes
docker volume ls
# Inspect a specific volume
docker volume inspect <volume-name>
# Verify data on host
ls -la /var/lib/docker/volumes/<volume-name>/_dataFixes
# Use named volume for MySQL persistence
services:
mysql:
image: mysql:8.0
environment:
MYSQL_ROOT_PASSWORD: "password"
volumes:
- mysql_data:/var/lib/mysql
ports:
- "3306:3306"
volumes:
mysql_data:
driver: local
# Bind‑mount configuration files
services:
nginx:
image: nginx:1.24
volumes:
- /data/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- /data/nginx/logs:/var/log/nginx
ports:
- "80:80"
# Avoid anonymous volumes; always name them explicitly
volumes:
- mysql_data:/var/lib/mysqlPitfall 7: Confusing latest tag
Symptoms
docker run my-app:latestpulls a newer image but behavior does not change. docker build -t my-app:1.0 . shows <none> in docker images.
Deployed version is unclear; cannot trace which image was used.
Root cause
The latest tag is just a regular tag; it points to whatever image was last built or tagged with latest. It does not automatically track the newest build, and local and remote latest may diverge.
Diagnostic commands
# List all tags for an image
docker images my-app
# Show image creation time
docker inspect my-app:latest | grep Created
# Show full image ID
docker images --no-trunc my-app
# Compare local and remote latest
docker pull my-app:latest
docker images my-app:latestFixes
# Always use explicit version tags
FROM nginx:1.24.0-alpine
# Build with precise tag
docker build -t my-app:1.2.3 .
docker build -t my-app:release-20240115 .
# Include git commit hash in tag
docker build -t my-app:v1.2.3-$(git rev-parse --short HEAD) .
# Adopt GitOps pipelines that generate unique tags per commit
stages:
- build
- push
- deploy
build:
stage: build
script:
- IMAGE_TAG=${CI_COMMIT_SHORT_SHA}-${CI_BUILD_ID}
- docker build -t registry.example.com/my-app:${IMAGE_TAG} .
- docker push registry.example.com/my-app:${IMAGE_TAG}
- echo ${IMAGE_TAG} > image_tag.txt
deploy:
stage: deploy
script:
- IMAGE_TAG=$(cat image_tag.txt)
- kubectl set image deployment/my-app app=registry.example.com/my-app:${IMAGE_TAG}Pitfall 8: PID 1 signal handling problems
Symptoms
docker stoptimes out; container does not stop gracefully.
Container receives SIGTERM but does not exit cleanly. docker kill sends SIGKILL, leaving zombie processes.
Logs show "main process exited, code 0" while child processes linger.
Root cause
PID 1 in a container has special signal handling. When PID 1 is a shell script (e.g., CMD ["/bin/sh","-c","java -jar app.jar"]), the shell does not forward signals to the actual application, causing it to miss SIGTERM. Some base images use tini or systemd as an init process to handle signals correctly.
Diagnostic commands
# Inspect the command line of PID 1
docker exec <container-id> cat /proc/1/cmdline | tr '\0' ' '
# Show the process tree
docker exec <container-id> ps aux
# Test stop time
time docker stop <container-id>Fixes
# Use exec form of CMD so the app is PID 1
CMD ["java","-jar","app.jar"]
# Or wrap with an entrypoint that forwards signals
ENTRYPOINT ["/entrypoint.sh"]
# entrypoint.sh example:
# #!/bin/bash
# trap 'kill -TERM $PID' TERM INT
# java -jar app.jar &
# PID=$!
# wait $PID
# Enable Docker's built‑in init (tini) – Docker 20.10+
docker run --init my-app:latest
# Dockerfile example with tini
FROM alpine
RUN apk add --no-cache tini
ENTRYPOINT ["/sbin/tini","--"]
CMD ["java","-jar","app.jar"]
# Increase graceful stop timeout (docker‑compose)
services:
app:
image: my-app:latest
stop_grace_period: 30s
stop_signal: SIGTERMPitfall 9: Missing resource limits cause cascade failures (avalanche)
Symptoms
One host runs many containers; memory is exhausted.
A Java container leaks memory and drags down all other containers.
Containers are OOMKilled, restart, and get killed again in a loop.
Host load spikes above 100 %, all services become sluggish.
Root cause
Containers without explicit limits can consume the entire host memory. A single runaway container triggers OOM kills for others, overloads the Docker daemon, and may even cause the host kernel to OOM, leading to a system‑wide avalanche.
Diagnostic commands
# Show memory usage of all containers
docker stats --no-stream
# List containers with their memory limits
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}"
# Show memory limit per container
docker ps --format "{{.Names}}" | while read name; do \
limit=$(docker inspect $name --format '{{.HostConfig.Memory}}'); \
echo "$name: $limit"; \
done
# Host resource overview
top
free -h
df -hFixes
# Always set limits in production (docker‑compose example)
services:
app:
image: my-app:latest
deploy:
resources:
limits:
memory: 512M
cpus: '0.5'
reservations:
memory: 256M
cpus: '0.25'
restart_policy:
condition: on-failure
max_attempts: 3
# Command‑line equivalent
docker run -d \
--memory=512m \
--memory-reservation=256m \
--cpus=0.5 \
--cpus-reservation=0.25 \
--restart=on-failure:3 \
my-app:latest
# Set reasonable limits (e.g., 400 M needed → 512 M limit)
--memory=512mPrevention
Define memory and CPU limits for every production container.
Monitor host and container resource usage; alert on high consumption.
Use orchestration platforms that enforce limits during scheduling.
Pitfall 10: Docker daemon exposed on TCP ports 2375/2376
Symptoms
Cloud console alerts that server has open Docker 2375 port. curl http://server:2375/info returns full daemon info. docker -H tcp://server:2375 ps can control remote containers.
Server compromised; attacker uses Docker escape to mine cryptocurrency.
Root cause
Docker daemon does not listen on TCP by default. Administrators sometimes enable -H tcp://0.0.0.0:2375 for convenience, exposing an unauthenticated API. Even the TLS‑enabled 2376 is unsafe without proper certificates.
Diagnostic commands
# Check daemon listening ports
ps aux | grep dockerd | grep -v grep
ss -tlnp | grep docker
# Inspect systemd ExecStart parameters
systemctl cat docker | grep ExecStart
# Test if ports are open locally
curl http://localhost:2375/info && echo "2375 is open"
curl https://localhost:2376/info && echo "2376 is open"
# External scan (if permitted)
nmap -p 2375,2376 <server-ip>Fixes
# Close exposed API (systemd override)
# /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd
# Reload and restart Docker
systemctl daemon-reload
systemctl restart docker
# Verify ports are closed
ss -tlnp | grep docker
# If remote API is required, enable TLS
# Generate CA, server and client certificates (Docker docs)
# daemon.json example
{
"tls": true,
"tlscert": "/etc/docker/tls/server-cert.pem",
"tlskey": "/etc/docker/tls/server-key.pem",
"tlscacert": "/etc/docker/tls/ca.pem",
"hosts": ["fd://", "tcp://127.0.0.1:2376"]
}
# Client connection with certificates
docker -H tcp://server:2376 --tlsverify \
--tlscert=client-cert.pem \
--tlskey=client-key.pem \
--tlscacert=ca.pem ps
# Firewall restriction (only management subnet can reach API)
iptables -A INPUT -p tcp --dport 2375 -s 192.168.1.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2375 -j DROPBest‑practice hardening
Never expose Docker API to the public internet.
Run containers with --read-only when possible.
Use --security-opt=no-new-privileges to prevent privilege escalation.
Avoid --privileged unless absolutely required.
Regularly audit container capabilities: docker inspect --format '{{.HostConfig.CapAdd}}'.
Additional hidden pitfalls (11‑15)
Pitfall 11: Container timezone mismatch (repeat of Pitfall 2)
Same cause and fixes as Pitfall 2 – ensure the host timezone is mounted or set TZ environment variable.
Pitfall 12: Missing --restart policy
Containers exit and are not automatically restarted.
# Recommended restart policies
# no (default) – never restart
# on-failure – restart on non‑zero exit code
# on-failure:3 – max 3 retries
# always – always restart, even after Docker daemon restart
# unless-stopped – restart unless manually stopped
docker run -d \
--restart=unless-stopped \
my-app:latestPitfall 13: Volume permission issues
Host directory owned by root may be unreadable by container user (e.g., nginx). Fix by adjusting host permissions, running container as root (not recommended), or creating a user inside the image with proper ownership.
Pitfall 14: Cross‑container networking (bridge vs host)
Default bridge network isolates containers; --link is deprecated. Use user‑defined bridge networks for name resolution. host network shares the host’s network namespace, which can cause port conflicts.
Pitfall 15: Multi‑stage builds leaking secrets
Copying build‑time files (e.g., .npmrc, credentials) into the final stage exposes them. Ensure only the final artifact is copied.
# Correct multi‑stage Dockerfile
FROM golang:1.21 AS builder
WORKDIR /app
COPY . .
RUN go build -ldflags="-w -s" -o app .
FROM alpine
COPY --from=builder /app/app /app
CMD ["/app"]Conclusion
The ten primary and five secondary Docker pitfalls listed above cover the most common failure sources in production environments. By following the diagnostic commands, applying the remediation steps, and adopting the preventive measures, operators can dramatically reduce downtime, avoid data loss, and harden their container infrastructure.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
