Operations 32 min read

Avoid These 10 Common Docker Production Pitfalls (Plus 5 Hidden Issues)

This article compiles the ten most frequent Docker problems encountered in production—such as disk exhaustion, time drift, DNS failures, OOM kills, data loss, tag confusion, signal handling, resource‑limit oversights, and exposed daemon ports—provides concrete symptoms, root‑cause explanations, diagnostic commands, remediation steps, and preventive measures, and also lists five often‑overlooked traps.

MaGe Linux Operations

May 10, 2026

Avoid These 10 Common Docker Production Pitfalls (Plus 5 Hidden Issues)

Pitfall 1: Docker storage runs out (Disk Full)

Symptoms

Container fails to start, error

no space left on device

docker ps

reports

Cannot connect to the Docker daemon

df -h

shows /var/lib/docker at 100% usage

File writes inside container return "No space left on device"

Root cause

Docker’s storage drivers (overlay2, devicemapper, btrfs, zfs) place image layers, container layers, logs and build cache under /var/lib/docker. If this directory shares a partition with the root filesystem or the partition is small, it quickly fills up.

Diagnostic commands

# Check Docker data directory usage
df -h /var/lib/docker

# Show Docker disk usage summary
docker system df

# Detailed usage per component
docker system df -v

# Inspect container log sizes
ls -lh /var/lib/docker/containers/*/*-json.log

# Examine overlay2 layer consumption
du -sh /var/lib/docker/overlay2/*

Fixes

# Clean dangling images
docker image prune -a

# Clean build cache
docker builder prune -a

# Remove all unused resources (images, containers, networks, caches)
docker system prune -a --volumes

# Limit container log size (daemon.json)
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "3"
  }
}

# Emergency manual log truncation
> /var/lib/docker/containers/<container-id>/*-json.log

Prevention

Place /var/lib/docker on a dedicated partition or LVM volume.

Configure log rotation with max-size and max-file.

Regularly prune unused images and build cache.

Monitor disk usage and trigger alerts when usage exceeds 80%.

Pitfall 2: Container time differs from host

Symptoms

date

inside container is 8 hours behind host.

Application logs show incorrect timestamps.

Database entries have an 8‑hour offset.

Certificate validity calculations are wrong.

Root cause

Containers inherit the host kernel and, by default, use UTC. If the host runs in CST (UTC+8) but the container does not mount the host’s timezone files, it shows UTC.

Diagnostic commands

# Host time
date

# Container time
docker exec <container-id> date

# Check if timezone files are mounted
docker inspect <container-id> | grep -A 20 "Mounts"

Fixes

# Option 1: Mount host timezone files at runtime
docker run -v /etc/timezone:/etc/timezone:ro \
           -v /etc/localtime:/etc/localtime:ro \
           nginx

# Option 2: Set TZ environment variable (if base image supports it)
docker run -e TZ=Asia/Shanghai nginx

# Option 3: docker‑compose timezone configuration
services:
  app:
    image: my-app:latest
    environment:
      TZ: "Asia/Shanghai"
    volumes:
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro

# Option 4: Set timezone in Dockerfile
FROM ubuntu:20.04
RUN apt-get update && apt-get install -y tzdata && \
    ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && \
    echo "Asia/Shanghai" > /etc/timezone

Pitfall 3: Containers cannot resolve internal DNS names

Symptoms

Host can ping redis-master, container cannot. curl http://nginx works on host but fails inside container.

Public DNS (e.g., baidu.com) resolves, internal names do not.

Cross‑container communication errors like "could not resolve host".

Root cause

Docker’s embedded DNS (127.0.0.11) knows names created by --link or user‑defined networks, but it does not automatically forward queries to the host’s custom DNS servers.

Diagnostic commands

# View container's resolv.conf
docker exec <container-id> cat /etc/resolv.conf

# Inspect container network mode
docker inspect <container-id> | grep -A 10 "NetworkSettings"

# Test DNS from inside container
docker exec <container-id> nslookup nginx
docker exec <container-id> dig nginx

# Host DNS configuration
cat /etc/resolv.conf

Fixes

# 1. Use --dns to specify DNS servers
docker run --dns 192.168.1.53 nginx

# 2. docker‑compose DNS configuration
services:
  app:
    image: my-app:latest
    dns:
      - 192.168.1.53
      - 8.8.8.8
    networks:
      - my-net

networks:
  my-net:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.0.0/16

# 3. Global daemon DNS (daemon.json)
{
  "dns": ["192.168.1.53", "8.8.8.8"]
}

# Remember to restart Docker after editing daemon.json
systemctl restart docker

Pitfall 4: Container OOMKilled

Symptoms

docker ps

shows container exited.

Last log line looks normal; no error. docker inspect reports OOMKilled: true.

Host logs ( dmesg or journalctl) contain OOM records.

Root cause

Memory limits are enforced by Linux cgroups. When a process exceeds its limit, the OOM killer terminates a process inside the container. If the process does not handle SIGKILL, the container exits abruptly.

Diagnostic commands

# Check OOM status
docker inspect <container-id> | grep -E "OOMKilled|ExitCode|State"

# View container memory usage peak
docker stats <container-id> --no-stream

# Inspect memory limit configuration
docker inspect <container-id> | grep -A 5 "Memory"

# Host cgroup memory stats
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes

# Host OOM logs
dmesg | grep -i "out of memory"
journalctl -i oom | tail -20

Fixes

# 1. Increase memory limit and restart container
docker run --memory=1g my-app:latest

# 2. docker‑compose memory limits
services:
  app:
    image: my-app:latest
    mem_limit: 1g
    mem_reservation: 512m

# 3. For Java apps, set JVM heap below container limit (75‑80%)
docker run -e JAVA_OPTS="-Xmx768m" --memory=1g my-java-app

# 4. Horizontal scaling – add more container instances

Prevention

Set realistic memory limits; avoid overly large or tiny values.

Explicitly configure JVM/Node.js heap sizes to stay within limits.

Enable monitoring and alerts when memory usage exceeds 80% of the limit.

Deploy OOM alert scripts on the host.

Pitfall 5: Containers cannot access the external network while the host can

Symptoms

ping baidu.com

works on host but fails inside container. curl https://google.com times out from container.

Inter‑container communication works (same bridge network).

Container can reach host IP but not other external IPs.

Root cause

Typical reasons are MTU mismatch, iptables NAT rules being cleared, missing host‑to‑container forwarding, or absent proxy configuration.

Diagnostic commands

# Test IP‑layer connectivity
docker exec <container-id> ping 8.8.8.8

# Test DNS resolution
docker exec <container-id> ping baidu.com

# Test application‑layer connectivity
docker exec <container-id> curl -v https://google.com

# Inspect host iptables NAT rules
iptables -t nat -L -n | grep DOCKER

# Check Docker bridge configuration
ip addr show docker0
ip route show

# Verify MTU settings
ip link show eth0
docker network inspect bridge | grep -i mtu

# Capture packets for analysis
tcpdump -i docker0 -n host 8.8.8.8

Fixes

# MTU adjustment (run container with specific MTU)
docker run --network=host --mtu=9000 my-app

# Global daemon MTU configuration
{ "mtu": 9000 }

# Restore Docker iptables rules
iptables -t nat -F
iptables -t filter -F
systemctl restart docker

# Proxy handling – propagate host proxy to container
docker run -e HTTP_PROXY=http://host.docker.internal:7890 my-app

Pitfall 6: Data loss after container removal

Symptoms

Re‑deployed container cannot find previously written data.

Database container starts with an empty database.

Configuration changes disappear after container restart.

Root cause

By default containers use a copy‑on‑write filesystem that disappears when the container is removed. Data persists only when stored in named volumes, bind mounts, or tmpfs.

Diagnostic commands

# Show container mounts
docker inspect <container-id> | grep -A 20 "Mounts"

# List volumes
docker volume ls

# Inspect a specific volume
docker volume inspect <volume-name>

# Verify data on host
ls -la /var/lib/docker/volumes/<volume-name>/_data

Fixes

# Use named volume for MySQL persistence
services:
  mysql:
    image: mysql:8.0
    environment:
      MYSQL_ROOT_PASSWORD: "password"
    volumes:
      - mysql_data:/var/lib/mysql
    ports:
      - "3306:3306"

volumes:
  mysql_data:
    driver: local

# Bind‑mount configuration files
services:
  nginx:
    image: nginx:1.24
    volumes:
      - /data/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - /data/nginx/logs:/var/log/nginx
    ports:
      - "80:80"

# Avoid anonymous volumes; always name them explicitly
volumes:
  - mysql_data:/var/lib/mysql

Pitfall 7: Confusing latest tag

Symptoms

docker run my-app:latest

pulls a newer image but behavior does not change. docker build -t my-app:1.0 . shows <none> in docker images.

Deployed version is unclear; cannot trace which image was used.

Root cause

The latest tag is just a regular tag; it points to whatever image was last built or tagged with latest. It does not automatically track the newest build, and local and remote latest may diverge.

Diagnostic commands

# List all tags for an image
docker images my-app

# Show image creation time
docker inspect my-app:latest | grep Created

# Show full image ID
docker images --no-trunc my-app

# Compare local and remote latest
docker pull my-app:latest
docker images my-app:latest

Fixes

# Always use explicit version tags
FROM nginx:1.24.0-alpine

# Build with precise tag
docker build -t my-app:1.2.3 .
docker build -t my-app:release-20240115 .

# Include git commit hash in tag
docker build -t my-app:v1.2.3-$(git rev-parse --short HEAD) .

# Adopt GitOps pipelines that generate unique tags per commit
stages:
  - build
  - push
  - deploy

build:
  stage: build
  script:
    - IMAGE_TAG=${CI_COMMIT_SHORT_SHA}-${CI_BUILD_ID}
    - docker build -t registry.example.com/my-app:${IMAGE_TAG} .
    - docker push registry.example.com/my-app:${IMAGE_TAG}
    - echo ${IMAGE_TAG} > image_tag.txt

deploy:
  stage: deploy
  script:
    - IMAGE_TAG=$(cat image_tag.txt)
    - kubectl set image deployment/my-app app=registry.example.com/my-app:${IMAGE_TAG}

Pitfall 8: PID 1 signal handling problems

Symptoms

docker stop

times out; container does not stop gracefully.

Container receives SIGTERM but does not exit cleanly. docker kill sends SIGKILL, leaving zombie processes.

Logs show "main process exited, code 0" while child processes linger.

Root cause

PID 1 in a container has special signal handling. When PID 1 is a shell script (e.g., CMD ["/bin/sh","-c","java -jar app.jar"]), the shell does not forward signals to the actual application, causing it to miss SIGTERM. Some base images use tini or systemd as an init process to handle signals correctly.

Diagnostic commands

# Inspect the command line of PID 1
docker exec <container-id> cat /proc/1/cmdline | tr '\0' ' '

# Show the process tree
docker exec <container-id> ps aux

# Test stop time
time docker stop <container-id>

Fixes

# Use exec form of CMD so the app is PID 1
CMD ["java","-jar","app.jar"]

# Or wrap with an entrypoint that forwards signals
ENTRYPOINT ["/entrypoint.sh"]
# entrypoint.sh example:
# #!/bin/bash
# trap 'kill -TERM $PID' TERM INT
# java -jar app.jar &
# PID=$!
# wait $PID

# Enable Docker's built‑in init (tini) – Docker 20.10+
docker run --init my-app:latest

# Dockerfile example with tini
FROM alpine
RUN apk add --no-cache tini
ENTRYPOINT ["/sbin/tini","--"]
CMD ["java","-jar","app.jar"]

# Increase graceful stop timeout (docker‑compose)
services:
  app:
    image: my-app:latest
    stop_grace_period: 30s
    stop_signal: SIGTERM

Pitfall 9: Missing resource limits cause cascade failures (avalanche)

Symptoms

One host runs many containers; memory is exhausted.

A Java container leaks memory and drags down all other containers.

Containers are OOMKilled, restart, and get killed again in a loop.

Host load spikes above 100 %, all services become sluggish.

Root cause

Containers without explicit limits can consume the entire host memory. A single runaway container triggers OOM kills for others, overloads the Docker daemon, and may even cause the host kernel to OOM, leading to a system‑wide avalanche.

Diagnostic commands

# Show memory usage of all containers
docker stats --no-stream

# List containers with their memory limits
docker ps --format "{{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}"

# Show memory limit per container
docker ps --format "{{.Names}}" | while read name; do \
  limit=$(docker inspect $name --format '{{.HostConfig.Memory}}'); \
  echo "$name: $limit"; \
done

# Host resource overview
top
free -h
df -h

Fixes

# Always set limits in production (docker‑compose example)
services:
  app:
    image: my-app:latest
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '0.5'
        reservations:
          memory: 256M
          cpus: '0.25'
      restart_policy:
        condition: on-failure
        max_attempts: 3

# Command‑line equivalent
docker run -d \
  --memory=512m \
  --memory-reservation=256m \
  --cpus=0.5 \
  --cpus-reservation=0.25 \
  --restart=on-failure:3 \
  my-app:latest

# Set reasonable limits (e.g., 400 M needed → 512 M limit)
--memory=512m

Prevention

Define memory and CPU limits for every production container.

Monitor host and container resource usage; alert on high consumption.

Use orchestration platforms that enforce limits during scheduling.

Pitfall 10: Docker daemon exposed on TCP ports 2375/2376

Symptoms

Cloud console alerts that server has open Docker 2375 port. curl http://server:2375/info returns full daemon info. docker -H tcp://server:2375 ps can control remote containers.

Server compromised; attacker uses Docker escape to mine cryptocurrency.

Root cause

Docker daemon does not listen on TCP by default. Administrators sometimes enable -H tcp://0.0.0.0:2375 for convenience, exposing an unauthenticated API. Even the TLS‑enabled 2376 is unsafe without proper certificates.

Diagnostic commands

# Check daemon listening ports
ps aux | grep dockerd | grep -v grep
ss -tlnp | grep docker

# Inspect systemd ExecStart parameters
systemctl cat docker | grep ExecStart

# Test if ports are open locally
curl http://localhost:2375/info && echo "2375 is open"
curl https://localhost:2376/info && echo "2376 is open"

# External scan (if permitted)
nmap -p 2375,2376 <server-ip>

Fixes

# Close exposed API (systemd override)
# /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd

# Reload and restart Docker
systemctl daemon-reload
systemctl restart docker

# Verify ports are closed
ss -tlnp | grep docker

# If remote API is required, enable TLS
# Generate CA, server and client certificates (Docker docs)
# daemon.json example
{
  "tls": true,
  "tlscert": "/etc/docker/tls/server-cert.pem",
  "tlskey": "/etc/docker/tls/server-key.pem",
  "tlscacert": "/etc/docker/tls/ca.pem",
  "hosts": ["fd://", "tcp://127.0.0.1:2376"]
}

# Client connection with certificates
docker -H tcp://server:2376 --tlsverify \
  --tlscert=client-cert.pem \
  --tlskey=client-key.pem \
  --tlscacert=ca.pem ps

# Firewall restriction (only management subnet can reach API)
iptables -A INPUT -p tcp --dport 2375 -s 192.168.1.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2375 -j DROP

Best‑practice hardening

Never expose Docker API to the public internet.

Run containers with --read-only when possible.

Use --security-opt=no-new-privileges to prevent privilege escalation.

Avoid --privileged unless absolutely required.

Regularly audit container capabilities: docker inspect --format '{{.HostConfig.CapAdd}}'.

Additional hidden pitfalls (11‑15)

Pitfall 11: Container timezone mismatch (repeat of Pitfall 2)

Same cause and fixes as Pitfall 2 – ensure the host timezone is mounted or set TZ environment variable.

Pitfall 12: Missing --restart policy

Containers exit and are not automatically restarted.

# Recommended restart policies
# no (default) – never restart
# on-failure – restart on non‑zero exit code
# on-failure:3 – max 3 retries
# always – always restart, even after Docker daemon restart
# unless-stopped – restart unless manually stopped

docker run -d \
  --restart=unless-stopped \
  my-app:latest

Pitfall 13: Volume permission issues

Host directory owned by root may be unreadable by container user (e.g., nginx). Fix by adjusting host permissions, running container as root (not recommended), or creating a user inside the image with proper ownership.

Pitfall 14: Cross‑container networking (bridge vs host)

Default bridge network isolates containers; --link is deprecated. Use user‑defined bridge networks for name resolution. host network shares the host’s network namespace, which can cause port conflicts.

Pitfall 15: Multi‑stage builds leaking secrets

Copying build‑time files (e.g., .npmrc, credentials) into the final stage exposes them. Ensure only the final artifact is copied.

# Correct multi‑stage Dockerfile
FROM golang:1.21 AS builder
WORKDIR /app
COPY . .
RUN go build -ldflags="-w -s" -o app .

FROM alpine
COPY --from=builder /app/app /app
CMD ["/app"]

Conclusion

The ten primary and five secondary Docker pitfalls listed above cover the most common failure sources in production environments. By following the diagnostic commands, applying the remediation steps, and adopting the preventive measures, operators can dramatically reduce downtime, avoid data loss, and harden their container infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker Network security troubleshooting storage container runtime Production resource limits

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Pitfall 1: Docker storage runs out (Disk Full)

Symptoms

Root cause

Diagnostic commands

Fixes

Prevention

Pitfall 2: Container time differs from host

Symptoms

Root cause

Diagnostic commands

Fixes

Pitfall 3: Containers cannot resolve internal DNS names

Symptoms

Root cause

Diagnostic commands

Fixes

Pitfall 4: Container OOMKilled

Symptoms

Root cause

Diagnostic commands

Fixes

Prevention

Pitfall 5: Containers cannot access the external network while the host can

Symptoms

Root cause

Diagnostic commands

Fixes

Pitfall 6: Data loss after container removal

Symptoms

Root cause

Diagnostic commands

Fixes

Pitfall 7: Confusing latest tag

Symptoms

Root cause

Diagnostic commands

Fixes

Pitfall 8: PID 1 signal handling problems

Symptoms

Root cause

Diagnostic commands

Fixes

Pitfall 9: Missing resource limits cause cascade failures (avalanche)

Symptoms

Root cause

Diagnostic commands

Fixes

Prevention

Pitfall 10: Docker daemon exposed on TCP ports 2375/2376

Symptoms

Root cause

Diagnostic commands

Fixes

Best‑practice hardening

Additional hidden pitfalls (11‑15)

Pitfall 11: Container timezone mismatch (repeat of Pitfall 2)

Pitfall 12: Missing --restart policy

Pitfall 13: Volume permission issues

Pitfall 14: Cross‑container networking (bridge vs host)

Pitfall 15: Multi‑stage builds leaking secrets

Conclusion

MaGe Linux Operations

How this landed with the community

Was this worth your time?

0 Comments

Pitfall 8: PID 1 signal handling problems

Pitfall 11: Container timezone mismatch (repeat of Pitfall 2)