Operations 12 min read

How Replit Cut REPL Startup Time from 2 Minutes to 15 Seconds by Fixing Docker Shutdown

Replit engineers discovered that slow Docker container shutdown on preemptible VMs caused REPL sessions to hang for up to a minute, and by bypassing Docker’s kill process and directly terminating container PIDs they reduced error rates from 3% to under 0.5% and cut 99th‑percentile startup time from two minutes to fifteen seconds.

Open Source Linux

Apr 9, 2021

How Replit Cut REPL Startup Time from 2 Minutes to 15 Seconds by Fixing Docker Shutdown

Replit Architecture

Replit runs user code in browsers by launching a REPL that connects via WebSocket to a Docker container on a preemptible virtual machine. Each VM runs a container manager (conman) that ensures a single container per REPL and handles routing of requests.

When the host machine is shut down, all containers must be destroyed before the VM can be reclaimed, which can cause long delays because the shutdown process is slow.

Slow Container Shutdown

During the 30‑second window before a preemptible VM is terminated, Docker’s kill operation takes far longer than expected. Killing 100‑200 containers can take over 20 seconds, whereas a single docker kill normally finishes in milliseconds.

Docker provides two ways to stop a container: docker stop (graceful shutdown with SIGTERM) and docker kill (SIGKILL). In practice, docker kill was not completing instantly, indicating additional hidden delays.

To investigate, a script was created to launch 200 containers and measure the time required to kill them.

#!/bin/bash
COUNT=200
echo "Starting $COUNT containers..."
for i in $(seq 1 $COUNT); do
  printf .
  docker run -d --name test-$i nginx > /dev/null 2>&1
done
echo -e "
Killing $COUNT containers..."
time $(docker kill $(docker container ls -a --filter "name=test" --format "{{.ID}}") > /dev/null 2>&1)
echo -e "
Cleaning up..."
docker rm $(docker container ls -a --filter "name=test" --format "{{.ID}}") > /dev/null 2>&1

On a production GCEn1‑highmem‑4 instance the script reported a total kill time of about 38 seconds, confirming the slowdown.

Debug logs from the Docker daemon showed that after sending SIGKILL, Docker spent a lot of time releasing network addresses via netlink, which is a serialized operation and becomes a bottleneck when many containers are terminated.

The root cause was identified in the container shutdown path: after sending SIGKILL, Docker waits for the container to exit while holding a lock, and the cleanup routine releases network resources under that lock, causing long pauses.

To avoid the delay, the team bypassed Docker entirely and killed the container’s PID directly. Because the container runs in its own PID namespace, terminating the init process with SIGKILL also terminates all processes inside the namespace without side effects.

After implementing this change, REPL control was released within a few seconds during VM shutdown, reducing the session‑connection error rate from roughly 3% to below 0.5% and cutting the 99th‑percentile startup time from about two minutes to fifteen seconds.

Original article: https://blog.repl.it/killing-containers-at-scale

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker Performance Optimization Linux backend infrastructure Replit Container Shutdown

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.