Cloud Native 19 min read

How OpenAI Scaled Kubernetes to 7,500 Nodes: Challenges, Solutions, and Lessons Learned

OpenAI’s engineering team details how they expanded a Kubernetes cluster to 7,500 nodes to support massive models like GPT‑3, CLIP, and DALL·E, describing workload characteristics, networking redesign, API server pressure, monitoring, health checks, resource quotas, and the remaining open problems.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How OpenAI Scaled Kubernetes to 7,500 Nodes: Challenges, Solutions, and Lessons Learned

Introduction

OpenAI engineers share the challenges and solutions they encountered while scaling a Kubernetes cluster to 7,500 nodes, enabling large‑scale models such as GPT‑3, CLIP, and DALL·E as well as rapid small‑scale research on neural language model scaling laws.

Workload Characteristics

The workloads are large machine‑learning jobs that span many nodes, often occupying an entire node per pod to maximize GPU‑to‑GPU communication via NVLink or GPUDirect NIC. Resource contention (NUMA, CPU, PCIe) is negligible, and scheduler load is low except for occasional spikes when thousands of pods are created simultaneously.

Networking Redesign

Flannel could not sustain the required throughput, so the team switched to Azure VMSS native pod networking with a custom CNI plugin, providing host‑level bandwidth and supporting roughly 200,000 IP addresses. They use iptables marking to differentiate internal and Internet traffic, for example:

iptables -t mangle -A INPUT ! -s 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-in"
iptables -t mangle -A FORWARD ! -s 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-in"
iptables -t mangle -A OUTPUT ! -d 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-out"
iptables -t mangle -A FORWARD ! -d 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-out"

Metrics from these rules are exported via the open‑source iptables-exporter to Prometheus.

API Server and etcd Stress

The API server and etcd are run on dedicated nodes (five of each in the largest cluster). Monitoring uses kube‑prometheus dashboards and custom Grafana panels, with high HTTP 429 and 5xx rates serving as early‑warning signals. Each API server can consume up to 70 GB of heap memory, roughly linear with node count. Watch traffic on Endpoints was a major load source, mitigated by EndpointSlices introduced in Kubernetes 1.17, reducing the load by up to 1,000×.

Time‑Series Metrics, Prometheus, and Grafana

Collecting metrics at this scale caused Prometheus memory pressure and OOM crashes. The root cause was unbounded /api/v1/series queries for all histogram metrics. The team patched Prometheus to enforce query timeouts, and tuned GOMAXPROCS=24 to improve WAL replay performance.

Health Checks

Passive health checks monitor network reachability, disk health, and GPU errors (e.g., ECC and Xid errors) via Nvidia DCGM, exporting metrics such as DCGM_FI_DEV_XID_ERRORS. Active GPU tests run as pre‑flight DaemonSets that taint nodes until the tests pass, preventing regular pods from scheduling on unhealthy hardware.

Quota, Resource Management, and Scheduling

A custom team-resource-manager service uses ConfigMaps to assign team‑specific node selectors and taints ( openai.com/team=teamname:NoSchedule). CPU and GPU “balloon” deployments keep low‑priority pods on nodes to prevent the cluster‑autoscaler from scaling to zero. Anti‑affinity rules distribute pods evenly, and the Coscheduling plugin resolves gang‑scheduling requirements for StatefulSets used in distributed training.

Unresolved Issues

OpenAI still faces challenges with Prometheus TSDB compression and WAL replay time, as well as pod‑network traffic shaping to avoid overwhelming external bandwidth when many pods download large datasets.

Conclusion

Kubernetes proves to be a flexible platform capable of supporting OpenAI’s demanding research workloads at massive scale, though continued engineering effort is needed to address remaining scalability bottlenecks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesPrometheusAPI Serverlarge-scale infrastructurehealth checks
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.