Debugging Random OOM Issues in PyTorch Distributed Training on A100 Clusters
The iQIYI backend team traced random OOM crashes in PyTorch Distributed Data Parallel on an A100 cluster to a malformed DDP message injected by a security scan, which forced a near‑terabyte allocation; using jemalloc for diagnostics, they mitigated the issue by adjusting scan policies and collaborating with PyTorch to harden the protocol.
In modern deep‑learning development, complex software systems are assembled like building blocks, which speeds up development but makes fault isolation difficult due to high coupling and system complexity.
Members of iQIYI's backend technology team recorded a concrete case of solving a memory‑related OOM problem during large‑model training on an A100 GPU cluster, aiming to share insights with peers.
Background
During the past quarter, the A100 cluster exhibited random CPU memory OOM events. With the introduction of large‑model training, the OOMs became intolerable, prompting a focused investigation.
Initial log analysis revealed several patterns:
This OOM issue was unique to the A100 cluster.
The problem was tied to PyTorch's Distributed Data Parallel (DDP) mode; other training modes did not show the issue.
The OOM occurrences were highly random—some appeared after three hours, others after more than a week.
When OOM happened, memory usage surged from 10% to 90% within about 1.5 minutes (see image).
Because the issue could not be reliably reproduced, early troubleshooting relied on speculative hypotheses, including potential code leaks, allocator fragmentation, hardware faults, or specific software bugs.
Process
Is it a code issue?
To test the code‑leak hypothesis, debugging code was injected into the problematic training runs to periodically print objects that the Python GC could not reclaim. The resulting logs showed no large number of unreachable objects, and continuous GC did not alleviate the OOM. Thus, the code‑leak hypothesis was discarded.
Is the memory allocator to blame?
The team switched to jemalloc, which offers more efficient allocation and better debugging support compared to glibc's ptmalloc. Using ctypes , jemalloc’s API was exposed in Python, allowing periodic retrieval of the allocator’s allocated and mapped metrics. During OOM events, these two values were nearly identical, disproving the fragmentation hypothesis.
Further log inspection revealed that multiple machines experienced OOM within 1–2 minutes of each other, suggesting a synchronized external factor. Network traffic monitoring with tcpdump captured a security‑scan flow arriving minutes before an OOM event.
Final Root Cause
Collaboration with the security team reproduced the OOM by repeatedly triggering the scan. Deep code analysis pinpointed the culprit to PyTorch’s DDP protocol handling. A specific message (QueryType::ADD) from the scan caused the master node to allocate a buffer based on a malformed length field (0xe0060b0000 ≈ 962 GB). PyTorch attempted to allocate roughly 1 TB of memory, leading the allocator to request massive physical pages. Because the GPU cluster lacked huge‑page support, Linux fulfilled the request page‑by‑page, causing a rapid memory surge and eventual OOM.
Solution
Short‑term: Adjust the security‑scan policy to avoid triggering the malformed DDP message.
Long‑term: Work with the PyTorch community to harden the DDP protocol against malformed inputs (see issue #106294 ).
Summary
The investigation demonstrated the effectiveness of jemalloc for quantitative memory analysis in mixed Python‑C systems and highlighted the limits of pure‑Python tools like Memray for debugging complex frameworks such as PyTorch DDP. It also underscored the importance of considering external services (e.g., security scans) when diagnosing seemingly internal resource issues.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.