How to Train Deeper TensorFlow Models by Optimizing GPU Memory
This article summarizes an NIPS 2017 paper that introduces GPU memory‑optimization techniques—swap‑out/in and a memory‑efficient attention layer—integrated into TensorFlow, enabling significantly larger batch sizes and deeper models without sacrificing accuracy.
At NIPS 2017 (December 4‑9, Long Beach, CA), Alibaba presented two workshop papers and hosted several technical sessions, showcasing its research in machine learning and artificial intelligence.
Paper: Training Deeper Models by GPU Memory Optimization on TensorFlow (authors: Meng Chen, Sun Minmin, Yang Jun, Qiu Minghui, Gu Yang) – https://github.com/LearningSys/nips17/blob/9ee207c054cf109bc4a068b1064b644d75d0381f/assets/papers/paper_18.pdf
Abstract: With the rise of big data, lower GPGPU costs, and advances in neural network modeling, training deep models on GPUs is increasingly popular. However, model complexity and limited GPU memory make training large models difficult. The paper proposes a generic data‑flow‑graph‑based GPU memory‑optimization strategy called “swap‑out/in” that uses host memory as an extended pool, and a specialized memory‑efficient attention layer for Seq2Seq models. Both are seamlessly integrated into TensorFlow without affecting accuracy, achieving 2‑30× larger batch sizes in experiments.
The core challenge is the gap between limited GPU memory (12‑16 GB on high‑end GPUs) and growing model size (e.g., ResNet‑1001, NMT models with many attention layers). The authors analyze GPU memory usage during training, identifying three main components:
Feature maps: Intermediate outputs of each layer; they dominate memory consumption and depend on batch size and model architecture.
Weights: Persistent memory that is only released after training completes.
Temporary memory: Short‑lived allocations for certain algorithms (e.g., FFT‑based convolutions) that are automatically managed by libraries like cuDNN.
To address the memory bottleneck, the paper introduces two methods focused on feature maps:
Swap‑out/in: Moves feature maps to host memory, effectively expanding the usable memory pool.
Memory‑efficient attention layer: Reduces memory usage for Seq2Seq models with attention mechanisms.
Both techniques are integrated into TensorFlow’s built‑in memory allocator (best‑fit with coalescing) and work transparently for any model without requiring architectural changes.
The authors evaluate the methods on a 12 GB GPU. Results show substantial reductions in memory usage and allow batch sizes to increase up to 30×, enabling training of deeper models that were previously infeasible.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
