How to Train Deeper TensorFlow Models by Optimizing GPU Memory

This article summarizes an NIPS 2017 paper that introduces GPU memory‑optimization techniques—swap‑out/in and a memory‑efficient attention layer—integrated into TensorFlow, enabling significantly larger batch sizes and deeper models without sacrificing accuracy.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How to Train Deeper TensorFlow Models by Optimizing GPU Memory

At NIPS 2017 (December 4‑9, Long Beach, CA), Alibaba presented two workshop papers and hosted several technical sessions, showcasing its research in machine learning and artificial intelligence.

Paper: Training Deeper Models by GPU Memory Optimization on TensorFlow (authors: Meng Chen, Sun Minmin, Yang Jun, Qiu Minghui, Gu Yang) – https://github.com/LearningSys/nips17/blob/9ee207c054cf109bc4a068b1064b644d75d0381f/assets/papers/paper_18.pdf

Abstract: With the rise of big data, lower GPGPU costs, and advances in neural network modeling, training deep models on GPUs is increasingly popular. However, model complexity and limited GPU memory make training large models difficult. The paper proposes a generic data‑flow‑graph‑based GPU memory‑optimization strategy called “swap‑out/in” that uses host memory as an extended pool, and a specialized memory‑efficient attention layer for Seq2Seq models. Both are seamlessly integrated into TensorFlow without affecting accuracy, achieving 2‑30× larger batch sizes in experiments.

The core challenge is the gap between limited GPU memory (12‑16 GB on high‑end GPUs) and growing model size (e.g., ResNet‑1001, NMT models with many attention layers). The authors analyze GPU memory usage during training, identifying three main components:

Feature maps: Intermediate outputs of each layer; they dominate memory consumption and depend on batch size and model architecture.

Weights: Persistent memory that is only released after training completes.

Temporary memory: Short‑lived allocations for certain algorithms (e.g., FFT‑based convolutions) that are automatically managed by libraries like cuDNN.

To address the memory bottleneck, the paper introduces two methods focused on feature maps:

Swap‑out/in: Moves feature maps to host memory, effectively expanding the usable memory pool.

Memory‑efficient attention layer: Reduces memory usage for Seq2Seq models with attention mechanisms.

Both techniques are integrated into TensorFlow’s built‑in memory allocator (best‑fit with coalescing) and work transparently for any model without requiring architectural changes.

The authors evaluate the methods on a 12 GB GPU. Results show substantial reductions in memory usage and allow batch sizes to increase up to 30×, enabling training of deeper models that were previously infeasible.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningTensorFlowGPU memory optimizationfeature mapNIPS 2017swap-out/in
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.