Is AI Triggering a Global Memory Shortage? Inside the Emerging Memory Supercycle
The article analyzes how generative AI workloads are reshaping the storage market into a multi‑year Memory Supercycle, detailing demand spikes from Mid‑Training checkpoints, synthetic data, KV‑Cache offload and multimodal video models, while supply is strained by HBM production and geopolitical factors.
Global Memory Shortage and the Rise of a Memory Supercycle
In recent months the storage market has shifted from the familiar inventory‑driven boom‑bust cycles to a structural shortage driven by generative AI workloads. This "Memory Supercycle" is expected to span 2024‑2027, with AI demand outpacing DRAM, NAND and HBM capacity, forcing manufacturers to move from quarterly pricing to daily pricing.
Demand Engine: Generative AI Trends
Mid‑Training Checkpoints: The emergence of Mid‑Training, which combines reinforcement learning and synthetic data, creates massive checkpoint storage needs, with a single checkpoint (optimizer state + model parameters) reaching ~13 TB and daily checkpointing approaching 2 PB.
Synthetic Data (Warm Data): Front‑line labs treat large‑scale synthetic data as a strategic asset, driving demand for cheap, high‑capacity QLC SSDs.
KV‑Cache Explosion: Inference with long contexts causes KV‑Cache sizes to grow to hundreds of GB or even TB per user session, far exceeding GPU HBM capacity.
Multimodal Video Models: Models such as Sora and Veo 3 introduce bandwidth‑intensive Spacetime Patches, requiring TB‑scale data streams between SSDs and GPUs.
Supply Constraints: The HBM “Eaten‑Away” Effect
GPU manufacturers’ push for higher HBM capacity consumes wafer area three times larger than standard DRAM, diverting capital and fab capacity from DDR5 and NAND production. Companies like SK Hynix, Samsung and Micron have announced that all DRAM, NAND and HBM capacity through 2026 is sold out, leaving DRAM inventories at a two‑week level.
Quantifying the Market Impact
Gartner forecasts a 46% rise in global data‑center spending by 2025, reaching roughly $490 billion, with AI servers consuming about 40% of DRAM and high‑capacity SSDs. Seagate estimates generative AI will generate up to 100 ZB of data over the next four years.
Training‑Side Storage Demands
Scaling laws remain valid (DeepMind’s Chinchilla paper), indicating that optimal performance requires model parameters and training tokens to grow proportionally. This drives PB‑to‑EB‑scale datasets, reinforcing the need for cheap, massive storage.
Mid‑Training, positioned between Pre‑Training and Post‑Training, is an I/O‑intensive phase that iteratively refines models using large‑scale synthetic data, creating exponential storage demand.
Checkpointing: Large models generate PB‑level checkpoint data daily; a single checkpoint can be ~13 TB, with frequent checkpointing (e.g., every 10 minutes) leading to ~2 PB per day.
Tensor Offloading: Techniques such as DeepSpeed ZeRO‑Infinity and Nvidia TeraIO move inactive tensors from GPU HBM/DRAM to CPU DRAM or NVMe SSDs, requiring sub‑millisecond latency and PCIe 5.0/6.0 bandwidth (128 GB/s – 256 GB/s) and GPUDirect Storage.
Inference‑Side Storage Demands
KV‑Cache storage dominates inference cost; its size scales with batch size, sequence length and model dimension, reaching hundreds of GB to TB for million‑token contexts. Offloading KV‑Cache to high‑performance NVMe SSDs reduces hardware cost at the expense of microsecond‑level latency.
Algorithmic mitigation strategies include:
Sampling (GQA/MQA): Reducing the number of KV heads (e.g., 8:1 GQA) cuts KV‑Cache size proportionally.
Compression (MLA/Quantization): Compressing KV vectors can lower storage needs by 2‑10×.
Eviction (Sliding Window, Sink Attention): Dropping older context for streaming applications keeps KV‑Cache size constant.
Multimodal Storage Challenges
Video generation models (Sora, Veo 3) process MB‑scale Spacetime Patches per timestep, shifting the bottleneck from context memory to bandwidth. Their DiT architecture streams TB‑scale latent patches between SSDs and GPU HBM, demanding storage bandwidth comparable to HBM.
Manufacturers respond with three AI‑focused NAND families:
AIN D (Density): High‑density QLC NAND for PB‑scale “warm” storage.
AIN P (Performance): Optimized controllers delivering 50 M‑100 M random IOPS for KV‑Cache offload.
AIN B (Bandwidth): High‑Bandwidth Flash (HBF) offering near‑HBM TB/s throughput for multimodal inference.
Supply‑Side Outlook and Mitigations
3D NAND stacking (300‑plus layers now, 1000‑layer targets by 2030) is the primary cost‑density lever. SK Hynix’s 321‑layer QLC, Samsung’s planned 280‑layer V‑NAND, and Kioxia’s 218‑layer BiCS illustrate the “stacking race.”
Emerging interconnects like CXL (Compute Express Link) enable unified memory pools across CPU, GPU, HBF and SSD, facilitating offloading and memory expansion.
Strategic Implications
Companies with balanced AI‑NAND portfolios (density, performance, bandwidth) and early HBF deployments are positioned to dominate the market. Geopolitical concentration of NAND fab capacity in South Korea, Japan and the US, with Taiwan’s packaging expertise, adds supply‑chain fragility.
In summary, AI is transforming storage from a peripheral component to a primary performance bottleneck, ushering in a multi‑year Memory Supercycle that will reshape hardware value chains and define the next era of computing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
