Tagged articles

GPU memory

28 articles · Page 1 of 1

Jun 28, 2026 · Operations

Why Large‑Model Services Keep Running Out of GPU Memory: An Ops View from KV Cache to Concurrency

The article explains why large‑model inference services frequently hit GPU memory limits, breaks down static vs. dynamic memory consumption, shows how KV‑Cache, request length, and concurrency amplify usage, and provides a step‑by‑step troubleshooting and mitigation workflow for production environments.

GPU memoryInference OptimizationKV cache

0 likes · 26 min read

Why Large‑Model Services Keep Running Out of GPU Memory: An Ops View from KV Cache to Concurrency

AI Engineering

Jun 28, 2026 · Artificial Intelligence

Why Does KV‑Cache Evict 90% of Tokens Without Reducing GPU Memory in LLM Inference?

Although a KV‑cache eviction strategy can discard 90% of tokens, GPU memory usage stays almost unchanged because paged‑attention memory blocks remain occupied and fast attention kernels discard the full score matrix, preventing effective memory release.

FlashAttentionGPU memoryKV cache

0 likes · 7 min read

Why Does KV‑Cache Evict 90% of Tokens Without Reducing GPU Memory in LLM Inference?

DeepHub IMBA

Jun 23, 2026 · Artificial Intelligence

Parallel Training of 100B‑Parameter Models: Intra‑Node Tensor Parallelism and Inter‑Node Data Parallelism

Training 100‑billion‑parameter Transformers is limited by GPU memory rather than compute, requiring a mix of tensor parallelism within nodes and data parallelism across nodes, along with pipeline parallelism, gradient accumulation, and careful framework choices to balance memory, bandwidth, and compute overheads.

GPU memoryLarge Language Modelsdata parallelism

0 likes · 14 min read

Parallel Training of 100B‑Parameter Models: Intra‑Node Tensor Parallelism and Inter‑Node Data Parallelism

MaGe Linux Operations

Jun 20, 2026 · Artificial Intelligence

LoRA vs QLoRA vs Full Fine‑Tuning: Which Method Wins for Large‑Model Adaptation?

This article provides a practical, data‑driven comparison of Full Fine‑Tuning, LoRA, and QLoRA for adapting 7B‑70B open‑source LLMs, detailing memory requirements, training speed, cost, performance trade‑offs, step‑by‑step workflows, code examples, evaluation metrics, common pitfalls, and optimization tips to help engineers choose the most suitable fine‑tuning approach for their data and budget.

Full Fine-tuningGPU memoryLarge Language Models

0 likes · 24 min read

LoRA vs QLoRA vs Full Fine‑Tuning: Which Method Wins for Large‑Model Adaptation?

DeepHub IMBA

Jun 7, 2026 · Artificial Intelligence

PyTorch GPU Memory Profiling: Checkpointing, Mixed Precision, Optimizer Choice

The article explains the seven sources of GPU memory usage during PyTorch training, shows how to measure them with built‑in profiling APIs and the memory‑viz tool, and evaluates three effective optimizations—gradient checkpointing, mixed‑precision training, and optimizer selection—detailing their memory savings and performance costs.

GPU memoryPyTorchgradient checkpointing

0 likes · 8 min read

PyTorch GPU Memory Profiling: Checkpointing, Mixed Precision, Optimizer Choice

Raymond Ops

Apr 27, 2026 · Artificial Intelligence

vLLM Production Pitfalls: The Ultimate Fix for PagedAttention Memory Fragmentation and OOM

This article analyzes why vLLM's PagedAttention can cause GPU memory fragmentation and out‑of‑memory errors in production, presents four typical OOM scenarios, and provides concrete diagnostics, configuration tweaks, code examples, and monitoring strategies to eliminate the problem.

CUDAGPU memoryLLM serving

0 likes · 22 min read

vLLM Production Pitfalls: The Ultimate Fix for PagedAttention Memory Fragmentation and OOM

Old Zhang's AI Learning

Apr 26, 2026 · Artificial Intelligence

Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging

The article dissects DeepSeek‑V4’s local deployment using vLLM, explaining the steep hardware requirements, the complex heterogeneous KV‑cache architecture, and the aggressive kernel‑fusion and multi‑stream optimizations that together make high‑context inference both memory‑intensive and engineering‑heavy.

DeepSeek-V4GPU memoryKV cache

0 likes · 15 min read

Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging

Qborfy AI

Mar 24, 2026 · Artificial Intelligence

Why Full Fine‑Tuning Beats LoRA: When and How to Update Every Model Parameter

This article explains full fine‑tuning—updating all parameters of a pretrained model—to achieve the highest task performance, compares it with LoRA and prompt tuning, shows when it is appropriate, provides a step‑by‑step Hugging Face implementation, memory‑saving tricks, common pitfalls, and practical takeaways.

Deep LearningDeepSpeedFull Fine-tuning

0 likes · 9 min read

Why Full Fine‑Tuning Beats LoRA: When and How to Update Every Model Parameter

MaGe Linux Operations

Mar 10, 2026 · Artificial Intelligence

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

This guide explains the five common sources of GPU memory consumption in large‑model inference services, provides a step‑by‑step diagnosis workflow—from static usage and KV‑Cache analysis to concurrency and K8s scheduling—offers concrete command‑line checks, scripts, configuration examples, and actionable remediation and monitoring recommendations.

GPU memoryKV cacheLLM OOM

0 likes · 28 min read

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

Network Intelligence Research Center (NIRC)

Jan 31, 2026 · Artificial Intelligence

How Engram Lets Large Models Swap GPU Memory for Cheap RAM to ‘Look Up’ Knowledge

The article dissects DeepSeek’s new Engram architecture, which separates computation from memory by using a large, cheap‑RAM‑based lookup table to store factual knowledge, allowing the transformer’s compute layers to focus on reasoning, dramatically reducing GPU memory demand while improving code, math, and long‑context performance.

EngramGPU memoryLarge Language Model

0 likes · 7 min read

How Engram Lets Large Models Swap GPU Memory for Cheap RAM to ‘Look Up’ Knowledge

Baidu Geek Talk

Dec 10, 2025 · Artificial Intelligence

How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report analyzes the memory bottleneck of DeepSeek‑V3.2‑Exp’s sparse‑attention decoder, proposes the Expanded Sparse Server (ESS) to offload the latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach dramatically improves decode throughput while keeping latency within acceptable limits.

Cache offloadGPU memoryLLM Inference

0 likes · 20 min read

How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

Baobao Algorithm Notes

Sep 28, 2025 · Artificial Intelligence

How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

This article breaks down the GPU memory requirements of large language models during training and inference, detailing the contributions of model weights, optimizer states, activations, KV cache, and activation recomputation, and provides concrete formulas, examples, and scaling insights for models like Qwen3 and DeepSeek V3.

GPU memoryKV cacheLLM

0 likes · 18 min read

How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

Fun with Large Models

Aug 29, 2025 · Artificial Intelligence

How to Estimate Hardware Costs for Large-Model Fine-Tuning and Training (Interview Classic #1)

The article explains how to estimate GPU memory and overall hardware requirements for fine-tuning and training large dense and MoE models, detailing calculations for full-parameter and LoRA approaches, scaling rules, and hidden costs relevant to interview assessments.

GPU memoryLoRAMixture of Experts

0 likes · 8 min read

How to Estimate Hardware Costs for Large-Model Fine-Tuning and Training (Interview Classic #1)

AI Algorithm Path

Jul 13, 2025 · Artificial Intelligence

How to Calculate the Right AI Model Size for Your PC (3B, 7B, 13B)

This article explains how to estimate the GPU memory required for running large language models of 3 B, 7 B, and 13 B parameters, walks through step‑by‑step calculations, shows how hardware limits affect feasibility, and offers practical optimization techniques such as quantization and CPU offloading.

AI model sizingCPU offloadingFP16

0 likes · 5 min read

How to Calculate the Right AI Model Size for Your PC (3B, 7B, 13B)

Architect

May 18, 2025 · Artificial Intelligence

How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting

This article breaks down GPU memory consumption for large Transformer models, explains how to estimate each component—parameters, optimizer state, activations, gradients—and shows how parallelism, mixed precision, and recomputation strategies can dramatically reduce the footprint.

AI trainingGPU memoryMemory optimization

0 likes · 14 min read

How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting

Alibaba Cloud Infrastructure

May 1, 2025 · Artificial Intelligence

Fine-grained Profiling of Online AI Workloads on Kubernetes Using ACK AI Profiling

This article demonstrates how to use ACK AI Profiling, built on eBPF and dynamic process injection, to perform non-intrusive, low‑overhead profiling of Kubernetes‑deployed large‑language‑model inference services, identify GPU memory growth causes, and apply optimization recommendations to prevent OOM issues.

AI profilingGPU memoryKubernetes

0 likes · 10 min read

Fine-grained Profiling of Online AI Workloads on Kubernetes Using ACK AI Profiling

Alibaba Cloud Developer

Apr 7, 2025 · Artificial Intelligence

Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache

After deploying the full‑precision DeepSeek‑R1 model on a 2×8‑GPU ACS cluster, repeated stress tests showed GPU memory usage continuously rising without release; this article details the investigation, reproduces the behavior, examines vLLM logs, Prometheus metrics, and reveals PyTorch’s caching allocator as the root cause, offering mitigation tips.

DeepSeekGPU memoryMemory Cache

0 likes · 21 min read

Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache

AI Algorithm Path

Mar 16, 2025 · Artificial Intelligence

How to Train PyTorch Models Using Far Less GPU Memory

This article walks through a suite of PyTorch techniques—including automatic mixed precision, BF16, gradient checkpointing, gradient accumulation, tensor sharding, efficient data loading, in‑place ops, lightweight optimizers, memory profiling, TorchScript, and kernel fusion—that together can cut peak GPU memory usage by up to twenty‑fold while preserving model accuracy.

GPU memoryPyTorchdata loading

0 likes · 13 min read

How to Train PyTorch Models Using Far Less GPU Memory

AI Algorithm Path

Mar 10, 2025 · Artificial Intelligence

How Much GPU Memory Does an LLM Service Really Need?

This article explains a simple formula for estimating the GPU VRAM required to serve large language models, demonstrates the calculation with a 7‑billion‑parameter example, clarifies why a 20% safety buffer is needed, and offers practical strategies such as quantization, CPU offload, and multi‑GPU parallelism to reduce memory usage.

DeploymentGPU memoryLLM

0 likes · 6 min read

How Much GPU Memory Does an LLM Service Really Need?

Infra Learning Club

Feb 12, 2025 · Fundamentals

Why Does Nvidia Report Less GPU Memory Than Specified?

The article investigates why Nvidia L40S and RTX A6000 GPUs show less memory via nvidia‑smi than their advertised 48 GB, revealing that enabled ECC memory reserves a few gigabytes, and demonstrates the effect by toggling ECC on a Tesla‑T4 card.

ECCGPU memoryL40S

0 likes · 4 min read

Why Does Nvidia Report Less GPU Memory Than Specified?

Architects' Tech Alliance

Oct 17, 2024 · Industry Insights

GDDR vs HBM: Choosing the Right GPU Memory in 2024

This article explains the technical differences between GDDR and HBM GPU memory, compares their bandwidth, cost, and use‑case scenarios, and helps engineers decide which memory type best fits their performance and efficiency requirements.

GDDRGPU memoryGraphics

0 likes · 8 min read

GDDR vs HBM: Choosing the Right GPU Memory in 2024

DaTaobao Tech

Aug 21, 2024 · Artificial Intelligence

Mastering Custom Large‑Model Training: Data Strategies, LoRA Tricks, and Resource Planning

This article provides a comprehensive, step‑by‑step guide to training customized large language models, covering industry‑specific needs, data privacy, meticulous data cleaning, optimal data‑ratio balancing, token budgeting, GPU memory accounting, LoRA fine‑tuning techniques, and practical evaluation metrics for robust AI deployment.

AI trainingData preprocessingGPU memory

0 likes · 23 min read

Mastering Custom Large‑Model Training: Data Strategies, LoRA Tricks, and Resource Planning

Rare Earth Juejin Tech Community

May 10, 2024 · Artificial Intelligence

GPU Memory Analysis and Distributed Training Strategies

This article explains how GPU memory is allocated during model fine‑tuning, describes collective communication primitives, and compares data parallel, model parallel, ZeRO, pipeline parallel, mixed‑precision, and checkpointing techniques for reducing memory consumption in large‑scale AI training.

GPU memoryPipeline ParallelZeRO

0 likes · 9 min read

GPU Memory Analysis and Distributed Training Strategies

Baobao Algorithm Notes

Apr 5, 2024 · Artificial Intelligence

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

This article explains how vLLM’s PagedAttention, inspired by operating‑system virtual‑memory paging, dynamically allocates KV‑cache memory to dramatically reduce GPU memory fragmentation, improve throughput, and handle scheduling, preemption, and distributed inference for large language models.

GPU memoryLLM InferencePagedAttention

0 likes · 25 min read

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

NewBeeNLP

Feb 5, 2024 · Artificial Intelligence

How HiFT Slashes GPU Memory for LLM Fine‑Tuning with Hierarchical Optimization

HiFT introduces a layer‑wise hierarchical fine‑tuning strategy that freezes most parameters per step, reduces optimizer state memory, and adapts mixed‑precision training, enabling 7B and 13B models to be fine‑tuned on 16‑31 GB GPUs while maintaining competitive performance.

GPU memoryHiFTLLM fine-tuning

0 likes · 12 min read

How HiFT Slashes GPU Memory for LLM Fine‑Tuning with Hierarchical Optimization

ByteDance Cloud Native

Jun 13, 2023 · Artificial Intelligence

How Ray and Cloud‑Native Tech Supercharge Large‑Model Offline Inference

This article explains the challenges of large‑model offline (batch) inference, such as GPU memory limits and distributed scheduling, and shows how Ray’s cloud‑native architecture, model partitioning, and Ray Datasets can be used to build efficient, elastic inference frameworks deployed with KubeRay.

Distributed ComputingGPU memoryRay

0 likes · 18 min read

How Ray and Cloud‑Native Tech Supercharge Large‑Model Offline Inference

DataFunSummit

Apr 11, 2023 · Artificial Intelligence

OneFlow Coop: Joint Optimization of Dynamic‑Graph Recomputation and Memory Allocation

This article introduces OneFlow Coop, a memory‑optimization technique that jointly optimizes dynamic‑graph recomputation strategies and GPU memory allocation by analyzing existing DTR limitations, proposing recomputable in‑place, op‑guided tensor allocation, and layout‑aware eviction modules, and demonstrating superior experimental results.

Deep LearningDynamic GraphGPU memory

0 likes · 18 min read

OneFlow Coop: Joint Optimization of Dynamic‑Graph Recomputation and Memory Allocation

Tencent TDS Service

Aug 20, 2015 · Mobile Development

Unlock Android GPU Memory: Master startTrimMemory to Reduce App Kills

Android apps often get killed due to high memory usage, especially from GPU caches; this article explains the Android drawing system architecture, how bitmap rendering creates GPU memory leaks, and demonstrates using WindowManagerGlobal.startTrimMemory to clear those caches while outlining common pitfalls and best practices.

AndroidGPU memoryMobile Development

0 likes · 12 min read

Unlock Android GPU Memory: Master startTrimMemory to Reduce App Kills