Tagged articles
23 articles
Page 1 of 1
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 26, 2026 · Artificial Intelligence

Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging

The article dissects DeepSeek‑V4’s local deployment using vLLM, explaining the steep hardware requirements, the complex heterogeneous KV‑cache architecture, and the aggressive kernel‑fusion and multi‑stream optimizations that together make high‑context inference both memory‑intensive and engineering‑heavy.

DeepSeek-V4GPU MemoryKV cache
0 likes · 15 min read
Why Deploying DeepSeek‑V4 Locally with vLLM Is So Challenging
Qborfy AI
Qborfy AI
Mar 24, 2026 · Artificial Intelligence

Why Full Fine‑Tuning Beats LoRA: When and How to Update Every Model Parameter

This article explains full fine‑tuning—updating all parameters of a pretrained model—to achieve the highest task performance, compares it with LoRA and prompt tuning, shows when it is appropriate, provides a step‑by‑step Hugging Face implementation, memory‑saving tricks, common pitfalls, and practical takeaways.

Deep LearningDeepSpeedGPU Memory
0 likes · 9 min read
Why Full Fine‑Tuning Beats LoRA: When and How to Update Every Model Parameter
MaGe Linux Operations
MaGe Linux Operations
Mar 10, 2026 · Artificial Intelligence

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

This guide explains the five common sources of GPU memory consumption in large‑model inference services, provides a step‑by‑step diagnosis workflow—from static usage and KV‑Cache analysis to concurrency and K8s scheduling—offers concrete command‑line checks, scripts, configuration examples, and actionable remediation and monitoring recommendations.

GPU MemoryKV cacheLLM OOM
0 likes · 28 min read
Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Jan 31, 2026 · Artificial Intelligence

How Engram Lets Large Models Swap GPU Memory for Cheap RAM to ‘Look Up’ Knowledge

The article dissects DeepSeek’s new Engram architecture, which separates computation from memory by using a large, cheap‑RAM‑based lookup table to store factual knowledge, allowing the transformer’s compute layers to focus on reasoning, dramatically reducing GPU memory demand while improving code, math, and long‑context performance.

EngramGPU MemoryMemory-Compute Architecture
0 likes · 7 min read
How Engram Lets Large Models Swap GPU Memory for Cheap RAM to ‘Look Up’ Knowledge
Baidu Geek Talk
Baidu Geek Talk
Dec 10, 2025 · Artificial Intelligence

How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report analyzes the memory bottleneck of DeepSeek‑V3.2‑Exp’s sparse‑attention decoder, proposes the Expanded Sparse Server (ESS) to offload the latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach dramatically improves decode throughput while keeping latency within acceptable limits.

Cache offloadGPU MemoryLLM inference
0 likes · 20 min read
How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 28, 2025 · Artificial Intelligence

How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

This article breaks down the GPU memory requirements of large language models during training and inference, detailing the contributions of model weights, optimizer states, activations, KV cache, and activation recomputation, and provides concrete formulas, examples, and scaling insights for models like Qwen3 and DeepSeek V3.

GPU MemoryKV cacheLLM
0 likes · 18 min read
How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference
AI Algorithm Path
AI Algorithm Path
Jul 13, 2025 · Artificial Intelligence

How to Calculate the Right AI Model Size for Your PC (3B, 7B, 13B)

This article explains how to estimate the GPU memory required for running large language models of 3 B, 7 B, and 13 B parameters, walks through step‑by‑step calculations, shows how hardware limits affect feasibility, and offers practical optimization techniques such as quantization and CPU offloading.

AI model sizingCPU offloadingFP16
0 likes · 5 min read
How to Calculate the Right AI Model Size for Your PC (3B, 7B, 13B)
Architect
Architect
May 18, 2025 · Artificial Intelligence

How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting

This article breaks down GPU memory consumption for large Transformer models, explains how to estimate each component—parameters, optimizer state, activations, gradients—and shows how parallelism, mixed precision, and recomputation strategies can dramatically reduce the footprint.

AI trainingGPU MemoryMemory Optimization
0 likes · 14 min read
How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 1, 2025 · Artificial Intelligence

Fine-grained Profiling of Online AI Workloads on Kubernetes Using ACK AI Profiling

This article demonstrates how to use ACK AI Profiling, built on eBPF and dynamic process injection, to perform non-intrusive, low‑overhead profiling of Kubernetes‑deployed large‑language‑model inference services, identify GPU memory growth causes, and apply optimization recommendations to prevent OOM issues.

AI profilingGPU MemoryKubernetes
0 likes · 10 min read
Fine-grained Profiling of Online AI Workloads on Kubernetes Using ACK AI Profiling
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 7, 2025 · Artificial Intelligence

Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache

After deploying the full‑precision DeepSeek‑R1 model on a 2×8‑GPU ACS cluster, repeated stress tests showed GPU memory usage continuously rising without release; this article details the investigation, reproduces the behavior, examines vLLM logs, Prometheus metrics, and reveals PyTorch’s caching allocator as the root cause, offering mitigation tips.

DeepSeekGPU MemoryMemory Cache
0 likes · 21 min read
Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache
AI Algorithm Path
AI Algorithm Path
Mar 16, 2025 · Artificial Intelligence

How to Train PyTorch Models Using Far Less GPU Memory

This article walks through a suite of PyTorch techniques—including automatic mixed precision, BF16, gradient checkpointing, gradient accumulation, tensor sharding, efficient data loading, in‑place ops, lightweight optimizers, memory profiling, TorchScript, and kernel fusion—that together can cut peak GPU memory usage by up to twenty‑fold while preserving model accuracy.

GPU MemoryPyTorchdata loading
0 likes · 13 min read
How to Train PyTorch Models Using Far Less GPU Memory
AI Algorithm Path
AI Algorithm Path
Mar 10, 2025 · Artificial Intelligence

How Much GPU Memory Does an LLM Service Really Need?

This article explains a simple formula for estimating the GPU VRAM required to serve large language models, demonstrates the calculation with a 7‑billion‑parameter example, clarifies why a 20% safety buffer is needed, and offers practical strategies such as quantization, CPU offload, and multi‑GPU parallelism to reduce memory usage.

DeploymentGPU MemoryLLM
0 likes · 6 min read
How Much GPU Memory Does an LLM Service Really Need?
Infra Learning Club
Infra Learning Club
Feb 12, 2025 · Fundamentals

Why Does Nvidia Report Less GPU Memory Than Specified?

The article investigates why Nvidia L40S and RTX A6000 GPUs show less memory via nvidia‑smi than their advertised 48 GB, revealing that enabled ECC memory reserves a few gigabytes, and demonstrates the effect by toggling ECC on a Tesla‑T4 card.

ECCGPU MemoryL40S
0 likes · 4 min read
Why Does Nvidia Report Less GPU Memory Than Specified?
Architects' Tech Alliance
Architects' Tech Alliance
Oct 17, 2024 · Industry Insights

GDDR vs HBM: Choosing the Right GPU Memory in 2024

This article explains the technical differences between GDDR and HBM GPU memory, compares their bandwidth, cost, and use‑case scenarios, and helps engineers decide which memory type best fits their performance and efficiency requirements.

GDDRGPU MemoryGraphics
0 likes · 8 min read
GDDR vs HBM: Choosing the Right GPU Memory in 2024
DaTaobao Tech
DaTaobao Tech
Aug 21, 2024 · Artificial Intelligence

Mastering Custom Large‑Model Training: Data Strategies, LoRA Tricks, and Resource Planning

This article provides a comprehensive, step‑by‑step guide to training customized large language models, covering industry‑specific needs, data privacy, meticulous data cleaning, optimal data‑ratio balancing, token budgeting, GPU memory accounting, LoRA fine‑tuning techniques, and practical evaluation metrics for robust AI deployment.

AI trainingFine-tuningGPU Memory
0 likes · 23 min read
Mastering Custom Large‑Model Training: Data Strategies, LoRA Tricks, and Resource Planning
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
May 10, 2024 · Artificial Intelligence

GPU Memory Analysis and Distributed Training Strategies

This article explains how GPU memory is allocated during model fine‑tuning, describes collective communication primitives, and compares data parallel, model parallel, ZeRO, pipeline parallel, mixed‑precision, and checkpointing techniques for reducing memory consumption in large‑scale AI training.

Distributed TrainingGPU MemoryPipeline Parallel
0 likes · 9 min read
GPU Memory Analysis and Distributed Training Strategies
Baobao Algorithm Notes
Baobao Algorithm Notes
Apr 5, 2024 · Artificial Intelligence

How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference

This article explains how vLLM’s PagedAttention, inspired by operating‑system virtual‑memory paging, dynamically allocates KV‑cache memory to dramatically reduce GPU memory fragmentation, improve throughput, and handle scheduling, preemption, and distributed inference for large language models.

GPU MemoryLLM inferencePagedAttention
0 likes · 25 min read
How vLLM’s PagedAttention Revolutionizes GPU Memory Management for LLM Inference
NewBeeNLP
NewBeeNLP
Feb 5, 2024 · Artificial Intelligence

How HiFT Slashes GPU Memory for LLM Fine‑Tuning with Hierarchical Optimization

HiFT introduces a layer‑wise hierarchical fine‑tuning strategy that freezes most parameters per step, reduces optimizer state memory, and adapts mixed‑precision training, enabling 7B and 13B models to be fine‑tuned on 16‑31 GB GPUs while maintaining competitive performance.

GPU MemoryHiFTLLM fine-tuning
0 likes · 12 min read
How HiFT Slashes GPU Memory for LLM Fine‑Tuning with Hierarchical Optimization
ByteDance Cloud Native
ByteDance Cloud Native
Jun 13, 2023 · Artificial Intelligence

How Ray and Cloud‑Native Tech Supercharge Large‑Model Offline Inference

This article explains the challenges of large‑model offline (batch) inference, such as GPU memory limits and distributed scheduling, and shows how Ray’s cloud‑native architecture, model partitioning, and Ray Datasets can be used to build efficient, elastic inference frameworks deployed with KubeRay.

GPU MemoryLarge ModelRay
0 likes · 18 min read
How Ray and Cloud‑Native Tech Supercharge Large‑Model Offline Inference
DataFunSummit
DataFunSummit
Apr 11, 2023 · Artificial Intelligence

OneFlow Coop: Joint Optimization of Dynamic‑Graph Recomputation and Memory Allocation

This article introduces OneFlow Coop, a memory‑optimization technique that jointly optimizes dynamic‑graph recomputation strategies and GPU memory allocation by analyzing existing DTR limitations, proposing recomputable in‑place, op‑guided tensor allocation, and layout‑aware eviction modules, and demonstrating superior experimental results.

Deep LearningDynamic GraphGPU Memory
0 likes · 18 min read
OneFlow Coop: Joint Optimization of Dynamic‑Graph Recomputation and Memory Allocation
Tencent TDS Service
Tencent TDS Service
Aug 20, 2015 · Mobile Development

Unlock Android GPU Memory: Master startTrimMemory to Reduce App Kills

Android apps often get killed due to high memory usage, especially from GPU caches; this article explains the Android drawing system architecture, how bitmap rendering creates GPU memory leaks, and demonstrates using WindowManagerGlobal.startTrimMemory to clear those caches while outlining common pitfalls and best practices.

AndroidGPU MemoryMobile Development
0 likes · 12 min read
Unlock Android GPU Memory: Master startTrimMemory to Reduce App Kills