How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

This article derives DeepSeek V3's training Model FLOPs Utilization (MFU) using publicly available data, showing an MFU of roughly 37%—about a 60% improvement over V2—and provides detailed formulas, parameter settings, and a reproducible Python script.

AI PerformanceDeepSeekMFU

0 likes · 8 min read

How Efficient Is DeepSeek V3? Calculating Its MFU Around 37%

Linux Kernel Journey

Dec 22, 2024 · Artificial Intelligence

Understanding GPU Monitoring: Utilization Metrics and Failure Scenarios

This article systematically reviews GPU monitoring for large‑scale AI training, covering MFU/HFU definitions, key DCGM metrics, NVLink bandwidth, common failure codes such as Xid and SXid, experimental insights on T4 and H100 GPUs, and practical case studies for diagnosing and mitigating performance drops.

DCGMGPU failuresGPU monitoring

0 likes · 26 min read

Understanding GPU Monitoring: Utilization Metrics and Failure Scenarios

Baidu Tech Salon

May 15, 2024 · Artificial Intelligence

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM

Baidu Baige’s AIAK‑LLM suite accelerates large‑model training and inference by boosting Model FLOPS Utilization through techniques such as TP communication overlap, hybrid recompute, zero‑offload, automatic parallel‑strategy search, multi‑chip support, and inference‑specific optimizations, achieving over 60 % speedup and seamless Hugging Face integration.

AI InfrastructureAIAK-LLMBaidu Baige

0 likes · 26 min read

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM

Baidu Geek Talk

May 15, 2024 · Artificial Intelligence

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations

The talk outlines how Baidu’s Baige AIAK‑LLM suite tackles the exploding compute demands of trillion‑parameter models by boosting Model FLOPS Utilization through advanced parallelism, memory‑saving recompute, zero‑offload, adaptive scheduling, and cross‑chip orchestration, delivering 30‑60% training and inference speedups and a unified cloud product.

AI InfrastructureBaiduInference Optimization

0 likes · 25 min read

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations

Baidu Intelligent Cloud Tech Hub

May 15, 2024 · Artificial Intelligence

How Baidu’s AIAK‑LLM Supercharges Large‑Model Training and Inference

The article explains the scaling challenges of ever‑larger LLMs, introduces the MFU performance metric, surveys industry parallelism and memory‑saving techniques, and details Baidu’s AIAK‑LLM suite—including resource, component and acceleration layers—as well as concrete training and inference optimizations that raise MFU by 30‑60% and cut deployment costs.

AI InfrastructureLarge ModelMFU

0 likes · 25 min read

How Baidu’s AIAK‑LLM Supercharges Large‑Model Training and Inference

Baidu Intelligent Cloud Tech Hub

Apr 24, 2024 · Artificial Intelligence

How to Build and Accelerate Multi‑Chip AI Clusters for Large‑Model Training

With AI training demands outgrowing single‑chip GPU clusters, this article explains how to construct and speed up heterogeneous AI clusters—combining GPUs, Kunlun, and Ascend chips—by addressing interconnect, distributed parallel strategies, and specialized acceleration suites to achieve high MFU and efficient large‑model training.

AI clusteringDistributed TrainingGPU Acceleration

0 likes · 15 min read

How to Build and Accelerate Multi‑Chip AI Clusters for Large‑Model Training