Tagged articles
22 articles
Page 1 of 1
Machine Heart
Machine Heart
May 5, 2026 · Artificial Intelligence

Musk’s 550K Nvidia GPUs Achieve Only 11% Utilization – Like Running 60K GPUs

xAI’s massive fleet of roughly 550,000 Nvidia H100 and H200 GPUs in its Memphis and Colossus data centers is operating at a mere 11% model FLOPs utilization, highlighting how scaling to hundreds of thousands of GPUs creates coordination, network, and scheduling bottlenecks that waste most of the hardware’s compute power.

AI InfrastructureGPU utilizationNvidia H100
0 likes · 5 min read
Musk’s 550K Nvidia GPUs Achieve Only 11% Utilization – Like Running 60K GPUs
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 28, 2026 · Artificial Intelligence

Why DeepSeek V4 Insists on Batch Invariance—and What It Costs

DeepSeek V4 achieves ultra‑long context, complex training pipelines, and custom high‑performance kernels by enforcing batch invariance, a design that guarantees bit‑wise identical outputs across varying batch shapes but incurs lower GPU utilization, reduced small‑batch speed, and added engineering complexity.

DeepSeek-V4GPU utilizationLLM engineering
0 likes · 8 min read
Why DeepSeek V4 Insists on Batch Invariance—and What It Costs
Code Mala Tang
Code Mala Tang
Mar 27, 2026 · Industry Insights

Why the Real GPU Shortage Is About Low Utilization, Not Supply

The article reveals that the perceived AI‑GPU shortage stems from misleading utilization metrics and wasted capacity, not actual supply constraints, and argues that better measurement and orchestration—not buying more hardware—will determine competitive advantage in the emerging AI infrastructure market.

GPU utilizationIndustry analysisOrchestration
0 likes · 9 min read
Why the Real GPU Shortage Is About Low Utilization, Not Supply
Huolala Tech
Huolala Tech
Mar 6, 2026 · Artificial Intelligence

How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%

The article details how Huolala’s Dolphin platform engineers large‑model inference for high‑concurrency, long‑context, low‑latency production workloads, achieving 50‑60% GPU cost reduction through systematic resource allocation, model quantization, PD‑separation, speculative sampling, and kernel‑level optimizations while maintaining service stability.

GPU utilizationModel QuantizationPerformance Evaluation
0 likes · 18 min read
How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%
DataFunTalk
DataFunTalk
Feb 2, 2026 · Artificial Intelligence

How Alluxio Boosts GPU Utilization to 99.57% for Embodied AI – Inside the MLPerf Success

This article explains how Alluxio’s distributed caching architecture tackles the massive, multimodal data challenges of embodied AI, delivers near‑zero‑millisecond access, achieves 99.57% GPU utilization in MLPerf Storage v2.0, and validates its value through real‑world enterprise deployments.

AI Data PlatformAlluxioEmbodied Intelligence
0 likes · 21 min read
How Alluxio Boosts GPU Utilization to 99.57% for Embodied AI – Inside the MLPerf Success
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 17, 2025 · Artificial Intelligence

How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%

The article details the Attention‑FFN Disaggregation (AFD) technique used by Baidu Baige to separate self‑attention and feed‑forward network stages in DeepSeek‑V3 models, describing multi‑stage scheduling, three‑batch overlap, communication optimizations, and performance results that achieve up to 19% throughput improvement under a 100 ms SLO.

3BOAFDAttention-FFN Disaggregation
0 likes · 17 min read
How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Nov 19, 2025 · Artificial Intelligence

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

Token‑level Two‑Chunk Overlap replaces traditional batch‑level Two‑Batch Overlap, dynamically splitting sequences into balanced token chunks, enabling near‑equal compute and communication times, improving GPU utilization and achieving up to 30% throughput gains in heterogeneous request workloads, with zero accuracy loss.

Batch schedulingGPU utilizationLLM inference
0 likes · 9 min read
Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap
DataFunTalk
DataFunTalk
Sep 3, 2025 · Artificial Intelligence

How Alluxio’s Distributed Cache Boosts AI Training to 99.57% GPU Utilization

Alluxio’s distributed caching dramatically accelerates AI training and checkpointing workloads, achieving up to 99.57% GPU utilization and linear scaling across clusters in the MLPerf Storage v2.0 benchmark, while using cost‑effective commodity hardware to eliminate I/O bottlenecks.

AI trainingAlluxioGPU utilization
0 likes · 11 min read
How Alluxio’s Distributed Cache Boosts AI Training to 99.57% GPU Utilization
Tencent Technical Engineering
Tencent Technical Engineering
Jul 11, 2025 · Artificial Intelligence

How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

This article details the Angel‑HCF team's end‑to‑end DeepSeek inference optimizations—including PD separation, multi‑layer MTP, EP and DP parallelism, hardware‑aware kernels, and load‑balancing strategies—that boost throughput to over 15,800 tokens per second while keeping per‑token latency under 50 ms.

AI PerformanceDeepSeekGPU utilization
0 likes · 13 min read
How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations
Baidu Geek Talk
Baidu Geek Talk
Feb 5, 2025 · Artificial Intelligence

How to Unlock Full GPU Efficiency for Enterprise AI Platforms

This article analyzes common GPU efficiency problems in enterprise AI compute platforms—such as low utilization, long fault‑resolution times, and limited performance gains—and presents three practical solutions: dynamic resource allocation, systematic fault‑tolerance, and system‑level tuning, illustrated with real‑world case studies.

AI PlatformGPU utilizationlarge model training
0 likes · 11 min read
How to Unlock Full GPU Efficiency for Enterprise AI Platforms
Baidu Geek Talk
Baidu Geek Talk
Jan 15, 2025 · Artificial Intelligence

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.

AI accelerationGPU utilizationTPOT
0 likes · 10 min read
Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 7, 2025 · Artificial Intelligence

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

This article explains the architecture of large‑model inference engines, key performance metrics like TTFT and TPOT, the limitations of popular engines such as vLLM, and Baidu Baige's AIAK solutions—including multi‑process, static slot, and asynchronous execution—that dramatically reduce token‑interval latency and increase GPU utilization.

AIAKGPU utilizationLLM Performance
0 likes · 10 min read
How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 16, 2024 · Artificial Intelligence

How TAG Makes LLM Inference Fully Asynchronous for Higher Throughput

With the growing complexity of LLM architectures like GQA, MLA, and MoE, runtime overhead has become a bottleneck; this article analyzes Python performance, communication costs, and synchronous execution in current inference frameworks, introduces the fully asynchronous TAG architecture, and demonstrates its superior throughput and latency through benchmarks.

GPU utilizationLLM inferenceRuntime Optimization
0 likes · 12 min read
How TAG Makes LLM Inference Fully Asynchronous for Higher Throughput
360 Smart Cloud
360 Smart Cloud
May 15, 2024 · Cloud Native

Polefs: A Cloud‑Native Distributed Cache File System for AI Training Workloads

The article outlines the challenges of massive AI training data, defines storage performance requirements, and presents Polefs—a cloud‑native distributed cache file system with unified storage, metadata acceleration, and read/write caching designed to improve GPU utilization and reduce data redundancy.

AICloud NativeDistributed File System
0 likes · 14 min read
Polefs: A Cloud‑Native Distributed Cache File System for AI Training Workloads
Volcano Engine Developer Services
Volcano Engine Developer Services
Jun 20, 2023 · Artificial Intelligence

Boosting Large-Model Offline Inference with Ray and Cloud-Native Architecture

Large-model offline (batch) inference, which processes massive data on billion-parameter models, faces GPU memory and distributed scheduling challenges; this article explains how Ray's cloud-native framework, model parallelism, and Ray Datasets pipelines address these issues, improve throughput, and enable elastic, efficient GPU utilization.

GPU utilizationRaycloud-native
0 likes · 16 min read
Boosting Large-Model Offline Inference with Ray and Cloud-Native Architecture
Bilibili Tech
Bilibili Tech
Jun 13, 2023 · Artificial Intelligence

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Bilibili’s self‑developed InferX framework, combined with NVIDIA Triton Inference Server, streamlines AI model serving by adding quantization, structured sparsity, and custom kernels, delivering up to eight‑fold throughput gains, cutting GPU usage by half, and enabling faster, cost‑effective OCR and large‑model deployments.

AI inferenceGPU utilizationInferX
0 likes · 10 min read
InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving
Efficient Ops
Efficient Ops
Jun 11, 2023 · Artificial Intelligence

Why Network Bandwidth Is the Real Bottleneck for AIGC and How DDC Solves It

The article explains how AIGC models demand massive GPU compute, why network bandwidth and latency become the critical limiting factors, and how the Distributed Disaggregated Chassis (DDC) architecture addresses these challenges with scalable, high‑throughput networking solutions.

AI InfrastructureAIGCDDC
0 likes · 13 min read
Why Network Bandwidth Is the Real Bottleneck for AIGC and How DDC Solves It
Alimama Tech
Alimama Tech
Nov 2, 2022 · Artificial Intelligence

Optimizing GPU Utilization for Multimedia AI Services with high_service

The article presents high_service, a high‑performance inference framework that boosts GPU utilization in multimedia AI services by separating CPU‑heavy preprocessing from GPU inference, employing priority‑based auto‑scaling, multi‑tenant sharing, and TensorRT‑accelerated models to eliminate GIL bottlenecks, reduce waste, and adapt to fluctuating traffic, with future work targeting automated bottleneck detection and further CPU‑GPU offloading.

Auto ScalingGPU utilizationHigh‑performance computing
0 likes · 19 min read
Optimizing GPU Utilization for Multimedia AI Services with high_service
ITPUB
ITPUB
Apr 27, 2022 · Artificial Intelligence

How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%

This article details the design and optimization of 58.com’s WPAI machine learning platform, covering background, training‑task scheduling, elastic inference scaling, offline‑online resource mixing, and model‑inference acceleration, and shows how these techniques collectively raised GPU usage by 51% and CPU usage by 38% while cutting costs.

AI PlatformGPU utilizationInference Acceleration
0 likes · 26 min read
How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%
DataFunTalk
DataFunTalk
Nov 2, 2021 · Artificial Intelligence

Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training

The article outlines a technical exchange hosted by 58.com AI Lab and Tianjin University that discusses high‑efficiency AI computing, resource‑aware scheduling for both online inference and offline training, and methods to mitigate GPU under‑utilization and gray‑interference in distributed deep‑learning platforms.

AIGPU utilizationInference
0 likes · 4 min read
Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training