Tagged articles

22 articles

Page 1 of 1

May 5, 2026 · Artificial Intelligence

Musk’s 550K Nvidia GPUs Achieve Only 11% Utilization – Like Running 60K GPUs

xAI’s massive fleet of roughly 550,000 Nvidia H100 and H200 GPUs in its Memphis and Colossus data centers is operating at a mere 11% model FLOPs utilization, highlighting how scaling to hundreds of thousands of GPUs creates coordination, network, and scheduling bottlenecks that waste most of the hardware’s compute power.

AI InfrastructureGPU utilizationNvidia H100

0 likes · 5 min read

Musk’s 550K Nvidia GPUs Achieve Only 11% Utilization – Like Running 60K GPUs

Machine Learning Algorithms & Natural Language Processing

Apr 28, 2026 · Artificial Intelligence

Why DeepSeek V4 Insists on Batch Invariance—and What It Costs

DeepSeek V4 achieves ultra‑long context, complex training pipelines, and custom high‑performance kernels by enforcing batch invariance, a design that guarantees bit‑wise identical outputs across varying batch shapes but incurs lower GPU utilization, reduced small‑batch speed, and added engineering complexity.

DeepSeek-V4GPU utilizationLLM engineering

0 likes · 8 min read

Why DeepSeek V4 Insists on Batch Invariance—and What It Costs

Code Mala Tang

Mar 27, 2026 · Industry Insights

Why the Real GPU Shortage Is About Low Utilization, Not Supply

The article reveals that the perceived AI‑GPU shortage stems from misleading utilization metrics and wasted capacity, not actual supply constraints, and argues that better measurement and orchestration—not buying more hardware—will determine competitive advantage in the emerging AI infrastructure market.

GPU utilizationIndustry analysisOrchestration

0 likes · 9 min read

Why the Real GPU Shortage Is About Low Utilization, Not Supply

Huolala Tech

Mar 6, 2026 · Artificial Intelligence

How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%

The article details how Huolala’s Dolphin platform engineers large‑model inference for high‑concurrency, long‑context, low‑latency production workloads, achieving 50‑60% GPU cost reduction through systematic resource allocation, model quantization, PD‑separation, speculative sampling, and kernel‑level optimizations while maintaining service stability.

GPU utilizationModel QuantizationPerformance Evaluation

0 likes · 18 min read

How Huolala’s Dolphin Platform Cuts Large‑Model Inference Costs by Up to 60%

DataFunTalk

Feb 2, 2026 · Artificial Intelligence

How Alluxio Boosts GPU Utilization to 99.57% for Embodied AI – Inside the MLPerf Success

This article explains how Alluxio’s distributed caching architecture tackles the massive, multimodal data challenges of embodied AI, delivers near‑zero‑millisecond access, achieves 99.57% GPU utilization in MLPerf Storage v2.0, and validates its value through real‑world enterprise deployments.

AI Data PlatformAlluxioEmbodied Intelligence

0 likes · 21 min read

How Alluxio Boosts GPU Utilization to 99.57% for Embodied AI – Inside the MLPerf Success

Baidu Intelligent Cloud Tech Hub

Dec 17, 2025 · Artificial Intelligence

How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%

The article details the Attention‑FFN Disaggregation (AFD) technique used by Baidu Baige to separate self‑attention and feed‑forward network stages in DeepSeek‑V3 models, describing multi‑stage scheduling, three‑batch overlap, communication optimizations, and performance results that achieve up to 19% throughput improvement under a 100 ms SLO.

3BOAFDAttention-FFN Disaggregation

0 likes · 17 min read

How AFD Splits Attention and FFN to Boost DeepSeek‑V3 Inference by Up to 19%

Baidu Intelligent Cloud Tech Hub

Nov 19, 2025 · Artificial Intelligence

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

Token‑level Two‑Chunk Overlap replaces traditional batch‑level Two‑Batch Overlap, dynamically splitting sequences into balanced token chunks, enabling near‑equal compute and communication times, improving GPU utilization and achieving up to 30% throughput gains in heterogeneous request workloads, with zero accuracy loss.

Batch schedulingGPU utilizationLLM inference

0 likes · 9 min read

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

DataFunTalk

Sep 3, 2025 · Artificial Intelligence

How Alluxio’s Distributed Cache Boosts AI Training to 99.57% GPU Utilization

Alluxio’s distributed caching dramatically accelerates AI training and checkpointing workloads, achieving up to 99.57% GPU utilization and linear scaling across clusters in the MLPerf Storage v2.0 benchmark, while using cost‑effective commodity hardware to eliminate I/O bottlenecks.

AI trainingAlluxioGPU utilization

0 likes · 11 min read

How Alluxio’s Distributed Cache Boosts AI Training to 99.57% GPU Utilization

Tencent Technical Engineering

Jul 11, 2025 · Artificial Intelligence

How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

This article details the Angel‑HCF team's end‑to‑end DeepSeek inference optimizations—including PD separation, multi‑layer MTP, EP and DP parallelism, hardware‑aware kernels, and load‑balancing strategies—that boost throughput to over 15,800 tokens per second while keeping per‑token latency under 50 ms.

AI PerformanceDeepSeekGPU utilization

0 likes · 13 min read

How DeepSeek Achieved 15,800+ Tokens/s: Full‑Stack Inference Optimizations

Baidu Geek Talk

Feb 5, 2025 · Artificial Intelligence

How to Unlock Full GPU Efficiency for Enterprise AI Platforms

This article analyzes common GPU efficiency problems in enterprise AI compute platforms—such as low utilization, long fault‑resolution times, and limited performance gains—and presents three practical solutions: dynamic resource allocation, systematic fault‑tolerance, and system‑level tuning, illustrated with real‑world case studies.

AI PlatformGPU utilizationlarge model training

0 likes · 11 min read

How to Unlock Full GPU Efficiency for Enterprise AI Platforms

Baidu Geek Talk

Jan 15, 2025 · Artificial Intelligence

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.

AI accelerationGPU utilizationTPOT

0 likes · 10 min read

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Baidu Intelligent Cloud Tech Hub

Jan 7, 2025 · Artificial Intelligence

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

This article explains the architecture of large‑model inference engines, key performance metrics like TTFT and TPOT, the limitations of popular engines such as vLLM, and Baidu Baige's AIAK solutions—including multi‑process, static slot, and asynchronous execution—that dramatically reduce token‑interval latency and increase GPU utilization.

AIAKGPU utilizationLLM Performance

0 likes · 10 min read

How Baidu’s AIAK Boosts LLM Inference Speed by Cutting Token Latency

Alibaba Cloud Big Data AI Platform

Sep 16, 2024 · Artificial Intelligence

How TAG Makes LLM Inference Fully Asynchronous for Higher Throughput

With the growing complexity of LLM architectures like GQA, MLA, and MoE, runtime overhead has become a bottleneck; this article analyzes Python performance, communication costs, and synchronous execution in current inference frameworks, introduces the fully asynchronous TAG architecture, and demonstrates its superior throughput and latency through benchmarks.

GPU utilizationLLM inferenceRuntime Optimization

0 likes · 12 min read

How TAG Makes LLM Inference Fully Asynchronous for Higher Throughput

360 Smart Cloud

May 15, 2024 · Cloud Native

Polefs: A Cloud‑Native Distributed Cache File System for AI Training Workloads

The article outlines the challenges of massive AI training data, defines storage performance requirements, and presents Polefs—a cloud‑native distributed cache file system with unified storage, metadata acceleration, and read/write caching designed to improve GPU utilization and reduce data redundancy.

AICloud NativeDistributed File System

0 likes · 14 min read

Polefs: A Cloud‑Native Distributed Cache File System for AI Training Workloads

Alibaba Cloud Big Data AI Platform

Jun 21, 2023 · Artificial Intelligence

How GoldMiner Boosts Deep Learning Training by Up to 12× with Elastic Data Pre‑Processing

GoldMiner, a new system from Alibaba Cloud’s PAI platform, elastically scales deep learning data pre‑processing pipelines, dramatically improving training performance up to 12.1× and GPU cluster utilization by 2.5×, and its underlying research was accepted at SIGMOD 2023.

Deep LearningGPU utilizationSIGMOD

0 likes · 5 min read

How GoldMiner Boosts Deep Learning Training by Up to 12× with Elastic Data Pre‑Processing

Volcano Engine Developer Services

Jun 20, 2023 · Artificial Intelligence

Boosting Large-Model Offline Inference with Ray and Cloud-Native Architecture

Large-model offline (batch) inference, which processes massive data on billion-parameter models, faces GPU memory and distributed scheduling challenges; this article explains how Ray's cloud-native framework, model parallelism, and Ray Datasets pipelines address these issues, improve throughput, and enable elastic, efficient GPU utilization.

GPU utilizationRaycloud-native

0 likes · 16 min read

Boosting Large-Model Offline Inference with Ray and Cloud-Native Architecture

Bilibili Tech

Jun 13, 2023 · Artificial Intelligence

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Bilibili’s self‑developed InferX framework, combined with NVIDIA Triton Inference Server, streamlines AI model serving by adding quantization, structured sparsity, and custom kernels, delivering up to eight‑fold throughput gains, cutting GPU usage by half, and enabling faster, cost‑effective OCR and large‑model deployments.

AI inferenceGPU utilizationInferX

0 likes · 10 min read

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Efficient Ops

Jun 11, 2023 · Artificial Intelligence

Why Network Bandwidth Is the Real Bottleneck for AIGC and How DDC Solves It

The article explains how AIGC models demand massive GPU compute, why network bandwidth and latency become the critical limiting factors, and how the Distributed Disaggregated Chassis (DDC) architecture addresses these challenges with scalable, high‑throughput networking solutions.

AI InfrastructureAIGCDDC

0 likes · 13 min read

Why Network Bandwidth Is the Real Bottleneck for AIGC and How DDC Solves It

Alimama Tech

Nov 2, 2022 · Artificial Intelligence

Optimizing GPU Utilization for Multimedia AI Services with high_service

The article presents high_service, a high‑performance inference framework that boosts GPU utilization in multimedia AI services by separating CPU‑heavy preprocessing from GPU inference, employing priority‑based auto‑scaling, multi‑tenant sharing, and TensorRT‑accelerated models to eliminate GIL bottlenecks, reduce waste, and adapt to fluctuating traffic, with future work targeting automated bottleneck detection and further CPU‑GPU offloading.

Auto ScalingGPU utilizationHigh‑performance computing

0 likes · 19 min read

Optimizing GPU Utilization for Multimedia AI Services with high_service

ITPUB

Apr 27, 2022 · Artificial Intelligence

How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%

This article details the design and optimization of 58.com’s WPAI machine learning platform, covering background, training‑task scheduling, elastic inference scaling, offline‑online resource mixing, and model‑inference acceleration, and shows how these techniques collectively raised GPU usage by 51% and CPU usage by 38% while cutting costs.

AI PlatformGPU utilizationInference Acceleration

0 likes · 26 min read

How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%

DataFunTalk

Nov 2, 2021 · Artificial Intelligence

Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training

The article outlines a technical exchange hosted by 58.com AI Lab and Tianjin University that discusses high‑efficiency AI computing, resource‑aware scheduling for both online inference and offline training, and methods to mitigate GPU under‑utilization and gray‑interference in distributed deep‑learning platforms.

AIGPU utilizationInference

0 likes · 4 min read

Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training

Tencent TDS Service

Dec 12, 2019 · Mobile Development

Why Did My Mobile Game’s FPS Crash? Uncovering Thermal Throttling on Galaxy S9+

A detailed case study shows how a mobile game's sudden frame‑rate drop on a Galaxy S9+ was traced to thermal throttling, highlighting the importance of correlating CPU, GPU, memory, and temperature data during performance testing.

CPU FrequencyFrame RateGPU utilization

0 likes · 5 min read

Why Did My Mobile Game’s FPS Crash? Uncovering Thermal Throttling on Galaxy S9+