Tag

GPU utilization

0 views collected around this technical thread.

Baidu Geek Talk
Baidu Geek Talk
Jan 15, 2025 · Artificial Intelligence

Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)

Large‑model inference engines convert prompts into responses via a Prefill stage and an autoregressive Decoder, measured by TTFT and TPOT, and Baidu’s AIAK suite improves TPOT by separating tokenization, using static slot scheduling, and asynchronous execution, cutting token‑interval latency from ~35 ms to ~14 ms and boosting GPU utilization to about 75 % while also leveraging quantization and speculative execution for higher throughput.

AI accelerationGPU utilizationTPOT
0 likes · 10 min read
Understanding Large Model Inference Engines and Reducing Token Interval (TPOT)
360 Smart Cloud
360 Smart Cloud
May 15, 2024 · Cloud Native

Polefs: A Cloud‑Native Distributed Cache File System for AI Training Workloads

The article outlines the challenges of massive AI training data, defines storage performance requirements, and presents Polefs—a cloud‑native distributed cache file system with unified storage, metadata acceleration, and read/write caching designed to improve GPU utilization and reduce data redundancy.

AIGPU utilizationPolefs
0 likes · 14 min read
Polefs: A Cloud‑Native Distributed Cache File System for AI Training Workloads
Bilibili Tech
Bilibili Tech
Jun 13, 2023 · Artificial Intelligence

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Bilibili’s self‑developed InferX framework, combined with NVIDIA Triton Inference Server, streamlines AI model serving by adding quantization, structured sparsity, and custom kernels, delivering up to eight‑fold throughput gains, cutting GPU usage by half, and enabling faster, cost‑effective OCR and large‑model deployments.

AI inferenceGPU utilizationInferX
0 likes · 10 min read
InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving
Efficient Ops
Efficient Ops
Jun 11, 2023 · Artificial Intelligence

Why Network Bandwidth Is the Real Bottleneck for AIGC and How DDC Solves It

The article explains how AIGC models demand massive GPU compute, why network bandwidth and latency become the critical limiting factors, and how the Distributed Disaggregated Chassis (DDC) architecture addresses these challenges with scalable, high‑throughput networking solutions.

AI infrastructureAIGCDDC
0 likes · 13 min read
Why Network Bandwidth Is the Real Bottleneck for AIGC and How DDC Solves It
Alimama Tech
Alimama Tech
Nov 2, 2022 · Artificial Intelligence

Optimizing GPU Utilization for Multimedia AI Services with high_service

The article presents high_service, a high‑performance inference framework that boosts GPU utilization in multimedia AI services by separating CPU‑heavy preprocessing from GPU inference, employing priority‑based auto‑scaling, multi‑tenant sharing, and TensorRT‑accelerated models to eliminate GIL bottlenecks, reduce waste, and adapt to fluctuating traffic, with future work targeting automated bottleneck detection and further CPU‑GPU offloading.

GPU utilizationHigh Performance ComputingTensorRT
0 likes · 19 min read
Optimizing GPU Utilization for Multimedia AI Services with high_service
DataFunTalk
DataFunTalk
Nov 2, 2021 · Artificial Intelligence

Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training

The article outlines a technical exchange hosted by 58.com AI Lab and Tianjin University that discusses high‑efficiency AI computing, resource‑aware scheduling for both online inference and offline training, and methods to mitigate GPU under‑utilization and gray‑interference in distributed deep‑learning platforms.

AIGPU utilizationInference
0 likes · 4 min read
Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training