Tagged articles
60 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 20, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors

The paper shows that applying lightweight L1 regularization can make over 99% of FFN activations zero, and by using a new tile‑wise ELLPACK (TwELL) format together with a hybrid routing scheme, inference speed improves up to 30% while memory usage drops over 24% and energy consumption is reduced, all with negligible impact on downstream task performance.

CUDAGPU OptimizationHybrid Routing
0 likes · 8 min read
Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 9, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the Original Authors

A new ICML 2026 paper by Sakana AI and NVIDIA shows that applying lightweight L1 regularization can make Feed‑Forward Network activations in Transformers over 99% sparse, and with the TwELL storage format and a hybrid routing scheme this sparsity translates into up to 20.5% inference speedup, 21.9% training‑step acceleration, lower energy consumption and reduced peak memory across 0.5‑2 B‑parameter models while preserving downstream performance.

CUDAGPU OptimizationHybrid Routing
0 likes · 9 min read
Can 99% Sparse Transformers Run Faster? Insights from the Original Authors
Old Zhang's AI Learning
Old Zhang's AI Learning
May 7, 2026 · Artificial Intelligence

How Unsloth and NVIDIA Boost Consumer‑GPU LLM Training by ~25% with Three Simple Optimizations

Unsloth and NVIDIA identified three low‑level bottlenecks in LLM fine‑tuning on consumer GPUs—repeated packed‑sequence metadata construction, serialized copy‑and‑compute during gradient checkpointing, and per‑expert routing overhead in MoE—and applied targeted patches that together deliver roughly a 25% speedup without changing hardware, code, or frameworks.

GPU OptimizationLLM trainingMixture of Experts
0 likes · 12 min read
How Unsloth and NVIDIA Boost Consumer‑GPU LLM Training by ~25% with Three Simple Optimizations
Woodpecker Software Testing
Woodpecker Software Testing
Apr 24, 2026 · Artificial Intelligence

Practical Guide to Optimizing Large Model Performance in Production

This guide details how enterprises can move large language models from lab to production by defining specific SLI/SLO metrics, diagnosing hidden bottlenecks such as tokenizer latency, and applying four quantifiable optimization levers that dramatically improve latency, throughput, and cost efficiency.

Continuous BatchingGPU OptimizationLarge Language Models
0 likes · 6 min read
Practical Guide to Optimizing Large Model Performance in Production
Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 23, 2026 · Artificial Intelligence

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

DeepSeek has released TileKernels, a GPU kernel library written in the TileLang DSL, that targets H100/H200/B200 GPUs and claims to approach hardware limits in compute intensity and memory bandwidth, offering MoE routing, FP8/FP4 quantization, and dual‑language PyTorch references for deep‑learning engineers.

FP8 quantizationGPU OptimizationLLM training
0 likes · 9 min read
DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits
Machine Heart
Machine Heart
Apr 16, 2026 · Artificial Intelligence

Achieving 4.6× Faster Diffusion Model Training with FP4‑BF16 Dual‑Track Parallelism (Sol‑RL)

Sol‑RL, a framework from NVIDIA, Hong Kong University and MIT, integrates NVFP4 inference for large‑scale rollout exploration and BF16 precision for high‑fidelity regeneration, delivering up to 4.64× faster convergence at equivalent reward levels while preserving BF16 training fidelity across SANA, FLUX.1 and SD3.5‑L models.

BF16Diffusion ModelsFP4
0 likes · 9 min read
Achieving 4.6× Faster Diffusion Model Training with FP4‑BF16 Dual‑Track Parallelism (Sol‑RL)
SuanNi
SuanNi
Mar 29, 2026 · Artificial Intelligence

How an AI Agent Outperformed NVIDIA Engineers in 7‑Day GPU Kernel Optimization

This article analyzes the AVO system, an autonomous AI agent that replaces traditional evolutionary search pipelines to iteratively improve CUDA attention kernels on NVIDIA's Blackwell B200 GPU, achieving up to 10.5% higher throughput than hand‑tuned implementations after a week of nonstop optimization.

AICUDAGPU Optimization
0 likes · 13 min read
How an AI Agent Outperformed NVIDIA Engineers in 7‑Day GPU Kernel Optimization
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 3, 2026 · Artificial Intelligence

How CUDA Agent Lets Anyone Write High‑Performance CUDA Kernels, Challenging Nvidia’s AI Moat

CUDA Agent, a large‑scale reinforcement‑learning system from ByteDance and Tsinghua, can automatically generate and optimize CUDA kernels that outperform torch.compile by up to 2× on simple kernels and achieve around 40% higher speed than proprietary models on the hardest benchmarks, while detailing its data‑synthesis pipeline, training workflow, and current limitations.

CUDAGPU OptimizationKernelBench
0 likes · 10 min read
How CUDA Agent Lets Anyone Write High‑Performance CUDA Kernels, Challenging Nvidia’s AI Moat
AI Explorer
AI Explorer
Mar 3, 2026 · Artificial Intelligence

ByteDance & Tsinghua Reveal AI‑Powered CUDA Agent for Self‑Evolving Kernels

ByteDance and Tsinghua University have created the CUDA Agent, an AI compiler that automatically writes and optimizes GPU kernels, delivering up to double the performance, and heralding a shift where AI‑generated low‑level code could reshape the hardware‑software competition landscape.

AI compilerByteDanceCUDA
0 likes · 6 min read
ByteDance & Tsinghua Reveal AI‑Powered CUDA Agent for Self‑Evolving Kernels
Bilibili Tech
Bilibili Tech
Feb 13, 2026 · Artificial Intelligence

Self-Forcing: Turning Global Video Diffusion into Causal Streaming for Long-Form Generation

This article examines the Wan2.1 video diffusion model, identifies its scalability bottlenecks for long and real‑time video generation, and introduces the Self‑Forcing causal framework together with sequence‑parallel and RoPE optimizations that achieve sub‑second latency and up to 1.5× speed‑up on modern GPUs.

GPU Optimizationcausal inferencelarge video generation
0 likes · 14 min read
Self-Forcing: Turning Global Video Diffusion into Causal Streaming for Long-Form Generation
Data Party THU
Data Party THU
Jan 21, 2026 · Artificial Intelligence

What DeepSeek’s Secret “Model1” Reveals About the Upcoming V4 LLM

Analyzing recent DeepSeek flashmla repository commits, the article uncovers that the mysterious Model1 likely corresponds to DeepSeek‑V4, detailing architectural shifts to a 512‑dimensional head, full support for NVIDIA Blackwell GPUs, token‑level sparse MLA, and new mechanisms such as Value Vector Position Awareness and Engram.

DeepSeekDeepSeek-V4GPU Optimization
0 likes · 6 min read
What DeepSeek’s Secret “Model1” Reveals About the Upcoming V4 LLM
Ops Community
Ops Community
Jan 18, 2026 · Artificial Intelligence

How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

This guide details how to replace native Transformers inference with the high‑performance vLLM engine, leveraging PagedAttention, continuous batching, tensor parallelism, and OpenAI‑compatible APIs to achieve 3‑4× higher throughput, lower latency, and scalable multi‑GPU deployments for production‑grade large language models.

Continuous BatchingGPU OptimizationOpenAI API Compatibility
0 likes · 61 min read
How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching
Fun with Large Models
Fun with Large Models
Jan 18, 2026 · Artificial Intelligence

Step‑by‑Step Guide to Deploying Large Language Models Locally with VLLM and Ollama

This article walks through two mainstream local deployment solutions—high‑performance VLLM for production Linux servers and lightweight Ollama for personal Windows machines—covering environment setup, model download, server launch, API testing, key configuration parameters, and the quantization technique that makes Ollama models compact.

GPU OptimizationLarge Language ModelsModel Quantization
0 likes · 18 min read
Step‑by‑Step Guide to Deploying Large Language Models Locally with VLLM and Ollama
MaGe Linux Operations
MaGe Linux Operations
Jan 6, 2026 · Artificial Intelligence

How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s

This guide details how switching from vLLM to SGLang on eight NVIDIA H800 GPUs increased Llama‑3‑70B‑Instruct throughput from 180 to 420 tokens per second, covering SGLang’s core innovations, environment setup, configuration tweaks, performance benchmarks, troubleshooting tips, and production‑grade deployment scripts.

FlashInferGPU OptimizationH800
0 likes · 19 min read
How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s
Ops Community
Ops Community
Dec 28, 2025 · Artificial Intelligence

Boost LLM Inference Speed: Build a High‑Concurrency vLLM Service with Best‑Practice Ops

This guide walks through the complete process of deploying a high‑throughput large language model inference service using vLLM, covering environment preparation, installation, configuration tuning, performance testing, real‑world case studies, monitoring, troubleshooting, and backup strategies for production‑grade deployments.

DeploymentGPU OptimizationLLM inference
0 likes · 44 min read
Boost LLM Inference Speed: Build a High‑Concurrency vLLM Service with Best‑Practice Ops
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Dec 19, 2025 · Artificial Intelligence

The 9 Key Ideas Behind FlashAttention

FlashAttention accelerates transformer inference by combining nine techniques—including loss‑less attention, GPU memory‑pyramid optimization, SRAM‑reusing tiling, safe softmax scaling, online buffering, tile‑size constraints, parallel multiplication, reduced KV slicing, and integrated backward‑pass caching—to achieve efficient, high‑throughput computation on modern GPUs.

Attention MechanismFlashAttentionGPU Optimization
0 likes · 8 min read
The 9 Key Ideas Behind FlashAttention
Old Meng AI Explorer
Old Meng AI Explorer
Nov 30, 2025 · Artificial Intelligence

Unlock 1‑Minute AI Video Generation with TTT‑Video‑Dit: Break the 3‑Second Limit

TTT‑Video‑Dit is an open‑source framework that uses test‑time‑training and hierarchical attention to generate coherent 63‑second videos with style‑transfer, dramatically reducing GPU memory requirements so a single RTX 4090 can replace costly H100 clusters, enabling creators and developers to produce long AI videos efficiently.

GPU OptimizationStyle TransferTTT-Video-Dit
0 likes · 11 min read
Unlock 1‑Minute AI Video Generation with TTT‑Video‑Dit: Break the 3‑Second Limit
Linux Kernel Journey
Linux Kernel Journey
Sep 24, 2025 · Fundamentals

Fine-Grained GPU Code Modifications: Boosting CUDA Performance

This article explains why certain GPU performance gains require direct CUDA kernel edits and walks through fine‑grained techniques such as data‑layout restructuring, warp‑level primitives, tiled memory accesses, kernel fusion, and dynamic execution paths, backed by code examples and benchmark insights.

CUDAGPU Optimizationdynamic execution
0 likes · 12 min read
Fine-Grained GPU Code Modifications: Boosting CUDA Performance
AI Frontier Lectures
AI Frontier Lectures
Apr 1, 2025 · Artificial Intelligence

Can SpargeAttn Accelerate Any Model Without Training? A Deep Dive

This article reviews the SpargeAttn paper, describing how a training‑free sparse attention mechanism achieves 4‑7× inference speedup across language, video, and image models while preserving end‑to‑end accuracy, and outlines its challenges, algorithmic solutions, implementation details, and experimental results.

GPU OptimizationQuantized InferenceSpargeAttn
0 likes · 7 min read
Can SpargeAttn Accelerate Any Model Without Training? A Deep Dive
Architects' Tech Alliance
Architects' Tech Alliance
Mar 5, 2025 · Industry Insights

How DeepSeek’s Open‑Source Tools Are Supercharging AI Model Performance

DeepSeek’s Open‑Source Week unveiled five high‑performance projects—FlashMLA, DeepEP, DeepGEMM, DualPipe/EPLB, and 3FS—each delivering novel GPU optimizations, communication kernels, matrix‑multiplication libraries, parallelism strategies, and a distributed file system that together dramatically accelerate large‑scale AI training and inference workloads.

AI accelerationDeepSeekDistributed Training
0 likes · 9 min read
How DeepSeek’s Open‑Source Tools Are Supercharging AI Model Performance
Data Thinking Notes
Data Thinking Notes
Mar 2, 2025 · Artificial Intelligence

How DeepSeek’s Open‑Source Week Accelerates AI with Cutting‑Edge GPU and Storage Innovations

During DeepSeek’s Open‑Source Week (Feb 24‑28), five production‑tested projects were released, spanning GPU‑optimized MLA kernels, MoE communication libraries, high‑performance FP8 GEMM, dual‑pipeline parallelism, and a AI‑focused distributed file system, each delivering significant performance and efficiency gains for large‑scale AI workloads.

AIDistributed TrainingGPU Optimization
0 likes · 13 min read
How DeepSeek’s Open‑Source Week Accelerates AI with Cutting‑Edge GPU and Storage Innovations
AIWalker
AIWalker
Feb 27, 2025 · Artificial Intelligence

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

This article provides a comprehensive, hands‑on guide for installing and configuring DeepSeek‑R1 with Ollama and vLLM, setting up multi‑node multi‑GPU environments, running performance benchmarks, optimizing runtime parameters, and even generating a full PyTorch distributed‑training script.

DeepSeek-R1Distributed TrainingGPU Optimization
0 likes · 39 min read
Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial
NewBeeNLP
NewBeeNLP
Feb 27, 2025 · Industry Insights

How DeepSeek’s Open‑Source Tools Exploit China‑Specific H800 GPUs to Boost AI Performance

The article analyzes DeepSeek’s three open‑source projects—FlashMLA, DeepEP, and DeepGEMM—showing how they optimize for the China‑only NVIDIA H800 GPU, contrast this with the abundant hardware resources of Western AI firms, and highlight the growing demand for talent that masters both AI models and GPU hardware.

AI hardwareDeepEPDeepGEMM
0 likes · 7 min read
How DeepSeek’s Open‑Source Tools Exploit China‑Specific H800 GPUs to Boost AI Performance
AIWalker
AIWalker
Feb 25, 2025 · Artificial Intelligence

Sliding Tile Attention speeds up HunyuanVideo DiT generation 3.5×

Sliding Tile Attention (STA) replaces costly full‑3D attention in video DiT models with a block‑wise sliding‑window scheme, achieving up to 10× attention speedup and a 3.53× end‑to‑end generation boost for HunyuanVideo without quality loss, as demonstrated by extensive benchmarks and kernel analyses.

Deep LearningGPU OptimizationHunyuanVideo
0 likes · 16 min read
Sliding Tile Attention speeds up HunyuanVideo DiT generation 3.5×
AI Algorithm Path
AI Algorithm Path
Feb 24, 2025 · Artificial Intelligence

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

Flash-MLA is an open‑source GPU kernel optimized for Nvidia Hopper GPUs that compresses the KV cache of multi‑head attention, cutting memory usage by up to 93.3% and delivering 580 TFLOPS compute, thereby dramatically accelerating large‑language‑model inference while lowering cost.

DeepSeekFlash-MLAGPU Optimization
0 likes · 8 min read
Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs
DataFunSummit
DataFunSummit
Jan 21, 2025 · Artificial Intelligence

NVIDIA NeMo Full Stack: End‑to‑End Large Language Model Training, Alignment, and RLHF

This article presents NVIDIA's NeMo technology stack for end‑to‑end large language model (LLM) training, covering the full software pipeline, model alignment with reinforcement learning from human feedback (RLHF), performance optimizations such as model parallelism, FP8, TensorRT‑LLM inference, dynamic load balancing, and future research directions.

Distributed TrainingGPU OptimizationLLM
0 likes · 24 min read
NVIDIA NeMo Full Stack: End‑to‑End Large Language Model Training, Alignment, and RLHF
JavaEdge
JavaEdge
Nov 20, 2024 · Artificial Intelligence

7 Proven Strategies to Simplify Large Language Model Deployment

The article explains why deploying large language models is challenging and presents seven practical techniques—including defining deployment boundaries, model quantization, inference optimization, infrastructure consolidation, model replacement planning, GPU utilization, and using smaller models—to make LLM deployment more efficient and cost‑effective.

GPU OptimizationLLM deploymentModel Scaling
0 likes · 24 min read
7 Proven Strategies to Simplify Large Language Model Deployment
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 8, 2024 · Industry Insights

Unlocking Efficient LLM Inference: Insights from China’s Cloud Computing Conference

The 5th China Cloud Computing Infrastructure Developer Conference in Beijing highlighted cutting‑edge AI inference optimization, Knative‑based serverless acceleration, AMD PMU virtualization, and CDI‑driven GPU management, offering detailed technical insights and real‑world case studies that illustrate how cloud providers are tackling performance and cost challenges of modern workloads.

AI inferenceAMD virtualizationCloud Native
0 likes · 9 min read
Unlocking Efficient LLM Inference: Insights from China’s Cloud Computing Conference
Sohu Tech Products
Sohu Tech Products
Oct 18, 2024 · Artificial Intelligence

Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes

This article details a comprehensive engineering practice for optimizing AI inference services at ZhiZhuan, covering background analysis, selection of TorchServe over alternatives, GPU/CPU performance tuning, custom handlers, Torch‑TRT integration, and deployment on Kubernetes, with measured improvements in throughput and resource utilization.

AI inferenceGPU OptimizationKubernetes
0 likes · 16 min read
Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes
DataFunSummit
DataFunSummit
Oct 5, 2024 · Artificial Intelligence

Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch

This article details the performance‑focused optimizations applied to TorchRec, PyTorch's large‑scale recommendation system library, including CUDA graph capture, multithreaded kernel launches, pinned memory copies, and input‑distribution refinements that together achieve a 2.25× speedup on MLPerf DLRM‑DCNv2 across 16 DGX H100 nodes.

CUDA GraphDistributed TrainingGPU Optimization
0 likes · 11 min read
Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch
JD Retail Technology
JD Retail Technology
Aug 30, 2024 · Artificial Intelligence

GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems

The article details JD Advertising's technical challenges and solutions for large‑scale sparse recommendation models, describing GPU‑focused storage, compute and I/O optimizations for both training and low‑latency inference, including distributed pipelines, heterogeneous deployment, batch aggregation, multi‑stream execution, and compiler extensions.

Distributed SystemsGPU OptimizationInference
0 likes · 13 min read
GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems
Baidu Geek Talk
Baidu Geek Talk
Aug 26, 2024 · Artificial Intelligence

RLHF Performance Optimization: PPO Algorithm Acceleration Techniques

The article presents three RLHF‑PPO acceleration techniques—TRT‑LLM‑based text generation speedups, selective activation recomputation with sequence parallelism for dynamic memory reduction, and overlapping pipeline stages for system‑level parallelism—demonstrating a 350 % throughput boost on a 10 B model using 16 A100 GPUs.

Distributed TrainingGPU OptimizationLarge Language Models
0 likes · 16 min read
RLHF Performance Optimization: PPO Algorithm Acceleration Techniques
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 22, 2024 · Artificial Intelligence

How RECom Accelerates Recommendation Model Inference on GPUs

The RECom compiler introduces a subgraph‑parallel fusion technique and symbolic shape handling to dramatically speed up GPU inference of deep recommendation models with massive embedding columns, achieving up to 6.61× lower latency and 1.91× higher throughput than TensorFlow baselines, while eliminating redundant computations.

GPU OptimizationRecommendation Systemscompiler
0 likes · 10 min read
How RECom Accelerates Recommendation Model Inference on GPUs
Baidu Tech Salon
Baidu Tech Salon
Aug 20, 2024 · Artificial Intelligence

PaddlePaddle Neural Network Compiler (CINN): Architecture, Optimization Techniques, and Performance

The PaddlePaddle Neural Network Compiler (CINN) combines a PIR‑based frontend and a hardware‑specific backend to apply graph‑level optimizations, operator fusion, schedule transformations and automatic tuning, delivering up to 4× faster kernels and 30‑60% overall speed‑ups for deep‑learning and scientific workloads.

CINNGPU OptimizationOperator fusion
0 likes · 19 min read
PaddlePaddle Neural Network Compiler (CINN): Architecture, Optimization Techniques, and Performance
58UXD
58UXD
Jun 13, 2024 · Artificial Intelligence

Why ComfyUI Is the Fast, Flexible Choice Over WebUI for Stable Diffusion

This article explains what ComfyUI is, how its node‑based workflow mirrors the underlying Stable Diffusion architecture, and why it outperforms WebUI in speed, GPU usage, real‑time preview, and workflow reuse, while also offering practical tips for new users.

AI image generationComfyUIGPU Optimization
0 likes · 9 min read
Why ComfyUI Is the Fast, Flexible Choice Over WebUI for Stable Diffusion
JD Tech
JD Tech
Mar 18, 2024 · Artificial Intelligence

High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

The article describes how JD’s advertising team tackled the high‑concurrency, low‑latency challenges of online recommendation inference by designing a distributed graph heterogeneous computing framework, optimizing GPU kernel launches with TensorBatch, deep‑learning compiler techniques, and a multi‑stream GPU architecture, achieving significant throughput and latency improvements.

AI inferenceDeep Learning CompilerGPU Optimization
0 likes · 14 min read
High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 23, 2024 · Artificial Intelligence

How PAI‑TorchAcc Supercharges Large‑Model Training on Alibaba Cloud

PAI‑TorchAcc, an Alibaba Cloud AI platform accelerator, offers a seamless PyTorch interface that integrates HuggingFace models and employs LazyTensor‑based static graph conversion, multi‑strategy distributed training, and extensive GPU optimizations to dramatically boost throughput for 1B‑175B parameter models, surpassing PyTorch native and Megatron‑LM performance.

AI accelerationAlibaba CloudGPU Optimization
0 likes · 13 min read
How PAI‑TorchAcc Supercharges Large‑Model Training on Alibaba Cloud
Baidu Geek Talk
Baidu Geek Talk
Dec 19, 2023 · Industry Insights

Inside Baidu Search Innovation Contest: Winning AI Solutions Across Five Tracks

The second Baidu Search Innovation Contest attracted over 2,800 participants from 45 regions, featured five AI‑focused tracks, and highlighted champion teams that employed techniques such as Lora‑fine‑tuned LLMs, vector‑intersection Top‑K search, GPU‑optimized algorithms, and diffusion‑based image generation to push the boundaries of search technology.

AI competitionDiffusion ModelsGPU Optimization
0 likes · 12 min read
Inside Baidu Search Innovation Contest: Winning AI Solutions Across Five Tracks
DataFunTalk
DataFunTalk
Dec 1, 2023 · Artificial Intelligence

GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario

This article details Xiaohongshu's end‑to‑end GPU‑centric transformation for search‑related machine‑learning models, covering model characteristics, training and inference frameworks, system‑level GPU and CPU optimizations, multi‑card and compilation techniques, and future directions for scaling large sparse and dense models.

GPU OptimizationInferenceModel Serving
0 likes · 16 min read
GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario
Baidu Tech Salon
Baidu Tech Salon
Nov 10, 2023 · Artificial Intelligence

Baidu Search Deep Learning Model Architecture and Optimization Practices

Baidu's Search Architecture team details how its deep‑learning models have evolved to deliver direct answer results via semantic embeddings, describes a massive online inference pipeline that rewrites queries, ranks relevance, and classifies types, and outlines optimization techniques—including data I/O, CPU/GPU balancing, pruning, quantization, and distillation—to achieve high‑throughput, low‑latency search.

BaiduGPU OptimizationInference System
0 likes · 13 min read
Baidu Search Deep Learning Model Architecture and Optimization Practices
Alimama Tech
Alimama Tech
Sep 12, 2023 · Artificial Intelligence

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Megatron-LLaMA is an open‑source high‑performance training framework for LLaMA models, offering tensor, pipeline, and sequence parallelism, an overlapped optimizer, and near‑linear scalability, achieving up to 176% speedup on 32 GPUs and robust performance even with limited network bandwidth.

DeepSpeedDistributed TrainingGPU Optimization
0 likes · 10 min read
Megatron-LLaMA: High-Performance Large Language Model Training Framework
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
May 15, 2023 · Artificial Intelligence

GPU-Accelerated Inference Optimization for Large-Scale Machine Learning at Xiaohongshu

Xiaohongshu transformed its recommendation, advertising, and search inference pipeline by migrating to GPU‑centric hardware, deploying a custom TensorFlow‑Core Lambda service, and applying system‑level, virtualization, and compute‑level optimizations—including NUMA binding, kernel fusion, dynamic scaling, and FP16 quantization—achieving roughly 30× compute capacity growth, over 10% user‑metric gains, and more than 50% cluster‑resource savings.

GPU OptimizationHardware accelerationMachine Learning Inference
0 likes · 20 min read
GPU-Accelerated Inference Optimization for Large-Scale Machine Learning at Xiaohongshu
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Feb 23, 2023 · Artificial Intelligence

How Baidu’s Cloud Infrastructure Tackles the Challenges of Training Massive AI Models

This article explains how Baidu's intelligent cloud overcomes the compute and storage walls of large‑scale model training by combining hardware design, network topology, and software optimizations such as pipeline, tensor, and expert parallelism, cost‑model‑driven placement, and future‑proof AI infrastructure evolution.

AI InfrastructureBaidu CloudCost Model
0 likes · 28 min read
How Baidu’s Cloud Infrastructure Tackles the Challenges of Training Massive AI Models
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 29, 2022 · Artificial Intelligence

Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques

This article details how to use NVIDIA profiling tools, mixed‑precision training, operator fusion, kernel optimizations, and INT8 quantization to identify and eliminate performance bottlenecks in Swin Transformer models, achieving up to 2.85× training speedup and up to 7.34× inference acceleration on modern GPUs.

AI PerformanceGPU OptimizationOperator fusion
0 likes · 23 min read
Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Dec 9, 2022 · Artificial Intelligence

What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization

BladeDISC 0.3.0 introduces full PyTorch 2.0 compilation support, new TorchDynamo optimizations, extensive GPU memory‑intensive compute enhancements, Shape Constraint IR, experimental quantization across multiple hardware platforms, and a suite of compiler‑level improvements for training and inference acceleration.

BladeDISCGPU OptimizationMLIR
0 likes · 11 min read
What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization
Alimama Tech
Alimama Tech
Oct 26, 2022 · Artificial Intelligence

GPU Utilization Analysis and Optimization for Alibaba's Intelligent Creative Video Service

The paper analyzes why Alibaba Mama’s intelligent creative video service suffers low GPU utilization—due to Python GIL blocking, lack of kernel fusion, and serialized CUDA streams—and details service‑level changes (separate CPU/GPU processes, shared‑memory queues, priority scheduling) and operator‑level kernel‑fusion techniques (channels‑last layouts, custom pooling, TensorRT conversion) that raise utilization from ~30 % to near 100 % and boost throughput by 75 %.

Deep LearningGPU OptimizationPython
0 likes · 20 min read
GPU Utilization Analysis and Optimization for Alibaba's Intelligent Creative Video Service
Alimama Tech
Alimama Tech
May 11, 2022 · Artificial Intelligence

PICASSO: An Industrial-Scale Sparse Training Engine for Wide-and-Deep Recommender Systems

PICASSO, Alibaba’s GPU‑centric sparse training engine for wide‑and‑deep recommender systems, merges identical embedding tables, interleaves data and kernel operations, and caches hot embeddings on GPU, eliminating the parameter server and delivering up to tenfold speedups over TensorFlow‑PS while maintaining model quality.

AlibabaGPU Optimizationmachine learning
0 likes · 14 min read
PICASSO: An Industrial-Scale Sparse Training Engine for Wide-and-Deep Recommender Systems
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 17, 2021 · Artificial Intelligence

How Alibaba’s Whale Framework Cuts Large‑Model Training Costs by 80%

Alibaba Cloud’s PAI team and the DAMO Academy introduced the low‑carbon M6 trillion‑parameter multimodal model, demonstrating that their self‑developed Whale framework can train such massive models on just 480 V100 GPUs, reducing energy consumption by over 80% and boosting training efficiency nearly eleven‑fold.

AIDistributed TrainingGPU Optimization
0 likes · 12 min read
How Alibaba’s Whale Framework Cuts Large‑Model Training Costs by 80%
Tencent Architect
Tencent Architect
Aug 4, 2021 · Artificial Intelligence

How We Accelerated Feature Hashing for Ad Ranking on GPUs

This article explains how Tencent's Light platform reduced the massive overhead of feature hashing in ad‑ranking by moving integer‑to‑string conversion and hash computation to the GPU, introducing custom contiguous string tensors, and achieving up to 12× speed‑up on V100 GPUs.

GPU OptimizationTensorFlowad ranking
0 likes · 14 min read
How We Accelerated Feature Hashing for Ad Ranking on GPUs
DataFunTalk
DataFunTalk
Mar 25, 2021 · Artificial Intelligence

Optimizing MNN Mobile Neural Network Inference on GPU with OpenCL: Memory Objects, Work‑Group Tuning, and Auto‑Tuning

This article explains how the MNN deep‑learning framework leverages OpenCL to achieve high‑performance inference on mobile, PC and embedded GPUs by diversifying memory objects, aligning data, using local‑memory reductions, selecting optimal work‑group sizes, applying pre‑inference auto‑tuning, caching compiled programs, and providing practical GPU‑friendly model design guidelines.

GPU OptimizationMNNOpenCL
0 likes · 20 min read
Optimizing MNN Mobile Neural Network Inference on GPU with OpenCL: Memory Objects, Work‑Group Tuning, and Auto‑Tuning
360 Smart Cloud
360 Smart Cloud
Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU Optimization
0 likes · 12 min read
Optimizing BERT Online Service Deployment at 360 Search
Sohu Tech Products
Sohu Tech Products
Dec 24, 2020 · Mobile Development

Reducing Frame Rate in iOS Animations to Lower GPU Usage

The article explains why lowering the frame rate of iOS animations can trade a slight loss in visual smoothness for significant GPU load reduction, describes the Core Animation rendering pipeline, compares different frame‑rate reduction techniques, and presents test results showing the impact on CPU, GPU, and overall app performance.

CADisplayLinkFrame RateGPU Optimization
0 likes · 11 min read
Reducing Frame Rate in iOS Animations to Lower GPU Usage
iQIYI Technical Product Team
iQIYI Technical Product Team
Jul 3, 2020 · Artificial Intelligence

Optimizing Video Inference Services for High GPU Utilization in AI Applications

By moving decoding, color conversion, preprocessing, inference, and re‑encoding entirely onto the GPU and enabling batch processing with flexible Python scripts, iQIYI’s video‑image enhancement service achieved ten‑fold throughput, over 90 % GPU utilization, and dramatically lower resource use, accelerating AI video inference deployment.

AI deploymentDeepStreamGPU Optimization
0 likes · 14 min read
Optimizing Video Inference Services for High GPU Utilization in AI Applications