Tagged articles

60 articles

Page 1 of 1

Machine Learning Algorithms & Natural Language Processing

May 20, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors

The paper shows that applying lightweight L1 regularization can make over 99% of FFN activations zero, and by using a new tile‑wise ELLPACK (TwELL) format together with a hybrid routing scheme, inference speed improves up to 30% while memory usage drops over 24% and energy consumption is reduced, all with negligible impact on downstream task performance.

CUDAGPU OptimizationHybrid Routing

0 likes · 8 min read

Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors

Machine Learning Algorithms & Natural Language Processing

May 9, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the Original Authors

A new ICML 2026 paper by Sakana AI and NVIDIA shows that applying lightweight L1 regularization can make Feed‑Forward Network activations in Transformers over 99% sparse, and with the TwELL storage format and a hybrid routing scheme this sparsity translates into up to 20.5% inference speedup, 21.9% training‑step acceleration, lower energy consumption and reduced peak memory across 0.5‑2 B‑parameter models while preserving downstream performance.

CUDAGPU OptimizationHybrid Routing

0 likes · 9 min read

Can 99% Sparse Transformers Run Faster? Insights from the Original Authors

Old Zhang's AI Learning

May 7, 2026 · Artificial Intelligence

How Unsloth and NVIDIA Boost Consumer‑GPU LLM Training by ~25% with Three Simple Optimizations

Unsloth and NVIDIA identified three low‑level bottlenecks in LLM fine‑tuning on consumer GPUs—repeated packed‑sequence metadata construction, serialized copy‑and‑compute during gradient checkpointing, and per‑expert routing overhead in MoE—and applied targeted patches that together deliver roughly a 25% speedup without changing hardware, code, or frameworks.

GPU OptimizationLLM trainingMixture of Experts

0 likes · 12 min read

How Unsloth and NVIDIA Boost Consumer‑GPU LLM Training by ~25% with Three Simple Optimizations

Woodpecker Software Testing

Apr 24, 2026 · Artificial Intelligence

Practical Guide to Optimizing Large Model Performance in Production

This guide details how enterprises can move large language models from lab to production by defining specific SLI/SLO metrics, diagnosing hidden bottlenecks such as tokenizer latency, and applying four quantifiable optimization levers that dramatically improve latency, throughput, and cost efficiency.

Continuous BatchingGPU OptimizationLarge Language Models

0 likes · 6 min read

Practical Guide to Optimizing Large Model Performance in Production

Old Zhang's AI Learning

Apr 23, 2026 · Artificial Intelligence

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

DeepSeek has released TileKernels, a GPU kernel library written in the TileLang DSL, that targets H100/H200/B200 GPUs and claims to approach hardware limits in compute intensity and memory bandwidth, offering MoE routing, FP8/FP4 quantization, and dual‑language PyTorch references for deep‑learning engineers.

FP8 quantizationGPU OptimizationLLM training

0 likes · 9 min read

DeepSeek Quietly Open‑Sources TileKernels to Push GPU Performance to Its Limits

Machine Heart

Apr 16, 2026 · Artificial Intelligence

Achieving 4.6× Faster Diffusion Model Training with FP4‑BF16 Dual‑Track Parallelism (Sol‑RL)

Sol‑RL, a framework from NVIDIA, Hong Kong University and MIT, integrates NVFP4 inference for large‑scale rollout exploration and BF16 precision for high‑fidelity regeneration, delivering up to 4.64× faster convergence at equivalent reward levels while preserving BF16 training fidelity across SANA, FLUX.1 and SD3.5‑L models.

BF16Diffusion ModelsFP4

0 likes · 9 min read

Achieving 4.6× Faster Diffusion Model Training with FP4‑BF16 Dual‑Track Parallelism (Sol‑RL)

SuanNi

Mar 29, 2026 · Artificial Intelligence

How an AI Agent Outperformed NVIDIA Engineers in 7‑Day GPU Kernel Optimization

This article analyzes the AVO system, an autonomous AI agent that replaces traditional evolutionary search pipelines to iteratively improve CUDA attention kernels on NVIDIA's Blackwell B200 GPU, achieving up to 10.5% higher throughput than hand‑tuned implementations after a week of nonstop optimization.

AICUDAGPU Optimization

0 likes · 13 min read

How an AI Agent Outperformed NVIDIA Engineers in 7‑Day GPU Kernel Optimization

AI Explorer

Mar 18, 2026 · Artificial Intelligence

Run and Fine‑Tune Hundreds of Open‑Source LLMs Locally with Unsloth

Unsloth offers a unified web UI that accelerates fine‑tuning by up to 2×, cuts VRAM usage by 70% (80% for RL), supports hundreds of open‑source models, and provides simple installation steps for rapid local AI experimentation.

AI workstationGPU OptimizationLLM

0 likes · 6 min read

Run and Fine‑Tune Hundreds of Open‑Source LLMs Locally with Unsloth

Machine Learning Algorithms & Natural Language Processing

Mar 3, 2026 · Artificial Intelligence

How CUDA Agent Lets Anyone Write High‑Performance CUDA Kernels, Challenging Nvidia’s AI Moat

CUDA Agent, a large‑scale reinforcement‑learning system from ByteDance and Tsinghua, can automatically generate and optimize CUDA kernels that outperform torch.compile by up to 2× on simple kernels and achieve around 40% higher speed than proprietary models on the hardest benchmarks, while detailing its data‑synthesis pipeline, training workflow, and current limitations.

CUDAGPU OptimizationKernelBench

0 likes · 10 min read

How CUDA Agent Lets Anyone Write High‑Performance CUDA Kernels, Challenging Nvidia’s AI Moat

AI Explorer

Mar 3, 2026 · Artificial Intelligence

ByteDance & Tsinghua Reveal AI‑Powered CUDA Agent for Self‑Evolving Kernels

ByteDance and Tsinghua University have created the CUDA Agent, an AI compiler that automatically writes and optimizes GPU kernels, delivering up to double the performance, and heralding a shift where AI‑generated low‑level code could reshape the hardware‑software competition landscape.

AI compilerByteDanceCUDA

0 likes · 6 min read

ByteDance & Tsinghua Reveal AI‑Powered CUDA Agent for Self‑Evolving Kernels

Bilibili Tech

Feb 13, 2026 · Artificial Intelligence

Self-Forcing: Turning Global Video Diffusion into Causal Streaming for Long-Form Generation

This article examines the Wan2.1 video diffusion model, identifies its scalability bottlenecks for long and real‑time video generation, and introduces the Self‑Forcing causal framework together with sequence‑parallel and RoPE optimizations that achieve sub‑second latency and up to 1.5× speed‑up on modern GPUs.

GPU Optimizationcausal inferencelarge video generation

0 likes · 14 min read

Self-Forcing: Turning Global Video Diffusion into Causal Streaming for Long-Form Generation

Data Party THU

Jan 21, 2026 · Artificial Intelligence

What DeepSeek’s Secret “Model1” Reveals About the Upcoming V4 LLM

Analyzing recent DeepSeek flashmla repository commits, the article uncovers that the mysterious Model1 likely corresponds to DeepSeek‑V4, detailing architectural shifts to a 512‑dimensional head, full support for NVIDIA Blackwell GPUs, token‑level sparse MLA, and new mechanisms such as Value Vector Position Awareness and Engram.

DeepSeekDeepSeek-V4GPU Optimization

0 likes · 6 min read

What DeepSeek’s Secret “Model1” Reveals About the Upcoming V4 LLM

Ops Community

Jan 18, 2026 · Artificial Intelligence

How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

This guide details how to replace native Transformers inference with the high‑performance vLLM engine, leveraging PagedAttention, continuous batching, tensor parallelism, and OpenAI‑compatible APIs to achieve 3‑4× higher throughput, lower latency, and scalable multi‑GPU deployments for production‑grade large language models.

Continuous BatchingGPU OptimizationOpenAI API Compatibility

0 likes · 61 min read

How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

Fun with Large Models

Jan 18, 2026 · Artificial Intelligence

Step‑by‑Step Guide to Deploying Large Language Models Locally with VLLM and Ollama

This article walks through two mainstream local deployment solutions—high‑performance VLLM for production Linux servers and lightweight Ollama for personal Windows machines—covering environment setup, model download, server launch, API testing, key configuration parameters, and the quantization technique that makes Ollama models compact.

GPU OptimizationLarge Language ModelsModel Quantization

0 likes · 18 min read

Step‑by‑Step Guide to Deploying Large Language Models Locally with VLLM and Ollama

MaGe Linux Operations

Jan 6, 2026 · Artificial Intelligence

How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s

This guide details how switching from vLLM to SGLang on eight NVIDIA H800 GPUs increased Llama‑3‑70B‑Instruct throughput from 180 to 420 tokens per second, covering SGLang’s core innovations, environment setup, configuration tweaks, performance benchmarks, troubleshooting tips, and production‑grade deployment scripts.

FlashInferGPU OptimizationH800

0 likes · 19 min read

How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s

Ops Community

Dec 28, 2025 · Artificial Intelligence

Boost LLM Inference Speed: Build a High‑Concurrency vLLM Service with Best‑Practice Ops

This guide walks through the complete process of deploying a high‑throughput large language model inference service using vLLM, covering environment preparation, installation, configuration tuning, performance testing, real‑world case studies, monitoring, troubleshooting, and backup strategies for production‑grade deployments.

DeploymentGPU OptimizationLLM inference

0 likes · 44 min read

Boost LLM Inference Speed: Build a High‑Concurrency vLLM Service with Best‑Practice Ops

AI2ML AI to Machine Learning

Dec 19, 2025 · Artificial Intelligence

The 9 Key Ideas Behind FlashAttention

FlashAttention accelerates transformer inference by combining nine techniques—including loss‑less attention, GPU memory‑pyramid optimization, SRAM‑reusing tiling, safe softmax scaling, online buffering, tile‑size constraints, parallel multiplication, reduced KV slicing, and integrated backward‑pass caching—to achieve efficient, high‑throughput computation on modern GPUs.

Attention MechanismFlashAttentionGPU Optimization

0 likes · 8 min read

Old Meng AI Explorer

Nov 30, 2025 · Artificial Intelligence

Unlock 1‑Minute AI Video Generation with TTT‑Video‑Dit: Break the 3‑Second Limit

TTT‑Video‑Dit is an open‑source framework that uses test‑time‑training and hierarchical attention to generate coherent 63‑second videos with style‑transfer, dramatically reducing GPU memory requirements so a single RTX 4090 can replace costly H100 clusters, enabling creators and developers to produce long AI videos efficiently.

GPU OptimizationStyle TransferTTT-Video-Dit

0 likes · 11 min read

Unlock 1‑Minute AI Video Generation with TTT‑Video‑Dit: Break the 3‑Second Limit

Alibaba Cloud Native

Oct 17, 2025 · Artificial Intelligence

How We Boosted Embedding Service Throughput 16× with Cloud‑Native Optimizations

This article details the cost and speed challenges of embedding vectors in large‑scale log scenarios, analyzes inference framework choices, describes GPU utilization, priority queuing, and pipeline redesigns, and reports a 16‑fold throughput increase and dramatically lower per‑request costs.

EmbeddingGPU OptimizationThroughput

0 likes · 8 min read

How We Boosted Embedding Service Throughput 16× with Cloud‑Native Optimizations

Linux Kernel Journey

Sep 24, 2025 · Fundamentals

Fine-Grained GPU Code Modifications: Boosting CUDA Performance

This article explains why certain GPU performance gains require direct CUDA kernel edits and walks through fine‑grained techniques such as data‑layout restructuring, warp‑level primitives, tiled memory accesses, kernel fusion, and dynamic execution paths, backed by code examples and benchmark insights.

CUDAGPU Optimizationdynamic execution

0 likes · 12 min read

Fine-Grained GPU Code Modifications: Boosting CUDA Performance

Alibaba Cloud Big Data AI Platform

Jul 17, 2025 · Artificial Intelligence

How ChunkFlow Boosts Long-Context Model Training Up to 4.5× Faster

The paper "Efficient Long Context Fine-tuning with Chunk Flow" introduces ChunkFlow, a training framework that reorganizes variable‑length sequences into fixed‑size chunks, achieving up to 4.53× speedup and more stable GPU memory usage for large language models.

Artificial IntelligenceChunkFlowGPU Optimization

0 likes · 7 min read

How ChunkFlow Boosts Long-Context Model Training Up to 4.5× Faster

AI Frontier Lectures

Apr 1, 2025 · Artificial Intelligence

Can SpargeAttn Accelerate Any Model Without Training? A Deep Dive

This article reviews the SpargeAttn paper, describing how a training‑free sparse attention mechanism achieves 4‑7× inference speedup across language, video, and image models while preserving end‑to‑end accuracy, and outlines its challenges, algorithmic solutions, implementation details, and experimental results.

GPU OptimizationQuantized InferenceSpargeAttn

0 likes · 7 min read

Can SpargeAttn Accelerate Any Model Without Training? A Deep Dive

Architects' Tech Alliance

Mar 5, 2025 · Industry Insights

How DeepSeek’s Open‑Source Tools Are Supercharging AI Model Performance

DeepSeek’s Open‑Source Week unveiled five high‑performance projects—FlashMLA, DeepEP, DeepGEMM, DualPipe/EPLB, and 3FS—each delivering novel GPU optimizations, communication kernels, matrix‑multiplication libraries, parallelism strategies, and a distributed file system that together dramatically accelerate large‑scale AI training and inference workloads.

AI accelerationDeepSeekDistributed Training

0 likes · 9 min read

How DeepSeek’s Open‑Source Tools Are Supercharging AI Model Performance

Data Thinking Notes

Mar 2, 2025 · Artificial Intelligence

How DeepSeek’s Open‑Source Week Accelerates AI with Cutting‑Edge GPU and Storage Innovations

During DeepSeek’s Open‑Source Week (Feb 24‑28), five production‑tested projects were released, spanning GPU‑optimized MLA kernels, MoE communication libraries, high‑performance FP8 GEMM, dual‑pipeline parallelism, and a AI‑focused distributed file system, each delivering significant performance and efficiency gains for large‑scale AI workloads.

AIDistributed TrainingGPU Optimization

0 likes · 13 min read

How DeepSeek’s Open‑Source Week Accelerates AI with Cutting‑Edge GPU and Storage Innovations

AIWalker

Feb 27, 2025 · Artificial Intelligence

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

This article provides a comprehensive, hands‑on guide for installing and configuring DeepSeek‑R1 with Ollama and vLLM, setting up multi‑node multi‑GPU environments, running performance benchmarks, optimizing runtime parameters, and even generating a full PyTorch distributed‑training script.

DeepSeek-R1Distributed TrainingGPU Optimization

0 likes · 39 min read

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

NewBeeNLP

Feb 27, 2025 · Industry Insights

How DeepSeek’s Open‑Source Tools Exploit China‑Specific H800 GPUs to Boost AI Performance

The article analyzes DeepSeek’s three open‑source projects—FlashMLA, DeepEP, and DeepGEMM—showing how they optimize for the China‑only NVIDIA H800 GPU, contrast this with the abundant hardware resources of Western AI firms, and highlight the growing demand for talent that masters both AI models and GPU hardware.

AI hardwareDeepEPDeepGEMM

0 likes · 7 min read

How DeepSeek’s Open‑Source Tools Exploit China‑Specific H800 GPUs to Boost AI Performance

AIWalker

Feb 25, 2025 · Artificial Intelligence

Sliding Tile Attention speeds up HunyuanVideo DiT generation 3.5×

Sliding Tile Attention (STA) replaces costly full‑3D attention in video DiT models with a block‑wise sliding‑window scheme, achieving up to 10× attention speedup and a 3.53× end‑to‑end generation boost for HunyuanVideo without quality loss, as demonstrated by extensive benchmarks and kernel analyses.

Deep LearningGPU OptimizationHunyuanVideo

0 likes · 16 min read

Sliding Tile Attention speeds up HunyuanVideo DiT generation 3.5×

AI Algorithm Path

Feb 24, 2025 · Artificial Intelligence

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

Flash-MLA is an open‑source GPU kernel optimized for Nvidia Hopper GPUs that compresses the KV cache of multi‑head attention, cutting memory usage by up to 93.3% and delivering 580 TFLOPS compute, thereby dramatically accelerating large‑language‑model inference while lowering cost.

DeepSeekFlash-MLAGPU Optimization

0 likes · 8 min read

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

Ops Development & AI Practice

Feb 16, 2025 · Artificial Intelligence

Why FlashAttention Supercharges Qwen Models: A Technical Deep Dive

This article explains the FlashAttention algorithm, its memory‑efficient tiling and recomputation techniques, and how enabling the flash_attn flag dramatically speeds up Qwen‑series large models while outlining hardware, software requirements and potential trade‑offs.

FlashAttentionGPU OptimizationPyTorch

0 likes · 8 min read

Why FlashAttention Supercharges Qwen Models: A Technical Deep Dive

Architect's Alchemy Furnace

Feb 8, 2025 · Artificial Intelligence

How to Choose the Right Hardware for AI Models from 1.5B to 671B

This guide outlines the hardware requirements for AI models ranging from lightweight 1.5 B parameters to massive 671 B models, detailing CPU cores, memory, GPU recommendations, storage needs, optimization tips, deployment suggestions, and suitable application scenarios.

AI hardwareDeepSeekDeployment

0 likes · 5 min read

How to Choose the Right Hardware for AI Models from 1.5B to 671B

DataFunSummit

Jan 21, 2025 · Artificial Intelligence

NVIDIA NeMo Full Stack: End‑to‑End Large Language Model Training, Alignment, and RLHF

This article presents NVIDIA's NeMo technology stack for end‑to‑end large language model (LLM) training, covering the full software pipeline, model alignment with reinforcement learning from human feedback (RLHF), performance optimizations such as model parallelism, FP8, TensorRT‑LLM inference, dynamic load balancing, and future research directions.

Distributed TrainingGPU OptimizationLLM

0 likes · 24 min read

NVIDIA NeMo Full Stack: End‑to‑End Large Language Model Training, Alignment, and RLHF

JavaEdge

Nov 20, 2024 · Artificial Intelligence

7 Proven Strategies to Simplify Large Language Model Deployment

The article explains why deploying large language models is challenging and presents seven practical techniques—including defining deployment boundaries, model quantization, inference optimization, infrastructure consolidation, model replacement planning, GPU utilization, and using smaller models—to make LLM deployment more efficient and cost‑effective.

GPU OptimizationLLM deploymentModel Scaling

0 likes · 24 min read

7 Proven Strategies to Simplify Large Language Model Deployment

Alibaba Cloud Infrastructure

Nov 8, 2024 · Industry Insights

Unlocking Efficient LLM Inference: Insights from China’s Cloud Computing Conference

The 5th China Cloud Computing Infrastructure Developer Conference in Beijing highlighted cutting‑edge AI inference optimization, Knative‑based serverless acceleration, AMD PMU virtualization, and CDI‑driven GPU management, offering detailed technical insights and real‑world case studies that illustrate how cloud providers are tackling performance and cost challenges of modern workloads.

AI inferenceAMD virtualizationCloud Native

0 likes · 9 min read

Unlocking Efficient LLM Inference: Insights from China’s Cloud Computing Conference

Sohu Tech Products

Oct 18, 2024 · Artificial Intelligence

Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes

This article details a comprehensive engineering practice for optimizing AI inference services at ZhiZhuan, covering background analysis, selection of TorchServe over alternatives, GPU/CPU performance tuning, custom handlers, Torch‑TRT integration, and deployment on Kubernetes, with measured improvements in throughput and resource utilization.

AI inferenceGPU OptimizationKubernetes

0 likes · 16 min read

Optimizing AI Inference with TorchServe: Tackling GPU Bottlenecks & Kubernetes

DataFunSummit

Oct 5, 2024 · Artificial Intelligence

Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch

This article details the performance‑focused optimizations applied to TorchRec, PyTorch's large‑scale recommendation system library, including CUDA graph capture, multithreaded kernel launches, pinned memory copies, and input‑distribution refinements that together achieve a 2.25× speedup on MLPerf DLRM‑DCNv2 across 16 DGX H100 nodes.

CUDA GraphDistributed TrainingGPU Optimization

0 likes · 11 min read

Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch

JD Retail Technology

Aug 30, 2024 · Artificial Intelligence

GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems

The article details JD Advertising's technical challenges and solutions for large‑scale sparse recommendation models, describing GPU‑focused storage, compute and I/O optimizations for both training and low‑latency inference, including distributed pipelines, heterogeneous deployment, batch aggregation, multi‑stream execution, and compiler extensions.

Distributed SystemsGPU OptimizationInference

0 likes · 13 min read

GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems

Baidu Geek Talk

Aug 26, 2024 · Artificial Intelligence

RLHF Performance Optimization: PPO Algorithm Acceleration Techniques

The article presents three RLHF‑PPO acceleration techniques—TRT‑LLM‑based text generation speedups, selective activation recomputation with sequence parallelism for dynamic memory reduction, and overlapping pipeline stages for system‑level parallelism—demonstrating a 350 % throughput boost on a 10 B model using 16 A100 GPUs.

Distributed TrainingGPU OptimizationLarge Language Models

0 likes · 16 min read

RLHF Performance Optimization: PPO Algorithm Acceleration Techniques

Alibaba Cloud Big Data AI Platform

Aug 22, 2024 · Artificial Intelligence

How RECom Accelerates Recommendation Model Inference on GPUs

The RECom compiler introduces a subgraph‑parallel fusion technique and symbolic shape handling to dramatically speed up GPU inference of deep recommendation models with massive embedding columns, achieving up to 6.61× lower latency and 1.91× higher throughput than TensorFlow baselines, while eliminating redundant computations.

GPU OptimizationRecommendation Systemscompiler

0 likes · 10 min read

How RECom Accelerates Recommendation Model Inference on GPUs

Baidu Tech Salon

Aug 20, 2024 · Artificial Intelligence

PaddlePaddle Neural Network Compiler (CINN): Architecture, Optimization Techniques, and Performance

The PaddlePaddle Neural Network Compiler (CINN) combines a PIR‑based frontend and a hardware‑specific backend to apply graph‑level optimizations, operator fusion, schedule transformations and automatic tuning, delivering up to 4× faster kernels and 30‑60% overall speed‑ups for deep‑learning and scientific workloads.

CINNGPU OptimizationOperator fusion

0 likes · 19 min read

PaddlePaddle Neural Network Compiler (CINN): Architecture, Optimization Techniques, and Performance

58UXD

Jun 13, 2024 · Artificial Intelligence

Why ComfyUI Is the Fast, Flexible Choice Over WebUI for Stable Diffusion

This article explains what ComfyUI is, how its node‑based workflow mirrors the underlying Stable Diffusion architecture, and why it outperforms WebUI in speed, GPU usage, real‑time preview, and workflow reuse, while also offering practical tips for new users.

AI image generationComfyUIGPU Optimization

0 likes · 9 min read

Why ComfyUI Is the Fast, Flexible Choice Over WebUI for Stable Diffusion

JD Tech

Mar 18, 2024 · Artificial Intelligence

High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

The article describes how JD’s advertising team tackled the high‑concurrency, low‑latency challenges of online recommendation inference by designing a distributed graph heterogeneous computing framework, optimizing GPU kernel launches with TensorBatch, deep‑learning compiler techniques, and a multi‑stream GPU architecture, achieving significant throughput and latency improvements.

AI inferenceDeep Learning CompilerGPU Optimization

0 likes · 14 min read

High‑Performance Inference Architecture: Distributed Graph Heterogeneous Computing Framework and GPU Multi‑Stream Optimization

Alibaba Cloud Big Data AI Platform

Feb 23, 2024 · Artificial Intelligence

How PAI‑TorchAcc Supercharges Large‑Model Training on Alibaba Cloud

PAI‑TorchAcc, an Alibaba Cloud AI platform accelerator, offers a seamless PyTorch interface that integrates HuggingFace models and employs LazyTensor‑based static graph conversion, multi‑strategy distributed training, and extensive GPU optimizations to dramatically boost throughput for 1B‑175B parameter models, surpassing PyTorch native and Megatron‑LM performance.

AI accelerationAlibaba CloudGPU Optimization

0 likes · 13 min read

How PAI‑TorchAcc Supercharges Large‑Model Training on Alibaba Cloud

Baidu Geek Talk

Dec 19, 2023 · Industry Insights

Inside Baidu Search Innovation Contest: Winning AI Solutions Across Five Tracks

The second Baidu Search Innovation Contest attracted over 2,800 participants from 45 regions, featured five AI‑focused tracks, and highlighted champion teams that employed techniques such as Lora‑fine‑tuned LLMs, vector‑intersection Top‑K search, GPU‑optimized algorithms, and diffusion‑based image generation to push the boundaries of search technology.

AI competitionDiffusion ModelsGPU Optimization

0 likes · 12 min read

Inside Baidu Search Innovation Contest: Winning AI Solutions Across Five Tracks

Alibaba Cloud Big Data AI Platform

Dec 11, 2023 · Artificial Intelligence

How PAI‑Blade Supercharges PyTorch Training with Up to 41% Speedup

This article explains how PAI‑Blade uses compiler optimizations, TorchDynamo, MHLO conversion, and aggressive kernel fusion to accelerate PyTorch training, provides simple two‑line integration code, showcases benchmark results on A10 and A100 GPUs, and details deployment steps on PAI‑DSW.

BladeDISCGPU OptimizationPAI-Blade

0 likes · 8 min read

How PAI‑Blade Supercharges PyTorch Training with Up to 41% Speedup

DataFunTalk

Dec 1, 2023 · Artificial Intelligence

GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario

This article details Xiaohongshu's end‑to‑end GPU‑centric transformation for search‑related machine‑learning models, covering model characteristics, training and inference frameworks, system‑level GPU and CPU optimizations, multi‑card and compilation techniques, and future directions for scaling large sparse and dense models.

GPU OptimizationInferenceModel Serving

0 likes · 16 min read

GPU‑Driven Model Service and Optimization Practices in Xiaohongshu's Search Scenario

Baidu Tech Salon

Nov 10, 2023 · Artificial Intelligence

Baidu Search Deep Learning Model Architecture and Optimization Practices

Baidu's Search Architecture team details how its deep‑learning models have evolved to deliver direct answer results via semantic embeddings, describes a massive online inference pipeline that rewrites queries, ranks relevance, and classifies types, and outlines optimization techniques—including data I/O, CPU/GPU balancing, pruning, quantization, and distillation—to achieve high‑throughput, low‑latency search.

BaiduGPU OptimizationInference System

0 likes · 13 min read

Baidu Search Deep Learning Model Architecture and Optimization Practices

Alimama Tech

Sep 12, 2023 · Artificial Intelligence

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Megatron-LLaMA is an open‑source high‑performance training framework for LLaMA models, offering tensor, pipeline, and sequence parallelism, an overlapped optimizer, and near‑linear scalability, achieving up to 176% speedup on 32 GPUs and robust performance even with limited network bandwidth.

DeepSpeedDistributed TrainingGPU Optimization

0 likes · 10 min read

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Xiaohongshu Tech REDtech

May 15, 2023 · Artificial Intelligence

GPU-Accelerated Inference Optimization for Large-Scale Machine Learning at Xiaohongshu

Xiaohongshu transformed its recommendation, advertising, and search inference pipeline by migrating to GPU‑centric hardware, deploying a custom TensorFlow‑Core Lambda service, and applying system‑level, virtualization, and compute‑level optimizations—including NUMA binding, kernel fusion, dynamic scaling, and FP16 quantization—achieving roughly 30× compute capacity growth, over 10% user‑metric gains, and more than 50% cluster‑resource savings.

GPU OptimizationHardware accelerationMachine Learning Inference

0 likes · 20 min read

GPU-Accelerated Inference Optimization for Large-Scale Machine Learning at Xiaohongshu

Baidu Intelligent Cloud Tech Hub

Feb 23, 2023 · Artificial Intelligence

How Baidu’s Cloud Infrastructure Tackles the Challenges of Training Massive AI Models

This article explains how Baidu's intelligent cloud overcomes the compute and storage walls of large‑scale model training by combining hardware design, network topology, and software optimizations such as pipeline, tensor, and expert parallelism, cost‑model‑driven placement, and future‑proof AI infrastructure evolution.

AI InfrastructureBaidu CloudCost Model

0 likes · 28 min read

How Baidu’s Cloud Infrastructure Tackles the Challenges of Training Massive AI Models

Baidu Intelligent Cloud Tech Hub

Dec 29, 2022 · Artificial Intelligence

Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques

This article details how to use NVIDIA profiling tools, mixed‑precision training, operator fusion, kernel optimizations, and INT8 quantization to identify and eliminate performance bottlenecks in Swin Transformer models, achieving up to 2.85× training speedup and up to 7.34× inference acceleration on modern GPUs.

AI PerformanceGPU OptimizationOperator fusion

0 likes · 23 min read

Boost Swin Transformer Speed: Profiling, Mixed Precision, and Operator Fusion Techniques

Alibaba Cloud Big Data AI Platform

Dec 9, 2022 · Artificial Intelligence

What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization

BladeDISC 0.3.0 introduces full PyTorch 2.0 compilation support, new TorchDynamo optimizations, extensive GPU memory‑intensive compute enhancements, Shape Constraint IR, experimental quantization across multiple hardware platforms, and a suite of compiler‑level improvements for training and inference acceleration.

BladeDISCGPU OptimizationMLIR

0 likes · 11 min read

What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization

Alimama Tech

Oct 26, 2022 · Artificial Intelligence

GPU Utilization Analysis and Optimization for Alibaba's Intelligent Creative Video Service

The paper analyzes why Alibaba Mama’s intelligent creative video service suffers low GPU utilization—due to Python GIL blocking, lack of kernel fusion, and serialized CUDA streams—and details service‑level changes (separate CPU/GPU processes, shared‑memory queues, priority scheduling) and operator‑level kernel‑fusion techniques (channels‑last layouts, custom pooling, TensorRT conversion) that raise utilization from ~30 % to near 100 % and boost throughput by 75 %.

Deep LearningGPU OptimizationPython

0 likes · 20 min read

GPU Utilization Analysis and Optimization for Alibaba's Intelligent Creative Video Service

Alimama Tech

May 11, 2022 · Artificial Intelligence

PICASSO: An Industrial-Scale Sparse Training Engine for Wide-and-Deep Recommender Systems

PICASSO, Alibaba’s GPU‑centric sparse training engine for wide‑and‑deep recommender systems, merges identical embedding tables, interleaves data and kernel operations, and caches hot embeddings on GPU, eliminating the parameter server and delivering up to tenfold speedups over TensorFlow‑PS while maintaining model quality.

AlibabaGPU Optimizationmachine learning

0 likes · 14 min read

PICASSO: An Industrial-Scale Sparse Training Engine for Wide-and-Deep Recommender Systems

Alibaba Cloud Developer

Aug 17, 2021 · Artificial Intelligence

How Alibaba’s Whale Framework Cuts Large‑Model Training Costs by 80%

Alibaba Cloud’s PAI team and the DAMO Academy introduced the low‑carbon M6 trillion‑parameter multimodal model, demonstrating that their self‑developed Whale framework can train such massive models on just 480 V100 GPUs, reducing energy consumption by over 80% and boosting training efficiency nearly eleven‑fold.

AIDistributed TrainingGPU Optimization

0 likes · 12 min read

How Alibaba’s Whale Framework Cuts Large‑Model Training Costs by 80%

Tencent Architect

Aug 4, 2021 · Artificial Intelligence

How We Accelerated Feature Hashing for Ad Ranking on GPUs

This article explains how Tencent's Light platform reduced the massive overhead of feature hashing in ad‑ranking by moving integer‑to‑string conversion and hash computation to the GPU, introducing custom contiguous string tensors, and achieving up to 12× speed‑up on V100 GPUs.

GPU OptimizationTensorFlowad ranking

0 likes · 14 min read

How We Accelerated Feature Hashing for Ad Ranking on GPUs

Kuaishou Large Model

Jul 30, 2021 · Fundamentals

How QuanTaichi Cuts GPU Memory Needs for High‑Fidelity Physics Simulations

QuanTaichi introduces a new language abstraction and compiler system that quantizes simulation data, dramatically reducing memory and bandwidth usage so that high‑precision physical effects—once requiring multiple GPUs—can now run on a single GPU, even on mobile devices.

GPU OptimizationGraphicsTaichi

0 likes · 12 min read

How QuanTaichi Cuts GPU Memory Needs for High‑Fidelity Physics Simulations

DataFunTalk

Mar 25, 2021 · Artificial Intelligence

Optimizing MNN Mobile Neural Network Inference on GPU with OpenCL: Memory Objects, Work‑Group Tuning, and Auto‑Tuning

This article explains how the MNN deep‑learning framework leverages OpenCL to achieve high‑performance inference on mobile, PC and embedded GPUs by diversifying memory objects, aligning data, using local‑memory reductions, selecting optimal work‑group sizes, applying pre‑inference auto‑tuning, caching compiled programs, and providing practical GPU‑friendly model design guidelines.

GPU OptimizationMNNOpenCL

0 likes · 20 min read

Optimizing MNN Mobile Neural Network Inference on GPU with OpenCL: Memory Objects, Work‑Group Tuning, and Auto‑Tuning

360 Smart Cloud

Mar 4, 2021 · Artificial Intelligence

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

BERTFP16 quantizationGPU Optimization

0 likes · 12 min read

Optimizing BERT Online Service Deployment at 360 Search

Sohu Tech Products

Dec 24, 2020 · Mobile Development

Reducing Frame Rate in iOS Animations to Lower GPU Usage

The article explains why lowering the frame rate of iOS animations can trade a slight loss in visual smoothness for significant GPU load reduction, describes the Core Animation rendering pipeline, compares different frame‑rate reduction techniques, and presents test results showing the impact on CPU, GPU, and overall app performance.

CADisplayLinkFrame RateGPU Optimization

0 likes · 11 min read

Reducing Frame Rate in iOS Animations to Lower GPU Usage

iQIYI Technical Product Team

Jul 3, 2020 · Artificial Intelligence

Optimizing Video Inference Services for High GPU Utilization in AI Applications

By moving decoding, color conversion, preprocessing, inference, and re‑encoding entirely onto the GPU and enabling batch processing with flexible Python scripts, iQIYI’s video‑image enhancement service achieved ten‑fold throughput, over 90 % GPU utilization, and dramatically lower resource use, accelerating AI video inference deployment.

AI deploymentDeepStreamGPU Optimization

0 likes · 14 min read

Optimizing Video Inference Services for High GPU Utilization in AI Applications