Tagged articles

12 articles

Page 1 of 1

Dec 17, 2025 · Artificial Intelligence

Can cuTile’s Tile Paradigm Disrupt the GPU Programming Landscape and Challenge Triton?

The article analyzes NVIDIA's newly announced cuTile, a tile‑based Python DSL for GPU kernels, examining its technical differences from CUDA's SIMT model, its potential to reshape the GPU programming ecosystem, community reactions, competition with Triton, and the uncertain future that hinges on ecosystem maturity and migration tools.

AI workloadsCUDAGPU programming

0 likes · 12 min read

Can cuTile’s Tile Paradigm Disrupt the GPU Programming Landscape and Challenge Triton?

Alibaba Cloud Native

Oct 17, 2025 · Artificial Intelligence

How We Boosted Embedding Service Throughput 16× with Cloud‑Native Optimizations

This article details the cost and speed challenges of embedding vectors in large‑scale log scenarios, analyzes inference framework choices, describes GPU utilization, priority queuing, and pipeline redesigns, and reports a 16‑fold throughput increase and dramatically lower per‑request costs.

EmbeddingGPU OptimizationThroughput

0 likes · 8 min read

How We Boosted Embedding Service Throughput 16× with Cloud‑Native Optimizations

Network Intelligence Research Center (NIRC)

Jul 15, 2025 · Fundamentals

How to Write High‑Performance GPU Code with OpenAI Triton

This article introduces OpenAI's Triton language, compares its block‑wise programming model to traditional CUDA, walks through vector‑addition and fused‑softmax kernel implementations, and presents benchmark results that demonstrate significant speedups over native PyTorch operations.

CUDAGPU programmingPyTorch

0 likes · 10 min read

How to Write High‑Performance GPU Code with OpenAI Triton

Baobao Algorithm Notes

Feb 25, 2025 · Artificial Intelligence

FlashMLA vs FlashInfer: DeepSeek Inference Performance Benchmarks Revealed

The author benchmarks DeepSeek's FlashMLA against FlashInfer and several Triton-based implementations, detailing setup challenges, decode‑only bandwidth results, and observations that the official DeepSeek version leads while Triton optimizations show mixed performance across different head sizes.

AIBenchmarkDeepSeek

0 likes · 6 min read

FlashMLA vs FlashInfer: DeepSeek Inference Performance Benchmarks Revealed

DeWu Technology

Jan 13, 2025 · Artificial Intelligence

Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE

This article introduces Triton—a Python‑based GPU programming language—covers essential GPU architecture, walks through practical kernels for vector addition, matrix multiplication, and rotary position encoding, compares performance with PyTorch, and provides debugging tips for high‑performance deep‑learning workloads.

CUDADeep LearningGPU programming

0 likes · 22 min read

Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE

Alibaba Cloud Infrastructure

Jun 12, 2024 · Artificial Intelligence

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

This tutorial walks through deploying the Llama‑2‑7b‑hf model on Alibaba Cloud Kubernetes (ACK) using KServe, Triton Inference Server with the TensorRT‑LLM backend, covering prerequisites, model preparation, YAML configuration, PV/PVC setup, runtime creation, and troubleshooting steps.

AI inferenceKServeKubernetes

0 likes · 13 min read

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

DataFunTalk

Jan 26, 2024 · Artificial Intelligence

Efficient Deployment of Speech AI Models on GPUs

This article explains how to efficiently deploy speech AI models—including ASR and TTS—on GPUs using NVIDIA's Triton Inference Server and TensorRT, covering background challenges, GPU‑based solutions, decoding optimizations, Whisper acceleration with TensorRT‑LLM, streaming TTS improvements, voice‑cloning pipelines, future plans, and a Q&A session.

ASRGPUInference

0 likes · 20 min read

Efficient Deployment of Speech AI Models on GPUs

DataFunSummit

Sep 8, 2023 · Artificial Intelligence

AI Compiler Forum at DataFun Summit 2023: Tile-Based Deep Learning Compilation, Graph Scheduling for Domain‑Specific Accelerators, and Triton on Hopper

The DataFun Summit 2023 AI Compiler Forum gathered leading researchers to present cutting‑edge techniques on tile‑based deep learning compilation, efficient graph scheduling for domain‑specific accelerators, large‑model deployment, and the latest advancements of OpenAI Triton on NVIDIA Hopper, offering practical insights for AI system developers.

AI compilerGraph SchedulingHardware acceleration

0 likes · 8 min read

AI Compiler Forum at DataFun Summit 2023: Tile-Based Deep Learning Compilation, Graph Scheduling for Domain‑Specific Accelerators, and Triton on Hopper

High Availability Architecture

Jun 15, 2023 · Artificial Intelligence

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

The article presents the background, challenges, and objectives of Bilibili's AI services, introduces the self‑developed InferX inference framework with its quantization and sparsity optimizations, details OCR‑specific enhancements, and describes how integrating InferX with Nvidia Triton dramatically improves throughput, latency, and GPU utilization.

AI OptimizationCUDAInference

0 likes · 10 min read

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

Bilibili Tech

Jun 13, 2023 · Artificial Intelligence

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Bilibili’s self‑developed InferX framework, combined with NVIDIA Triton Inference Server, streamlines AI model serving by adding quantization, structured sparsity, and custom kernels, delivering up to eight‑fold throughput gains, cutting GPU usage by half, and enabling faster, cost‑effective OCR and large‑model deployments.

AI inferenceGPU utilizationInferX

0 likes · 10 min read

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Meituan Technology Team

Feb 9, 2023 · Backend Development

Efficient Deployment Architecture for Visual Inference Services: GPU Utilization Optimization

Meituan Visual's engineering team tackled the common low‑GPU‑utilization bottleneck in online inference services by splitting model structures and adopting micro‑service deployment, raising GPU usage from 40% to 100% and more than tripling QPS, and then generalized the approach for other GPU‑based services.

GPUMicroservicesPerformance Optimization

0 likes · 21 min read

Efficient Deployment Architecture for Visual Inference Services: GPU Utilization Optimization

Zuoyebang Tech Team

Nov 17, 2022 · Artificial Intelligence

Scaling Deep Learning Model Serving: High‑Concurrency, Low‑Latency Solutions

This article examines the challenges of deploying dozens of deep‑learning models at Zuoyebang and compares three serving architectures—Gunicorn + Flask + Transformers, Tornado + PyTorch, and Tornado + Triton—highlighting performance trade‑offs and presenting a final high‑concurrency, low‑latency solution in production.

Deep LearningLow latencyModel Deployment

0 likes · 11 min read

Scaling Deep Learning Model Serving: High‑Concurrency, Low‑Latency Solutions