Tagged articles
12 articles
Page 1 of 1
HyperAI Super Neural
HyperAI Super Neural
Dec 17, 2025 · Artificial Intelligence

Can cuTile’s Tile Paradigm Disrupt the GPU Programming Landscape and Challenge Triton?

The article analyzes NVIDIA's newly announced cuTile, a tile‑based Python DSL for GPU kernels, examining its technical differences from CUDA's SIMT model, its potential to reshape the GPU programming ecosystem, community reactions, competition with Triton, and the uncertain future that hinges on ecosystem maturity and migration tools.

AI workloadsCUDAGPU programming
0 likes · 12 min read
Can cuTile’s Tile Paradigm Disrupt the GPU Programming Landscape and Challenge Triton?
DeWu Technology
DeWu Technology
Jan 13, 2025 · Artificial Intelligence

Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE

This article introduces Triton—a Python‑based GPU programming language—covers essential GPU architecture, walks through practical kernels for vector addition, matrix multiplication, and rotary position encoding, compares performance with PyTorch, and provides debugging tips for high‑performance deep‑learning workloads.

CUDADeep LearningGPU programming
0 likes · 22 min read
Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jun 12, 2024 · Artificial Intelligence

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

This tutorial walks through deploying the Llama‑2‑7b‑hf model on Alibaba Cloud Kubernetes (ACK) using KServe, Triton Inference Server with the TensorRT‑LLM backend, covering prerequisites, model preparation, YAML configuration, PV/PVC setup, runtime creation, and troubleshooting steps.

AI inferenceKServeKubernetes
0 likes · 13 min read
Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide
DataFunTalk
DataFunTalk
Jan 26, 2024 · Artificial Intelligence

Efficient Deployment of Speech AI Models on GPUs

This article explains how to efficiently deploy speech AI models—including ASR and TTS—on GPUs using NVIDIA's Triton Inference Server and TensorRT, covering background challenges, GPU‑based solutions, decoding optimizations, Whisper acceleration with TensorRT‑LLM, streaming TTS improvements, voice‑cloning pipelines, future plans, and a Q&A session.

ASRGPUInference
0 likes · 20 min read
Efficient Deployment of Speech AI Models on GPUs
DataFunSummit
DataFunSummit
Sep 8, 2023 · Artificial Intelligence

AI Compiler Forum at DataFun Summit 2023: Tile-Based Deep Learning Compilation, Graph Scheduling for Domain‑Specific Accelerators, and Triton on Hopper

The DataFun Summit 2023 AI Compiler Forum gathered leading researchers to present cutting‑edge techniques on tile‑based deep learning compilation, efficient graph scheduling for domain‑specific accelerators, large‑model deployment, and the latest advancements of OpenAI Triton on NVIDIA Hopper, offering practical insights for AI system developers.

AI compilerGraph SchedulingHardware acceleration
0 likes · 8 min read
AI Compiler Forum at DataFun Summit 2023: Tile-Based Deep Learning Compilation, Graph Scheduling for Domain‑Specific Accelerators, and Triton on Hopper
High Availability Architecture
High Availability Architecture
Jun 15, 2023 · Artificial Intelligence

InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration

The article presents the background, challenges, and objectives of Bilibili's AI services, introduces the self‑developed InferX inference framework with its quantization and sparsity optimizations, details OCR‑specific enhancements, and describes how integrating InferX with Nvidia Triton dramatically improves throughput, latency, and GPU utilization.

AI OptimizationCUDAInference
0 likes · 10 min read
InferX Inference Framework: Challenges, Architecture, Optimizations, and Triton Integration
Bilibili Tech
Bilibili Tech
Jun 13, 2023 · Artificial Intelligence

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Bilibili’s self‑developed InferX framework, combined with NVIDIA Triton Inference Server, streamlines AI model serving by adding quantization, structured sparsity, and custom kernels, delivering up to eight‑fold throughput gains, cutting GPU usage by half, and enabling faster, cost‑effective OCR and large‑model deployments.

AI inferenceGPU utilizationInferX
0 likes · 10 min read
InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving
Meituan Technology Team
Meituan Technology Team
Feb 9, 2023 · Backend Development

Efficient Deployment Architecture for Visual Inference Services: GPU Utilization Optimization

Meituan Visual's engineering team tackled the common low‑GPU‑utilization bottleneck in online inference services by splitting model structures and adopting micro‑service deployment, raising GPU usage from 40% to 100% and more than tripling QPS, and then generalized the approach for other GPU‑based services.

GPUMicroservicesPerformance Optimization
0 likes · 21 min read
Efficient Deployment Architecture for Visual Inference Services: GPU Utilization Optimization
Zuoyebang Tech Team
Zuoyebang Tech Team
Nov 17, 2022 · Artificial Intelligence

Scaling Deep Learning Model Serving: High‑Concurrency, Low‑Latency Solutions

This article examines the challenges of deploying dozens of deep‑learning models at Zuoyebang and compares three serving architectures—Gunicorn + Flask + Transformers, Tornado + PyTorch, and Tornado + Triton—highlighting performance trade‑offs and presenting a final high‑concurrency, low‑latency solution in production.

Deep LearningLow latencyModel Deployment
0 likes · 11 min read
Scaling Deep Learning Model Serving: High‑Concurrency, Low‑Latency Solutions