Tagged articles

CUDA optimization

4 articles · Page 1 of 1
AI Engineering
AI Engineering
Jul 4, 2026 · Backend Development

How SGLang Encoded Engineering Experience into Agents and Achieved Up to 2.75× Kernel Speedups

The SGLang team turned their benchmarking, profiling, CUDA kernel tuning, and production‑issue triage know‑how into reusable agent skills, merging three KDA‑Pilot PRs that delivered up to 2.75× kernel acceleration, a 71.4% throughput boost for Qwen3‑Next and a TTFT reduction from 456 ms to 168 ms, while outlining a repeatable workflow and practical rules for large‑scale performance engineering.

CUDA optimizationLLM servingSGLang
0 likes · 16 min read
How SGLang Encoded Engineering Experience into Agents and Achieved Up to 2.75× Kernel Speedups
Tech Musings
Tech Musings
Mar 6, 2026 · Artificial Intelligence

How to Deploy Qwen3-8B on WSL2 with 4‑Bit Quantization and Resource Limits

This article details a step‑by‑step guide for setting up the Qwen3‑8B large language model on a Windows 11 system using WSL2, covering hardware specs, CUDA configuration, 4‑bit quantization with BitsAndBytes, SDPA attention optimization, CPU offload, and resource‑limiting tricks to achieve smooth inference performance.

4-bit quantizationCUDA optimizationPyTorch
0 likes · 10 min read
How to Deploy Qwen3-8B on WSL2 with 4‑Bit Quantization and Resource Limits
Refining Core Development Skills
Refining Core Development Skills
Aug 26, 2025 · Fundamentals

How NVIDIA’s Fermi Architecture Revolutionized GPU Computing: Key Improvements Explained

Fermi, NVIDIA’s 2010 GPU architecture, introduced major upgrades over the Tesla line—including a 40 nm process, vastly increased transistor count, GDDR5 memory, L2 cache, enhanced FP64 performance, ECC support, and unified CPU‑GPU addressing—making it the first truly complete GPU computing platform.

CUDA optimizationECC MemoryFP64 performance
0 likes · 12 min read
How NVIDIA’s Fermi Architecture Revolutionized GPU Computing: Key Improvements Explained
DataFunSummit
DataFunSummit
Jul 4, 2023 · Artificial Intelligence

PPL: A Full‑Platform Deep Learning Deployment Framework by SenseTime

The article presents SenseTime's PPL framework, detailing its toolchain, inference engine, multi‑backend operator library, quantization tools, CUDA optimizations, performance benchmarks across CPUs, GPUs, DSPs and DSAs, and outlines future plans for broader chip support and AI for Science.

AI inferenceCUDA optimizationCross-Platform
0 likes · 23 min read
PPL: A Full‑Platform Deep Learning Deployment Framework by SenseTime