Tagged articles

CUDA optimization

4 articles · Page 1 of 1

Jul 4, 2026 · Backend Development

How SGLang Encoded Engineering Experience into Agents and Achieved Up to 2.75× Kernel Speedups

The SGLang team turned their benchmarking, profiling, CUDA kernel tuning, and production‑issue triage know‑how into reusable agent skills, merging three KDA‑Pilot PRs that delivered up to 2.75× kernel acceleration, a 71.4% throughput boost for Qwen3‑Next and a TTFT reduction from 456 ms to 168 ms, while outlining a repeatable workflow and practical rules for large‑scale performance engineering.

CUDA optimizationLLM servingSGLang

0 likes · 16 min read

How SGLang Encoded Engineering Experience into Agents and Achieved Up to 2.75× Kernel Speedups

Tech Musings

Mar 6, 2026 · Artificial Intelligence

How to Deploy Qwen3-8B on WSL2 with 4‑Bit Quantization and Resource Limits

This article details a step‑by‑step guide for setting up the Qwen3‑8B large language model on a Windows 11 system using WSL2, covering hardware specs, CUDA configuration, 4‑bit quantization with BitsAndBytes, SDPA attention optimization, CPU offload, and resource‑limiting tricks to achieve smooth inference performance.

4-bit quantizationCUDA optimizationPyTorch

0 likes · 10 min read

How to Deploy Qwen3-8B on WSL2 with 4‑Bit Quantization and Resource Limits

Refining Core Development Skills

Aug 26, 2025 · Fundamentals

How NVIDIA’s Fermi Architecture Revolutionized GPU Computing: Key Improvements Explained

Fermi, NVIDIA’s 2010 GPU architecture, introduced major upgrades over the Tesla line—including a 40 nm process, vastly increased transistor count, GDDR5 memory, L2 cache, enhanced FP64 performance, ECC support, and unified CPU‑GPU addressing—making it the first truly complete GPU computing platform.

CUDA optimizationECC MemoryFP64 performance

0 likes · 12 min read

How NVIDIA’s Fermi Architecture Revolutionized GPU Computing: Key Improvements Explained

DataFunSummit

Jul 4, 2023 · Artificial Intelligence

PPL: A Full‑Platform Deep Learning Deployment Framework by SenseTime

The article presents SenseTime's PPL framework, detailing its toolchain, inference engine, multi‑backend operator library, quantization tools, CUDA optimizations, performance benchmarks across CPUs, GPUs, DSPs and DSAs, and outlines future plans for broader chip support and AI for Science.

AI inferenceCUDA optimizationCross-Platform

0 likes · 23 min read

PPL: A Full‑Platform Deep Learning Deployment Framework by SenseTime