From CPUs to GPUs: How Traditional Backend Skills Power Modern AI Infrastructure
This article explores the evolution of AI infrastructure, comparing it with traditional backend systems, and details how hardware shifts to GPU-centric designs, software adaptations like deep learning frameworks, and engineering challenges in model training and inference can be addressed using established backend methodologies.
What are the differences between AI Infra and traditional Infra, and how can programmers reuse their tech stack and methodology in AI system architecture design?
With the explosion of large‑model technology, AI infrastructure has become a core battlefield. Over the past year the QQ algorithm engineering team has deployed several large‑model applications (speech synthesis, multimodal content understanding, generative recommendation), completing the full training‑to‑inference pipeline and accumulating many lessons.
1. Hardware Evolution
Hardware constraints shape software architecture, so understanding modern AI hardware is essential.
1.1 From CPU‑centric to GPU‑centric
Traditional infra centers on CPUs, handling logical transactions with bottlenecks in network I/O and CPU cores. AI Infra shifts the focus to high‑throughput floating‑point computation on GPUs, replacing CPU multithreading with GPU parallelism and memory with VRAM. A single H20 GPU card offers 96 GB VRAM and 44 TFLOPS FP32, dozens of times faster than mainstream CPUs.
Because each token generation in an LLM requires reading the full model parameters, CPU‑memory bandwidth cannot meet the required compute density; GPUs become the primary compute engine while CPUs act as data movers.
For example, generating one token with the DeepSeek‑R1‑671B‑A37B‑FP8 model takes about 9 ms on H20 (4000 GB/s memory bandwidth) versus 578 ms on a CPU (64 GB/s).
1.2 From “de‑IOE” to “AI Mainframe”
Specialized hardware and networks are emerging. Training trillion‑parameter models like DeepSeek‑R1 and QWen3‑235B requires GPU clusters interconnected by dedicated networks, forming an “AI supercomputer” reminiscent of IBM mainframes: centralized hardware for extreme performance and reliability.
Traditional distributed‑infra concepts lose relevance; AI workloads demand microsecond‑level latency, making centralized, high‑performance hardware essential for the next 1‑3 years.
History shows a pattern of “de‑IOE” where cheap x86 clusters replaced expensive mainframes; similarly, we anticipate “de‑NVIDIA” in the longer term.
2. Software Evolution
After hardware, software also evolves.
2.1 Deep Learning Frameworks
Just as backend services rely on frameworks like tRPC or Spring, AI applications depend on deep‑learning frameworks. PyTorch has become the de‑facto standard for model training and inference, offering dynamic computation graphs, automatic differentiation, and rich tensor operators.
2.2 GPU Programming
While most AI apps avoid hand‑written GPU kernels, custom kernels can dramatically improve performance. For example, Meta’s HSTU recommendation model reduces complexity from O(N³) to O(N²) with a custom kernel.
GPU kernels run under the SIMT model, where many threads execute the same instruction simultaneously, posing a steep learning curve for developers accustomed to CPU multithreading.
We recommend using the Triton language, which provides Python‑like syntax for writing GPU kernels without deep hardware knowledge.
2.3 Python Programming
Python is the primary language for AI Infra. Although many models could previously be exported to C++ (ONNX, TorchScript), modern optimizations such as KV‑Cache, MoE, and custom Triton ops keep Python at the core of deployment.
3. Challenges in Model Training
3.1 Memory Capacity
Even with 768 GB VRAM on an 8‑card server, a 670 GB DeepSeek‑R1 model barely fits; larger models require distributed GPU clusters.
3.1.1 Activation Memory “Memory Assassin”
Intermediate activations grow quadratically with input length, leading to OOM issues during back‑propagation.
3.1.2 Model Parallelism
Model parallel splits a large model across multiple GPUs, similar to sharding in traditional services. Frameworks like Megatron and PyTorch support various parallelism strategies.
3.2 Speeding Up Training
3.2.1 Overlapping Communication and Computation
By assigning compute and communication to separate CUDA streams, their execution can overlap, reducing idle GPU time. Projects like TorchRec’s training pipeline implement this technique.
4. Challenges in Model Inference
4.1 Reducing Latency
4.1.1 CUDA Graph
CUDA Graph batches multiple GPU operations into a single DAG, cutting CPU‑GPU interaction overhead.
4.1.2 KV‑Cache (Space‑for‑Time)
LLM inference repeats many matrix multiplications; caching key and value projections (KV‑Cache) avoids redundant work, dramatically lowering compute cost.
4.1.3 Streaming Response
Instead of waiting for the full output, the system streams the first token or audio frame to the user, improving perceived latency.
4.2 Increasing Throughput
4.2.1 Traditional Batching
Batching groups multiple inputs into a single GPU kernel launch, similar to Redis MGet, improving GPU utilization.
4.2.2 Continuous Batching
Continuous batching dynamically adds new requests to a running batch, akin to ride‑sharing, preventing GPU idle time caused by long requests.
5. Conclusion
The engineering challenges of AI Infra—compute, storage, communication—are essentially modern versions of classic problems solved in traditional infrastructure. The main difference is the shift from CPU to GPU, allowing backend engineers to transfer their existing methodologies to AI systems.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
