Applying GPU Technology for High‑Throughput Image Rendering in Shopee Off‑Platform Ads
The Shopee Off‑Platform Ads team built a GPU‑accelerated Creative Rendering System that uses a four‑layer architecture, CGO‑bridged C/C++ kernels, and template caching to process billions of product images daily, achieving roughly ten‑fold speedup, half the cost, and far reduced rack space while handling high concurrency.
This article presents the design and practical experience of the Shopee Off‑Platform Ads team in applying GPU technology to a large‑scale image rendering system (Creative Rendering System, CRS). The system processes billions of product images daily to generate advertisement creatives, addressing the high concurrency and heavy computation characteristics of the ad‑material pipeline.
1. GPU Background
GPU (Graphics Processing Unit) is a massive parallel processor composed of hundreds of thousands of compute cores. Compared with CPU, a GPU contains many SIMD units that can execute the same instruction on vectorized data, making it ideal for workloads such as image processing, matrix operations, and AI inference.
The article briefly explains GPU‑CPU communication via PCIe, the typical CPU/GPU program flow, and the CUDA software stack that provides driver‑level and application‑level APIs.
2. Business Background
Shopee’s Off‑Platform Ads place product images with promotional information on external platforms (e.g., Facebook, Google). These images must combine the original product picture with price, discount, and style information, requiring a fast, scalable rendering pipeline.
3. System Design
The CRS consists of two core modules: material rendering and template management. Templates are composed of multiple layers (image, shape, text) and are configured by operations staff. Rendering follows a four‑layer architecture:
System Access Layer – parameter validation, traffic control, reporting.
Business Logic Layer – preparation of rendering data and template handling.
C/Go Communication Layer – bridges Go (the main service language) with C/C++ GPU code via CGO.
Image Rendering Layer – performs GPU‑accelerated image processing (decoding, color conversion, scaling, alpha compositing).
Template management includes caching because the number of distinct templates is far smaller than the number of products. A delayed‑deletion strategy ensures safe removal of unused templates.
3.3.1 C/Go Communication
The team uses CGO to call C functions from Go. A simple example is shown below:
package main
/*
#include
void printint(int v) {
printf("printint: %d\n", v);
}
*/
import "C"
func main() {
v := 42
C.printint(C.int(v))
}CGO places all C symbols into a virtual package named C . The article also provides a mapping table between C types, CGO types, and Go types (e.g., char → C.char → byte , int → C.int → int32 , etc.).
3.3.2 Rendering Engine
The engine manages template cache, rendering data queues, and a thread group that executes GPU kernels for tasks such as image decoding, channel conversion, rotation/scaling, and alpha compositing.
4. Results
Performance tests show that the GPU‑based system can render ~4,680 images per second on a server with six NVIDIA T4 cards, compared with ~453 images per second on a 64‑core CPU server – roughly a ten‑fold speedup. Cost analysis indicates the GPU solution is about 50 % of the CPU solution, and it also consumes only 10 % of the rack space.
5. Practical Experience
Several optimisation techniques were applied:
Text handling improvements – disabling HarfBuzz ligature shaping in OpenCV to avoid unexpected character duplication, and using the glyph advanceX metric to correctly position text, fixing both over‑ and under‑flow issues.
Memory allocation – replacing malloc() with cudaMallocHost() to use page‑locked host memory, reducing an implicit copy between pageable and pinned memory.
Memory pool – leveraging OpenCV’s GPU memory pool with a stack‑based management strategy to avoid frequent allocate/free cycles.
Structure conversion – keeping data structures C‑compatible (avoiding std::string , std::vector ) to minimise conversion overhead between C and C++.
CUDA Alpha compositing bug – discovered that nppiAlphaComp_8u_AC4R() produced incorrect results; the issue was reported to NVIDIA and fixed in CUDA 11.5.
Performance before and after memory‑pool optimisation:
Render Time
Images Rendered
QPS
Before
30 s
19,572
652
After
30 s
28,413
947
6. Conclusion
The GPU‑driven rendering pipeline successfully meets the high‑concurrency, high‑compute demands of Shopee’s ad‑material generation. While GPU utilization still has headroom, further profiling will aim to push the system toward its theoretical limits and extend support to richer media such as video.
Shopee Tech Team
How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.