Backend Development 23 min read

Applying GPU Technology for High‑Throughput Image Rendering in Shopee Off‑Platform Ads

The Shopee Off‑Platform Ads team built a GPU‑accelerated Creative Rendering System that uses a four‑layer architecture, CGO‑bridged C/C++ kernels, and template caching to process billions of product images daily, achieving roughly ten‑fold speedup, half the cost, and far reduced rack space while handling high concurrency.

Shopee Tech Team

Jun 2, 2022

Applying GPU Technology for High‑Throughput Image Rendering in Shopee Off‑Platform Ads

This article presents the design and practical experience of the Shopee Off‑Platform Ads team in applying GPU technology to a large‑scale image rendering system (Creative Rendering System, CRS). The system processes billions of product images daily to generate advertisement creatives, addressing the high concurrency and heavy computation characteristics of the ad‑material pipeline.

1. GPU Background

GPU (Graphics Processing Unit) is a massive parallel processor composed of hundreds of thousands of compute cores. Compared with CPU, a GPU contains many SIMD units that can execute the same instruction on vectorized data, making it ideal for workloads such as image processing, matrix operations, and AI inference.

The article briefly explains GPU‑CPU communication via PCIe, the typical CPU/GPU program flow, and the CUDA software stack that provides driver‑level and application‑level APIs.

2. Business Background

Shopee’s Off‑Platform Ads place product images with promotional information on external platforms (e.g., Facebook, Google). These images must combine the original product picture with price, discount, and style information, requiring a fast, scalable rendering pipeline.

3. System Design

The CRS consists of two core modules: material rendering and template management. Templates are composed of multiple layers (image, shape, text) and are configured by operations staff. Rendering follows a four‑layer architecture:

System Access Layer – parameter validation, traffic control, reporting.

Business Logic Layer – preparation of rendering data and template handling.

C/Go Communication Layer – bridges Go (the main service language) with C/C++ GPU code via CGO.

Image Rendering Layer – performs GPU‑accelerated image processing (decoding, color conversion, scaling, alpha compositing).

Template management includes caching because the number of distinct templates is far smaller than the number of products. A delayed‑deletion strategy ensures safe removal of unused templates.

3.3.1 C/Go Communication

The team uses CGO to call C functions from Go. A simple example is shown below:

package main

/*
#include <stdio.h>

void printint(int v) {
    printf("printint: %d
", v);
}
*/
import "C"

func main() {
    v := 42
    C.printint(C.int(v))
}

CGO places all C symbols into a virtual package named C. The article also provides a mapping table between C types, CGO types, and Go types (e.g., char → C.char → byte, int → C.int → int32, etc.).

3.3.2 Rendering Engine

The engine manages template cache, rendering data queues, and a thread group that executes GPU kernels for tasks such as image decoding, channel conversion, rotation/scaling, and alpha compositing.

4. Results

Performance tests show that the GPU‑based system can render ~4,680 images per second on a server with six NVIDIA T4 cards, compared with ~453 images per second on a 64‑core CPU server – roughly a ten‑fold speedup. Cost analysis indicates the GPU solution is about 50 % of the CPU solution, and it also consumes only 10 % of the rack space.

5. Practical Experience

Several optimisation techniques were applied:

Text handling improvements – disabling HarfBuzz ligature shaping in OpenCV to avoid unexpected character duplication, and using the glyph advanceX metric to correctly position text, fixing both over‑ and under‑flow issues.

Memory allocation – replacing malloc() with cudaMallocHost() to use page‑locked host memory, reducing an implicit copy between pageable and pinned memory.

Memory pool – leveraging OpenCV’s GPU memory pool with a stack‑based management strategy to avoid frequent allocate/free cycles.

Structure conversion – keeping data structures C‑compatible (avoiding std::string, std::vector) to minimise conversion overhead between C and C++.

CUDA Alpha compositing bug – discovered that nppiAlphaComp_8u_AC4R() produced incorrect results; the issue was reported to NVIDIA and fixed in CUDA 11.5.

Performance before and after memory‑pool optimisation:

Render Time

Images Rendered

QPS

Before

30 s

19,572

652

After

30 s

28,413

947

6. Conclusion

The GPU‑driven rendering pipeline successfully meets the high‑concurrency, high‑compute demands of Shopee’s ad‑material generation. While GPU utilization still has headroom, further profiling will aim to push the system toward its theoretical limits and extend support to richer media such as video.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising Go CUDA GPU Image Rendering CGO

Written by

Shopee Tech Team

How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.