How WeChat’s TFCC Boosts Deep Learning Inference Performance Across Platforms
The TFCC framework, developed by WeChat's backend team, delivers high‑performance, easy‑to‑use, and universal deep‑learning inference by supporting numerous ONNX and TensorFlow operations, optimizing model structures, constants, and operators, and providing a versatile runtime and math library for both CPU and GPU platforms.
Introduction
WeChat TFCC is a server‑side deep‑learning inference framework developed by the WeChat Technical Architecture Backend team, open‑sourced together with Tencent’s Oteam Cloud‑Fan.
Key Features
TFCC offers high performance, ease of use, and universality. It supports 81 ONNX operations and 108 TensorFlow operations, covering recommendation, NLP, and speech models, and is widely used in WeChat Video, Official Accounts, User Portrait, and voice services.
Performance Optimizations
Through model‑structure optimization, constant tracking, and operator optimization, TFCC achieves strong QPS on Intel CPUs and NVIDIA GPUs (example with BERT‑base). Model‑structure optimization includes constant folding, pruning, and operator fusion. Constant tracking aggressively identifies local constants for further gains. The Math Library provides AVX2/AVX512‑accelerated CPU kernels and assembly‑level optimized operators.
Ease of Use
TFCC supplies a complete toolchain; a single command converts ONNX or TensorFlow models to TFCC format. The runtime loads TFCC models, uses JIT to generate execution graphs, and offers a simple validation program.
Universality
TFCC works with both model and hardware platforms. It converts ONNX/TensorFlow models to a unified IR and supports X86‑64 CPUs and NVIDIA GPUs, with platform‑specific assembly optimizations.
Architecture Overview
The solution consists of three components: Code Generator (converts models to C++ code or tfccrt description), Runtime (loads tfccrt files and provides inference APIs), and Math Library (hardware‑agnostic matrix‑operation APIs backed by TFCC Core, MKL, and CUDA libraries).
Code Generator
Implemented as a compiler with Frontend (ONNX/TensorFlow to IR), Optimizer (pruning, pre‑computation, duplicate removal, node fusion, equivalent replacement), and Backend (Runtime Backend or C++ Code Backend).
Runtime
Loads tfccrt model description, generates machine code on the fly, and executes inference with minimal overhead.
Math Library
Provides platform‑independent core and platform‑specific MKL/CUDA libraries, exposing constants, variables, views, devices, and sessions.
Conclusion
TFCC focuses on server‑side deep‑learning inference and will continue to add features and performance improvements.
References
oneDNN – https://github.com/oneapi-src/oneDNN
Xbyak – https://github.com/herumi/xbyak
CUDA – https://developer.nvidia.com/
MKL – https://software.intel.com/content/www/us/en/develop/tools/oneapi.html
ONNX – https://github.com/onnx/onnx
TensorFlow – https://www.tensorflow.org/
PyTorch – https://pytorch.org/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeChat Backend Team
Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
