Artificial Intelligence 14 min read

How KaiFG Lets Python Feature Engineering Run at C++ Speed

KaiFG, Kuaishou's self‑built AI Feature Generator, unifies fragmented feature extraction frameworks, replaces slow C++ compilation cycles with Python‑level development, and achieves near‑C++ performance through Codon‑based compilation, reference‑counted memory management, and aggressive LLVM optimizations, dramatically shortening iteration time.

Kuaishou Tech

Nov 12, 2025

How KaiFG Lets Python Feature Engineering Run at C++ Speed

Project Background

In Kuaishou's recommendation, advertising, and search systems, multiple heterogeneous feature extraction frameworks (e.g., Mio, Kuiba, Dark) co‑existed, each with different interfaces and programming paradigms. Algorithm engineers had to write dedicated operators for each framework, leading to duplicated effort, and C++‑based development suffered from difficult debugging and long compile‑to‑deploy cycles (30 min compile + 20 min deployment), stifling innovation.

What is KaiFG?

KaiFG (Kuaishou AI Feature Generator) is a unified feature extraction framework developed by the algorithm engine team. It provides a Python front‑end while leveraging the open‑source Codon compiler and LLVM back‑end to generate code that runs at native C++ speed, allowing developers to write concise, familiar Python logic without a compilation bottleneck.

Key Benefits

Zero learning cost : Write feature logic in Python (or NumPy) without mastering C++.

Minute‑level debugging : Local execution eliminates the 30‑minute compile step.

Seamless deployment : Python code can be deployed directly; no pre‑compilation required.

Performance parity : Runtime performance matches native C++ and far exceeds typical scripting languages.

Accelerated build : Compilation time reduced from 111 min to 12 min (≈10× speedup).

Stable memory management : Deterministic reference‑counting replaces GC, eliminating performance jitter.

Technical Highlights

IR‑Level Reference Counting

Codon originally relied on the bdwgc garbage collector, which caused global‑lock contention and unpredictable pauses in high‑concurrency scenarios. KaiFG replaces GC with an IR‑level reference‑counting mechanism that tracks object lifetimes per thread, providing deterministic reclamation and a reported 294 % performance gain over GC.

# Object memory layout (64‑bit)
+---------------------+
| ref_count (8 bytes) |
+---------------------+
|      Object Data    |
+---------------------+

Instruction‑Level Instrumentation

KaiFG inserts inc_ref and dec_ref calls at the IR level based on variable liveness analysis, and removes redundant pairs via optimization passes. Example LLVM IR transformation:

; a = b
%load_a = load { i64, ptr }, ptr %a
store { i64, ptr } %load_a, ptr %tmp

; a = b
%load_b = load { i64, ptr }, ptr %b
store { i64, ptr } %load_b, ptr %a

; inc_ref(a)
%load1_a = load { i64, ptr }, ptr %a
%unused = call {} @inc_ref({ i64, ptr } %load1_a)

; dec_ref(tmp)
%load_tmp = load { i64, ptr }, ptr %tmp
%unused1 = call {} @dec_ref({ i64, ptr } %load_tmp)

Coroutine Memory Safety

Generators allocate coroutine stacks on the heap. KaiFG modifies LLVM’s CoroElide pass to keep extra space for reference counts when stack‑eliding, ensuring safe reclamation even when a generator is discarded before full consumption.

Universal Data Interface

KaiFG abstracts data access through an IDataAccessor interface. Implementations (e.g., DragonDataAccessor in the internal Dragonfly engine) provide zero‑copy reads and expose capabilities such as sequential access, enabling the same feature code to run across online and offline pipelines and across various storage formats, including a custom ProtoKV reader for Protobuf without deserialization.

Vectorization Enhancements

Python loops like for x in list or for i in range(...) have predictable bounds, but LLVM cannot infer them automatically. KaiFG adds assume constraints for loop bounds and enriches type information with a full TBAA hierarchy tailored to Python data structures, allowing LLVM’s auto‑vectorizer to generate SIMD code. In benchmarked kernels, KaiFG’s vectorization matches or exceeds Clang’s C++ auto‑vectorization.

Performance Results

Extensive benchmarks show KaiFG’s runtime within ±10 % of hand‑written C++ programs, while traditional script‑based solutions lag significantly. Compilation speed improved from 111 min to 12 min, and self‑developed optimizations yielded 40 %–80 % speedups over vanilla Codon. The framework also reduced memory‑related pauses, achieving stable P99 latency.

Conclusion and Outlook

KaiFG represents a paradigm shift for feature engineering: Python’s expressive syntax combined with C++‑grade performance eliminates the trade‑off between development speed and execution efficiency. Its unified interface, deterministic memory management, and aggressive LLVM optimizations empower algorithm teams to iterate rapidly, deploy seamlessly, and scale across diverse business lines. Future work will deepen compiler optimizations and expand the ecosystem to further accelerate AI engineering.

feature engineering high performance computing Reference Counting AI infrastructure llvm optimization python compilation

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.