How KaiFG Lets Python Feature Engineering Run at C++ Speed
KaiFG, Kuaishou's self‑built AI Feature Generator, unifies fragmented feature extraction frameworks, replaces slow C++ compilation cycles with Python‑level development, and achieves near‑C++ performance through Codon‑based compilation, reference‑counted memory management, and aggressive LLVM optimizations, dramatically shortening iteration time.
Project Background
In Kuaishou's recommendation, advertising, and search systems, multiple heterogeneous feature extraction frameworks (e.g., Mio, Kuiba, Dark) co‑existed, each with different interfaces and programming paradigms. Algorithm engineers had to write dedicated operators for each framework, leading to duplicated effort, and C++‑based development suffered from difficult debugging and long compile‑to‑deploy cycles (30 min compile + 20 min deployment), stifling innovation.
What is KaiFG?
KaiFG (Kuaishou AI Feature Generator) is a unified feature extraction framework developed by the algorithm engine team. It provides a Python front‑end while leveraging the open‑source Codon compiler and LLVM back‑end to generate code that runs at native C++ speed, allowing developers to write concise, familiar Python logic without a compilation bottleneck.
Key Benefits
Zero learning cost : Write feature logic in Python (or NumPy) without mastering C++.
Minute‑level debugging : Local execution eliminates the 30‑minute compile step.
Seamless deployment : Python code can be deployed directly; no pre‑compilation required.
Performance parity : Runtime performance matches native C++ and far exceeds typical scripting languages.
Accelerated build : Compilation time reduced from 111 min to 12 min (≈10× speedup).
Stable memory management : Deterministic reference‑counting replaces GC, eliminating performance jitter.
Technical Highlights
IR‑Level Reference Counting
Codon originally relied on the bdwgc garbage collector, which caused global‑lock contention and unpredictable pauses in high‑concurrency scenarios. KaiFG replaces GC with an IR‑level reference‑counting mechanism that tracks object lifetimes per thread, providing deterministic reclamation and a reported 294 % performance gain over GC.
# Object memory layout (64‑bit)
+---------------------+
| ref_count (8 bytes) |
+---------------------+
| Object Data |
+---------------------+Instruction‑Level Instrumentation
KaiFG inserts inc_ref and dec_ref calls at the IR level based on variable liveness analysis, and removes redundant pairs via optimization passes. Example LLVM IR transformation:
; a = b
%load_a = load { i64, ptr }, ptr %a
store { i64, ptr } %load_a, ptr %tmp
; a = b
%load_b = load { i64, ptr }, ptr %b
store { i64, ptr } %load_b, ptr %a
; inc_ref(a)
%load1_a = load { i64, ptr }, ptr %a
%unused = call {} @inc_ref({ i64, ptr } %load1_a)
; dec_ref(tmp)
%load_tmp = load { i64, ptr }, ptr %tmp
%unused1 = call {} @dec_ref({ i64, ptr } %load_tmp)Coroutine Memory Safety
Generators allocate coroutine stacks on the heap. KaiFG modifies LLVM’s CoroElide pass to keep extra space for reference counts when stack‑eliding, ensuring safe reclamation even when a generator is discarded before full consumption.
Universal Data Interface
KaiFG abstracts data access through an IDataAccessor interface. Implementations (e.g., DragonDataAccessor in the internal Dragonfly engine) provide zero‑copy reads and expose capabilities such as sequential access, enabling the same feature code to run across online and offline pipelines and across various storage formats, including a custom ProtoKV reader for Protobuf without deserialization.
Vectorization Enhancements
Python loops like for x in list or for i in range(...) have predictable bounds, but LLVM cannot infer them automatically. KaiFG adds assume constraints for loop bounds and enriches type information with a full TBAA hierarchy tailored to Python data structures, allowing LLVM’s auto‑vectorizer to generate SIMD code. In benchmarked kernels, KaiFG’s vectorization matches or exceeds Clang’s C++ auto‑vectorization.
Performance Results
Extensive benchmarks show KaiFG’s runtime within ±10 % of hand‑written C++ programs, while traditional script‑based solutions lag significantly. Compilation speed improved from 111 min to 12 min, and self‑developed optimizations yielded 40 %–80 % speedups over vanilla Codon. The framework also reduced memory‑related pauses, achieving stable P99 latency.
Conclusion and Outlook
KaiFG represents a paradigm shift for feature engineering: Python’s expressive syntax combined with C++‑grade performance eliminates the trade‑off between development speed and execution efficiency. Its unified interface, deterministic memory management, and aggressive LLVM optimizations empower algorithm teams to iterate rapidly, deploy seamlessly, and scale across diverse business lines. Future work will deepen compiler optimizations and expand the ecosystem to further accelerate AI engineering.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
