Backend Development 20 min read

Design and Application of the Dragonfly Strategy Engine Framework at Kuaishou

Kuaishou’s Dragonfly framework replaces tightly‑coupled C++ recommendation code with a Python‑centric DSL that compiles to high‑performance JSON, offering DAG‑based operators, schema‑free dataframes, and a full ecosystem of debugging, tracing, and governance tools, dramatically cutting code volume while enabling fast, scalable scenario development.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Design and Application of the Dragonfly Strategy Engine Framework at Kuaishou

Overview – This article shares how Kuaishou tackles complex and hard-to‑abstract system problems in designing and applying a strategy‑engine framework.

Main Parts – The presentation is organized into seven sections: 1) Problems and Challenges, 2) Dragonfly Framework Introduction, 3) Ecosystem Construction, 4) Planning Outlook, 5) Q&A, plus introductory and concluding remarks.

1. Problems and Challenges

Kuaishou’s business grew rapidly from 2018, with DAU increasing from 100 million to 376 million. The recommendation scenarios expanded from a few pages to hundreds (e‑commerce, live‑stream, overseas, local life, etc.). The R&D team grew from dozens to over a thousand engineers, creating two main demands: quickly building new recommendation scenarios and rapidly replicating effective strategies.

Early solutions copied existing architecture code, but as scenario count rose this approach became unsustainable. Issues included:

High maintenance cost due to tight coupling between business and architecture code.

Low engineer‑to‑algorithm staff ratio, leading to heavy pressure on architecture engineers.

All online code written in C++, making quality control difficult and causing stability bugs.

Different goals of algorithm (fast iteration) vs. engineering (maintainability) teams.

Severe code coupling akin to DNA double‑strand structure, causing unintended side effects.

These factors forced frequent large‑scale refactorings (every 1‑2 years), consuming massive manpower and hindering other architectural upgrades.

2. Dragonfly Framework Introduction

Dragonfly is a general‑purpose graph engine framework for search‑advertising‑promotion (搜广推) services. It provides a unified base engine and flexible workflow orchestration for upper‑level business.

2.1 What is Dragonfly? It offers built‑in efficient data models, a rich set of surrounding tools, and a layered architecture: strategy engines at the top, core services (recall, coarse‑ranking, fine‑ranking) in the middle, and a graph scheduler at the bottom. DSL operators enable a shift from C++‑centric development to Python‑centric scripting.

2.2 Strategy Orchestration – Users define flows (similar to workflows) composed of operators (recall, deduplication, exposure, filtering, etc.). The DSL script is compiled into a JSON configuration executed by the C++ runtime, preserving Python’s expressiveness without sacrificing performance.

2.3 Process Abstraction – Business functions are broken into DAG‑based operators (filter, recall, model inference, etc.). Common operators are maintained by architects and reused across teams; custom operators can be added by business users for specialized logic.

2.4 Data Abstraction – Dragonfly provides a high‑performance DataFrame structure (column‑store‑like) for item‑side and common‑side data. The schema‑free design allows online services to add new features without recompilation, similar to a key‑value interface.

2.5 DSL Layer – The DSL offers high‑level abstractions such as standard operator wrappers, async/sync handling, parallel decorators (@parallel), and modular components. Users write Python scripts; the system compiles them to JSON for the C++ service.

2.6 Layered Decoupling – The DSL isolates algorithm developers (who write orchestration scripts) from architecture developers (who implement operators). This clear separation prevents strong coupling and mutual interference.

3. Ecosystem Construction

Beyond the core framework, a suite of tools supports the entire lifecycle: code generation assistance, debugging playground, white‑box tracing, visualization, and code governance. The Playground allows zero‑deployment online debugging of DSL scripts. White‑box tracing tracks per‑request operator execution, latency, and outputs. Visualization displays end‑to‑end business flows. Code governance automatically detects and removes unused operators or branches, reducing code bloat.

4. Planning Outlook

The future work focuses on three aspects:

Performance – Adding NUMA‑aware capabilities and graph‑level optimizations to better exploit modern multi‑NUMA CPUs.

Governance – Full‑link feature management, automatic detection and deletion of dead data, and automated code‑clean‑up.

Productization – Enhancing the ecosystem with AI‑driven tools, offering more intelligent development workflows, and providing B2B solutions.

5. Q&A

Questions covered custom operator expressiveness, granularity of custom operators, differences between Dragonfly’s control flow and TensorFlow’s graph, and the relationship between DSL operators and micro‑service partitioning. Answers clarified that custom operators are abstracted at the operator level (not line‑by‑line C++), are typically developed by algorithm teams, and that a DSL script ultimately generates a JSON configuration for a single service node.

Conclusion

The Dragonfly framework enables Kuaishou to break the cycle of massive refactorings, achieve flexible and scalable recommendation pipelines, and significantly reduce C++ code volume (by 50‑80%). It also provides a robust ecosystem for rapid development, debugging, monitoring, and automated code governance.

backend architectureDSLrecommendation systemStrategy EngineKuaishouData Abstraction
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.