BiFang: A Unified Lake‑Stream Storage Engine for Real‑Time and Batch Data Processing
BiFang is a lake‑stream integrated storage engine that merges Apache Pulsar message‑queue capabilities with Iceberg data‑lake features, providing a single unified data store with full‑incremental queries, sub‑second visibility, exactly‑once semantics, and seamless integration with Flink, Spark, and StarRocks for both real‑time analytics and batch processing.
1. System Overview
BiFang is a lake‑stream integrated storage engine that unifies message‑queue and data‑lake functionalities, supporting full‑incremental queries and end‑to‑end real‑time data visibility. It is built on Tencent Tianqiong Pulsar and integrates with Iceberg for lake storage.
1.1 System Positioning
BiFang provides a single entry for both streaming and batch data, compatible with mainstream batch‑stream engines and meeting diverse real‑time, consistency, and flexibility requirements.
1.2 Applicable Scenarios
Full‑incremental query of message‑queue data using Pulsar manifests.
Real‑time visibility of Iceberg lake data, reducing latency from minutes to sub‑second.
Unified storage for stream and batch, enabling cost and operational complexity reductions.
Real‑time multidimensional reporting via StarRocks integration.
Efficient low‑cost multi‑stream stitching with KV/Value support.
1.3 Industry Comparison
Compared with Alibaba Fluss and Douyin BTS, BiFang offers a unified storage engine that supports exactly‑once semantics, sub‑second data visibility, and has been deployed in production for video, gaming, and AI pipelines.
2. Architecture Principles
BiFang consists of three main components: BiFang Client, BiFang Server, and Lakehouse Storage (currently Iceberg). The server extends Pulsar Broker with modules such as Log Writer, Offload Service, Transaction Manager, Manifest Store, Manifest Service, and File Service.
2.1 Overall Architecture
The architecture integrates Pulsar and Iceberg, using a unified metadata catalog to manage both streaming and batch data.
2.2 Core Process
Data is written by Log Writer as row‑format batches, generating Delta Manifests stored in Manifest Store.
Manifest Service consumes Delta Manifests, creates BiFang logical files, and builds Manifest Files for Iceberg.
Auto Optimizer merges Manifest Files and converts logical files to columnar Parquet files.
Offload Service moves data to long‑term HDFS storage, enabling seamless reads from historical files.
2.3 Technical Advantages
Unified table management and metadata governance via Iceberg.
End‑to‑end real‑time data visibility through real‑time Manifest queries.
Hybrid row‑column storage reduces storage redundancy and improves query performance.
Exactly‑once semantics with Pulsar transactions and Read‑Committed isolation.
Broad engine compatibility (Flink, Spark, StarRocks) and ecosystem integration.
3. Business Practice
In Tencent Video, BiFang replaces the traditional Lambda architecture, collapsing message‑queue, Flink real‑time jobs, and Iceberg ingestion into a single step, achieving sub‑second data visibility, exactly‑once guarantees, and eliminating the need for separate reconciliation pipelines.
4. Future Roadmap
Architecture optimization for higher read/write performance and stability.
Enhanced core capabilities: unified lakehouse lifecycle, KV/Changelog support, Arrow columnar format.
Ecosystem enrichment: integration with InLong, StarRocks, Oceanus, and WeData governance platform.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.