Big Data 17 min read

How BIGO Scaled Real‑Time Messaging by Migrating from Kafka to Pulsar

BIGO replaced its Kafka‑based message‑flow platform with Apache Pulsar to overcome scaling, stability, and operational cost challenges, leveraging Pulsar’s storage‑compute separation, seamless horizontal expansion, low latency, and tight integration with Flink for real‑time ETL and AB‑test pipelines, resulting in billions of messages processed daily with half the hardware cost.

ITFLY8 Architecture Home
ITFLY8 Architecture Home
ITFLY8 Architecture Home
How BIGO Scaled Real‑Time Messaging by Migrating from Kafka to Pulsar

Initially BIGO’s message‑flow platform relied on open‑source Kafka, but rapid data growth caused severe scalability and stability issues: data‑storage coupling, costly cluster expansion, frequent broker failures, cross‑region replication problems, PageCache pollution, limited partition scalability, and soaring operational effort (0.5 person‑day per added broker, 1 person‑day per removed broker).

Why Choose Pulsar

In November 2019 BIGO evaluated mainstream streaming platforms and selected Apache Pulsar as a next‑generation cloud‑native solution. Pulsar separates storage (BookKeeper) from compute, supports seamless horizontal scaling, low latency, high throughput, multi‑tenant isolation, and cross‑region replication, directly addressing Kafka’s expansion pain points.

Pulsar’s architecture writes messages from producers to brokers, which forward them to BookKeeper for durable storage, enabling independent scaling of brokers and storage nodes.

Horizontal scaling to hundreds of nodes without service disruption.

High throughput proven in Yahoo! production (millions of msgs/sec).

Low latency (< 5 ms) even under massive load.

Persistent storage built on Apache BookKeeper with read/write separation.

Read/write separation optimizes sequential disk writes, friendly to HDDs and removes per‑topic partition limits.

Apache Pulsar at BIGO: Pub‑Sub Consumption Mode

Since May 2020 Pulsar runs in production at BIGO. Producers (C++, Java, Python, etc.) write to Pulsar topics; Flink jobs and Flink SQL consume these topics, while downstream systems such as Hive, ClickHouse, HDFS, and Redis receive processed data.

Pulsar + Flink Real‑Time Stream Platform

The Pulsar‑Flink connector creates a FlinkPulsarSource with serviceUrl, adminUrl, and topic serialization, and a FlinkPulsarSink that writes back to a target topic. The connector uses Pulsar readers with non‑durable cursors for low‑latency consumption, committing offsets only after Flink checkpoints.

During a checkpoint, the connector snapshots each reader’s position to Flink state (memory, RocksDB, or files). After checkpoint completion, a “NotifyCheckpointComplete” signal triggers the connector to commit the offsets to a durable cursor in Pulsar, persisting them in BookKeeper. This two‑layer recovery (checkpoint + durable cursor) guarantees exactly‑once semantics and resilience to checkpoint corruption.

Topic/partition discovery is handled by independent readers per partition; new partitions are automatically detected based on the partition.discovery.interval-millis setting, allowing Flink jobs to scale with topic growth.

Real‑time ETL Scenario

Hundreds of Pulsar topics (each with its own JSON schema) are consumed by a small number of Flink jobs. Readers subscribe to all topics, parse schemas, and write transformed data to HDFS. To avoid load imbalance, BIGO groups topics into slot groups based on traffic, assigning high‑traffic topics to dedicated groups and low‑traffic topics to shared groups, ensuring even resource utilization.

AB‑test Scenario

For real‑time data‑warehouse use cases, BIGO aggregates multiple raw event topics into wide tables covering 80‑90 % of recommendation and analytics needs. Instead of costly joins, Flink SQL performs UNION operations to produce hourly views, which are then written to ClickHouse. Daily tables are enriched by joining with offline Hive data processed by Spark, with the final result stored in HBase before being exported to ClickHouse for low‑latency queries.

Business Benefits

Since the May 2020 rollout, Pulsar processes hundreds of billions of messages daily at 2‑3 GB/s, delivering high throughput, low latency, and strong reliability while cutting hardware costs by ~50 %. The storage‑compute separation supports millions of topics, solving Kafka’s partition‑induced performance degradation.

Future Outlook

BIGO plans to further expand Pulsar usage, contribute new features (e.g., topic policies), migrate more workloads, adopt Kafka‑on‑Pulsar (KoP) for smoother transitions, and continue optimizing broker load‑balancing, cache hit rates, and BookKeeper I/O performance to meet the demanding reliability requirements of large‑scale streaming.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

FlinkReal-time StreamingETLApache PulsarKafka migration
ITFLY8 Architecture Home
Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.