Big Data 21 min read

Optimizing Apache Pulsar for MySQL Binlog Ingestion and Sorting in Apache InLong

This article explains how Apache Pulsar is used within Apache InLong to collect, sort, and reliably deliver massive MySQL binlog incremental data, covering component architecture, job isolation, client and producer management, consumption strategies, common pitfalls, performance tuning, and practical code examples.

Tencent Cloud Middleware

Nov 15, 2023

Optimizing Apache Pulsar for MySQL Binlog Ingestion and Sorting in Apache InLong

Background

Apache Pulsar is a high‑performance, multi‑tenant messaging system used in the Apache InLong pipeline for massive MySQL binlog incremental data collection and sorting.

System Architecture

InLong consists of:

InLong Manager – metadata and job configuration.

InLong DBAgent – stateless binlog collector; each job runs its own Pulsar client and producers.

Pulsar clusters – separate data and metric clusters.

InLong Sort – Flink‑based sorting task that consumes Pulsar topics, deserializes, transforms and writes to sinks such as Hive/Thrift, Iceberg, HBase, ClickHouse.

US Runner – scheduling and reconciliation component.

DBAgent (collector) design

Each DBAgent node can run multiple independent jobs. Job metadata is stored in InLong Manager and coordinated via Zookeeper. For every job a Pulsar client is created and a set of producers (one per target topic) are instantiated, guaranteeing isolation of connections, state and HA scheduling.

Sort (consumer) design

InLong Sort runs as a single Flink (Oceanus) job that hosts multiple dataflows. Each dataflow has its own source (Pulsar consumer), deserialization, sink and committer operators. The consumer uses a persistent exclusive subscription to enable reliable offset management and monitoring.

Pulsar client/producer sizing

The maximum number of TCP connections required by a DBAgent node can be estimated by:

maxConnections = MaxJobsNum * PreBrokerConnectNum * PulsarBrokerNum * min(MaxPartitionNum, PulsarBrokerNum)

Example: MaxJobsNum = 60, PreBrokerConnectNum = 2, PulsarBrokerNum = 90 → 97 200 connections. Capacity planning must account for file‑descriptor limits and broker scaling.

Key design issues and solutions

1. Client and producer isolation

Sharing a single Pulsar client across jobs creates uneven load, higher latency and complex HA handling. The recommended pattern is:

One Pulsar client per job.

One producer per topic per job.

This eliminates cross‑job interference and simplifies failover.

2. Multi‑threaded production

Using multiple threads with a shared producer incurs internal locking and degrades throughput. Each thread should own its own producer instance. Example implementation:

public class Sender extends Thread {
    private Producer<Message> producer;
    private Queue<Message> msgQueue;
    public Sender(String topic, Queue<Message> q) {
        this.producer = client.newProducer()
                              .topic(topic)
                              .create();
        this.msgQueue = q;
    }
    public void run() {
        while (true) {
            Message msg = msgQueue.poll();
            if (msg != null) {
                producer.sendAsync(msg);
            }
        }
    }
}

3. Consumer subscription model

Earlier versions used Pulsar Reader with a temporary subscription, which made monitoring difficult and caused data loss when checkpoints were not restored. Switching to a Pulsar Consumer with a persistent exclusive subscription provides:

Stable subscription name for metrics.

Ability to reset offsets programmatically.

Exactly‑once semantics when combined with Flink checkpointing.

Operational considerations

Monitoring : Use the persistent subscription name to query broker metrics for consumption lag.

Data loss avoidance : When not restoring from a checkpoint, the consumer starts from the last committed offset stored in Pulsar, ensuring no messages are skipped.

Dynamic offset reset : The exclusive consumer can be instructed via admin API to seek to a specific message ID or timestamp.

Summary

The article demonstrates practical tuning of Pulsar in InLong’s binlog ingestion pipeline: isolate clients and producers per job, carefully calculate connection requirements, employ per‑thread producers for multi‑threaded publishing, and use a persistent exclusive consumer in Flink to guarantee reliable consumption and easy monitoring.

Java Streaming binlog Apache Pulsar Data Ingestion InLong

Written by

Tencent Cloud Middleware

Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.