Backend Development 11 min read

Design and Implementation of a Lightweight Asynchronous Message Processing Framework for Data Catalog

This article describes the motivation, requirements, design, and implementation of a lightweight asynchronous message processing framework built by ByteDance to handle near‑real‑time metadata changes for DataLeap's Data Catalog, detailing its architecture, thread model, state management, delay handling, monitoring, and operational experiences.

DataFunTalk
DataFunTalk
DataFunTalk
Design and Implementation of a Lightweight Asynchronous Message Processing Framework for Data Catalog

The ByteDance Data Platform team needed a more efficient way to consume near‑real‑time metadata change messages for the DataLeap Data Catalog, as Apache Atlas with Kafka could not meet the required throughput and latency.

Initially, a Flink‑based solution improved scalability but introduced maintenance challenges, especially for private‑cloud deployments and Volcano Cloud, prompting the development of a custom lightweight asynchronous message processing framework.

Key requirements include handling millions of messages daily, supporting at‑least‑once delivery, delayed processing up to one minute, configurable retries, partition‑aware ordering, variable processing times, offset management, and minimal operational overhead.

The framework consists of three core components: MQ Consumer (pulls messages from Kafka, creates Events, handles delayed queues, reports metrics, and commits offsets), Message Processor (asynchronously processes Events, updates State Manager, reports metrics), and State Manager (maintains per‑partition offset state using priority queues).

Its thread model separates Consumer Pool and Processor Pool, allowing independent scaling; Consumer threads map to Kafka partitions, while Processor threads handle FIFO queues per Event key, ensuring ordered processing for the same key.

State Manager uses two min‑heap queues per partition (in‑process and completed) to determine safe offset commits, handling rebalance scenarios by clearing partition queues.

KeyBy support groups Events with the same hash key into the same internal queue to preserve order, and a bounded DelayQueue implements delayed processing based on Event time.

Exception handling includes automatic retries, failure reporting, and back‑pressure mechanisms similar to Flink when processors cannot keep up.

Built‑in monitoring exposes metrics for consumer lag, rebalance rate, QPS, processing time, and queue length, enabling operators to diagnose issues such as message backlog or slow processors.

Operational case studies illustrate troubleshooting steps for consumer lag, queue buildup, and message replay using a new consumer group and offset reset.

After more than a year in production, the framework reliably synchronizes metadata for Data Catalog and Volcano Engine services, with future plans to add support for other MQs like RocketMQ and further automation of configuration, alerts, and operations.

system designKafkaMessage Queueasynchronous processingBackend Frameworkmetadata synchronization
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.