Scaling Real‑Time Reconciliation with Dynamic Kafka Consumer Clusters
To ensure fund safety and robust operations, the team built a real‑time reconciliation platform that leverages Kafka, and after encountering scaling bottlenecks with a static consumer model, they implemented a dynamic, partition‑level, weighted load‑balancing consumer cluster that supports automatic scaling and high‑throughput processing.
To ensure fund safety, the team built and maintains a real‑time reconciliation platform (named “算盘”) that compares upstream and downstream service data in real time to quickly detect potential loss risks and guarantee business robustness.
The platform uses Kafka as the core messaging middleware, efficiently ingesting reconciliation source data and supporting high‑throughput, low‑latency processing of massive data. As business scenarios expanded, the number of Kafka topics exceeded 100, partitions over 1,000, and daily message volume reached hundreds of millions, rendering the initial consumption scheduling approach insufficient.
We faced message backlog, rising latency, and uneven resource allocation, and horizontal scaling of compute resources could not solve these issues, threatening platform stability and growth. Consequently, the team designed and deployed a dynamically scalable Kafka consumer cluster scheduling solution.
Original Scheme and Challenges
Initially, a static, cluster‑based Kafka consumer management model was used, with the following core design:
Virtual cluster design: All machine resources were divided into multiple virtual clusters, with dynamic registration via a抢注 algorithm.
Cluster‑machine binding: Each machine belongs to only one cluster at any time.
Topic‑cluster binding: Each Kafka topic is assigned to a specific consumer cluster; a topic is consumed by only one cluster.
Unified consumer configuration: All machines in a cluster launch consumer instances for every topic in the cluster, with the number of consumers per topic centrally controlled.
This design simplified operations and suited early‑stage scenarios with few topics, low throughput, and simple consumption logic.
Problems and Bottlenecks
As the business grew, the number of topics and data volume increased, exposing two major limitations:
Low resource utilization and load imbalance: During traffic peaks, some machines experienced high CPU usage while others remained idle. The root cause was the conflict between Kafka’s consumer allocation mechanism and the static management model. When the number of consumers equals or exceeds the number of partitions, extra consumers stay idle, and Kafka’s default rebalance (based on consumer instance lexical order) allocates partitions to lower‑order consumers first, leaving higher‑order instances idle.
Limited scalability and elasticity: Adding new machines inherited the full consumer configuration of existing clusters. If a topic’s consumer count already reached the partition limit, new consumers could not obtain partitions, leading to wasted resources and continued load imbalance.
These issues resulted in resource waste, uneven load, and poor scalability, making the original scheme unsuitable for high‑availability and high‑throughput requirements.
New Scheme Design and Implementation
The new design aims to solve the above problems while improving overall performance and stability. The goals are:
Eliminate consumer‑allocation hotspots and achieve balanced CPU load across machines.
Support horizontal scaling without architectural changes.
Automatically adjust node load via intelligent load‑balancing algorithms.
Enhance circuit‑breaker and degradation mechanisms for robustness.
Provide fallback mechanisms to guarantee reliable consumption even under extreme conditions.
Technical Selection and Architecture
We implemented a dynamic, scalable Kafka consumer cluster scheduling scheme with the following core components:
Virtual cluster design: Same as before—stateless machines are grouped into a “consumer cluster,” and each machine belongs to only one cluster at a time.
Topic‑cluster binding: A topic can be manually assigned to one or more clusters. The total number of consumers for a topic equals its partition count (e.g., a topic with 10 partitions requires 10 consumer instances across all assigned clusters).
Dynamic machine‑topic consumer configuration: Based on the cluster‑machine and topic‑cluster bindings, the system calculates the ideal consumer configuration for each machine per topic. Each machine “抢占” a unique consumer slot and creates the corresponding Kafka consumer instance.
Core Implementation Details
Partition‑level granularity: The smallest scheduling unit is a partition, enabling fine‑grained resource management and load balancing.
Proportional multi‑cluster consumption: Multiple clusters can consume the same topic, with configurable consumer‑count ratios to control resource distribution.
Weighted load balancing: For each topic‑partition, the system calculates a weight based on total CPU time consumed per unit time divided by the number of partitions. This weight guides resource allocation, preventing hotspots and uneven load.
Automated scaling: The system supports automatic horizontal scaling. New nodes automatically join the load‑balancing pool; when nodes are removed, tasks are smoothly migrated without manual reconfiguration.
Dynamic Refresh Mechanism
The system uses a “central calculation + distributed pull” model:
Schedule plan generation: Every 30 seconds, the scheduler runs a load‑balancing algorithm and writes the global consumption plan to Redis.
Node pull: Each machine runs a periodic pull job that fetches the latest plan from Redis.
Local hot‑update: Nodes dynamically start or stop consumer instances for specific topics based on the new configuration, without restarting services, ensuring continuous consumption.
Integration with Kubernetes HPA for Elastic Scaling
When traffic triggers scaling, new Pods register their heartbeat, the scheduler detects the node count change, performs a full load‑recalculation, and assigns the new node to the highest‑CPU‑load consumer cluster as a temporary node. The node then pulls the latest configuration and immediately starts consuming its assigned partitions, achieving minute‑level elastic scaling.
Implementation Results and Data Verification
Current operating scale and throughput:
Compute resources: Over 200 active machine nodes, supporting horizontal elasticity.
Message ingestion: More than 360 Kafka topics, over 3,000 partitions covering core financial scenarios.
Consumption throughput: Peak consumption rate exceeds 200,000 QPS; reconciliation task peak exceeds 3,000,000 QPS.
These metrics demonstrate the system’s ability to handle ultra‑large message streams, meeting current demands and providing ample capacity for future growth.
Cluster Scheduling and Elastic Capability
Dynamic cluster management: Six independent consumer clusters can be configured per business priority; CPU load variance stays within 10% during peaks.
Real‑time scaling support: Partition‑level scheduling and weighted load balancing enable instantaneous node addition/removal without manual configuration, greatly improving elasticity and operational efficiency.
Stability and Business Value
Since adopting the new scheme, the platform has eliminated message backlog and reconciliation latency caused by insufficient consumption capacity. Real‑time reconciliation SLA is consistently met, providing a solid technical foundation for fund‑loss risk control.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
