Why We’re Dropping RabbitMQ for Kafka: A Complete Migration Blueprint

Facing chaotic usage, maintenance challenges, partition tolerance issues, and performance bottlenecks with RabbitMQ, our middleware team decided to fully migrate to Kafka, outlining reasons, comparative models, migration strategies, and verification steps to ensure a smooth, high‑availability, high‑performance transition.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
Why We’re Dropping RabbitMQ for Kafka: A Complete Migration Blueprint

Due to historical reasons, our company has been operating multiple MQ systems simultaneously. Since the second half of last year, the middleware team has wrapped Kafka and RabbitMQ capabilities, initially supporting business teams fully.

Having largely implemented a Kafka governance platform and cluster migration controls, we now consider our Kafka expertise mature.

Consequently, we decided to cease maintenance and support for RabbitMQ for the following reasons.

Reasons

Chaotic Usage and Maintenance Difficulty

Data analysis shows that almost no service uses our custom‑wrapped RabbitMQ; most use spring-amqp or native Rabbit usage, leading to inconsistent usage patterns and increased troubleshooting difficulty.

Supporting both Kafka and RabbitMQ requires duplicate effort, wasting resources, and we lack deep RabbitMQ expertise, raising maintenance concerns.

Partition Fault Tolerance Issues

RabbitMQ clusters have low network partition tolerance. High‑availability setups rely on mirrored queues, but during a network split each partition treats nodes in other partitions as down, making queue, exchange, and binding operations effective only within the current partition.

If a mirrored queue spans nodes across multiple partitions, each partition elects its own master, resulting in independent queues per partition.

By default, this architecture risks split‑brain scenarios; version 3.1 cannot auto‑recover, and later versions only detect partitions, requiring manual intervention that may cause data loss.

Performance Bottlenecks

Mirrored queues improve availability but do not increase load capacity. Under high traffic, a single node handling a queue becomes a performance bottleneck.

Even though most MQ traffic is low, any issue can become a system‑wide bottleneck.

Performance tests on a 4‑node cluster (each with 16 cores) showed:

Without sharding, a single queue reaches ~5K TPS.

On memory‑optimized nodes, the theoretical limit is ~50K/s.

With sharding, a single queue can handle up to ~10K/s.

These numbers assume normal consumption; under high load or backlog, performance degrades sharply.

Operations & Governance

Given the above challenges, we will fully migrate all RabbitMQ‑using services to Kafka to ensure stability, high availability, and performance as the business scales.

Our methodology emphasizes three principles for production: gray‑release, observability, and rollback. For middleware platform operations we aim for three capabilities: operability, observability, and governance, which we have largely achieved with Kafka Manager.

High availability: Kafka provides strong platform reliability.

High performance: Kafka supports extremely high TPS and horizontal scaling.

Feature support: We retain sequential, delayed, gray‑release, and tracing messages.

Operations & governance: Enhanced on top of Kafka Manager for better developer, ops, and testing experience.

Model Comparison

RabbitMQ

Exchange: Routes messages to queues based on exchange type, binding key, and routing key.

Queue: Stores messages; consumers bind directly to queues.

Routing Key: Specified by the producer when sending to an exchange.

Binding Key: Defined when binding an exchange to a queue.

Exchange Types:

Direct – routes to queues where binding key exactly matches routing key.

Fanout – broadcasts to all bound queues.

Topic – routes based on pattern matching of routing key.

Headers – routes based on message headers, ignoring routing key.

Kafka

Topic – logical channel for organizing messages.

Broker – Kafka server instance.

Consumer Group – set of consumers sharing the load.

Partition – a topic is split into partitions; each partition preserves order, enabling load‑balanced read/write and scalability.

Migration Plan

We must ensure three points during migration:

Ease of operation – avoid excessive complexity.

Risk control – minimize impact on business.

No disruption to normal service operation.

Dual Subscription for Consumers

Modify consumers to listen to both RabbitMQ and Kafka.

Refactor producers to send messages to Kafka.

After RabbitMQ messages are drained, decommission it.

Advantages: lossless migration.

Disadvantages:

Maintaining two listener codes increases workload; old code must be removed after migration.

Message ordering cannot be guaranteed.

Gray‑Release Single Subscription

Optimizes the dual‑subscription approach by using gray/blue‑green deployment to listen only to Kafka.

Update consumer code to consume from Kafka on gray/blue nodes.

Refactor producers to publish to Kafka.

After draining remaining RabbitMQ messages, decommission it; if business tolerates some loss, proceed directly, otherwise pause production, drain, then switch.

Advantages:

Reduces workload by avoiding dual listeners.

Enables lossless migration.

Disadvantage: ordering still not guaranteed.

Real‑World Scenario Challenges

Complex publish/subscribe topologies (mesh, ring) require mapping each exchange relationship and migrating exchange‑by‑exchange rather than service‑by‑service.

This may increase the number of publishes per service and demands careful handling of multiple consumers/producers.

Implementation Details

We cataloged all RabbitMQ exchanges and will handle them as follows:

Delete unused exchanges with no producers, consumers, or traffic.

For fanout exchanges, map to Kafka topics; queues become consumer groups, handling random queues with simple wrappers.

For direct exchanges, map routing keys to topics; queues become consumer groups.

For topic exchanges, map routing keys to topics; no wildcard usage observed.

Implement delayed queues and retries via a second‑layer spring‑kafka wrapper.

Verification, Monitoring, Gray‑Release, and Rollback

Verification

After migration, verify RabbitMQ traffic via management platform or logs; most exchanges have low traffic, so manual message injection may be needed.

Verify Kafka traffic via Kafka Manager or logs.

Monitoring

Monitoring is performed through Kafka Manager and existing monitoring systems.

Gray‑Release

Both consumers and producers support gray‑release for pre‑deployment validation.

Rollback

Services can be rolled back in reverse deployment order.

References: https://xie.infoq.cn/article/bf3d9cfd01af72b326254aa81 https://developer.aliyun.com/article/772095 《RabbitMQ实战指南》
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendmigrationOperationsKafkaMessage QueueRabbitMQ
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.