Big Data 21 min read

Evolution of Kafka‑Based Data Pipeline at Chehaoduo Group: Architecture, Scaling, and Best Practices

This article chronicles the four‑year evolution of Chehaoduo Group’s Kafka ecosystem—from its initial role as a simple data‑ingestion layer to becoming the core of the company’s large‑scale data pipeline—detailing cluster management, upgrade strategies, multi‑cluster deployment, AVRO schema handling, SDK development, and operational lessons learned.

DataFunTalk

Nov 27, 2020

Evolution of Kafka‑Based Data Pipeline at Chehaoduo Group: Architecture, Scaling, and Best Practices

In mid‑2016, Chehaoduo Group (Guazi Used Cars & Maodou New Cars) introduced Kafka as the data‑input component of its big‑data system, gradually promoting it to the core of the entire data chain.

1. Kafka Overview – Kafka is an open‑source distributed event‑streaming platform used by thousands of companies for high‑performance data pipelines, streaming analytics, and data integration.

2. Message‑Queue Selection – Kafka was chosen for its rich ecosystem (Kafka Connect, KStream/KSQL), active community, multi‑language SDKs, and tight integration with Hadoop‑related components.

3. Cluster Management Tools – Early clusters were manually deployed on two machines; as usage grew, Cloudera Manager (CM) and AWX (Ansible UI) were adopted to automate installation, monitoring, and scaling.

4. First Major Upgrade – Rapid traffic growth caused high disk I/O on compact+delete topics, leading to ISR reduction and latency spikes. Two upgrade paths were evaluated: (A) rebuild a new cluster and migrate services, (B) expand the existing cluster then roll‑out upgrades gradually. After comparison, path A was selected for its lower risk and finer‑grained control.

5. Multi‑Cluster Deployment – To improve availability, three logical clusters were built: Business‑Dedicated, Online (high‑SLA), and Offline (central data lake). Each serves different latency and reliability requirements, with data synchronized across them.

6. AVRO and Schema Registry – To enforce data format consistency, an AVRO schema registry was introduced. The team adopted BACKWARD compatibility with the restriction of only adding fields with default values, preventing downstream breakage.

7. SDK Development – A language‑agnostic SDK (initially Java, later PHP, Python, Go) was created to simplify producer/consumer integration, reducing onboarding friction for downstream teams.

8. Kafka Connect – Source & Sink – MySQL binlog ingestion was implemented via Maxwell (later Debezium considered) as a source connector. Sink connectors were built for Elasticsearch, Neo4j, Redis, HDFS/Hive, HBase, and Kudu, with special handling for small‑file issues in HDFS (5‑minute file flush and nightly merge).

9. Operational Challenges – Issues encountered included outdated hardware, insufficient replication, mixed workload contention, legacy client incompatibilities, and Kafka Connect version fragmentation. Solutions involved custom offset migration tools, documentation, and on‑site Kafka‑admin support.

10. Reflections & Standards – The team established code‑review standards, Git‑managed configuration, rapid rollback policies, and unified monitoring/alerting to improve reliability.

11. Platformization – A internal Kafka platform was built to expose read‑only topic operations, offset/time‑based consumption, AVRO deserialization, department metadata, and integrated monitoring/alerting.

12. Future Outlook – Plans include adopting newer Kafka releases (e.g., removing Zookeeper), exploring Pulsar, building cross‑cluster backup tools, and further automating schema management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SDK data pipeline Kafka Cluster Management kafka-connect Avro Schema Registry

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.