Big Data 14 min read

Building and Evolving Zhihu’s Flink‑Based Data Integration Platform

This article details Zhihu’s transition from a Sqoop‑driven data integration system to a Flink‑centric platform, covering business scenarios, historical architecture, design goals, technology choices, performance optimizations, and future plans for unified streaming‑batch processing across diverse storage systems.

DataFunTalk

Apr 23, 2021

Building and Evolving Zhihu’s Flink‑Based Data Integration Platform

Zhihu’s data integration platform connects heterogeneous storage systems such as MySQL, Redis, HBase, Hive, TiDB, and Zetta, handling both batch and streaming workloads. Rapid growth in data volume and diverse source requirements demanded higher throughput, lower latency, and reliable data accuracy.

The first‑generation platform relied on crontab‑based jobs, Sqoop for MySQL‑Hive sync, and limited scheduling, which caused management difficulties, high scheduling latency, and data‑skew issues. A technical evaluation compared Sqoop with Alibaba’s DataX, ultimately retaining Sqoop for quick validation.

To address emerging streaming needs, Zhihu introduced Flink for Kafka‑to‑HDFS sync and later decided to replace Sqoop entirely with Flink, leveraging its robust API, active community, and native Hive support. A comparative study with Apache NIFI highlighted Flink’s lower integration overhead and better scalability.

Key design goals included supporting multiple storage systems, ensuring high throughput and low scheduling delay, maintaining data reliability, and providing comprehensive metadata management, monitoring, and alerting.

Flink migration introduced mixed scheduling (Per‑Job for streaming, Session for batch) with automatic scaling, private session clusters per business line, and improved resource utilization via Kubernetes. Custom connectors were developed for TiDB (region‑level load balancing, follower reads, distributed commits), Redis (source and sink with master/slave extraction), and Pulsar, expanding the platform’s source ecosystem.

Performance gains were significant: job scheduling latency dropped from minutes to ~10 seconds, TiDB native connector achieved four‑fold throughput over JDBC, and overall throughput increased while reducing operational costs by consolidating resources under K8s.

Future directions focus on extending Flink to real‑time data warehousing, online machine‑learning platforms, and achieving seamless batch‑stream integration for large‑scale ETL and reporting workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Kubernetes redis Batch Processing Streaming TiDB Data Integration

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.