Big Data 14 min read

Building and Evolving Zhihu’s Flink‑Based Data Integration Platform

This article details Zhihu’s transition from a Sqoop‑driven data integration system to a Flink‑centric platform, covering business scenarios, historical architecture, design goals, technology choices, performance optimizations, and future plans for unified streaming‑batch processing across diverse storage systems.

DataFunTalk
DataFunTalk
DataFunTalk
Building and Evolving Zhihu’s Flink‑Based Data Integration Platform

Zhihu’s data integration platform connects heterogeneous storage systems such as MySQL, Redis, HBase, Hive, TiDB, and Zetta, handling both batch and streaming workloads. Rapid growth in data volume and diverse source requirements demanded higher throughput, lower latency, and reliable data accuracy.

The first‑generation platform relied on crontab‑based jobs, Sqoop for MySQL‑Hive sync, and limited scheduling, which caused management difficulties, high scheduling latency, and data‑skew issues. A technical evaluation compared Sqoop with Alibaba’s DataX, ultimately retaining Sqoop for quick validation.

To address emerging streaming needs, Zhihu introduced Flink for Kafka‑to‑HDFS sync and later decided to replace Sqoop entirely with Flink, leveraging its robust API, active community, and native Hive support. A comparative study with Apache NIFI highlighted Flink’s lower integration overhead and better scalability.

Key design goals included supporting multiple storage systems, ensuring high throughput and low scheduling delay, maintaining data reliability, and providing comprehensive metadata management, monitoring, and alerting.

Flink migration introduced mixed scheduling (Per‑Job for streaming, Session for batch) with automatic scaling, private session clusters per business line, and improved resource utilization via Kubernetes. Custom connectors were developed for TiDB (region‑level load balancing, follower reads, distributed commits), Redis (source and sink with master/slave extraction), and Pulsar, expanding the platform’s source ecosystem.

Performance gains were significant: job scheduling latency dropped from minutes to ~10 seconds, TiDB native connector achieved four‑fold throughput over JDBC, and overall throughput increased while reducing operational costs by consolidating resources under K8s.

Future directions focus on extending Flink to real‑time data warehousing, online machine‑learning platforms, and achieving seamless batch‑stream integration for large‑scale ETL and reporting workloads.

big dataFlinkkubernetesRedisBatch ProcessingstreamingTiDBdata integration
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.