How Vipshop Scales Real‑Time Data with Flink on Kubernetes
This article details Vipshop's real‑time platform architecture, the migration from Storm and Spark to Flink, Flink's deployment on Kubernetes, and the latest Unified Data Management system that unifies data access across Kafka, Redis, Tair and HDFS.
1. Vipshop Real‑Time Platform Overview
Vipshop operates a heterogeneous real‑time platform composed of Storm, Spark and Flink, with Flink becoming the primary engine this year. The platform supports eight core business areas such as real‑time recommendation (RTRS), Dataeye dashboards, real‑time data cleaning, internet finance, security risk control, price comparison, internal monitoring services (Logview, Mercury, Titan) and VDRC data sync, running on roughly 1,400 machines and over 600 applications.
The platform provides two main services: a real‑time compute layer built on the three frameworks, and a real‑time base data layer that normalizes upstream event data (Kafka) and MySQL binlog, performing cleaning and enrichment for downstream consumption.
Data sources include application‑level event streams sent to Kafka and MySQL binlog logs, which are processed by Flink jobs and output to downstream systems such as HBase.
2. Flink Practice at Vipshop
Scenario 1 – Dataeye Real‑Time Dashboard
Dataeye aggregates massive event and order data in real time, calculating dozens of metrics per second across multiple dimensions (site, platform, category, period, audience, activity, time). UV calculation involves cleaning Kafka events, joining with Redis, writing back to Kafka, and then Flink consumes the enriched stream.
Results are written to Kafka buffers and finally synchronized to HBase for display, with output volumes reaching tens of millions of records per second. Flink’s native state handling eliminated the need for external Redis storage, reducing resource consumption to one‑third of the previous Storm solution.
Scenario 2 – Kafka Data Landing on HDFS
Vipshop is migrating Spark Streaming jobs to Flink, using OrcBucketingTableSink to write event data into Hive tables on HDFS. A single Flink task can write ~3.5K rows/s, cutting resource usage by 90% and reducing latency from 30 s to under 3 s.
Scenario 3 – Real‑Time ETL with Changing Dictionary Tables
Dictionary tables stored in HDFS change frequently and must be joined with streaming data. Vipshop uses ContinuousFileMonitoringFunction and ContinuousFileReaderOperator to monitor HDFS for updates and refresh the join data in real time, planning a generic Hive‑Stream join solution.
3. Flink on Kubernetes
To unify management of various compute frameworks (real‑time, ML, batch), Vipshop migrated Flink to Kubernetes, deploying each Flink job as a StatefulSet‑based mini‑cluster with high availability.
StatefulSet provides stable hostnames (e.g., -0, -1) that can be directly assigned as JobManager pods, allowing a single StatefulSet to launch a full cluster, unlike Deployments which require multiple replicas.
Pod failures trigger automatic recreation with the same hostname, enabling faster recovery.
The Docker entrypoint script sets required environment variables (shown in the original slide). Configuration files for HDFS and other services are managed via Kubernetes ConfigMaps, e.g.:
kubectl create configmap hdfs-conf --from-file=hdfs-site.xml --from-file=core-site.xml4. Latest Project – Unified Data Management (UDM)
Vipshop faces challenges accessing binary data (PB/Avro) in Kafka, Redis, Tair, and HDFS, lacking a unified schema service, audit, and permission controls. UDM addresses these by providing a Location Manager, Schema Metastore, and Client Proxy.
Maps logical names to physical addresses for data access.
Web GUI for schema inspection.
APIs for audit, monitoring, and lineage.
Integrations with Spark, Flink, Storm via custom TableSource factories.
In Flink, the UDMExternalCatalog bridges Flink and UDM by implementing the ExternalCatalog interface and providing TableSourceFactory implementations, enabling schema registration and SQL/Table API development with reduced code.
Q & A
Q: What does “automatic push of Hive table changes” mean?
A: After a Hive table is updated, the cached data in Flink is also refreshed so that joins use the latest data. Currently this requires custom development as the official framework does not support it out of the box.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
