Big Data 12 min read

How Vipshop Scales Real‑Time Data with Flink on Kubernetes

This article details Vipshop's real‑time platform architecture, the migration from Storm and Spark to Flink, Flink's deployment on Kubernetes, and the latest Unified Data Management system that unifies data access across Kafka, Redis, Tair and HDFS.

dbaplus Community

Nov 1, 2018

How Vipshop Scales Real‑Time Data with Flink on Kubernetes

1. Vipshop Real‑Time Platform Overview

Vipshop operates a heterogeneous real‑time platform composed of Storm, Spark and Flink, with Flink becoming the primary engine this year. The platform supports eight core business areas such as real‑time recommendation (RTRS), Dataeye dashboards, real‑time data cleaning, internet finance, security risk control, price comparison, internal monitoring services (Logview, Mercury, Titan) and VDRC data sync, running on roughly 1,400 machines and over 600 applications.

The platform provides two main services: a real‑time compute layer built on the three frameworks, and a real‑time base data layer that normalizes upstream event data (Kafka) and MySQL binlog, performing cleaning and enrichment for downstream consumption.

Data sources include application‑level event streams sent to Kafka and MySQL binlog logs, which are processed by Flink jobs and output to downstream systems such as HBase.

2. Flink Practice at Vipshop

Scenario 1 – Dataeye Real‑Time Dashboard

Dataeye aggregates massive event and order data in real time, calculating dozens of metrics per second across multiple dimensions (site, platform, category, period, audience, activity, time). UV calculation involves cleaning Kafka events, joining with Redis, writing back to Kafka, and then Flink consumes the enriched stream.

Results are written to Kafka buffers and finally synchronized to HBase for display, with output volumes reaching tens of millions of records per second. Flink’s native state handling eliminated the need for external Redis storage, reducing resource consumption to one‑third of the previous Storm solution.

Scenario 2 – Kafka Data Landing on HDFS

Vipshop is migrating Spark Streaming jobs to Flink, using OrcBucketingTableSink to write event data into Hive tables on HDFS. A single Flink task can write ~3.5K rows/s, cutting resource usage by 90% and reducing latency from 30 s to under 3 s.

Scenario 3 – Real‑Time ETL with Changing Dictionary Tables

Dictionary tables stored in HDFS change frequently and must be joined with streaming data. Vipshop uses ContinuousFileMonitoringFunction and ContinuousFileReaderOperator to monitor HDFS for updates and refresh the join data in real time, planning a generic Hive‑Stream join solution.

3. Flink on Kubernetes

To unify management of various compute frameworks (real‑time, ML, batch), Vipshop migrated Flink to Kubernetes, deploying each Flink job as a StatefulSet‑based mini‑cluster with high availability.

StatefulSet provides stable hostnames (e.g., -0, -1) that can be directly assigned as JobManager pods, allowing a single StatefulSet to launch a full cluster, unlike Deployments which require multiple replicas.

Pod failures trigger automatic recreation with the same hostname, enabling faster recovery.

The Docker entrypoint script sets required environment variables (shown in the original slide). Configuration files for HDFS and other services are managed via Kubernetes ConfigMaps, e.g.:

kubectl create configmap hdfs-conf --from-file=hdfs-site.xml --from-file=core-site.xml

4. Latest Project – Unified Data Management (UDM)

Vipshop faces challenges accessing binary data (PB/Avro) in Kafka, Redis, Tair, and HDFS, lacking a unified schema service, audit, and permission controls. UDM addresses these by providing a Location Manager, Schema Metastore, and Client Proxy.

Maps logical names to physical addresses for data access.

Web GUI for schema inspection.

APIs for audit, monitoring, and lineage.

Integrations with Spark, Flink, Storm via custom TableSource factories.

In Flink, the UDMExternalCatalog bridges Flink and UDM by implementing the ExternalCatalog interface and providing TableSourceFactory implementations, enabling schema registration and SQL/Table API development with reduced code.

Q & A

Q: What does “automatic push of Hive table changes” mean?

A: After a Hive table is updated, the cached data in Flink is also refreshed so that joins use the latest data. Currently this requires custom development as the official framework does not support it out of the box.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Flink kubernetes UDM

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.