Evolution of 58.com Real-Time Computing Platform and the One-Stop Streaming Data Processing System Wstream
The article details the technical evolution of 58.com’s real-time computing platform—from Storm and Spark Streaming to a Flink‑based one‑stop solution called Wstream—covering use cases, architecture, stability measures, migration from Storm, operational diagnostics, and future development plans.
58.com operates a massive service platform covering recruitment, real estate, automotive, finance, and local services, generating huge volumes of user data that require real‑time analysis. The real‑time computing platform provides efficient, stable, distributed processing for the group’s data.
Background
To meet diverse real‑time needs, 58.com built a platform that supports ETL, real‑time data warehouses, monitoring, analysis, and personalized recommendation scenarios.
Platform Evolution
The platform progressed from Storm to Spark Streaming and finally to Apache Flink, forming the one‑stop solution Wstream that improves development, deployment, management, and monitoring efficiency.
Platform Scale
The cluster comprises over 500 machines, processing more than 600 billion records daily; Flink now handles about 50 % of tasks after a year of development.
Flink Stability
High availability is achieved through task isolation and a robust cluster architecture, including YARN deployment, HDFS federation for checkpoint storage, Node Labels for logical isolation, and Cgroup‑based CPU isolation.
Platform Management (Wstream)
Wstream is a Flink‑based, high‑performance real‑time big‑data platform offering SQL‑style stream analytics, DDL for sources/sinks and dimension tables, and support for UDF/UDAF/UDTF, enabling developers to build applications via FlinkJar, Stream SQL, or Flink‑Storm.
Streaming SQL Capabilities
Custom DDL syntax, user‑defined functions, and joins between streams and dimension tables are supported, along with an interactive SQL client featuring syntax highlighting, validation, and a wizard‑guided configuration to simplify real‑time job development.
Storm Task Migration to Flink
Using Flink‑Storm compatibility, Storm jobs were migrated to Flink with at‑least‑once semantics and tick‑tuple support, achieving over 40 % resource savings and eliminating the need for separate Storm clusters.
Task Diagnostics
Metrics are collected via Prometheus (pushed through Pushgateway) and visualized in Grafana; latency is monitored using Kafka topic lag; logs are aggregated with Log4j daily rotation, collected by agents, sent to Kafka, and indexed in Elasticsearch for searchable access.
Flink Optimizations
Customizations include implementing checkpoint‑based exactly‑once for Storm‑style ack, enhancing ResourceManager slot handling to match actual resource usage, improving Kafka connector client IDs, adding external JAR support in YARN, and extending BucketingSink to LZO and Parquet formats.
Future Plans
The platform will continue to enhance Flink capabilities, expanding into real‑time rule engines and machine learning, while completing the migration of remaining Storm tasks.
Author : Feng Haitao / Wan Shikang, responsible for 58.com real‑time computing platform construction.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.