Big Data 15 min read

WingPay's Big Data Platform Construction and Development Experience

This article presents a comprehensive case study of WingPay's big data platform, covering company background, data development and governance platform design, task development workflow, architectural choices, scheduling engine selection, data bus implementation, resource isolation, quality monitoring, cloud‑native practices, future challenges, and a Q&A session.

DataFunTalk
DataFunTalk
DataFunTalk
WingPay's Big Data Platform Construction and Development Experience

Guide – The article shares WingPay's experience in building and evolving a big data platform.

Four parts – (1) Company overview and business scenarios, goals, and challenges; (2) Data development and governance platform, including task development and dual‑environment deployment; (3) Platform architecture, covering system design, scheduler selection, data bus, quality monitoring, resource isolation, and compute optimization; (4) Future outlook and discussion.

Company and Business Introduction – WingPay, a fintech subsidiary of China Telecom, serves 70 million monthly active users with payment, shopping, and financial services, leveraging blockchain, cloud computing, big data, and AI to empower over 1 million offline merchants and 170 online e‑commerce platforms.

Platform Business Scenario – The platform supports data warehouse, rapid development for business units, offline and real‑time computation, data integration, and data services to improve development and governance efficiency.

Goals – Build an integrated data development and governance platform that unifies data integration, offline/real‑time computation, and data services, providing a one‑stop solution for developers.

Challenges – Massive data volume, high concurrency, low‑latency requirements, diverse business scenarios, and complex use cases.

Task Development Process – Create a business flow (Flow) to manage a group of tasks, develop SparkSQL offline tasks, set core parameters (priority, dependencies, runtime configs), test, submit for review, and publish to production where the scheduler executes them.

Task Types – (1) Data synchronization (Oracle, OceanBase, MySQL, SFTP, HBase); (2) Spark tasks (SparkSQL offline); (3) Machine learning (AI model tasks); (4) Kylin tasks (data‑warehouse jobs); (5) Trigger tasks (external platform‑initiated jobs).

SparkSQL Task Editor Features – New task creation, syntax checking, single‑run execution, automatic dependency parsing, manual dependency addition, permission parsing, and lineage visualization.

Dual‑Environment Deployment – Separate development and production Flows with mirrored task codes (dev prefix in dev environment) to allow parallel execution without interference.

Platform Architecture Practice – The system consists of an application layer (configuration management, data development, task management, external services) and a scheduling layer built on Airflow, extended with custom operators for SparkSQL and data exchange.

Scheduler Engine Selection – Compared Zeus, Airflow, and Azkaban; chose Airflow 1.10 for its Python extensibility, stability, and community support. Airflow runs DAG files stored in a directory, uses a metadata database, and supports Celery with Redis for task queuing.

Data Bus – Uses DataX as the core module; templates are generated from user‑provided parameters, submitted to YARN for offline batch processing.

Resource Isolation & Compute Optimization – Implements multi‑level queueing (core, important, normal) and dynamic throttling to protect high‑priority tasks; Spark optimizations include small‑file handling, data skew mitigation, join optimization, and task splitting.

Data Quality Monitoring – Monitors timeliness, accuracy, completeness, consistency, and validity; strong rules trigger task circuit‑breakers, while weak rules guide post‑analysis improvements via SparkSQL jobs.

Cloud‑Native Practice – Micro‑service architecture separates services for real‑time, offline, data bus, quality jobs, monitoring, and AI models; CI/CD platform provides rapid testing, iteration, and deployment with unified monitoring and auto‑scaling.

Results – Overall compute cost reduced by 87 %, model feature computation time advanced by 7.5 hours, and dashboard query latency improved 40‑fold.

Future Outlook – Emphasizes observability, further compute efficiency for SparkSQL workloads, and multi‑site disaster recovery for both batch and real‑time clusters.

Q&A – Discusses data permission management (organization‑based and metadata‑based) and data governance improvements that lower compute cost and increase performance.

cloud nativeBig Datadata platformdata governanceResource IsolationAirflow
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.