Big Data 10 min read

How a FinTech Scaled Its Data Platform with Alibaba Cloud EMR Serverless Spark

Weifin, a fintech innovator, tackled massive data‑scale challenges by adopting Alibaba Cloud EMR Serverless Spark, building a unified Spark‑based platform that supports data collection, lake ingestion, distributed machine‑learning training, and intelligent risk‑control applications, while achieving performance gains, cost reduction, and scalable automation.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How a FinTech Scaled Its Data Platform with Alibaba Cloud EMR Serverless Spark

Company Overview

Weifin (微财) is an innovative fintech company that provides consumer installment services and other financial information services, leveraging years of fintech capability and data processing advantages. Its brands include Haofenqi, offering comprehensive credit installment borrowing information, technology, and support for high‑growth users.

Business Challenges

Data assets are the core value for fintech. Weifin relies on big data to assess loan risk. Rapid business growth has accumulated massive user data, making large‑scale model training a bottleneck. Building a mature, stable, and efficient big‑data model training platform would require substantial manpower and time.

Choosing the Spark Stack

During the selection of the compute engine for the data platform, extensive research was conducted. The goal was a mature, unified platform that supports data processing, analytics, and data‑science scenarios. The team’s strong experience with Python and Spark, together with Spark’s mature machine‑learning ecosystem, led to the decision to adopt the Spark stack.

Why Alibaba Cloud EMR Serverless Spark

Key problems in machine‑learning scenarios include breaking the single‑node data‑scale limit and improving training efficiency. EMR Serverless Spark’s fully managed service and elastic scaling meet these needs while guaranteeing isolated resources per user.

After technical exchanges and proof‑of‑concept with Alibaba Cloud EMR, the team selected EMR Serverless Spark for its self‑developed Fusion engine, high‑performance vectorized computation, RSS capability, and unified support for data engineering and data science.

Core Advantages

Engine performance boost : Fusion engine with vectorized computation and RSS delivers more than 3× performance over open‑source versions.

Complete Spark stack integration : Supports DataFrame, SQL, PySpark for batch, streaming, interactive analytics, and ML; compatible with Spark Submit, Livy, Spark Thrift Server; provides built‑in SQL Editor and Notebook for ETL and data‑science development.

Serverless fully managed service : Zero‑ops, no need to manage underlying resources, reduces OPEX, provides second‑level resource provisioning and task‑level elastic scaling.

High‑quality support and SLA : Alibaba Cloud offers technical support, commercial SLA, and 24/7 expert service for Serverless Spark.

Total cost reduction : Fusion engine’s performance and storage‑compute separation using OSS lower overall cluster cost.

Technical Data Platform Architecture

Technical data platform architecture
Technical data platform architecture

Data Collection

In the early stage of building the Weifin data warehouse, the team developed the dw‑shell tool, which provides comprehensive data collection capabilities and abstracts differences between storage and compute engines. This tool enabled a complete migration of all big‑data tasks to the cloud within one month.

Data Lake Ingestion

The lake uses Apache Paimon as the storage framework and integrates Apache Spark, Flink, and Hive as compute engines, forming a complete data‑lake ecosystem that supports real‑time monitoring and analysis, significantly improving processing capability and business efficiency.

Data Science

Machine‑learning training was moved from single‑node to a big‑data cluster using a local Python environment and cloud‑based Serverless Spark. The self‑developed vulcan‑x framework makes distributed training code writing and hyper‑parameter tuning as easy as local development, greatly lowering the learning curve for data scientists.

Typical Application Scenarios

Intelligent Risk Control

The MX Flow platform provides risk‑control capabilities, including feature mining, distributed training, and automated project management.

Feature Mining Support

Serverless Spark implements common binning methods (equal‑frequency, decision‑tree, chi‑square) and feature‑evaluation functions, enabling users to discretize features and assess risk discrimination in a distributed manner.

Distributed Training

Leveraging Spark MLlib and open‑source algorithms such as SynapseML’s LightGBM, the platform supports Random Forest, Logistic Regression, LightGBM, CatBoost, XGBoost, etc. Tests show training time scales linearly with dataset size; a 5 × 10⁷‑row dataset trains in about 20 minutes on Serverless Spark.

Training performance chart
Training performance chart

Automated Project Management

MX Flow combined with vulcan‑x creates a client‑server interaction model. Code executed locally generates data that is automatically organized as dataset services on the server, providing visual process reports, model management, and full hyper‑parameter tuning compatible with Optuna, with a dashboard for tuning processes.

Project management UI
Project management UI
Model management UI
Model management UI
Hyperparameter tuning UI
Hyperparameter tuning UI
Dashboard UI
Dashboard UI

Summary and Outlook

As Weifin’s data volume continues to grow, the team plans to further scale distributed training and introduce deep‑learning capabilities in intelligent risk control, using multi‑GPU, multi‑node frameworks such as Horovod and PyTorch Distributed. Research will focus on optimizing data‑parallel and model‑parallel strategies in distributed environments to improve training efficiency and scalability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningSparkFinTech
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.