How a FinTech Scaled Its Data Platform with Alibaba Cloud EMR Serverless Spark
Weifin, a fintech innovator, tackled massive data‑scale challenges by adopting Alibaba Cloud EMR Serverless Spark, building a unified Spark‑based platform that supports data collection, lake ingestion, distributed machine‑learning training, and intelligent risk‑control applications, while achieving performance gains, cost reduction, and scalable automation.
Company Overview
Weifin (微财) is an innovative fintech company that provides consumer installment services and other financial information services, leveraging years of fintech capability and data processing advantages. Its brands include Haofenqi, offering comprehensive credit installment borrowing information, technology, and support for high‑growth users.
Business Challenges
Data assets are the core value for fintech. Weifin relies on big data to assess loan risk. Rapid business growth has accumulated massive user data, making large‑scale model training a bottleneck. Building a mature, stable, and efficient big‑data model training platform would require substantial manpower and time.
Choosing the Spark Stack
During the selection of the compute engine for the data platform, extensive research was conducted. The goal was a mature, unified platform that supports data processing, analytics, and data‑science scenarios. The team’s strong experience with Python and Spark, together with Spark’s mature machine‑learning ecosystem, led to the decision to adopt the Spark stack.
Why Alibaba Cloud EMR Serverless Spark
Key problems in machine‑learning scenarios include breaking the single‑node data‑scale limit and improving training efficiency. EMR Serverless Spark’s fully managed service and elastic scaling meet these needs while guaranteeing isolated resources per user.
After technical exchanges and proof‑of‑concept with Alibaba Cloud EMR, the team selected EMR Serverless Spark for its self‑developed Fusion engine, high‑performance vectorized computation, RSS capability, and unified support for data engineering and data science.
Core Advantages
Engine performance boost : Fusion engine with vectorized computation and RSS delivers more than 3× performance over open‑source versions.
Complete Spark stack integration : Supports DataFrame, SQL, PySpark for batch, streaming, interactive analytics, and ML; compatible with Spark Submit, Livy, Spark Thrift Server; provides built‑in SQL Editor and Notebook for ETL and data‑science development.
Serverless fully managed service : Zero‑ops, no need to manage underlying resources, reduces OPEX, provides second‑level resource provisioning and task‑level elastic scaling.
High‑quality support and SLA : Alibaba Cloud offers technical support, commercial SLA, and 24/7 expert service for Serverless Spark.
Total cost reduction : Fusion engine’s performance and storage‑compute separation using OSS lower overall cluster cost.
Technical Data Platform Architecture
Data Collection
In the early stage of building the Weifin data warehouse, the team developed the dw‑shell tool, which provides comprehensive data collection capabilities and abstracts differences between storage and compute engines. This tool enabled a complete migration of all big‑data tasks to the cloud within one month.
Data Lake Ingestion
The lake uses Apache Paimon as the storage framework and integrates Apache Spark, Flink, and Hive as compute engines, forming a complete data‑lake ecosystem that supports real‑time monitoring and analysis, significantly improving processing capability and business efficiency.
Data Science
Machine‑learning training was moved from single‑node to a big‑data cluster using a local Python environment and cloud‑based Serverless Spark. The self‑developed vulcan‑x framework makes distributed training code writing and hyper‑parameter tuning as easy as local development, greatly lowering the learning curve for data scientists.
Typical Application Scenarios
Intelligent Risk Control
The MX Flow platform provides risk‑control capabilities, including feature mining, distributed training, and automated project management.
Feature Mining Support
Serverless Spark implements common binning methods (equal‑frequency, decision‑tree, chi‑square) and feature‑evaluation functions, enabling users to discretize features and assess risk discrimination in a distributed manner.
Distributed Training
Leveraging Spark MLlib and open‑source algorithms such as SynapseML’s LightGBM, the platform supports Random Forest, Logistic Regression, LightGBM, CatBoost, XGBoost, etc. Tests show training time scales linearly with dataset size; a 5 × 10⁷‑row dataset trains in about 20 minutes on Serverless Spark.
Automated Project Management
MX Flow combined with vulcan‑x creates a client‑server interaction model. Code executed locally generates data that is automatically organized as dataset services on the server, providing visual process reports, model management, and full hyper‑parameter tuning compatible with Optuna, with a dashboard for tuning processes.
Summary and Outlook
As Weifin’s data volume continues to grow, the team plans to further scale distributed training and introduce deep‑learning capabilities in intelligent risk control, using multi‑GPU, multi‑node frameworks such as Horovod and PyTorch Distributed. Research will focus on optimizing data‑parallel and model‑parallel strategies in distributed environments to improve training efficiency and scalability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
