Big Data Architecture and Solutions at Du Xiaoman Financial: MMR, Honghu Data Lake, and Yichuang Model Training Platform
This article presents Du Xiaoman Financial's big‑data architecture challenges and three integrated solutions—MMR cloud‑based data framework, the Honghu data‑lake management platform, and the Yichuang model‑training monitoring system—detailing their design, governance, low‑threshold usage, and future outlook.
Guest Speaker: Zhao Hui, Architect at Du Xiaoman Financial
Editor: Jiang Wenjuan, Xiamen University Jiageng College
Platform: DataFunTalk
Introduction: Big‑data architecture in financial scenarios faces challenges such as fine‑grained control over data processing, storage, and usage, low‑threshold user access, and the need to preserve operational experience.
Solution Overview: The talk shares three solutions: (1) MMR – a Baidu Cloud‑based big‑data architecture for governance; (2) Honghu – Du Xiaoman's data‑lake management and analysis platform for lowering barriers; (3) Yichuang – a model‑training monitoring and evaluation system for experience inheritance.
01 Big Data Cloud Architecture – MMR
Du Xiaoman's cloud‑native big‑data architecture builds on Baidu Cloud's standard products, extending the architecture to meet finer governance needs across the full data lifecycle.
The architecture is divided into Access Layer, Table‑Control Layer, Compute Layer, Virtual Storage Layer, and Physical Storage Layer.
1. User Layer
The user layer implements identity‑based control by integrating with Du Xiaoman's employee management system, tagging each operation with the user's identity to enable precise responsibility tracking.
2. Table‑Control Management Layer
To support field‑level permission control for Hive tables, a permission‑control center allows users to label field sensitivity and set sharing permissions, with the service layer validating field‑level access during task submission.
3. Compute Layer
The compute layer leverages Baidu Cloud resources and introduces a virtual management layer for non‑structured data sharing and isolation, applying directory‑level and IP‑based permissions to achieve controlled data sharing and auditability.
To bridge the gap between Baidu's legacy architecture and the open‑source cloud architecture, a unified client smooths user experience, and a virtual storage layer presents object storage as a familiar file system.
Intelligent scheduling and high availability are achieved through windowed scheduling, dynamic task timing, and heartbeat‑based agent monitoring to quickly locate failures.
02 Data‑Lake Management and Analysis Platform – Honghu
Targeted at strategy analysts, Honghu provides an agile, intelligent, and user‑friendly data‑lake platform.
Unified metadata management aggregates metadata from various storage systems, eliminating silos.
Domain construction and intelligent recommendation reduce data discovery costs.
Data‑quality control produces quality reports to assure data usability.
Permission, desensitization, and encryption ensure data security throughout the pipeline.
1. Agile, Intelligent Global Data Management
By unifying metadata, building domain tags, and offering quality reports, the platform lowers the cognitive barrier for data discovery and usage.
2. Multi‑Engine Visual Drag‑and‑Drop Batch & Stream Development Platform
Data exchange platform enables code‑free data acquisition.
Visual IDE supports syntax checking, highlighting, formatting, and one‑click deployment for Hive, Spark, Flink, GP, Shell, etc.
Drag‑and‑drop scheduling visualizes task dependencies and provides multi‑dimensional monitoring.
Data analysis supports OLTP engines like Greenplum and ad‑hoc Presto queries.
Data API allows one‑click activation of analysis results into online systems.
3. One‑Stop Data Lifecycle Analysis Platform
The platform dramatically reduces the entry barrier for analysts by integrating the entire data lifecycle.
03 Model Training Monitoring and Evaluation System – Yichuang
Yichuang offers a comprehensive, one‑click model training and deployment workflow.
Unified code and environment management standardizes model and feature engineering.
Standardized sample and feature libraries resolve data consistency issues.
Dedicated training clusters ensure efficient, standardized model training.
Online and offline feature stores maintain consistency through mirroring and multi‑dimensional validation.
Plugin‑Based Model Evaluation Framework
The modular evaluation framework expands evaluation dimensions while unifying assessment criteria.
04 Outlook & Q&A
Future directions include embracing cloud‑native to break resource silos, integrating lake‑warehouse architectures for faster business iteration, and optimizing the one‑stop big‑data platform to unlock full data value.
Q: How are online and offline features kept consistent?
A: Offline feature stores manage quality; online stores mirror the latest offline version. Model training uses offline features, while online scoring uses the online store, with multi‑dimensional validation during synchronization.
Thank you for listening.
Share, like, and give a triple‑click at the end!
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.