Artificial Intelligence 15 min read

Machine Learning and Deep Learning Engineering Practices at Ping An Life

The article summarizes senior AI expert Wu Jianjun’s presentation on machine‑learning and deep‑learning engineering at Ping An Life, detailing the company’s big‑data platform, data processing pipelines, model training frameworks, distributed computing strategies, and production model‑serving architecture for financial applications.

DataFunTalk
DataFunTalk
DataFunTalk
Machine Learning and Deep Learning Engineering Practices at Ping An Life

This article is based on the talk by senior AI expert Wu Jianjun at the Ping An Life & DataFunTalk algorithm salon, where he presented "Machine Learning/Deep Learning Engineering Practice" and has been edited without changing the original meaning.

The presentation covered an overview of Ping An Life’s AI application technology, describing a big‑data platform split into platform‑level (offline, real‑time, multidimensional analysis engines) and application‑level development (data collection, cleaning, reporting, profiling). Algorithm research includes statistical analysis, machine learning, deep learning, and emerging reinforcement learning, while backend systems consist of component development (service framework, training platform, container platform) and service development.

The platform architecture relies on Kafka for data ingestion, after which data is stored in Hadoop and relational databases. Data cleaning uses Hive (HQL) and Spark for complex processing. Real‑time analytics employ Druid and Elasticsearch for single‑table analysis, while Presto and Impala handle multi‑table joins. Additional tools such as MATLAB, SAS, TensorFlow, HBase, and Redis support actuarial modeling, deep learning, and profile storage.

AI is widely used in finance at Ping An Life, powering agent management (recruitment, sales, promotion), intelligent customer service, claim processing, and other scenarios, leveraging the massive amount of data generated by millions of agents.

The insurance domain faces data challenges: long decision cycles, low‑frequency interactions, complex and unstable data from many business lines, and high modeling costs. To address these, the workflow includes profile generation, quality inspection, and data embedding to create standardized representations.

Data quality validation focuses on stability, importance (IV, chi‑square, variable importance), and correlation (correlation coefficient, PCA/RUFS, VIF). A Spark‑Python tool was built to flexibly configure and output validation results with a single click.

Embedding techniques include GBDT‑based feature combination, FM encoding, low‑rank factorization, KB encoding, TF/IDF, word2vec for text, and CNN‑based image embeddings. These representations are then merged with other data for unified modeling.

Distributed training relies on data parallelism: data is split across workers, each computes gradients, which are aggregated and broadcast back for model updates. Parameter‑update strategies discussed include BSP (synchronous), ASP (asynchronous), SSP (stale‑synchronous), and the PS‑Lite framework supporting all three.

On Spark, several ML libraries are used: MLlib (decision trees, SVM, LR), Splash (MCMC, Gibbs Sampling, LDA, 20× faster than MLlib), DeepLearning4j for deep learning (GPU support but less flexible than TensorFlow), and the in‑house PAMLkit (NB, AdaGrad+FM, FTRL+LR).

Best coding practices emphasize modular design (separate Gradient, Updater, Optimizer classes) and sparse‑vector operations to avoid performance degradation.

TensorFlow is used for deep learning on structured, visual, and textual data, employing DNN, CNN, AE, and emerging reinforcement learning. Distributed TensorFlow training involves manual process launch, data sharding, and limited fault tolerance; training evolved from single‑GPU to multi‑GPU synchronous mode and finally to multi‑node setups using pdsh for data distribution.

Joint modeling combines Spark’s parallelism with TensorFlow’s deep models (e.g., GBDT+FM on Spark, DNN on TensorFlow). Data output from Spark is stored in HDFS, then read by TensorFlow workers launched via pdsh; research is ongoing on co‑existing Spark‑TensorFlow clusters with per‑RDD computation graphs.

Model serving faces challenges such as hundreds of models, diverse platforms (MATLAB, Java, Python, SAS, R, Spark, TensorFlow), complex algorithm combinations, and heterogeneous data processing. The solution uses Thrift for cross‑language RPC, Zookeeper for coordination, Redis for online storage, Netty for communication, Docker for containers, and Nginx for load balancing.

The serving architecture consists of three layers: model processing (parsers outputting PMML, protobuf, or custom formats), data computation (feature engineering operators), and interface (HTTP APIs with load balancers). A management and monitoring platform oversees the whole pipeline.

In summary, the talk provides a comprehensive view of how Ping An Life integrates big‑data platforms, distributed machine‑learning frameworks, and robust model‑serving infrastructure to enable AI applications across its financial services.

—END

Big Datamachine learningDeep Learningdistributed computingmodel servingFinancial AI
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.