Big Data 43 min read

Real-Time Customer Data Platform (RT‑CDP) Architecture and Implementation at iFanFan

This article explains the concept, challenges, and key business goals of a real‑time Customer Data Platform, details the technology stack selection—including Nebula Graph, Apache Flink, Apache Beam, Kudu, and Doris—and describes the modular architecture, data model, identity service, streaming computation, storage layers, rule engine, operational results, and future directions.

DataFunTalk
DataFunTalk
DataFunTalk
Real-Time Customer Data Platform (RT‑CDP) Architecture and Implementation at iFanFan

1. What is CDP?

CDP (Customer Data Platform) emerged to solve data silos in the Marketing 3.0 era, offering a unified, persistent customer database for marketing rather than just sales (CRM) or anonymous advertising (DMP).

1.2 CDP Definition

Defined by CDP Institute as packaged software that creates a persistent, unified customer database accessible to other systems, covering packaged software, persistent unified data, and accessibility.

1.3 CDP Classification

Data CDPs : core data management, multi‑source ingestion, identity resolution, unified storage.

Analytics CDPs : adds segmentation, machine‑learning, predictive modeling.

Campaign CDPs : adds cross‑channel customer treatments, real‑time interactions.

Delivery CDPs : adds message delivery (email, push, ads).

The article focuses on Analytics CDPs.

2. Challenges and Goals

2.1 Challenges

Multiple heterogeneous data channels make integration difficult.

Cross‑channel identity gaps prevent 360° customer view.

Complex segmentation rules.

Supporting both B2B2C and B2C tenants.

2.2 RT‑CDP Construction Goals

Flexible data ingestion.

Support B2C and B2B data models.

Unified user and enterprise profiles.

Real‑time cross‑channel identity management.

Powerful real‑time segmentation.

Secure, long‑term data storage.

3. Technical Selection

3.1 Identity Relationship Storage

Evaluated relational DB, Spark GraphX, then chose graph databases; migrated from DGraph to Nebula for better scalability.

3.2 Streaming Engine

Selected Apache Flink for true streaming, Apache Beam for unified programming, and Apache Doris for analytical queries.

3.3 Massive Storage Engine

Adopted Impala + Kudu for hot data, Parquet for cold data, and Doris for OLAP analytics.

3.4 Rule Engine

Built a custom real‑time rule engine on Flink to handle complex, multi‑tenant segmentation.

4. Platform Architecture

4.1 Overall Architecture

Divided into five layers: data sources, data collection, real‑time warehouse, data applications, and shared components.

4.2 Core Modules

Data Source & Collection

Identity Service (graph‑based ID mapping)

Real‑time Computation (Flink + Beam)

Unified Profile Store (Kudu, Parquet)

Unified Query Service (Impala, Doris, Presto, ES)

Real‑time Rule Engine

4.3 Key Implementations

Data Definition Model

Introduced Schema, Field, and Behavior concepts to allow flexible tenant‑specific data structures.

Identity Service

Implemented on cloud‑native Nebula Graph with multi‑tenant isolation, real‑time read/write, and custom identity weighting.

Real‑time Compute

Ingested data via Kafka, processed with a stateless Entrance Job, then distributed to tenant‑specific jobs for enrichment, persistence, and downstream delivery.

Storage Layer

Hot data stored in Kudu, warm/cold data in Parquet, with unified views for transparent queries.

Rule Engine

Supports AND/OR rule trees, windowed and non‑windowed behavior checks, leveraging Flink state management for high‑throughput segmentation.

4.4 Extensions

Dynamic cluster scaling (master, core, task, client nodes).

Full‑chain monitoring with SkyWalking, Prometheus, Grafana.

5. Outcomes

5.1 Data Assetization

Achieved multi‑party, digital, secure, and intelligent customer data management.

5.2 Business Enablement

Provided flexible schema APIs, industry‑specific models, and served thousands of enterprises across dozens of sectors.

5.3 Technical Excellence

Identity Service handles tens of hundred k QPS.

Streaming pipeline processes hundreds of k TPS with millisecond latency.

Elastic, cloud‑native architecture with high availability (>99.99%).

6. Future Outlook

More industry‑specific middle‑platform capabilities.

Enrich AI models for scoring and prediction.

Intelligent governance and K8s‑native Flink orchestration.

Lake‑house integration with Iceberg/Hudi.

architectureBig DataStreamingreal-time datadata integrationCDP
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.