Big Data 33 min read

How Baidu iFanFan Built a Real-Time Big Data Platform: Challenges & Lessons

Facing rapid business iteration, Baidu’s iFanFan data team designed a unified real‑time and offline big‑data platform, tackling business, technical, and organizational challenges through Lambda/Kappa architectures, data integration, storage, computation, governance, and scalable analytics to deliver timely, accurate, and valuable data products.

21CTO

Nov 8, 2021

How Baidu iFanFan Built a Real-Time Big Data Platform: Challenges & Lessons

Guide: This article narrates the journey of Baidu iFanFan’s data team in building a real‑time and offline big‑data platform to better empower the business, efficiently delivering valuable data products and services, and discusses the challenges in business, technology, and organization along with practical solutions.

1. Preface

iFanFan is a one‑stop intelligent marketing and sales accelerator that handles lead data from various internal promotion platforms (search, feed, Baijiu self‑built sites, etc.) and external advertising platforms, providing capabilities such as lead ingestion, control, follow‑up, and conversion. The platform aggregates business data, user data, and event data to form the core of a big‑data analytics system.

Continuously delivering timely, accurate, and stable data analysis services in an agile iteration environment is a long‑term goal for the data team. Therefore, a high‑level architectural design is crucial.

1.1 Glossary

Watt : Data flow open platform for integrating MySQL, BaikalDB binlog logs, supporting multi‑table joins, UDF extensions, etc.

Fengge Platform : A data asset center and governance middle‑platform built by the commercial platform R&D department, managing core data and scenarios for the business, leveraging warehouse engine, resource scheduling, and data transmission capabilities.

Bigpipe (BP) : Distributed data transmission system that supports real‑time transmission of messages, commands, and log data, enabling decoupled module communication and unified traffic and operations.

AFS : Large‑scale file storage system similar to open‑source HDFS.

Palo : An MPP data warehouse built on Apache Doris (Baidu’s analytical database engine), supporting high concurrency, low latency queries, and petabyte‑scale datasets for online real‑time analytics.

2. Challenges and Pain Points

2.1 Business

To provide customers with clear, multi‑view, multi‑dimensional metrics on lead creation, allocation, follow‑up, and conversion funnels, as well as staff follow‑up status, valuable data statistics are required.

Value: Providing valuable data products is a key consideration during requirement review. Timeliness: Customers expect near‑real‑time (second‑level) display of lead assignments and follow‑up details. Richness: Simple count and sum metrics are easy, but delivering guidance‑oriented analytics for lead management and marketing activities is challenging.

2.2 Technical

Business data, user behavior events, and internal/external data need systematic ingestion, unified management, processing, and output.

1) Rapid business growth leads to large data volumes and sharding of MySQL databases, making OLAP analysis on lead‑related core tables impossible directly from the source. 2) Metadata is scattered across code, making unified management difficult; migrations or password changes affect data extraction. 3) Data duplication across multiple stores increases maintenance costs. 4) Limited R&D personnel must balance operations, internal support, and product development. 5) Hundreds of extraction, transformation, and loading tasks run online, raising stability concerns.

2.3 Organizational

iFanFan’s business unit divides product‑research‑test, market, commercialization, and customer success into about 15 agile squads, each with clear business goals. The data team must support these squads agilely.

1) Each team’s OKRs and monitoring of customer growth, business growth, and operations require rapid data team response. 2) One‑off data extraction offers low ROI and is common. 3) Consistent metric definitions across iFanFan’s core indicators need unified management and platform‑based data services.

3. Practice and Experience Sharing

Before the architecture described below, the team relied on point‑solution engineering, which could not comprehensively address data application needs. The team later adopted a systematic approach, leveraging Baidu’s internal private cloud and Baidu Intelligent Cloud platforms for big‑data components.

3.1 Data Architecture

3.1.1 What Is Data Architecture

Google’s early 2000s “three horsemen” – GFS (distributed file system), MapReduce (large‑scale data processing), and BigTable (distributed structured storage) – inspired two mainstream solutions: Lambda and Kappa architectures, with a Unified approach also existing but less applicable.

3.1.1.1 Lambda Architecture

The Lambda architecture combines batch and stream processing to balance latency, throughput, and fault tolerance. Batch processing provides stable, aggregated views, while real‑time layers offer online analysis. The batch and speed layers are merged before serving.

3.1.1.2 Kappa Architecture

Proposed by Jay Kreps, Kappa focuses solely on stream processing, building on Lambda but not intended to replace it unless the use case fully aligns.

3.1.2 Architecture Selection

3.1.2.1 Comprehensive System Review

Data forms include private and public domain user behavior events, user attributes, lead management, IM communication, marketing activities, account management, and more.

Data Integration: The team does not generate data but must ingest data from various business lines, terminals, and internal/external channels into an OLAP system.

Data Storage: Offline T+1 data is stored in AFS; high‑timeliness data resides in an MPP analytical database.

Data Computation: Real‑time uses Spark Streaming or Flink; offline uses MapReduce or SparkSQL.

Data Governance & Monitoring: Includes platform stability, metadata management, lineage, scheduling, source management, and exception handling.

Data Development: Considers manpower, data reuse, operation standards, and modeling from business to logical to physical layers.

Data Business Scenarios: Online analysis, user activity statistics, ID linking for precise marketing, and ad‑hoc queries via OpenAPI.

3.1.2.2 Fast and Slow in Parallel

Given iFanFan’s data volume and business needs, the team decided on a dual‑track approach: a “quick‑win” path for urgent customer needs and a long‑term systematic data architecture.

In September 2020, the sales domain split tables, requiring migration of multiple systems into a unified database. After several technical reviews, the team chose the Watt platform for data extraction because it supports sharding, high‑timeliness binlog, load balancing, multi‑table joins, rich UDFs, and dedicated operations support.

Three developers spent over two months implementing version 1.0, which initially met urgent needs but suffered stability issues, leading to frequent complaints and even a night‑time alarm‑call system.

By January 2021, version 1.0 stabilized, but its CDC file interaction resembled a Kappa variant. The team then researched industry best practices.

3.2 Business Demands and Architecture Evolution

3.2.1 Pursuing Timeliness

Customer feedback indicated severe latency (up to 18 minutes) in lead analysis. The team implemented three measures:

Move Spring Streaming jobs to an isolated cluster to avoid resource contention.

Deploy Bigpipe’s cross‑region disaster recovery, using the Suzhou data center as primary and Beijing as backup with data compensation.

Leverage Watt’s multi‑binlog join capability to pre‑process complex calculations, reducing real‑time load.

These actions reduced OLAP query latency to 10‑15 seconds.

3.2.2 BI Scenario Requirements

Marketing, operations, commercial sales, and customer success teams generate abundant ad‑hoc data requests. The data team productized common needs.

Periodic data needs are delivered via scheduled email tasks.

Ad‑hoc queries are supported through self‑service platforms.

3.2.3 Public Data Warehouse

By March 2021, many data‑driven scenarios (unified data sources, reusable data models, self‑service platforms) were still unsupported. Integrating Watt with the Fengge platform enabled seamless data transfer, reducing development effort and supporting ad‑hoc, data modeling, and governance needs.

After a one‑and‑a‑half month migration to Fengge, the 2.0 version supported ad‑hoc queries, layered data warehouse management, metadata, lineage, and monitoring.

With the 2.0 architecture, ordinary developers can, after brief training, build customized monthly, quarterly, and yearly OKR reports, freeing data R&D resources.

3.3 Data Warehouse Modeling Process

Using Fengge, the team applied Kimball dimensional modeling: define conformed dimensions and business processes, produce a bus matrix, then determine facts and grain for each subject.

3.3.1 Layered ETL

The warehouse is layered to avoid monolithic “chimney” structures, improving resource cost, task explosion, query performance, and usability.

3.3.2 Model Selection

The team primarily used star schemas, exemplified by lead‑follow‑up facts and dimensions, progressing from logical to physical models.

3.4 Data Governance

Data governance covers the entire data lifecycle, including acquisition, storage, cleaning, transformation, metadata, standards, quality, security, development, value, and services.

3.4.1 Data Asset Governance

Establishes standards, permissions, and data sharing to treat data as a valuable organizational asset.

3.4.1.1 Topic Management

Classifies data into subjects, enabling users to quickly locate needed data.

3.4.1.2 Metadata and Lineage

Shows data ownership, source, and relationships, providing traceability.

3.4.1.3 Permission Control and Self‑Service

After granting permissions, users can query and download data via the platform, and drag‑and‑drop in ad‑hoc query tools.

3.4.2 Data Quality Governance

Post‑architecture upgrade, the team focused on daily incremental data diff monitoring, anomaly handling, cluster stability, network/component jitter, and data loss compensation.

3.5 Easy Scalability

3.5.1 Marketing Effect Analysis

Private‑domain marketing relies on CDP data stored in Impala & Kudu. Due to limited concurrency and latency, the team migrated to Palo (Doris) for analysis, evaluating MPP solutions and ultimately selecting Palo.

3.5.2 Real‑Time Capability Enhancement

Based on the 2.0 real‑time pipeline, the team added POC and stress‑test components, integrating Kafka, Flink, and Palo via Stream Load (HTTP) and Routine Load (persistent tasks) to achieve sub‑2‑second latency.

3.5.2.1 Palo Principles

Palo uses an LSM‑Tree for writes, with background compaction and merge‑sort on disk. It supports wide tables, multi‑table joins, and a CBO optimizer for complex analytics.

3.5.3 Building Metric System

For private‑domain marketing (B‑to‑B‑to‑C), the product team defines multi‑dimensional metrics (e.g., acquisition, operation, nurturing) and builds visual reports and real‑time services. Core product usage metrics such as DAU, MAU, PV, UV are also tracked to guide feature optimization.

3.6 Full‑View of the Data Analysis System

3.7 Benefits of the Data Analysis System

Business Benefits

Data product managers and analysts work with customers to model core processes, define consistent dimensions, and compute rich, multi‑view metrics, enabling proactive alerts, guided actions, and data‑driven growth.

Technical Benefits

The architecture resolves previous issues: untimely data, inaccurate results, sharding limitations, scattered data, backlog of ad‑hoc requests, and inconsistent logic.

Organizational Benefits

Fengge’s visual development tools allow ordinary developers to create customized OKR reports after brief training, freeing data R&D resources. Self‑service queries and data services improve efficiency and clarify responsibilities.

4. Conclusion and Outlook

Integrate real‑time analysis with CDP and ID‑mapping to achieve fine‑grained operations.

Explore lake‑warehouse integration on private or Baidu Intelligent Cloud.

Design platform‑based data processing to improve developer productivity.

Simplify data pipelines to boost efficiency and timeliness.

Introduce middle‑platform concepts, data standards, health scoring, and reuse to achieve cost reduction and efficiency gains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Data Warehouse Data Architecture

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.