Big Data 11 min read

Challenges and Approaches for Real‑Time Data Aggregation Analysis

The article examines the key challenges of real‑time data aggregation—data freshness, timely processing, and result visibility—and surveys common solutions such as timestamp‑based sync, CDC, full and incremental computation, storage formats, and trigger mechanisms.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Challenges and Approaches for Real‑Time Data Aggregation Analysis

Author : Data Monkey, Ctrip Data Analysis Director, focusing on distributed data storage and real‑time data analysis.

Real‑time data analysis is a hot topic with growing scenarios such as financial risk control, operational monitoring alerts, and AI model consumption that require up‑to‑date aggregated results.

In real‑time contexts the dominant constraint is time, which leads to three sub‑problems: how to observe changed data, how to process those changes efficiently and merge them into existing aggregates, and how to deliver the updated results to consumers promptly.

Data freshness and processing timeliness form a basic contradiction in real‑time processing.

Different scenarios tolerate different latencies; Uber’s definition is used to illustrate acceptable delay ranges.

1. Data Freshness

Data can be divided into two categories: transactional data stored in relational databases and log‑type data stored in message queues such as Kafka.

For log data, a pull‑based consumption model works well, and because the data is append‑only, both ClickHouse and TimescaleDB can provide good real‑time aggregation without further discussion.

For transactional data, two main approaches exist:

Approach 1 – Timestamp‑based synchronization: Each table contains a datachange_lasttime column; a sync program periodically scans for rows whose timestamp is newer than the last sync and pulls them. This method cannot detect deletions unless logical delete flags are used, which adds complexity to application logic.

Approach 2 – CDC (Change Data Capture): Minimal intrusion and performance impact; open‑source tools such as Canal (MySQL) and Debezium (PostgreSQL) capture changes and push them to Kafka for downstream real‑time analysis.

2. Data Association

After obtaining fresh data, enrichment often requires joining with historical tables (e.g., linking a changed flight order with its segment information). Directly joining only the incremental stream can miss required historical rows, leading to a classic real‑time‑historical join problem.

One solution is to build a wide table on the database side using row‑to‑column transformation, keep it up‑to‑date via CDC, and thus provide a self‑contained source for enrichment.

3. Computation Timeliness

Balancing result accuracy and processing speed can be achieved via two strategies:

3.1 Full Computation (1 min < latency < 5 min)

Re‑process the entire dataset for the latest window, yielding the highest accuracy. Frameworks such as Apache Spark and Apache Flink are typical choices. Columnar storage formats like Parquet and ORC reduce storage and network overhead, while hybrid row/column solutions (Apache Hudi, Delta Lake) address the need for updates.

3.2 Incremental Computation

When only a small fraction of records change, recompute only the affected aggregates. Three cases are identified: (1) new aggregates added without affecting existing ones, (2) new aggregates added and partially invalidating existing results, (3) all existing results become invalid, requiring full recomputation. Cases 1 and 2 benefit real‑time latency; case 3 falls back to full computation.

Beyond Flink, the Naiad model (Microsoft) offers a highly efficient incremental processing engine, influencing later systems such as TensorFlow.

4. Computation Trigger Mechanism

Timed triggers

Triggers for every new element

Computation Cost Comparison

5. Real‑Time Visibility of Aggregation Results

Result storage must support upsert semantics, allow consumers to detect updates instantly, and scale horizontally. NoSQL databases such as MongoDB are recommended for this purpose.

6. Conclusion

This article attempts to outline the problems and common approaches in real‑time data aggregation analysis; readers are encouraged to provide feedback on any omissions or inaccuracies.

Recommended Reading

CrateDB in Ctrip Flight BI Practice

Common Methods and Thoughts on Time‑Series Forecasting

How Ctrip Uses ARIMA for Business Volume Prediction

Performance Boost 400%: ClickHouse in Ctrip Hotel Data Warehouse

Big Datastream processingreal-time analyticsincremental computationdata aggregationCDC
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.