Nov 18, 2024 · Big Data

Optimizing Multi-Dimensional User Count Statistics in Big Data Computing: A Data Tagging Approach

By replacing exponential row expansion with a data‑tagging strategy that encodes dimension combinations and aggregates at the user level, the authors cut Baidu Feed’s multi‑dimensional user‑count computation time from 49 to 14 minutes and shuffle size from 16 TB to 800 GB, enabling scalable analysis across dozens of dimensions for billions of daily users.

Big Data OptimizationHive SQLdata tagging

0 likes · 12 min read

Optimizing Multi-Dimensional User Count Statistics in Big Data Computing: A Data Tagging Approach

Big Data Technology & Architecture

Dec 28, 2022 · Big Data

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store

Flink 1.16 introduces adaptive batch scheduling, speculative execution, hybrid shuffle, dynamic partition pruning, improved Hive SQL compatibility, advanced checkpoint mechanisms including changelog backend, and integrates CDC with Kafka and Table Store, offering faster, more stable, and easier-to-use stream‑batch processing capabilities.

Big DataCDCCheckpoint

0 likes · 8 min read

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store