Tagged articles
2 articles
Page 1 of 1
Baidu Geek Talk
Baidu Geek Talk
Nov 18, 2024 · Big Data

Optimizing Multi-Dimensional User Count Statistics in Big Data Computing: A Data Tagging Approach

By replacing exponential row expansion with a data‑tagging strategy that encodes dimension combinations and aggregates at the user level, the authors cut Baidu Feed’s multi‑dimensional user‑count computation time from 49 to 14 minutes and shuffle size from 16 TB to 800 GB, enabling scalable analysis across dozens of dimensions for billions of daily users.

Big Data OptimizationHive SQLdata tagging
0 likes · 12 min read
Optimizing Multi-Dimensional User Count Statistics in Big Data Computing: A Data Tagging Approach
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 28, 2022 · Big Data

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store

Flink 1.16 introduces adaptive batch scheduling, speculative execution, hybrid shuffle, dynamic partition pruning, improved Hive SQL compatibility, advanced checkpoint mechanisms including changelog backend, and integrates CDC with Kafka and Table Store, offering faster, more stable, and easier-to-use stream‑batch processing capabilities.

Big DataCDCCheckpoint
0 likes · 8 min read
Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store