Tagged articles
15 articles
Page 1 of 1
DataFunSummit
DataFunSummit
Sep 26, 2024 · Big Data

Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC

This article explains Apache Hudi's incremental processing capabilities, covering an overview of the medallion architecture, detailed configuration for incremental queries, the introduction of Change Data Capture (CDC) with required table properties, and a review of how these features enable richer data insights in modern data lake environments.

Apache HudiBig DataChange Data Capture
0 likes · 9 min read
Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 27, 2023 · Big Data

How MaxCompute’s Lakehouse Architecture Enables Near‑Real‑Time Incremental Processing

This article details Alibaba Cloud MaxCompute’s lakehouse evolution, describing its unified storage‑metadata‑compute design, the Transactional Table 2.0 format, near‑real‑time incremental ingestion, clustering and compaction services, transaction handling, TimeTravel and incremental queries, and future roadmap for big‑data workloads.

Big DataIncremental ProcessingLakehouse
0 likes · 23 min read
How MaxCompute’s Lakehouse Architecture Enables Near‑Real‑Time Incremental Processing
DataFunTalk
DataFunTalk
Jun 24, 2023 · Big Data

Design and Architecture of MaxCompute Lakehouse Near‑Real‑Time Incremental Processing

This article explains the evolution of Alibaba Cloud's MaxCompute platform into a lakehouse architecture that supports near‑real‑time incremental processing, detailing its development history, core design of transactional tables, five‑module technical stack, data ingestion methods, optimization services, transaction management, query capabilities, ecosystem integration, practical applications, future roadmap, and common user questions.

Big DataData LakeIncremental Processing
0 likes · 24 min read
Design and Architecture of MaxCompute Lakehouse Near‑Real‑Time Incremental Processing
Bilibili Tech
Bilibili Tech
Apr 4, 2023 · Big Data

How Bilibili’s Flink‑Based Real‑Time Incremental Pipeline Cuts Costs and Boosts Latency

This article details Bilibili’s migration from a Spark‑based offline ODS‑to‑DWD sharding process to a Flink real‑time incremental pipeline, explaining the background challenges, the design of multi‑level partitioning, small‑file optimizations, stability enhancements, and the measurable performance gains achieved.

Big DataFlinkIncremental Processing
0 likes · 19 min read
How Bilibili’s Flink‑Based Real‑Time Incremental Pipeline Cuts Costs and Boosts Latency
Shopee Tech Team
Shopee Tech Team
Sep 2, 2022 · Big Data

Shopee Data System Challenges and Apache Hudi Practices

Shopee tackled its data‑system bottlenecks by customizing Apache Hudi to provide unified stream‑batch integration, efficient state‑detail snapshots, and low‑latency wide‑table generation, using CDC‑based bootstrapping, COW/MOR tables, savepoints and partial updates, which cut latency to ten minutes, lowered resource use, and yielded several community‑backed enhancements.

Apache HudiBig DataData Integration
0 likes · 18 min read
Shopee Data System Challenges and Apache Hudi Practices
Big Data Technology Architecture
Big Data Technology Architecture
Aug 23, 2022 · Big Data

Comparative Analysis of Apache Hudi, Delta Lake, and Apache Iceberg for Lakehouse Architectures

This article examines the technical differences and feature sets of Apache Hudi, Delta Lake, and Apache Iceberg, highlighting incremental pipelines, concurrency control, merge‑on‑read storage, partition evolution, multi‑mode indexing, and real‑world use cases to help practitioners choose the most suitable lakehouse solution for their workloads.

Apache HudiApache IcebergConcurrency Control
0 likes · 18 min read
Comparative Analysis of Apache Hudi, Delta Lake, and Apache Iceberg for Lakehouse Architectures
Big Data Technology & Architecture
Big Data Technology & Architecture
May 17, 2022 · Big Data

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

This article provides a comprehensive guide to Apache Hudi, covering its basic concepts, timeline architecture, storage types (Copy‑On‑Write and Merge‑On‑Read), write operations, DeltaStreamer usage, Hive/Spark/Presto query integration, data management, indexing, compaction, and best‑practice recommendations for big‑data lake workloads.

Apache HudiBig DataCopy-on-Write
0 likes · 43 min read
Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management
DataFunTalk
DataFunTalk
May 14, 2021 · Big Data

Real‑time Billion‑Scale Data Transmission and AI Pipeline Architecture at Bilibili

This article presents a technical deep‑dive into Bilibili’s evolution from offline to real‑time data processing, describing the challenges of timeliness, ETL, AI feature engineering, and the design of a Flink‑on‑YARN incremental pipeline that supports trillion‑scale message throughput and AI‑driven real‑time applications.

Big DataFlinkIncremental Processing
0 likes · 27 min read
Real‑time Billion‑Scale Data Transmission and AI Pipeline Architecture at Bilibili
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 2, 2020 · Big Data

An Overview of Apache Hudi: Architecture, Features, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark to ingest, manage, and incrementally query large analytical datasets on HDFS‑compatible storage, offering features such as timeline management, copy‑on‑write and merge‑on‑read tables, and support for snapshot, incremental, and read‑optimized queries across engines like Hive, Spark SQL and Presto.

Apache HudiBig DataData Lake
0 likes · 12 min read
An Overview of Apache Hudi: Architecture, Features, and Query Types
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 23, 2020 · Big Data

Apache Hudi Overview, Core Concepts, and Quick‑Start Guide

This article introduces Apache Hudi, explaining its storage types, query views, timeline feature, typical use cases such as near‑real‑time ingestion and incremental pipelines, and provides a step‑by‑step Scala/Spark quick‑start guide with code examples for compiling, inserting, updating, querying, and syncing data to Hive.

Apache HudiBig DataData Lake
0 likes · 18 min read
Apache Hudi Overview, Core Concepts, and Quick‑Start Guide
Architect
Architect
May 12, 2020 · Big Data

An Overview of Apache Hudi: Architecture, Concepts, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark and Hadoop‑compatible storage to provide efficient ingestion, incremental processing, and multiple query modes such as snapshot, incremental, and read‑optimized for large analytical datasets.

Apache HudiBig DataData Lake
0 likes · 11 min read
An Overview of Apache Hudi: Architecture, Concepts, and Query Types
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 23, 2020 · Big Data

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

This article explains how Apache Hudi enables efficient, low‑latency incremental data ingestion and processing on Hadoop by providing a unified service layer, describing its motivation, architecture, storage components, write and read paths, compaction, fault recovery, and incremental query capabilities.

Apache HudiHadoopIncremental Processing
0 likes · 17 min read
Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop