Tagged articles

Data Lake

356 articles · Page 3 of 4

Dec 27, 2022 · Big Data

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

The article details how Tencent Advertising re‑architected its massive log pipeline by consolidating heterogeneous real‑time and offline logs into an Iceberg‑based data lake, introducing multi‑level partitioning, Spark and Flink ingestion, and numerous performance and cost optimizations for scalable big‑data analytics.

Big DataData LakeFlink

0 likes · 20 min read

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

Data Thinking Notes

Dec 23, 2022 · Big Data

How Real-Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices

This article explains why real‑time data warehouses are becoming essential, outlines their goals, compares them with traditional offline warehouses, and presents detailed design patterns, naming conventions, and case studies from Didi, Kuaishou, Tencent, Youzan and other enterprises, highlighting challenges and solutions for streaming, storage, and query layers.

Big Data ArchitectureData LakeETL

0 likes · 49 min read

How Real-Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices

Big Data Technology & Architecture

Dec 19, 2022 · Big Data

Near Real-Time Data Lake Practices in TikTok E-commerce: Architecture, Techniques, and Case Studies

This article presents a comprehensive overview of TikTok e-commerce's near‑real‑time data lake implementation, detailing data lake characteristics, architecture choices, practical use cases across analysis and operations, and for future challenges and plans.

Apache HudiBig DataData Lake

0 likes · 16 min read

Near Real-Time Data Lake Practices in TikTok E-commerce: Architecture, Techniques, and Case Studies

Big Data Technology & Architecture

Dec 15, 2022 · Big Data

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

This technical article presents a comprehensive overview of migrating Hive SQL to Flink SQL, covering the motivations behind the migration, key challenges such as compatibility, stability and performance, practical implementation steps, a detailed demo, future development directions, and a Q&A session addressing common concerns.

Batch ProcessingBig DataData Lake

0 likes · 13 min read

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

DataFunTalk

Dec 8, 2022 · Big Data

Arctic: NetEase’s Real-Time Lakehouse System Built on Apache Iceberg

This article introduces NetEase’s Arctic, a real‑time lakehouse system built on Apache Iceberg that unifies streaming and batch processing, explains the challenges of Lambda architecture, details Arctic’s features such as change/base stores, hidden queue, transaction handling, and shares internal practice cases and future roadmap.

Apache IcebergArcticData Lake

0 likes · 12 min read

Arctic: NetEase’s Real-Time Lakehouse System Built on Apache Iceberg

Architects' Tech Alliance

Dec 7, 2022 · Big Data

Why Data Lakes and Data Warehouses Must Converge: The Rise of Lakehouse Architecture

This article traces 20 years of big‑data evolution, defines data lakes and data warehouses, compares their trade‑offs, and explains how lakehouse solutions—exemplified by Alibaba Cloud MaxCompute—merge flexibility with enterprise‑grade performance to lower total ownership cost.

Big Data ArchitectureCloud Data PlatformData Lake

0 likes · 32 min read

Why Data Lakes and Data Warehouses Must Converge: The Rise of Lakehouse Architecture

StarRocks

Dec 1, 2022 · Big Data

How Alibaba Cloud EMR StarRocks Supercharges Data Lake Analytics with Advanced Optimizations

This article explains how Alibaba Cloud EMR StarRocks extends data lake analytics to support Hive, Iceberg, and Hudi, detailing its architecture, Iceberg integration, performance gains over Trino, IO merging, lazy materialization, intelligent caching, and elastic compute capabilities for faster, unified, and cost‑effective queries.

Data LakeEMRElastic Compute

0 likes · 16 min read

How Alibaba Cloud EMR StarRocks Supercharges Data Lake Analytics with Advanced Optimizations

DataFunSummit

Nov 23, 2022 · Big Data

Lakehouse Analysis Service (LAS): Architecture, Challenges, and Service Design

The article introduces the Lakehouse Analysis Service (LAS), explains its layered architecture that unifies data lake and warehouse capabilities, discusses challenges with Apache Hudi metadata and consistency, and details the design of the unified MetaServer, Table Management Service, concurrency control, async compaction, event bus, and future roadmap.

Apache HudiData Lake

0 likes · 18 min read

Lakehouse Analysis Service (LAS): Architecture, Challenges, and Service Design

Architects Research Society

Nov 19, 2022 · Big Data

Data Warehouse vs Data Lake: Definitions, Differences, and Best Practices

This article explains the fundamental concepts of data warehouses and data lakes, compares their architectures and use cases, discusses common misconceptions, highlights real‑world examples such as Facebook, and outlines the challenges and strategic considerations for organizations adopting both technologies.

AnalyticsCloudData Lake

0 likes · 13 min read

Data Warehouse vs Data Lake: Definitions, Differences, and Best Practices

ITPUB

Nov 18, 2022 · Big Data

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

This article introduces Iceberg’s table‑format fundamentals, details Xiaomi’s large‑scale deployment of Iceberg for CDC and log ingestion, explores their streaming‑batch integration experiments, outlines future roadmap items, and provides a comprehensive Q&A covering practical challenges and solutions.

Batch ProcessingBig DataData Lake

0 likes · 23 min read

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

ByteDance Data Platform

Nov 16, 2022 · Big Data

How ByteDance’s Data Lake Powers Near‑Real‑Time E‑Commerce Analytics

This article explains ByteDance’s data lake technology, its Apache Hudi‑based features, near‑real‑time architecture, and practical e‑commerce use cases such as marketing promotion, traffic diagnosis, logistics monitoring, risk governance, and operational monitoring, while outlining future challenges and plans.

Apache HudiBig Data ArchitectureData Lake

0 likes · 15 min read

How ByteDance’s Data Lake Powers Near‑Real‑Time E‑Commerce Analytics

DataFunTalk

Nov 13, 2022 · Big Data

Iceberg Data Lake: Technology Overview, Xiaomi Practices, and Stream‑Batch Integration

This article presents an overview of the Iceberg table format, its core architecture and advantages, details Xiaomi’s large‑scale deployment and use cases, explores stream‑batch integration with Spark and Flink, outlines data correction methods, future plans, and answers common technical questions.

Data LakeFlinkIceberg

0 likes · 20 min read

Iceberg Data Lake: Technology Overview, Xiaomi Practices, and Stream‑Batch Integration

DataFunTalk

Nov 5, 2022 · Big Data

Evolution of ByteDance Data Lake Indexing: Hudi Index Enhancements and Future Directions

This article presents ByteDance's evolution of data lake indexing built on Apache Hudi, detailing traditional update challenges, Hudi's index mechanisms, the introduction of bucket and extensible hash indexes, query optimizations, and upcoming multi‑modal and range index innovations.

Bucket IndexData LakeExtensible Hash

0 likes · 12 min read

Evolution of ByteDance Data Lake Indexing: Hudi Index Enhancements and Future Directions

Baidu Geek Talk

Nov 3, 2022 · Cloud Native

Challenges and Solutions for AI Storage Systems in Cloud‑Native Training

The talk outlines how AI training’s growing data and compute demands create storage bottlenecks across four evolutionary stages, identifies four core problems—massive data, data‑flow, resource scheduling, and compute acceleration—and proposes hardware, software (parallel file systems, caching), and cloud‑native orchestration (Fluid, Baidu Canghai) solutions that combine object‑storage lakes with high‑performance acceleration layers to achieve near‑full GPU utilization.

AICachingCloud Native

0 likes · 37 min read

Challenges and Solutions for AI Storage Systems in Cloud‑Native Training

DataFunSummit

Oct 29, 2022 · Big Data

Apache Iceberg in Tencent: Architecture, Spark Read/Write, Production Practices, and Data Governance

This article presents an in‑depth overview of Apache Iceberg as used at Tencent, covering its table format architecture, Spark read/write mechanisms, production challenges and optimizations such as schema evolution, file filtering, upsert strategies, and the surrounding data‑governance services.

Apache IcebergBig DataData Governance

0 likes · 19 min read

Apache Iceberg in Tencent: Architecture, Spark Read/Write, Production Practices, and Data Governance

NetEase Cloud Music Tech Team

Oct 26, 2022 · Big Data

Arctic: NetEase's Streaming Lakehouse Service and Hive-Based Stream-Batch Integration Practice

Arctic, NetEase’s streaming lakehouse built on Apache Iceberg, unifies streaming and batch workloads with millisecond‑level latency, Hive compatibility, and built‑in message‑queue support, delivering CDC, upserts and OLAP without a Lambda architecture, as demonstrated by real‑time processing of 2 PB of Hive data for Cloud Music.

Apache IcebergArcticBig Data Architecture

0 likes · 15 min read

Arctic: NetEase's Streaming Lakehouse Service and Hive-Based Stream-Batch Integration Practice

Xingsheng Youxuan Technology Community

Oct 21, 2022 · Big Data

How We Cut Hudi Data Lake Write Costs by Over 85% with Custom Architecture

This article examines the challenges of using Apache Hudi for real‑time data lake writes, analyzes the COW and MOR write models, and presents a custom master‑worker architecture with index optimization and repartitioning that reduces write resource consumption by over 85% while boosting throughput up to 300‑fold.

COWData LakeHudi

0 likes · 14 min read

How We Cut Hudi Data Lake Write Costs by Over 85% with Custom Architecture

Baidu Intelligent Cloud Tech Hub

Oct 19, 2022 · Artificial Intelligence

Why Storage Systems Bottleneck AI Training and How to Accelerate Them

This article examines the comprehensive challenges AI applications face from storage to compute, traces the evolution of AI training infrastructure, analyzes key bottlenecks such as compute acceleration, resource scheduling, massive data handling and data flow, and presents Baidu Cloud's storage acceleration solutions—including parallel file systems, caching, and the Fluid scheduler—to dramatically improve AI training performance.

AI trainingCloud NativeData Lake

0 likes · 38 min read

Why Storage Systems Bottleneck AI Training and How to Accelerate Them

ITPUB

Oct 15, 2022 · Big Data

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

This talk introduces the evolution of data lakes, outlines Apache Hudi’s core features, details the Flink‑Hudi integration architecture—including write pipelines, small‑file handling, and read strategies—covers real‑world use cases such as near‑real‑time DB ingestion, OLAP, and ETL, and previews upcoming Hudi roadmap items.

Apache HudiBig DataData Lake

0 likes · 21 min read

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

Xingsheng Youxuan Technology Community

Oct 14, 2022 · Big Data

How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi

This article explains how an e‑commerce company designed and implemented a modern data warehouse—combining batch Spark jobs, real‑time Flink streams, and Hudi data‑lake storage—to handle terabytes of daily logs, ensure data quality, and provide fast, reliable analytics for business decision‑making.

Data LakeData WarehouseETL

0 likes · 16 min read

How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi

DataFunTalk

Oct 14, 2022 · Big Data

Exploring Flink and Apache Hudi for Streaming Data Lakes: Design, Practices, and Roadmap

This article presents a comprehensive overview of using Flink with Apache Hudi to build streaming data lake solutions, covering Hudi's background, core features, Flink‑Hudi integration design, practical use cases, recent roadmap updates, and a Q&A session.

Apache HudiData LakeFlink

0 likes · 19 min read

Exploring Flink and Apache Hudi for Streaming Data Lakes: Design, Practices, and Roadmap

Big Data Technology & Architecture

Oct 13, 2022 · Big Data

Hudi Clustering After Batch Processing: Merging Small Files Before Streaming

This guide details how to execute Apache Hudi file clustering after a batch job and before streaming, using Spark commands to merge numerous small HDFS files into larger ones, configure clustering and cleaning policies, and verify the results with HDFS counts.

Apache HudiBig DataData Lake

0 likes · 15 min read

Hudi Clustering After Batch Processing: Merging Small Files Before Streaming

DataFunSummit

Oct 5, 2022 · Big Data

Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless

This article explains how Amazon EMR Serverless leverages serverless architecture to simplify, scale, and reduce the cost of big data analytics by providing managed Hadoop‑based services, flexible resource allocation, built‑in security, and seamless integration with the AWS data lake ecosystem.

AWSAmazon EMR ServerlessBig Data

0 likes · 16 min read

Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless

DataFunTalk

Oct 4, 2022 · Big Data

Near‑Real‑Time Data Lake Practices in TikTok E‑commerce Data Warehouse

The presentation by TikTok e‑commerce data‑warehouse engineer Ma Wenyuan explains data‑lake characteristics, near‑real‑time architecture, and practical e‑commerce use cases, highlighting Apache Hudi features, hybrid batch‑stream processing, and future challenges for scaling and integration.

Data LakeHudiStreaming

0 likes · 13 min read

Near‑Real‑Time Data Lake Practices in TikTok E‑commerce Data Warehouse

Tencent Cloud Developer

Sep 27, 2022 · Big Data

GooseFS: Accelerating Cloud Storage for Big Data and Data Lake Platforms

GooseFS, Tencent Cloud’s Hadoop‑compatible storage accelerator, adds a local NVMe‑SSD cache layer to cloud‑native data lakes, letting users boost query speeds by up to 46 % and cut backend bandwidth by 200 Gbps without code changes, as demonstrated by a music‑industry customer’s 200‑node deployment caching ten million files.

Data LakeGooseFSHigh Availability

0 likes · 16 min read

GooseFS: Accelerating Cloud Storage for Big Data and Data Lake Platforms

Alibaba Cloud Big Data AI Platform

Sep 20, 2022 · Big Data

How Alibaba Cloud’s Data Lake Metadata Warehouse Transforms Big Data Management

This article explains the challenges of data lake adoption and details Alibaba Cloud’s metadata warehouse architecture, construction, search capabilities, asset analysis, fine‑grained profiling, and lifecycle management that together enable efficient, cloud‑native big data management.

Alibaba CloudBig DataCloud Native

0 likes · 13 min read

How Alibaba Cloud’s Data Lake Metadata Warehouse Transforms Big Data Management

DataFunTalk

Sep 17, 2022 · Big Data

Real-Time Data Warehouse Practices with Hudi at ByteDance

This presentation details ByteDance's real‑time data‑warehouse implementations using Apache Hudi, covering scenario classifications, challenges of traditional offline warehouses, practical solutions for ingestion, upsert, validation, indexing, query optimization, and future plans for extensible indexing and unified batch‑stream processing.

Data LakeHudiOptimization

0 likes · 16 min read

Real-Time Data Warehouse Practices with Hudi at ByteDance

DataFunSummit

Sep 15, 2022 · Big Data

Amazon Real-Time Data Warehouse Architecture and Services Overview

This article reviews the evolution of data warehouse architectures, explains Amazon's serverless real-time data lake design and its key services, and details Amazon Redshift's cloud-native real-time data warehouse features, streaming ingestion, and integrated machine learning capabilities.

AWSAmazon RedshiftBig Data

0 likes · 10 min read

Amazon Real-Time Data Warehouse Architecture and Services Overview

dbaplus Community

Sep 14, 2022 · Databases

How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes

This article explains the architecture of Apache Doris, introduces Apache Hudi as a data‑lake format, compares Lambda and Kappa approaches, and details the design, implementation steps, and future roadmap for querying Hudi tables directly from Doris.

Apache DorisApache HudiBig Data

0 likes · 10 min read

How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes

Alibaba Cloud Big Data AI Platform

Sep 13, 2022 · Big Data

From Hadoop to Cloud‑Native: The Evolution of Data Lakes and Modern Architecture

This article traces the history of data lakes from their 2010 inception with Hadoop through cloud‑native object storage, lakehouse formats like Delta Lake, and Alibaba Cloud's multi‑layer solution, outlining key architectural stages and practical construction challenges for enterprise‑grade implementations.

Alibaba CloudBig DataCloud Native

0 likes · 9 min read

From Hadoop to Cloud‑Native: The Evolution of Data Lakes and Modern Architecture

Tencent Cloud Developer

Sep 9, 2022 · Big Data

Data Lake, Data Warehouse, and Lakehouse: Concepts, Architectures, and Industry Practices

The article explains how data lakes excel at ingesting massive, varied data, data warehouses optimize storage and query performance, and lake‑house architectures combine both strengths—offering scalable, low‑cost storage with high‑speed analytics—highlighting industry solutions from Snowflake, Databricks, and major cloud providers.

AnalyticsBig DataData Lake

0 likes · 8 min read

Data Lake, Data Warehouse, and Lakehouse: Concepts, Architectures, and Industry Practices

DataFunSummit

Sep 7, 2022 · Big Data

Integrating Apache Doris with Hudi: Architecture, Design, and Implementation

This article explains the background, architecture, design choices, and step‑by‑step implementation for enabling Apache Doris to query Hudi data lake tables, covering Doris features, Hudi formats, Lambda/Kappa architectures, solution alternatives, and future roadmap for real‑time analytics.

Apache DorisBig DataData Lake

0 likes · 10 min read

Integrating Apache Doris with Hudi: Architecture, Design, and Implementation

DataFunSummit

Sep 5, 2022 · Big Data

DataFun Summit 2022 – Modern Data Stack Forum: Speaker Lineup and Session Overviews

The DataFun Summit 2022 featured a Data Lake & Warehouse forum with expert talks on PALO, ByteDance LAS, Iceberg at Huawei, and Presto‑Alluxio acceleration, providing detailed technical outlines, speaker backgrounds, and audience takeaways for modern big‑data architectures.

Apache IcebergBig DataData Lake

0 likes · 7 min read

DataFun Summit 2022 – Modern Data Stack Forum: Speaker Lineup and Session Overviews

DataFunTalk

Aug 29, 2022 · Big Data

Migrating from Lambda Architecture to an Iceberg‑Based Unified Batch‑Stream Architecture at NetEase Yanxuan

This article details how NetEase Yanxuan upgraded its legacy Lambda data pipeline to a unified batch‑stream architecture built on Apache Iceberg, covering the original challenges, the evaluation of Iceberg versus Hudi and DeltaLake, implementation specifics, table‑governance techniques, and future roadmap.

Batch-StreamData LakeFlink

0 likes · 14 min read

Migrating from Lambda Architecture to an Iceberg‑Based Unified Batch‑Stream Architecture at NetEase Yanxuan

Past Memory Big Data

Aug 11, 2022 · Big Data

What Kind of Data Lake Do Enterprises Really Need? Lessons from Delta 2.0

The article examines the open‑source release of Delta 2.0, compares its features and benchmark results with Iceberg and Hudi, discusses the core capabilities required by enterprises for a lakehouse architecture, and introduces the Arctic project as a multi‑engine streaming lake service.

ArcticData LakeDelta Lake

0 likes · 25 min read

What Kind of Data Lake Do Enterprises Really Need? Lessons from Delta 2.0

DataFunTalk

Aug 10, 2022 · Big Data

Delta Lake 2.0, Iceberg, Hudi: A Comparative Study and the Arctic Lakehouse Service

The article reviews recent developments in data‑lake table formats—Delta Lake 2.0, Iceberg, and Hudi—examining their features, benchmark results, and ecosystem impact, and then introduces Arctic, an open‑source streaming lakehouse service built on Iceberg that aims to bridge batch‑stream gaps for enterprises.

Data LakeDelta LakeHudi

0 likes · 24 min read

Delta Lake 2.0, Iceberg, Hudi: A Comparative Study and the Arctic Lakehouse Service

Baidu Geek Talk

Aug 5, 2022 · Big Data

How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation

This article analyzes Baidu Intelligent Cloud's data‑lake acceleration strategy, covering the evolution of big‑data architectures, the advantages and challenges of compute‑storage separation, the native hierarchical namespace and RapidFS cache solutions, performance test results, and recommended deployment patterns.

BOSCompute-Storage SeparationData Lake

0 likes · 17 min read

How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation

DataFunTalk

Aug 5, 2022 · Big Data

Delta Lake Principles, eBay Migration, and Practical Enhancements

This talk by eBay software engineer Zhu Feng explains the fundamentals of Delta Lake and Lakehouse architecture, outlines eBay’s migration from Teradata to a Spark‑based platform, and details the custom enhancements, performance optimizations, and operational improvements implemented to support large‑scale update and delete workloads.

Data LakeDelta LakeLakehouse

0 likes · 16 min read

Delta Lake Principles, eBay Migration, and Practical Enhancements

High Availability Architecture

Aug 5, 2022 · Big Data

Innovative Marketing Practices on the Cloud: How an Intelligent Data Lake Enables Flexible and Efficient Marketing Capabilities

The presentation details how Amazon Web Services’ intelligent data lake architecture integrates big data and machine learning to overcome marketing challenges, improve data governance, and provide scalable, real‑time analytics for personalized, data‑driven marketing across enterprises.

AWSBig DataCloud Computing

0 likes · 13 min read

Innovative Marketing Practices on the Cloud: How an Intelligent Data Lake Enables Flexible and Efficient Marketing Capabilities

Architecture Digest

Aug 1, 2022 · Big Data

Understanding Data Lakes: Concepts, Features, Architectures, and Vendor Solutions

This article provides a comprehensive overview of data lakes, explaining their definition, key characteristics, architectural evolution, and detailed comparisons of major cloud providers' solutions, while also presenting typical use cases, construction processes, and future development directions for this emerging big‑data infrastructure.

AWSAlibaba CloudAzure

0 likes · 52 min read

Understanding Data Lakes: Concepts, Features, Architectures, and Vendor Solutions

Programmer DD

Jul 28, 2022 · Databases

Why MongoDB Is Adding Native Analytics and What It Means for Developers

MongoDB is evolving from a purely operational document store to a hybrid system that embeds native analytics, cloud‑native features, and SQL access, aiming to boost developer productivity, support real‑time insights, and complement rather than replace traditional data warehouses.

AnalyticsCloudData Lake

0 likes · 12 min read

Why MongoDB Is Adding Native Analytics and What It Means for Developers

Baidu Intelligent Cloud Tech Hub

Jul 28, 2022 · Big Data

How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation

This article explains Baidu Intelligent Cloud’s data lake acceleration solution, covering the evolution of big‑data technologies, the benefits and challenges of compute‑storage separation, the architecture of BOS object storage, and the native hierarchical namespace and RapidFS cache mechanisms that boost performance and reduce costs.

BOSBig DataCompute-Storage Separation

0 likes · 18 min read

Big Data Technology & Architecture

Jul 27, 2022 · Big Data

Step-by-Step Guide to Installing and Using Flink with Iceberg for Real-Time Data Lake

This article provides a comprehensive tutorial on setting up Flink 1.11 with Iceberg 0.11.1, creating Hive catalogs, building databases and tables, inserting data, and exploring Iceberg components, file structures, partitioned tables, execution plans, and programmatic access via Scala.

Big DataData LakeFlink

0 likes · 10 min read

Step-by-Step Guide to Installing and Using Flink with Iceberg for Real-Time Data Lake

ITPUB

Jul 24, 2022 · Databases

How Apache Doris Enables Real‑Time Queries on Hudi Data Lakes

This article explains Apache Doris’s architecture, introduces the Hudi data‑lake format, compares Lambda and Kappa approaches, and details the design and implementation of Doris’s Hudi external table support, including practical steps, code examples, and future roadmap.

Apache DorisBig DataData Lake

0 likes · 10 min read

How Apache Doris Enables Real‑Time Queries on Hudi Data Lakes

Past Memory Big Data

Jul 22, 2022 · Big Data

Choosing Modern Data Architecture: Data Fabric vs. Data Mesh

The article compares Data Fabric and Data Mesh as modern data‑architecture approaches, explains their technical and organizational differences, discusses the ongoing debate between data lakes, warehouses, and lakehouses, and highlights how each option fits varying data‑type and usage scenarios.

Data ArchitectureData FabricData Lake

0 likes · 4 min read

Choosing Modern Data Architecture: Data Fabric vs. Data Mesh

Baidu Intelligent Cloud Tech Hub

Jul 21, 2022 · Cloud Computing

How Baidu’s Cloud Storage Powers High‑Performance Computing and AI Workloads

This article explains the storage challenges of high‑performance computing—including traditional HPC, AI‑driven HPC, and HPDA—then details Baidu’s unified storage platform, object storage BOS, and runtime solutions PFS and RapidFS, illustrating their architecture, features, and a real‑world autonomous‑driving customer case.

AI trainingData Lakecloud storage

0 likes · 29 min read

How Baidu’s Cloud Storage Powers High‑Performance Computing and AI Workloads

DataFunTalk

Jul 18, 2022 · Big Data

Integrating Apache Doris with Hudi: Design, Implementation, and Future Plans

This article introduces Apache Doris, an MPP analytical database, and explains how it integrates with the Hudi data lake format, covering architectural features, design choices, implementation steps including external table creation and query processing, and outlines future enhancements for supporting MOR snapshots and incremental queries.

Apache DorisData LakeHudi

0 likes · 12 min read

Integrating Apache Doris with Hudi: Design, Implementation, and Future Plans

DataFunTalk

Jul 16, 2022 · Big Data

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

The article provides an in‑depth overview of Apache Hudi 0.11.0, covering its new multi‑level index design, Spark SQL enhancements, Flink integration improvements, and additional performance and usability features aimed at boosting read/write efficiency in large‑scale data lake environments.

Apache HudiBig DataData Lake

0 likes · 15 min read

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

Bilibili Tech

Jul 15, 2022 · Big Data

Lakehouse Architecture Practice at Bilibili: Query Acceleration and Index Enhancement

Bilibili’s lakehouse architecture merges Iceberg‑based data lake flexibility with data‑warehouse efficiency, using Kafka‑Flink real‑time ingestion, Spark offline loads, Trino queries, Alluxio caching, Z‑Order/Hilbert sorting, and enhanced BloomFilter and bitmap indexes to boost query speed up to tenfold while drastically cutting file reads.

Big Data ArchitectureBitmap IndexData Lake

0 likes · 17 min read

Lakehouse Architecture Practice at Bilibili: Query Acceleration and Index Enhancement

DataFunSummit

Jul 12, 2022 · Big Data

Practical Use of Apache Iceberg in Microvision's Data Warehouse: Architecture, Real‑time Integration, and Table Maintenance

This article details why Microvision adopted Apache Iceberg, how it replaces parts of their Lambda‑architecture data pipeline, the real‑time and offline use cases, table‑maintenance practices such as snapshot cleanup and small‑file merging, and lessons learned from the implementation.

Big DataData LakeFlink

0 likes · 17 min read

Practical Use of Apache Iceberg in Microvision's Data Warehouse: Architecture, Real‑time Integration, and Table Maintenance

Big Data Technology & Architecture

Jul 12, 2022 · Big Data

Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging

This article explains how Spark reads data from Apache Iceberg tables by parsing snapshots and manifest files into DataFile objects, creates Batch and InputPartition objects, uses readers to materialize InternalRows, and then demonstrates how Iceberg's RewriteDataFilesAction can merge tiny Parquet files into larger ones through Spark‑driven tasks.

Big DataData LakeIceberg

0 likes · 17 min read

Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging

DataFunTalk

Jul 10, 2022 · Big Data

Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless

This article presents a comprehensive overview of how Amazon EMR Serverless leverages serverless technology to simplify, scale, and cost‑optimize big data analytics, covering the evolution of serverless services, the intelligent lakehouse architecture, core concepts, key benefits, common use cases, and available documentation.

Amazon EMRAnalyticsBig Data

0 likes · 17 min read

Big Data Technology & Architecture

Jul 7, 2022 · Big Data

Deep Dive into Apache Iceberg Core Features and Flink Integration

This article explains Apache Iceberg’s architecture, core capabilities such as time‑travel, fast scans, delete handling, and schema evolution, and provides a step‑by‑step guide for configuring Flink to use Iceberg with Hive and Hadoop catalogs, including DDL commands and streaming queries.

Apache IcebergBig DataData Lake

0 likes · 22 min read

Deep Dive into Apache Iceberg Core Features and Flink Integration

Big Data Technology & Architecture

Jul 6, 2022 · Big Data

Understanding Apache Iceberg File Storage Format and Write Processes in Spark and Flink

This article explains the Apache Iceberg file storage format, its metadata hierarchy, and demonstrates how Spark and Flink write data to Iceberg tables, including detailed code examples, manifest handling, snapshot management, and commit processes for efficient data lake operations.

Apache IcebergBig DataData Lake

0 likes · 31 min read

Understanding Apache Iceberg File Storage Format and Write Processes in Spark and Flink

Baidu Intelligent Cloud Tech Hub

Jun 30, 2022 · Big Data

Why Data Lakes Need Data Warehouses: Evolution of Modern Data Platforms

This article traces the evolution of enterprise data platforms—from early data warehouses to modern data lakes and the emerging lakehouse—detailing key technologies, challenges, and best practices for storage, compute engines, metadata, and integration, while highlighting how cloud-native object storage reshapes scalability and cost.

Big DataData LakeData Warehouse

0 likes · 27 min read

Why Data Lakes Need Data Warehouses: Evolution of Modern Data Platforms

Volcano Engine Developer Services

Jun 20, 2022 · Big Data

How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study

ByteDance tackled massive feature‑storage challenges by replacing row‑based HDFS files with columnar Parquet and the Iceberg table format, enabling schema evolution, selective reads, efficient backfill, and training optimizations that cut storage costs by over 40% and reduced CPU and network I/O dramatically.

Big DataData LakeIceberg

0 likes · 13 min read

How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study

Top Architect

Jun 18, 2022 · Big Data

Overview of Data Lakes and the Open SPL Compute Engine

This article explains the concept and challenges of data lakes, describes the “impossible triangle” of storage, compute, and cost, and introduces the open‑source SPL engine that provides multi‑source, file‑based, high‑performance computing to overcome those limitations.

Data LakeSPLcompute engine

0 likes · 13 min read

Overview of Data Lakes and the Open SPL Compute Engine

DataFunSummit

May 30, 2022 · Big Data

Lakehouse Architecture at Bilibili: Query Acceleration and Index Enhancement Practices

This article explains Bilibili's lake‑warehouse integrated architecture, describing how Iceberg, Z‑Order sorting, and advanced indexing techniques such as BloomFilter and BitMap are used to accelerate queries and improve data organization in large‑scale analytics workloads.

Big DataData LakeData Warehouse

0 likes · 18 min read

Lakehouse Architecture at Bilibili: Query Acceleration and Index Enhancement Practices

Architect's Tech Stack

May 28, 2022 · Big Data

Data Lake Challenges and the Open SPL Computing Engine

The article examines the inherent trade‑offs of data lakes—maintaining raw data, enabling efficient computation, and keeping costs low—explains why traditional data‑warehouse approaches fall short, and introduces the open‑source SPL engine that provides multi‑source, file‑based, high‑performance analytics to overcome these limitations.

Big DataData LakeETL

0 likes · 12 min read

Data Lake Challenges and the Open SPL Computing Engine

DataFunTalk

May 24, 2022 · Big Data

Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake

This article explains how Apache Flink integrates with Apache Hudi to enable real‑time data lake ingestion, covering the evolution from traditional data warehouses to data lakes, Hudi’s core concepts such as timeline and file grouping, copy‑on‑write vs merge‑on‑read modes, and Flink’s CDC‑based ETL pipeline.

Big DataCDCData Lake

0 likes · 18 min read

Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake

Alibaba Cloud Developer

May 18, 2022 · Big Data

Why Delta Lake Is Revolutionizing Data Lakes with ACID Guarantees

This article explains how Delta Lake adds reliability to data lakes by offering ACID transactions, scalable metadata, and unified batch‑and‑stream processing, outlines the challenges it solves, details its implementation principles, and demonstrates a practical demo for building an integrated data warehouse.

ACIDBig DataData Engineering

0 likes · 9 min read

Why Delta Lake Is Revolutionizing Data Lakes with ACID Guarantees

Big Data Technology & Architecture

May 17, 2022 · Big Data

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

This article provides a comprehensive guide to Apache Hudi, covering its basic concepts, timeline architecture, storage types (Copy‑On‑Write and Merge‑On‑Read), write operations, DeltaStreamer usage, Hive/Spark/Presto query integration, data management, indexing, compaction, and best‑practice recommendations for big‑data lake workloads.

Apache HudiBig DataCopy-on-Write

0 likes · 43 min read

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

DataFunTalk

May 17, 2022 · Big Data

Exploring JuiceFS in Data Lake Storage Architecture

This presentation provides a comprehensive overview of JuiceFS, an open‑source cloud‑native distributed file system, detailing its role in modern data lake and lakehouse architectures, comparing it with HDFS and object storage, and highlighting its performance, integration, and community ecosystem.

Big DataData LakeDistributed File System

0 likes · 19 min read

Exploring JuiceFS in Data Lake Storage Architecture

ITPUB

Apr 26, 2022 · Big Data

Mastering Delta Lake: From Data Lake Basics to Hands‑On Implementation

This article explains the fundamentals of data lakes and data warehouses, compares their architectures, outlines the challenges of data lakes, and then dives deep into Delta Lake's core features, storage model, ACID guarantees, concurrency handling, and provides step‑by‑step Spark code examples for practical use.

ACIDCopy-on-WriteData Lake

0 likes · 18 min read

Mastering Delta Lake: From Data Lake Basics to Hands‑On Implementation

StarRocks

Apr 13, 2022 · Big Data

How StarRocks Achieves Lightning‑Fast Data Lake Analytics

This article explains StarRocks' streamlined architecture, cost‑based optimizer, massively parallel processing and vectorized engine, and how they enable high‑performance queries over data stored in Hive, Iceberg, Hudi and other lake formats, backed by benchmark results and future roadmap details.

Big DataCBOData Lake

0 likes · 19 min read

How StarRocks Achieves Lightning‑Fast Data Lake Analytics

NetEase Yanxuan Technology Product Team

Mar 30, 2022 · Big Data

Data Lake Construction and Practice at NetEase Yanxuan

NetEase Yanxuan replaced its cumbersome data‑warehouse with a flexible Delta‑Lake/Iceberg data lake, creating a unified metadata layer and real‑time ingestion pipelines that cut latency from nightly batches to seconds, slashed compute and storage costs, supported diverse business scenarios and machine‑learning feature engineering, and set the stage for broader future expansion.

Data IntegrationData LakeDelta Lake

0 likes · 16 min read

Data Lake Construction and Practice at NetEase Yanxuan

Yanxuan Tech Team

Mar 29, 2022 · Big Data

How NetEase Yanxuan Built a Real‑Time Data Lake to Boost Efficiency

This article explains how NetEase Yanxuan evolved from a traditional data‑warehouse pipeline to a cloud‑native data‑lake architecture, detailing the business challenges, design choices, technology stack (Delta, Iceberg, Hudi), implementation steps, and the resulting gains in real‑time data access, cost reduction, and feature‑engineering support.

Data LakeDelta LakeHudi

0 likes · 18 min read

How NetEase Yanxuan Built a Real‑Time Data Lake to Boost Efficiency

DataFunTalk

Mar 29, 2022 · Big Data

FlinkX Multi-Source Heterogeneous Data Synchronization Framework: Architecture, Features, and Cloud‑Native Enhancements

This article introduces the FlinkX framework for multi‑source heterogeneous data synchronization, detailing its background, core functions such as checkpoint‑based resume, metric monitoring, rate limiting, plugin architecture, cloud‑native K8s deployment, Hudi integration, and future roadmap, while also addressing common Q&A topics.

BatchBig DataData Lake

0 likes · 14 min read

FlinkX Multi-Source Heterogeneous Data Synchronization Framework: Architecture, Features, and Cloud‑Native Enhancements

DataFunTalk

Mar 23, 2022 · Big Data

Iceberg Data Lake Query Optimization Practices and Governance

This talk by Tencent senior engineer Chen Liang covers Iceberg table format fundamentals, data lake ingestion, query processing, hidden partitioning, time‑travel, major features, optimization techniques such as compaction, bin‑packing, sorting and Z‑ordering, and outlines a future roadmap for improving performance and governance in big‑data environments.

Big DataData LakeFlink

0 likes · 12 min read

Iceberg Data Lake Query Optimization Practices and Governance

Alibaba Cloud Developer

Mar 15, 2022 · Big Data

How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture

This article explains why data lakes are essential for today’s analytics, outlines the three main user demands, defines data lakes, compares rule‑based and cost‑based optimizers, explores record‑oriented versus block‑oriented processing, and details StarRocks’ frontend‑backend architecture and benchmark results.

Analytics EngineBig DataData Lake

0 likes · 17 min read

How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture

DataFunTalk

Mar 13, 2022 · Big Data

Tencent Data Lake Metadata Governance Practice and Architecture

This article presents Tencent's data lake metadata governance practice, covering data lake fundamentals, the 3+2 architecture of storage, compute and unified metadata, multi‑tenant design, the re‑implemented Hive Metastore for online catalog, performance optimizations, and offline data‑governance capabilities.

Big DataCloud ComputingData Lake

0 likes · 18 min read

Tencent Data Lake Metadata Governance Practice and Architecture

StarRocks

Mar 4, 2022 · Big Data

How StarRocks Powers Ultra‑Fast Data Lake Analytics: Architecture and Core Techniques

This article explains the fundamentals of data lake analytics, compares optimization strategies such as rule‑based vs cost‑based and record‑oriented vs block‑oriented processing, describes StarRocks' lightweight frontend/backend architecture, and presents benchmark results that demonstrate its performance advantages over competing engines.

Analytics EngineData LakeStarRocks

0 likes · 17 min read

How StarRocks Powers Ultra‑Fast Data Lake Analytics: Architecture and Core Techniques

Big Data Technology & Architecture

Mar 4, 2022 · Big Data

Managing Small Files in Apache Hudi and Spark Optimization Guide

The article explains how Apache Hudi automatically manages file sizes to avoid small‑file issues, details key configuration parameters, provides a step‑by‑step example of file merging, and offers practical Spark tuning recommendations for optimal performance in data‑lake workloads.

Apache HudiBig DataData Lake

0 likes · 11 min read

Managing Small Files in Apache Hudi and Spark Optimization Guide

DataFunTalk

Mar 1, 2022 · Cloud Native

Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions

The presentation outlines Alibaba Cloud's native data lake solution built on Apache Iceberg, covering data lake fundamentals, cloud migration challenges, Iceberg's architecture and features, real‑time ingestion with Flink, unified metadata management, security guarantees, and testing practices to ensure reliable, scalable big‑data analytics.

Apache IcebergBig DataData Lake

0 likes · 16 min read

Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions

Big Data Technology & Architecture

Feb 28, 2022 · Big Data

Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples

This article provides a step‑by‑step guide on integrating Apache Hudi with Hive and Presto, demonstrates core Hudi operations such as insert, upsert, delete, query, and Hive synchronization using Scala code, and shows how to manage Hudi tables through Spark SQL DDL/DML commands.

Apache HudiBig DataData Lake

0 likes · 16 min read

Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples

DataFunTalk

Feb 25, 2022 · Big Data

Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization

This article explains how Tencent leverages Apache Iceberg together with Flink to build a real‑time data lake pipeline, covering data ingestion, Iceberg's snapshot‑based read/write model, compaction and governance services, Z‑order based query optimization, performance results, and future roadmap.

Apache IcebergBig DataCompaction

0 likes · 24 min read

Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization

Bilibili Tech

Feb 17, 2022 · Big Data

Bilibili's Lakehouse Architecture: Building a Unified Data Lake and Data Warehouse

Bilibili replaced its Hive‑Spark‑Presto ETL pipeline with a lakehouse built on Iceberg, using Magnus, Trino and Alluxio to unify a PB‑scale data lake and warehouse, adding Z‑Order sorting and indexing for fast multi‑dimensional queries while planning further schema and pre‑computation optimizations.

Data LakeData WarehouseIceberg

0 likes · 14 min read

Bilibili's Lakehouse Architecture: Building a Unified Data Lake and Data Warehouse

DataFunTalk

Feb 12, 2022 · Big Data

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

This article introduces NetEase's internally incubated data lake project Arctic, explains the concept of data lakes, outlines NetEase's specific requirements for a unified streaming‑batch platform, details Arctic's core architecture, storage strategy, data‑merge mechanisms, current achievements, and future development plans.

Apache IcebergArcticBig Data

0 likes · 10 min read

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

Architects Research Society

Feb 9, 2022 · Cloud Computing

Four Open‑Source Object Storage Platforms for Managing Large Unstructured Data

This article introduces object storage as a cost‑effective solution for massive unstructured data and reviews five open‑source platforms—LakeFS, Ceph, MinIO, OpenIO, and Apache Ozone—highlighting their features, scalability, and suitability for modern data‑lake and cloud‑native environments.

Data Lakecloud storageobject-storage

0 likes · 7 min read

Four Open‑Source Object Storage Platforms for Managing Large Unstructured Data

Big Data Technology & Architecture

Feb 8, 2022 · Big Data

Apache Hudi Overview: Design Principles, Table Architecture, and Read/Write Processes

This article provides a comprehensive overview of Apache Hudi, covering its storage reliance on HDFS, core design principles, table architecture, timeline management, file and index structures, as well as detailed read and write workflows for both Copy‑On‑Write and Merge‑On‑Read table types.

Apache HudiBig DataCopy-on-Write

0 likes · 16 min read

Apache Hudi Overview: Design Principles, Table Architecture, and Read/Write Processes

DataFunTalk

Feb 3, 2022 · Big Data

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

This article explains how Kuashou tackled latency and efficiency problems in large‑scale data pipelines by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, its architecture, model design, handling of bursty updates, back‑fill scenarios, and operational safeguards.

Big DataData LakeFlink

0 likes · 13 min read

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

Alibaba Cloud Native

Jan 26, 2022 · Big Data

How to Build a Lakehouse with RocketMQ and Apache Hudi: A Step‑by‑Step Guide

This article explains the Lakehouse architecture, its required features, the evolution of big‑data stacks, and provides a detailed, hands‑on guide for constructing a Lakehouse using RocketMQ (Connector & Stream) and Apache Hudi, including configuration, deployment, and sample code.

Apache HudiBig DataCloud Native

0 likes · 18 min read

How to Build a Lakehouse with RocketMQ and Apache Hudi: A Step‑by‑Step Guide

IT Architects Alliance

Jan 26, 2022 · Big Data

Why Combine Data Lakes and Warehouses? Understanding Lakehouse Architecture

This article explains the concepts of data warehouses, data marts, and data lakes, illustrates why the lakehouse model emerged to bridge storage and compute, and outlines its key benefits such as flexibility, scalability, reduced redundancy, and unified analytics for modern enterprises.

AnalyticsBig DataData Architecture

0 likes · 12 min read

Why Combine Data Lakes and Warehouses? Understanding Lakehouse Architecture

21CTO

Jan 8, 2022 · Big Data

How Amazon’s Intelligent Lakehouse Redefines Big Data Architecture

The article examines Amazon’s Intelligent Lakehouse architecture, tracing its evolution from early data‑lake‑warehouse integrations to a modern, serverless, secure, and AI‑enhanced platform that unifies data storage, governance, and analytics to lower big‑data costs and boost agility.

Big DataData GovernanceData Lake

0 likes · 12 min read

How Amazon’s Intelligent Lakehouse Redefines Big Data Architecture

DataFunTalk

Jan 8, 2022 · Big Data

Lakehouse: Concepts, Architecture, Implementation, and Cloud Practices

This article provides a comprehensive overview of the Lakehouse paradigm, tracing its origins from traditional data warehouses and data lakes, comparing architectures, detailing core components such as Delta Lake and Iceberg, and illustrating practical cloud implementations and future directions.

Apache IcebergBig DataCloud Data Platform

0 likes · 14 min read

Lakehouse: Concepts, Architecture, Implementation, and Cloud Practices

Architects' Tech Alliance

Dec 26, 2021 · Big Data

Understanding Data Lakes: Concepts, Architecture, Vendor Solutions, and Implementation Practices

This comprehensive article explains what a data lake is, its core characteristics, reference architecture, major cloud vendor implementations, typical use cases such as advertising and gaming, step‑by‑step construction guidance, and future trends in cloud‑native big‑data platforms.

Data ArchitectureData LakeData Management

0 likes · 51 min read

Understanding Data Lakes: Concepts, Architecture, Vendor Solutions, and Implementation Practices

DataFunSummit

Nov 28, 2021 · Big Data

Understanding Data Lakes: Definition, Architecture, Core Capabilities, and Comparison with Data Warehouses

The article explains what a data lake is, its architecture and core capabilities, compares it with data warehouses, discusses its value and challenges, and reviews major open‑source platforms such as Delta Lake, Iceberg, and Hudi.

AnalyticsBig DataData Architecture

0 likes · 11 min read

Understanding Data Lakes: Definition, Architecture, Core Capabilities, and Comparison with Data Warehouses

Architects' Tech Alliance

Nov 12, 2021 · Big Data

Understanding Data Lakes: Definitions, Evolution, and Architectural Patterns

The article explains what a data lake is, compares various vendor definitions, outlines its four essential components, describes three evolutionary architecture stages from self‑hosted Hadoop to cloud‑native storage‑compute separation, and discusses the benefits and challenges of adopting data lake solutions in modern big‑data platforms.

AWSData LakeHadoop

0 likes · 8 min read

Understanding Data Lakes: Definitions, Evolution, and Architectural Patterns

Big Data Technology & Architecture

Nov 8, 2021 · Big Data

Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices

This article examines the strengths and weaknesses of Apache Iceberg, explains why Tencent selected it over alternatives, details Tencent’s own enhancements and integration with Flink, Spark, and other engines, and shares multiple real‑world implementations for building enterprise‑grade real‑time data lakes.

Apache IcebergData LakeFlink

0 likes · 17 min read

Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices

Kuaishou Big Data

Oct 21, 2021 · Big Data

How Kuaishou Boosted Data Efficiency with Apache Hudi: Real‑Time + Offline Solutions

This article explains how Kuaishou tackled late data scheduling, costly synchronization, and inefficient back‑fills by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, and step‑by‑step implementation to achieve fast, fresh, and scalable data processing.

Data LakeFlinkHudi

0 likes · 13 min read

How Kuaishou Boosted Data Efficiency with Apache Hudi: Real‑Time + Offline Solutions

Big Data Technology & Architecture

Oct 12, 2021 · Big Data

Data Lake Evolution and a Practical Flink + Iceberg Implementation Guide

This article explores the evolution of data lakes, compares major cloud providers' lake architectures, introduces the emerging lakehouse concept, and provides a step‑by‑step Flink‑Iceberg implementation—including dependencies, catalog setup, table creation, checkpointing, and Kafka ingestion—demonstrating practical big‑data streaming solutions.

Data LakeFlinkIceberg

0 likes · 14 min read

Data Lake Evolution and a Practical Flink + Iceberg Implementation Guide

ITPUB

Sep 9, 2021 · Big Data

Why Data Lakes Are Essential for Modern Data Platforms: Goals, Architecture, and Governance

This article explains the origins and purpose of data lakes, outlines four key construction goals, details common ingestion methods and storage technologies, and describes essential governance practices such as cataloging, data quality, and regulatory compliance.

Data GovernanceData LakeETL

0 likes · 18 min read

Why Data Lakes Are Essential for Modern Data Platforms: Goals, Architecture, and Governance

DataFunTalk

Sep 3, 2021 · Big Data

Building an Exabyte‑Scale Data Lake with Apache Hudi at ByteDance: Architecture, Design Choices, and Performance Optimizations

This article details ByteDance's implementation of an exabyte‑scale data lake using Apache Hudi, covering scenario requirements, engine selection, functional support, schema management, extensive performance tuning, and future directions, while also noting recruitment opportunities within the team.

Apache HudiBig DataByteDance

0 likes · 9 min read

Building an Exabyte‑Scale Data Lake with Apache Hudi at ByteDance: Architecture, Design Choices, and Performance Optimizations

Big Data Technology & Architecture

Aug 24, 2021 · Big Data

Comprehensive Overview of Data Lake Technologies: Iceberg, Hudi, and Delta Lake

This article provides an in-depth overview of data lake concepts, definitions, and essential features, followed by detailed case studies of enterprise data lake implementations and comparative analysis of leading data lake table formats—Iceberg, Hudi, and Delta Lake—highlighting their architectures, capabilities, and trade‑offs.

Data LakeDelta LakeFlink

0 likes · 19 min read

Comprehensive Overview of Data Lake Technologies: Iceberg, Hudi, and Delta Lake

dbaplus Community

Aug 17, 2021 · Big Data

How JD Transformed Its Data Warehouse with Delta Lake for Real‑Time Analytics

This article examines JD's shift from a traditional Lambda‑based data warehouse to a Delta Lake‑powered real‑time data lake, detailing the challenges of legacy architectures, the evaluation of open‑source table formats, Delta Lake's core mechanisms, and the resulting simplified batch‑stream development workflow.

Batch-StreamBig DataData Lake

0 likes · 11 min read

How JD Transformed Its Data Warehouse with Delta Lake for Real‑Time Analytics

DataFunTalk

Aug 11, 2021 · Big Data

OPPO CBFS: Architecture and Key Technologies of a Scalable Data Lake Storage System

This article introduces OPPO's self‑developed data lake storage system CBFS, covering the fundamentals of data lake storage, the multi‑layer CBFS architecture, its core technologies such as metadata management and erasure coding, and future directions for large‑scale, low‑cost data analytics.

CBFSCloud NativeData Lake

0 likes · 14 min read

OPPO CBFS: Architecture and Key Technologies of a Scalable Data Lake Storage System

Architects Research Society

Jun 23, 2021 · Big Data

Understanding Data Lakes: Concepts, Benefits, and Comparison with Data Warehouses

The article explains what a data lake is, its origins, key characteristics such as storing all raw data, flexible access, and low‑cost storage, compares it with traditional data warehouses, discusses advantages, common criticisms, and the types of users who can benefit from it.

Data LakeData ManagementData Warehouse

0 likes · 10 min read

Understanding Data Lakes: Concepts, Benefits, and Comparison with Data Warehouses

DataFunTalk

Jun 21, 2021 · Big Data

Flink + Iceberg 0.11 Practices in Qunar Data Platform

This article shares Qunar's experience using Flink together with Apache Iceberg 0.11 to address real‑time data warehouse challenges, covering background pain points, Iceberg architecture, solutions for Kafka data loss and Hive latency, and optimization practices such as small‑file handling, sorting, and checkpoint management.

Big DataData LakeFlink

0 likes · 13 min read

Flink + Iceberg 0.11 Practices in Qunar Data Platform

Qunar Tech Salon

Jun 21, 2021 · Big Data

Using Apache Iceberg 0.11 with Flink for Real‑time Data Lake: Architecture, Pain Points, and Solutions

This article examines the challenges of using Kafka, Flink, and Hive for real‑time data warehousing, introduces Apache Iceberg 0.11 as a solution, details its architecture, query planning, Flink integration, code examples, optimization techniques, and summarizes the benefits for large‑scale data processing.

Big DataData LakeFlink

0 likes · 12 min read

Using Apache Iceberg 0.11 with Flink for Real‑time Data Lake: Architecture, Pain Points, and Solutions

Sohu Tech Products

Jun 16, 2021 · Big Data

Understanding Databases, Data Warehouses, Data Lakes, and the Emerging Lake House Architecture

This article explains the fundamental differences between databases, data warehouses, and data lakes, describes how they complement each other, and introduces the Lake House concept that integrates transactional and analytical workloads using cloud services such as Amazon S3, Redshift Spectrum, and Athena.

AWSBig DataData Lake

0 likes · 11 min read

Understanding Databases, Data Warehouses, Data Lakes, and the Emerging Lake House Architecture