Tagged articles
343 articles
Page 3 of 4
Baidu Geek Talk
Baidu Geek Talk
Nov 3, 2022 · Cloud Native

Challenges and Solutions for AI Storage Systems in Cloud‑Native Training

The talk outlines how AI training’s growing data and compute demands create storage bottlenecks across four evolutionary stages, identifies four core problems—massive data, data‑flow, resource scheduling, and compute acceleration—and proposes hardware, software (parallel file systems, caching), and cloud‑native orchestration (Fluid, Baidu Canghai) solutions that combine object‑storage lakes with high‑performance acceleration layers to achieve near‑full GPU utilization.

AICloud NativeData Lake
0 likes · 37 min read
Challenges and Solutions for AI Storage Systems in Cloud‑Native Training
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Oct 26, 2022 · Big Data

Arctic: NetEase's Streaming Lakehouse Service and Hive-Based Stream-Batch Integration Practice

Arctic, NetEase’s streaming lakehouse built on Apache Iceberg, unifies streaming and batch workloads with millisecond‑level latency, Hive compatibility, and built‑in message‑queue support, delivering CDC, upserts and OLAP without a Lambda architecture, as demonstrated by real‑time processing of 2 PB of Hive data for Cloud Music.

Apache IcebergArcticBig Data Architecture
0 likes · 15 min read
Arctic: NetEase's Streaming Lakehouse Service and Hive-Based Stream-Batch Integration Practice
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Oct 19, 2022 · Artificial Intelligence

Why Storage Systems Bottleneck AI Training and How to Accelerate Them

This article examines the comprehensive challenges AI applications face from storage to compute, traces the evolution of AI training infrastructure, analyzes key bottlenecks such as compute acceleration, resource scheduling, massive data handling and data flow, and presents Baidu Cloud's storage acceleration solutions—including parallel file systems, caching, and the Fluid scheduler—to dramatically improve AI training performance.

AI trainingCloud NativeData Lake
0 likes · 38 min read
Why Storage Systems Bottleneck AI Training and How to Accelerate Them
ITPUB
ITPUB
Oct 15, 2022 · Big Data

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

This talk introduces the evolution of data lakes, outlines Apache Hudi’s core features, details the Flink‑Hudi integration architecture—including write pipelines, small‑file handling, and read strategies—covers real‑world use cases such as near‑real‑time DB ingestion, OLAP, and ETL, and previews upcoming Hudi roadmap items.

Apache HudiBig DataData Lake
0 likes · 21 min read
Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi

This article explains how an e‑commerce company designed and implemented a modern data warehouse—combining batch Spark jobs, real‑time Flink streams, and Hudi data‑lake storage—to handle terabytes of daily logs, ensure data quality, and provide fast, reliable analytics for business decision‑making.

Data LakeData WarehouseETL
0 likes · 16 min read
How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi
DataFunTalk
DataFunTalk
Oct 4, 2022 · Big Data

Near‑Real‑Time Data Lake Practices in TikTok E‑commerce Data Warehouse

The presentation by TikTok e‑commerce data‑warehouse engineer Ma Wenyuan explains data‑lake characteristics, near‑real‑time architecture, and practical e‑commerce use cases, highlighting Apache Hudi features, hybrid batch‑stream processing, and future challenges for scaling and integration.

Data LakeHudiStreaming
0 likes · 13 min read
Near‑Real‑Time Data Lake Practices in TikTok E‑commerce Data Warehouse
Tencent Cloud Developer
Tencent Cloud Developer
Sep 27, 2022 · Big Data

GooseFS: Accelerating Cloud Storage for Big Data and Data Lake Platforms

GooseFS, Tencent Cloud’s Hadoop‑compatible storage accelerator, adds a local NVMe‑SSD cache layer to cloud‑native data lakes, letting users boost query speeds by up to 46 % and cut backend bandwidth by 200 Gbps without code changes, as demonstrated by a music‑industry customer’s 200‑node deployment caching ten million files.

Cost reductionData LakeGooseFS
0 likes · 16 min read
GooseFS: Accelerating Cloud Storage for Big Data and Data Lake Platforms
DataFunTalk
DataFunTalk
Sep 17, 2022 · Big Data

Real-Time Data Warehouse Practices with Hudi at ByteDance

This presentation details ByteDance's real‑time data‑warehouse implementations using Apache Hudi, covering scenario classifications, challenges of traditional offline warehouses, practical solutions for ingestion, upsert, validation, indexing, query optimization, and future plans for extensible indexing and unified batch‑stream processing.

Data LakeHudiStreaming
0 likes · 16 min read
Real-Time Data Warehouse Practices with Hudi at ByteDance
DataFunSummit
DataFunSummit
Sep 15, 2022 · Big Data

Amazon Real-Time Data Warehouse Architecture and Services Overview

This article reviews the evolution of data warehouse architectures, explains Amazon's serverless real-time data lake design and its key services, and details Amazon Redshift's cloud-native real-time data warehouse features, streaming ingestion, and integrated machine learning capabilities.

AWSAmazon RedshiftBig Data
0 likes · 10 min read
Amazon Real-Time Data Warehouse Architecture and Services Overview
dbaplus Community
dbaplus Community
Sep 14, 2022 · Databases

How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes

This article explains the architecture of Apache Doris, introduces Apache Hudi as a data‑lake format, compares Lambda and Kappa approaches, and details the design, implementation steps, and future roadmap for querying Hudi tables directly from Doris.

Apache DorisApache HudiBig Data
0 likes · 10 min read
How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 13, 2022 · Big Data

From Hadoop to Cloud‑Native: The Evolution of Data Lakes and Modern Architecture

This article traces the history of data lakes from their 2010 inception with Hadoop through cloud‑native object storage, lakehouse formats like Delta Lake, and Alibaba Cloud's multi‑layer solution, outlining key architectural stages and practical construction challenges for enterprise‑grade implementations.

Alibaba CloudBig DataCloud Native
0 likes · 9 min read
From Hadoop to Cloud‑Native: The Evolution of Data Lakes and Modern Architecture
Tencent Cloud Developer
Tencent Cloud Developer
Sep 9, 2022 · Big Data

Data Lake, Data Warehouse, and Lakehouse: Concepts, Architectures, and Industry Practices

The article explains how data lakes excel at ingesting massive, varied data, data warehouses optimize storage and query performance, and lake‑house architectures combine both strengths—offering scalable, low‑cost storage with high‑speed analytics—highlighting industry solutions from Snowflake, Databricks, and major cloud providers.

AnalyticsBig DataData Lake
0 likes · 8 min read
Data Lake, Data Warehouse, and Lakehouse: Concepts, Architectures, and Industry Practices
DataFunSummit
DataFunSummit
Sep 7, 2022 · Big Data

Integrating Apache Doris with Hudi: Architecture, Design, and Implementation

This article explains the background, architecture, design choices, and step‑by‑step implementation for enabling Apache Doris to query Hudi data lake tables, covering Doris features, Hudi formats, Lambda/Kappa architectures, solution alternatives, and future roadmap for real‑time analytics.

Apache DorisBig DataData Lake
0 likes · 10 min read
Integrating Apache Doris with Hudi: Architecture, Design, and Implementation
DataFunTalk
DataFunTalk
Aug 29, 2022 · Big Data

Migrating from Lambda Architecture to an Iceberg‑Based Unified Batch‑Stream Architecture at NetEase Yanxuan

This article details how NetEase Yanxuan upgraded its legacy Lambda data pipeline to a unified batch‑stream architecture built on Apache Iceberg, covering the original challenges, the evaluation of Iceberg versus Hudi and DeltaLake, implementation specifics, table‑governance techniques, and future roadmap.

Batch-StreamData LakeFlink
0 likes · 14 min read
Migrating from Lambda Architecture to an Iceberg‑Based Unified Batch‑Stream Architecture at NetEase Yanxuan
DataFunTalk
DataFunTalk
Aug 10, 2022 · Big Data

Delta Lake 2.0, Iceberg, Hudi: A Comparative Study and the Arctic Lakehouse Service

The article reviews recent developments in data‑lake table formats—Delta Lake 2.0, Iceberg, and Hudi—examining their features, benchmark results, and ecosystem impact, and then introduces Arctic, an open‑source streaming lakehouse service built on Iceberg that aims to bridge batch‑stream gaps for enterprises.

BenchmarkData LakeDelta Lake
0 likes · 24 min read
Delta Lake 2.0, Iceberg, Hudi: A Comparative Study and the Arctic Lakehouse Service
Baidu Geek Talk
Baidu Geek Talk
Aug 5, 2022 · Big Data

How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation

This article analyzes Baidu Intelligent Cloud's data‑lake acceleration strategy, covering the evolution of big‑data architectures, the advantages and challenges of compute‑storage separation, the native hierarchical namespace and RapidFS cache solutions, performance test results, and recommended deployment patterns.

BOSCompute-Storage SeparationData Lake
0 likes · 17 min read
How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation
DataFunTalk
DataFunTalk
Aug 5, 2022 · Big Data

Delta Lake Principles, eBay Migration, and Practical Enhancements

This talk by eBay software engineer Zhu Feng explains the fundamentals of Delta Lake and Lakehouse architecture, outlines eBay’s migration from Teradata to a Spark‑based platform, and details the custom enhancements, performance optimizations, and operational improvements implemented to support large‑scale update and delete workloads.

Data LakeDelta LakeLakehouse
0 likes · 16 min read
Delta Lake Principles, eBay Migration, and Practical Enhancements
High Availability Architecture
High Availability Architecture
Aug 5, 2022 · Big Data

Innovative Marketing Practices on the Cloud: How an Intelligent Data Lake Enables Flexible and Efficient Marketing Capabilities

The presentation details how Amazon Web Services’ intelligent data lake architecture integrates big data and machine learning to overcome marketing challenges, improve data governance, and provide scalable, real‑time analytics for personalized, data‑driven marketing across enterprises.

AWSBig DataData Governance
0 likes · 13 min read
Innovative Marketing Practices on the Cloud: How an Intelligent Data Lake Enables Flexible and Efficient Marketing Capabilities
Architecture Digest
Architecture Digest
Aug 1, 2022 · Big Data

Understanding Data Lakes: Concepts, Features, Architectures, and Vendor Solutions

This article provides a comprehensive overview of data lakes, explaining their definition, key characteristics, architectural evolution, and detailed comparisons of major cloud providers' solutions, while also presenting typical use cases, construction processes, and future development directions for this emerging big‑data infrastructure.

AWSAlibaba CloudAzure
0 likes · 52 min read
Understanding Data Lakes: Concepts, Features, Architectures, and Vendor Solutions
Programmer DD
Programmer DD
Jul 28, 2022 · Databases

Why MongoDB Is Adding Native Analytics and What It Means for Developers

MongoDB is evolving from a purely operational document store to a hybrid system that embeds native analytics, cloud‑native features, and SQL access, aiming to boost developer productivity, support real‑time insights, and complement rather than replace traditional data warehouses.

AnalyticsData LakeMongoDB
0 likes · 12 min read
Why MongoDB Is Adding Native Analytics and What It Means for Developers
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jul 28, 2022 · Big Data

How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation

This article explains Baidu Intelligent Cloud’s data lake acceleration solution, covering the evolution of big‑data technologies, the benefits and challenges of compute‑storage separation, the architecture of BOS object storage, and the native hierarchical namespace and RapidFS cache mechanisms that boost performance and reduce costs.

BOSBig DataCompute-Storage Separation
0 likes · 18 min read
How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation
ITPUB
ITPUB
Jul 24, 2022 · Databases

How Apache Doris Enables Real‑Time Queries on Hudi Data Lakes

This article explains Apache Doris’s architecture, introduces the Hudi data‑lake format, compares Lambda and Kappa approaches, and details the design and implementation of Doris’s Hudi external table support, including practical steps, code examples, and future roadmap.

Apache DorisBig DataData Lake
0 likes · 10 min read
How Apache Doris Enables Real‑Time Queries on Hudi Data Lakes
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jul 21, 2022 · Cloud Computing

How Baidu’s Cloud Storage Powers High‑Performance Computing and AI Workloads

This article explains the storage challenges of high‑performance computing—including traditional HPC, AI‑driven HPC, and HPDA—then details Baidu’s unified storage platform, object storage BOS, and runtime solutions PFS and RapidFS, illustrating their architecture, features, and a real‑world autonomous‑driving customer case.

AI trainingData Lakecloud storage
0 likes · 29 min read
How Baidu’s Cloud Storage Powers High‑Performance Computing and AI Workloads
DataFunTalk
DataFunTalk
Jul 18, 2022 · Big Data

Integrating Apache Doris with Hudi: Design, Implementation, and Future Plans

This article introduces Apache Doris, an MPP analytical database, and explains how it integrates with the Hudi data lake format, covering architectural features, design choices, implementation steps including external table creation and query processing, and outlines future enhancements for supporting MOR snapshots and incremental queries.

Apache DorisData LakeHudi
0 likes · 12 min read
Integrating Apache Doris with Hudi: Design, Implementation, and Future Plans
DataFunTalk
DataFunTalk
Jul 16, 2022 · Big Data

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

The article provides an in‑depth overview of Apache Hudi 0.11.0, covering its new multi‑level index design, Spark SQL enhancements, Flink integration improvements, and additional performance and usability features aimed at boosting read/write efficiency in large‑scale data lake environments.

Apache HudiBig DataData Lake
0 likes · 15 min read
Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements
Bilibili Tech
Bilibili Tech
Jul 15, 2022 · Big Data

Lakehouse Architecture Practice at Bilibili: Query Acceleration and Index Enhancement

Bilibili’s lakehouse architecture merges Iceberg‑based data lake flexibility with data‑warehouse efficiency, using Kafka‑Flink real‑time ingestion, Spark offline loads, Trino queries, Alluxio caching, Z‑Order/Hilbert sorting, and enhanced BloomFilter and bitmap indexes to boost query speed up to tenfold while drastically cutting file reads.

Big Data ArchitectureBitmap IndexData Lake
0 likes · 17 min read
Lakehouse Architecture Practice at Bilibili: Query Acceleration and Index Enhancement
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 12, 2022 · Big Data

Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging

This article explains how Spark reads data from Apache Iceberg tables by parsing snapshots and manifest files into DataFile objects, creates Batch and InputPartition objects, uses readers to materialize InternalRows, and then demonstrates how Iceberg's RewriteDataFilesAction can merge tiny Parquet files into larger ones through Spark‑driven tasks.

Big DataData LakeIceberg
0 likes · 17 min read
Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging
DataFunTalk
DataFunTalk
Jul 10, 2022 · Big Data

Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless

This article presents a comprehensive overview of how Amazon EMR Serverless leverages serverless technology to simplify, scale, and cost‑optimize big data analytics, covering the evolution of serverless services, the intelligent lakehouse architecture, core concepts, key benefits, common use cases, and available documentation.

Amazon EMRAnalyticsBig Data
0 likes · 17 min read
Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jun 30, 2022 · Big Data

Why Data Lakes Need Data Warehouses: Evolution of Modern Data Platforms

This article traces the evolution of enterprise data platforms—from early data warehouses to modern data lakes and the emerging lakehouse—detailing key technologies, challenges, and best practices for storage, compute engines, metadata, and integration, while highlighting how cloud-native object storage reshapes scalability and cost.

Big DataData LakeData Warehouse
0 likes · 27 min read
Why Data Lakes Need Data Warehouses: Evolution of Modern Data Platforms
Volcano Engine Developer Services
Volcano Engine Developer Services
Jun 20, 2022 · Big Data

How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study

ByteDance tackled massive feature‑storage challenges by replacing row‑based HDFS files with columnar Parquet and the Iceberg table format, enabling schema evolution, selective reads, efficient backfill, and training optimizations that cut storage costs by over 40% and reduced CPU and network I/O dramatically.

Big DataData LakeIceberg
0 likes · 13 min read
How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study
Top Architect
Top Architect
Jun 18, 2022 · Big Data

Overview of Data Lakes and the Open SPL Compute Engine

This article explains the concept and challenges of data lakes, describes the “impossible triangle” of storage, compute, and cost, and introduces the open‑source SPL engine that provides multi‑source, file‑based, high‑performance computing to overcome those limitations.

Data LakeSPLcompute engine
0 likes · 13 min read
Overview of Data Lakes and the Open SPL Compute Engine
Architect's Tech Stack
Architect's Tech Stack
May 28, 2022 · Big Data

Data Lake Challenges and the Open SPL Computing Engine

The article examines the inherent trade‑offs of data lakes—maintaining raw data, enabling efficient computation, and keeping costs low—explains why traditional data‑warehouse approaches fall short, and introduces the open‑source SPL engine that provides multi‑source, file‑based, high‑performance analytics to overcome these limitations.

Big DataData LakeETL
0 likes · 12 min read
Data Lake Challenges and the Open SPL Computing Engine
DataFunTalk
DataFunTalk
May 24, 2022 · Big Data

Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake

This article explains how Apache Flink integrates with Apache Hudi to enable real‑time data lake ingestion, covering the evolution from traditional data warehouses to data lakes, Hudi’s core concepts such as timeline and file grouping, copy‑on‑write vs merge‑on‑read modes, and Flink’s CDC‑based ETL pipeline.

Big DataCDCData Lake
0 likes · 18 min read
Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake
Alibaba Cloud Developer
Alibaba Cloud Developer
May 18, 2022 · Big Data

Why Delta Lake Is Revolutionizing Data Lakes with ACID Guarantees

This article explains how Delta Lake adds reliability to data lakes by offering ACID transactions, scalable metadata, and unified batch‑and‑stream processing, outlines the challenges it solves, details its implementation principles, and demonstrates a practical demo for building an integrated data warehouse.

ACIDBig DataData Lake
0 likes · 9 min read
Why Delta Lake Is Revolutionizing Data Lakes with ACID Guarantees
Big Data Technology & Architecture
Big Data Technology & Architecture
May 17, 2022 · Big Data

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

This article provides a comprehensive guide to Apache Hudi, covering its basic concepts, timeline architecture, storage types (Copy‑On‑Write and Merge‑On‑Read), write operations, DeltaStreamer usage, Hive/Spark/Presto query integration, data management, indexing, compaction, and best‑practice recommendations for big‑data lake workloads.

Apache HudiBig DataCopy-on-Write
0 likes · 43 min read
Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management
DataFunTalk
DataFunTalk
May 17, 2022 · Big Data

Exploring JuiceFS in Data Lake Storage Architecture

This presentation provides a comprehensive overview of JuiceFS, an open‑source cloud‑native distributed file system, detailing its role in modern data lake and lakehouse architectures, comparing it with HDFS and object storage, and highlighting its performance, integration, and community ecosystem.

Big DataData LakeDistributed File System
0 likes · 19 min read
Exploring JuiceFS in Data Lake Storage Architecture
ITPUB
ITPUB
Apr 26, 2022 · Big Data

Mastering Delta Lake: From Data Lake Basics to Hands‑On Implementation

This article explains the fundamentals of data lakes and data warehouses, compares their architectures, outlines the challenges of data lakes, and then dives deep into Delta Lake's core features, storage model, ACID guarantees, concurrency handling, and provides step‑by‑step Spark code examples for practical use.

ACIDCopy-on-WriteData Lake
0 likes · 18 min read
Mastering Delta Lake: From Data Lake Basics to Hands‑On Implementation
StarRocks
StarRocks
Apr 13, 2022 · Big Data

How StarRocks Achieves Lightning‑Fast Data Lake Analytics

This article explains StarRocks' streamlined architecture, cost‑based optimizer, massively parallel processing and vectorized engine, and how they enable high‑performance queries over data stored in Hive, Iceberg, Hudi and other lake formats, backed by benchmark results and future roadmap details.

Big DataCBOData Lake
0 likes · 19 min read
How StarRocks Achieves Lightning‑Fast Data Lake Analytics

Data Lake Construction and Practice at NetEase Yanxuan

NetEase Yanxuan replaced its cumbersome data‑warehouse with a flexible Delta‑Lake/Iceberg data lake, creating a unified metadata layer and real‑time ingestion pipelines that cut latency from nightly batches to seconds, slashed compute and storage costs, supported diverse business scenarios and machine‑learning feature engineering, and set the stage for broader future expansion.

Data IntegrationData LakeDelta Lake
0 likes · 16 min read
Data Lake Construction and Practice at NetEase Yanxuan
Yanxuan Tech Team
Yanxuan Tech Team
Mar 29, 2022 · Big Data

How NetEase Yanxuan Built a Real‑Time Data Lake to Boost Efficiency

This article explains how NetEase Yanxuan evolved from a traditional data‑warehouse pipeline to a cloud‑native data‑lake architecture, detailing the business challenges, design choices, technology stack (Delta, Iceberg, Hudi), implementation steps, and the resulting gains in real‑time data access, cost reduction, and feature‑engineering support.

Data LakeDelta LakeHudi
0 likes · 18 min read
How NetEase Yanxuan Built a Real‑Time Data Lake to Boost Efficiency
DataFunTalk
DataFunTalk
Mar 29, 2022 · Big Data

FlinkX Multi-Source Heterogeneous Data Synchronization Framework: Architecture, Features, and Cloud‑Native Enhancements

This article introduces the FlinkX framework for multi‑source heterogeneous data synchronization, detailing its background, core functions such as checkpoint‑based resume, metric monitoring, rate limiting, plugin architecture, cloud‑native K8s deployment, Hudi integration, and future roadmap, while also addressing common Q&A topics.

BatchBig DataData Lake
0 likes · 14 min read
FlinkX Multi-Source Heterogeneous Data Synchronization Framework: Architecture, Features, and Cloud‑Native Enhancements
DataFunTalk
DataFunTalk
Mar 23, 2022 · Big Data

Iceberg Data Lake Query Optimization Practices and Governance

This talk by Tencent senior engineer Chen Liang covers Iceberg table format fundamentals, data lake ingestion, query processing, hidden partitioning, time‑travel, major features, optimization techniques such as compaction, bin‑packing, sorting and Z‑ordering, and outlines a future roadmap for improving performance and governance in big‑data environments.

Big DataData LakeFlink
0 likes · 12 min read
Iceberg Data Lake Query Optimization Practices and Governance
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 15, 2022 · Big Data

How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture

This article explains why data lakes are essential for today’s analytics, outlines the three main user demands, defines data lakes, compares rule‑based and cost‑based optimizers, explores record‑oriented versus block‑oriented processing, and details StarRocks’ frontend‑backend architecture and benchmark results.

Analytics EngineBig DataData Lake
0 likes · 17 min read
How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture
DataFunTalk
DataFunTalk
Mar 13, 2022 · Big Data

Tencent Data Lake Metadata Governance Practice and Architecture

This article presents Tencent's data lake metadata governance practice, covering data lake fundamentals, the 3+2 architecture of storage, compute and unified metadata, multi‑tenant design, the re‑implemented Hive Metastore for online catalog, performance optimizations, and offline data‑governance capabilities.

Big DataData LakeHive Metastore
0 likes · 18 min read
Tencent Data Lake Metadata Governance Practice and Architecture
StarRocks
StarRocks
Mar 4, 2022 · Big Data

How StarRocks Powers Ultra‑Fast Data Lake Analytics: Architecture and Core Techniques

This article explains the fundamentals of data lake analytics, compares optimization strategies such as rule‑based vs cost‑based and record‑oriented vs block‑oriented processing, describes StarRocks' lightweight frontend/backend architecture, and presents benchmark results that demonstrate its performance advantages over competing engines.

Analytics EngineData LakeStarRocks
0 likes · 17 min read
How StarRocks Powers Ultra‑Fast Data Lake Analytics: Architecture and Core Techniques
DataFunTalk
DataFunTalk
Mar 1, 2022 · Cloud Native

Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions

The presentation outlines Alibaba Cloud's native data lake solution built on Apache Iceberg, covering data lake fundamentals, cloud migration challenges, Iceberg's architecture and features, real‑time ingestion with Flink, unified metadata management, security guarantees, and testing practices to ensure reliable, scalable big‑data analytics.

Apache IcebergBig DataData Lake
0 likes · 16 min read
Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions
DataFunTalk
DataFunTalk
Feb 25, 2022 · Big Data

Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization

This article explains how Tencent leverages Apache Iceberg together with Flink to build a real‑time data lake pipeline, covering data ingestion, Iceberg's snapshot‑based read/write model, compaction and governance services, Z‑order based query optimization, performance results, and future roadmap.

Apache IcebergBig DataData Lake
0 likes · 24 min read
Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization
Bilibili Tech
Bilibili Tech
Feb 17, 2022 · Big Data

Bilibili's Lakehouse Architecture: Building a Unified Data Lake and Data Warehouse

Bilibili replaced its Hive‑Spark‑Presto ETL pipeline with a lakehouse built on Iceberg, using Magnus, Trino and Alluxio to unify a PB‑scale data lake and warehouse, adding Z‑Order sorting and indexing for fast multi‑dimensional queries while planning further schema and pre‑computation optimizations.

Data LakeData WarehouseIceberg
0 likes · 14 min read
Bilibili's Lakehouse Architecture: Building a Unified Data Lake and Data Warehouse
DataFunTalk
DataFunTalk
Feb 12, 2022 · Big Data

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

This article introduces NetEase's internally incubated data lake project Arctic, explains the concept of data lakes, outlines NetEase's specific requirements for a unified streaming‑batch platform, details Arctic's core architecture, storage strategy, data‑merge mechanisms, current achievements, and future development plans.

Apache IcebergArcticBig Data
0 likes · 10 min read
NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap
DataFunTalk
DataFunTalk
Feb 3, 2022 · Big Data

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

This article explains how Kuashou tackled latency and efficiency problems in large‑scale data pipelines by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, its architecture, model design, handling of bursty updates, back‑fill scenarios, and operational safeguards.

Big DataData LakeFlink
0 likes · 13 min read
Improving Data Processing Efficiency at Kuaishou with Apache Hudi
21CTO
21CTO
Jan 8, 2022 · Big Data

How Amazon’s Intelligent Lakehouse Redefines Big Data Architecture

The article examines Amazon’s Intelligent Lakehouse architecture, tracing its evolution from early data‑lake‑warehouse integrations to a modern, serverless, secure, and AI‑enhanced platform that unifies data storage, governance, and analytics to lower big‑data costs and boost agility.

Big DataData GovernanceData Lake
0 likes · 12 min read
How Amazon’s Intelligent Lakehouse Redefines Big Data Architecture
DataFunTalk
DataFunTalk
Jan 8, 2022 · Big Data

Lakehouse: Concepts, Architecture, Implementation, and Cloud Practices

This article provides a comprehensive overview of the Lakehouse paradigm, tracing its origins from traditional data warehouses and data lakes, comparing architectures, detailing core components such as Delta Lake and Iceberg, and illustrating practical cloud implementations and future directions.

Apache IcebergBig DataCloud Data Platform
0 likes · 14 min read
Lakehouse: Concepts, Architecture, Implementation, and Cloud Practices
Architects' Tech Alliance
Architects' Tech Alliance
Nov 12, 2021 · Big Data

Understanding Data Lakes: Definitions, Evolution, and Architectural Patterns

The article explains what a data lake is, compares various vendor definitions, outlines its four essential components, describes three evolutionary architecture stages from self‑hosted Hadoop to cloud‑native storage‑compute separation, and discusses the benefits and challenges of adopting data lake solutions in modern big‑data platforms.

AWSData LakeHadoop
0 likes · 8 min read
Understanding Data Lakes: Definitions, Evolution, and Architectural Patterns
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 8, 2021 · Big Data

Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices

This article examines the strengths and weaknesses of Apache Iceberg, explains why Tencent selected it over alternatives, details Tencent’s own enhancements and integration with Flink, Spark, and other engines, and shares multiple real‑world implementations for building enterprise‑grade real‑time data lakes.

Apache IcebergData LakeFlink
0 likes · 17 min read
Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 12, 2021 · Big Data

Data Lake Evolution and a Practical Flink + Iceberg Implementation Guide

This article explores the evolution of data lakes, compares major cloud providers' lake architectures, introduces the emerging lakehouse concept, and provides a step‑by‑step Flink‑Iceberg implementation—including dependencies, catalog setup, table creation, checkpointing, and Kafka ingestion—demonstrating practical big‑data streaming solutions.

Data LakeFlinkIceberg
0 likes · 14 min read
Data Lake Evolution and a Practical Flink + Iceberg Implementation Guide
DataFunTalk
DataFunTalk
Sep 3, 2021 · Big Data

Building an Exabyte‑Scale Data Lake with Apache Hudi at ByteDance: Architecture, Design Choices, and Performance Optimizations

This article details ByteDance's implementation of an exabyte‑scale data lake using Apache Hudi, covering scenario requirements, engine selection, functional support, schema management, extensive performance tuning, and future directions, while also noting recruitment opportunities within the team.

Apache HudiBig DataByteDance
0 likes · 9 min read
Building an Exabyte‑Scale Data Lake with Apache Hudi at ByteDance: Architecture, Design Choices, and Performance Optimizations
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 24, 2021 · Big Data

Comprehensive Overview of Data Lake Technologies: Iceberg, Hudi, and Delta Lake

This article provides an in-depth overview of data lake concepts, definitions, and essential features, followed by detailed case studies of enterprise data lake implementations and comparative analysis of leading data lake table formats—Iceberg, Hudi, and Delta Lake—highlighting their architectures, capabilities, and trade‑offs.

Data LakeDelta LakeFlink
0 likes · 19 min read
Comprehensive Overview of Data Lake Technologies: Iceberg, Hudi, and Delta Lake
dbaplus Community
dbaplus Community
Aug 17, 2021 · Big Data

How JD Transformed Its Data Warehouse with Delta Lake for Real‑Time Analytics

This article examines JD's shift from a traditional Lambda‑based data warehouse to a Delta Lake‑powered real‑time data lake, detailing the challenges of legacy architectures, the evaluation of open‑source table formats, Delta Lake's core mechanisms, and the resulting simplified batch‑stream development workflow.

Batch-StreamBig DataData Lake
0 likes · 11 min read
How JD Transformed Its Data Warehouse with Delta Lake for Real‑Time Analytics
DataFunTalk
DataFunTalk
Jun 21, 2021 · Big Data

Flink + Iceberg 0.11 Practices in Qunar Data Platform

This article shares Qunar's experience using Flink together with Apache Iceberg 0.11 to address real‑time data warehouse challenges, covering background pain points, Iceberg architecture, solutions for Kafka data loss and Hive latency, and optimization practices such as small‑file handling, sorting, and checkpoint management.

Big DataData LakeFlink
0 likes · 13 min read
Flink + Iceberg 0.11 Practices in Qunar Data Platform
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 16, 2021 · Big Data

Practical Experience and Optimizations of Apache Iceberg in Tencent’s Big Data Ecosystem

This article reviews the advantages of Apache Iceberg for data lake storage, details Tencent’s custom optimizations and integration with Flink and Spark, and shares multiple real‑world implementations that demonstrate how Iceberg improves data consistency, reduces small‑file overhead, and enables near‑real‑time analytics in large‑scale big‑data environments.

Apache IcebergData LakeFlink
0 likes · 18 min read
Practical Experience and Optimizations of Apache Iceberg in Tencent’s Big Data Ecosystem
dbaplus Community
dbaplus Community
Jun 5, 2021 · Big Data

How Flink + Iceberg Transform Data Lakes for Real‑Time Streaming

This article explains the concept of data lakes, outlines a four‑layer open‑source architecture, presents several classic Flink‑Iceberg use cases, details why Iceberg was chosen, and describes the design of Flink’s streaming sink and upcoming community roadmap.

Apache FlinkApache IcebergBig Data
0 likes · 14 min read
How Flink + Iceberg Transform Data Lakes for Real‑Time Streaming
Tencent Cloud Developer
Tencent Cloud Developer
May 26, 2021 · Big Data

Big Data Trends and Future Directions – Insights from the Techo TVP Developer Summit Roundtable

At the Techo TVP Developer Summit, leaders discussed how big‑data tools are evolving beyond perceived bottlenecks toward cloud‑native, specialized platforms and data lakes, emphasized open‑source collaboration, highlighted China’s capacity to spawn a Snowflake‑like service, and offered guidance on emerging real‑time, GPU‑accelerated analytics and multidisciplinary data‑career paths.

Data Lakecareer adviceindustry trends
0 likes · 24 min read
Big Data Trends and Future Directions – Insights from the Techo TVP Developer Summit Roundtable
Tencent Cloud Developer
Tencent Cloud Developer
May 25, 2021 · Cloud Native

Next‑Generation Cloud‑Native Data Lake Architecture: Value, Principles, Challenges, and Tencent Solutions

The talk outlines a next‑generation cloud‑native data lake that leverages elastic Kubernetes compute, object‑storage, and Apache Iceberg to cut costs 3‑10× while boosting performance, and presents Tencent’s Data Lake Compute and Data Lake Fabric solutions that address scalability, reliability, and operational challenges through serverless, unified, multi‑engine architecture.

Cost OptimizationData LakeIceberg
0 likes · 13 min read
Next‑Generation Cloud‑Native Data Lake Architecture: Value, Principles, Challenges, and Tencent Solutions
Programmer DD
Programmer DD
May 22, 2021 · Big Data

What Is a Data Lake? Origins, Architecture, and How It Powers Modern Big Data

This article explains the concept of a data lake—its origin in 2011, how it differs from traditional databases and data warehouses, its core characteristics such as raw data storage, on‑demand computing, and schema‑on‑read, as well as its advantages, challenges, architectural components, and future outlook within the big‑data ecosystem.

Big DataData ArchitectureData Governance
0 likes · 20 min read
What Is a Data Lake? Origins, Architecture, and How It Powers Modern Big Data
DataFunTalk
DataFunTalk
Apr 27, 2021 · Big Data

Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System

This article describes how Linkflow migrated mutable customer data from MySQL to an Apache Hudi data lake using Debezium‑in‑Flink CDC, addressing challenges such as snapshot resumability, partial updates, row‑key merging, schema evolution, indexing, and concurrent writes to achieve minute‑level data freshness and improved offline processing performance.

Apache HudiBig DataCDC
0 likes · 21 min read
Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 30, 2021 · Big Data

Implementing Real-Time Data Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

This article describes how Soul's data engineering team replaced nightly batch ETL with real-time Delta Lake ingestion on EMR, detailing the motivations, comparative analysis of Delta, Hudi, Iceberg, the implementation architecture, encountered issues such as data skew and schema evolution, and the solutions adopted to improve performance and reliability.

Data LakeData SkewDelta Lake
0 likes · 13 min read
Implementing Real-Time Data Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions
Tencent Cloud Developer
Tencent Cloud Developer
Mar 29, 2021 · Cloud Native

How Tencent Cloud’s Native Data Lake Redefines Big Data Analytics

This article examines the evolution of data lakes, outlines the challenges enterprises face with massive, heterogeneous data, and details Tencent Cloud’s native data lake architecture and its serverless Data Lake Compute service, highlighting performance, cost‑efficiency, and future development directions.

AnalyticsCloud NativeData Lake
0 likes · 10 min read
How Tencent Cloud’s Native Data Lake Redefines Big Data Analytics
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 23, 2021 · Big Data

Practical Implementations of Data Lakes: Huawei Production Scenario, Real-Time Financial Data Lake, and Soul's Delta Lake

This article presents a comprehensive overview of data lake implementations, detailing Huawei's production‑scene platform, a real‑time financial data lake architecture using Kafka, Flink and Iceberg, and Soul's Delta Lake practice with Spark, Hive, and custom ETL tools, highlighting design choices, processing flows, and operational considerations.

Data LakeDelta LakeFlink
0 likes · 8 min read
Practical Implementations of Data Lakes: Huawei Production Scenario, Real-Time Financial Data Lake, and Soul's Delta Lake