Tagged articles
343 articles
Page 2 of 4
DataFunSummit
DataFunSummit
Jun 20, 2024 · Big Data

Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions

This article presents a comprehensive overview of modern Data+AI data lake challenges and solutions, covering the evolution of data lakes, an introduction to Apache Iceberg, practical use of PyIceberg for AI training and inference pipelines, and advanced vector table and indexing techniques for efficient similarity search.

AI trainingApache IcebergBig Data
0 likes · 22 min read
Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions
DataFunTalk
DataFunTalk
Jun 18, 2024 · Big Data

Real-time Data Warehouse Evolution with Data Lake: Architecture, Challenges, and Solutions

This article presents a comprehensive overview of the evolution from traditional Lambda‑based real‑time data warehouse solutions to a data‑lake‑integrated architecture, detailing the shortcomings of legacy designs, the iterative improvements made at JD Technology, and the technical and operational challenges encountered during implementation.

Data LakeLambda architectureStreaming
0 likes · 24 min read
Real-time Data Warehouse Evolution with Data Lake: Architecture, Challenges, and Solutions
21CTO
21CTO
Jun 7, 2024 · Artificial Intelligence

10 Essential Tools for Building a Modern AI Data Lake Architecture

This article outlines ten critical components of a modern data lake reference architecture for AI/ML, detailing each function, the supporting vendor tools and open‑source libraries, and how they enable scalable storage, MLOps, distributed training, model hubs, vector search, and data visualization.

AIData LakeDistributed Training
0 likes · 14 min read
10 Essential Tools for Building a Modern AI Data Lake Architecture
StarRocks
StarRocks
May 22, 2024 · Big Data

Unlocking Data Lake Power: Iceberg Architecture & StarRocks Acceleration

Apache Iceberg offers a modern, ACID‑compliant table format for data lakes with features like hidden partitions and schema evolution, while StarRocks provides high‑performance query acceleration, metadata caching, and distributed planning to address Iceberg’s latency challenges, enabling seamless lake‑warehouse integration and real‑time analytics.

Apache IcebergData LakeMetadata Caching
0 likes · 19 min read
Unlocking Data Lake Power: Iceberg Architecture & StarRocks Acceleration
DataFunSummit
DataFunSummit
May 17, 2024 · Big Data

Comprehensive Hudi Real-Time Data Lake Ingestion Solutions

This article presents a complete guide to Hudi-based real-time data lake ingestion, covering overall data integration architecture, batch and streaming ingestion strategies, advanced table design, and practical recommendations for handling challenges such as deduplication, latency, partitioning, and performance optimization.

Batch ProcessingBig DataData Lake
0 likes · 12 min read
Comprehensive Hudi Real-Time Data Lake Ingestion Solutions
DataFunTalk
DataFunTalk
May 16, 2024 · Big Data

Streaming Data Lake Warehouse Solution Based on USDP with Flink and Paimon

This article presents UCloud's USDP‑based streaming data lake warehouse solution that leverages Flink for real‑time processing and Paimon for lake storage, detailing its architecture, advantages, practical scenarios, and providing complete SQL and Flink CDC code snippets for end‑to‑end implementation.

CDCData LakeFlink
0 likes · 27 min read
Streaming Data Lake Warehouse Solution Based on USDP with Flink and Paimon
iQIYI Technical Product Team
iQIYI Technical Product Team
Apr 26, 2024 · Big Data

iQIYI Real-time Lakehouse: Stream‑Batch Unified Architecture

iQIYI replaced its costly Lambda architecture with a unified Iceberg‑based lakehouse that combines Flink streaming and batch processing, cutting data latency from hours to minutes, supporting thousands of tables via a multi‑table sink, guaranteeing completeness, and saving millions of RMB in operational costs.

Data LakeFlinkIceberg
0 likes · 18 min read
iQIYI Real-time Lakehouse: Stream‑Batch Unified Architecture
StarRocks
StarRocks
Apr 25, 2024 · Big Data

How StarRocks Beats Trino: 4.3× Faster Queries on Apache Paimon Lakehouse

This article explains how to build a high‑performance data‑lake analytics stack by combining StarRocks with Apache Paimon, covering direct queries, Data Cache acceleration, and asynchronous materialized views, and presents benchmark results that show StarRocks achieving up to 4.3× faster query speeds than Trino and significant latency reductions with caching and materialized views.

Apache PaimonData CacheData Lake
0 likes · 12 min read
How StarRocks Beats Trino: 4.3× Faster Queries on Apache Paimon Lakehouse
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 8, 2024 · Big Data

Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation

iQIYI migrated hundreds of petabytes of Hive tables to Apache Iceberg using dual‑write, in‑place, and CTAS strategies, combined with partition pruning, Bloom filters, and Trino/Alluxio optimizations, achieving up to 40% lower query latency, simplified pipelines, and faster, cost‑effective data lake operations.

Data LakeHiveIceberg
0 likes · 20 min read
Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Mar 4, 2024 · Big Data

Integrating Data Lake Technologies with Data Warehouse Architecture at Xiaohongshu: Practices and Performance Optimizations

Xiaohongshu’s data‑warehouse team integrated Apache Iceberg‑based data‑lake techniques into its existing warehouse, replacing the legacy Hive/Spark stack with global sorting, Z‑order, and upsert‑enabled tables, which cut query latency by up to 90 %, boosted data freshness by 50 %, slashed storage costs by 83 % and saved tens of thousands of GB‑hours of compute daily.

Apache IcebergData LakeData Warehouse
0 likes · 19 min read
Integrating Data Lake Technologies with Data Warehouse Architecture at Xiaohongshu: Practices and Performance Optimizations
DataFunSummit
DataFunSummit
Feb 26, 2024 · Big Data

Building a New Lakehouse Analytics Paradigm with StarRocks and Paimon

This article introduces a new lakehouse analytics paradigm by combining StarRocks and Paimon, covering the evolution of data lake technologies, key integration scenarios, core technical mechanisms such as JNI connectors, materialized views, and future roadmap for enhanced lakehouse capabilities.

AnalyticsBig DataData Lake
0 likes · 16 min read
Building a New Lakehouse Analytics Paradigm with StarRocks and Paimon
StarRocks
StarRocks
Jan 10, 2024 · Big Data

How Tencent Built the ABetterChoice SaaS A/B Testing Platform for Global Games

In 2022 Tencent's A/B Test team created the overseas SaaS product ABetterChoice, abstracting internal experiment capabilities, adapting to multi‑cloud compliance, and unifying computation with StarRocks, enabling game titles like Honor of Kings, PUBG Mobile, and Ubisoft to run scalable, compliant A/B experiments worldwide.

A/B testingData LakeExperiment Platform
0 likes · 14 min read
How Tencent Built the ABetterChoice SaaS A/B Testing Platform for Global Games
Architects Research Society
Architects Research Society
Jan 2, 2024 · Big Data

Understanding Data Lakes: Concepts, Benefits, Challenges, and Comparison with Data Warehouses

This article explains what a data lake is, its origins, key characteristics such as collecting all data, enabling diverse user access, and flexible processing, compares it with traditional data warehouses, discusses cost advantages, potential pitfalls like data swamps, and outlines best‑practice considerations for enterprise adoption.

AnalyticsData ArchitectureData Lake
0 likes · 10 min read
Understanding Data Lakes: Concepts, Benefits, Challenges, and Comparison with Data Warehouses
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 8, 2023 · Big Data

Comprehensive Guide to Apache Paimon and Advanced Flink Integration

This article provides an in‑depth overview of Apache Paimon as a streaming lakehouse, explains its core features, file layout, consistency guarantees, and offers detailed guidance on integrating and tuning Paimon with Apache Flink for both write and read performance, multi‑writer concurrency, table management, and bucket rescaling.

Apache PaimonBig DataData Lake
0 likes · 23 min read
Comprehensive Guide to Apache Paimon and Advanced Flink Integration
DataFunTalk
DataFunTalk
Nov 30, 2023 · Big Data

Big Data Cloud‑Native Trends and Challenges Highlighted at the 2023 Yunqi Conference

The 2023 Yunqi Conference in Hangzhou showcased the latest advances in cloud computing and big‑data technologies, examined the evolution from big‑data 1.0 to 3.0, discussed the key difficulties of making big data cloud‑native, and presented a practical case study of MiHoYo’s cloud‑native transformation.

Alibaba CloudBig DataCloud Native
0 likes · 12 min read
Big Data Cloud‑Native Trends and Challenges Highlighted at the 2023 Yunqi Conference
Big Data Technology Architecture
Big Data Technology Architecture
Nov 29, 2023 · Big Data

Building Real-Time Wide Tables with Partial-Update Using Apache Paimon for NetEase News Recommendation

The article describes how NetEase News' recommendation team replaced a slow, batch‑oriented data‑warehouse pipeline with a Flink‑based, Apache Paimon real‑time wide‑table solution that supports partial updates, reduces latency from hours to minutes, and lowers processing costs while handling both deduplication and non‑deduplication recommendation scenarios.

Apache PaimonData LakeFlink
0 likes · 8 min read
Building Real-Time Wide Tables with Partial-Update Using Apache Paimon for NetEase News Recommendation
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 28, 2023 · Big Data

Apache Paimon for CDC: Low‑Cost, Low‑Latency Data Lake Ingestion and Performance Comparison with Hive and Hudi

This article explains how Apache Paimon simplifies CDC data lake ingestion with one‑click, low‑cost, low‑latency pipelines, details its architecture and tag‑based Hive compatibility, provides best‑practice configurations, and presents benchmark results showing Paimon outperforming Hive and Hudi in both write and query performance.

Apache PaimonCDCData Lake
0 likes · 14 min read
Apache Paimon for CDC: Low‑Cost, Low‑Latency Data Lake Ingestion and Performance Comparison with Hive and Hudi
Architects Research Society
Architects Research Society
Nov 26, 2023 · Big Data

Data Lake vs Data Warehouse: Key Differences and How to Choose

Data lakes and data warehouses serve different purposes in big‑data architectures; this article explains their definitions, core attributes, five major distinctions—including data retention, type support, user coverage, adaptability, and insight speed—and offers guidance on selecting or combining the two approaches.

AnalyticsData ArchitectureData Lake
0 likes · 12 min read
Data Lake vs Data Warehouse: Key Differences and How to Choose
StarRocks
StarRocks
Nov 22, 2023 · Big Data

How StarRocks’ Compute‑Storage Separation Cut Costs 46% and Boosted Performance

This article details a Chinese tech company's migration of its internal big‑data analytics platform to StarRocks’ compute‑storage separation architecture, describing the original multi‑component setup, the pain points encountered, the evaluation methodology, performance and cost benchmarks, operational optimizations, migration steps, and future roadmap.

Big DataCompute-Storage SeparationCost reduction
0 likes · 17 min read
How StarRocks’ Compute‑Storage Separation Cut Costs 46% and Boosted Performance
dbaplus Community
dbaplus Community
Nov 8, 2023 · Big Data

Choosing Between Data Warehouse, Data Lake, and Lakehouse: When to Use Each

This article compares traditional data warehouses, modern data lakes, and emerging lakehouse architectures, explaining their design patterns, advantages, disadvantages, and suitable use cases, while detailing implementation considerations such as schema design, ETL/ELT processes, file formats like Delta, Iceberg, and Hudi, and factors influencing platform selection.

Apache SparkData LakeData Warehouse
0 likes · 20 min read
Choosing Between Data Warehouse, Data Lake, and Lakehouse: When to Use Each
DataFunTalk
DataFunTalk
Oct 28, 2023 · Big Data

Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices

This article presents a comprehensive overview of a unified data lake architecture, evaluates three ingestion solutions, details real‑time ingestion optimizations for Flink‑Hudi pipelines, and describes how Kyuubi enables unified query access across multiple engines, offering practical guidance for large‑scale data processing.

Big DataData LakeFlink
0 likes · 14 min read
Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices
DataFunSummit
DataFunSummit
Oct 18, 2023 · Big Data

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

This article explains why Kuaishou built a data lake, outlines the shortcomings of its previous Lambda architecture, describes the adoption of Apache Hudi for unified batch‑stream processing, and details the five major technical challenges and the corresponding solutions implemented to improve performance, consistency, and operational reliability.

Apache HudiBig DataData Architecture
0 likes · 17 min read
Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions
Data Thinking Notes
Data Thinking Notes
Oct 11, 2023 · Big Data

How ByteDance Optimized Its E‑Commerce Data Lake to Cut Costs and Boost Real‑Time Accuracy

ByteDance revamped its traditional Lambda architecture for e‑commerce traffic data by introducing a new lake ingestion solution that reduces development and operational costs, ensures timely and stable data, and outlines future plans covering business background, ODS lake design, archiving tags, delayed data handling, and real‑time stability.

Big DataData LakeFlink
0 likes · 7 min read
How ByteDance Optimized Its E‑Commerce Data Lake to Cut Costs and Boost Real‑Time Accuracy
Sohu Tech Products
Sohu Tech Products
Oct 11, 2023 · Industry Insights

How StarRocks Materialized Views Power Real‑Time Lakehouse Analytics

The article provides a deep technical overview of StarRocks 3.0’s data‑lake analysis capabilities, its unified Lakehouse architecture, Catalog integration, Trino compatibility, extensive I/O optimizations, materialized view features, resource isolation techniques, real‑world use cases, and future development directions.

AnalyticsData LakeLakehouse
0 likes · 22 min read
How StarRocks Materialized Views Power Real‑Time Lakehouse Analytics
DataFunSummit
DataFunSummit
Oct 1, 2023 · Big Data

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This presentation introduces Iceberg's core capabilities, details Xiaomi's practical applications—including log ingestion, near‑real‑time warehousing, offline challenges, column‑level encryption, and Hive migration—and outlines future development directions such as materialized views and cloud migration, providing a comprehensive view of modern data‑lake engineering.

Big DataData LakeFlink
0 likes · 22 min read
Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans
Architects Research Society
Architects Research Society
Sep 26, 2023 · Big Data

From a Single Data Lake to a Decentralized Data Mesh: A Step‑by‑Step Migration Guide

This article explains why traditional centralized data lakes hinder modern software development, introduces the data‑mesh concept as a decentralized alternative, and walks through an e‑commerce microservice example with concrete steps, data‑API designs, and migration tactics to transition from a monolithic lake to a distributed data mesh.

Data LakeData MeshData Platform
0 likes · 22 min read
From a Single Data Lake to a Decentralized Data Mesh: A Step‑by‑Step Migration Guide
iQIYI Technical Product Team
iQIYI Technical Product Team
Sep 22, 2023 · Big Data

Data Lake: Concepts, Architecture, and Application in iQIYI's Data Platform

iQIYI’s data‑middle‑platform team built a four‑zone data lake—raw, product, work, and sensitive—integrated with unified ODS/DWD/MID layers, a metadata catalog, and self‑service tools, leveraging HDFS, Hive/Iceberg, Spark/Trino, and Flink, migrated to Apache Iceberg for real‑time freshness, and now aims to further streamline modules and adopt new technologies.

Apache IcebergData GovernanceData Lake
0 likes · 13 min read
Data Lake: Concepts, Architecture, and Application in iQIYI's Data Platform
Didi Tech
Didi Tech
Sep 19, 2023 · Cloud Native

OrangeFS: A Cloud‑Native Multi‑Protocol Distributed Data Lake Storage System

OrangeFS is Didi’s cloud‑native, multi‑protocol distributed data‑lake storage system that unifies POSIX, S3 and HDFS access on a single logical hierarchy, integrates with Kubernetes via a CSI plugin, supports on‑premise and public‑cloud backends, provides multi‑tenant isolation, and dramatically improves elasticity, utilization and latency for petabyte‑scale workloads such as ride‑hailing logs, machine‑learning training, finance and analytics.

CSICloud Native StorageData Lake
0 likes · 17 min read
OrangeFS: A Cloud‑Native Multi‑Protocol Distributed Data Lake Storage System
DataFunTalk
DataFunTalk
Sep 16, 2023 · Big Data

StarRocks Data Lake Analysis, Materialized Views, and Lakehouse Architecture

This article explains how StarRocks 3.0 extends real‑time data‑warehouse capabilities to support data‑lake analysis, external catalog integration, Trino compatibility, extensive I/O optimizations, and powerful materialized‑view features that together enable a unified, cloud‑native Lakehouse solution with high performance and flexible resource isolation.

Big DataData LakeLakehouse
0 likes · 20 min read
StarRocks Data Lake Analysis, Materialized Views, and Lakehouse Architecture
iQIYI Technical Product Team
iQIYI Technical Product Team
Sep 15, 2023 · Big Data

Apache Spark at iQIYI: Current Status and Optimization

iQIYI now relies on Apache Spark as its main offline engine, processing over 200 000 daily tasks for ETL, data synchronization and analytics, while recent optimizations—dynamic resource allocation, adaptive query execution, compression, rebalance, Z‑order and resource‑governance—have cut compute usage by ~27 %, storage by up to 76 % and improved query speed, completing a large‑scale migration from Hive and paving the way for Spark 3.4 and Iceberg support.

Apache SparkData LakePerformance Optimization
0 likes · 21 min read
Apache Spark at iQIYI: Current Status and Optimization
DataFunTalk
DataFunTalk
Aug 20, 2023 · Databases

Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark

This article presents a comprehensive technical overview of Alibaba Cloud AnalyticDB MySQL and its Serverless Spark integration, detailing architecture, core optimizations, security enhancements, and real‑world case studies that demonstrate how to achieve cost‑effective, high‑performance data lake analytics.

AnalyticDBBig DataData Lake
0 likes · 19 min read
Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark
StarRocks
StarRocks
Aug 9, 2023 · Databases

StarRocks 3.1 Highlights: Faster Lakehouse Analytics and Advanced Materialized Views

StarRocks 3.1 introduces a cloud‑native, lakehouse‑oriented architecture with enhanced storage‑compute separation, up to 3‑6× faster data‑lake queries than Trino/Presto, expanded Iceberg and Paimon support, richer materialized view capabilities, new random bucketing, expression partitioning, generated columns, and spill‑to‑disk stability, all backed by extensive performance optimizations and open‑source contributions.

Data LakeLakehouseMaterialized Views
0 likes · 17 min read
StarRocks 3.1 Highlights: Faster Lakehouse Analytics and Advanced Materialized Views
Baidu Geek Talk
Baidu Geek Talk
Aug 7, 2023 · Artificial Intelligence

Storage Acceleration Solutions for Large AI Model Workflows

To tackle the massive data, high‑throughput and low‑latency demands of large‑model training and inference, the talk proposes a unified data‑lake built on scalable object storage combined with an acceleration layer—either a parallel file system or cloud‑native RapidFS cache—demonstrating multi‑fold training speedups, faster checkpoint uploads, and linear inference scaling.

AI Model StorageAccelerated FilesystemData Lake
0 likes · 18 min read
Storage Acceleration Solutions for Large AI Model Workflows
DataFunSummit
DataFunSummit
Jul 18, 2023 · Databases

Apache Doris Data Lake Federation Features Overview

This article introduces Apache Doris’s data lake federation capabilities, detailing its lake‑warehouse integration design, supported data sources such as Hive, Iceberg, Hudi, and Elasticsearch, performance optimizations for metadata and file access, case studies, community roadmap, and Q&A on replacing Presto.

Apache DorisData LakeSQL Engine
0 likes · 21 min read
Apache Doris Data Lake Federation Features Overview
Data Thinking Notes
Data Thinking Notes
Jul 12, 2023 · Fundamentals

Why Metadata Governance Is the Backbone of Modern Data Platforms

This article explains how metadata serves as essential infrastructure for data platforms, detailing Huawei's classification framework, governance challenges, management architecture, integrated modeling, data lake handling, service management, and data map construction to bridge business and IT domains.

Data GovernanceData LakeData Management
0 likes · 24 min read
Why Metadata Governance Is the Backbone of Modern Data Platforms
DataFunTalk
DataFunTalk
Jul 10, 2023 · Big Data

Practical Experience of In‑Lake Warehouse Implementation Based on Lakehouse Architecture

This article presents a comprehensive overview of Lakehouse‑based in‑lake warehousing, covering common data‑lake misconceptions, the evolution from databases to data warehouses and lakes, the advantages of Lakehouse over traditional architectures, a reference multi‑layer architecture, typical use cases, challenges, future plans, and a brief Q&A.

Big Data ArchitectureData LakeData Warehouse
0 likes · 20 min read
Practical Experience of In‑Lake Warehouse Implementation Based on Lakehouse Architecture
DataFunTalk
DataFunTalk
Jun 29, 2023 · Big Data

Practical Deployment of Delta Lake in BI and AI Products

This article summarizes a technical presentation on how Delta Lake is integrated into a BI+AI platform, covering the product background, data‑lake architecture, Delta Lake features such as ACID transactions, schema management, multi‑engine support, performance optimizations, and future development directions.

AIBIBig Data
0 likes · 12 min read
Practical Deployment of Delta Lake in BI and AI Products
DataFunTalk
DataFunTalk
Jun 26, 2023 · Big Data

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This presentation details Iceberg's core capabilities—transactional writes, schema evolution, implicit partitioning, and row‑level updates—while showcasing Xiaomi's real‑world applications such as log ingestion redesign, near‑real‑time warehousing, offline optimizations, column‑level encryption, Hive migration strategies, and outlining upcoming enhancements like materialized views and cloud migration.

Big DataColumn EncryptionData Lake
0 likes · 20 min read
Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans
DataFunTalk
DataFunTalk
Jun 24, 2023 · Big Data

Design and Architecture of MaxCompute Lakehouse Near‑Real‑Time Incremental Processing

This article explains the evolution of Alibaba Cloud's MaxCompute platform into a lakehouse architecture that supports near‑real‑time incremental processing, detailing its development history, core design of transactional tables, five‑module technical stack, data ingestion methods, optimization services, transaction management, query capabilities, ecosystem integration, practical applications, future roadmap, and common user questions.

Big DataData LakeIncremental Processing
0 likes · 24 min read
Design and Architecture of MaxCompute Lakehouse Near‑Real‑Time Incremental Processing
Data Thinking Notes
Data Thinking Notes
Jun 18, 2023 · Big Data

Data Lake vs Data Warehouse: Uncover the Real Differences

This article explores the evolving concept of data lakes, compares them with traditional data warehouses across storage, modeling, tooling, and user roles, and examines the emerging lake‑warehouse integration, highlighting why both remain essential in modern big‑data architectures.

Big DataData ArchitectureData Lake
0 likes · 12 min read
Data Lake vs Data Warehouse: Uncover the Real Differences
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 13, 2023 · Big Data

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of Iceberg for its data lake, covering the OLAP architecture, reasons for a data lake, Iceberg's table format advantages over Hive, platform construction, streaming ingestion, query and performance optimizations, real‑world business deployments, and future plans.

Big DataData LakeFlink
0 likes · 21 min read
Iceberg Data Lake Implementation and Optimization at iQIYI
DataFunSummit
DataFunSummit
Jun 10, 2023 · Big Data

Performance Optimization of Iceberg Real‑time Data Warehouse and Arctic Enhancements

This article presents a comprehensive overview of Iceberg MOR principles, Arctic‑based performance optimizations, benchmark evaluations using CH‑benchmark, and future roadmap items, highlighting how various file‑type strategies, self‑optimizing mechanisms, and task balancing improve real‑time data lake query efficiency.

ArcticData LakeIceberg
0 likes · 14 min read
Performance Optimization of Iceberg Real‑time Data Warehouse and Arctic Enhancements
Data Thinking Notes
Data Thinking Notes
Jun 4, 2023 · Big Data

How Distributed Lakehouse Architecture Solves Data Swamp Challenges

This article examines the explosion of heterogeneous data sources, the limitations of traditional data lakes and warehouses, and proposes a distributed lakehouse architecture that integrates advanced management layers to improve data governance, reliability, and support both SQL and advanced analytics workloads.

Data GovernanceData LakeData Warehouse
0 likes · 29 min read
How Distributed Lakehouse Architecture Solves Data Swamp Challenges
DataFunTalk
DataFunTalk
Jun 2, 2023 · Big Data

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of the Iceberg data lake, covering its OLAP architecture, reasons for a lake, Iceberg table format advantages over Hive, platform construction, extensive performance optimizations, and real‑world business use cases such as ad‑flow unification, log analysis, audit, and CDC pipelines.

Big DataData LakeFlink
0 likes · 18 min read
Iceberg Data Lake Implementation and Optimization at iQIYI
DataFunSummit
DataFunSummit
May 28, 2023 · Big Data

Apache Hudi: Capabilities, Architecture, Use Cases, and Future Outlook

This article introduces Apache Hudi as a next‑generation streaming data‑lake platform, explains its core concepts, architecture, and table types, and showcases real‑world use cases at Tencent such as CDC ingestion, minute‑level real‑time warehousing, streaming analytics, multi‑stream joins, ad attribution, and stream‑to‑batch processing, while also outlining future directions.

Apache HudiCDCData Lake
0 likes · 16 min read
Apache Hudi: Capabilities, Architecture, Use Cases, and Future Outlook
DataFunTalk
DataFunTalk
May 22, 2023 · Big Data

Alibaba Cloud Data Lake: Unified Metadata and Storage Management Practices

This article explains Alibaba Cloud's data lake architecture, unified metadata services, storage management optimizations, and format handling techniques, illustrating how lakehouse concepts, multi‑engine support, and lifecycle policies enable efficient, secure, and cost‑effective big data processing in the cloud.

Big DataCloud ServicesData Lake
0 likes · 22 min read
Alibaba Cloud Data Lake: Unified Metadata and Storage Management Practices
DataFunTalk
DataFunTalk
May 11, 2023 · Big Data

Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap

This article describes how ByteDance tackled petabyte‑scale feature storage by adopting Apache Iceberg, detailing the problem background, design choices, implementation of COW and MOR back‑fill strategies, performance optimizations, and future plans such as lake‑cold‑layering and materialized views.

Apache IcebergBig DataData Lake
0 likes · 16 min read
Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap
dbaplus Community
dbaplus Community
May 9, 2023 · Big Data

How a Bank Built a Near‑Real‑Time Data Platform with Kafka, Flink & Hudi

An in‑depth case study of a Chinese bank’s near‑real‑time data platform reveals its evolution from a monolithic CDC pipeline to a split architecture featuring a real‑time data lake and a data‑service bus, detailing component choices, schema‑registry integration, SDK development, observability, and future roadmap.

Big Data ArchitectureData LakeFlink
0 likes · 18 min read
How a Bank Built a Near‑Real‑Time Data Platform with Kafka, Flink & Hudi
DataFunTalk
DataFunTalk
May 6, 2023 · Databases

Apache Doris: Overview, Data Lake Analysis Architecture, Community Development and Future Roadmap

This article provides a comprehensive overview of Apache Doris, detailing its origins, MPP‑based analytical capabilities, data‑lake integration techniques, recent architectural enhancements, performance optimizations, community growth, and upcoming development plans, while also addressing common user questions.

Analytical DatabaseApache DorisBig Data
0 likes · 20 min read
Apache Doris: Overview, Data Lake Analysis Architecture, Community Development and Future Roadmap
DataFunTalk
DataFunTalk
Apr 13, 2023 · Big Data

Four Paradigms of StarRocks Lakehouse Integration and an Overview of StarRocks 3.0

This article explains why lake‑warehouse integration is needed, outlines its challenges, describes StarRocks' four integration paradigms—including query acceleration, layered modeling, real‑time warehouse‑lake fusion, and the cloud‑native 3.0 solution—and previews the upcoming StarRocks 3.0 release.

Big DataCloud NativeData Lake
0 likes · 18 min read
Four Paradigms of StarRocks Lakehouse Integration and an Overview of StarRocks 3.0
Data Thinking Notes
Data Thinking Notes
Apr 5, 2023 · Big Data

Mastering Data Governance: From Challenges to End‑to‑End Solutions

This article explores the key problems data governance aims to solve, outlines a comprehensive governance framework, and details practical implementation steps—including tool integration, metadata management, lake‑in and lake‑out processes, and governance policies—to achieve a closed‑loop, value‑driven data ecosystem.

Big DataData GovernanceData Lake
0 likes · 13 min read
Mastering Data Governance: From Challenges to End‑to‑End Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 30, 2023 · Big Data

Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

Apache Paimon, newly incubated by the Apache Software Foundation, combines Flink's real‑time streaming capabilities with open lakehouse storage formats, offering high‑throughput, low‑latency data ingestion, partial‑update merges, and seamless integration with engines like Flink, Spark, and Trino for unified batch and streaming analytics.

Apache PaimonBig DataData Lake
0 likes · 7 min read
Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 13, 2023 · Big Data

Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution

Alibaba Cloud’s cloud‑native data lake analysis solution combines fully managed storage (OSS‑HDFS), a one‑stop lake management platform (Data Lake Formation), and multimodal compute capabilities, delivering high performance, massive scalability, and low cost for big‑data and AI workloads across offline, real‑time, and lake‑house scenarios.

AnalyticsBig DataCloud Native
0 likes · 11 min read
Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution
DataFunSummit
DataFunSummit
Feb 28, 2023 · Big Data

Iceberg Technology Overview and Its Application at Xiaomi: Practices, Stream‑Batch Integration, and Future Plans

This article introduces the Iceberg table format, explains its core architecture and advantages such as transactionality, implicit partitioning and row‑level updates, details Xiaomi's practical deployments—including CDC pipelines, partition strategies, compaction services, and stream‑batch integration—and outlines future development directions.

Data LakeFlinkIceberg
0 likes · 20 min read
Iceberg Technology Overview and Its Application at Xiaomi: Practices, Stream‑Batch Integration, and Future Plans

How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration

This article details how NetEase Yanxuan transformed its data platform from a dual Lambda architecture to a unified batch‑stream solution built on Apache Iceberg, covering the original challenges, the evaluation of Iceberg versus Hudi and Delta Lake, implementation of stream‑batch pipelines, message ordering fixes, snapshot generation, and extensive table‑governance optimizations.

Apache FlinkApache SparkBatch-Stream Integration
0 likes · 14 min read
How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration
DataFunTalk
DataFunTalk
Feb 25, 2023 · Big Data

T3 Travel’s Modern Data Stack and Feature Platform: Architecture and Practices

This article details T3 Travel’s exploration of the Modern Data Stack, describing its four‑point overview, business scenarios, the initial MDS implementation using Apache Hudi and Kyuubi, and the design of a feature platform that integrates Metricflow, Feast, and other components to support data processing, analytics, and machine‑learning workflows.

Apache HudiBig DataData Lake
0 likes · 22 min read
T3 Travel’s Modern Data Stack and Feature Platform: Architecture and Practices
DataFunTalk
DataFunTalk
Feb 20, 2023 · Big Data

Understanding Data Lakes and Their Application at iQIYI: Concepts, Scenarios, and Iceberg Implementation

This article explains the definition of data lakes (public‑cloud and non‑public‑cloud), outlines their key characteristics, presents three typical business scenarios—real‑time event analysis, change‑data analysis, and stream‑batch integration—summarizes required product features, evaluates open‑source lake formats, and details iQIYI's adoption of Apache Iceberg across multiple services to achieve low‑latency, large‑scale, cost‑effective analytics.

Big DataData LakeIceberg
0 likes · 23 min read
Understanding Data Lakes and Their Application at iQIYI: Concepts, Scenarios, and Iceberg Implementation
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 8, 2023 · Big Data

How Alibaba Cloud EMR 2.0 Redefines Open‑Source Big Data Platforms

This article summarizes Alibaba Cloud senior product expert He Yuan's presentation on EMR 2.0, outlining the challenges of open‑source big data, the evolution of EMR, and the new features—including cloud‑native architecture, enhanced performance, diverse resource models, and expanded analysis scenarios—aimed at reducing cost and complexity.

Alibaba CloudBig DataCloud Native
0 likes · 11 min read
How Alibaba Cloud EMR 2.0 Redefines Open‑Source Big Data Platforms
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 6, 2023 · Big Data

Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations

This article presents an in‑depth overview of real‑time data‑warehouse scenarios, discusses challenges such as timeliness, update efficiency, and resource consumption, and details practical solutions using Apache Hudi, Flink, Presto, and related optimizations for ingestion, indexing, compaction, and query performance.

Big DataData LakeFlink
0 likes · 17 min read
Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations
iQIYI Technical Product Team
iQIYI Technical Product Team
Feb 3, 2023 · Big Data

Data Lake Concepts, Benefits, and Iceberg‑Based Implementations at iQIYI

iQIYI’s data lake combines public‑cloud and private storage with Apache Iceberg’s snapshot‑based table format to enable near‑real‑time, unified batch‑and‑stream analytics, reducing costs, simplifying architecture, and improving data freshness across use cases such as log collection, audit, pingback, and member order processing.

Apache IcebergData ArchitectureData Lake
0 likes · 25 min read
Data Lake Concepts, Benefits, and Iceberg‑Based Implementations at iQIYI
DataFunTalk
DataFunTalk
Jan 28, 2023 · Big Data

Data Lake vs Data Warehouse: Differences, Evolution, and Integrated Lakehouse Design

This article explores the ongoing debate between data lakes and data warehouses, clarifies their distinct purposes and technologies, discusses how they can coexist or complement each other, and introduces the concept of an integrated lakehouse architecture while promoting a comprehensive data intelligence knowledge map.

Big DataData LakeData Warehouse
0 likes · 5 min read
Data Lake vs Data Warehouse: Differences, Evolution, and Integrated Lakehouse Design
DataFunSummit
DataFunSummit
Jan 10, 2023 · Big Data

Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans

This article presents a comprehensive overview of Iceberg's adoption in Huawei Terminal Cloud, covering its architectural overview, key features such as Git‑style data management, real‑time processing, acceleration layers, and future development directions, along with a Q&A session addressing performance and implementation details.

Big DataData LakeFlink
0 likes · 15 min read
Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans
Data Thinking Notes
Data Thinking Notes
Jan 5, 2023 · Big Data

Why Data Lakes Are Outshining Traditional Data Warehouses: A Deep Dive

This comprehensive guide explains the evolution from traditional data warehouses to modern data lakes, detailing concepts, architectures, differences, implementation steps, and real‑world case studies, while also comparing major cloud providers' solutions and highlighting how data platforms support digital transformation and analytics.

AnalyticsBig DataData Lake
0 likes · 97 min read
Why Data Lakes Are Outshining Traditional Data Warehouses: A Deep Dive
DataFunTalk
DataFunTalk
Dec 27, 2022 · Big Data

Multi‑Stream Join and Concurrency Control in Apache Hudi: Design, Implementation, and Usage

This article presents a comprehensive solution for multi‑stream joins in Apache Hudi, detailing the challenges of dimension and multi‑stream joins, the novel storage‑layer join approach, timeline‑based concurrency control, marker mechanisms, early conflict detection, payload customization, and practical usage with Flink and Spark, along with performance benefits and future directions.

Apache HudiData LakeFlink
0 likes · 31 min read
Multi‑Stream Join and Concurrency Control in Apache Hudi: Design, Implementation, and Usage
Tencent Advertising Technology
Tencent Advertising Technology
Dec 27, 2022 · Big Data

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

The article details how Tencent Advertising re‑architected its massive log pipeline by consolidating heterogeneous real‑time and offline logs into an Iceberg‑based data lake, introducing multi‑level partitioning, Spark and Flink ingestion, and numerous performance and cost optimizations for scalable big‑data analytics.

Big DataData LakeFlink
0 likes · 20 min read
Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink
Data Thinking Notes
Data Thinking Notes
Dec 23, 2022 · Big Data

How Real-Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices

This article explains why real‑time data warehouses are becoming essential, outlines their goals, compares them with traditional offline warehouses, and presents detailed design patterns, naming conventions, and case studies from Didi, Kuaishou, Tencent, Youzan and other enterprises, highlighting challenges and solutions for streaming, storage, and query layers.

Big Data ArchitectureData LakeETL
0 likes · 49 min read
How Real-Time Data Warehouses Power Modern Business: Architecture, Cases, and Best Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 15, 2022 · Big Data

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

This technical article presents a comprehensive overview of migrating Hive SQL to Flink SQL, covering the motivations behind the migration, key challenges such as compatibility, stability and performance, practical implementation steps, a detailed demo, future development directions, and a Q&A session addressing common concerns.

Batch ProcessingBig DataData Lake
0 likes · 13 min read
Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans
DataFunTalk
DataFunTalk
Dec 8, 2022 · Big Data

Arctic: NetEase’s Real-Time Lakehouse System Built on Apache Iceberg

This article introduces NetEase’s Arctic, a real‑time lakehouse system built on Apache Iceberg that unifies streaming and batch processing, explains the challenges of Lambda architecture, details Arctic’s features such as change/base stores, hidden queue, transaction handling, and shares internal practice cases and future roadmap.

Apache IcebergArcticData Lake
0 likes · 12 min read
Arctic: NetEase’s Real-Time Lakehouse System Built on Apache Iceberg
StarRocks
StarRocks
Dec 1, 2022 · Big Data

How Alibaba Cloud EMR StarRocks Supercharges Data Lake Analytics with Advanced Optimizations

This article explains how Alibaba Cloud EMR StarRocks extends data lake analytics to support Hive, Iceberg, and Hudi, detailing its architecture, Iceberg integration, performance gains over Trino, IO merging, lazy materialization, intelligent caching, and elastic compute capabilities for faster, unified, and cost‑effective queries.

Data LakeEMRElastic Compute
0 likes · 16 min read
How Alibaba Cloud EMR StarRocks Supercharges Data Lake Analytics with Advanced Optimizations
DataFunSummit
DataFunSummit
Nov 23, 2022 · Big Data

Lakehouse Analysis Service (LAS): Architecture, Challenges, and Service Design

The article introduces the Lakehouse Analysis Service (LAS), explains its layered architecture that unifies data lake and warehouse capabilities, discusses challenges with Apache Hudi metadata and consistency, and details the design of the unified MetaServer, Table Management Service, concurrency control, async compaction, event bus, and future roadmap.

Apache HudiData Lake
0 likes · 18 min read
Lakehouse Analysis Service (LAS): Architecture, Challenges, and Service Design
ITPUB
ITPUB
Nov 18, 2022 · Big Data

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

This article introduces Iceberg’s table‑format fundamentals, details Xiaomi’s large‑scale deployment of Iceberg for CDC and log ingestion, explores their streaming‑batch integration experiments, outlines future roadmap items, and provides a comprehensive Q&A covering practical challenges and solutions.

Batch ProcessingBig DataData Lake
0 likes · 23 min read
How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes
ByteDance Data Platform
ByteDance Data Platform
Nov 16, 2022 · Big Data

How ByteDance’s Data Lake Powers Near‑Real‑Time E‑Commerce Analytics

This article explains ByteDance’s data lake technology, its Apache Hudi‑based features, near‑real‑time architecture, and practical e‑commerce use cases such as marketing promotion, traffic diagnosis, logistics monitoring, risk governance, and operational monitoring, while outlining future challenges and plans.

Apache HudiBig Data ArchitectureData Lake
0 likes · 15 min read
How ByteDance’s Data Lake Powers Near‑Real‑Time E‑Commerce Analytics