Tagged articles

Data Lake

356 articles · Page 2 of 4

Sep 4, 2024 · Artificial Intelligence

Data+AI Data Lake Technologies: Challenges, Apache Iceberg Overview, and Vector Table Implementations with PyIceberg

This article explores the evolution of data lakes for AI, discusses the challenges of AI-era data management, introduces Apache Iceberg and its architecture, demonstrates PyIceberg-based AI training and inference pipelines, and presents vector table designs with LSH indexing and performance optimizations.

AIApache IcebergBig Data

0 likes · 22 min read

Data+AI Data Lake Technologies: Challenges, Apache Iceberg Overview, and Vector Table Implementations with PyIceberg

DataFunSummit

Aug 30, 2024 · Big Data

Kuaishou's Data Lake Journey with Apache Hudi: Architecture Evolution, Use Cases, and Lessons Learned

The article details Kuaishou's adoption of a data lake powered by Apache Hudi, covering the challenges of growing data warehouses, the migration from Hive to Hudi, concrete business case studies, promotion strategies, and key takeaways for large‑scale data engineering.

Apache HudiBig DataData Lake

0 likes · 12 min read

Kuaishou's Data Lake Journey with Apache Hudi: Architecture Evolution, Use Cases, and Lessons Learned

AsiaInfo Technology: New Tech Exploration

Aug 12, 2024 · Big Data

How Hudi MetaServer Transforms Metadata Management and Performance in Data Lakes

This article examines the challenges of Hudi metadata stored on HDFS, introduces the independently developed Hudi MetaServer for centralized metadata, visual management, unified permission control, TTL, expression payloads, and multi‑active scaling, and outlines future enhancements such as LLS, multi‑table fusion, and JDBC support.

Big DataData LakeHudi

0 likes · 11 min read

How Hudi MetaServer Transforms Metadata Management and Performance in Data Lakes

DataFunSummit

Aug 4, 2024 · Big Data

Apache Hudi from Zero to One: Comprehensive Guide to Write Indexing (Part 4)

This article explains Apache Hudi’s write‑side indexing, detailing the indexing API, various index types—including simple, Bloom, bucket, HBase, and record‑level indexes—and their mechanisms, helping readers understand how Hudi validates record existence and optimizes updates and deletions.

Apache HudiBig DataData Lake

0 likes · 9 min read

Apache Hudi from Zero to One: Comprehensive Guide to Write Indexing (Part 4)

DataFunSummit

Aug 3, 2024 · Big Data

Apache Hudi Write Process: From Zero to One – Part 3 (Understanding Write Flow and Operations)

This article explains the complete Apache Hudi write pipeline, detailing each step from client creation to commit, and describes the various write operations such as Upsert, Insert, Bulk Insert, Delete, Delete Partition, and Insert‑Overwrite, providing a comprehensive overview for data‑lake practitioners.

Apache HudiBig DataData Lake

0 likes · 12 min read

Apache Hudi Write Process: From Zero to One – Part 3 (Understanding Write Flow and Operations)

DataFunSummit

Jul 31, 2024 · Big Data

Tencent Big Data Processing Suite and Gravitino: Unified Metadata and Permission Management

This article introduces Tencent's Big Data Processing Suite (TBDS) and the open‑source Gravitino project, explaining how they provide a unified metadata service and a comprehensive, extensible permission model to address data and permission islands across heterogeneous Hadoop and MPP ecosystems.

Big DataData LakeGravitino

0 likes · 12 min read

Tencent Big Data Processing Suite and Gravitino: Unified Metadata and Permission Management

DataFunSummit

Jul 12, 2024 · Big Data

Data Lake Development Trends, Architecture, Integration, Lakehouse Core Capabilities, and Open Design

This article examines the current evolution of data lakes, detailing their overall architecture, batch and real‑time integration methods, Lakehouse core functionalities such as enhanced DML, schema evolution, ACID support, and open‑design principles that enable multi‑cloud deployment and seamless interaction with diverse compute engines.

Batch ProcessingBig Data ArchitectureData Lake

0 likes · 12 min read

Data Lake Development Trends, Architecture, Integration, Lakehouse Core Capabilities, and Open Design

Sohu Tech Products

Jul 10, 2024 · Industry Insights

How StarRocks and Apache Paimon Transform Data Lake Analytics and Migration

This article provides a practical deep‑dive into StarRocks and Apache Paimon, covering data‑lake fundamentals, the technical advantages of both platforms, performance gains over traditional engines, step‑by‑step migration strategies, deployment options on Alibaba Cloud EMR, and future roadmap plans.

Apache PaimonData LakeQuery Optimization

0 likes · 15 min read

How StarRocks and Apache Paimon Transform Data Lake Analytics and Migration

DataFunTalk

Jul 6, 2024 · Big Data

StarRocks and Paimon Data Lake Capabilities, Migration Solutions, and Future Roadmap

This article presents a practical overview of StarRocks and Apache Paimon data‑lake capabilities, explains their performance advantages, details migration strategies from Trino/Presto and other engines, describes cluster‑to‑cluster migration, and outlines future roadmap for integration and optimization.

Big DataCloud ComputingData Lake

0 likes · 13 min read

StarRocks and Paimon Data Lake Capabilities, Migration Solutions, and Future Roadmap

DataFunSummit

Jul 6, 2024 · Artificial Intelligence

Highlights of DataFunCon 2024 Beijing: Big Data, AI, and Large‑Model Trends

The two‑day DataFunCon 2024 Beijing conference gathered hundreds of big‑data and AI experts to discuss the evolution from data lakes to lake‑warehouses, large‑model development, practical applications, and future strategies for enterprises, while showcasing partner exhibitions and a vibrant community spirit.

Big DataChinaData Lake

0 likes · 9 min read

Highlights of DataFunCon 2024 Beijing: Big Data, AI, and Large‑Model Trends

DataFunSummit

Jun 28, 2024 · Big Data

Apache Hudi from Zero to One – Part 2: Reading Process and Query Types (Spark Example)

This article explains how Apache Hudi integrates with Spark to read data, detailing the Spark‑SQL planning stages, the Spark‑Hudi read workflow, and the four main Hudi query types—snapshot, read‑optimized, time‑travel, and incremental—along with example SQL commands and code snippets.

Apache HudiBig DataData Lake

0 likes · 11 min read

Apache Hudi from Zero to One – Part 2: Reading Process and Query Types (Spark Example)

Alibaba Cloud Big Data AI Platform

Jun 25, 2024 · Big Data

Build Real-Time Data Lake Analytics with Flink, Paimon, and EMR Serverless Spark

This guide demonstrates how to use Alibaba Cloud's EMR Serverless Spark and Flink Serverless services together with Apache Paimon to ingest streaming data, perform interactive queries, and schedule offline compaction jobs, creating a unified real‑time and batch data lake solution.

Big DataData LakeEMR Serverless

0 likes · 6 min read

Build Real-Time Data Lake Analytics with Flink, Paimon, and EMR Serverless Spark

DataFunSummit

Jun 20, 2024 · Big Data

Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions

This article presents a comprehensive overview of modern Data+AI data lake challenges and solutions, covering the evolution of data lakes, an introduction to Apache Iceberg, practical use of PyIceberg for AI training and inference pipelines, and advanced vector table and indexing techniques for efficient similarity search.

AI trainingApache IcebergBig Data

0 likes · 22 min read

Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions

DataFunSummit

Jun 19, 2024 · Big Data

Apache Hudi from Zero to One: Introduction to Hudi’s Storage Format (Part 1)

This article introduces Apache Hudi’s storage format, explaining the table layout, metadata and data file organization, the naming conventions of timeline actions, and the trade‑offs between Copy‑on‑Write and Merge‑on‑Read table types for transactional data lakes.

Apache HudiBig DataData Lake

0 likes · 8 min read

Apache Hudi from Zero to One: Introduction to Hudi’s Storage Format (Part 1)

DataFunTalk

Jun 18, 2024 · Big Data

Real-time Data Warehouse Evolution with Data Lake: Architecture, Challenges, and Solutions

This article presents a comprehensive overview of the evolution from traditional Lambda‑based real‑time data warehouse solutions to a data‑lake‑integrated architecture, detailing the shortcomings of legacy designs, the iterative improvements made at JD Technology, and the technical and operational challenges encountered during implementation.

Data LakeLambda architectureReal-Time Data Warehouse

0 likes · 24 min read

Real-time Data Warehouse Evolution with Data Lake: Architecture, Challenges, and Solutions

21CTO

Jun 7, 2024 · Artificial Intelligence

10 Essential Tools for Building a Modern AI Data Lake Architecture

This article outlines ten critical components of a modern data lake reference architecture for AI/ML, detailing each function, the supporting vendor tools and open‑source libraries, and how they enable scalable storage, MLOps, distributed training, model hubs, vector search, and data visualization.

AIData LakeMLOps

0 likes · 14 min read

10 Essential Tools for Building a Modern AI Data Lake Architecture

StarRocks

May 22, 2024 · Big Data

Unlocking Data Lake Power: Iceberg Architecture & StarRocks Acceleration

Apache Iceberg offers a modern, ACID‑compliant table format for data lakes with features like hidden partitions and schema evolution, while StarRocks provides high‑performance query acceleration, metadata caching, and distributed planning to address Iceberg’s latency challenges, enabling seamless lake‑warehouse integration and real‑time analytics.

Apache IcebergData LakeMetadata Caching

0 likes · 19 min read

Unlocking Data Lake Power: Iceberg Architecture & StarRocks Acceleration

DataFunSummit

May 17, 2024 · Big Data

Comprehensive Hudi Real-Time Data Lake Ingestion Solutions

This article presents a complete guide to Hudi-based real-time data lake ingestion, covering overall data integration architecture, batch and streaming ingestion strategies, advanced table design, and practical recommendations for handling challenges such as deduplication, latency, partitioning, and performance optimization.

Batch ProcessingBig DataData Lake

0 likes · 12 min read

Comprehensive Hudi Real-Time Data Lake Ingestion Solutions

DataFunTalk

May 16, 2024 · Big Data

Streaming Data Lake Warehouse Solution Based on USDP with Flink and Paimon

This article presents UCloud's USDP‑based streaming data lake warehouse solution that leverages Flink for real‑time processing and Paimon for lake storage, detailing its architecture, advantages, practical scenarios, and providing complete SQL and Flink CDC code snippets for end‑to‑end implementation.

CDCData LakeFlink

0 likes · 27 min read

Streaming Data Lake Warehouse Solution Based on USDP with Flink and Paimon

DataFunSummit

Apr 27, 2024 · Big Data

Delta Lake 3.1: New Features, Metadata Optimization, and Universal Format Overview

This article introduces Delta Lake 3.1, detailing its release background, the addition of Deletion Vector to Update and Merge commands, metadata‑driven count/min/max optimizations, the Universal Format for cross‑engine compatibility, and a comparative evaluation with Iceberg and Hudi.

Big DataData LakeDeletion Vector

0 likes · 8 min read

Delta Lake 3.1: New Features, Metadata Optimization, and Universal Format Overview

iQIYI Technical Product Team

Apr 26, 2024 · Big Data

iQIYI Real-time Lakehouse: Stream‑Batch Unified Architecture

iQIYI replaced its costly Lambda architecture with a unified Iceberg‑based lakehouse that combines Flink streaming and batch processing, cutting data latency from hours to minutes, supporting thousands of tables via a multi‑table sink, guaranteeing completeness, and saving millions of RMB in operational costs.

Data LakeFlinkIceberg

0 likes · 18 min read

iQIYI Real-time Lakehouse: Stream‑Batch Unified Architecture

StarRocks

Apr 25, 2024 · Big Data

How StarRocks Beats Trino: 4.3× Faster Queries on Apache Paimon Lakehouse

This article explains how to build a high‑performance data‑lake analytics stack by combining StarRocks with Apache Paimon, covering direct queries, Data Cache acceleration, and asynchronous materialized views, and presents benchmark results that show StarRocks achieving up to 4.3× faster query speeds than Trino and significant latency reductions with caching and materialized views.

Apache PaimonData CacheData Lake

0 likes · 12 min read

How StarRocks Beats Trino: 4.3× Faster Queries on Apache Paimon Lakehouse

DataFunSummit

Apr 18, 2024 · Big Data

Real‑time Data Warehouse Evolution with Data Lake: Architecture, Challenges, and Solutions

This article presents a comprehensive overview of JD Tech's real‑time data warehouse evolution, detailing the legacy Lambda‑based design, its shortcomings, the transition to a data‑lake‑integrated architecture, iterative improvements, encountered technical and non‑technical issues, and future outlooks.

ClickHouseData LakeFlink

0 likes · 24 min read

DataFunSummit

Mar 30, 2024 · Big Data

Alluxio in Data & AI Lakehouse: Architecture, Performance Optimizations, and Cloud Practices at OPPO

OPPO's data architects combined their self‑developed Shuttle service with Alluxio to double performance, halve system pressure, and double throughput, while building a unified Data & AI lakehouse that integrates structured and unstructured data, metadata management, real‑time ingestion, and cloud cost reductions.

AIAlluxioBig Data

0 likes · 11 min read

Alluxio in Data & AI Lakehouse: Architecture, Performance Optimizations, and Cloud Practices at OPPO

iQIYI Technical Product Team

Mar 8, 2024 · Big Data

Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation

iQIYI migrated hundreds of petabytes of Hive tables to Apache Iceberg using dual‑write, in‑place, and CTAS strategies, combined with partition pruning, Bloom filters, and Trino/Alluxio optimizations, achieving up to 40% lower query latency, simplified pipelines, and faster, cost‑effective data lake operations.

Data LakeHiveIceberg

0 likes · 20 min read

Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation

Xiaohongshu Tech REDtech

Mar 4, 2024 · Big Data

Integrating Data Lake Technologies with Data Warehouse Architecture at Xiaohongshu: Practices and Performance Optimizations

Xiaohongshu’s data‑warehouse team integrated Apache Iceberg‑based data‑lake techniques into its existing warehouse, replacing the legacy Hive/Spark stack with global sorting, Z‑order, and upsert‑enabled tables, which cut query latency by up to 90 %, boosted data freshness by 50 %, slashed storage costs by 83 % and saved tens of thousands of GB‑hours of compute daily.

Apache IcebergData LakeData Warehouse

0 likes · 19 min read

Integrating Data Lake Technologies with Data Warehouse Architecture at Xiaohongshu: Practices and Performance Optimizations

DataFunSummit

Feb 26, 2024 · Big Data

Building a New Lakehouse Analytics Paradigm with StarRocks and Paimon

This article introduces a new lakehouse analytics paradigm by combining StarRocks and Paimon, covering the evolution of data lake technologies, key integration scenarios, core technical mechanisms such as JNI connectors, materialized views, and future roadmap for enhanced lakehouse capabilities.

AnalyticsBig DataData Lake

0 likes · 16 min read

StarRocks

Jan 10, 2024 · Big Data

How Tencent Built the ABetterChoice SaaS A/B Testing Platform for Global Games

In 2022 Tencent's A/B Test team created the overseas SaaS product ABetterChoice, abstracting internal experiment capabilities, adapting to multi‑cloud compliance, and unifying computation with StarRocks, enabling game titles like Honor of Kings, PUBG Mobile, and Ubisoft to run scalable, compliant A/B experiments worldwide.

A/B testingData LakeExperiment Platform

0 likes · 14 min read

How Tencent Built the ABetterChoice SaaS A/B Testing Platform for Global Games

Alibaba Cloud Big Data AI Platform

Jan 9, 2024 · Databases

Boost Real‑Time Lakehouse Queries with Hologres in 5 Minutes

This guide walks you through a 5‑minute challenge that shows how to enable Hologres real‑time lakehouse capabilities, configure OSS, DLF, and Hologres services, create external and internal tables, run TPCH Q11 queries, and submit results for prizes.

Data LakeHologresSQL

0 likes · 10 min read

Boost Real‑Time Lakehouse Queries with Hologres in 5 Minutes

Architects Research Society

Jan 2, 2024 · Big Data

Understanding Data Lakes: Concepts, Benefits, Challenges, and Comparison with Data Warehouses

This article explains what a data lake is, its origins, key characteristics such as collecting all data, enabling diverse user access, and flexible processing, compares it with traditional data warehouses, discusses cost advantages, potential pitfalls like data swamps, and outlines best‑practice considerations for enterprise adoption.

AnalyticsData ArchitectureData Lake

0 likes · 10 min read

Understanding Data Lakes: Concepts, Benefits, Challenges, and Comparison with Data Warehouses

Big Data Technology & Architecture

Dec 8, 2023 · Big Data

Comprehensive Guide to Apache Paimon and Advanced Flink Integration

This article provides an in‑depth overview of Apache Paimon as a streaming lakehouse, explains its core features, file layout, consistency guarantees, and offers detailed guidance on integrating and tuning Paimon with Apache Flink for both write and read performance, multi‑writer concurrency, table management, and bucket rescaling.

Apache PaimonBig DataData Lake

0 likes · 23 min read

Comprehensive Guide to Apache Paimon and Advanced Flink Integration

DataFunTalk

Nov 30, 2023 · Big Data

Big Data Cloud‑Native Trends and Challenges Highlighted at the 2023 Yunqi Conference

The 2023 Yunqi Conference in Hangzhou showcased the latest advances in cloud computing and big‑data technologies, examined the evolution from big‑data 1.0 to 3.0, discussed the key difficulties of making big data cloud‑native, and presented a practical case study of MiHoYo’s cloud‑native transformation.

Alibaba CloudBig DataCloud Native

0 likes · 12 min read

Big Data Cloud‑Native Trends and Challenges Highlighted at the 2023 Yunqi Conference

Big Data Technology Architecture

Nov 29, 2023 · Big Data

Building Real-Time Wide Tables with Partial-Update Using Apache Paimon for NetEase News Recommendation

The article describes how NetEase News' recommendation team replaced a slow, batch‑oriented data‑warehouse pipeline with a Flink‑based, Apache Paimon real‑time wide‑table solution that supports partial updates, reduces latency from hours to minutes, and lowers processing costs while handling both deduplication and non‑deduplication recommendation scenarios.

Apache PaimonData LakeFlink

0 likes · 8 min read

Building Real-Time Wide Tables with Partial-Update Using Apache Paimon for NetEase News Recommendation

Big Data Technology & Architecture

Nov 28, 2023 · Big Data

Apache Paimon for CDC: Low‑Cost, Low‑Latency Data Lake Ingestion and Performance Comparison with Hive and Hudi

This article explains how Apache Paimon simplifies CDC data lake ingestion with one‑click, low‑cost, low‑latency pipelines, details its architecture and tag‑based Hive compatibility, provides best‑practice configurations, and presents benchmark results showing Paimon outperforming Hive and Hudi in both write and query performance.

Apache PaimonCDCData Lake

0 likes · 14 min read

Apache Paimon for CDC: Low‑Cost, Low‑Latency Data Lake Ingestion and Performance Comparison with Hive and Hudi

Architects Research Society

Nov 26, 2023 · Big Data

Data Lake vs Data Warehouse: Key Differences and How to Choose

Data lakes and data warehouses serve different purposes in big‑data architectures; this article explains their definitions, core attributes, five major distinctions—including data retention, type support, user coverage, adaptability, and insight speed—and offers guidance on selecting or combining the two approaches.

AnalyticsData ArchitectureData Lake

0 likes · 12 min read

Data Lake vs Data Warehouse: Key Differences and How to Choose

DataFunTalk

Nov 24, 2023 · Big Data

Amoro Lakehouse Management System: Deployment Practices and AWS Integration for Apache Iceberg

This article introduces Amoro, a lakehouse management platform built on Apache Iceberg, explains why Webex adopted it to overcome Hive limitations, details its AWS GlueCatalog and S3 integration with DynamoDB lock management, and provides step‑by‑step Helm‑based deployment instructions on Kubernetes.

AWSAmoroApache Iceberg

0 likes · 19 min read

Amoro Lakehouse Management System: Deployment Practices and AWS Integration for Apache Iceberg

StarRocks

Nov 22, 2023 · Big Data

How StarRocks’ Compute‑Storage Separation Cut Costs 46% and Boosted Performance

This article details a Chinese tech company's migration of its internal big‑data analytics platform to StarRocks’ compute‑storage separation architecture, describing the original multi‑component setup, the pain points encountered, the evaluation methodology, performance and cost benchmarks, operational optimizations, migration steps, and future roadmap.

Big DataCompute-Storage SeparationData Lake

0 likes · 17 min read

How StarRocks’ Compute‑Storage Separation Cut Costs 46% and Boosted Performance

dbaplus Community

Nov 8, 2023 · Big Data

Choosing Between Data Warehouse, Data Lake, and Lakehouse: When to Use Each

This article compares traditional data warehouses, modern data lakes, and emerging lakehouse architectures, explaining their design patterns, advantages, disadvantages, and suitable use cases, while detailing implementation considerations such as schema design, ETL/ELT processes, file formats like Delta, Iceberg, and Hudi, and factors influencing platform selection.

Apache SparkData LakeData Warehouse

0 likes · 20 min read

Choosing Between Data Warehouse, Data Lake, and Lakehouse: When to Use Each

DataFunTalk

Oct 28, 2023 · Big Data

Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices

This article presents a comprehensive overview of a unified data lake architecture, evaluates three ingestion solutions, details real‑time ingestion optimizations for Flink‑Hudi pipelines, and describes how Kyuubi enables unified query access across multiple engines, offering practical guidance for large‑scale data processing.

Big DataData LakeFlink

0 likes · 14 min read

Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices

DataFunSummit

Oct 18, 2023 · Big Data

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

This article explains why Kuaishou built a data lake, outlines the shortcomings of its previous Lambda architecture, describes the adoption of Apache Hudi for unified batch‑stream processing, and details the five major technical challenges and the corresponding solutions implemented to improve performance, consistency, and operational reliability.

Apache HudiBig DataData Architecture

0 likes · 17 min read

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

Data Thinking Notes

Oct 11, 2023 · Big Data

How ByteDance Optimized Its E‑Commerce Data Lake to Cut Costs and Boost Real‑Time Accuracy

ByteDance revamped its traditional Lambda architecture for e‑commerce traffic data by introducing a new lake ingestion solution that reduces development and operational costs, ensures timely and stable data, and outlines future plans covering business background, ODS lake design, archiving tags, delayed data handling, and real‑time stability.

Big DataData LakeFlink

0 likes · 7 min read

How ByteDance Optimized Its E‑Commerce Data Lake to Cut Costs and Boost Real‑Time Accuracy

Sohu Tech Products

Oct 11, 2023 · Industry Insights

How StarRocks Materialized Views Power Real‑Time Lakehouse Analytics

The article provides a deep technical overview of StarRocks 3.0’s data‑lake analysis capabilities, its unified Lakehouse architecture, Catalog integration, Trino compatibility, extensive I/O optimizations, materialized view features, resource isolation techniques, real‑world use cases, and future development directions.

AnalyticsData LakeLakehouse

0 likes · 22 min read

How StarRocks Materialized Views Power Real‑Time Lakehouse Analytics

DataFunSummit

Oct 1, 2023 · Big Data

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This presentation introduces Iceberg's core capabilities, details Xiaomi's practical applications—including log ingestion, near‑real‑time warehousing, offline challenges, column‑level encryption, and Hive migration—and outlines future development directions such as materialized views and cloud migration, providing a comprehensive view of modern data‑lake engineering.

Big DataData LakeFlink

0 likes · 22 min read

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

Architects Research Society

Sep 26, 2023 · Big Data

From a Single Data Lake to a Decentralized Data Mesh: A Step‑by‑Step Migration Guide

This article explains why traditional centralized data lakes hinder modern software development, introduces the data‑mesh concept as a decentralized alternative, and walks through an e‑commerce microservice example with concrete steps, data‑API designs, and migration tactics to transition from a monolithic lake to a distributed data mesh.

Data LakeData MeshData Platform

0 likes · 22 min read

From a Single Data Lake to a Decentralized Data Mesh: A Step‑by‑Step Migration Guide

iQIYI Technical Product Team

Sep 22, 2023 · Big Data

Data Lake: Concepts, Architecture, and Application in iQIYI's Data Platform

iQIYI’s data‑middle‑platform team built a four‑zone data lake—raw, product, work, and sensitive—integrated with unified ODS/DWD/MID layers, a metadata catalog, and self‑service tools, leveraging HDFS, Hive/Iceberg, Spark/Trino, and Flink, migrated to Apache Iceberg for real‑time freshness, and now aims to further streamline modules and adopt new technologies.

Apache IcebergData GovernanceData Lake

0 likes · 13 min read

Data Lake: Concepts, Architecture, and Application in iQIYI's Data Platform

Didi Tech

Sep 19, 2023 · Cloud Native

OrangeFS: A Cloud‑Native Multi‑Protocol Distributed Data Lake Storage System

OrangeFS is Didi’s cloud‑native, multi‑protocol distributed data‑lake storage system that unifies POSIX, S3 and HDFS access on a single logical hierarchy, integrates with Kubernetes via a CSI plugin, supports on‑premise and public‑cloud backends, provides multi‑tenant isolation, and dramatically improves elasticity, utilization and latency for petabyte‑scale workloads such as ride‑hailing logs, machine‑learning training, finance and analytics.

CSIData LakeDistributed File System

0 likes · 17 min read

OrangeFS: A Cloud‑Native Multi‑Protocol Distributed Data Lake Storage System

DataFunTalk

Sep 16, 2023 · Big Data

StarRocks Data Lake Analysis, Materialized Views, and Lakehouse Architecture

This article explains how StarRocks 3.0 extends real‑time data‑warehouse capabilities to support data‑lake analysis, external catalog integration, Trino compatibility, extensive I/O optimizations, and powerful materialized‑view features that together enable a unified, cloud‑native Lakehouse solution with high performance and flexible resource isolation.

Big DataData LakeLakehouse

0 likes · 20 min read

StarRocks Data Lake Analysis, Materialized Views, and Lakehouse Architecture

iQIYI Technical Product Team

Sep 15, 2023 · Big Data

Apache Spark at iQIYI: Current Status and Optimization

iQIYI now relies on Apache Spark as its main offline engine, processing over 200 000 daily tasks for ETL, data synchronization and analytics, while recent optimizations—dynamic resource allocation, adaptive query execution, compression, rebalance, Z‑order and resource‑governance—have cut compute usage by ~27 %, storage by up to 76 % and improved query speed, completing a large‑scale migration from Hive and paving the way for Spark 3.4 and Iceberg support.

Apache SparkData LakePerformance Optimization

0 likes · 21 min read

Apache Spark at iQIYI: Current Status and Optimization

DataFunSummit

Aug 26, 2023 · Big Data

Bilibili's Practice of Building a Streaming Data Lake with Hudi and Flink

This article details Bilibili's implementation of a streaming data lake using Hudi and Flink, covering background challenges, four case studies, batch‑stream integration optimizations, infrastructure and kernel enhancements, and future work directions.

Batch-Stream IntegrationBig DataData Lake

0 likes · 14 min read

Bilibili's Practice of Building a Streaming Data Lake with Hudi and Flink

DataFunTalk

Aug 20, 2023 · Databases

Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark

This article presents a comprehensive technical overview of Alibaba Cloud AnalyticDB MySQL and its Serverless Spark integration, detailing architecture, core optimizations, security enhancements, and real‑world case studies that demonstrate how to achieve cost‑effective, high‑performance data lake analytics.

AnalyticDBBig DataData Lake

0 likes · 19 min read

Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark

StarRocks

Aug 9, 2023 · Databases

StarRocks 3.1 Highlights: Faster Lakehouse Analytics and Advanced Materialized Views

StarRocks 3.1 introduces a cloud‑native, lakehouse‑oriented architecture with enhanced storage‑compute separation, up to 3‑6× faster data‑lake queries than Trino/Presto, expanded Iceberg and Paimon support, richer materialized view capabilities, new random bucketing, expression partitioning, generated columns, and spill‑to‑disk stability, all backed by extensive performance optimizations and open‑source contributions.

Data LakeLakehouseMaterialized Views

0 likes · 17 min read

StarRocks 3.1 Highlights: Faster Lakehouse Analytics and Advanced Materialized Views

DataFunSummit

Aug 7, 2023 · Big Data

Performance Optimizations in Impala for Data Lake Queries: Iceberg and Codegen Enhancements

This article presents a comprehensive overview of Impala's high‑performance MPP query engine, its architecture for data‑lake workloads, and detailed performance optimizations including Iceberg table format improvements, manifest caching, and various Codegen techniques such as asynchronous compilation and caching.

Big DataCodegenData Lake

0 likes · 17 min read

Performance Optimizations in Impala for Data Lake Queries: Iceberg and Codegen Enhancements

Baidu Geek Talk

Aug 7, 2023 · Artificial Intelligence

Storage Acceleration Solutions for Large AI Model Workflows

To tackle the massive data, high‑throughput and low‑latency demands of large‑model training and inference, the talk proposes a unified data‑lake built on scalable object storage combined with an acceleration layer—either a parallel file system or cloud‑native RapidFS cache—demonstrating multi‑fold training speedups, faster checkpoint uploads, and linear inference scaling.

AI Model StorageAccelerated FilesystemData Lake

0 likes · 18 min read

Storage Acceleration Solutions for Large AI Model Workflows

GuanYuan Data Tech Team

Jul 27, 2023 · Big Data

How Delta Lake Powers Scalable BI & AI: Real-World Practices and Optimizations

Guandata’s R&D leader outlines how their analytics platform leverages Delta Lake and Spark to deliver fast, ACID‑compliant BI and AI workloads, detailing architecture, key features like schema evolution and time travel, and practical performance tricks such as compaction, vacuuming, and multi‑engine integration.

AIBIBig Data

0 likes · 14 min read

How Delta Lake Powers Scalable BI & AI: Real-World Practices and Optimizations

DataFunSummit

Jul 18, 2023 · Databases

Apache Doris Data Lake Federation Features Overview

This article introduces Apache Doris’s data lake federation capabilities, detailing its lake‑warehouse integration design, supported data sources such as Hive, Iceberg, Hudi, and Elasticsearch, performance optimizations for metadata and file access, case studies, community roadmap, and Q&A on replacing Presto.

Apache DorisData LakeFederated Query

0 likes · 21 min read

Apache Doris Data Lake Federation Features Overview

Data Thinking Notes

Jul 12, 2023 · Fundamentals

Why Metadata Governance Is the Backbone of Modern Data Platforms

This article explains how metadata serves as essential infrastructure for data platforms, detailing Huawei's classification framework, governance challenges, management architecture, integrated modeling, data lake handling, service management, and data map construction to bridge business and IT domains.

Data GovernanceData LakeData Management

0 likes · 24 min read

Why Metadata Governance Is the Backbone of Modern Data Platforms

DataFunTalk

Jul 10, 2023 · Big Data

Practical Experience of In‑Lake Warehouse Implementation Based on Lakehouse Architecture

This article presents a comprehensive overview of Lakehouse‑based in‑lake warehousing, covering common data‑lake misconceptions, the evolution from databases to data warehouses and lakes, the advantages of Lakehouse over traditional architectures, a reference multi‑layer architecture, typical use cases, challenges, future plans, and a brief Q&A.

Big Data ArchitectureData LakeData Warehouse

0 likes · 20 min read

Practical Experience of In‑Lake Warehouse Implementation Based on Lakehouse Architecture

DataFunTalk

Jun 29, 2023 · Big Data

Practical Deployment of Delta Lake in BI and AI Products

This article summarizes a technical presentation on how Delta Lake is integrated into a BI+AI platform, covering the product background, data‑lake architecture, Delta Lake features such as ACID transactions, schema management, multi‑engine support, performance optimizations, and future development directions.

AIBIBig Data

0 likes · 12 min read

Practical Deployment of Delta Lake in BI and AI Products

Baidu Intelligent Cloud Tech Hub

Jun 27, 2023 · Cloud Native

How Hierarchical Namespace Boosts Cloud‑Native Data Lake Performance

This article examines the performance challenges of cloud‑native data lakes built on flat object storage and explains how a hierarchical‑namespace design improves directory operations, reduces request amplification, and delivers significant speedups for big‑data and AI workloads.

Big DataData Lakecloud-native

0 likes · 21 min read

How Hierarchical Namespace Boosts Cloud‑Native Data Lake Performance

DataFunTalk

Jun 26, 2023 · Big Data

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This presentation details Iceberg's core capabilities—transactional writes, schema evolution, implicit partitioning, and row‑level updates—while showcasing Xiaomi's real‑world applications such as log ingestion redesign, near‑real‑time warehousing, offline optimizations, column‑level encryption, Hive migration strategies, and outlining upcoming enhancements like materialized views and cloud migration.

Big DataColumn EncryptionData Lake

0 likes · 20 min read

DataFunTalk

Jun 24, 2023 · Big Data

Design and Architecture of MaxCompute Lakehouse Near‑Real‑Time Incremental Processing

This article explains the evolution of Alibaba Cloud's MaxCompute platform into a lakehouse architecture that supports near‑real‑time incremental processing, detailing its development history, core design of transactional tables, five‑module technical stack, data ingestion methods, optimization services, transaction management, query capabilities, ecosystem integration, practical applications, future roadmap, and common user questions.

Big DataData LakeIncremental Processing

0 likes · 24 min read

Design and Architecture of MaxCompute Lakehouse Near‑Real‑Time Incremental Processing

Data Thinking Notes

Jun 18, 2023 · Big Data

Data Lake vs Data Warehouse: Uncover the Real Differences

This article explores the evolving concept of data lakes, compares them with traditional data warehouses across storage, modeling, tooling, and user roles, and examines the emerging lake‑warehouse integration, highlighting why both remain essential in modern big‑data architectures.

Big DataData ArchitectureData Lake

0 likes · 12 min read

Data Lake vs Data Warehouse: Uncover the Real Differences

Big Data Technology & Architecture

Jun 13, 2023 · Big Data

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of Iceberg for its data lake, covering the OLAP architecture, reasons for a data lake, Iceberg's table format advantages over Hive, platform construction, streaming ingestion, query and performance optimizations, real‑world business deployments, and future plans.

Big DataData LakeFlink

0 likes · 21 min read

Iceberg Data Lake Implementation and Optimization at iQIYI

DataFunSummit

Jun 10, 2023 · Big Data

Performance Optimization of Iceberg Real‑time Data Warehouse and Arctic Enhancements

This article presents a comprehensive overview of Iceberg MOR principles, Arctic‑based performance optimizations, benchmark evaluations using CH‑benchmark, and future roadmap items, highlighting how various file‑type strategies, self‑optimizing mechanisms, and task balancing improve real‑time data lake query efficiency.

ArcticData LakeIceberg

0 likes · 14 min read

Performance Optimization of Iceberg Real‑time Data Warehouse and Arctic Enhancements

DataFunSummit

Jun 6, 2023 · Big Data

Optimizing Real-Time Data Lake Queries on Huawei Cloud with Apache Hudi: Architecture, Indexing, and Performance Enhancements

This article introduces Huawei Cloud's real-time data lake query optimizations using Apache Hudi, covering Hudi's query capabilities, clustering and MDT optimizations, various index types (Min‑max, Lucene, bitmap), caching strategies, and future plans for performance improvements.

Apache HudiData LakeHuawei Cloud

0 likes · 18 min read

Optimizing Real-Time Data Lake Queries on Huawei Cloud with Apache Hudi: Architecture, Indexing, and Performance Enhancements

Data Thinking Notes

Jun 4, 2023 · Big Data

How Distributed Lakehouse Architecture Solves Data Swamp Challenges

This article examines the explosion of heterogeneous data sources, the limitations of traditional data lakes and warehouses, and proposes a distributed lakehouse architecture that integrates advanced management layers to improve data governance, reliability, and support both SQL and advanced analytics workloads.

Data GovernanceData LakeData Warehouse

0 likes · 29 min read

How Distributed Lakehouse Architecture Solves Data Swamp Challenges

DataFunSummit

Jun 3, 2023 · Big Data

Kuaishou’s Data Lake Architecture with Apache Hudi: Design, Challenges, Solutions, and Future Plans

This article presents Kuaishou’s journey in building a data lake using Apache Hudi, detailing the lake architecture, key challenges such as ingestion bottlenecks and update inefficiencies, the solutions implemented, practical case studies, and the roadmap for future enhancements.

Apache HudiData LakeFlink

0 likes · 20 min read

Kuaishou’s Data Lake Architecture with Apache Hudi: Design, Challenges, Solutions, and Future Plans

DataFunTalk

Jun 2, 2023 · Big Data

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of the Iceberg data lake, covering its OLAP architecture, reasons for a lake, Iceberg table format advantages over Hive, platform construction, extensive performance optimizations, and real‑world business use cases such as ad‑flow unification, log analysis, audit, and CDC pipelines.

Big DataData LakeFlink

0 likes · 18 min read

DataFunSummit

May 28, 2023 · Big Data

Apache Hudi: Capabilities, Architecture, Use Cases, and Future Outlook

This article introduces Apache Hudi as a next‑generation streaming data‑lake platform, explains its core concepts, architecture, and table types, and showcases real‑world use cases at Tencent such as CDC ingestion, minute‑level real‑time warehousing, streaming analytics, multi‑stream joins, ad attribution, and stream‑to‑batch processing, while also outlining future directions.

Apache HudiCDCData Lake

0 likes · 16 min read

Apache Hudi: Capabilities, Architecture, Use Cases, and Future Outlook

DataFunTalk

May 22, 2023 · Big Data

Alibaba Cloud Data Lake: Unified Metadata and Storage Management Practices

This article explains Alibaba Cloud's data lake architecture, unified metadata services, storage management optimizations, and format handling techniques, illustrating how lakehouse concepts, multi‑engine support, and lifecycle policies enable efficient, secure, and cost‑effective big data processing in the cloud.

Big DataCloud ServicesData Lake

0 likes · 22 min read

Alibaba Cloud Data Lake: Unified Metadata and Storage Management Practices

DataFunSummit

May 20, 2023 · Big Data

Arctic on Flink: Streaming Features, Core Principles, Benchmark Results, and Future Roadmap

This article presents a comprehensive overview of Arctic's streaming capabilities on Flink, detailing its mixed‑format architecture, core principles, benchmark comparisons with Iceberg, future development plans, and a Q&A session covering implementation nuances and performance considerations.

ArcticData LakeFlink

0 likes · 18 min read

Arctic on Flink: Streaming Features, Core Principles, Benchmark Results, and Future Roadmap

DataFunTalk

May 15, 2023 · Big Data

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

This article explains why Kuaishou built a data lake, describes its Hudi‑based architecture, outlines five major challenges encountered during implementation, and presents the solutions and future development plans, illustrating performance improvements and practical use cases across various business scenarios.

Apache HudiBig DataData Lake

0 likes · 19 min read

DataFunTalk

May 11, 2023 · Big Data

Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap

This article describes how ByteDance tackled petabyte‑scale feature storage by adopting Apache Iceberg, detailing the problem background, design choices, implementation of COW and MOR back‑fill strategies, performance optimizations, and future plans such as lake‑cold‑layering and materialized views.

Apache IcebergBig DataData Lake

0 likes · 16 min read

Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap

dbaplus Community

May 9, 2023 · Big Data

How a Bank Built a Near‑Real‑Time Data Platform with Kafka, Flink & Hudi

An in‑depth case study of a Chinese bank’s near‑real‑time data platform reveals its evolution from a monolithic CDC pipeline to a split architecture featuring a real‑time data lake and a data‑service bus, detailing component choices, schema‑registry integration, SDK development, observability, and future roadmap.

Big Data ArchitectureData LakeFlink

0 likes · 18 min read

How a Bank Built a Near‑Real‑Time Data Platform with Kafka, Flink & Hudi

DataFunTalk

May 6, 2023 · Databases

Apache Doris: Overview, Data Lake Analysis Architecture, Community Development and Future Roadmap

This article provides a comprehensive overview of Apache Doris, detailing its origins, MPP‑based analytical capabilities, data‑lake integration techniques, recent architectural enhancements, performance optimizations, community growth, and upcoming development plans, while also addressing common user questions.

Apache DorisBig DataData Lake

0 likes · 20 min read

Apache Doris: Overview, Data Lake Analysis Architecture, Community Development and Future Roadmap

DataFunSummit

Apr 25, 2023 · Big Data

Building a Real-Time Data Lake with Hudi: Architecture, Challenges, and Practices

This article presents Huawei's end‑to‑end solution for constructing a real‑time data lake on Hudi, covering requirement analysis, technology selection, architectural design, ingestion and processing challenges, practical optimizations, and future improvement directions.

Data LakeETL/ELTFlink

0 likes · 14 min read

Building a Real-Time Data Lake with Hudi: Architecture, Challenges, and Practices

DataFunTalk

Apr 13, 2023 · Big Data

Four Paradigms of StarRocks Lakehouse Integration and an Overview of StarRocks 3.0

This article explains why lake‑warehouse integration is needed, outlines its challenges, describes StarRocks' four integration paradigms—including query acceleration, layered modeling, real‑time warehouse‑lake fusion, and the cloud‑native 3.0 solution—and previews the upcoming StarRocks 3.0 release.

Big DataCloud NativeData Lake

0 likes · 18 min read

Four Paradigms of StarRocks Lakehouse Integration and an Overview of StarRocks 3.0

Data Thinking Notes

Apr 5, 2023 · Big Data

Mastering Data Governance: From Challenges to End‑to‑End Solutions

This article explores the key problems data governance aims to solve, outlines a comprehensive governance framework, and details practical implementation steps—including tool integration, metadata management, lake‑in and lake‑out processes, and governance policies—to achieve a closed‑loop, value‑driven data ecosystem.

Big DataData GovernanceData Lake

0 likes · 13 min read

Mastering Data Governance: From Challenges to End‑to‑End Solutions

StarRing Big Data Open Lab

Mar 31, 2023 · Big Data

Why Disaggregated Storage‑Compute Architecture Is Revolutionizing Big Data Platforms

The article explains how separating storage and compute layers—through disaggregated architectures, containerized services, and cloud‑native scheduling—enhances data openness, independent scaling, resource isolation, and performance for modern big data platforms like StarRing and Cloudera.

Data Lakedisaggregated storagekubernetes

0 likes · 14 min read

Why Disaggregated Storage‑Compute Architecture Is Revolutionizing Big Data Platforms

Big Data Technology & Architecture

Mar 30, 2023 · Big Data

Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

Apache Paimon, newly incubated by the Apache Software Foundation, combines Flink's real‑time streaming capabilities with open lakehouse storage formats, offering high‑throughput, low‑latency data ingestion, partial‑update merges, and seamless integration with engines like Flink, Spark, and Trino for unified batch and streaming analytics.

Apache PaimonBig DataData Lake

0 likes · 7 min read

Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

StarRing Big Data Open Lab

Mar 24, 2023 · Big Data

How Inceptor and Delta Lake Power a Unified Lake‑Warehouse Architecture

This article explains how Inceptor and Apache Delta Lake combine distributed transaction, MVCC, snapshot isolation, and high‑performance SQL to support both data lake and data warehouse workloads, compares them with Hudi and Iceberg, and outlines their strengths and limitations for modern big‑data analytics.

Data LakeData WarehouseDelta Lake

0 likes · 12 min read

How Inceptor and Delta Lake Power a Unified Lake‑Warehouse Architecture

StarRing Big Data Open Lab

Mar 22, 2023 · Big Data

Why Lakehouse Architecture Is Revolutionizing Data Analytics: Hudi vs Iceberg

This article explains how the lakehouse integrated architecture combines data lake and data warehouse capabilities, outlines its key features, compares three implementation paths, and provides an in‑depth technical overview of Apache Hudi and Apache Iceberg for modern big‑data analytics.

Apache HudiApache IcebergData Lake

0 likes · 15 min read

Why Lakehouse Architecture Is Revolutionizing Data Analytics: Hudi vs Iceberg

Big Data Technology & Architecture

Mar 20, 2023 · Big Data

Using SparkSQL to Connect and Operate with Apache Hudi: Configuration, Table Creation, Data Manipulation, and Deletion

This guide demonstrates how to configure Hive metastore, connect SparkSQL to Apache Hudi, create COW and MOR tables, perform insert, update, merge, delete, and insert‑overwrite operations, and illustrates each step with executable code snippets and sample results.

Apache HudiBig DataData Lake

0 likes · 14 min read

Using SparkSQL to Connect and Operate with Apache Hudi: Configuration, Table Creation, Data Manipulation, and Deletion

Open Source Linux

Mar 14, 2023 · Big Data

Can Data Lakes and Data Warehouses Coexist? Exploring the Lake‑Warehouse Fusion

This article traces 20 years of big‑data evolution, compares data lakes and data warehouses, defines both concepts, examines their technical trade‑offs, and presents Alibaba Cloud’s lake‑warehouse (lakehouse) solution that unifies flexible storage with enterprise‑grade performance and governance.

Big DataCloud ComputingData Lake

0 likes · 32 min read

Can Data Lakes and Data Warehouses Coexist? Exploring the Lake‑Warehouse Fusion

Alibaba Cloud Big Data AI Platform

Mar 13, 2023 · Big Data

Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution

Alibaba Cloud’s cloud‑native data lake analysis solution combines fully managed storage (OSS‑HDFS), a one‑stop lake management platform (Data Lake Formation), and multimodal compute capabilities, delivering high performance, massive scalability, and low cost for big‑data and AI workloads across offline, real‑time, and lake‑house scenarios.

AnalyticsBig DataCloud Native

0 likes · 11 min read

Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution

DataFunSummit

Mar 10, 2023 · Big Data

Interview on Data Lake and Lakehouse: Current Applications, Challenges, and Evolution

This interview with NetEase’s data‑lake technology manager explores the distinction between data lakes and lakehouses, the evolution of table‑format technologies such as Iceberg, Hudi and Delta Lake, their maturity across key capabilities, and the practical adoption challenges faced by enterprises.

Data LakeDelta LakeHudi

0 likes · 14 min read

Interview on Data Lake and Lakehouse: Current Applications, Challenges, and Evolution

DataFunSummit

Feb 28, 2023 · Big Data

Iceberg Technology Overview and Its Application at Xiaomi: Practices, Stream‑Batch Integration, and Future Plans

This article introduces the Iceberg table format, explains its core architecture and advantages such as transactionality, implicit partitioning and row‑level updates, details Xiaomi's practical deployments—including CDC pipelines, partition strategies, compaction services, and stream‑batch integration—and outlines future development directions.

CompactionData LakeFlink

0 likes · 20 min read

Iceberg Technology Overview and Its Application at Xiaomi: Practices, Stream‑Batch Integration, and Future Plans

NetEase Yanxuan Technology Product Team

Feb 27, 2023 · Big Data

How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration

This article details how NetEase Yanxuan transformed its data platform from a dual Lambda architecture to a unified batch‑stream solution built on Apache Iceberg, covering the original challenges, the evaluation of Iceberg versus Hudi and Delta Lake, implementation of stream‑batch pipelines, message ordering fixes, snapshot generation, and extensive table‑governance optimizations.

Apache FlinkApache SparkBatch-Stream Integration

0 likes · 14 min read

How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration

DataFunTalk

Feb 25, 2023 · Big Data

T3 Travel’s Modern Data Stack and Feature Platform: Architecture and Practices

This article details T3 Travel’s exploration of the Modern Data Stack, describing its four‑point overview, business scenarios, the initial MDS implementation using Apache Hudi and Kyuubi, and the design of a feature platform that integrates Metricflow, Feast, and other components to support data processing, analytics, and machine‑learning workflows.

Apache HudiBig DataData Lake

0 likes · 22 min read

T3 Travel’s Modern Data Stack and Feature Platform: Architecture and Practices

DataFunTalk

Feb 24, 2023 · Big Data

Presto and Alluxio Integration for Iceberg: Architecture, Best Practices, and Future Work

This article explains how Presto and Alluxio work together to query Iceberg tables, describes their architectures, deployment options, best‑practice recommendations such as using Iceberg native catalogs and local caches, and outlines future research directions for improving CPU usage and off‑heap caching.

AlluxioBig DataCache

0 likes · 14 min read

Presto and Alluxio Integration for Iceberg: Architecture, Best Practices, and Future Work

DataFunTalk

Feb 20, 2023 · Big Data

Understanding Data Lakes and Their Application at iQIYI: Concepts, Scenarios, and Iceberg Implementation

This article explains the definition of data lakes (public‑cloud and non‑public‑cloud), outlines their key characteristics, presents three typical business scenarios—real‑time event analysis, change‑data analysis, and stream‑batch integration—summarizes required product features, evaluates open‑source lake formats, and details iQIYI's adoption of Apache Iceberg across multiple services to achieve low‑latency, large‑scale, cost‑effective analytics.

Big DataData LakeIceberg

0 likes · 23 min read

Understanding Data Lakes and Their Application at iQIYI: Concepts, Scenarios, and Iceberg Implementation

Alibaba Cloud Big Data AI Platform

Feb 8, 2023 · Big Data

How Alibaba Cloud EMR 2.0 Redefines Open‑Source Big Data Platforms

This article summarizes Alibaba Cloud senior product expert He Yuan's presentation on EMR 2.0, outlining the challenges of open‑source big data, the evolution of EMR, and the new features—including cloud‑native architecture, enhanced performance, diverse resource models, and expanded analysis scenarios—aimed at reducing cost and complexity.

Alibaba CloudBig DataCloud Native

0 likes · 11 min read

How Alibaba Cloud EMR 2.0 Redefines Open‑Source Big Data Platforms

Big Data Technology & Architecture

Feb 6, 2023 · Big Data

Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations

This article presents an in‑depth overview of real‑time data‑warehouse scenarios, discusses challenges such as timeliness, update efficiency, and resource consumption, and details practical solutions using Apache Hudi, Flink, Presto, and related optimizations for ingestion, indexing, compaction, and query performance.

Big DataData LakeFlink

0 likes · 17 min read

Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations

iQIYI Technical Product Team

Feb 3, 2023 · Big Data

Data Lake Concepts, Benefits, and Iceberg‑Based Implementations at iQIYI

iQIYI’s data lake combines public‑cloud and private storage with Apache Iceberg’s snapshot‑based table format to enable near‑real‑time, unified batch‑and‑stream analytics, reducing costs, simplifying architecture, and improving data freshness across use cases such as log collection, audit, pingback, and member order processing.

Apache IcebergData ArchitectureData Lake

0 likes · 25 min read

Data Lake Concepts, Benefits, and Iceberg‑Based Implementations at iQIYI

dbaplus Community

Jan 31, 2023 · Big Data

Building ByteDance’s Real‑Time Data Warehouse with Hudi: Architecture & Solutions

This article explains how ByteDance designed and deployed a real‑time data warehouse on a data lake using Hudi, detailing three business scenarios, the challenges of latency, consistency and resource usage, and the engineering solutions—including upserts, compaction services, indexing, and future unified storage plans.

Data LakeFlinkHudi

0 likes · 14 min read

Building ByteDance’s Real‑Time Data Warehouse with Hudi: Architecture & Solutions

DataFunTalk

Jan 28, 2023 · Big Data

Data Lake vs Data Warehouse: Differences, Evolution, and Integrated Lakehouse Design

This article explores the ongoing debate between data lakes and data warehouses, clarifies their distinct purposes and technologies, discusses how they can coexist or complement each other, and introduces the concept of an integrated lakehouse architecture while promoting a comprehensive data intelligence knowledge map.

Big DataData LakeData Warehouse

0 likes · 5 min read

Data Lake vs Data Warehouse: Differences, Evolution, and Integrated Lakehouse Design

DataFunSummit

Jan 10, 2023 · Big Data

Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans

This article presents a comprehensive overview of Iceberg's adoption in Huawei Terminal Cloud, covering its architectural overview, key features such as Git‑style data management, real‑time processing, acceleration layers, and future development directions, along with a Q&A session addressing performance and implementation details.

Big DataData LakeFlink

0 likes · 15 min read

Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans

Data Thinking Notes

Jan 5, 2023 · Big Data

Why Data Lakes Are Outshining Traditional Data Warehouses: A Deep Dive

This comprehensive guide explains the evolution from traditional data warehouses to modern data lakes, detailing concepts, architectures, differences, implementation steps, and real‑world case studies, while also comparing major cloud providers' solutions and highlighting how data platforms support digital transformation and analytics.

AnalyticsBig DataData Lake

0 likes · 97 min read

Why Data Lakes Are Outshining Traditional Data Warehouses: A Deep Dive

DataFunTalk

Dec 31, 2022 · Big Data

Glacier: An Intelligent Data Lake Architecture for Real‑Time Analytics and Machine Learning

This article presents Glacier, OPPO's intelligent data lake solution that builds on Iceberg Table Format to provide real‑time data ingestion, low‑latency queries, advanced indexing, and robust multi‑version management for both structured and unstructured data, tightly integrating with machine‑learning workflows.

Data LakeGlacierIceberg

0 likes · 20 min read

Glacier: An Intelligent Data Lake Architecture for Real‑Time Analytics and Machine Learning

DataFunTalk

Dec 27, 2022 · Big Data

Multi‑Stream Join and Concurrency Control in Apache Hudi: Design, Implementation, and Usage

This article presents a comprehensive solution for multi‑stream joins in Apache Hudi, detailing the challenges of dimension and multi‑stream joins, the novel storage‑layer join approach, timeline‑based concurrency control, marker mechanisms, early conflict detection, payload customization, and practical usage with Flink and Spark, along with performance benefits and future directions.

Apache HudiData LakeFlink

0 likes · 31 min read

Multi‑Stream Join and Concurrency Control in Apache Hudi: Design, Implementation, and Usage