Tagged articles

Data Lake

356 articles · Page 1 of 4

Jul 2, 2026 · Industry Insights

How ColdFront Sets pgEdge Apart in the OLTP‑OLAP‑AI Showdown

The article compares four emerging data‑lake‑for‑PostgreSQL solutions—Databricks LTAP, EDB Fusion Analytics, Snowflake pg_lake, and pgEdge's ColdFront—highlighting ColdFront's unique transparent Iceberg layer, writable cold data, DuckDB integration, and the strategic trade‑offs developers must weigh when choosing a modern OLTP/OLAP/AI architecture.

Agentic AIColdFrontData Lake

0 likes · 9 min read

How ColdFront Sets pgEdge Apart in the OLTP‑OLAP‑AI Showdown

Alibaba Cloud Native

Jun 26, 2026 · Cloud Native

One-Click Real-Time Stream Ingestion: Alibaba Cloud Kafka’s Native Data Lake Integration

Alibaba Cloud Message Queue for Kafka introduces a native message‑to‑lake capability that integrates Apache Iceberg with OSS Table Bucket, eliminating Spark/Flink/Kafka Connect, providing exactly‑once semantics, automatic schema management, dual write modes, smart partitioning, and up to ten‑fold performance gains across diverse real‑time analytics scenarios.

Apache IcebergCloud NativeData Lake

0 likes · 12 min read

One-Click Real-Time Stream Ingestion: Alibaba Cloud Kafka’s Native Data Lake Integration

Alibaba Cloud Native

Jun 19, 2026 · Big Data

Why Real-Time Data Lake Ingestion Is Dropping ETL in the AI Era: Architecture Simplification from Kafka to Iceberg

In the AI‑driven era, enterprises need a data foundation that supports both real‑time consumption and long‑term historical analysis, and the emerging "zero‑ETL" trend moves generic ingestion capabilities from external Flink/Spark jobs into a streamlined Kafka‑to‑Iceberg pipeline, reducing complexity while preserving low latency, consistency, schema evolution, CDC semantics and open‑ecosystem compatibility.

Data LakeIcebergStreaming

0 likes · 25 min read

Why Real-Time Data Lake Ingestion Is Dropping ETL in the AI Era: Architecture Simplification from Kafka to Iceberg

Alibaba Cloud Developer

Jun 18, 2026 · Big Data

How AI-Driven Real-Time Data Lakes Are Ditching ETL: A Kafka‑to‑Iceberg Architecture Simplification

In the AI era, enterprises need a data foundation that supports both low‑latency streaming and long‑term analytics, and the combination of Kafka, Iceberg and object storage is emerging as a preferred solution; by moving ingestion capabilities closer to the message layer and eliminating external ETL jobs, a "zero‑ETL" approach reduces architectural complexity, improves consistency, and streamlines schema evolution and small‑file management.

CDCData LakeIceberg

0 likes · 27 min read

How AI-Driven Real-Time Data Lakes Are Ditching ETL: A Kafka‑to‑Iceberg Architecture Simplification

StarRocks

Jun 4, 2026 · Databases

How StarRocks and Iceberg Enable Federated Queries: A Practical Walkthrough

This article details Fresha's real‑world integration of StarRocks with Apache Iceberg, covering metadata planning, distributed execution, adaptive metadata retrieval, hot‑cold data layering, missing statistics handling, catalog configuration, and performance optimizations that together demonstrate how federated queries can be efficiently executed over data‑lake tables.

Apache IcebergData LakeFederated Query

0 likes · 14 min read

How StarRocks and Iceberg Enable Federated Queries: A Practical Walkthrough

DataFunSummit

May 25, 2026 · Big Data

How Hisense Built an AI‑Ready Multimodal Data Platform: Storage, Governance, and Development

This article details Hisense's journey to create an AI‑ready multimodal data platform, covering the challenges of integrating diverse business systems, the shift from a Hadoop‑based architecture to a cloud‑native data lake, the JuData governance and development platform, and six practical scenarios that demonstrate unified ingestion, metadata management, rule‑based quality control, intelligent asset retrieval, and future AI‑driven DataOps capabilities.

AI platformCloud NativeData Governance

0 likes · 23 min read

How Hisense Built an AI‑Ready Multimodal Data Platform: Storage, Governance, and Development

DataFunSummit

May 21, 2026 · Big Data

Alibaba Cloud’s Agent-Ready Big Data AI Infrastructure: Boosting Data Development from Hours to Minutes

Facing a projected 85% of enterprises deploying internal agents within two years, Alibaba Cloud proposes an Agent-Ready big‑data AI infrastructure—comprising a unified data lake, real‑time processing, high‑dimensional vector retrieval, elastic model serving, and comprehensive security governance—that has already cut data‑development cycles from hours to 5‑10 minutes in internal model‑training and Taobao flash‑sale scenarios.

AIAgent-ReadyBig Data

0 likes · 15 min read

Alibaba Cloud’s Agent-Ready Big Data AI Infrastructure: Boosting Data Development from Hours to Minutes

DataFunSummit

May 20, 2026 · Big Data

How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture

The article explains how Kuaishou partnered with Apache Hudi to overhaul its ODS‑based data lake, addressing latency, storage cost, and complexity for AI and BI workloads, detailing the evolution from mysql‑to‑hive to mysql‑to‑hudi 1.0 and 2.0, the resulting performance gains, cost savings, and future roadmap.

AIBIBig Data

0 likes · 20 min read

How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture

DataFunSummit

May 11, 2026 · Artificial Intelligence

How Lance Powers Enterprise Multimodal AI Data Lakes

The article analyzes why 74% of AI projects fail due to feedback gaps and data silos, explains how the open‑source Lance format addresses these issues with unified multimodal storage, outlines a layered Lance‑on‑Ray architecture, and details three real‑world practices—implicit feedback loops, GPU‑accelerated self‑evolution, and semantic knowledge‑graph evolution—to boost R&D efficiency.

CAGRADaftData Lake

0 likes · 13 min read

How Lance Powers Enterprise Multimodal AI Data Lakes

DataFunSummit

May 5, 2026 · Big Data

A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

The article presents Volcano Engine’s AI‑focused data lake built on the Lance format, detailing why traditional lakes fall short for multimodal data, the engineering enhancements such as Binary Copy Compaction, Lance Insight, distributed vector indexing, JSON‑based tagging, Row‑ID shuffle optimization, and real‑world case studies that demonstrate significant performance and cost gains.

AIBinary Copy CompactionData Lake

0 likes · 18 min read

A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

DataFunSummit

Apr 19, 2026 · Big Data

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

OPPO’s data‑lake team, led by David, detailed their transition from Hive‑Spark to a unified multi‑modal lake, leveraging Gravitino for cross‑engine metadata management and the open‑source Curvine cache to eliminate data silos, boost I/O performance, and support massive image, recommendation, and AI‑Agent workloads.

Big DataData LakeMultimodal

0 likes · 11 min read

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

DataFunTalk

Apr 18, 2026 · Databases

How Will Apache Doris Evolve in 2026 to Power AI‑Driven Data Workloads?

The article outlines Apache Doris's 2026 roadmap, detailing how the database will shift from pure analytics to a unified AI‑enabled platform with enhanced semi‑structured data support, vector and hybrid search, agent‑focused capabilities, and expanded storage and lakehouse integrations to meet emerging AI workloads.

AI integrationApache DorisData Lake

0 likes · 14 min read

How Will Apache Doris Evolve in 2026 to Power AI‑Driven Data Workloads?

Past Memory Big Data

Mar 27, 2026 · Big Data

Why AI Workloads Require Rebuilding Parquet: A Deep Dive into Lance

The article explains how traditional Parquet‑based lakehouse architectures, optimized for large‑scale scans, struggle with AI workloads that need ultra‑low‑latency random access, and how Lance redesigns the storage format, indexing and write path to provide O(1) addressing, native vector support, and seamless integration with native execution engines.

AI workloadsData LakeLance

0 likes · 12 min read

Why AI Workloads Require Rebuilding Parquet: A Deep Dive into Lance

Alibaba Cloud Infrastructure

Mar 26, 2026 · Cloud Computing

Secure Multi‑Tenant Data Lakes on Alibaba Cloud: OSS Access Points + VPC Gateway Endpoints

This guide explains how to build a secure, multi‑tenant data lake on Alibaba Cloud by combining OSS Access Points with VPC Gateway Endpoints, covering architecture overview, step‑by‑step configuration, policy examples, and best‑practice considerations for private‑network access.

Alibaba CloudCloud ComputingData Lake

0 likes · 10 min read

Secure Multi‑Tenant Data Lakes on Alibaba Cloud: OSS Access Points + VPC Gateway Endpoints

DataFunTalk

Mar 3, 2026 · Big Data

Exploring Tencent Cloud’s Iceberg Batch‑Stream Integration and AI‑Driven Data Governance

This article presents a series of seven technical case studies—including Tencent Cloud’s Iceberg‑based batch‑stream integration, AI‑driven data governance with Apache Gravitino, Xiaohongshu’s lakehouse evolution, and a multimodal data‑lake solution—detailing challenges, architectural designs, implementation steps, performance results, and future directions.

AIBig DataData Lake

0 likes · 8 min read

Exploring Tencent Cloud’s Iceberg Batch‑Stream Integration and AI‑Driven Data Governance

StarRocks

Feb 11, 2026 · Big Data

How StarRocks and Apache Paimon Build a True Lakehouse Native Engine

This article details the deep integration of StarRocks with Apache Paimon, describing the unified architecture, version evolution, performance enhancements, time‑travel queries, native readers/writers, distributed planning, and future roadmap for achieving lakehouse‑native analytics at scale.

Apache PaimonData LakeLakehouse

0 likes · 10 min read

How StarRocks and Apache Paimon Build a True Lakehouse Native Engine

DataFunSummit

Feb 8, 2026 · Big Data

Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges

The article explains how Kuaishou modernized its data lake by partnering with Apache Hudi to address latency, storage cost, and consistency issues in both AI and BI pipelines, detailing architectural changes, new ingestion tools, partitioning strategies, compaction mechanisms, performance gains and future plans.

AIBIBig Data

0 likes · 20 min read

Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges

Alibaba Cloud Big Data AI Platform

Feb 4, 2026 · Big Data

How Paimon + StarRocks Power Real‑Time OLAP for Double‑11 Mega‑Sales

During Double‑11 mega‑sales, Taobao Group faced exploding OLAP query traffic, costly data sync pipelines, and slow near‑real‑time analytics, so they unified real‑time and batch data in Paimon, leveraged StarRocks for high‑performance lake queries, tuned cluster settings, and saved nearly ten‑million yuan annually while cutting refresh latency by 80%.

Big DataData LakeOLAP

0 likes · 22 min read

How Paimon + StarRocks Power Real‑Time OLAP for Double‑11 Mega‑Sales

JD Retail Technology

Jan 5, 2026 · Big Data

How JD’s Data Lake Uses Hudi LSM‑Tree to Power Near‑Real‑Time Data Assets

The article details JD’s data lake architecture, its 500 PB scale, self‑developed Hudi extensions—including LSM‑Tree‑based MoR tables, custom indexing, IO optimizations, Flink stream scheduling, and NativeIO SDK—along with benchmarks, community contributions, and future roadmap for real‑time big‑data processing.

Big DataData LakeHudi

0 likes · 19 min read

How JD’s Data Lake Uses Hudi LSM‑Tree to Power Near‑Real‑Time Data Assets

Big Data Tech Team

Dec 29, 2025 · Big Data

Data Warehouse vs Data Mart vs Data Lake: Which Should Your Enterprise Choose?

The article explains the distinct roles of data warehouses, data marts, and data lakes, illustrates their differences with analogies and real‑world cases, outlines a three‑step strategy for enterprises, highlights common pitfalls, and offers a decision guide to help organizations choose the right architecture for their data needs.

Data LakeData MartData Warehouse

0 likes · 11 min read

Data Warehouse vs Data Mart vs Data Lake: Which Should Your Enterprise Choose?

DataFunTalk

Dec 26, 2025 · Cloud Native

How Haier Built a Cloud‑Native Multi‑Modal Data Lake for AI‑Ready Manufacturing

Haier’s digital transformation leverages a cloud‑native, open‑source‑based multi‑modal data lake that unifies structured and unstructured industrial data, uses metadata models and knowledge graphs for governance, and provides AI‑ready services that balance performance, cost, and real‑time requirements.

AIData LakeMetadata

0 likes · 12 min read

How Haier Built a Cloud‑Native Multi‑Modal Data Lake for AI‑Ready Manufacturing

dbaplus Community

Dec 20, 2025 · Big Data

From Data Lakes to DataOps: Unveiling the Hidden Challenges of Data Governance

The article walks through the evolution of data management—from idealistic visions and messy “shit mountains” to the realities of data lakes, metadata layers, governance challenges, trust breakdowns, and finally the promise of DataOps as a hopeful path forward.

Big DataData GovernanceData Lake

0 likes · 3 min read

From Data Lakes to DataOps: Unveiling the Hidden Challenges of Data Governance

JD Tech Talk

Dec 12, 2025 · Big Data

Understanding Hudi Core Concepts: Timeline, Indexes, and Table Types Explained

This article explains Apache Hudi’s core concepts, including its timeline architecture, file layout, indexing mechanisms, and the two primary table types—Copy on Write and Merge on Read—along with their trade‑offs and the various query modes such as snapshot, time‑travel, and incremental queries.

Apache HudiBig DataData Lake

0 likes · 9 min read

Understanding Hudi Core Concepts: Timeline, Indexes, and Table Types Explained

JD Cloud Developers

Dec 12, 2025 · Big Data

Apache Hudi Core Concepts: Timeline, Indexes, Table Types & Queries

This article explains Apache Hudi’s core architecture, detailing the timeline mechanism, file layout, indexing strategies, the two main table types (Copy‑On‑Write and Merge‑On‑Read), and various query modes such as snapshot, time‑travel, read‑optimized and incremental queries.

Apache HudiBig DataData Lake

0 likes · 9 min read

Apache Hudi Core Concepts: Timeline, Indexes, Table Types & Queries

Past Memory Big Data

Dec 12, 2025 · Big Data

How Uber Reduced Data Freshness from Hours to Minutes Using Flink Streaming

Uber rebuilt its data‑lake ingestion pipeline with Apache Flink, replacing batch jobs with a streaming architecture that cuts data freshness from hours to minutes, lowers compute usage by 25%, and solves challenges like small‑file proliferation, partition skew, and checkpoint‑commit synchronization at petabyte scale.

Apache FlinkApache HudiData Freshness

0 likes · 10 min read

How Uber Reduced Data Freshness from Hours to Minutes Using Flink Streaming

DataFunSummit

Dec 10, 2025 · Big Data

How Apache Hudi Powers the Next‑Gen AI‑Native Lakehouse: Insights from the Asia Meetup

The article recaps the Apache Hudi Asia Meetup hosted by JD, covering community updates, JD's data‑lake challenges, the upcoming Hudi 1.1 release, JD's architectural redesign, Kuaishou's real‑time lake adoption, and Huawei Cloud's deep optimizations, all aimed at building an AI‑native, real‑time lakehouse.

AI-nativeApache HudiData Lake

0 likes · 13 min read

How Apache Hudi Powers the Next‑Gen AI‑Native Lakehouse: Insights from the Asia Meetup

DataFunSummit

Dec 1, 2025 · Big Data

7 Cutting-Edge Data Engineering Practices Shaping AI-Driven Data Lakes

This article collection showcases seven advanced data engineering solutions—from Tencent Cloud's Iceberg batch‑stream integration and Apache Gravitino metadata lineage to Xiaohongshu's Lakehouse evolution and multimodal AI data lake implementations—highlighting architectural innovations, performance optimizations, and real‑world deployment insights for modern big‑data platforms.

Apache GravitinoApache IcebergBatch-Stream Integration

0 likes · 7 min read

7 Cutting-Edge Data Engineering Practices Shaping AI-Driven Data Lakes

Past Memory Big Data

Dec 1, 2025 · Big Data

Apache XTable: A Universal Translator for Data Lake Format Interoperability

Apache XTable introduces a lightweight metadata translation layer that decouples data storage from format metadata, enabling zero‑copy, omni‑directional conversion among Hudi, Iceberg, and Delta Lake, allowing organizations to write with one format and read with any engine without duplicating Parquet files.

Apache XTableData LakeDelta Lake

0 likes · 7 min read

Apache XTable: A Universal Translator for Data Lake Format Interoperability

DataFunSummit

Nov 24, 2025 · Big Data

How Tencent Cloud Uses Iceberg, Gravitino and Multimodal Lakes for Unified Data Processing

This article series explores Tencent Cloud's Iceberg‑based batch‑stream integration, Apache Gravitino's unified metadata and lineage solution, Xiaohongshu's data‑architecture evolution for the Big AI Data era, and a practical Data+AI multimodal data‑lake implementation, highlighting challenges, architectural designs, and performance gains.

Big DataData LakeIceberg

0 likes · 7 min read

How Tencent Cloud Uses Iceberg, Gravitino and Multimodal Lakes for Unified Data Processing

DataFunTalk

Nov 22, 2025 · Big Data

How Modern Data Lakes and AI Governance Transform Enterprise Analytics

This article collection examines Tencent Cloud’s Iceberg batch‑stream integration, AI‑driven game data governance, Apache Gravitino unified metadata and lineage, Xiaohongshu’s multimodal data‑lake evolution, and Volcano Engine’s Data+AI multimodal lake, highlighting architectures, techniques, performance gains, and practical implementations.

AI GovernanceData LakeGravitino

0 likes · 7 min read

How Modern Data Lakes and AI Governance Transform Enterprise Analytics

Alibaba Cloud Big Data AI Platform

Nov 15, 2025 · Big Data

From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse

This article chronicles a ten‑year evolution of a self‑built big data platform—detailing early Hadoop clusters, successive migrations to Spark, Hive, Hudi, and StarRocks, the operational challenges encountered, and the comprehensive shift to Alibaba Cloud EMR Serverless that delivered significant cost, performance, and stability gains while outlining future intelligent‑ecosystem plans.

Big DataData LakeEMR Serverless

0 likes · 17 min read

From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse

Big Data Technology & Architecture

Nov 3, 2025 · Big Data

Taming Small Files in Paimon: Proven Tuning Strategies for Better Performance

This article explains how small‑file issues in Paimon's streaming data lake architecture degrade system stability and query speed, and presents practical parameter‑tuning, table‑level settings, asynchronous compaction, and monitoring techniques to mitigate those problems.

Big DataData LakeFlink

0 likes · 7 min read

Taming Small Files in Paimon: Proven Tuning Strategies for Better Performance

Big Data Technology & Architecture

Oct 20, 2025 · Big Data

Unlocking Lakehouse Power: Paimon and Doris Integrated Solutions

This article reviews how Paimon and Doris combine to solve unified storage, data visibility, and performance challenges in modern lakehouse architectures, detailing their complementary features, integration capabilities, and real‑world use cases from leading companies.

AnalyticsBig DataData Lake

0 likes · 8 min read

Unlocking Lakehouse Power: Paimon and Doris Integrated Solutions

JD Tech Talk

Oct 16, 2025 · Big Data

Understanding Apache Hudi Core Concepts: Timeline, File Layout, and Table Types

This article explains Apache Hudi's architecture, including its timeline mechanism, file layout, indexing strategies, table types (COW and MOR), query options, storage format versioning, backward compatibility, and key configuration settings for managing data lake tables.

Apache HudiBig DataCopy-on-Write

0 likes · 8 min read

Understanding Apache Hudi Core Concepts: Timeline, File Layout, and Table Types

DataFunSummit

Oct 6, 2025 · Artificial Intelligence

Why Vector Lakes Are the Next Frontier for AI Data Management

This article explains how Zilliz's Vector Lake extends traditional data lakes with a unified storage‑compute architecture optimized for massive unstructured and vector data, detailing its background, key data types, autonomous‑driving use case, data flow, architecture, and deployment options.

AI data managementData LakeVector Lake

0 likes · 13 min read

Why Vector Lakes Are the Next Frontier for AI Data Management

IT Architects Alliance

Sep 21, 2025 · Big Data

From Data Warehouses to Lakehouses: Why Data Architecture Keeps Evolving

This article traces the three‑generation evolution of data architecture—from the structured‑data era of data warehouses, through the flexible, multi‑format data lake, to the unified lakehouse model—explaining the drivers, benefits, challenges, and future trends shaping modern data platforms.

Data ArchitectureData LakeData Warehouse

0 likes · 11 min read

From Data Warehouses to Lakehouses: Why Data Architecture Keeps Evolving

360 Zhihui Cloud Developer

Sep 11, 2025 · Big Data

How Paimon Transforms Membership Data Warehousing: From Legacy Lambda to Real‑Time Lakehouse

This article examines the challenges of a legacy Lambda‑based membership data warehouse, introduces Apache Paimon’s lakehouse architecture and its key features, and showcases three real‑world implementations—partial‑update order wide tables, Bitmap‑based UV counting, and branch‑based data correction—while discussing benefits, remaining challenges, and future directions.

Big DataData LakeData Warehouse

0 likes · 29 min read

How Paimon Transforms Membership Data Warehousing: From Legacy Lambda to Real‑Time Lakehouse

DataFunTalk

Sep 6, 2025 · Big Data

How Xiaomi Cuts Costs and Boosts Efficiency with a Cloud‑Native Lakehouse Architecture

Xiaomi’s data‑lake team explains how they tackled small‑file issues, unified metadata with Gravitino, migrated Hive to Iceberg and Fileset, leveraged JuiceFS for multi‑cloud storage, and combined Iceberg and Paimon to achieve cost‑effective, high‑performance batch and real‑time analytics.

Big DataCloud NativeData Lake

0 likes · 13 min read

How Xiaomi Cuts Costs and Boosts Efficiency with a Cloud‑Native Lakehouse Architecture

DataFunSummit

Sep 2, 2025 · Big Data

How Xiaomi Cuts Costs and Boosts Performance with Cloud‑Native Data Lake Architecture

Xiaomi’s engineers explain how they tackled data‑lake challenges—small files, metadata latency, and multi‑cloud costs—by combining compact storage, Gravitino‑based metadata governance, Iceberg and Paimon formats, and JuiceFS abstraction, achieving lower storage expenses, faster queries, and a roadmap toward intelligent, real‑time, multimodal lakehouses.

Big DataData LakeMulti-Cloud

0 likes · 14 min read

How Xiaomi Cuts Costs and Boosts Performance with Cloud‑Native Data Lake Architecture

Big Data Technology Tribe

Aug 12, 2025 · Databases

Why Lakehouse Architecture Is Redefining Modern Data Platforms

This article explains the evolution from traditional data warehouses and data lakes to the unified Lakehouse architecture, detailing its design, benefits, challenges, and research directions for delivering high‑performance SQL and advanced analytics on open‑format storage.

Big DataData LakeData Warehouse

0 likes · 20 min read

Why Lakehouse Architecture Is Redefining Modern Data Platforms

Past Memory Big Data

Jul 30, 2025 · Big Data

Why Iceberg Is Dropping Positional Deletes in Merge‑on‑Read Tables

The article explains how Apache Iceberg v3 replaces the scalable‑limited positional‑delete mechanism in Merge‑on‑Read tables with compact Deletion Vectors, detailing the performance, I/O and metadata drawbacks of positional deletes and showing how the new bitmap‑based approach resolves them.

Apache IcebergData LakeDeletion Vector

0 likes · 20 min read

Why Iceberg Is Dropping Positional Deletes in Merge‑on‑Read Tables

DataFunSummit

Jul 23, 2025 · Big Data

Explore Cutting-Edge Lakehouse & Real-Time Data Solutions: A Curated Resource Guide

This guide presents a curated list of cutting‑edge lakehouse, real‑time analytics, and big‑data implementations from industry leaders, followed by a QR code that lets you download the full eBook for deeper insights.

Cloud ComputingData LakeLakehouse

0 likes · 2 min read

Explore Cutting-Edge Lakehouse & Real-Time Data Solutions: A Curated Resource Guide

Big Data Technology & Architecture

Jul 21, 2025 · Big Data

Essential Data Lake Interview Questions: Flink, Hudi, Row_Number, and Best Practices

This article reviews common data lake interview questions—covering problem definition, Flink-to-Hudi row_number deduplication, retract streams, pipeline architecture optimizations, and read/write best practices—providing concise explanations and practical insights for candidates.

Big Data InterviewData LakeFlink

0 likes · 7 min read

Essential Data Lake Interview Questions: Flink, Hudi, Row_Number, and Best Practices

ITFLY8 Architecture Home

Jul 20, 2025 · Big Data

Exploring the Architecture of a Data Lake and Application Platform

This article outlines the overall architecture, data architecture, logical project structure, and the construction of a data resource center for a data lake and application platform, illustrated through a series of diagrams that depict each component and their interconnections.

Big DataData LakeData Platform

0 likes · 1 min read

Exploring the Architecture of a Data Lake and Application Platform

DataFunSummit

Jul 18, 2025 · Big Data

Data Lake & Lakehouse Innovations: Real-Time Analytics and Industry Case Studies

This article presents a curated collection of cutting‑edge data lake and lakehouse case studies—including real‑time analytics, cloud‑native architectures, industry implementations from sales platforms to automotive IoT, and the latest advancements in open‑source projects—offering insights into modern big‑data strategies and governance.

Big DataData LakeLakehouse

0 likes · 2 min read

Data Lake & Lakehouse Innovations: Real-Time Analytics and Industry Case Studies

DataFunSummit

Jul 12, 2025 · Big Data

How Fluss Unifies Stream and Lake to Power AI Data Pipelines

In the era of rapid AI growth, Fluss offers a unified lake‑stream architecture that tackles data quality, timeliness, scale, and multimodal challenges by tightly integrating Flink streaming with a high‑performance data lake, enabling seamless real‑time and batch analytics for AI workloads.

AIData LakeFlink

0 likes · 12 min read

How Fluss Unifies Stream and Lake to Power AI Data Pipelines

Big Data Technology & Architecture

Jul 8, 2025 · Big Data

Flink’s AI Agents and Disaggregated State: Transforming Big Data

The article reviews key topics from the FFA2025 Singapore conference, highlighting Flink’s new AI‑focused Agents framework, the breakthrough Flink 2.0 disaggregated state architecture, emerging lake storage solutions like Paimon, and the Fluss streaming table store, illustrating how big‑data platforms are evolving for AI workloads.

AI agentsBig DataData Lake

0 likes · 6 min read

Flink’s AI Agents and Disaggregated State: Transforming Big Data

DataFunTalk

Jul 4, 2025 · Big Data

How Flink Agents and Flink 2.0 Are Powering Real‑Time AI at Scale

The Flink Forward Asia 2025 conference in Singapore showcased Apache Flink’s latest advances—including Flink Agents for system‑triggered AI, the cloud‑native Flink 2.0 with disaggregated state management, the multi‑modal lakehouse Paimon, and the Fluss table storage system—highlighting the ecosystem’s shift toward real‑time AI integration.

Apache FlinkData LakeFlink 2.0

0 likes · 9 min read

How Flink Agents and Flink 2.0 Are Powering Real‑Time AI at Scale

Baidu Geek Talk

Jun 30, 2025 · Big Data

How Baidu’s Turing 3.0 Leverages Apache Iceberg to Boost Data Lake Performance

This article explains how Baidu’s next‑generation data platform Turing 3.0 integrates Apache Iceberg to solve the inefficiencies of the legacy MEG stack, detailing ecosystem components, migration strategies from Hive, table‑level optimizations, and future roadmap for high‑frequency, low‑latency analytics.

Apache IcebergData LakeHive Migration

0 likes · 17 min read

How Baidu’s Turing 3.0 Leverages Apache Iceberg to Boost Data Lake Performance

StarRocks

Jun 26, 2025 · Databases

What’s New in StarRocks 3.5? Snapshot Backup, Bulk Load, Partition & Transaction Enhancements

StarRocks 3.5 introduces a cluster‑level Snapshot backup for fast recovery, a bulk‑load optimization that reduces small files and compaction cost, smarter partition management with time‑based merging and TTL, multi‑statement transactions with full ACID guarantees, low‑cardinality dictionary support for lake tables, and several security and performance upgrades.

ACID TransactionsData LakeLow Cardinality Dictionary

0 likes · 17 min read

What’s New in StarRocks 3.5? Snapshot Backup, Bulk Load, Partition & Transaction Enhancements

DataFunSummit

Jun 19, 2025 · Big Data

How Shopee Leverages Paimon for Real‑Time Data Warehousing and Task Diagnosis

This article details Shopee's Data Infra team's use of the Paimon data lake to build near‑real‑time warehouses, accelerate ODS layers, implement a task‑diagnosis system, and create a reconciliation platform, while sharing future plans and a Q&A session.

Data LakeFlinkPaimon

0 likes · 12 min read

How Shopee Leverages Paimon for Real‑Time Data Warehousing and Task Diagnosis

DataFunSummit

Jun 18, 2025 · Big Data

How Real‑Time Lakehouse and Apache Paimon Transform Modern Data Architecture

This article explains the concept of a real‑time lakehouse, compares it with traditional batch warehouses, introduces Apache Paimon and its innovations such as native upserts, LSM storage, tags and branches, and showcases multiple enterprise use cases that demonstrate its low‑cost, low‑latency stream‑batch integration.

Apache PaimonData Lakereal-time lakehouse

0 likes · 17 min read

How Real‑Time Lakehouse and Apache Paimon Transform Modern Data Architecture

DataFunSummit

Jun 10, 2025 · Big Data

How OpenLake Redefines Data Lake Infrastructure for the AI Era

This article explores OpenLake's evolution as a data lake platform for AI, covering the transition from Hive to modern lake formats like Iceberg and Paimon, performance benchmarks, metadata management advances, intelligent storage optimization, and the integration of multimodal support with the Lance file format.

AIBig DataData Lake

0 likes · 22 min read

How OpenLake Redefines Data Lake Infrastructure for the AI Era

Alibaba Cloud Big Data AI Platform

Jun 10, 2025 · Big Data

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

This article details how a leading automotive parts supply‑chain platform migrated from a traditional Hadoop stack to Alibaba Cloud EMR Serverless Spark and DataWorks, achieving faster, more elastic, and cost‑effective data processing, enhanced AI integration, and significant operational improvements across multiple business scenarios.

Big DataCloud NativeData Lake

0 likes · 12 min read

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

DataFunTalk

Jun 4, 2025 · Artificial Intelligence

Coupang’s Distributed Cache Architecture Accelerates AI/ML Model Training

Coupang’s AI platform replaces costly data‑copy steps with a distributed cache that automatically pulls data from a central lake, boosts GPU utilization across regions, cuts storage and operational expenses, and speeds up model training by up to 40% while simplifying deployment via Kubernetes.

AIData LakeGPU

0 likes · 9 min read

Coupang’s Distributed Cache Architecture Accelerates AI/ML Model Training

Xiaohongshu Tech REDtech

May 19, 2025 · Industry Insights

How Xiaohongshu Built a Minute‑Level Near‑Real‑Time Data Warehouse with Incremental Computing

Facing billions of daily logs and the need for minute‑level experiment metrics, Xiaohongshu partnered with Yunqi Tech to design a generic incremental‑compute solution that delivers near‑real‑time data warehousing with lower cost, higher accuracy, simplified pipelines, and improved query performance.

Big DataData LakeFlink

0 likes · 24 min read

How Xiaohongshu Built a Minute‑Level Near‑Real‑Time Data Warehouse with Incremental Computing

Big Data Technology & Architecture

May 16, 2025 · Big Data

Apache Gravitino: An Open‑Source Metadata Lake for Unified Data and AI Asset Management

Apache Gravitino is an open‑source metadata service platform that provides a unified, high‑performance, geographically distributed metadata lake, enabling end‑to‑end data governance, multi‑engine access, and direct management of both structured and unstructured data assets across diverse systems.

Apache GravitinoData GovernanceData Lake

0 likes · 9 min read

Apache Gravitino: An Open‑Source Metadata Lake for Unified Data and AI Asset Management

Big Data Tech Team

May 11, 2025 · Industry Insights

What Is Data Architecture? A Complete Guide to Its Evolution, Frameworks, and Benefits

This article explains data architecture, its purpose, historical development, major enterprise frameworks, various data management system types, and the key advantages it brings to organizations, helping readers understand how to design and implement effective data solutions in modern environments.

Data LakeData MeshData Warehouse

0 likes · 15 min read

What Is Data Architecture? A Complete Guide to Its Evolution, Frameworks, and Benefits

Tencent Cloud Developer

May 8, 2025 · Big Data

How Setats Unifies Stream, Batch, and Incremental Processing for Real‑Time Data Lakes

At the 2025 DA Data+AI Conference in Shanghai, Tencent Cloud unveiled Setats—a unified stream‑batch‑incremental engine that cuts system costs, delivers second‑level data visibility and real‑time changelog generation, and demonstrates measurable performance gains in automotive IoT analytics while integrating tightly with the WeData platform.

Batch ProcessingBig Data ArchitectureData Lake

0 likes · 5 min read

How Setats Unifies Stream, Batch, and Incremental Processing for Real‑Time Data Lakes

DataFunSummit

May 4, 2025 · Big Data

Iceberg Table Format Practice in Huawei Terminal Cloud

This article explains how Huawei's terminal cloud adopts the Apache Iceberg table format to efficiently manage large-scale datasets, detailing its architecture, feature engineering, merge operations, LSM-based storage, schema versioning, AB testing support, catalog enhancements, and future roadmap for full lifecycle data governance.

Big DataData LakeHuawei Cloud

0 likes · 13 min read

Iceberg Table Format Practice in Huawei Terminal Cloud

DataFunTalk

Apr 9, 2025 · Big Data

Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies

The Kuaishou‑hosted Apache Hudi Asia technical salon gathered over 230 attendees and featured seven experts from Kuaishou, Meituan, TikTok, Huawei, JD and others, who shared best practices, architecture designs, and performance optimizations for large‑scale data lake applications across AI, BI, and real‑time workloads.

AIApache HudiBatch Processing

0 likes · 14 min read

Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies

Alibaba Cloud Big Data AI Platform

Apr 9, 2025 · Big Data

How We Built an Intelligent Data Warehouse on Alibaba Cloud MaxCompute

This article details the business background, technical challenges, and the step‑by‑step implementation of an intelligent data warehouse on Alibaba Cloud MaxCompute, covering offline data pipelines, metric calculation, data analysis, and future plans for data lake and AI‑driven analytics.

AnalyticsBig DataData Lake

0 likes · 10 min read

How We Built an Intelligent Data Warehouse on Alibaba Cloud MaxCompute

DataFunSummit

Apr 3, 2025 · Big Data

Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD

The Apache Hudi Asia technical salon held in Beijing on March 29 gathered over 230 on‑site participants and 16,000 online viewers, featuring expert talks from leading Chinese tech companies that showcased real‑world Hudi implementations, performance optimizations, and future roadmap for data‑lake technologies.

Apache HudiBig DataData Lake

0 likes · 13 min read

Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD

Kuaishou Tech

Apr 2, 2025 · Big Data

Apache Hudi Asia Summit Successfully Held

The first Apache Hudi Asia Summit in Beijing attracted over 230 attendees, featuring technical discussions on data lake optimization and case studies from companies like Fastly and Meituan.

Apache HudiBig DataData Engineering

0 likes · 12 min read

Apache Hudi Asia Summit Successfully Held

dbaplus Community

Mar 22, 2025 · Big Data

Why Data Lakes Are Crucial for Observability—and When They’re Not the Answer

The article explains how data lakes serve as a foundational component for observability by aggregating raw, diverse data for advanced analysis, while also outlining the technical, cost, and scalability challenges that make them unsuitable for every organization.

AnalyticsBig DataData Lake

0 likes · 10 min read

Why Data Lakes Are Crucial for Observability—and When They’re Not the Answer

AntData

Mar 20, 2025 · Big Data

Design and Optimization of Real‑time Data Lake Tables with Paimon and Flink for Advertising Diagnostics

This article presents a comprehensive exploration of using Apache Paimon and Flink to design lake tables that support minute‑level latency, low cost, and unified batch‑stream processing for advertising data, covering schema design, partitioning strategies, performance trade‑offs, cost analysis, and operational best practices.

Big DataData LakeFlink

0 likes · 34 min read

Design and Optimization of Real‑time Data Lake Tables with Paimon and Flink for Advertising Diagnostics

Alimama Tech

Mar 12, 2025 · Big Data

Design and Evolution of Alibaba Advertising Real-Time Data Warehouse

Alibaba Mama’s advertising platform migrated from a monolithic Flink‑Kafka pipeline to a layered Paimon lakehouse, adding DWS upsert support and multi‑layer storage, which delivers minute‑level data freshness, cuts latency by 2.5 hours, reduces resource use over 40 %, halves development effort and achieves ≥99.9 % availability.

AdvertisingAlibabaData Lake

0 likes · 18 min read

Design and Evolution of Alibaba Advertising Real-Time Data Warehouse

Alibaba Cloud Infrastructure

Mar 6, 2025 · Big Data

Leveraging Apache Iceberg and AutoMQ for Real-Time Data Lake Ingestion: Architecture, Best Practices, and Cost Optimization

This article examines how Apache Iceberg’s snapshot‑based ACID transactions, logical‑physical partition evolution, and COW/MOR update modes enable efficient real‑time data lake ingestion, and demonstrates AutoMQ’s Kafka‑to‑Iceberg Table Topic solution that simplifies schema management, reduces latency, and cuts operational costs.

Apache IcebergAutoMQBig Data

0 likes · 14 min read

Leveraging Apache Iceberg and AutoMQ for Real-Time Data Lake Ingestion: Architecture, Best Practices, and Cost Optimization

Volcano Engine Developer Services

Mar 5, 2025 · Artificial Intelligence

How DeepSeek Smallpond Powers AI Data Processing with Ray and DuckDB

This article introduces DeepSeek Smallpond, a lightweight yet high‑performance AI data‑processing engine built on Ray and DuckDB, explains its dual Dataframe and LogicalPlan APIs, showcases integration with Volcano Engine's AI Data Lake LAS, and provides practical code examples for distributed processing, multimodal storage, and RAG pipelines.

AI data processingData LakeDistributed Computing

0 likes · 18 min read

How DeepSeek Smallpond Powers AI Data Processing with Ray and DuckDB

DataFunSummit

Feb 23, 2025 · Big Data

Douyin Group’s ByteLake Data Lake Table Optimization and Management Practices

This article presents Douyin Group’s ByteLake, a heavily customized Apache Hudi‑based data lake table framework, detailing its core concepts, metadata services, write and read optimizations, operational challenges, a fully managed table management service, and its integration with the Amoro open‑source platform.

AmoroApache HudiBig Data

0 likes · 11 min read

Douyin Group’s ByteLake Data Lake Table Optimization and Management Practices

Alibaba Cloud Big Data AI Platform

Feb 21, 2025 · Big Data

Building a Scalable IoT Data Platform with Alibaba EMR Serverless Spark

Midea Building Technology shares how its IoT data platform leverages Alibaba Cloud EMR Serverless Spark, Hudi Lakehouse, and Serverless StarRocks to achieve real‑time ingestion, massive scale processing, AI‑driven analytics, and significant performance and cost improvements for building‑system management.

Big DataData LakeEMR Serverless Spark

0 likes · 12 min read

Building a Scalable IoT Data Platform with Alibaba EMR Serverless Spark

JD Tech

Feb 11, 2025 · Big Data

Cold‑Hot Data Tiering and Performance Optimization in Apache Doris for JD Advertising

This article presents JD Advertising's engineering experience with Apache Doris, describing the evolution from a data‑lake cold‑data solution to a native cold‑hot tiering approach, detailing performance regressions after upgrading to Doris 2.0, and outlining a series of optimizations for query speed, CPU and memory usage, schema‑change efficiency, and automated data migration and restoration.

Apache DorisBig DataData Lake

0 likes · 17 min read

Cold‑Hot Data Tiering and Performance Optimization in Apache Doris for JD Advertising

Alibaba Cloud Big Data AI Platform

Jan 23, 2025 · Big Data

How Alibaba Cloud DataWorks Leverages Flink CDC for Scalable Data Lake Integration

Alibaba Cloud DataWorks’ Data Integration platform, built on Flink CDC, offers a comprehensive, serverless solution for real‑time and batch data lake ingestion, detailing its architecture, elastic scaling, productized use cases, and future roadmap, including AI‑driven diagnostics and expanded source support.

Big DataData IntegrationData Lake

0 likes · 12 min read

How Alibaba Cloud DataWorks Leverages Flink CDC for Scalable Data Lake Integration

JD Cloud Developers

Jan 16, 2025 · Artificial Intelligence

JD Retail’s 2024 Tech Innovations: AI, Supply Chain, and Immersive Shopping

In 2024, JD Retail Technology rolled out a series of breakthroughs—including a major JD APP redesign, data‑driven inventory algorithms, an AIGC content platform, a low‑code national‑subsidy system, a high‑performance data lake, cross‑platform Taro on Harmony, AI‑powered merchant assistants, and immersive XR shopping—showcasing how AI and advanced engineering drive faster fulfillment, richer user experiences, and scalable innovation.

AIAIGCCross‑Platform Development

0 likes · 18 min read

JD Retail’s 2024 Tech Innovations: AI, Supply Chain, and Immersive Shopping

JD Retail Technology

Jan 15, 2025 · Industry Insights

JD Retail’s 2024 Tech Innovations: AI, Supply‑Chain Algorithms, and Development

In 2024 JD Retail Technology delivered a series of breakthroughs—including a major JD APP redesign, a data‑driven inventory selection and allocation algorithm that cut stockouts, an AIGC platform for marketing content, a low‑code national‑subsidy system, a large‑scale Apache Hudi data lake, the Taro‑on‑Harmony cross‑platform framework, immersive XR shopping experiences, and a domestic‑chip AI engine—showcasing how advanced AI, cloud, and operations engineering are reshaping e‑commerce.

AIAIGCCross‑Platform Development

0 likes · 17 min read

JD Retail’s 2024 Tech Innovations: AI, Supply‑Chain Algorithms, and Development

Alibaba Cloud Big Data AI Platform

Jan 14, 2025 · Big Data

How Fluss Unifies Lake and Stream for Real‑Time Analytics: Architecture, Benefits, and Future Roadmap

This article summarizes a talk by Alibaba Cloud senior engineer and Flink Committer Luo Yuxia on the challenges of separating lake and stream storage, introduces the Fluss lake‑stream unified architecture, explains its technical benefits such as second‑level data freshness, unified metadata, efficient changelog generation, and outlines future plans for broader ecosystem integration.

Data LakeFlinkFluss

0 likes · 13 min read

How Fluss Unifies Lake and Stream for Real‑Time Analytics: Architecture, Benefits, and Future Roadmap

Alibaba Cloud Developer

Dec 25, 2024 · Big Data

Build a Low‑Cost, High‑Performance Game Player Profiling Platform with Alibaba Cloud EMR StarRocks

This tutorial walks you through using Alibaba Cloud EMR Serverless StarRocks and Apache Paimon to create a cost‑effective, high‑performance game player profiling and behavior analysis platform, covering data import, materialized view creation, DWD/ADS layer construction, and lakehouse integration.

Alibaba CloudData LakeGame Analytics

0 likes · 12 min read

Build a Low‑Cost, High‑Performance Game Player Profiling Platform with Alibaba Cloud EMR StarRocks

Tencent Advertising Technology

Dec 6, 2024 · Big Data

Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent

Tencent's advertising team replaced a traditional HDFS‑Hive warehouse with an Apache Iceberg‑based data lake, adding primary‑key tables, multi‑stream merging, adaptive compaction, and Spark SPJ optimizations to achieve minute‑level feature update latency, 10× back‑fill speed, and up to 60% storage savings.

Big DataCDCCompaction

0 likes · 25 min read

Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent

Tongcheng Travel Technology Center

Nov 27, 2024 · Big Data

Highlights of Tongcheng Travel’s 8th Big Data Technology Salon

The 8th Tongcheng Travel Big Data Technology Salon in Suzhou featured four expert talks covering Tencent Cloud’s Meson Spark engine, near‑line computing for travel itineraries, a Flink‑based real‑time risk control system, and Apache Paimon’s latest lake‑warehouse innovations, followed by a data‑driven business perspective session.

Apache PaimonBig DataData Lake

0 likes · 7 min read

Highlights of Tongcheng Travel’s 8th Big Data Technology Salon

Bilibili Tech

Nov 26, 2024 · Big Data

Bilibili’s Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practices

Bilibili migrated its massive user‑behavior, commercial AI training, and database synchronization pipelines from Hive and Kafka to an Iceberg‑based streaming‑batch architecture, using Flink and the Magnus optimizer to achieve minute‑level freshness, reduce CPU and memory usage by about 20‑22 %, save roughly 3.55 M CNY annually, and dramatically improve query latency and join performance.

BatchData IntegrationData Lake

0 likes · 20 min read

Bilibili’s Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practices

DataFunSummit

Nov 23, 2024 · Big Data

Bilibili's Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practice

This article presents Bilibili's end‑to‑end exploration of a streaming‑batch unified data pipeline built on Apache Iceberg, detailing the original and iterated architectures for massive user behavior transmission, online AI training, DB synchronization, and dimension‑join, along with performance gains, cost savings, and future plans.

Batch ProcessingData LakeFlink

0 likes · 20 min read

Bilibili's Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practice

iQIYI Technical Product Team

Nov 21, 2024 · Big Data

Alluxio Integration and Optimization for Multi‑AZ Big Data Analytics at iQIYI

iQIYI integrates Alluxio with its QBFS multi‑AZ unified scheduling system, automatically caching hot tables, applying table‑level policies, page‑level storage and AZ‑aware worker selection, which together cut cross‑zone traffic, halve query latency, achieve up to 20× I/O speedup and a three‑fold overall performance boost.

AlluxioCache OptimizationData Lake

0 likes · 23 min read

Alluxio Integration and Optimization for Multi‑AZ Big Data Analytics at iQIYI

DataFunSummit

Nov 20, 2024 · Artificial Intelligence

How Data Lakes Empower AI: Expert Insights on Feature Management, Columnar Storage, and Vector Formats

In a panel discussion, experts explain how data‑lake‑warehouse integration, columnar formats like Apache Iceberg, and emerging variant types enable efficient feature engineering, support large‑language‑model workloads, and provide flexible vector storage, thereby driving the evolution of AI from traditional ML to the GenAI era.

Apache IcebergData LakeGenAI

0 likes · 6 min read

How Data Lakes Empower AI: Expert Insights on Feature Management, Columnar Storage, and Vector Formats

Baidu Geek Talk

Nov 13, 2024 · Industry Insights

Why Cloud‑Native Data Lakes Are the New Standard for Storage Acceleration

This article analyzes the evolution of data‑lake storage acceleration, compares traditional parallel file systems, object‑storage‑based solutions and modern cache‑enabled architectures, and explains how cloud‑native data lakes address scalability, cost, and performance challenges for AI and big‑data workloads.

AIBig DataCloud Native

0 likes · 24 min read

Why Cloud‑Native Data Lakes Are the New Standard for Storage Acceleration

Baidu Intelligent Cloud Tech Hub

Nov 12, 2024 · Big Data

Why Data Lake Storage Acceleration Is the New Standard in Cloud‑Native AI

The article examines the evolution of data lake storage acceleration, compares various solutions, and explains how metadata, read/write, and end‑to‑end optimizations enable scalable, cost‑effective AI and big‑data workloads in cloud‑native environments.

AI trainingBig DataData Lake

0 likes · 24 min read

Why Data Lake Storage Acceleration Is the New Standard in Cloud‑Native AI

DataFunSummit

Nov 8, 2024 · Big Data

Roundtable Discussion on Data Lake Technology Maturity and Governance Practices

Experts from Kuaishou, former Tencent, Ping An Insurance and others discuss data lake maturity, column‑level governance, resource management of unstructured data, and automated optimization techniques such as Iceberg small‑file merging, highlighting how these advances improve data quality and business decision‑making.

Big DataColumn-level GovernanceData Lake

0 likes · 6 min read

Roundtable Discussion on Data Lake Technology Maturity and Governance Practices

DataFunTalk

Nov 6, 2024 · Big Data

How Data Lakes Empower AI: Insights from Industry Experts

In a panel discussion, experts from Kuaishou, Ping An, and Datastrato explain how data lake architectures, columnar storage formats like Apache Iceberg, and vector‑enabled lake formats are enhancing feature management, supporting generative AI workloads, and accelerating machine‑learning pipelines.

AIApache IcebergBig Data

0 likes · 6 min read

How Data Lakes Empower AI: Insights from Industry Experts

Baidu Tech Salon

Nov 5, 2024 · Big Data

Accelerating Data Lake Storage for Big Data and AI: Baidu's Solutions

Baidu’s Data Lake Storage Acceleration 2.0 replaces traditional HDFS with a scalable object‑storage foundation, introducing an adaptive hierarchical namespace, high‑throughput streaming engine, RapidFS caching, and fully compatible BOS‑HDFS APIs, thereby delivering up to 70 % higher throughput, lower costs, and seamless migration for big‑data and AI workloads.

AIBOS-HDFSBig Data

0 likes · 11 min read

Accelerating Data Lake Storage for Big Data and AI: Baidu's Solutions

Baidu Geek Talk

Nov 4, 2024 · Big Data

Why Object Storage Is Replacing HDFS for Modern Data Lakes: Baidu’s 2.0 Acceleration

Data lakes have evolved from HDFS to object storage, addressing resource inefficiency, scalability limits, and operational burdens; Baidu’s Data Lake Storage Acceleration 2.0 introduces hierarchical Namespace 2.0, a streaming storage engine, RapidFS caching, and a fully HDFS‑compatible BOS‑HDFS layer to boost performance and support massive AI workloads.

AIBaiduBig Data

0 likes · 12 min read

Why Object Storage Is Replacing HDFS for Modern Data Lakes: Baidu’s 2.0 Acceleration

Bilibili Tech

Nov 1, 2024 · Big Data

Magnus: Intelligent Data Optimization Service for Iceberg Tables in Bilibili's Lakehouse Platform

Magnus is Bilibili’s self‑developed intelligent service that continuously optimizes Iceberg tables by scheduling snapshot expiration, orphan‑file cleanup, manifest rewriting, and multi‑dimensional data optimizations—including small‑file merging, sorting, distribution, and index creation—while automatically recommending configurations from real‑time query logs, delivering over 99.9% task success and up to 30% scan‑data reduction.

Data LakeIcebergIntelligent Recommendation

0 likes · 15 min read

Magnus: Intelligent Data Optimization Service for Iceberg Tables in Bilibili's Lakehouse Platform

Baidu Intelligent Cloud Tech Hub

Oct 28, 2024 · Cloud Native

How Baidu Smart Cloud Reinvents Cloud‑Native Infrastructure for the AI‑Native Era

The talk outlines Baidu Smart Cloud's comprehensive cloud‑native redesign—including ultra‑elastic compute, AI‑focused storage, high‑performance networking, AI‑driven operations, and edge‑distributed services—illustrated with automotive and fintech case studies that demonstrate how enterprises can accelerate digital transformation in the AI‑native age.

AI InfrastructureData LakeMLOps

0 likes · 12 min read

How Baidu Smart Cloud Reinvents Cloud‑Native Infrastructure for the AI‑Native Era

Shopee Tech Team

Oct 25, 2024 · Big Data

StarRocks at Shopee: Practical Use Cases and Performance Analysis

Shopee’s deployment of StarRocks across DataService, DataGo, and DataStudio demonstrates that its vectorized engine, cost‑based optimizer, and materialized‑view caching can query Hive, Iceberg, Delta Lake and Hudi up to 20,000× faster than Presto, cutting CPU usage and delivering consistently lower latency for complex analytics.

Data LakeHiveMPP

0 likes · 11 min read

StarRocks at Shopee: Practical Use Cases and Performance Analysis

JD Retail Technology

Oct 11, 2024 · Big Data

JD Retail Data Lake Architecture: Challenges, Optimizations, and Future Plans

This article presents JD Retail's data lake architecture overhaul, detailing the shortcomings of the Lambda model, the migration to Flink‑Hudi‑Spark pipelines, performance gains, storage savings, unified APIs, and upcoming improvements for resilience and automation.

Big DataData LakeFlink

0 likes · 11 min read

JD Retail Data Lake Architecture: Challenges, Optimizations, and Future Plans

DataFunTalk

Oct 3, 2024 · Big Data

Data Lake Technology Maturity Curve: Architecture, Design Principles, Core Functions, and Open‑Source Solutions

Amid growing data demands, this article explains the data lake technology maturity curve, detailing lake‑warehouse architectural patterns, design principles, core functionalities, and the four leading open‑source solutions (Hudi, Iceberg, Delta Lake, Paimon) to guide enterprises in building flexible, scalable, and governed data platforms.

Big DataData ArchitectureData Lake

0 likes · 10 min read

Data Lake Technology Maturity Curve: Architecture, Design Principles, Core Functions, and Open‑Source Solutions

DataFunSummit

Oct 1, 2024 · Big Data

Apache Hudi from Zero to One: Highlighting Key Features of Version 1.0 (Part 10)

The article explains Apache Hudi’s three‑layer architecture and details four major 1.0 enhancements—LSM‑tree timeline, non‑blocking concurrency control, file‑group reader/writer APIs, and function indexes—while providing a brief review and links to the Hudi 1.x RFC.

Apache HudiBig DataConcurrency Control

0 likes · 9 min read

Apache Hudi from Zero to One: Highlighting Key Features of Version 1.0 (Part 10)

DataFunSummit

Sep 27, 2024 · Big Data

Data Lake Technology Maturity Curve: Architecture Modes, Design Principles, Core Functions, and Applications

This article explains the data lake technology maturity curve, covering lake‑warehouse architecture patterns, design principles, core capabilities of major open‑source lake engines (Hudi, Iceberg, Delta Lake, Paimon), and practical application scenarios for modern data‑driven enterprises.

Big DataData LakeDelta Lake

0 likes · 10 min read

Data Lake Technology Maturity Curve: Architecture Modes, Design Principles, Core Functions, and Applications

Alibaba Cloud Big Data AI Platform

Sep 27, 2024 · Big Data

How Alibaba Cloud’s New Vectorized Engines Are Revolutionizing Real‑Time Big Data Processing

At the 2024 Cloud Xi Conference, Alibaba Cloud unveiled a suite of vectorized big‑data solutions—including the Flash engine for Flink, EMR Serverless Spark with a 300% speed boost, upgraded lakehouse architecture, and real‑world case studies—showcasing massive performance gains, cost reductions, and broader serverless adoption.

Big DataData LakeFlink

0 likes · 8 min read

How Alibaba Cloud’s New Vectorized Engines Are Revolutionizing Real‑Time Big Data Processing

DataFunSummit

Sep 26, 2024 · Big Data

Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC

This article explains Apache Hudi's incremental processing capabilities, covering an overview of the medallion architecture, detailed configuration for incremental queries, the introduction of Change Data Capture (CDC) with required table properties, and a review of how these features enable richer data insights in modern data lake environments.

Apache HudiBig DataChange Data Capture

0 likes · 9 min read

Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC

DataFunTalk

Sep 24, 2024 · Big Data

Data Lake Technology Maturity Curve: Architecture Modes, Design Principles, Core Functions, and Applications

This article explains the rapid growth of data-driven businesses, the challenges of traditional data warehouses, and how modern data lake technologies such as Delta Lake, Hudi, Iceberg, and Paimon form a maturity curve that guides enterprises in architecture choices, design principles, core capabilities, and practical applications.

Big DataData LakeDelta Lake

0 likes · 12 min read

DataFunSummit

Sep 14, 2024 · Big Data

Apache Hudi Concurrency Control: Overview, MVCC, and OCC

This article provides a comprehensive overview of concurrency control in Apache Hudi, explaining ACID properties, the role of MVCC and OCC, and how Hudi coordinates multiple writers and table services to achieve serializable scheduling while maintaining high performance.

Apache HudiBig DataConcurrency Control

0 likes · 8 min read

Apache Hudi Concurrency Control: Overview, MVCC, and OCC