Tagged articles
343 articles
Page 1 of 4
DataFunSummit
DataFunSummit
May 20, 2026 · Big Data

How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture

The article explains how Kuaishou partnered with Apache Hudi to overhaul its ODS‑based data lake, addressing latency, storage cost, and complexity for AI and BI workloads, detailing the evolution from mysql‑to‑hive to mysql‑to‑hudi 1.0 and 2.0, the resulting performance gains, cost savings, and future roadmap.

AIBIBig Data
0 likes · 20 min read
How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture
DataFunSummit
DataFunSummit
May 11, 2026 · Artificial Intelligence

How Lance Powers Enterprise Multimodal AI Data Lakes

The article analyzes why 74% of AI projects fail due to feedback gaps and data silos, explains how the open‑source Lance format addresses these issues with unified multimodal storage, outlines a layered Lance‑on‑Ray architecture, and details three real‑world practices—implicit feedback loops, GPU‑accelerated self‑evolution, and semantic knowledge‑graph evolution—to boost R&D efficiency.

CAGRADaftData Lake
0 likes · 13 min read
How Lance Powers Enterprise Multimodal AI Data Lakes
DataFunSummit
DataFunSummit
May 5, 2026 · Big Data

A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

The article presents Volcano Engine’s AI‑focused data lake built on the Lance format, detailing why traditional lakes fall short for multimodal data, the engineering enhancements such as Binary Copy Compaction, Lance Insight, distributed vector indexing, JSON‑based tagging, Row‑ID shuffle optimization, and real‑world case studies that demonstrate significant performance and cost gains.

AIBinary Copy CompactionData Lake
0 likes · 18 min read
A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance
DataFunSummit
DataFunSummit
Apr 19, 2026 · Big Data

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

OPPO’s data‑lake team, led by David, detailed their transition from Hive‑Spark to a unified multi‑modal lake, leveraging Gravitino for cross‑engine metadata management and the open‑source Curvine cache to eliminate data silos, boost I/O performance, and support massive image, recommendation, and AI‑Agent workloads.

Big DataData Lakedistributed cache
0 likes · 11 min read
How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine
DataFunTalk
DataFunTalk
Apr 18, 2026 · Databases

How Will Apache Doris Evolve in 2026 to Power AI‑Driven Data Workloads?

The article outlines Apache Doris's 2026 roadmap, detailing how the database will shift from pure analytics to a unified AI‑enabled platform with enhanced semi‑structured data support, vector and hybrid search, agent‑focused capabilities, and expanded storage and lakehouse integrations to meet emerging AI workloads.

AI integrationApache DorisData Lake
0 likes · 14 min read
How Will Apache Doris Evolve in 2026 to Power AI‑Driven Data Workloads?
DataFunTalk
DataFunTalk
Mar 3, 2026 · Big Data

Exploring Tencent Cloud’s Iceberg Batch‑Stream Integration and AI‑Driven Data Governance

This article presents a series of seven technical case studies—including Tencent Cloud’s Iceberg‑based batch‑stream integration, AI‑driven data governance with Apache Gravitino, Xiaohongshu’s lakehouse evolution, and a multimodal data‑lake solution—detailing challenges, architectural designs, implementation steps, performance results, and future directions.

AIBig DataData Lake
0 likes · 8 min read
Exploring Tencent Cloud’s Iceberg Batch‑Stream Integration and AI‑Driven Data Governance
StarRocks
StarRocks
Feb 11, 2026 · Big Data

How StarRocks and Apache Paimon Build a True Lakehouse Native Engine

This article details the deep integration of StarRocks with Apache Paimon, describing the unified architecture, version evolution, performance enhancements, time‑travel queries, native readers/writers, distributed planning, and future roadmap for achieving lakehouse‑native analytics at scale.

Apache PaimonData LakeLakehouse
0 likes · 10 min read
How StarRocks and Apache Paimon Build a True Lakehouse Native Engine
DataFunSummit
DataFunSummit
Feb 8, 2026 · Big Data

Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges

The article explains how Kuaishou modernized its data lake by partnering with Apache Hudi to address latency, storage cost, and consistency issues in both AI and BI pipelines, detailing architectural changes, new ingestion tools, partitioning strategies, compaction mechanisms, performance gains and future plans.

AIBIBig Data
0 likes · 20 min read
Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 4, 2026 · Big Data

How Paimon + StarRocks Power Real‑Time OLAP for Double‑11 Mega‑Sales

During Double‑11 mega‑sales, Taobao Group faced exploding OLAP query traffic, costly data sync pipelines, and slow near‑real‑time analytics, so they unified real‑time and batch data in Paimon, leveraged StarRocks for high‑performance lake queries, tuned cluster settings, and saved nearly ten‑million yuan annually while cutting refresh latency by 80%.

Big DataData LakeOLAP
0 likes · 22 min read
How Paimon + StarRocks Power Real‑Time OLAP for Double‑11 Mega‑Sales
Big Data Tech Team
Big Data Tech Team
Dec 29, 2025 · Big Data

Data Warehouse vs Data Mart vs Data Lake: Which Should Your Enterprise Choose?

The article explains the distinct roles of data warehouses, data marts, and data lakes, illustrates their differences with analogies and real‑world cases, outlines a three‑step strategy for enterprises, highlights common pitfalls, and offers a decision guide to help organizations choose the right architecture for their data needs.

Data LakeData MartData Warehouse
0 likes · 11 min read
Data Warehouse vs Data Mart vs Data Lake: Which Should Your Enterprise Choose?
DataFunTalk
DataFunTalk
Dec 26, 2025 · Cloud Native

How Haier Built a Cloud‑Native Multi‑Modal Data Lake for AI‑Ready Manufacturing

Haier’s digital transformation leverages a cloud‑native, open‑source‑based multi‑modal data lake that unifies structured and unstructured industrial data, uses metadata models and knowledge graphs for governance, and provides AI‑ready services that balance performance, cost, and real‑time requirements.

AIData LakeMultimodal Data
0 likes · 12 min read
How Haier Built a Cloud‑Native Multi‑Modal Data Lake for AI‑Ready Manufacturing
JD Tech Talk
JD Tech Talk
Dec 12, 2025 · Big Data

Understanding Hudi Core Concepts: Timeline, Indexes, and Table Types Explained

This article explains Apache Hudi’s core concepts, including its timeline architecture, file layout, indexing mechanisms, and the two primary table types—Copy on Write and Merge on Read—along with their trade‑offs and the various query modes such as snapshot, time‑travel, and incremental queries.

Apache HudiBig DataData Lake
0 likes · 9 min read
Understanding Hudi Core Concepts: Timeline, Indexes, and Table Types Explained
JD Cloud Developers
JD Cloud Developers
Dec 12, 2025 · Big Data

Apache Hudi Core Concepts: Timeline, Indexes, Table Types & Queries

This article explains Apache Hudi’s core architecture, detailing the timeline mechanism, file layout, indexing strategies, the two main table types (Copy‑On‑Write and Merge‑On‑Read), and various query modes such as snapshot, time‑travel, read‑optimized and incremental queries.

Apache HudiBig DataData Lake
0 likes · 9 min read
Apache Hudi Core Concepts: Timeline, Indexes, Table Types & Queries
DataFunSummit
DataFunSummit
Dec 1, 2025 · Big Data

7 Cutting-Edge Data Engineering Practices Shaping AI-Driven Data Lakes

This article collection showcases seven advanced data engineering solutions—from Tencent Cloud's Iceberg batch‑stream integration and Apache Gravitino metadata lineage to Xiaohongshu's Lakehouse evolution and multimodal AI data lake implementations—highlighting architectural innovations, performance optimizations, and real‑world deployment insights for modern big‑data platforms.

Apache GravitinoApache IcebergBatch-Stream Integration
0 likes · 7 min read
7 Cutting-Edge Data Engineering Practices Shaping AI-Driven Data Lakes
DataFunSummit
DataFunSummit
Nov 24, 2025 · Big Data

How Tencent Cloud Uses Iceberg, Gravitino and Multimodal Lakes for Unified Data Processing

This article series explores Tencent Cloud's Iceberg‑based batch‑stream integration, Apache Gravitino's unified metadata and lineage solution, Xiaohongshu's data‑architecture evolution for the Big AI Data era, and a practical Data+AI multimodal data‑lake implementation, highlighting challenges, architectural designs, and performance gains.

Big DataData LakeIceberg
0 likes · 7 min read
How Tencent Cloud Uses Iceberg, Gravitino and Multimodal Lakes for Unified Data Processing
DataFunTalk
DataFunTalk
Nov 22, 2025 · Big Data

How Modern Data Lakes and AI Governance Transform Enterprise Analytics

This article collection examines Tencent Cloud’s Iceberg batch‑stream integration, AI‑driven game data governance, Apache Gravitino unified metadata and lineage, Xiaohongshu’s multimodal data‑lake evolution, and Volcano Engine’s Data+AI multimodal lake, highlighting architectures, techniques, performance gains, and practical implementations.

AI GovernanceData LakeGravitino
0 likes · 7 min read
How Modern Data Lakes and AI Governance Transform Enterprise Analytics
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 15, 2025 · Big Data

From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse

This article chronicles a ten‑year evolution of a self‑built big data platform—detailing early Hadoop clusters, successive migrations to Spark, Hive, Hudi, and StarRocks, the operational challenges encountered, and the comprehensive shift to Alibaba Cloud EMR Serverless that delivered significant cost, performance, and stability gains while outlining future intelligent‑ecosystem plans.

Big DataData LakeSpark
0 likes · 17 min read
From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse
DataFunSummit
DataFunSummit
Oct 6, 2025 · Artificial Intelligence

Why Vector Lakes Are the Next Frontier for AI Data Management

This article explains how Zilliz's Vector Lake extends traditional data lakes with a unified storage‑compute architecture optimized for massive unstructured and vector data, detailing its background, key data types, autonomous‑driving use case, data flow, architecture, and deployment options.

AI data managementData LakeVector Lake
0 likes · 13 min read
Why Vector Lakes Are the Next Frontier for AI Data Management
IT Architects Alliance
IT Architects Alliance
Sep 21, 2025 · Big Data

From Data Warehouses to Lakehouses: Why Data Architecture Keeps Evolving

This article traces the three‑generation evolution of data architecture—from the structured‑data era of data warehouses, through the flexible, multi‑format data lake, to the unified lakehouse model—explaining the drivers, benefits, challenges, and future trends shaping modern data platforms.

Data ArchitectureData LakeData Warehouse
0 likes · 11 min read
From Data Warehouses to Lakehouses: Why Data Architecture Keeps Evolving
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Sep 11, 2025 · Big Data

How Paimon Transforms Membership Data Warehousing: From Legacy Lambda to Real‑Time Lakehouse

This article examines the challenges of a legacy Lambda‑based membership data warehouse, introduces Apache Paimon’s lakehouse architecture and its key features, and showcases three real‑world implementations—partial‑update order wide tables, Bitmap‑based UV counting, and branch‑based data correction—while discussing benefits, remaining challenges, and future directions.

Big DataData LakeData Warehouse
0 likes · 29 min read
How Paimon Transforms Membership Data Warehousing: From Legacy Lambda to Real‑Time Lakehouse
DataFunSummit
DataFunSummit
Sep 2, 2025 · Big Data

How Xiaomi Cuts Costs and Boosts Performance with Cloud‑Native Data Lake Architecture

Xiaomi’s engineers explain how they tackled data‑lake challenges—small files, metadata latency, and multi‑cloud costs—by combining compact storage, Gravitino‑based metadata governance, Iceberg and Paimon formats, and JuiceFS abstraction, achieving lower storage expenses, faster queries, and a roadmap toward intelligent, real‑time, multimodal lakehouses.

Big DataData LakeStorage Optimization
0 likes · 14 min read
How Xiaomi Cuts Costs and Boosts Performance with Cloud‑Native Data Lake Architecture
Big Data Technology Tribe
Big Data Technology Tribe
Aug 12, 2025 · Databases

Why Lakehouse Architecture Is Redefining Modern Data Platforms

This article explains the evolution from traditional data warehouses and data lakes to the unified Lakehouse architecture, detailing its design, benefits, challenges, and research directions for delivering high‑performance SQL and advanced analytics on open‑format storage.

Big DataData LakeData Warehouse
0 likes · 20 min read
Why Lakehouse Architecture Is Redefining Modern Data Platforms
DataFunSummit
DataFunSummit
Jul 18, 2025 · Big Data

Data Lake & Lakehouse Innovations: Real-Time Analytics and Industry Case Studies

This article presents a curated collection of cutting‑edge data lake and lakehouse case studies—including real‑time analytics, cloud‑native architectures, industry implementations from sales platforms to automotive IoT, and the latest advancements in open‑source projects—offering insights into modern big‑data strategies and governance.

Big DataData LakeLakehouse
0 likes · 2 min read
Data Lake & Lakehouse Innovations: Real-Time Analytics and Industry Case Studies
DataFunSummit
DataFunSummit
Jul 12, 2025 · Big Data

How Fluss Unifies Stream and Lake to Power AI Data Pipelines

In the era of rapid AI growth, Fluss offers a unified lake‑stream architecture that tackles data quality, timeliness, scale, and multimodal challenges by tightly integrating Flink streaming with a high‑performance data lake, enabling seamless real‑time and batch analytics for AI workloads.

AIData LakeFlink
0 likes · 12 min read
How Fluss Unifies Stream and Lake to Power AI Data Pipelines
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 8, 2025 · Big Data

Flink’s AI Agents and Disaggregated State: Transforming Big Data

The article reviews key topics from the FFA2025 Singapore conference, highlighting Flink’s new AI‑focused Agents framework, the breakthrough Flink 2.0 disaggregated state architecture, emerging lake storage solutions like Paimon, and the Fluss streaming table store, illustrating how big‑data platforms are evolving for AI workloads.

AI agentsBig DataData Lake
0 likes · 6 min read
Flink’s AI Agents and Disaggregated State: Transforming Big Data
DataFunTalk
DataFunTalk
Jul 4, 2025 · Big Data

How Flink Agents and Flink 2.0 Are Powering Real‑Time AI at Scale

The Flink Forward Asia 2025 conference in Singapore showcased Apache Flink’s latest advances—including Flink Agents for system‑triggered AI, the cloud‑native Flink 2.0 with disaggregated state management, the multi‑modal lakehouse Paimon, and the Fluss table storage system—highlighting the ecosystem’s shift toward real‑time AI integration.

Apache FlinkData LakeFlink 2.0
0 likes · 9 min read
How Flink Agents and Flink 2.0 Are Powering Real‑Time AI at Scale
Baidu Geek Talk
Baidu Geek Talk
Jun 30, 2025 · Big Data

How Baidu’s Turing 3.0 Leverages Apache Iceberg to Boost Data Lake Performance

This article explains how Baidu’s next‑generation data platform Turing 3.0 integrates Apache Iceberg to solve the inefficiencies of the legacy MEG stack, detailing ecosystem components, migration strategies from Hive, table‑level optimizations, and future roadmap for high‑frequency, low‑latency analytics.

Apache IcebergData LakeHive Migration
0 likes · 17 min read
How Baidu’s Turing 3.0 Leverages Apache Iceberg to Boost Data Lake Performance
StarRocks
StarRocks
Jun 26, 2025 · Databases

What’s New in StarRocks 3.5? Snapshot Backup, Bulk Load, Partition & Transaction Enhancements

StarRocks 3.5 introduces a cluster‑level Snapshot backup for fast recovery, a bulk‑load optimization that reduces small files and compaction cost, smarter partition management with time‑based merging and TTL, multi‑statement transactions with full ACID guarantees, low‑cardinality dictionary support for lake tables, and several security and performance upgrades.

ACID TransactionsData LakeLow Cardinality Dictionary
0 likes · 17 min read
What’s New in StarRocks 3.5? Snapshot Backup, Bulk Load, Partition & Transaction Enhancements
DataFunSummit
DataFunSummit
Jun 18, 2025 · Big Data

How Real‑Time Lakehouse and Apache Paimon Transform Modern Data Architecture

This article explains the concept of a real‑time lakehouse, compares it with traditional batch warehouses, introduces Apache Paimon and its innovations such as native upserts, LSM storage, tags and branches, and showcases multiple enterprise use cases that demonstrate its low‑cost, low‑latency stream‑batch integration.

Apache PaimonData Lakereal-time lakehouse
0 likes · 17 min read
How Real‑Time Lakehouse and Apache Paimon Transform Modern Data Architecture
DataFunSummit
DataFunSummit
Jun 10, 2025 · Big Data

How OpenLake Redefines Data Lake Infrastructure for the AI Era

This article explores OpenLake's evolution as a data lake platform for AI, covering the transition from Hive to modern lake formats like Iceberg and Paimon, performance benchmarks, metadata management advances, intelligent storage optimization, and the integration of multimodal support with the Lance file format.

AIBig DataData Lake
0 likes · 22 min read
How OpenLake Redefines Data Lake Infrastructure for the AI Era
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 10, 2025 · Big Data

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

This article details how a leading automotive parts supply‑chain platform migrated from a traditional Hadoop stack to Alibaba Cloud EMR Serverless Spark and DataWorks, achieving faster, more elastic, and cost‑effective data processing, enhanced AI integration, and significant operational improvements across multiple business scenarios.

Big DataCloud NativeData Lake
0 likes · 12 min read
Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark
DataFunTalk
DataFunTalk
Jun 4, 2025 · Artificial Intelligence

Coupang’s Distributed Cache Architecture Accelerates AI/ML Model Training

Coupang’s AI platform replaces costly data‑copy steps with a distributed cache that automatically pulls data from a central lake, boosts GPU utilization across regions, cuts storage and operational expenses, and speeds up model training by up to 40% while simplifying deployment via Kubernetes.

AIData LakeGPU
0 likes · 9 min read
Coupang’s Distributed Cache Architecture Accelerates AI/ML Model Training
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
May 19, 2025 · Industry Insights

How Xiaohongshu Built a Minute‑Level Near‑Real‑Time Data Warehouse with Incremental Computing

Facing billions of daily logs and the need for minute‑level experiment metrics, Xiaohongshu partnered with Yunqi Tech to design a generic incremental‑compute solution that delivers near‑real‑time data warehousing with lower cost, higher accuracy, simplified pipelines, and improved query performance.

Big DataData LakeFlink
0 likes · 24 min read
How Xiaohongshu Built a Minute‑Level Near‑Real‑Time Data Warehouse with Incremental Computing
Big Data Technology & Architecture
Big Data Technology & Architecture
May 16, 2025 · Big Data

Apache Gravitino: An Open‑Source Metadata Lake for Unified Data and AI Asset Management

Apache Gravitino is an open‑source metadata service platform that provides a unified, high‑performance, geographically distributed metadata lake, enabling end‑to‑end data governance, multi‑engine access, and direct management of both structured and unstructured data assets across diverse systems.

Apache GravitinoData GovernanceData Lake
0 likes · 9 min read
Apache Gravitino: An Open‑Source Metadata Lake for Unified Data and AI Asset Management
Tencent Cloud Developer
Tencent Cloud Developer
May 8, 2025 · Big Data

How Setats Unifies Stream, Batch, and Incremental Processing for Real‑Time Data Lakes

At the 2025 DA Data+AI Conference in Shanghai, Tencent Cloud unveiled Setats—a unified stream‑batch‑incremental engine that cuts system costs, delivers second‑level data visibility and real‑time changelog generation, and demonstrates measurable performance gains in automotive IoT analytics while integrating tightly with the WeData platform.

Batch ProcessingBig Data ArchitectureData Lake
0 likes · 5 min read
How Setats Unifies Stream, Batch, and Incremental Processing for Real‑Time Data Lakes
DataFunSummit
DataFunSummit
May 4, 2025 · Big Data

Iceberg Table Format Practice in Huawei Terminal Cloud

This article explains how Huawei's terminal cloud adopts the Apache Iceberg table format to efficiently manage large-scale datasets, detailing its architecture, feature engineering, merge operations, LSM-based storage, schema versioning, AB testing support, catalog enhancements, and future roadmap for full lifecycle data governance.

Big DataData LakeHuawei Cloud
0 likes · 13 min read
Iceberg Table Format Practice in Huawei Terminal Cloud
DataFunTalk
DataFunTalk
Apr 9, 2025 · Big Data

Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies

The Kuaishou‑hosted Apache Hudi Asia technical salon gathered over 230 attendees and featured seven experts from Kuaishou, Meituan, TikTok, Huawei, JD and others, who shared best practices, architecture designs, and performance optimizations for large‑scale data lake applications across AI, BI, and real‑time workloads.

AIApache HudiBatch Processing
0 likes · 14 min read
Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies
DataFunSummit
DataFunSummit
Apr 3, 2025 · Big Data

Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD

The Apache Hudi Asia technical salon held in Beijing on March 29 gathered over 230 on‑site participants and 16,000 online viewers, featuring expert talks from leading Chinese tech companies that showcased real‑world Hudi implementations, performance optimizations, and future roadmap for data‑lake technologies.

Apache HudiBig DataData Lake
0 likes · 13 min read
Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD
Kuaishou Tech
Kuaishou Tech
Apr 2, 2025 · Big Data

Apache Hudi Asia Summit Successfully Held

The first Apache Hudi Asia Summit in Beijing attracted over 230 attendees, featuring technical discussions on data lake optimization and case studies from companies like Fastly and Meituan.

Apache HudiBig DataData Lake
0 likes · 12 min read
Apache Hudi Asia Summit Successfully Held
AntData
AntData
Mar 20, 2025 · Big Data

Design and Optimization of Real‑time Data Lake Tables with Paimon and Flink for Advertising Diagnostics

This article presents a comprehensive exploration of using Apache Paimon and Flink to design lake tables that support minute‑level latency, low cost, and unified batch‑stream processing for advertising data, covering schema design, partitioning strategies, performance trade‑offs, cost analysis, and operational best practices.

Big DataData LakeFlink
0 likes · 34 min read
Design and Optimization of Real‑time Data Lake Tables with Paimon and Flink for Advertising Diagnostics
Alimama Tech
Alimama Tech
Mar 12, 2025 · Big Data

Design and Evolution of Alibaba Advertising Real-Time Data Warehouse

Alibaba Mama’s advertising platform migrated from a monolithic Flink‑Kafka pipeline to a layered Paimon lakehouse, adding DWS upsert support and multi‑layer storage, which delivers minute‑level data freshness, cuts latency by 2.5 hours, reduces resource use over 40 %, halves development effort and achieves ≥99.9 % availability.

AdvertisingAlibabaData Lake
0 likes · 18 min read
Design and Evolution of Alibaba Advertising Real-Time Data Warehouse
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 6, 2025 · Big Data

Leveraging Apache Iceberg and AutoMQ for Real-Time Data Lake Ingestion: Architecture, Best Practices, and Cost Optimization

This article examines how Apache Iceberg’s snapshot‑based ACID transactions, logical‑physical partition evolution, and COW/MOR update modes enable efficient real‑time data lake ingestion, and demonstrates AutoMQ’s Kafka‑to‑Iceberg Table Topic solution that simplifies schema management, reduces latency, and cuts operational costs.

Apache IcebergAutoMQBig Data
0 likes · 14 min read
Leveraging Apache Iceberg and AutoMQ for Real-Time Data Lake Ingestion: Architecture, Best Practices, and Cost Optimization
Volcano Engine Developer Services
Volcano Engine Developer Services
Mar 5, 2025 · Artificial Intelligence

How DeepSeek Smallpond Powers AI Data Processing with Ray and DuckDB

This article introduces DeepSeek Smallpond, a lightweight yet high‑performance AI data‑processing engine built on Ray and DuckDB, explains its dual Dataframe and LogicalPlan APIs, showcases integration with Volcano Engine's AI Data Lake LAS, and provides practical code examples for distributed processing, multimodal storage, and RAG pipelines.

AI data processingData LakeDuckDB
0 likes · 18 min read
How DeepSeek Smallpond Powers AI Data Processing with Ray and DuckDB
DataFunSummit
DataFunSummit
Feb 23, 2025 · Big Data

Douyin Group’s ByteLake Data Lake Table Optimization and Management Practices

This article presents Douyin Group’s ByteLake, a heavily customized Apache Hudi‑based data lake table framework, detailing its core concepts, metadata services, write and read optimizations, operational challenges, a fully managed table management service, and its integration with the Amoro open‑source platform.

AmoroApache HudiBig Data
0 likes · 11 min read
Douyin Group’s ByteLake Data Lake Table Optimization and Management Practices
JD Tech
JD Tech
Feb 11, 2025 · Big Data

Cold‑Hot Data Tiering and Performance Optimization in Apache Doris for JD Advertising

This article presents JD Advertising's engineering experience with Apache Doris, describing the evolution from a data‑lake cold‑data solution to a native cold‑hot tiering approach, detailing performance regressions after upgrading to Doris 2.0, and outlining a series of optimizations for query speed, CPU and memory usage, schema‑change efficiency, and automated data migration and restoration.

Apache DorisBig DataData Lake
0 likes · 17 min read
Cold‑Hot Data Tiering and Performance Optimization in Apache Doris for JD Advertising
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 23, 2025 · Big Data

How Alibaba Cloud DataWorks Leverages Flink CDC for Scalable Data Lake Integration

Alibaba Cloud DataWorks’ Data Integration platform, built on Flink CDC, offers a comprehensive, serverless solution for real‑time and batch data lake ingestion, detailing its architecture, elastic scaling, productized use cases, and future roadmap, including AI‑driven diagnostics and expanded source support.

Big DataData IntegrationData Lake
0 likes · 12 min read
How Alibaba Cloud DataWorks Leverages Flink CDC for Scalable Data Lake Integration
JD Cloud Developers
JD Cloud Developers
Jan 16, 2025 · Artificial Intelligence

JD Retail’s 2024 Tech Innovations: AI, Supply Chain, and Immersive Shopping

In 2024, JD Retail Technology rolled out a series of breakthroughs—including a major JD APP redesign, data‑driven inventory algorithms, an AIGC content platform, a low‑code national‑subsidy system, a high‑performance data lake, cross‑platform Taro on Harmony, AI‑powered merchant assistants, and immersive XR shopping—showcasing how AI and advanced engineering drive faster fulfillment, richer user experiences, and scalable innovation.

AIAIGCCross‑platform development
0 likes · 18 min read
JD Retail’s 2024 Tech Innovations: AI, Supply Chain, and Immersive Shopping
JD Retail Technology
JD Retail Technology
Jan 15, 2025 · Industry Insights

JD Retail’s 2024 Tech Innovations: AI, Supply‑Chain Algorithms, and Development

In 2024 JD Retail Technology delivered a series of breakthroughs—including a major JD APP redesign, a data‑driven inventory selection and allocation algorithm that cut stockouts, an AIGC platform for marketing content, a low‑code national‑subsidy system, a large‑scale Apache Hudi data lake, the Taro‑on‑Harmony cross‑platform framework, immersive XR shopping experiences, and a domestic‑chip AI engine—showcasing how advanced AI, cloud, and operations engineering are reshaping e‑commerce.

AIAIGCCross‑platform development
0 likes · 17 min read
JD Retail’s 2024 Tech Innovations: AI, Supply‑Chain Algorithms, and Development
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 14, 2025 · Big Data

How Fluss Unifies Lake and Stream for Real‑Time Analytics: Architecture, Benefits, and Future Roadmap

This article summarizes a talk by Alibaba Cloud senior engineer and Flink Committer Luo Yuxia on the challenges of separating lake and stream storage, introduces the Fluss lake‑stream unified architecture, explains its technical benefits such as second‑level data freshness, unified metadata, efficient changelog generation, and outlines future plans for broader ecosystem integration.

Data LakeFlinkFluss
0 likes · 13 min read
How Fluss Unifies Lake and Stream for Real‑Time Analytics: Architecture, Benefits, and Future Roadmap
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 25, 2024 · Big Data

Build a Low‑Cost, High‑Performance Game Player Profiling Platform with Alibaba Cloud EMR StarRocks

This tutorial walks you through using Alibaba Cloud EMR Serverless StarRocks and Apache Paimon to create a cost‑effective, high‑performance game player profiling and behavior analysis platform, covering data import, materialized view creation, DWD/ADS layer construction, and lakehouse integration.

Alibaba CloudData LakeGame Analytics
0 likes · 12 min read
Build a Low‑Cost, High‑Performance Game Player Profiling Platform with Alibaba Cloud EMR StarRocks
Tencent Advertising Technology
Tencent Advertising Technology
Dec 6, 2024 · Big Data

Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent

Tencent's advertising team replaced a traditional HDFS‑Hive warehouse with an Apache Iceberg‑based data lake, adding primary‑key tables, multi‑stream merging, adaptive compaction, and Spark SPJ optimizations to achieve minute‑level feature update latency, 10× back‑fill speed, and up to 60% storage savings.

Big DataCDCData Lake
0 likes · 25 min read
Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Nov 27, 2024 · Big Data

Highlights of Tongcheng Travel’s 8th Big Data Technology Salon

The 8th Tongcheng Travel Big Data Technology Salon in Suzhou featured four expert talks covering Tencent Cloud’s Meson Spark engine, near‑line computing for travel itineraries, a Flink‑based real‑time risk control system, and Apache Paimon’s latest lake‑warehouse innovations, followed by a data‑driven business perspective session.

Apache PaimonBig DataData Lake
0 likes · 7 min read
Highlights of Tongcheng Travel’s 8th Big Data Technology Salon
Bilibili Tech
Bilibili Tech
Nov 26, 2024 · Big Data

Bilibili’s Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practices

Bilibili migrated its massive user‑behavior, commercial AI training, and database synchronization pipelines from Hive and Kafka to an Iceberg‑based streaming‑batch architecture, using Flink and the Magnus optimizer to achieve minute‑level freshness, reduce CPU and memory usage by about 20‑22 %, save roughly 3.55 M CNY annually, and dramatically improve query latency and join performance.

BatchData IntegrationData Lake
0 likes · 20 min read
Bilibili’s Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practices
DataFunSummit
DataFunSummit
Nov 23, 2024 · Big Data

Bilibili's Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practice

This article presents Bilibili's end‑to‑end exploration of a streaming‑batch unified data pipeline built on Apache Iceberg, detailing the original and iterated architectures for massive user behavior transmission, online AI training, DB synchronization, and dimension‑join, along with performance gains, cost savings, and future plans.

Batch ProcessingData LakeFlink
0 likes · 20 min read
Bilibili's Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practice
iQIYI Technical Product Team
iQIYI Technical Product Team
Nov 21, 2024 · Big Data

Alluxio Integration and Optimization for Multi‑AZ Big Data Analytics at iQIYI

iQIYI integrates Alluxio with its QBFS multi‑AZ unified scheduling system, automatically caching hot tables, applying table‑level policies, page‑level storage and AZ‑aware worker selection, which together cut cross‑zone traffic, halve query latency, achieve up to 20× I/O speedup and a three‑fold overall performance boost.

AlluxioData LakeMulti‑AZ
0 likes · 23 min read
Alluxio Integration and Optimization for Multi‑AZ Big Data Analytics at iQIYI
DataFunSummit
DataFunSummit
Nov 20, 2024 · Artificial Intelligence

How Data Lakes Empower AI: Expert Insights on Feature Management, Columnar Storage, and Vector Formats

In a panel discussion, experts explain how data‑lake‑warehouse integration, columnar formats like Apache Iceberg, and emerging variant types enable efficient feature engineering, support large‑language‑model workloads, and provide flexible vector storage, thereby driving the evolution of AI from traditional ML to the GenAI era.

Apache IcebergData LakeGenAI
0 likes · 6 min read
How Data Lakes Empower AI: Expert Insights on Feature Management, Columnar Storage, and Vector Formats
Baidu Geek Talk
Baidu Geek Talk
Nov 13, 2024 · Industry Insights

Why Cloud‑Native Data Lakes Are the New Standard for Storage Acceleration

This article analyzes the evolution of data‑lake storage acceleration, compares traditional parallel file systems, object‑storage‑based solutions and modern cache‑enabled architectures, and explains how cloud‑native data lakes address scalability, cost, and performance challenges for AI and big‑data workloads.

AIBig DataCloud Native
0 likes · 24 min read
Why Cloud‑Native Data Lakes Are the New Standard for Storage Acceleration
DataFunSummit
DataFunSummit
Nov 8, 2024 · Big Data

Roundtable Discussion on Data Lake Technology Maturity and Governance Practices

Experts from Kuaishou, former Tencent, Ping An Insurance and others discuss data lake maturity, column‑level governance, resource management of unstructured data, and automated optimization techniques such as Iceberg small‑file merging, highlighting how these advances improve data quality and business decision‑making.

Big DataColumn-level GovernanceData Lake
0 likes · 6 min read
Roundtable Discussion on Data Lake Technology Maturity and Governance Practices
DataFunTalk
DataFunTalk
Nov 6, 2024 · Big Data

How Data Lakes Empower AI: Insights from Industry Experts

In a panel discussion, experts from Kuaishou, Ping An, and Datastrato explain how data lake architectures, columnar storage formats like Apache Iceberg, and vector‑enabled lake formats are enhancing feature management, supporting generative AI workloads, and accelerating machine‑learning pipelines.

AIApache IcebergBig Data
0 likes · 6 min read
How Data Lakes Empower AI: Insights from Industry Experts
Baidu Tech Salon
Baidu Tech Salon
Nov 5, 2024 · Big Data

Accelerating Data Lake Storage for Big Data and AI: Baidu's Solutions

Baidu’s Data Lake Storage Acceleration 2.0 replaces traditional HDFS with a scalable object‑storage foundation, introducing an adaptive hierarchical namespace, high‑throughput streaming engine, RapidFS caching, and fully compatible BOS‑HDFS APIs, thereby delivering up to 70 % higher throughput, lower costs, and seamless migration for big‑data and AI workloads.

AIBOS-HDFSBig Data
0 likes · 11 min read
Accelerating Data Lake Storage for Big Data and AI: Baidu's Solutions
Baidu Geek Talk
Baidu Geek Talk
Nov 4, 2024 · Big Data

Why Object Storage Is Replacing HDFS for Modern Data Lakes: Baidu’s 2.0 Acceleration

Data lakes have evolved from HDFS to object storage, addressing resource inefficiency, scalability limits, and operational burdens; Baidu’s Data Lake Storage Acceleration 2.0 introduces hierarchical Namespace 2.0, a streaming storage engine, RapidFS caching, and a fully HDFS‑compatible BOS‑HDFS layer to boost performance and support massive AI workloads.

AIBaiduBig Data
0 likes · 12 min read
Why Object Storage Is Replacing HDFS for Modern Data Lakes: Baidu’s 2.0 Acceleration
Bilibili Tech
Bilibili Tech
Nov 1, 2024 · Big Data

Magnus: Intelligent Data Optimization Service for Iceberg Tables in Bilibili's Lakehouse Platform

Magnus is Bilibili’s self‑developed intelligent service that continuously optimizes Iceberg tables by scheduling snapshot expiration, orphan‑file cleanup, manifest rewriting, and multi‑dimensional data optimizations—including small‑file merging, sorting, distribution, and index creation—while automatically recommending configurations from real‑time query logs, delivering over 99.9% task success and up to 30% scan‑data reduction.

Data LakeIcebergIntelligent Recommendation
0 likes · 15 min read
Magnus: Intelligent Data Optimization Service for Iceberg Tables in Bilibili's Lakehouse Platform
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Oct 28, 2024 · Cloud Native

How Baidu Smart Cloud Reinvents Cloud‑Native Infrastructure for the AI‑Native Era

The talk outlines Baidu Smart Cloud's comprehensive cloud‑native redesign—including ultra‑elastic compute, AI‑focused storage, high‑performance networking, AI‑driven operations, and edge‑distributed services—illustrated with automotive and fintech case studies that demonstrate how enterprises can accelerate digital transformation in the AI‑native age.

AI InfrastructureData LakeEdge Computing
0 likes · 12 min read
How Baidu Smart Cloud Reinvents Cloud‑Native Infrastructure for the AI‑Native Era
Shopee Tech Team
Shopee Tech Team
Oct 25, 2024 · Big Data

StarRocks at Shopee: Practical Use Cases and Performance Analysis

Shopee’s deployment of StarRocks across DataService, DataGo, and DataStudio demonstrates that its vectorized engine, cost‑based optimizer, and materialized‑view caching can query Hive, Iceberg, Delta Lake and Hudi up to 20,000× faster than Presto, cutting CPU usage and delivering consistently lower latency for complex analytics.

Data LakeHiveMPP
0 likes · 11 min read
StarRocks at Shopee: Practical Use Cases and Performance Analysis
DataFunTalk
DataFunTalk
Oct 3, 2024 · Big Data

Data Lake Technology Maturity Curve: Architecture, Design Principles, Core Functions, and Open‑Source Solutions

Amid growing data demands, this article explains the data lake technology maturity curve, detailing lake‑warehouse architectural patterns, design principles, core functionalities, and the four leading open‑source solutions (Hudi, Iceberg, Delta Lake, Paimon) to guide enterprises in building flexible, scalable, and governed data platforms.

Big DataData ArchitectureData Lake
0 likes · 10 min read
Data Lake Technology Maturity Curve: Architecture, Design Principles, Core Functions, and Open‑Source Solutions
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 27, 2024 · Big Data

How Alibaba Cloud’s New Vectorized Engines Are Revolutionizing Real‑Time Big Data Processing

At the 2024 Cloud Xi Conference, Alibaba Cloud unveiled a suite of vectorized big‑data solutions—including the Flash engine for Flink, EMR Serverless Spark with a 300% speed boost, upgraded lakehouse architecture, and real‑world case studies—showcasing massive performance gains, cost reductions, and broader serverless adoption.

Big DataData LakeFlink
0 likes · 8 min read
How Alibaba Cloud’s New Vectorized Engines Are Revolutionizing Real‑Time Big Data Processing
DataFunSummit
DataFunSummit
Sep 26, 2024 · Big Data

Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC

This article explains Apache Hudi's incremental processing capabilities, covering an overview of the medallion architecture, detailed configuration for incremental queries, the introduction of Change Data Capture (CDC) with required table properties, and a review of how these features enable richer data insights in modern data lake environments.

Apache HudiBig DataChange Data Capture
0 likes · 9 min read
Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC
DataFunTalk
DataFunTalk
Sep 24, 2024 · Big Data

Data Lake Technology Maturity Curve: Architecture Modes, Design Principles, Core Functions, and Applications

This article explains the rapid growth of data-driven businesses, the challenges of traditional data warehouses, and how modern data lake technologies such as Delta Lake, Hudi, Iceberg, and Paimon form a maturity curve that guides enterprises in architecture choices, design principles, core capabilities, and practical applications.

Big DataData LakeDelta Lake
0 likes · 12 min read
Data Lake Technology Maturity Curve: Architecture Modes, Design Principles, Core Functions, and Applications
DataFunSummit
DataFunSummit
Sep 14, 2024 · Big Data

Apache Hudi Concurrency Control: Overview, MVCC, and OCC

This article provides a comprehensive overview of concurrency control in Apache Hudi, explaining ACID properties, the role of MVCC and OCC, and how Hudi coordinates multiple writers and table services to achieve serializable scheduling while maintaining high performance.

Apache HudiBig DataConcurrency Control
0 likes · 8 min read
Apache Hudi Concurrency Control: Overview, MVCC, and OCC
DataFunTalk
DataFunTalk
Sep 4, 2024 · Artificial Intelligence

Data+AI Data Lake Technologies: Challenges, Apache Iceberg Overview, and Vector Table Implementations with PyIceberg

This article explores the evolution of data lakes for AI, discusses the challenges of AI-era data management, introduces Apache Iceberg and its architecture, demonstrates PyIceberg-based AI training and inference pipelines, and presents vector table designs with LSH indexing and performance optimizations.

AIApache IcebergBig Data
0 likes · 22 min read
Data+AI Data Lake Technologies: Challenges, Apache Iceberg Overview, and Vector Table Implementations with PyIceberg

How Hudi MetaServer Transforms Metadata Management and Performance in Data Lakes

This article examines the challenges of Hudi metadata stored on HDFS, introduces the independently developed Hudi MetaServer for centralized metadata, visual management, unified permission control, TTL, expression payloads, and multi‑active scaling, and outlines future enhancements such as LLS, multi‑table fusion, and JDBC support.

Big DataData LakeHudi
0 likes · 11 min read
How Hudi MetaServer Transforms Metadata Management and Performance in Data Lakes
DataFunSummit
DataFunSummit
Aug 4, 2024 · Big Data

Apache Hudi from Zero to One: Comprehensive Guide to Write Indexing (Part 4)

This article explains Apache Hudi’s write‑side indexing, detailing the indexing API, various index types—including simple, Bloom, bucket, HBase, and record‑level indexes—and their mechanisms, helping readers understand how Hudi validates record existence and optimizes updates and deletions.

Apache HudiBig DataData Lake
0 likes · 9 min read
Apache Hudi from Zero to One: Comprehensive Guide to Write Indexing (Part 4)
DataFunSummit
DataFunSummit
Jul 12, 2024 · Big Data

Data Lake Development Trends, Architecture, Integration, Lakehouse Core Capabilities, and Open Design

This article examines the current evolution of data lakes, detailing their overall architecture, batch and real‑time integration methods, Lakehouse core functionalities such as enhanced DML, schema evolution, ACID support, and open‑design principles that enable multi‑cloud deployment and seamless interaction with diverse compute engines.

Batch ProcessingBig Data ArchitectureData Lake
0 likes · 12 min read
Data Lake Development Trends, Architecture, Integration, Lakehouse Core Capabilities, and Open Design
Sohu Tech Products
Sohu Tech Products
Jul 10, 2024 · Industry Insights

How StarRocks and Apache Paimon Transform Data Lake Analytics and Migration

This article provides a practical deep‑dive into StarRocks and Apache Paimon, covering data‑lake fundamentals, the technical advantages of both platforms, performance gains over traditional engines, step‑by‑step migration strategies, deployment options on Alibaba Cloud EMR, and future roadmap plans.

Apache PaimonData LakeReal-time analytics
0 likes · 15 min read
How StarRocks and Apache Paimon Transform Data Lake Analytics and Migration
DataFunSummit
DataFunSummit
Jul 6, 2024 · Artificial Intelligence

Highlights of DataFunCon 2024 Beijing: Big Data, AI, and Large‑Model Trends

The two‑day DataFunCon 2024 Beijing conference gathered hundreds of big‑data and AI experts to discuss the evolution from data lakes to lake‑warehouses, large‑model development, practical applications, and future strategies for enterprises, while showcasing partner exhibitions and a vibrant community spirit.

Big DataChinaData Lake
0 likes · 9 min read
Highlights of DataFunCon 2024 Beijing: Big Data, AI, and Large‑Model Trends