Tag

Data Lake

0 views collected around this technical thread.

DataFunSummit
DataFunSummit
Jun 10, 2025 · Big Data

How OpenLake Redefines Data Lake Infrastructure for the AI Era

This article explores OpenLake's evolution as a data lake platform for AI, covering the transition from Hive to modern lake formats like Iceberg and Paimon, performance benchmarks, metadata management advances, intelligent storage optimization, and the integration of multimodal support with the Lance file format.

AIBig DataData Lake
0 likes · 22 min read
How OpenLake Redefines Data Lake Infrastructure for the AI Era
DataFunTalk
DataFunTalk
Jun 4, 2025 · Artificial Intelligence

Coupang’s Distributed Cache Architecture Accelerates AI/ML Model Training

Coupang’s AI platform replaces costly data‑copy steps with a distributed cache that automatically pulls data from a central lake, boosts GPU utilization across regions, cuts storage and operational expenses, and speeds up model training by up to 40% while simplifying deployment via Kubernetes.

AIData LakeGPU
0 likes · 9 min read
Coupang’s Distributed Cache Architecture Accelerates AI/ML Model Training
DataFunTalk
DataFunTalk
May 29, 2025 · Databases

Introducing DuckLake: An Integrated Data Lake and Catalog Format Powered by SQL

DuckDB's DuckLake is an open‑standard, SQL‑driven data lake and catalog format that simplifies lakehouse architecture by managing metadata in a database while storing data in scalable Parquet files, offering multi‑user collaboration, time‑travel queries, and MIT licensing.

Data LakeSQLdatabases
0 likes · 4 min read
Introducing DuckLake: An Integrated Data Lake and Catalog Format Powered by SQL
DataFunSummit
DataFunSummit
May 4, 2025 · Big Data

Iceberg Table Format Practice in Huawei Terminal Cloud

This article explains how Huawei's terminal cloud adopts the Apache Iceberg table format to efficiently manage large-scale datasets, detailing its architecture, feature engineering, merge operations, LSM-based storage, schema versioning, AB testing support, catalog enhancements, and future roadmap for full lifecycle data governance.

Big DataData LakeHuawei Cloud
0 likes · 13 min read
Iceberg Table Format Practice in Huawei Terminal Cloud
DataFunTalk
DataFunTalk
Apr 9, 2025 · Big Data

Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies

The Kuaishou‑hosted Apache Hudi Asia technical salon gathered over 230 attendees and featured seven experts from Kuaishou, Meituan, TikTok, Huawei, JD and others, who shared best practices, architecture designs, and performance optimizations for large‑scale data lake applications across AI, BI, and real‑time workloads.

AIApache HudiBig Data
0 likes · 14 min read
Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies
DataFunSummit
DataFunSummit
Apr 8, 2025 · Big Data

Huolala’s Real‑Time Data Synchronization with Flink CDC: Architecture, Practices, and Future Outlook

This article presents Huolala’s end‑to‑end implementation of Flink CDC for real‑time data capture, detailing the business background, reasons for selecting Flink CDC over Canal, component comparisons, production‑level platform enhancements, data‑lake integration, validation methods, and future directions for unified data ingestion.

Big DataData LakeData Synchronization
0 likes · 13 min read
Huolala’s Real‑Time Data Synchronization with Flink CDC: Architecture, Practices, and Future Outlook
DataFunSummit
DataFunSummit
Apr 3, 2025 · Big Data

Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD

The Apache Hudi Asia technical salon held in Beijing on March 29 gathered over 230 on‑site participants and 16,000 online viewers, featuring expert talks from leading Chinese tech companies that showcased real‑world Hudi implementations, performance optimizations, and future roadmap for data‑lake technologies.

Apache HudiBig DataData Lake
0 likes · 13 min read
Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD
Kuaishou Tech
Kuaishou Tech
Apr 2, 2025 · Big Data

Apache Hudi Asia Summit Successfully Held

The first Apache Hudi Asia Summit in Beijing attracted over 230 attendees, featuring technical discussions on data lake optimization and case studies from companies like Fastly and Meituan.

Apache HudiBig DataData Engineering
0 likes · 12 min read
Apache Hudi Asia Summit Successfully Held
AntData
AntData
Mar 20, 2025 · Big Data

Design and Optimization of Real‑time Data Lake Tables with Paimon and Flink for Advertising Diagnostics

This article presents a comprehensive exploration of using Apache Paimon and Flink to design lake tables that support minute‑level latency, low cost, and unified batch‑stream processing for advertising data, covering schema design, partitioning strategies, performance trade‑offs, cost analysis, and operational best practices.

Advertising AnalyticsBig DataData Lake
0 likes · 34 min read
Design and Optimization of Real‑time Data Lake Tables with Paimon and Flink for Advertising Diagnostics
Alimama Tech
Alimama Tech
Mar 12, 2025 · Big Data

Design and Evolution of Alibaba Advertising Real-Time Data Warehouse

Alibaba Mama’s advertising platform migrated from a monolithic Flink‑Kafka pipeline to a layered Paimon lakehouse, adding DWS upsert support and multi‑layer storage, which delivers minute‑level data freshness, cuts latency by 2.5 hours, reduces resource use over 40 %, halves development effort and achieves ≥99.9 % availability.

AlibabaData LakePaimon
0 likes · 18 min read
Design and Evolution of Alibaba Advertising Real-Time Data Warehouse
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 6, 2025 · Big Data

Leveraging Apache Iceberg and AutoMQ for Real-Time Data Lake Ingestion: Architecture, Best Practices, and Cost Optimization

This article examines how Apache Iceberg’s snapshot‑based ACID transactions, logical‑physical partition evolution, and COW/MOR update modes enable efficient real‑time data lake ingestion, and demonstrates AutoMQ’s Kafka‑to‑Iceberg Table Topic solution that simplifies schema management, reduces latency, and cuts operational costs.

Apache IcebergAutoMQBig Data
0 likes · 14 min read
Leveraging Apache Iceberg and AutoMQ for Real-Time Data Lake Ingestion: Architecture, Best Practices, and Cost Optimization
DataFunSummit
DataFunSummit
Feb 23, 2025 · Big Data

Douyin Group’s ByteLake Data Lake Table Optimization and Management Practices

This article presents Douyin Group’s ByteLake, a heavily customized Apache Hudi‑based data lake table framework, detailing its core concepts, metadata services, write and read optimizations, operational challenges, a fully managed table management service, and its integration with the Amoro open‑source platform.

AmoroApache HudiBig Data
0 likes · 11 min read
Douyin Group’s ByteLake Data Lake Table Optimization and Management Practices
JD Tech
JD Tech
Feb 11, 2025 · Big Data

Cold‑Hot Data Tiering and Performance Optimization in Apache Doris for JD Advertising

This article presents JD Advertising's engineering experience with Apache Doris, describing the evolution from a data‑lake cold‑data solution to a native cold‑hot tiering approach, detailing performance regressions after upgrading to Doris 2.0, and outlining a series of optimizations for query speed, CPU and memory usage, schema‑change efficiency, and automated data migration and restoration.

Apache DorisBig DataData Lake
0 likes · 17 min read
Cold‑Hot Data Tiering and Performance Optimization in Apache Doris for JD Advertising
Tencent Advertising Technology
Tencent Advertising Technology
Dec 6, 2024 · Big Data

Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent

Tencent's advertising team replaced a traditional HDFS‑Hive warehouse with an Apache Iceberg‑based data lake, adding primary‑key tables, multi‑stream merging, adaptive compaction, and Spark SPJ optimizations to achieve minute‑level feature update latency, 10× back‑fill speed, and up to 60% storage savings.

Big DataCDCCompaction
0 likes · 25 min read
Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Nov 27, 2024 · Big Data

Highlights of Tongcheng Travel’s 8th Big Data Technology Salon

The 8th Tongcheng Travel Big Data Technology Salon in Suzhou featured four expert talks covering Tencent Cloud’s Meson Spark engine, near‑line computing for travel itineraries, a Flink‑based real‑time risk control system, and Apache Paimon’s latest lake‑warehouse innovations, followed by a data‑driven business perspective session.

Apache PaimonBig DataData Lake
0 likes · 7 min read
Highlights of Tongcheng Travel’s 8th Big Data Technology Salon
Bilibili Tech
Bilibili Tech
Nov 26, 2024 · Big Data

Bilibili’s Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practices

Bilibili migrated its massive user‑behavior, commercial AI training, and database synchronization pipelines from Hive and Kafka to an Iceberg‑based streaming‑batch architecture, using Flink and the Magnus optimizer to achieve minute‑level freshness, reduce CPU and memory usage by about 20‑22 %, save roughly 3.55 M CNY annually, and dramatically improve query latency and join performance.

Data IntegrationData LakeIceberg
0 likes · 20 min read
Bilibili’s Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practices
DataFunSummit
DataFunSummit
Nov 23, 2024 · Big Data

Bilibili's Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practice

This article presents Bilibili's end‑to‑end exploration of a streaming‑batch unified data pipeline built on Apache Iceberg, detailing the original and iterated architectures for massive user behavior transmission, online AI training, DB synchronization, and dimension‑join, along with performance gains, cost savings, and future plans.

Data LakeIcebergOptimization
0 likes · 20 min read
Bilibili's Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practice
iQIYI Technical Product Team
iQIYI Technical Product Team
Nov 21, 2024 · Big Data

Alluxio Integration and Optimization for Multi‑AZ Big Data Analytics at iQIYI

iQIYI integrates Alluxio with its QBFS multi‑AZ unified scheduling system, automatically caching hot tables, applying table‑level policies, page‑level storage and AZ‑aware worker selection, which together cut cross‑zone traffic, halve query latency, achieve up to 20× I/O speedup and a three‑fold overall performance boost.

AlluxioBig DataCache Optimization
0 likes · 23 min read
Alluxio Integration and Optimization for Multi‑AZ Big Data Analytics at iQIYI
DataFunSummit
DataFunSummit
Nov 20, 2024 · Artificial Intelligence

How Data Lakes Empower AI: Expert Insights on Feature Management, Columnar Storage, and Vector Formats

In a panel discussion, experts explain how data‑lake‑warehouse integration, columnar formats like Apache Iceberg, and emerging variant types enable efficient feature engineering, support large‑language‑model workloads, and provide flexible vector storage, thereby driving the evolution of AI from traditional ML to the GenAI era.

Apache IcebergArtificial IntelligenceData Lake
0 likes · 6 min read
How Data Lakes Empower AI: Expert Insights on Feature Management, Columnar Storage, and Vector Formats
DataFunSummit
DataFunSummit
Nov 16, 2024 · Big Data

Data Lake Storage Acceleration: Evolution, Challenges, and Solutions for AI and Big Data Workloads

This article surveys the evolution of data‑lake storage acceleration, compares different architectural stages, analyzes why acceleration is needed for AI and big‑data scenarios, and details the key techniques—metadata acceleration, read/write speedup, and end‑to‑end workflow optimization—used to overcome performance and cost challenges.

AIBig DataCaching
0 likes · 23 min read
Data Lake Storage Acceleration: Evolution, Challenges, and Solutions for AI and Big Data Workloads