Apache Hudi Asia Summit Successfully Held
The first Apache Hudi Asia Summit in Beijing attracted over 230 attendees, featuring technical discussions on data lake optimization and case studies from companies like Fastly and Meituan.
The first Apache Hudi Asia Summit in Beijing attracted over 230 attendees, featuring technical discussions on data lake optimization and case studies from companies like Fastly and Meituan.
Taobao’s AIGC pipeline combines a human‑feedback multimodal reward model, audio‑visual joint pre‑training, and Mixture‑of‑Experts distillation to clean data, align outputs with user preferences, and achieve state‑of‑the‑art multimodal LLM performance that drives content cold‑start and conversion gains in e‑commerce.
This presentation details Tencent's real‑time lakehouse architecture and the four key topics—lakehouse design, intelligent optimization services, scenario‑driven capabilities, and future outlook—covering components such as Spark, Flink, Iceberg, Auto‑Optimize Service, indexing, clustering, AutoEngine, and PyIceberg implementations.
This article presents Tencent's end‑to‑end real‑time lakehouse architecture, detailing its three‑layer design, the Auto Optimize Service modules such as compaction, indexing, clustering and engine acceleration, as well as scenario‑driven capabilities like multi‑stream joins, primary‑key tables, in‑place migration and PyIceberg support, and concludes with future optimization directions.
This presentation describes Tencent's real-time lakehouse architecture, including data lake compute, management, and storage layers, and details the intelligent optimization services—such as compaction, indexing, clustering, and auto-engine—designed to improve query performance, storage cost, and operational efficiency for large-scale data processing.
This article presents Tencent's real‑time lakehouse architecture, detailing its three‑layer design of compute, management and storage, and explains the six components of the Intelligent Optimization Service—including Compaction, Index, Clustering, and AutoEngine—along with scenario‑based capabilities, migration strategies, and future optimization directions.
This article discusses the implementation of Apache Kylin as an OLAP engine for logistics data, focusing on optimizing cube building and query performance to handle large-scale, high-dimensional data analytics.
The article explores how early 1980s game programmers managed extremely limited memory for graphics, audio, and code by using techniques such as tile-based rendering, simple audio synthesis, and data size estimation, highlighting the stark contrast with modern development resources.
This article examines the causes and impacts of small file proliferation in Hive and Spark environments, and presents multiple mitigation techniques—including Spark 3 adaptive query execution, reducing reduce tasks, using DISTRIBUTE BY RAND(), post‑processing clean‑up, Hive and Spark configuration tweaks, and automated tooling—to improve performance and storage efficiency.
The article surveys unconventional offline data‑task optimizations—such as distribution‑by, seeded random shuffling, explode‑based skew mitigation, hash bucketing, task‑parallelism tuning, and multi‑insert materialization—organized by point, line, and surface perspectives, and stresses that effective performance gains require both technical tricks and business‑driven pipeline adjustments.
This article examines how Zhaunzhuan's product service, a core component of its e‑commerce platform, was optimized by reducing unnecessary data transmission, applying cache‑aside patterns, redesigning Redis storage, and introducing a field‑marking approach, resulting in dramatically lower GC overhead, network traffic, and response times.
This article details Bilibili's lakehouse implementation using Apache Iceberg and Alluxio, covering background challenges, architectural components, data organization techniques like Z‑order and bitmap indexes, performance benchmarks, and future optimization plans for large‑scale analytics.
This article examines performance bottlenecks in a high‑traffic e‑commerce product service and proposes data‑centric optimizations—including read‑only focus, field‑level selection via bit‑masking, and Redis hash storage—to reduce payload size, lower GC pressure, and improve latency while maintaining scalability.
This case study details how JD.com’s ranking page for the 618 promotion leveraged data‑driven navigation, visual redesign, pixel‑perfect front‑end techniques, and component‑based development to boost click‑through rates, conversion, and overall GMV while outlining future optimization directions.
Didi’s engineering team analyzed a severe write bottleneck in their 3000‑node Elasticsearch cluster, identified long‑tail latency caused by refresh, translog locks, write queues and GC, and applied routing‑aware bulk writes, JVM and Lucene tweaks, and data cleaning to more than double write throughput while slashing server costs.
This article examines the challenges of mobile‑based trajectory tracking in city management and presents a comprehensive set of optimizations—including adaptive GPS sampling, keep‑alive strategies, accuracy enhancements, algorithmic fitting, and cinematic animation effects—to produce smooth, accurate, and visually appealing trajectory displays at scale.
This article details a data‑science team’s end‑to‑end approach to the TalkingData ad‑fraud Kaggle competition, covering dataset quirks, performance‑critical optimizations, a multi‑stage cross‑validation workflow, feature‑engineering tactics, model experiments with LightGBM and neural nets, and key lessons learned.
This article explains Meizu's cloud synchronization system, detailing its custom MZ‑SyncML, Semi‑Sync, File‑Sync and One‑Sync protocols, the multi‑IDC deployment, routing components, data format optimizations, and modular backend architecture that together support millions of users with high availability and efficient data transfer.