Tagged articles
8 articles
Page 1 of 1
Bilibili Tech
Bilibili Tech
Sep 30, 2022 · Big Data

Bilibili's Efficient Lakehouse Platform Built on Trino and Iceberg

Bilibili’s new lake‑house platform, built on Trino and Iceberg, replaces Hive‑based pipelines by ingesting logs and DB data into Iceberg tables, applying advanced sorting, Z‑order/Hilbert clustering, bitmap and bloom indexes, virtual join columns and pre‑aggregation, enabling 70 000 daily queries on 2 PB with average scans of 2 GB and sub‑2‑second response times.

Big DataData SkippingIceberg
0 likes · 15 min read
Bilibili's Efficient Lakehouse Platform Built on Trino and Iceberg
DataFunTalk
DataFunTalk
Sep 15, 2022 · Big Data

Bilibili Offline Platform: Migration from Hive to Spark and Large‑Scale Optimizations

This article details Bilibili's evolution of its offline computing platform from Hadoop‑based Hive to Spark, describing the migration process, automated SQL conversion, result verification, stability and performance enhancements, meta‑store optimizations, and future work on remote shuffle and vectorized execution.

Data SkippingMetaStoreShuffle
0 likes · 28 min read
Bilibili Offline Platform: Migration from Hive to Spark and Large‑Scale Optimizations
ITPUB
ITPUB
Aug 1, 2022 · Big Data

How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance

This article details Bilibili's evolution from a Hadoop‑based offline platform to a Spark‑driven architecture, covering the Hive‑to‑Spark migration, automated SQL conversion, result validation, stability enhancements, performance tuning, meta‑store federation, and future directions for large‑scale data processing.

Big DataData SkippingMetaStore
0 likes · 31 min read
How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance
Big Data Technology & Architecture
Big Data Technology & Architecture
May 4, 2022 · Big Data

Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities

The Apache Hudi 0.11.0 release introduces multi‑mode metadata indexing, enhanced data‑skipping, asynchronous indexing, extensive Spark and Flink integration improvements, new bundle utilities, and expanded metadata synchronization with BigQuery, AWS Glue, and DataHub, while also adding bucket indexing and encryption support.

Apache HudiAsync IndexBig Data
0 likes · 13 min read
Apache Hudi 0.11.0 Release Highlights: Multi‑Mode Index, Data Skipping, Async Index, Spark & Flink Integration, and New Utilities
DataFunSummit
DataFunSummit
Apr 29, 2022 · Big Data

Optimizing Query Performance in Apache Iceberg with Z‑Order Data Organization

This article explains how Apache Iceberg’s DataSkipping technique can lose efficiency when many filter columns are used, and presents a data‑organization optimization using space‑filling curves and Z‑Order to improve query I/O, details the OPTIMIZE implementation, and shares performance benchmark results and future plans.

Apache IcebergBig DataData Skipping
0 likes · 12 min read
Optimizing Query Performance in Apache Iceberg with Z‑Order Data Organization
DataFunTalk
DataFunTalk
Apr 9, 2022 · Big Data

Optimizing Apache Iceberg Query Performance with Z‑Order Data Organization

This talk explains how Apache Iceberg’s DataSkipping can lose efficiency with many filter columns, and presents a data‑organization redesign using space‑filling curves and Z‑Order to improve query I/O, detailing the OPTIMIZE syntax, implementation steps, performance benchmarks, and future roadmap.

Apache IcebergBig DataData Skipping
0 likes · 12 min read
Optimizing Apache Iceberg Query Performance with Z‑Order Data Organization
Big Data Technology Architecture
Big Data Technology Architecture
Mar 4, 2021 · Big Data

Improving Interactive Analysis on Massive Datasets with Data Clustering and Data Skipping Using Spark and Iceberg

This article explores how data clustering techniques such as linear order, Z‑order, and Hilbert‑curve ordering can be applied in Apache Spark and Apache Iceberg to achieve efficient data skipping on terabyte‑scale tables, dramatically reducing file scans and enabling sub‑second interactive analytics for multi‑dimensional queries.

Big DataData ClusteringData Skipping
0 likes · 20 min read
Improving Interactive Analysis on Massive Datasets with Data Clustering and Data Skipping Using Spark and Iceberg