Big Data 11 min read

Optimizing Point‑Query Performance in Presto with Apache Hudi Data Skipping and Layout Techniques

This article explains how Huawei Cloud leverages Apache Hudi and HetuEngine (Presto) to improve point‑query performance on Lakehouse architectures through data layout optimization, file‑skipping techniques, metadata tables, and extensive benchmark results demonstrating multi‑fold speedups.

DataFunSummit
DataFunSummit
DataFunSummit
Optimizing Point‑Query Performance in Presto with Apache Hudi Data Skipping and Layout Techniques

Background : Lakehouse is an open architecture that merges the best of data lakes and warehouses. Huawei Cloud has incorporated this concept into its FusionInsight MRS solution, using Apache Hudi as the core storage layer and HetuEngine (an enhanced Presto) as the unified SQL analysis engine.

Data layout optimization : Point‑query workloads often include selective filters. By arranging data with appropriate partition fields and sorting (DataSkipping), query engines can skip large portions of irrelevant data. Combining DataSkipping with FileSkipping techniques such as min‑max statistics, Bloom filters, and bitmap indexes further reduces I/O.

Apache Hudi core capabilities : Clustering – three algorithms (Order, Z‑Order, Hilbert) that reorder data to improve locality for point queries. Metadata Table (MDT) – a self‑managed Hudi MoR table stored under .hoodie that maintains column statistics and Bloom filters. High‑performance FileList – MDT eliminates costly large‑scale file listings on object storage.

Integration with Presto : The team chose the Hudi connector and added support for column‑statistics and Bloom‑filter push‑down. Column statistics are filtered on the Coordinator (memory‑friendly), while Bloom‑filter checks are pushed to the Worker side to avoid loading massive indexes.

Point‑query testing : Using the 1.5 TB SSB benchmark (12 billion rows) on a 1CN+3WN container cluster, data were pre‑processed with Hudi’s Hilbert clustering on S_CITY , C_CITY , P_BRAND , and LO_DISCOUNT . MDT, Bloom filter, and DataSkipping were enabled. Results showed 2‑11× overall query speedup, 2‑200× reduction in file reads, and up to 30× improvement for multi‑column scans.

Sample table creation :

spark.sql("""
    |create table prestoc(
    |c1 int,
    |c11 int,
    |c12 int,
    |c2 string,
    |c3 decimal(38, 10),
    |c4 timestamp,
    |c5 int,
    |c6 date,
    |c7 binary,
    |c8 int
    |) using hudi
    |tblproperties (
    |primaryKey = 'c1',
    |preCombineField = 'c11',
    |hoodie.upsert.shuffle.parallelism = 8,
    |hoodie.table.keygenerator.class = 'org.apache.hudi.keygen.SimpleKeyGenerator',
    |hoodie.metadata.enable = "true",
    |hoodie.metadata.index.column.stats.enable = "true",
    |hoodie.metadata.index.column.stats.file.group.count = "2",
    |hoodie.metadata.index.column.stats.column.list = 'c1,c2',
    |hoodie.metadata.index.bloom.filter.enable = "true",
    |hoodie.metadata.index.bloom.filter.column.list = 'c1',
    |hoodie.enable.data.skipping = "true",
    |hoodie.cleaner.policy.failed.writes = "LAZY",
    |hoodie.clean.automatic = "false",
    |hoodie.metadata.compact.max.delta.commits = "1"
    |)
    |""").stripMargin)

Future work : The next steps focus on adding Bitmap and secondary‑index support, and improving MDT caching to further accelerate metadata access.

big dataquery optimizationPrestolakehouseData SkippingApache Hudi
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.