Big Data 12 min read

MaxCompute Bloomfilter Index: Faster Emergency Tracing Queries, Reduced Storage

The article explains how MaxCompute’s newly introduced Bloomfilter index dramatically improves emergency data tracing by cutting query time and resource consumption, replacing costly secondary indexes, reducing storage by over 45%, and providing a lightweight, high‑efficiency solution for large‑scale point‑lookup scenarios.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
MaxCompute Bloomfilter Index: Faster Emergency Tracing Queries, Reduced Storage

Business Background

Emergency tracing is the last line of defense for data security. When a potential data leak occurs, a rapid and accurate investigation is required, and the results must be quickly organized and synchronized to allow proper handling and reporting.

To verify whether a sample of suspected leaked data matches leaked data, it must be correlated with various internal data sources, identify the leak source (e.g., ecosystem merchants), and respond promptly. The challenge is that tracing requires scanning massive datasets such as gateway logs and traffic data, often at the petabyte level with dozens of partitions, leading to long query times.

Business Pain Points

Investigations frequently involve querying the traffic tracing detail table and OSS log table, which often suffer from slow SQL and high resource consumption. Existing optimization approaches include:

Hash/Range clustering : In equality queries, the key is divided into 256‑4096 buckets to reduce scanned data.

Secondary index tables : For tables with multiple query requirements (e.g., OSS log table needs both Access_key and IP), a secondary table maps IP to hash key, and both tables use clustering to accelerate queries.

Both methods have drawbacks: clustering works well only for the clustered fields, while secondary indexes consume large additional storage (tens to hundreds of terabytes).

MaxCompute Bloomfilter Index Introduction

Bloomfilter is an efficient probabilistic data structure. MaxCompute introduced Bloomfilter index in its November release to support finer‑grained data pruning for large‑scale point‑lookup scenarios, reducing unnecessary data scans and improving query performance.

Large‑Scale Point‑Lookup Scenarios

Typical use cases include:

Querying a user's food‑delivery records for the past week.

Querying newly registered users' activity in the maternity category.

Querying user information by phone number.

Even when the result set contains only a few rows, MaxCompute may still need to scan massive amounts of data because predicate push‑down is limited by file‑level min/max statistics and full‑table scans.

Current Clustering Pain Points

MaxCompute supports Hash and Range clustering, which distribute data into buckets based on a clustering key. Queries can quickly eliminate irrelevant buckets, but clustering has limitations:

Hash clustering filters data only when the query includes all clustering keys.

Range clustering works well only with prefix filters on clustering keys in left‑to‑right order.

If the clustering key is absent from the query, no filtering occurs, making clustering ineffective for ad‑hoc queries.

Data writes require shuffling by the clustering key, increasing cost and causing skew‑related tail latency.

Advantages of Bloomfilter Index

Bloomfilter index generates a Bloomfilter for specified columns and stores it for fast membership checks. Compared with clustering, its benefits are:

Highly efficient: insertion and query consume fewer resources, filtering invalid data at minimal cost.

Effective in high‑cardinality, tightly distributed data scenarios.

Highly extensible: can be built on one or multiple columns, even on non‑clustering keys, and can be combined with clustering indexes.

Ant Security Tracing Best Practice

Test Environment

Two business tables are used:

Large‑scale equality test – Tracing detail table : column si_value stores sensitive values (phone numbers, IDs). Tracing this column reveals all access records for the sensitive value.

Hot‑key query – OSS log table : column Access_id is the application access key. In AK leakage scenarios, all OSS logs for a specific Access_id are extracted for analysis.

Usage Example

-- 1. Create table
create table test_oss_backend_hi_1 like dwd_sec_evt_oss_backend_hi LIFECYCLE 180;
DESC EXTENDED test_oss_backend_hi_1;
DROP TABLE ap_asec_ahunt_sys_dev.test_oss_backend_hi_1;

-- 2. Create Bloomfilter index
CREATE BLOOMFILTER INDEX access_id_idx
ON test_oss_backend_hi_1
FOR COLUMNS(access_id)
IDXPROPERTIES('fpp' = '0.00005', 'numitems'= '100000000')
COMMENT 'access_id index';
SHOW INDEXES ON test_oss_backend_hi_1;

-- 3. Load data (run after index creation)
INSERT OVERWRITE TABLE test_oss_backend_hi_1 PARTITION (dt = '20230424', hour = '10')
SELECT * FROM dwd_sec_evt_oss_backend_hi
WHERE dt = '20230424' AND hour = '10';

-- 4. Query a hot key (e.g., 1024)
SET odps.sql.enable.bloom.filter.index=true;
SELECT * FROM test_oss_backend_hi_1
WHERE access_id = 'LTAIIQ3X1Mr1JAFd' AND dt='20230424' AND hour='10';

Test Results Comparison

Two solutions were compared:

Solution 1: Original table + secondary index.

Solution 2: Original table + Bloomfilter index.

Results show that the Bloomfilter index reduces overall storage by more than 45%, achieves the best computation time for single hot‑key queries, and eliminates 99% of secondary index construction effort.

Key metrics:

Storage reduction: approximately 2 PB per month, saving about ¥83,000 per month.

Computation reduction: negligible additional cost, while query time is consistently lower than the secondary‑index approach.

Conclusion

MaxCompute, Alibaba’s leading distributed big‑data processing platform, continuously enhances SQL usability and performance. In large‑scale emergency tracing scenarios, traditional clustering and secondary indexes fail to meet efficiency and cost requirements. The newly introduced lightweight Bloomfilter index provides higher space efficiency and query speed, reducing both query latency and storage overhead, thereby lowering overall business costs.

For more details, refer to the official product documentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceBig DataSQLindexMaxComputeBloomFilter
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.