Databases 20 min read

Applying PostgreSQL GIN Index for Real‑Time Deduplication, Precise Advertising, and TOB Profiling at Alibaba

The article presents several large‑scale Alibaba use cases—real‑time deduplication of shopping‑guide articles, precision ad targeting, TOB real‑time profiling, arbitrary field combination queries, and fuzzy matching—demonstrating how PostgreSQL’s array types, GIN indexes and pg_trgm extensions enable millisecond‑level responses on billions of records.

Qunar Tech Salon

Jul 11, 2017

Applying PostgreSQL GIN Index for Real‑Time Deduplication, Precise Advertising, and TOB Profiling at Alibaba

Introduction

PostgreSQL has grown rapidly in China, creating a wave of interest but a shortage of skilled operators; consequently many cloud providers, including Alibaba Cloud, now offer PostgreSQL as a managed service. This article shares several internal Alibaba scenarios that leverage PostgreSQL and invites readers to think about alternative solutions.

1. Real‑time Deduplication of Massive Shopping‑Guide Articles

1.1 Business Overview

Users frequently receive shopping‑guide pushes (e.g., daily "白菜价" articles). Each article may recommend dozens of products, and billions of articles accumulate, leading to massive redundancy that harms user experience.

1.2 Review Process Evolution

Initially, article review was labor‑intensive. The second generation introduced automated checks, but real‑time deduplication remained impossible due to the sheer volume of data.

1.2.1 Data Structure Issue

Each article stores about 50 product IDs in an array. When a new article arrives, the system must compare its IDs against existing records to calculate overlap percentages.

1.2.2 PostgreSQL GIN Index Application

PostgreSQL’s GIN (Generalized Inverted Index) treats each array element as a separate entry, effectively creating a reverse index that maps product IDs to the rows containing them, enabling fast overlap calculations.

1.2.3 Multi‑stage Filtering for Efficiency

First‑stage filtering discards articles whose overlap exceeds a configurable threshold (e.g., >4). The second stage examines only the relevant data blocks, reducing the number of rows to scan to a few hundred.

1.2.4 Simulation Test Results

Simulated data with over ten million distinct products and 11‑50 products per article showed that the GIN‑based solution could reject duplicates in ~15 ms while handling a throughput of 10 k QPS.

1.2.5 PostgreSQL Optimization Timing Test

Real‑world tests confirmed that the system can provide millisecond‑level feedback, turning a previously day‑level review process into a real‑time service.

2. Precise Advertising Delivery

2.1 Business Overview

Advertising platforms need to target users based on recent browsing behavior (e.g., cosmetics). The data volume reaches hundreds of billions of events, requiring both high precision and low latency.

2.2 Data Modeling and Scaling

User IDs, shop visit counts, product view counts, purchase quantities, and geographic information are stored as arrays, enabling efficient multi‑dimensional queries with GIN indexes.

2.3 Stair‑case Bucketing

Continuous metrics (e.g., number of views) are bucketed into discrete tiers, then encoded into a single integer per user. This reduces the dimensionality and allows fast overlap queries.

2.4 Targeted Audience Extraction

With 3.2 billion users, 64 partitions, and 4 k tags per user, the system can retrieve over ten thousand qualified users within a millisecond, a task that would otherwise require terabytes of rows.

3. TOB Real‑time Profiling

3.1 Business Overview

Similar to the advertising case but with only 10 k tags per enterprise user. Storing each tag as a separate column is infeasible; the solution uses array storage with GIN indexing.

3.2 Tag‑based Audience Queries

Queries combine inclusion and exclusion of tags. Using arrays and GIN indexes reduces storage from petabytes to a few hundred gigabytes while maintaining millisecond response times.

3.3 Solution Comparison

Three approaches were evaluated: (1) wide tables with many columns, (2) bitmap‑style BTI without indexes, and (3) PostgreSQL’s varbitx extension with array‑based storage. Approach 3 achieved the best balance of space (≈100 GB) and performance.

3.4 Space and Performance Evaluation

Approach 1 required ~8 TB, while approaches 2 and 3 needed only ~100 GB. All approaches met query latency requirements, but approach 3 also satisfied update‑frequency constraints.

3.5 Before‑and‑After Comparison

After migration, update latency improved from days to minutes and query latency dropped from minutes to milliseconds, with a reduction in required physical machines.

4. Arbitrary Field Combination Queries

4.1 Business Overview

Users can select dozens of options (e.g., free‑shipping, second‑hand goods). Storing each option as a separate column would require many indexes, harming write performance.

4.2 GIN Composite Index Solution

By storing selected options in an array and building a GIN index on the array, the system can answer any combination of AND/OR queries across six fields in sub‑millisecond response times.

5. Arbitrary Field Fuzzy Matching

5.1 Business Overview

Customer‑relationship systems need to support fuzzy searches on URLs, company names, and other non‑tokenizable strings, which traditional full‑text search cannot handle efficiently.

5.2 pg_trgm Solution

PostgreSQL’s pg_trgm extension breaks strings into trigrams, enabling fast similarity searches that scale to billions of rows with millisecond latency.

Conclusion

The presented cases demonstrate how PostgreSQL’s advanced indexing features—GIN for array data, bitmap extensions, and pg_trgm for fuzzy matching—allow Alibaba to solve large‑scale, low‑latency data problems that would be difficult or impossible with other technologies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

database optimization PostgreSQL GIN Index precision advertising real-time deduplication

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.