Applying PostgreSQL GIN Index for Real‑Time Deduplication, Precise Advertising, and TOB Profiling at Alibaba
The article presents several large‑scale Alibaba use cases—real‑time deduplication of shopping‑guide articles, precision ad targeting, TOB real‑time profiling, arbitrary field combination queries, and fuzzy matching—demonstrating how PostgreSQL’s array types, GIN indexes and pg_trgm extensions enable millisecond‑level responses on billions of records.
Introduction
PostgreSQL has grown rapidly in China, creating a wave of interest but a shortage of skilled operators; consequently many cloud providers, including Alibaba Cloud, now offer PostgreSQL as a managed service. This article shares several internal Alibaba scenarios that leverage PostgreSQL and invites readers to think about alternative solutions.
1. Real‑time Deduplication of Massive Shopping‑Guide Articles
1.1 Business Overview
Users frequently receive shopping‑guide pushes (e.g., daily "白菜价" articles). Each article may recommend dozens of products, and billions of articles accumulate, leading to massive redundancy that harms user experience.
1.2 Review Process Evolution
Initially, article review was labor‑intensive. The second generation introduced automated checks, but real‑time deduplication remained impossible due to the sheer volume of data.
1.2.1 Data Structure Issue
Each article stores about 50 product IDs in an array. When a new article arrives, the system must compare its IDs against existing records to calculate overlap percentages.
1.2.2 PostgreSQL GIN Index Application
PostgreSQL’s GIN (Generalized Inverted Index) treats each array element as a separate entry, effectively creating a reverse index that maps product IDs to the rows containing them, enabling fast overlap calculations.
1.2.3 Multi‑stage Filtering for Efficiency
First‑stage filtering discards articles whose overlap exceeds a configurable threshold (e.g., >4). The second stage examines only the relevant data blocks, reducing the number of rows to scan to a few hundred.
1.2.4 Simulation Test Results
Simulated data with over ten million distinct products and 11‑50 products per article showed that the GIN‑based solution could reject duplicates in ~15 ms while handling a throughput of 10 k QPS.
1.2.5 PostgreSQL Optimization Timing Test
Real‑world tests confirmed that the system can provide millisecond‑level feedback, turning a previously day‑level review process into a real‑time service.
2. Precise Advertising Delivery
2.1 Business Overview
Advertising platforms need to target users based on recent browsing behavior (e.g., cosmetics). The data volume reaches hundreds of billions of events, requiring both high precision and low latency.
2.2 Data Modeling and Scaling
User IDs, shop visit counts, product view counts, purchase quantities, and geographic information are stored as arrays, enabling efficient multi‑dimensional queries with GIN indexes.
2.3 Stair‑case Bucketing
Continuous metrics (e.g., number of views) are bucketed into discrete tiers, then encoded into a single integer per user. This reduces the dimensionality and allows fast overlap queries.
2.4 Targeted Audience Extraction
With 3.2 billion users, 64 partitions, and 4 k tags per user, the system can retrieve over ten thousand qualified users within a millisecond, a task that would otherwise require terabytes of rows.
3. TOB Real‑time Profiling
3.1 Business Overview
Similar to the advertising case but with only 10 k tags per enterprise user. Storing each tag as a separate column is infeasible; the solution uses array storage with GIN indexing.
3.2 Tag‑based Audience Queries
Queries combine inclusion and exclusion of tags. Using arrays and GIN indexes reduces storage from petabytes to a few hundred gigabytes while maintaining millisecond response times.
3.3 Solution Comparison
Three approaches were evaluated: (1) wide tables with many columns, (2) bitmap‑style BTI without indexes, and (3) PostgreSQL’s varbitx extension with array‑based storage. Approach 3 achieved the best balance of space (≈100 GB) and performance.
3.4 Space and Performance Evaluation
Approach 1 required ~8 TB, while approaches 2 and 3 needed only ~100 GB. All approaches met query latency requirements, but approach 3 also satisfied update‑frequency constraints.
3.5 Before‑and‑After Comparison
After migration, update latency improved from days to minutes and query latency dropped from minutes to milliseconds, with a reduction in required physical machines.
4. Arbitrary Field Combination Queries
4.1 Business Overview
Users can select dozens of options (e.g., free‑shipping, second‑hand goods). Storing each option as a separate column would require many indexes, harming write performance.
4.2 GIN Composite Index Solution
By storing selected options in an array and building a GIN index on the array, the system can answer any combination of AND/OR queries across six fields in sub‑millisecond response times.
5. Arbitrary Field Fuzzy Matching
5.1 Business Overview
Customer‑relationship systems need to support fuzzy searches on URLs, company names, and other non‑tokenizable strings, which traditional full‑text search cannot handle efficiently.
5.2 pg_trgm Solution
PostgreSQL’s pg_trgm extension breaks strings into trigrams, enabling fast similarity searches that scale to billions of rows with millisecond latency.
Conclusion
The presented cases demonstrate how PostgreSQL’s advanced indexing features—GIN for array data, bitmap extensions, and pg_trgm for fuzzy matching—allow Alibaba to solve large‑scale, low‑latency data problems that would be difficult or impossible with other technologies.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.