Databases 18 min read

How Alibaba Scales Real‑Time Deduplication and Targeted Ads with PostgreSQL

This article explains how Alibaba Cloud leverages PostgreSQL’s GIN indexes, array types, and pg_trgm extension to achieve real‑time deduplication of massive guide articles, precise ad targeting, TOB profiling, arbitrary field combination, and fuzzy matching, delivering millisecond‑level latency while drastically reducing hardware requirements.

Efficient Ops

Jul 9, 2017

How Alibaba Scales Real‑Time Deduplication and Targeted Ads with PostgreSQL

Introduction

PostgreSQL has grown rapidly in recent years, creating a surge of interest in China. However, skilled operations personnel remain scarce, limiting large‑scale adoption. Alibaba Cloud offers PostgreSQL as both a public cloud service (RDS PostgreSQL) and an internal service for the Alibaba Group.

1. Massive Guide Article Real‑Time Deduplication

1.1 Business Overview

Guide articles push many products to users. Repeating the same products across articles leads to a poor user experience, so the platform must filter already‑recommended items.

With billions of articles and tens of millions of products, the system must enforce a configurable overlap threshold (e.g., reject an article if more than 80% of its products duplicate a previous one).

1.2 Review Process Evolution

Early stages relied on manual labor to review submissions, which was slow and costly. Later, a computer‑assisted approach compared new submissions against billions of existing records, but real‑time verification remained infeasible.

1.3 Data Structure Challenge

Each article stores about 50 product IDs in an array. When a new article arrives, the system must compare its array against every stored record to count overlapping IDs.

1.4 PostgreSQL GIN Index Application

PostgreSQL’s GIN (Generalized Inverted Index) treats each array element as a separate entry, enabling fast lookup of which rows contain a given product ID. This inverted index flips the relationship: instead of rows → values, it maps values → rows.

During deduplication, the system queries the GIN index for each product ID in the new article, aggregates the matching row identifiers, and quickly determines overlap counts.

1.5 Multi‑Stage Filtering for Efficiency

First‑stage filtering discards articles that exceed a simple overlap threshold (e.g., >4 matching products). If the threshold is lower (e.g., ≥3), the system retrieves candidate data blocks (e.g., block 49 and block 101) for a second‑stage, more detailed check.

1.6 Simulation Test Results

Using synthetic data that mirrors real‑world characteristics, the system handled over 10 million products and 6 × 10⁷ guide articles. Hot products (≈2% of total) were identified and their recommendation frequency measured.

1.7 PostgreSQL Optimization Timing Test

Real‑time deduplication achieved sub‑15 ms latency for hot‑product‑heavy articles and sustained a throughput of 10 k requests per second, turning a previously day‑level verification into an instantaneous response.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

PostgreSQL Ad Targeting GIN Index real-time deduplication array indexing pg_trgm TOB profiling

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.