Boost Address Keyword Search with PostgreSQL: GIN, pg_trgm, and pg_bigm
This article explains how PostgreSQL's native full‑text search, GIN indexes, and extensions like pg_trgm, pg_bigm, and pg_jieba can dramatically improve the speed and accuracy of Chinese address keyword searches, offering a cost‑effective alternative to external search engines.
In map or GIS scenarios, address keyword search is crucial; users often type a keyword such as "China Agricultural University" instead of the full address.
PostgreSQL is praised as the most advanced open‑source database, offering strong text‑search capabilities, including full‑text, fuzzy, and regex queries, as well as expression and GIN indexes, making it advantageous for address keyword retrieval.
1. Application Background
In address search scenarios, user input is tokenized and matched against a full‑text index of address corpora to obtain normalized address information. Queries are split into keywords, matched to historical address data, and results returned. Matching can take seconds to tens of seconds depending on tokenization and matching methods.
Common approaches include:
Using dedicated search engines such as Elasticsearch or Lucene.
Leveraging built‑in database search capabilities.
While Elasticsearch provides flexible search, it increases development and operational costs. This article explores how to achieve efficient Chinese address search using PostgreSQL’s native features.
2. Technical Solution
GIN (Generalized Inverted Index) stores (key, posting list) pairs, where each key maps to the tuple IDs where it appears, enabling fast lookup of rows containing specific keywords.
The
pg_trgmextension implements a trigram (3‑gram) tokenization; it slides a window of three characters over the text, creates tokens, and builds a GIN index for efficient fuzzy matching.
The
pg_bigmextension is similar but uses bigrams (2‑gram), which performs better for one‑ or two‑character queries.
We built a dataset of about 100 million Beijing address records and benchmarked PostgreSQL. Results show that both
pg_trgm+ginand
pg_bigm+ginoutperform traditional B‑tree indexes for fuzzy searches. Because logistics keywords are usually three characters or longer,
pg_trgm+ginprovides millisecond‑level response times.
For Chinese text segmentation, the
pg_jiebaextension integrates the Jieba library, offering customizable dictionaries, keyword extraction, and part‑of‑speech tagging, further improving search relevance.
3. Conclusion
PostgreSQL’s rich indexing options, powerful full‑text search, and extensive plugin ecosystem allow developers to implement high‑performance address keyword search without synchronizing data to external search engines, simplifying architecture, reducing development cost, and ensuring real‑time query results.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.