How WeChat Supercharged Mobile Full‑Text Search with SQLite FTS Optimizations
This article explains the principles of SQLite’s Full‑Text Search extension, details WeChat’s architecture for mobile search, and shares practical performance‑tuning techniques—including custom tokenizers, offset function improvements, and SQL‑level ranking—to achieve sub‑50 ms query times on billions of records.
Introduction
Full‑text search (FTS) on mobile devices is critical for local data retrieval, but limited hardware makes performance a challenge. This article uses SQLite FTS Extension, the de‑facto standard on Android and iOS, and shares WeChat’s practical optimization experience.
SQLite FTS Extension
SQLite FTS Extension is a built‑in plugin that provides fast, stable, easy‑to‑integrate full‑text search using inverted indexes. It is compatible across platforms and offers three main versions:
FTS3 : basic version with full features and custom tokenizers.
FTS4 : performance‑enhanced version with relevance functions (MatchInfo).
FTS5 : major redesign with segmented instance‑list storage, larger capacity, and extensible auxiliary functions.
Storage Architecture
WeChat’s full‑text search, launched in 2014, isolates its search database from the main business database, uses read‑only connections for indexing, reduces database operations by batching inserts, and employs delayed updates for hot data.
The schema separates an IndexTable (pure search index) from a MetaTable (business filtering), linking them via DocId and business type fields.
Search Optimization
From version 5.4 to 6.5.7, average query time grew tenfold as data volume increased, prompting a deep analysis of the search pipeline. Data retrieval accounted for over 80 % of total latency.
Reducing Single‑Loop Cost
The Offsets function in FTS4 consumed >70 % of each loop’s time, especially with large result sets. Offsets converts term positions to byte offsets for highlighting.
Input: Query, hit document, target term offsets. Output: target byte offsets.
Example: Query=我, Doc=我和我弟弟去逛街, term offsets=0,2 → byte offsets=0,6.
Optimizing the tokenizer reduced Offsets processing of 100 KB data to 21 ms, but further gains were needed for larger workloads.
Tokenizer Improvements
WeChat replaced the ICU tokenizer with a custom Simple tokenizer that treats English letters and digits as a single token and Chinese characters individually, eliminating unnecessary UTF‑8→Unicode conversion and dictionary lookup.
Reducing Overall Loop Count
Moving sorting logic from Java to SQL required a 64‑bit ranking value ( MMRank ) that encodes multiple ranking factors. The SQL then orders by this single column.
Special Optimization – Chat Record Search
Chat search must handle statistical attributes and massive result sets (up to 200 k hits per keyword). Three optimization strategies were evaluated.
Solution 1: GROUP BY
SQL‑level aggregation counts hits per conversation, but cannot use indexes efficiently and performs full scans.
Solution 2: Stepwise Calculation
First select the three most recent active conversations, then run separate count queries. This reduces work but requires multiple SQL statements.
Solution 3: MessageCount Aggregation
A custom aggregate function computes hit counts for the top conversations in a single query.
Conclusion
After these optimizations, average query latency for all users fell below 50 ms and below 200 ms for heavy users—a five‑fold improvement. Further gains are possible, such as embedding byte offsets directly in the DocList for faster highlighting.
The author hopes the sharing provides value to readers.
WeChat Client Technology Team
Official account of the WeChat mobile client development team, sharing development experience, cutting‑edge tech, and little‑known stories across Android, iOS, macOS, Windows Phone, and Windows.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
