How iOS WeChat Supercharged Search with SQLite FTS5 and Custom Tokenizers
This article details the 2021 overhaul of iOS WeChat's full‑text search, covering engine selection, segment‑merge optimization, a new VerbatimTokenizer, multi‑level separator support, table schema choices, asynchronous index updates, and extensive performance gains across chat, contacts, and favorites.
1. Current State of iOS WeChat Full‑Text Search
Full‑text search relies on inverted indexes, where each Token records its position in the content. WeChat's search scenarios (contacts, chat history, favorites) have used SQLite FTS3 since 2014, with simple LIKE queries for favorites and in‑memory scans for contacts, prompting a need for modernization.
2. Engine Selection and Optimization
2.1 Engine Choice
Available engines for iOS include SQLite FTS3/4/5, CLucene, and Lucy. After comparing transaction support, technical risk, search capability, and read/write performance, SQLite FTS5 was chosen for its mature transaction handling and low risk.
Performance tests on 1 million random Chinese sentences showed Lucene reads hits faster, but overall read/write differences were minor; FTS5’s generation time was higher but optimizable.
2.2 Automatic Segment Merge
FTS5 stores each transaction as a separate B‑tree segment. Many segments degrade query speed, so FTS5 provides a merge mechanism. New segments start at level 0; merges combine two level‑i segments into a level‑i+1 segment. Two merge strategies exist: an automerge triggered when a level reaches 4 segments, and a crisismerge when a level reaches 16 segments. To avoid blocking business logic, merges are offloaded to a dedicated thread, limited to one merge per level, and throttled to keep write latency low.
2.3 Tokenizer Optimization
The tokenizer is crucial for breaking text into Token s. The legacy FTS3 OneOrBinaryTokenizer used a hybrid character‑pair approach, inflating index size. A new VerbatimTokenizer performs simple character tokenization and builds quoted two‑character Phrase s for exact adjacency matching, achieving the same precision with far less index bloat.
Five extensions were added: traditional‑simplified Chinese conversion, Unicode normalization, symbol filtering (required for contacts), Porter stemming for English (disabled for contacts), and case‑folding.
2.4 Multi‑Level Separator Support
To index multiple searchable attributes without cross‑attribute false matches, a custom FTS5 auxiliary function SubstringMatchInfo was created. It leverages token position data to infer separators and hierarchy, enabling precise attribute‑level matching.
3. Full‑Text Search Application Logic Optimizations
3.1 Table Schema Choices
Two schema patterns were evaluated: (1) a separate ordinary table mapping rowid to non‑text data, and (2) embedding non‑text columns directly in the FTS table. The second pattern was retained for speed, with UNINDEXED constraints on non‑searchable columns to avoid redundant indexing.
Column ordering was adjusted to place the largest searchable column first, reducing column‑separator bytes and shrinking index files.
3.2 Index Update Logic
Indexes are stored in a dedicated search database, decoupled from business databases. Updates are driven by per‑business progress markers (e.g., chat rowid, favorites updateSequence) stored in the same WAL‑enabled database to guarantee atomicity. A lazy‑validation step discards stale indexes only when results are displayed.
Batch indexing is triggered when 100 pending items accumulate, when the search UI opens, or on app launch. Deletion optimizations include optional auxiliary indexes on business IDs and selective UNINDEXED columns.
3.3 Search Logic Enhancements
Search tasks now run in parallel across business domains and within a single domain by sharding FTS tables (e.g., four chat tables instead of ten). Tasks support interruption via a CancelFlag to avoid overlapping work during rapid user input.
Result ordering is performed client‑side after fetching required id and sort fields, eliminating costly ORDER BY in SQLite. Highlighting is deferred to the UI layer by re‑tokenizing the query and locating token offsets, avoiding the expensive highlight function during search.
4. Performance Results
Post‑upgrade, index file sizes per user dropped dramatically, index build times decreased, and query latencies improved (e.g., 2.9 ms for three‑term queries on 1 M rows in optimize state). Search UI latency across chat, contacts, and favorites showed consistent reductions.
5. Conclusion
The revamped iOS WeChat full‑text search, built on SQLite FTS5 with custom tokenizers, automatic segment merging, and extensive schema and workflow optimizations, delivers smaller index footprints, faster updates, and markedly quicker search responses across all supported business scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeChat Client Technology Team
Official account of the WeChat mobile client development team, sharing development experience, cutting‑edge tech, and little‑known stories across Android, iOS, macOS, Windows Phone, and Windows.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
