Mobile Development 27 min read

How iOS WeChat Supercharged Search with SQLite FTS5 and Custom Tokenizers

This article details the 2021 overhaul of iOS WeChat's full‑text search, covering engine selection, segment‑merge optimization, a new VerbatimTokenizer, multi‑level separator support, table schema choices, asynchronous index updates, and extensive performance gains across chat, contacts, and favorites.

WeChat Client Technology Team

Feb 22, 2022

How iOS WeChat Supercharged Search with SQLite FTS5 and Custom Tokenizers

1. Current State of iOS WeChat Full‑Text Search

Full‑text search relies on inverted indexes, where each Token records its position in the content. WeChat's search scenarios (contacts, chat history, favorites) have used SQLite FTS3 since 2014, with simple LIKE queries for favorites and in‑memory scans for contacts, prompting a need for modernization.

2. Engine Selection and Optimization

2.1 Engine Choice

Available engines for iOS include SQLite FTS3/4/5, CLucene, and Lucy. After comparing transaction support, technical risk, search capability, and read/write performance, SQLite FTS5 was chosen for its mature transaction handling and low risk.

Performance tests on 1 million random Chinese sentences showed Lucene reads hits faster, but overall read/write differences were minor; FTS5’s generation time was higher but optimizable.

2.2 Automatic Segment Merge

FTS5 stores each transaction as a separate B‑tree segment. Many segments degrade query speed, so FTS5 provides a merge mechanism. New segments start at level 0; merges combine two level‑i segments into a level‑i+1 segment. Two merge strategies exist: an automerge triggered when a level reaches 4 segments, and a crisismerge when a level reaches 16 segments. To avoid blocking business logic, merges are offloaded to a dedicated thread, limited to one merge per level, and throttled to keep write latency low.

2.3 Tokenizer Optimization

The tokenizer is crucial for breaking text into Token s. The legacy FTS3 OneOrBinaryTokenizer used a hybrid character‑pair approach, inflating index size. A new VerbatimTokenizer performs simple character tokenization and builds quoted two‑character Phrase s for exact adjacency matching, achieving the same precision with far less index bloat.

Five extensions were added: traditional‑simplified Chinese conversion, Unicode normalization, symbol filtering (required for contacts), Porter stemming for English (disabled for contacts), and case‑folding.

2.4 Multi‑Level Separator Support

To index multiple searchable attributes without cross‑attribute false matches, a custom FTS5 auxiliary function SubstringMatchInfo was created. It leverages token position data to infer separators and hierarchy, enabling precise attribute‑level matching.

3. Full‑Text Search Application Logic Optimizations

3.1 Table Schema Choices

Two schema patterns were evaluated: (1) a separate ordinary table mapping rowid to non‑text data, and (2) embedding non‑text columns directly in the FTS table. The second pattern was retained for speed, with UNINDEXED constraints on non‑searchable columns to avoid redundant indexing.

Column ordering was adjusted to place the largest searchable column first, reducing column‑separator bytes and shrinking index files.

3.2 Index Update Logic

Indexes are stored in a dedicated search database, decoupled from business databases. Updates are driven by per‑business progress markers (e.g., chat rowid, favorites updateSequence) stored in the same WAL‑enabled database to guarantee atomicity. A lazy‑validation step discards stale indexes only when results are displayed.

Batch indexing is triggered when 100 pending items accumulate, when the search UI opens, or on app launch. Deletion optimizations include optional auxiliary indexes on business IDs and selective UNINDEXED columns.

3.3 Search Logic Enhancements

Search tasks now run in parallel across business domains and within a single domain by sharding FTS tables (e.g., four chat tables instead of ten). Tasks support interruption via a CancelFlag to avoid overlapping work during rapid user input.

Result ordering is performed client‑side after fetching required id and sort fields, eliminating costly ORDER BY in SQLite. Highlighting is deferred to the UI layer by re‑tokenizing the query and locating token offsets, avoiding the expensive highlight function during search.

4. Performance Results

Post‑upgrade, index file sizes per user dropped dramatically, index build times decreased, and query latencies improved (e.g., 2.9 ms for three‑term queries on 1 M rows in optimize state). Search UI latency across chat, contacts, and favorites showed consistent reductions.

5. Conclusion

The revamped iOS WeChat full‑text search, built on SQLite FTS5 with custom tokenizers, automatic segment merging, and extensive schema and workflow optimizations, delivers smaller index footprints, faster updates, and markedly quicker search responses across all supported business scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance optimization iOS sqlite Full-Text Search Tokenizer FTS5

Written by

WeChat Client Technology Team

Official account of the WeChat mobile client development team, sharing development experience, cutting‑edge tech, and little‑known stories across Android, iOS, macOS, Windows Phone, and Windows.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.