Mobile Development 11 min read

Solving Chinese Homograph Search in WeChat Android with SQLite FTS5

This article explains how WeChat Android tackles the challenge of Chinese homograph full‑text search by using SQLite FTS5, comparing word‑based and character‑based indexing schemes, introducing a custom tokenizer, and detailing preprocessing and performance results.

WeChat Client Technology Team
WeChat Client Technology Team
WeChat Client Technology Team
Solving Chinese Homograph Search in WeChat Android with SQLite FTS5

Introduction

Chinese full‑text search on mobile clients suffers from homograph issues, which affect user experience. WeChat’s Android client receives many complaints about this, so a solution based on SQLite FTS5 is presented.

Requirements

Search type: pinyin prefix search; Chinese characters and pinyin cannot be mixed; input must be full or short pinyin of consecutive characters.

Search content: contacts, group chats, and public account remarks/nicknames (max 16 Chinese characters).

Word‑Table Scheme

Supporting pinyin normally requires a single‑character pinyin table, but homographs need a word‑level pinyin mapping because a character can have different pronunciations in different words.

Two approaches are considered:

Exhaustive word list – very high storage cost.

Probabilistic model – train a classifier to predict pinyin, resulting in a ~1 GB model that must be processed on the backend.

The word‑table scheme is discarded due to its resource consumption.

Character‑Table Scheme

A table of the 20,777 most common Chinese characters (~200 KB) can be shipped to the client, offering O(1) lookup.

Advantages: immediate index updates after nickname/remark changes and off‑loading heavy computation from the server.

Disadvantage: a homograph in a word can be matched by any of its pronunciations, which may affect precision.

WeChat ultimately chose this character‑table approach.

Client Index Scheme

Using SQLite FTS5, a prefix index is built on the client. Two paths are illustrated:

Path 1: Build the index with Prefix configuration, allowing direct hash lookup of prefix terms.

Path 2: Build without Prefix; FTS5 constructs a temporary prefix tree at query time.

Path 1 is preferred for its lower time complexity.

Index Variants

Four index designs are evaluated:

Scheme 1

Assumes users type continuous pinyin from the start of a word (e.g., “shi”, “shiweishuj”, “sw”).

Scheme 2

Handles middle‑of‑word pinyin input (e.g., “shuji”, “sj”) by indexing each character’s pinyin as a prefix.

Scheme 3

Exhaustively enumerates all pinyin combinations for homographs, leading to huge index size in worst‑case scenarios, thus impractical.

Scheme 4

Uses a synonym‑like approach: different pinyin strings that map to the same character share the same DocId and TermOffset, dramatically reducing index size while keeping lookup fast.

Homograph Tokenizer

SQLite FTS5’s default tokenizer treats pinyin as ordinary letters. A custom tokenizer introduces a secondary delimiter (“;”) between characters and a comma (“,”) between multiple pronunciations of the same character.

The tokenizer workflow splits input into terms that match the index structure.

User Input Pre‑processing

When a user enters continuous pinyin, the query is split into possible terms that exist in the index. For example, the input “zhuang” yields seven search combinations, considering both full and short pinyin and allowing the last token to be a prefix.

The algorithm builds a prefix tree of all pinyin strings, achieving O(n log n) decomposition time, and the resulting SQL statements are embedded directly in the client.

Results

Comparing Scheme 3 and Scheme 4 shows a ~50 % reduction in pinyin data size. Index creation time drops by about 30 %, and query latency improves by roughly 15 % despite a slight increase in hash‑table lookups.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

mobileSQLiteWeChatSearchFTS5homograph
WeChat Client Technology Team
Written by

WeChat Client Technology Team

Official account of the WeChat mobile client development team, sharing development experience, cutting‑edge tech, and little‑known stories across Android, iOS, macOS, Windows Phone, and Windows.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.