Graph-Based Chinese Word Embedding (AlphaEmbedding) for Improved Text Matching
AlphaEmbedding builds a weighted graph linking Chinese words, sub‑words, characters and pinyin, then uses random‑walk‑based node2vec training to produce embeddings that capture orthographic and phonetic similarity, markedly improving recall and ranking for homophones, typos and OOV terms in enterprise search.
In the field of natural language processing, text representation learning transforms real‑world text into data that can be processed by computers. In Chinese search scenarios, issues such as homophones, easily confused words, and typographical errors make recall and similarity matching challenging. This article proposes training Chinese word vectors from a graph‑computing perspective and reports positive results.
Technical Background Traditional retrieval relied on tokenization and inverted indexes, while modern approaches use vector retrieval (e.g., Faiss, Nmslib). Existing embedding methods rarely consider Chinese‑specific problems like input‑method errors or phonetic similarity.
Evolution of Word Embedding Early methods (one‑hot, TF‑IDF, bag‑of‑words) suffer from high dimensionality and lack semantic similarity. Neural language models (Bengio et al., 2003) introduced distributed representations, followed by word2vec (Mikolov et al.) with CBOW and skip‑gram, ELMo (contextual LSTM), and BERT (masked language model and next‑sentence prediction). However, these models are trained on Latin‑script languages and do not exploit Chinese character structure.
Problems and Proposed Solution Most Chinese embedding research focuses on semantic context, ignoring orthographic and phonetic cues. The proposed solution builds an undirected weighted graph where nodes represent words, sub‑words, characters, and pinyin. Edges connect "word‑subword‑character‑pinyin" nodes with weights derived from edit distance or learned models. Random walks (node2vec or metapath2vec) generate sequences for skip‑gram training, yielding node embeddings that capture Chinese morphological similarity.
Experimental Setup The dataset consists of ~2.2 billion registered Chinese company names. Graph construction is implemented with PySpark, producing tens of millions of nodes and billions of edges. Random walks are performed on a Tencent Cloud EMR cluster (≈1 min for walks, ≈2 h for embedding training). Two graph construction styles are compared: a combinatorial style (only same‑pronunciation and sub‑word edges) and a fasttext‑style (full "word‑subword‑character‑pinyin" graph). Various walk strategies (node2vec depth‑first, breadth‑first, metapath2vec) are evaluated.
Results The fasttext‑style graph combined with node2vec breadth‑first walk yields the best trade‑off between recall of OOV terms and ranking smoothness. Depth‑first walks retrieve more distant sub‑words, while breadth‑first walks provide finer granularity for downstream tasks. Metapath2vec shows lower discrimination due to biased sampling of meta‑paths. Overall, node2vec on the fasttext‑style graph outperforms the combinatorial approach.
Conclusion By incorporating Chinese orthographic and phonetic information into a graph‑based embedding pipeline, the proposed AlphaEmbedding improves similarity matching for homophones, typos, and mixed characters. The method is already deployed in Tencent Cloud's enterprise profile search and future work includes integrating stroke‑level modeling and graph neural networks (GCN, GraphSAGE, GAT) to further enhance embedding quality.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.