cw2vec: Learning Chinese Word Embeddings with Stroke n-grams
The cw2vec paper, presented at AAAI 2018, introduces a Chinese word embedding method that leverages stroke n‑grams to capture character semantics, proposes a novel loss function, demonstrates consistent improvements over existing models across similarity, analogy, classification and NER tasks, and discusses real‑world AI applications.
Word‑vector algorithms are fundamental to natural language processing, but most existing methods, such as word2vec, are designed for Latin‑script languages and ignore the rich semantic information inherent in Chinese characters. The cw2vec model, a collaboration between Ant Financial AI Lab and Singapore University of Technology and Design, addresses this gap by representing Chinese words through n‑gram sequences of strokes.
The authors define “stroke n‑grams” as contiguous sequences of n strokes within a character, treating each n‑gram as a semantic unit. A new loss function is introduced that combines a sigmoid‑based similarity term with negative sampling, allowing efficient training without the computational burden of a full softmax.
During preprocessing, each word is decomposed into its constituent characters, each character is split into strokes, strokes are mapped to numeric IDs, and sliding windows generate the stroke n‑grams. Each n‑gram receives its own embedding vector, initialized randomly with the same dimensionality as traditional word vectors.
Experimental evaluation on public datasets compares cw2vec with word2vec (skip‑gram and CBOW), GloVe, CWE, and recent pixel‑/radical‑based Chinese embedding methods. Across word similarity, word analogy, text classification, and named‑entity recognition tasks, cw2vec consistently outperforms the baselines. Additional experiments varying embedding dimensionality and using only 20 % of Chinese Wikipedia as training data further confirm its robustness, especially on small corpora.
Case studies on the terms “water pollution” and “Sun Wukong” illustrate cw2vec’s ability to capture fine‑grained semantic relations that other models miss, thanks to the combined influence of stroke information and contextual word vectors.
Beyond research, the cw2vec technique has been deployed in Ant Group’s intelligent customer service, text risk control, and recommendation systems, and similar approaches have been explored for Japanese and Korean, resulting in nearly twenty related patent applications.
The paper can be accessed at https://github.com/ShelsonCao/cw2vec/blob/master/cw2vec.pdf .
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
