How cw2vec Beats Word2Vec: Leveraging Chinese Stroke N‑grams for Superior Word Embeddings

This article introduces cw2vec, a novel Chinese word‑embedding algorithm that exploits stroke‑level subword information, outlines its theoretical foundations, compares it with word2vec, GloVe, CWE and other models on multiple benchmarks, and demonstrates its superior performance across word similarity, analogy, text classification and named‑entity recognition tasks.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How cw2vec Beats Word2Vec: Leveraging Chinese Stroke N‑grams for Superior Word Embeddings

Background

Chinese characters carry rich semantic information in their strokes, yet most existing word‑embedding methods ignore these linguistic features, limiting performance on Chinese NLP tasks such as intelligent客服, machine translation, text summarization, and sentiment analysis.

Word Embedding Basics

Word vectors map words into a semantic vector space using unsupervised learning. Traditional models like word2vec (skip‑gram, CBOW) and GloVe were designed for alphabetic languages and do not fully exploit Chinese character structure.

Related Work

Early work (Harris 1954) introduced the distributional hypothesis. Subsequent models such as NNLM, word2vec with negative sampling and hierarchical softmax, CWE (averaging character vectors), radical‑enhanced embeddings, and glyph‑based approaches (GWE) attempted to incorporate sub‑character information, but each had limitations.

cw2vec Model

cw2vec proposes the concept of n‑gram strokes —continuous sequences of n strokes within a Chinese word or character. Each n‑gram stroke is assigned a vector of the same dimensionality as standard word vectors. The algorithm proceeds as follows:

Decompose each word into its constituent characters and further into stroke sequences.

Map each stroke to a numeric ID and generate overlapping n‑gram windows (e.g., n=3,4,5).

Initialize n‑gram stroke vectors randomly.

During training, keep context words intact while splitting the target word into its n‑gram strokes.

Optimize a loss function that combines a sigmoid‑based similarity term with negative sampling, analogous to word2vec but applied at the stroke level.

The loss function is:

where

denotes the target word vector,

the sigmoid function,

the set of all words in the sliding window, and

the number of negative samples.

Training Procedure

Before training, each word’s n‑gram stroke set is pre‑computed and vectors are randomly initialized. During each iteration, the algorithm updates both the context word vectors and the n‑gram stroke vectors based on the gradient of the loss.

Experimental Evaluation

cw2vec was benchmarked against word2vec (skip‑gram & CBOW), GloVe, CWE, and recent glyph‑based models on public datasets. Results show consistent improvements in word similarity, word analogy, text classification, and named‑entity recognition across multiple embedding dimensions.

Further experiments on a reduced Chinese Wikipedia corpus (20% of full data) demonstrate that cw2vec outperforms baseline models that do not leverage Chinese-specific features, confirming the effectiveness of stroke‑level information especially in low‑resource settings.

Case Studies

For the term “水污染” (water pollution), cw2vec retrieves semantically related words such as “水质” (water quality) by jointly considering stroke patterns and context, whereas other models miss this connection. For “孙悟空” (Sun Wukong), cw2vec correctly associates related characters and works of literature, highlighting its ability to capture fine‑grained semantics.

Conclusion and Impact

cw2vec demonstrates that incorporating n‑gram stroke information substantially enhances Chinese word embeddings, leading to better performance in both academic benchmarks and real‑world Alibaba scenarios like intelligent客服, text risk control, and recommendation. The approach also extends to other logographic languages (Japanese, Korean), with several related patents filed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningUnsupervised LearningChinese NLPword embeddingscw2vecstroke n‑grams
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.