Artificial Intelligence 15 min read

How AI-Driven Clustering Boosts Smart Customer Service Knowledge Bases

This article outlines an AI-powered workflow for constructing and enriching a business knowledge base in intelligent customer service, covering preprocessing, intent detection, deep and shallow semantic feature engineering, hierarchical bucket clustering, and automated summary extraction to improve FAQ coverage and reduce manual workload.

Yanxuan Tech Team

Apr 20, 2020

How AI-Driven Clustering Boosts Smart Customer Service Knowledge Bases

A high‑quality intelligent customer service system requires both accurate user‑question understanding and a comprehensive knowledge base. Leveraging algorithms to build the knowledge base can quickly surface unanswered high‑frequency user issues, assist agents in configuration, and significantly raise resolution rates while reducing manual workload, especially during large promotions or sudden events such as pandemics.

1. Overview

The knowledge base consists of two parts: structured product knowledge and unstructured business knowledge; this article focuses on constructing the business knowledge base.

When the intelligent customer service answers business questions, it primarily uses a QQ‑matching technique. Unresolved user queries may stem from (1) missing knowledge in the knowledge base, or (2) existing knowledge lacking the specific user phrasing.

Knowledge base mining therefore involves (1) adding new standard questions and (2) adding similar questions to existing standards.

The overall framework of the intelligent customer service is shown above; this article concentrates on the offline mining component. Automatic extraction of user questions is essentially a clustering problem, involving preprocessing, feature construction, and clustering algorithms, plus matching clusters to existing knowledge and extracting FAQ summaries.

2. Preprocessing

Preprocessing includes (1) text normalization and (2) intent recognition. Reducing noise and merging similar texts improves clustering quality and reduces computational cost.

2.1 Text Normalization

Normalization removes noise and corrects errors. The system performs:

Basic preprocessing : conversion between traditional/simplified Chinese, case folding, half‑width/full‑width conversion, punctuation normalization.

Special entity recognition : phone numbers, order IDs, URLs, after‑sale numbers, tracking numbers, addresses, etc., replaced with generic placeholders.

Text correction : spelling and character shape correction using a correction algorithm.

2.2 Intent Recognition

After normalization, user questions are passed through the existing intent‑recognition pipeline. Unresolved queries are filtered out if they belong to intents that cannot be answered via FAQ (e.g., “shopping guide”, “human assistance”). For other intents, supervised intent models reduce sample size and help discover fine‑grained high‑frequency unanswered issues.

3. Feature Construction

Post‑preprocessing, questions under a specific intent are vectorized using two levels of semantic features: deep and shallow.

3.1 Deep Semantic Features

Typical text vectorization uses pretrained models such as word2vec, GloVe, or language models like ELMo and BERT. In practice, word2vec‑based sentence vectors outperform direct BERT embeddings for this domain because BERT captures overall semantics but misses business‑specific keywords.

q1 : 申请的退货进度怎么样了？
q2 : 申请的价保进度怎么样了？

Although the two questions differ by only two characters, BERT yields very similar vectors, failing to distinguish the core terms “退货” (return) and “价保” (price protection). Using word2vec, we weight words to emphasize:

Business entity words (e.g., return, logistics).

Key part‑of‑speech (verbs, nouns).

This weighting tilts the sentence vector toward the core terms, producing more reasonable representations.

3.2 Shallow Semantic Features

Shallow features capture fine‑grained lexical differences that deep models miss. For example:

q1 : 如何申请退货？
q2 : 如何申请换货？

These questions differ by a single character but require completely different solutions. Shallow features are built by tokenizing, removing stop words, constructing 1‑4‑gram bag‑of‑words, encoding with n‑grams, and applying PCA for dimensionality reduction.

4. Clustering Algorithm

After feature construction, a clustering algorithm groups similar questions. Traditional algorithms like k‑means or DBSCAN rely on prior knowledge and performed poorly. Hierarchical clustering offers global optimality but has O(N³) complexity, making it infeasible for hundreds of thousands of sentences.

To scale, a bucket‑wise hierarchical clustering method was devised, limiting the number of samples per hierarchical run.

Set bucket count : For N samples and a bucket capacity M, use at least K = N/M buckets (e.g., 150k samples with M=20k → K≥8).

Distribute samples : Randomly assign samples to buckets, or pre‑cluster with k‑means to improve relevance, adjusting bucket numbers to avoid imbalance.

Hierarchical clustering within each bucket : Perform standard hierarchical clustering, raising merge thresholds each round to prevent divergence.

Merge samples inside clusters : After clustering, combine samples in a cluster into a new representative vector by weighted averaging the top‑10 frequent samples.

Stopping criteria : Stop when (a) only one bucket remains, (b) sample count no longer decreases, or (c) bucket count stabilizes with negligible merges.

This bucket‑wise approach maintains clustering quality while handling large corpora within acceptable time.

5. Summary Extraction

Clustered questions are compared with existing FAQs using vector similarity; if similarity exceeds a threshold, the cluster is treated as a variant of an existing FAQ for human review.

When no match is found, a new standard question is created. A simple summarization method extracts the core user intent from the cluster using a “jump‑word” algorithm, allowing gaps of up to two words between summary terms.

q1 : 物流查询
q2 : 物流怎么查询？
q3 : 我的物流能不能帮忙查询一下？

The extracted summary is “物流查询” (logistics query). The algorithm evaluates two metrics:

Mutual Information : Measures co‑occurrence likelihood of summary words. Left/Right Neighbor Entropy : Captures randomness of surrounding words, indicating a robust boundary.

Formulas:

MI ab = log\frac{p(a,b)}{p(a)p(b)}

H left = -\sum_w p(w\,|\,a,b) \log p(w\,|\,a,b)

6. Conclusion

The article presented an algorithm‑assisted solution for building a business knowledge base in the Yanxuan intelligent customer service system, covering preprocessing, deep and shallow semantic feature engineering, bucket‑wise hierarchical clustering, and automated summary extraction. Experimental results show added similar questions to existing standards and creation of new standard questions across intents such as after‑sale, logistics, and invoicing. Current challenges include occasional “unclustered” or “mis‑clustered” cases, leaving room for further improvements in text representation and clustering methods.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

clustering AI customer-service Knowledge Base NLP text preprocessing

Written by

Yanxuan Tech Team

NetEase Yanxuan Tech Team shares e-commerce tech insights and quality finds for mindful living. This is the public portal for NetEase Yanxuan's technology and product teams, featuring weekly tech articles, team activities, and job postings.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.