Improving Text Representation and Clustering for Small‑Sample Scenarios in 58 Second‑Hand Car Intelligent Customer Service
This article presents a study on enhancing text representation and clustering in a small‑sample setting for 58's second‑hand car intelligent customer service by introducing a Bi‑LSTM based pre‑training language model and an improved Deep Embedded Clustering (DEC) algorithm, demonstrating significant gains in accuracy, silhouette score, and answer‑rate through extensive experiments.
Background 58.com’s intelligent customer service system ("BangBang") provides automated Q&A, human‑online chat, and smart assistance across various business lines. In the second‑hand car domain, the system faces challenges due to weak text representation and low clustering purity when only a small amount of labeled data is available.
Problem Statement The key issues are (1) insufficient representation of diverse user queries in a small‑sample scenario, leading to poor model generalization, and (2) difficulty discovering new user questions to improve coverage.
Proposed Solutions Two algorithms are explored: (1) a Bi‑LSTM based pre‑training language model that adapts BERT’s masked‑LM task to the vertical domain and replaces the transformer with Bi‑LSTM for lower computational cost; (2) the Deep Embedded Clustering (DEC) algorithm, which jointly learns feature representations and cluster assignments.
Bi‑LSTM Pre‑training Model The model is trained on 40 million unlabeled second‑hand car sentences using only the masked‑LM task, adds residual connections and layer‑norm between Bi‑LSTM layers, and is trained on a single NVIDIA TESLA P40 GPU for 30 k iterations (~28 h). Experiments on a 26 k‑sample classification task show accuracy improvement from 0.81 to 0.86, outperforming a BERT‑based baseline (0.8487).
Model
+Pretrain Acc
No Pretrain Acc
Bi‑LSTM
0.8662
0.8107
BERT
0.8487 (5 epoch) / 0.8530 (10 epoch)
0.7884 (5 epoch) / 0.8342 (10 epoch)
DEC Algorithm Description DEC consists of two stages: (1) pre‑training an auto‑encoder to obtain initial features, (2) fine‑tuning the encoder together with cluster centroids using a KL‑divergence loss between a soft assignment distribution q and a target distribution p. The original K‑means initialization is replaced with custom centroids derived from the average vectors of existing standard questions, reducing randomness.
Experimental Comparison Three experiments were conducted on a small labeled dataset: (1) K‑means + Word2Vec static features, (2) K‑means + Bi‑LSTM static features, (3) DEC + Bi‑LSTM pre‑trained features. Results (Table 2) show that DEC with Bi‑LSTM achieves the highest accuracy (0.8437), silhouette score (0.142), albeit with longer runtime (30 min).
Method
Accuracy
Silhouette
Runtime
K‑means + Word2Vec
0.354
0.047
<5 min
K‑means + Bi‑LSTM
0.377
0.025
<5 min
DEC + Bi‑LSTM
0.8437
0.142
30 min
Impact on Online System Applying the improved DEC to the online Q&A robot uncovered new standard questions (e.g., “What does other fees include?”) and increased the weekly average answer‑rate from 79.71 % to 83.62 %.
Iterative Improvement of Question Expansion After deployment, analysis of bad cases revealed many variant utterances for existing standard questions. By customizing DEC centroids with averaged vectors of all known expansion queries, the system’s precision rose from 98.11 % to 98.24 % and recall from 89.66 % to 92.27 %.
Before Iteration
After Iteration
Precision
98.11 %
98.24 %
Recall
89.66 %
92.27 %
Conclusion and Future Work The study demonstrates that (1) a domain‑specific Bi‑LSTM pre‑training model markedly improves text representation for small‑sample NLP tasks, and (2) an enhanced DEC algorithm with custom centroids boosts clustering purity and downstream Q&A performance. Future directions include leveraging transfer learning between online/offline data, designing more suitable unsupervised objectives, and incorporating self‑supervision.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.