User Identity Recognition on Internet Platforms: Solving Cold‑Start with Keyword Matching, XGBoost, TextCNN, and an Improved Wide & Deep Model
This article presents a comprehensive study on C‑side user identity recognition for internet platforms, addressing cold‑start and sample‑scarcity challenges by comparing keyword matching, XGBoost, TextCNN, a fusion model, and an improved Wide & Deep architecture, showing that the latter achieves the highest F1 score of 80.67%.
In internet platforms, distinguishing between C‑side and B‑side customers is essential, and accurately identifying C‑side user identities is crucial for reducing harassment and improving experience. This paper details methods for tackling the cold‑start problem with limited samples and compares the performance of keyword matching, XGBoost, TextCNN, and an improved Wide & Deep model.
Background : Malicious users and black‑market activities can be mitigated by recognizing true C‑side identities using natural language processing (NLP) techniques.
Cold‑Start and Sample Collection : Two cold‑start issues are highlighted—insufficient labeled data and class imbalance (e.g., positive samples may constitute only 1% of data). A three‑stage workflow is proposed: (1) define labels with clear, incremental difficulty; (2) algorithmic recognition using existing labeled data; (3) continuous iteration by feeding corrected results back to the model.
Model Iteration : The study evaluates five models on the same test set.
1. Keyword Matching : Early-stage solution using TF‑IDF to extract manually curated keywords. Optimizations include combining weak keywords into strong rules and applying a hit‑count threshold. The optimized keyword model reaches 98.75% precision, 37.44% recall, and an F1 score of 54.3%.
2. XGBoost : Utilizes word‑frequency features from concatenated dialogue text. Achieves 96.77% precision, 42.65% recall, and an F1 score of 59.21%, offering greater stability than keyword matching.
3. TextCNN : Deep‑learning model that captures semantic and sentence‑structure information. Records 99.02% precision, 48.34% recall, and an F1 score of 64.97%.
4. Fusion Model (Keyword + XGBoost + TextCNN) : Combines predictions of the three models, using averaged scores and threshold rules to improve recall while maintaining high precision. Results: 100% precision, 54.03% recall, F1 score 70.15%.
5. Improved Wide & Deep : Replaces the deep component of the original Wide & Deep model with TextCNN, keeping the wide part as a linear model on word‑frequency features. This joint learning approach yields 98.63% precision, 68.25% recall, and the highest F1 score of 80.67%.
Results : All five models achieve >96% precision. Ranking by F1 score: Improved Wide & Deep > Fusion > TextCNN > XGBoost > Keyword Matching. The improved Wide & Deep model is the best overall.
Conclusion and Outlook : The paper outlines the cold‑start and sample‑collection workflow, details the iterative development of models, and recommends using keyword matching for rapid early deployment, while transitioning to TextCNN and the improved Wide & Deep models as data volume grows for superior performance.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.