How Tmall’s “Most Concerned” Feature Uses AI to Match Reviews with Consumer Questions
The article explains how Tmall’s new “Most Concerned” module leverages NLP techniques, fastText embeddings, Bi‑LSTM classifiers, and a custom clustering algorithm to filter, group, and link consumer questions with relevant product reviews, improving the shopping experience across many product categories.
Overview
Tmall’s mobile client recently launched a “Most Concerned” feature that, when users search for a product category (e.g., refrigerators), displays a module listing frequently asked questions such as “Is it noisy?” or “Does it consume a lot of power?”. Clicking a question shows detailed information and related product reviews.
Problem Statement
To build this module, several challenges must be solved:
Question Selection : Keep only generic questions applicable to a product category and discard item‑specific or vague queries.
Duplicate Question Merging : Consolidate semantically identical questions (e.g., “Is it noisy?” vs. “Is the noise loud?”) into a single representative.
Question‑Comment Association : Map each review to the questions it can answer, recognizing that a single review may address multiple questions or none at all.
Data Sources
tbods.s_macross_feed – contains all user‑submitted questions and answers from the “Ask Everyone” module.
search_kg.s_kg_all_comment_for_ha3 – stores all product comments.
Additional tables include tbcdm.dim_tb_itm (product catalog), search_ats.ali_seller_matrix_open_d (seller scores), and a category‑keyword dictionary.
Preprocessing
Noisy characters and punctuation are removed from questions; empty, invalid, or default comments are filtered out. Low‑frequency questions, low‑sales items, and low‑rating sellers are also excluded to improve data quality.
Algorithm
Word Embedding
FastText pretrained Chinese word vectors (trained on Wikipedia) are used as embeddings.
Question Filtering
A Bi‑LSTM encoder extracts sentence representations, followed by dropout and an MLP that predicts whether a question should be filtered. The model was trained on >5,000 manually labeled questions with >95% accuracy on a held‑out test set.
Question Clustering
A symmetric Bi‑LSTM‑based classifier determines if two questions share the same meaning. Using attention mechanisms (Luong attention) and a second‑layer Bi‑LSTM, the model outputs a probability of duplication. Over 10,000 question pairs were manually labeled for training and evaluation.
A custom clustering algorithm then processes questions in descending frequency order, assigning each to an existing cluster if it duplicates all members, creating a new cluster otherwise.
Question‑Comment Association
Because comments are longer than typical “Ask Everyone” answers, a rule‑based keyword matching approach is used to retrieve comments that answer a given question, favoring higher precision over recall.
Future Work
Deploy the module on the main Taobao app to reach more users.
Generate question‑relevant comments automatically to expand coverage.
Adopt advanced reading‑comprehension models such as BERT to improve accuracy.
References
【1】 https://fasttext.cc/docs/en/pretrained-vectors.html 【2】 https://www.kaggle.com/c/quora-question-pairs 【3】 https://arxiv.org/abs/1508.04025
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
