Artificial Intelligence 13 min read

Sentiment Classification of iQIYI User Comments: Model Selection, Feature Engineering, and Online Deployment

The team built a lightweight three‑class sentiment classifier for iQIYI user comments using a linear‑kernel SVM with high‑dimensional bag‑of‑words features and an expanded ~100k word lexicon, achieving over 96% accuracy across domains, and deployed it as a Spring Boot PMML service with zero‑downtime refresh, while planning GBDT‑enhanced features and word‑embedding optimizations.

iQIYI Technical Product Team

Dec 15, 2017

Sentiment Classification of iQIYI User Comments: Model Selection, Feature Engineering, and Online Deployment

Authors : Dixon (iQIYI Open Platform, background in master data management, NLP, graph computing, machine learning, big data) and Samuel (iQIYI Open Platform backend engineer, responsible for workflow systems, data statistics, and exploring ML & big‑data applications).

Problem definition : With iQIYI consolidating various creator channels under a single "iQIYI Account", user comments from video, feed, and article sources become a key indicator of content popularity. Classifying the sentiment of these comments (positive, negative, neutral) is essential for the iQIYI Index and for providing feedback to creators.

Data characteristics : Comments are short texts (≤100 characters), often contain emojis, slang, and trending internet terms. They can be strongly opinionated or neutral ("water" comments). Volume spikes during events or celebrity activities.

Task formulation : Define sentiment classification as a three‑class problem (positive, negative, neutral) and require a lightweight, easily extensible service capable of rapid inference on short texts.

Model selection : After evaluating classic classifiers (Naive Bayes, Logistic Regression, GBDT) and deep‑learning approaches, the team chose a linear‑kernel Support Vector Machine (SVM) because of its robustness to high‑dimensional bag‑of‑words features and fast training/inference.

Feature engineering : • Use a bag‑of‑words model to vectorize text, producing high‑dimensional sparse vectors. • Build an initial sentiment lexicon by counting word frequencies in iQIYI comments, filtering stop words, and manually selecting sentiment‑related terms (a few thousand words). • Expand the lexicon with public sentiment dictionaries, resulting in a ~100k word dictionary (the "expanded dictionary"). • Apply scikit‑learn Pipeline and SelectKBest to rank feature dimensions; the top 20,000 dimensions (≈20k words) were kept, achieving >96% accuracy on the test set.

Experimental results : Using the initial 1‑5k word dictionary, SVM achieved ~97% accuracy on an internal iQIYI comment set but performed poorly on a cross‑domain Weibo set. After expanding the training data (adding diverse Weibo comments) and the dictionary (to ~100k words, then selecting top 20k), the model maintained >96% accuracy across both domains.

Online deployment architecture : • Implemented as a Spring Boot service (Java ecosystem) to leverage existing team expertise. • Service exposes both synchronous APIs and asynchronous MQ interfaces. • Model files are exported from scikit‑learn to PMML using JPMML, stored in Redis with metadata (version, timestamp). • At startup, the service loads the PMML model via the JPMML‑evaluator library and registers it as a Spring bean. • Model updates are performed without downtime using Spring Cloud's @RefreshScope and Spring Cloud Bus: a new model version is uploaded to Redis, a refresh message is sent through MQ, and the bean is re‑instantiated.

Model refresh mechanism : The PMML‑based model bean is annotated with @RefreshScope. Upon receiving a refresh signal, the bean reloads the latest PMML file from Redis, allowing instantaneous rollout of new models and sentiment dictionaries across multiple instances.

Future work : • Incorporate GBDT‑generated leaf‑index categorical features into the SVM pipeline (≈1% accuracy gain). • Explore word2vec‑derived semantic dictionaries. • Optimize inference latency by replacing JPMML with a native C/C++ implementation or using sklearn‑porter to compile the model to a compiled library. • Continue improving model robustness and scaling to handle millions of comments per day; current online A/B checks show >90% prediction correctness.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

feature engineering deployment Sentiment Analysis NLP svm Text Classification

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.