Artificial Intelligence 13 min read

Sentiment Classification of iQIYI User Comments: Model Selection, Feature Engineering, and Online Deployment

The team built a lightweight three‑class sentiment classifier for iQIYI user comments using a linear‑kernel SVM with high‑dimensional bag‑of‑words features and an expanded ~100k word lexicon, achieving over 96% accuracy across domains, and deployed it as a Spring Boot PMML service with zero‑downtime refresh, while planning GBDT‑enhanced features and word‑embedding optimizations.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Sentiment Classification of iQIYI User Comments: Model Selection, Feature Engineering, and Online Deployment

Authors : Dixon (iQIYI Open Platform, background in master data management, NLP, graph computing, machine learning, big data) and Samuel (iQIYI Open Platform backend engineer, responsible for workflow systems, data statistics, and exploring ML & big‑data applications).

Problem definition : With iQIYI consolidating various creator channels under a single "iQIYI Account", user comments from video, feed, and article sources become a key indicator of content popularity. Classifying the sentiment of these comments (positive, negative, neutral) is essential for the iQIYI Index and for providing feedback to creators.

Data characteristics : Comments are short texts (≤100 characters), often contain emojis, slang, and trending internet terms. They can be strongly opinionated or neutral ("water" comments). Volume spikes during events or celebrity activities.

Task formulation : Define sentiment classification as a three‑class problem (positive, negative, neutral) and require a lightweight, easily extensible service capable of rapid inference on short texts.

Model selection : After evaluating classic classifiers (Naive Bayes, Logistic Regression, GBDT) and deep‑learning approaches, the team chose a linear‑kernel Support Vector Machine (SVM) because of its robustness to high‑dimensional bag‑of‑words features and fast training/inference.

Feature engineering : • Use a bag‑of‑words model to vectorize text, producing high‑dimensional sparse vectors. • Build an initial sentiment lexicon by counting word frequencies in iQIYI comments, filtering stop words, and manually selecting sentiment‑related terms (a few thousand words). • Expand the lexicon with public sentiment dictionaries, resulting in a ~100k word dictionary (the "expanded dictionary"). • Apply scikit‑learn Pipeline and SelectKBest to rank feature dimensions; the top 20,000 dimensions (≈20k words) were kept, achieving >96% accuracy on the test set.

Experimental results : Using the initial 1‑5k word dictionary, SVM achieved ~97% accuracy on an internal iQIYI comment set but performed poorly on a cross‑domain Weibo set. After expanding the training data (adding diverse Weibo comments) and the dictionary (to ~100k words, then selecting top 20k), the model maintained >96% accuracy across both domains.

Online deployment architecture : • Implemented as a Spring Boot service (Java ecosystem) to leverage existing team expertise. • Service exposes both synchronous APIs and asynchronous MQ interfaces. • Model files are exported from scikit‑learn to PMML using JPMML, stored in Redis with metadata (version, timestamp). • At startup, the service loads the PMML model via the JPMML‑evaluator library and registers it as a Spring bean. • Model updates are performed without downtime using Spring Cloud's @RefreshScope and Spring Cloud Bus: a new model version is uploaded to Redis, a refresh message is sent through MQ, and the bean is re‑instantiated.

Model refresh mechanism : The PMML‑based model bean is annotated with @RefreshScope . Upon receiving a refresh signal, the bean reloads the latest PMML file from Redis, allowing instantaneous rollout of new models and sentiment dictionaries across multiple instances.

Future work : • Incorporate GBDT‑generated leaf‑index categorical features into the SVM pipeline (≈1% accuracy gain). • Explore word2vec‑derived semantic dictionaries. • Optimize inference latency by replacing JPMML with a native C/C++ implementation or using sklearn‑porter to compile the model to a compiled library. • Continue improving model robustness and scaling to handle millions of comments per day; current online A/B checks show >90% prediction correctness.

feature engineeringdeploymentsentiment analysisNLPmachine learningSVMText Classification
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.