Artificial Intelligence 19 min read

How to Build a Scalable Spark-Based Text Sentiment Analysis System

This article walks through constructing a Spark-powered text sentiment analysis pipeline—from crawling movie reviews, preprocessing and feature extraction with jieba and TF‑IDF, to training Naive Bayes and SVM classifiers—while discussing Spark's advantages and ways to improve model accuracy.

Huawei Cloud Developer Alliance

Jan 16, 2018

How to Build a Scalable Spark-Based Text Sentiment Analysis System

Based on Spark Text Sentiment Analysis

The article describes a cognitive system built on Apache Spark for analyzing unstructured text data from social forums, using movie reviews from Douban as a case study.

Four Technical Stages

Data Collection : Crawl short comments and ratings for the movie "Zootopia" from Douban, creating a dataset of over 100,000 records.

Text Representation Model : Convert each document into a vector space model where words, phrases, or characters become features.

Feature Selection : Apply jieba.cut_for_search for Chinese word segmentation, then extract TF‑IDF features.

Classification Model : Train classifiers such as Naive Bayes and Support Vector Machine (SVM) on the feature vectors.

Why Use Spark

Single‑node processing cannot handle the massive volume of user‑generated data (e.g., 111,421 comments for a single movie). Spark provides in‑memory computation, lazy evaluation, and rich APIs for Scala, Java, Python, and R, enabling linear scalability.

Environment Setup

Python 3.5.0, Spark 1.6.1, and Jupyter Notebook are used. The following command sets the PySpark environment:

export PYSPARK_PYTHON=ipython3 PYSPARK_DRIVER_PYTHON_OPTS="notebook"

System Architecture

Data Preprocessing

Load the raw data into an RDD, remove duplicates, split each line by tabs, and filter rows with exactly two fields (rating and comment). Example code:

originData = sc.textFile('YOUR_FILE_PATH')
originDistinctData = originData.distinct()
rateDocument = originDistinctData.map(lambda line: line.split('\t')).filter(lambda line: len(line) == 2)

Balance positive (rating 5) and negative (rating ≤3) samples, then repartition for efficient computation.

negRateDocument = oneRateDocument.union(twoRateDocument).union(threeRateDocument)
negRateDocument = negRateDocument.repartition(1)
posRateDocument = sc.parallelize(fiveRateDocument.take(negRateDocument.count()))
allRateDocument = negRateDocument.union(posRateDocument).repartition(1)

Feature Extraction

Segment Chinese text using jieba (search engine mode) and transform words into term frequencies with HashingTF:

words = document.map(lambda w: "/".join(jieba.cut_for_search(w))).map(lambda line: line.split("/"))
hashingTF = HashingTF()
tf = hashingTF.transform(words)
tf.cache()

Compute TF‑IDF:

idfModel = IDF().fit(tf)
tfidf = idfModel.transform(tf)

Model Training

Combine labels and features, split into training (60%) and test (40%) sets, and train a Naive Bayes classifier:

zipped = rate.zip(tfidf)
data = zipped.map(lambda line: LabeledPoint(line[0], line[1]))
training, test = data.randomSplit([0.6, 0.4], seed=0)
NBmodel = NaiveBayes.train(training, 1.0)
predictionAndLabel = test.map(lambda p: (NBmodel.predict(p.features), p.label))
accuracy = 1.0 * predictionAndLabel.filter(lambda x: x[0] == x[1]).count() / test.count()

The Naive Bayes model achieves an accuracy of 74.83%.

Replacing Naive Bayes with an SVM improves accuracy to 78.59%:

SVMmodel = SVMWithSGD.train(training, iterations=100)
predictionAndLabel = test.map(lambda p: (SVMmodel.predict(p.features), p.label))
accuracy = 1.0 * predictionAndLabel.filter(lambda x: x[0] == x[1]).count() / test.count()

Improving Accuracy

Four main reasons for low accuracy are coarse preprocessing, limited feature extraction, simple Naive Bayes model, and insufficient data. Suggested improvements include removing stop words, using full‑mode jieba segmentation, and adopting more advanced classifiers such as SVM.

Conclusion

The article demonstrates end‑to‑end construction of a Spark‑based text sentiment classification system, covering data cleaning, feature extraction with HashingTF and TF‑IDF, and model training. It shows how Spark’s distributed computing capabilities enable handling high‑dimensional text data and provides practical tips for boosting model performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Python Sentiment Analysis NLP Spark Text Classification

Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.