How to Build a Scalable Spark-Based Text Sentiment Analysis System
This article walks through constructing a Spark-powered text sentiment analysis pipeline—from crawling movie reviews, preprocessing and feature extraction with jieba and TF‑IDF, to training Naive Bayes and SVM classifiers—while discussing Spark's advantages and ways to improve model accuracy.
Based on Spark Text Sentiment Analysis
The article describes a cognitive system built on Apache Spark for analyzing unstructured text data from social forums, using movie reviews from Douban as a case study.
Four Technical Stages
Data Collection : Crawl short comments and ratings for the movie "Zootopia" from Douban, creating a dataset of over 100,000 records.
Text Representation Model : Convert each document into a vector space model where words, phrases, or characters become features.
Feature Selection : Apply jieba.cut_for_search for Chinese word segmentation, then extract TF‑IDF features.
Classification Model : Train classifiers such as Naive Bayes and Support Vector Machine (SVM) on the feature vectors.
Why Use Spark
Single‑node processing cannot handle the massive volume of user‑generated data (e.g., 111,421 comments for a single movie). Spark provides in‑memory computation, lazy evaluation, and rich APIs for Scala, Java, Python, and R, enabling linear scalability.
Environment Setup
Python 3.5.0, Spark 1.6.1, and Jupyter Notebook are used. The following command sets the PySpark environment:
export PYSPARK_PYTHON=ipython3 PYSPARK_DRIVER_PYTHON_OPTS="notebook"System Architecture
Data Preprocessing
Load the raw data into an RDD, remove duplicates, split each line by tabs, and filter rows with exactly two fields (rating and comment). Example code:
originData = sc.textFile('YOUR_FILE_PATH')
originDistinctData = originData.distinct()
rateDocument = originDistinctData.map(lambda line: line.split('\t')).filter(lambda line: len(line) == 2)Balance positive (rating 5) and negative (rating ≤3) samples, then repartition for efficient computation.
negRateDocument = oneRateDocument.union(twoRateDocument).union(threeRateDocument)
negRateDocument = negRateDocument.repartition(1)
posRateDocument = sc.parallelize(fiveRateDocument.take(negRateDocument.count()))
allRateDocument = negRateDocument.union(posRateDocument).repartition(1)Feature Extraction
Segment Chinese text using jieba (search engine mode) and transform words into term frequencies with HashingTF:
words = document.map(lambda w: "/".join(jieba.cut_for_search(w))).map(lambda line: line.split("/"))
hashingTF = HashingTF()
tf = hashingTF.transform(words)
tf.cache()Compute TF‑IDF:
idfModel = IDF().fit(tf)
tfidf = idfModel.transform(tf)Model Training
Combine labels and features, split into training (60%) and test (40%) sets, and train a Naive Bayes classifier:
zipped = rate.zip(tfidf)
data = zipped.map(lambda line: LabeledPoint(line[0], line[1]))
training, test = data.randomSplit([0.6, 0.4], seed=0)
NBmodel = NaiveBayes.train(training, 1.0)
predictionAndLabel = test.map(lambda p: (NBmodel.predict(p.features), p.label))
accuracy = 1.0 * predictionAndLabel.filter(lambda x: x[0] == x[1]).count() / test.count()The Naive Bayes model achieves an accuracy of 74.83%.
Replacing Naive Bayes with an SVM improves accuracy to 78.59%:
SVMmodel = SVMWithSGD.train(training, iterations=100)
predictionAndLabel = test.map(lambda p: (SVMmodel.predict(p.features), p.label))
accuracy = 1.0 * predictionAndLabel.filter(lambda x: x[0] == x[1]).count() / test.count()Improving Accuracy
Four main reasons for low accuracy are coarse preprocessing, limited feature extraction, simple Naive Bayes model, and insufficient data. Suggested improvements include removing stop words, using full‑mode jieba segmentation, and adopting more advanced classifiers such as SVM.
Conclusion
The article demonstrates end‑to‑end construction of a Spark‑based text sentiment classification system, covering data cleaning, feature extraction with HashingTF and TF‑IDF, and model training. It shows how Spark’s distributed computing capabilities enable handling high‑dimensional text data and provides practical tips for boosting model performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Huawei Cloud Developer Alliance
The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
