How to Turn Crawled CSV Data into Word Clouds and Sentiment Scores with Python
This guide walks you through extracting text from a CSV obtained via Python web scraping, cleaning it with stop‑words, generating a word‑cloud, performing jieba tokenization and frequency analysis, and finally applying SnowNLP for sentiment scoring, with all code snippets and data links provided.
Introduction
A fan asked how to process text data scraped with Python: the data is in a CSV, needs to be saved to txt, then visualized as a word cloud, tokenized, and sentiment‑analyzed.
1. Approach
The overall workflow is: extract text from the CSV, apply stop‑words for tokenization, generate a word cloud, and finally perform sentiment analysis.
1. Extract each line of text from the CSV and write to a new txt file.
2. Run a script that uses stop‑words to clean the text and output a processed txt file.
3. Run a script to create a word‑cloud image.
4. Run a script that tokenizes with jieba, counts word frequencies, and writes results to Excel and txt files; the txt output is further processed for sentiment analysis.
5. Run a sentiment‑analysis script to obtain average sentiment scores.
The source code and data are packaged and available on GitHub; reply with the keyword 小明的数据 to receive the download link.
2. Implementation
1. Extract CSV text line by line into a new txt file
Run the script 读取csv文件中文本并存txt文档.py to produce 职位表述文本.txt :
# coding: utf-8
import pandas as pd
df = pd.read_csv('./职位描述.csv', encoding='gbk')
# print(df.head())
for text in df['Job_Description']:
# print(text)
if text is not None:
with open('职位表述文本.txt', mode='a', encoding='utf-8') as file:
file.write(str(text))
print('写入完成')2. Use stop‑words to obtain the final cleaned text
Run the script 使用停用词获取最后的文本内容.py to generate 职位表述文本分词后_outputs.txt :
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import jieba
def stopwordslist(filepath):
stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
return stopwords
def seg_sentence(sentence):
sentence_seged = jieba.cut(sentence.strip())
stopwords = stopwordslist('stop_word.txt')
outstr = ''
for word in sentence_seged:
if word not in stopwords:
if word != '\t':
outstr += word + " "
return outstr
inputs = open('职位表述文本.txt', 'r', encoding='utf-8')
outputs = open('职位表述文本分词后_outputs.txt', 'w', encoding='utf-8')
for line in inputs:
line_seg = seg_sentence(line)
outputs.write(line_seg + '
')
outputs.close()
inputs.close()3. Create a word‑cloud image
Run the script 指定txt词云图.py to produce 词云图.png :
from wordcloud import WordCloud
import jieba
import numpy
import PIL.Image as Image
def cut(text):
wordlist_jieba = jieba.cut(text)
space_wordlist = " ".join(wordlist_jieba)
return space_wordlist
with open(r"C:\Users\pdcfi\Desktop\xiaoming\职位表述文本.txt", encoding="utf-8") as file:
text = file.read()
text = cut(text)
mask_pic = numpy.array(Image.open(r"C:\Users\pdcfi\Desktop\xiaoming\python.png"))
wordcloud = WordCloud(font_path=r"C:/Windows/Fonts/simfang.ttf",
collocations=False,
max_words=100,
min_font_size=10,
max_font_size=500,
mask=mask_pic).generate(text)
wordcloud.to_file('词云图.png')4. Token frequency statistics
Run the script jieba分词并统计词频后输出结果到Excel和txt文档.py to generate wordCount_all_lyrics.xls and 分词结果.txt :
#!/usr/bin/env python3
# -*- coding:utf-8 -*-
import sys
import jieba
import jieba.analyse
import xlwt
if __name__ == "__main__":
wbk = xlwt.Workbook(encoding='ascii')
sheet = wbk.add_sheet("wordCount")
word_lst = []
key_list = []
for line in open('职位表述文本.txt', encoding='utf-8'):
item = line.strip('
\r').split('\t')
tags = jieba.analyse.extract_tags(item[0])
for t in tags:
word_lst.append(t)
word_dict = {}
with open("分词结果.txt", 'w') as wf2:
for item in word_lst:
if item not in word_dict:
word_dict[item] = 1
else:
word_dict[item] += 1
orderList = list(word_dict.values())
orderList.sort(reverse=True)
for i in range(len(orderList)):
for key in word_dict:
if word_dict[key] == orderList[i]:
wf2.write(key + ' ' + str(word_dict[key]) + '
')
key_list.append(key)
word_dict[key] = 0
for i in range(len(key_list)):
sheet.write(i, 1, label=orderList[i])
sheet.write(i, 0, label=key_list[i])
wbk.save('wordCount_all_lyrics.xls')5. Sentiment analysis results
Run the script 情感分析.py to compute sentiment scores using SnowNLP:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from snownlp import SnowNLP
def get_word():
with open("情感分析用词.txt", encoding='utf-8') as f:
line = f.readline()
word_list = []
while line:
line = f.readline()
word_list.append(line.strip('
'))
return word_list
def get_sentiment(word):
text = u'{}'.format(word)
s = SnowNLP(text)
print(s.sentiments)
if __name__ == '__main__':
words = get_word()
for word in words:
get_sentiment(word)Conclusion
The article demonstrates a complete mini‑project: from scraping CSV data, through stop‑word cleaning, word‑cloud visualization, jieba tokenization, frequency counting, and finally sentiment analysis, providing a practical template for similar text‑processing tasks.
All source code and data are packaged on GitHub; reply with the keyword 小明的数据 in the public account backend to download.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
