Transform Crawled CSV Text into Word Clouds and Sentiment Analysis with Python
Learn step‑by‑step how to extract text from a CSV generated by a Python web crawler, clean it with stop‑words, create a word‑cloud visualization, compute word frequencies, and perform sentiment analysis using jieba and SnowNLP, with all code snippets provided.
Preface
A follower asked how to process text data obtained via a Python web crawler: convert a CSV file into a TXT corpus, then generate a word cloud, perform tokenization, and conduct sentiment analysis.
Approach
The workflow is: extract text from the CSV, apply stop‑words for tokenization, create a word‑cloud image, count word frequencies, and finally run sentiment analysis on the cleaned tokens.
Implementation
1. Extract text from CSV to a new TXT file
Run the script 读取csv文件中文本并存txt文档.py to produce 职位表述文本.txt:
# coding: utf-8
import pandas as pd
df = pd.read_csv('./职位描述.csv', encoding='gbk')
for text in df['Job_Description']:
if text is not None:
with open('职位表述文本.txt', mode='a', encoding='utf-8') as file:
file.write(str(text))
print('写入完成')2. Apply stop‑words and generate the cleaned text
Run 使用停用词获取最后的文本内容.py to produce 职位表述文本分词后_outputs.txt:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import jieba
def stopwordslist(filepath):
stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
return stopwords
def seg_sentence(sentence):
sentence_seged = jieba.cut(sentence.strip())
stopwords = stopwordslist('stop_word.txt')
outstr = ''
for word in sentence_seged:
if word not in stopwords and word != '\t':
outstr += word + ' '
return outstr
inputs = open('职位表述文本.txt', 'r', encoding='utf-8')
outputs = open('职位表述文本分词后_outputs.txt', 'w', encoding='utf-8')
for line in inputs:
line_seg = seg_sentence(line)
outputs.write(line_seg + '
')
outputs.close()
inputs.close()3. Generate a word‑cloud image
Run 指定txt词云图.py to create 词云图.png (the mask image can be replaced with any picture):
from wordcloud import WordCloud
import jieba
import numpy as np
from PIL import Image
def cut(text):
return " ".join(jieba.cut(text))
with open(r"职位表述文本.txt", encoding="utf-8") as file:
text = cut(file.read())
mask_pic = np.array(Image.open(r"python.png"))
wordcloud = WordCloud(font_path=r"C:/Windows/Fonts/simfang.ttf",
collocations=False,
max_words=100,
min_font_size=10,
max_font_size=500,
mask=mask_pic).generate(text)
wordcloud.to_file('词云图.png')4. Token frequency statistics
Run
jieba分词并统计词频后输出结果到Excel和txt文档.pyto obtain wordCount_all_lyrics.xls and 分词结果.txt, then create 情感分析用词.txt for the next step:
#!/usr/bin/env python3
# -*- coding:utf-8 -*-
import jieba
import jieba.analyse
import xlwt
wbk = xlwt.Workbook(encoding='ascii')
sheet = wbk.add_sheet('wordCount')
word_lst = []
key_list = []
for line in open('职位表述文本.txt', encoding='utf-8'):
tags = jieba.analyse.extract_tags(line.strip('
\r'))
for t in tags:
word_lst.append(t)
word_dict = {}
with open('分词结果.txt', 'w') as wf2:
for item in word_lst:
if item not in word_dict:
word_dict[item] = 1
else:
word_dict[item] += 1
orderList = sorted(word_dict.values(), reverse=True)
for i in range(len(orderList)):
for key in list(word_dict.keys()):
if word_dict[key] == orderList[i]:
wf2.write(key + ' ' + str(word_dict[key]) + '
')
key_list.append(key)
word_dict[key] = 0
for i in range(len(key_list)):
sheet.write(i, 0, label=key_list[i])
sheet.write(i, 1, label=orderList[i])
wbk.save('wordCount_all_lyrics.xls')5. Sentiment analysis
Run 情感分析.py to compute sentiment scores for each word in 情感分析用词.txt using SnowNLP:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from snownlp import SnowNLP
def get_word():
with open('情感分析用词.txt', encoding='utf-8') as f:
return [line.strip('
') for line in f]
def get_sentiment(word):
s = SnowNLP(word)
print(s.sentiments)
if __name__ == '__main__':
words = get_word()
for word in words:
get_sentiment(word)The resulting sentiment scores are visualized in the following image; an average score above 0.5 indicates overall positive sentiment.
Conclusion
This tutorial walks through a complete mini‑project: from crawling data to tokenization, word‑cloud generation, frequency counting, and sentiment analysis, providing ready‑to‑run Python scripts and sample outputs for learners to practice.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
