Artificial Intelligence 11 min read

Transform Crawled CSV Text into Word Clouds and Sentiment Analysis with Python

Learn step‑by‑step how to extract text from a CSV generated by a Python web crawler, clean it with stop‑words, create a word‑cloud visualization, compute word frequencies, and perform sentiment analysis using jieba and SnowNLP, with all code snippets provided.

Python Crawling & Data Mining

Sep 27, 2024

Transform Crawled CSV Text into Word Clouds and Sentiment Analysis with Python

Preface

A follower asked how to process text data obtained via a Python web crawler: convert a CSV file into a TXT corpus, then generate a word cloud, perform tokenization, and conduct sentiment analysis.

Approach

The workflow is: extract text from the CSV, apply stop‑words for tokenization, create a word‑cloud image, count word frequencies, and finally run sentiment analysis on the cleaned tokens.

Implementation

1. Extract text from CSV to a new TXT file

Run the script 读取csv文件中文本并存txt文档.py to produce 职位表述文本.txt:

# coding: utf-8
import pandas as pd
df = pd.read_csv('./职位描述.csv', encoding='gbk')
for text in df['Job_Description']:
    if text is not None:
        with open('职位表述文本.txt', mode='a', encoding='utf-8') as file:
            file.write(str(text))
print('写入完成')

2. Apply stop‑words and generate the cleaned text

Run 使用停用词获取最后的文本内容.py to produce 职位表述文本分词后_outputs.txt:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import jieba

def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords

def seg_sentence(sentence):
    sentence_seged = jieba.cut(sentence.strip())
    stopwords = stopwordslist('stop_word.txt')
    outstr = ''
    for word in sentence_seged:
        if word not in stopwords and word != '\t':
            outstr += word + ' '
    return outstr

inputs = open('职位表述文本.txt', 'r', encoding='utf-8')
outputs = open('职位表述文本分词后_outputs.txt', 'w', encoding='utf-8')
for line in inputs:
    line_seg = seg_sentence(line)
    outputs.write(line_seg + '
')
outputs.close()
inputs.close()

3. Generate a word‑cloud image

Run 指定txt词云图.py to create 词云图.png (the mask image can be replaced with any picture):

from wordcloud import WordCloud
import jieba
import numpy as np
from PIL import Image

def cut(text):
    return " ".join(jieba.cut(text))

with open(r"职位表述文本.txt", encoding="utf-8") as file:
    text = cut(file.read())
    mask_pic = np.array(Image.open(r"python.png"))
    wordcloud = WordCloud(font_path=r"C:/Windows/Fonts/simfang.ttf",
                          collocations=False,
                          max_words=100,
                          min_font_size=10,
                          max_font_size=500,
                          mask=mask_pic).generate(text)
    wordcloud.to_file('词云图.png')

4. Token frequency statistics

Run

jieba分词并统计词频后输出结果到Excel和txt文档.py

to obtain wordCount_all_lyrics.xls and 分词结果.txt, then create 情感分析用词.txt for the next step:

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
import jieba
import jieba.analyse
import xlwt

wbk = xlwt.Workbook(encoding='ascii')
sheet = wbk.add_sheet('wordCount')
word_lst = []
key_list = []
for line in open('职位表述文本.txt', encoding='utf-8'):
    tags = jieba.analyse.extract_tags(line.strip('
\r'))
    for t in tags:
        word_lst.append(t)
word_dict = {}
with open('分词结果.txt', 'w') as wf2:
    for item in word_lst:
        if item not in word_dict:
            word_dict[item] = 1
        else:
            word_dict[item] += 1
    orderList = sorted(word_dict.values(), reverse=True)
    for i in range(len(orderList)):
        for key in list(word_dict.keys()):
            if word_dict[key] == orderList[i]:
                wf2.write(key + ' ' + str(word_dict[key]) + '
')
                key_list.append(key)
                word_dict[key] = 0
for i in range(len(key_list)):
    sheet.write(i, 0, label=key_list[i])
    sheet.write(i, 1, label=orderList[i])
wbk.save('wordCount_all_lyrics.xls')

5. Sentiment analysis

Run 情感分析.py to compute sentiment scores for each word in 情感分析用词.txt using SnowNLP:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from snownlp import SnowNLP

def get_word():
    with open('情感分析用词.txt', encoding='utf-8') as f:
        return [line.strip('
') for line in f]

def get_sentiment(word):
    s = SnowNLP(word)
    print(s.sentiments)

if __name__ == '__main__':
    words = get_word()
    for word in words:
        get_sentiment(word)

The resulting sentiment scores are visualized in the following image; an average score above 0.5 indicates overall positive sentiment.

Conclusion

This tutorial walks through a complete mini‑project: from crawling data to tokenization, word‑cloud generation, frequency counting, and sentiment analysis, providing ready‑to‑run Python scripts and sample outputs for learners to practice.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Sentiment Analysis text-mining Web Scraping jieba word cloud SnowNLP

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.