Artificial Intelligence 12 min read

How to Turn Crawled CSV Data into Word Clouds and Sentiment Scores with Python

This guide walks you through extracting text from a CSV obtained via Python web scraping, cleaning it with stop‑words, generating a word‑cloud, performing jieba tokenization and frequency analysis, and finally applying SnowNLP for sentiment scoring, with all code snippets and data links provided.

Python Crawling & Data Mining

Feb 9, 2022

How to Turn Crawled CSV Data into Word Clouds and Sentiment Scores with Python

Introduction

A fan asked how to process text data scraped with Python: the data is in a CSV, needs to be saved to txt, then visualized as a word cloud, tokenized, and sentiment‑analyzed.

1. Approach

The overall workflow is: extract text from the CSV, apply stop‑words for tokenization, generate a word cloud, and finally perform sentiment analysis.

1. Extract each line of text from the CSV and write to a new txt file.

2. Run a script that uses stop‑words to clean the text and output a processed txt file.

3. Run a script to create a word‑cloud image.

4. Run a script that tokenizes with jieba, counts word frequencies, and writes results to Excel and txt files; the txt output is further processed for sentiment analysis.

5. Run a sentiment‑analysis script to obtain average sentiment scores.

The source code and data are packaged and available on GitHub; reply with the keyword 小明的数据 to receive the download link.

2. Implementation

1. Extract CSV text line by line into a new txt file

Run the script 读取csv文件中文本并存txt文档.py to produce 职位表述文本.txt :

# coding: utf-8
import pandas as pd
df = pd.read_csv('./职位描述.csv', encoding='gbk')
# print(df.head())

for text in df['Job_Description']:
    # print(text)
    if text is not None:
        with open('职位表述文本.txt', mode='a', encoding='utf-8') as file:
            file.write(str(text))

print('写入完成')

2. Use stop‑words to obtain the final cleaned text

Run the script 使用停用词获取最后的文本内容.py to generate 职位表述文本分词后_outputs.txt :

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import jieba

def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords

def seg_sentence(sentence):
    sentence_seged = jieba.cut(sentence.strip())
    stopwords = stopwordslist('stop_word.txt')
    outstr = ''
    for word in sentence_seged:
        if word not in stopwords:
            if word != '\t':
                outstr += word + " "
    return outstr

inputs = open('职位表述文本.txt', 'r', encoding='utf-8')
outputs = open('职位表述文本分词后_outputs.txt', 'w', encoding='utf-8')
for line in inputs:
    line_seg = seg_sentence(line)
    outputs.write(line_seg + '
')
outputs.close()
inputs.close()

3. Create a word‑cloud image

Run the script 指定txt词云图.py to produce 词云图.png :

from wordcloud import WordCloud
import jieba
import numpy
import PIL.Image as Image

def cut(text):
    wordlist_jieba = jieba.cut(text)
    space_wordlist = " ".join(wordlist_jieba)
    return space_wordlist

with open(r"C:\Users\pdcfi\Desktop\xiaoming\职位表述文本.txt", encoding="utf-8") as file:
    text = file.read()
    text = cut(text)
    mask_pic = numpy.array(Image.open(r"C:\Users\pdcfi\Desktop\xiaoming\python.png"))
    wordcloud = WordCloud(font_path=r"C:/Windows/Fonts/simfang.ttf",
                          collocations=False,
                          max_words=100,
                          min_font_size=10,
                          max_font_size=500,
                          mask=mask_pic).generate(text)
    wordcloud.to_file('词云图.png')

4. Token frequency statistics

Run the script jieba分词并统计词频后输出结果到Excel和txt文档.py to generate wordCount_all_lyrics.xls and 分词结果.txt :

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
import sys
import jieba
import jieba.analyse
import xlwt

if __name__ == "__main__":
    wbk = xlwt.Workbook(encoding='ascii')
    sheet = wbk.add_sheet("wordCount")
    word_lst = []
    key_list = []
    for line in open('职位表述文本.txt', encoding='utf-8'):
        item = line.strip('
\r').split('\t')
        tags = jieba.analyse.extract_tags(item[0])
        for t in tags:
            word_lst.append(t)
    word_dict = {}
    with open("分词结果.txt", 'w') as wf2:
        for item in word_lst:
            if item not in word_dict:
                word_dict[item] = 1
            else:
                word_dict[item] += 1
        orderList = list(word_dict.values())
        orderList.sort(reverse=True)
        for i in range(len(orderList)):
            for key in word_dict:
                if word_dict[key] == orderList[i]:
                    wf2.write(key + ' ' + str(word_dict[key]) + '
')
                    key_list.append(key)
                    word_dict[key] = 0
    for i in range(len(key_list)):
        sheet.write(i, 1, label=orderList[i])
        sheet.write(i, 0, label=key_list[i])
    wbk.save('wordCount_all_lyrics.xls')

5. Sentiment analysis results

Run the script 情感分析.py to compute sentiment scores using SnowNLP:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from snownlp import SnowNLP

def get_word():
    with open("情感分析用词.txt", encoding='utf-8') as f:
        line = f.readline()
        word_list = []
        while line:
            line = f.readline()
            word_list.append(line.strip('
'))
        return word_list

def get_sentiment(word):
    text = u'{}'.format(word)
    s = SnowNLP(text)
    print(s.sentiments)

if __name__ == '__main__':
    words = get_word()
    for word in words:
        get_sentiment(word)

Conclusion

The article demonstrates a complete mini‑project: from scraping CSV data, through stop‑word cleaning, word‑cloud visualization, jieba tokenization, frequency counting, and finally sentiment analysis, providing a practical template for similar text‑processing tasks.

All source code and data are packaged on GitHub; reply with the keyword 小明的数据 in the public account backend to download.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Sentiment Analysis text-mining Web Scraping jieba word cloud SnowNLP

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.