Fundamentals 7 min read

How to Scrape Web Text with Python and Visualize Word Frequencies

This article demonstrates how to use Python's requests and BeautifulSoup to crawl text from a news site, process it with collections, numpy, and jieba for word‑frequency analysis, and then visualize the top terms using pyecharts, providing complete code snippets and explanations.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Scrape Web Text with Python and Visualize Word Frequencies

Introduction

The author, a Python enthusiast, shares a simple project that combines web crawling, word‑frequency statistics, and visualization, inspired by a student's request about tokenization and visual analysis.

Data Source

The text is fetched from a news platform, but the same approach works for any large‑scale textual data such as reports, papers, social media comments, or music reviews.

Data Acquisition

The following script downloads a page, extracts paragraph text, and saves it to 报告.txt:

import re
import collections  # word‑frequency library
import numpy as np   # data processing
import jieba          # Chinese tokenization
import requests
from bs4 import BeautifulSoup
from pyecharts import options as opts
from pyecharts.charts import WordCloud
from pyecharts.globals import SymbolType
import warnings
warnings.filterwarnings('ignore')

r = requests.get("https://m.thepaper.cn/baijiahao_11694997", timeout=10)
r.encoding = "utf-8"
s = BeautifulSoup(r.text, "html.parser")
f = open("报告.txt", "w", encoding="utf-8")
L = s.find_all("p")
for c in L:
    f.write("{}
".format(c.text))
f.close()

Running this code creates a 报告.txt file containing the page's text.

Word‑Frequency Statistics

The next script reads the file, cleans the text, tokenizes it with jieba, removes stop words, counts frequencies, and prints the top 30 terms:

# Read file
fn = open("./报告.txt", "r", encoding="utf-8")
string_data = fn.read()
fn.close()

# Text preprocessing
pattern = re.compile(u'\t|,|/|。|
|\.|-|:|;|\)|\(|\?|"')
string_data = re.sub(pattern, '', string_data)

# Tokenization (precise mode)
seg_list_exact = jieba.cut(string_data, cut_all=False)
object_list = []
remove_words = [u'的', u'要', u'“', u'”', u'和', u',', u'为', u'是', u'以', u'随着', u'对于', u'对', u'等', u'能', u'都', u'。', u' ', u'、', u'中', u'在', u'了', u'通常', u'如果', u'我', u'她', u'(', u')', u'他', u'你', u'?', u'—', u'就', u'着', u'说', u'上', u'这', u'那', u'有', u'也', u'什么', u'·', u'将', u'没有', u'到', u'不', u'去']
for word in seg_list_exact:
    if word not in remove_words:
        object_list.append(word)

# Frequency count
word_counts = collections.Counter(object_list)
word_counts_top30 = word_counts.most_common(30)
print("2021年政府工作报告一共有%d个词" % len(word_counts))
print(word_counts_top30)

The script outputs the total number of distinct words and the 30 most frequent terms, which are later visualized.

Visualization

The final step creates a line chart of the top‑30 word frequencies using pyecharts:

import pyecharts
from pyecharts.charts import Line
from pyecharts import options as opts

cate = [i[0] for i in word_counts_top30]
data1 = [i[1] for i in word_counts_top30]

line = (Line()
        .add_xaxis(cate)
        .add_yaxis('词频', data1, markline_opts=opts.MarkLineOpts(data=[opts.MarkLineItem(type_="average")]))
        .set_global_opts(title_opts=opts.TitleOpts(title="词频统计Top30"),
                         xaxis_opts=opts.AxisOpts(name_rotate=60, axislabel_opts={"rotate":45})))
line.render_notebook()

The resulting line chart displays the frequency distribution of the most common words.

Conclusion

This straightforward project shows how to combine Python web scraping, text preprocessing, word‑frequency analysis, and data visualization to turn raw online articles into insightful visual reports.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data visualizationWeb ScrapingPyechartsjiebaWord Frequency
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.