Build a Real‑Time News Scraper & Word Cloud with Flask and Python

This tutorial shows how to create a multithreaded Python web scraper that fetches news from Sina and NetEase, stores the results in MySQL, generates a live word‑cloud, and serves the data through a simple Flask web application with an HTML front‑end.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Build a Real‑Time News Scraper & Word Cloud with Flask and Python

This article demonstrates how to combine Python web scraping, word‑cloud generation, and a Flask web service to display real‑time news recommendations.

Three main steps

Crawl data

Generate a word cloud from the crawled text

Recommend hot news on a web page

Web‑scraping part

The scraper uses multithreading to fetch news from Sina and NetEase across 14 categories. The pages load data via AJAX, returning a JavaScript callback named data_callback. The callback string is evaluated with eval to obtain a Python list.

def get_wy_teach():
    url = 'https://tech.163.com/special/00097UHL/tech_datalist.js?callback=data_callback'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
    }
    res = requests.get(url=url, headers=headers)
    # print(res.text)
    data = res.text
    data = eval(data.replace('data_callback(', '').replace(data[-1], ""))

After extracting the news items, they are stored in a MySQL database. A main script launches 14 threads, each handling one category:

def multi_thread():
    t1 = threading.Thread(target=xzg)
    t2 = threading.Thread(target=xz)
    # ...
    t13 = threading.Thread(target=wy_hua)
    t14 = threading.Thread(target=wy_chn)
    t1.start()
    t2.start()
    # ...
    t13.start()
    t14.start()

Once crawling is complete, a word cloud is generated from the collected text.

Flask part

The Flask application provides two routes: /test to trigger the crawler and /news to render the news list.

from flask import Flask, render_template, request
from flask_cors import CORS
import time, main

app = Flask(__name__)
CORS(app, resources=r'/*')

@app.route('/test', methods=['GET'])
def mytest():
    main.multi_thread()
    time.sleep(10)
    return 'Crawling completed~'

@app.route('/news')
def news_list():
    data = get_mysql()
    return render_template('index4.html', data=data)

if __name__ == '__main__':
    app.run(debug=True, port=5000)

The HTML template ( index4.html) iterates over the data list to display 20 news items:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Today's News Recommendations</title>
</head>
<body>
    <h1>今日新闻推荐</h1>
    <ul>
        {% for item in data %}
        <li><a href="{{item[1]}}">{{item[0]}}</a></li>
        {% endfor %}
    </ul>
</body>
</html>

Images illustrate the initial ugly page, the generated word cloud, and the final Flask interface.

Finally, the article notes that Flask is a lightweight, flexible framework ideal for small‑scale web projects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonmysqlFlaskWeb Scrapingword cloud
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.