Build a Real‑Time News Scraper & Word Cloud with Flask and Python
This tutorial shows how to create a multithreaded Python web scraper that fetches news from Sina and NetEase, stores the results in MySQL, generates a live word‑cloud, and serves the data through a simple Flask web application with an HTML front‑end.
This article demonstrates how to combine Python web scraping, word‑cloud generation, and a Flask web service to display real‑time news recommendations.
Three main steps
Crawl data
Generate a word cloud from the crawled text
Recommend hot news on a web page
Web‑scraping part
The scraper uses multithreading to fetch news from Sina and NetEase across 14 categories. The pages load data via AJAX, returning a JavaScript callback named data_callback. The callback string is evaluated with eval to obtain a Python list.
def get_wy_teach():
url = 'https://tech.163.com/special/00097UHL/tech_datalist.js?callback=data_callback'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
}
res = requests.get(url=url, headers=headers)
# print(res.text)
data = res.text
data = eval(data.replace('data_callback(', '').replace(data[-1], ""))After extracting the news items, they are stored in a MySQL database. A main script launches 14 threads, each handling one category:
def multi_thread():
t1 = threading.Thread(target=xzg)
t2 = threading.Thread(target=xz)
# ...
t13 = threading.Thread(target=wy_hua)
t14 = threading.Thread(target=wy_chn)
t1.start()
t2.start()
# ...
t13.start()
t14.start()Once crawling is complete, a word cloud is generated from the collected text.
Flask part
The Flask application provides two routes: /test to trigger the crawler and /news to render the news list.
from flask import Flask, render_template, request
from flask_cors import CORS
import time, main
app = Flask(__name__)
CORS(app, resources=r'/*')
@app.route('/test', methods=['GET'])
def mytest():
main.multi_thread()
time.sleep(10)
return 'Crawling completed~'
@app.route('/news')
def news_list():
data = get_mysql()
return render_template('index4.html', data=data)
if __name__ == '__main__':
app.run(debug=True, port=5000)The HTML template ( index4.html) iterates over the data list to display 20 news items:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Today's News Recommendations</title>
</head>
<body>
<h1>今日新闻推荐</h1>
<ul>
{% for item in data %}
<li><a href="{{item[1]}}">{{item[0]}}</a></li>
{% endfor %}
</ul>
</body>
</html>Images illustrate the initial ugly page, the generated word cloud, and the final Flask interface.
Finally, the article notes that Flask is a lightweight, flexible framework ideal for small‑scale web projects.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
