Building a Daily News Summarizer: Design, Implementation, and Automation (Part 4)
This article walks through the complete design and implementation of a daily news summarizer, covering source selection, web‑scraping with BeautifulSoup, database schema with SQLModel, LLM‑based summarization, FastAPI endpoints, front‑end layout, category/date browsing, and a scheduled update loop.
Core functional requirements
Select an appropriate news source – the example uses Sina News.
Scrape news titles and full article text for LLM summarization.
Support timed updates; handle possible duplicate entries during refresh.
Provide a minimal UI that shows a concise headline, a link to the original article, and the generated summary.
Classify news into categories such as domestic, international, entertainment, finance, etc.
Allow historical browsing by date.
Overall implementation approach
Use requests to fetch the Sina News homepage and BeautifulSoup (with the lxml parser) to extract headline elements and their hyperlinks.
Pass the raw article text to a large language model (e.g., DeepSeek‑Chat) via the OpenAI‑compatible API; the model returns a summary limited to 100 Chinese characters.
Schedule a periodic crawler (every two hours at minute 30) that repeats steps 1‑2 and writes results to the database.
Because the news site’s layout is stable, category mapping can be hard‑coded in the scraper.
Store the update timestamp in the database; use it for grouping by date in the front‑end.
Database schema
DROP TABLE IF EXISTS `news`;</code>
<code>CREATE TABLE `news` (</code>
<code> `id` int(11) NOT NULL AUTO_INCREMENT,</code>
<code> `category` varchar(255) DEFAULT NULL,</code>
<code> `headline` varchar(255) DEFAULT NULL,</code>
<code> `hyperlink` varchar(255) DEFAULT NULL,</code>
<code> `content` tinytext DEFAULT NULL,</code>
<code> `summary` varchar(255) DEFAULT NULL,</code>
<code> `createtime` datetime DEFAULT NULL,</code>
<code> PRIMARY KEY (`id`)</code>
<code>) ENGINE=InnoDB AUTO_INCREMENT=253 DEFAULT CHARSET=utf8mb4;The columns correspond to the list in the article (id, category, headline, hyperlink, content, summary, createtime).
Backend code snippets
SQLModel model definition (news.py):
from sqlmodel import Field, SQLModel, Session, select, between
from typing import Optional
from datetime import datetime
class News(SQLModel, table=True):
id: Optional[int] = Field(default=None, primary_key=True)
category: str
headline: str
hyperlink: str
content: str = ""
summary: str = ""
createtime: datetimeDeduplication logic in insert_news (crawler.py):
def insert_news(news):
with Session(engine) as session:
query = select(News.id).where(News.hyperlink == news.hyperlink)
result = session.exec(query).all()
if len(result) > 0:
print(f"Data {result} already exists, no update needed")
else:
session.add(news)
session.commit()Fetching hot news (news.py):
def get_hots():
daily_hots = html.find(name="div", id="syncad_1", class_="ct_t_01")
for hots in daily_hots.find_all("h1", attrs={"data-client": "headline"}):
for link in hots.find_all("a", class_="linkNewsTopBold"):
hyperlink = link.get("href")
headline = link.text
news = News(category="hot", headline=headline, hyperlink=hyperlink, createtime=datetime.now())
insert_news(news)LLM summarization (summarize.py):
def summarize(content):
if len(content) > 100:
message = {"role": "user", "content": f"以下内容来自于新浪新闻,属于公开信息,请在100字以内进行摘要:
{content}"}
client = OpenAI(api_key=os.getenv("DeepSeek_API_Key"), base_url="https://api.deepseek.com")
completion = client.chat.completions.create(model="deepseek-chat", messages=[message], stream=False)
return completion.choices[0].message.content
return contentScheduled update loop (schedule.py):
from datetime import datetime
import time
from crawler import get_hots, get_mils, get_news
from summarize import get_and_update
while True:
now = datetime.now()
if now.hour % 2 == 0 and now.minute == 30:
get_hots()
get_mils()
get_news("blk_gnxw_011", "p_china")
get_news("blk_gjxw_011", "p_world")
get_news("blk_cjkjqcfc_011", "p_finance")
get_news("blk_lctycp_011", "p_ent")
get_and_update()
print("News update completed", now)
else:
print("Scheduler running, waiting...")
time.sleep(60)FastAPI endpoints (main.py)
@app.get('/')
def index(request: Request):
today = time.strftime("%Y-%m-%d")
news_list = get_news_by_dc(date=today)
return templates.TemplateResponse(request, name="news.html", context={"news_list": news_list, "today": today})
@app.get('/{category}')
def query_news(request: Request, category: str):
today = time.strftime("%Y-%m-%d")
news_list = get_news_by_dc(date=today, category=category)
return templates.TemplateResponse(request, name="news.html", context={"news_list": news_list, "today": today})
@app.get('/{date}/{category}')
def query_news_dc(request: Request, date: str, category: str):
sql = text("select DISTINCT SUBSTRING(createtime,1,10) as mydate from news")
with Session(engine) as session:
date_list = session.execute(sql).mappings().all()
news_list = get_news_by_dc(date=date, category=category)
return templates.TemplateResponse(request, name="news.html", context={"news_list": news_list, "date_list": date_list, "today": date})Front‑end template (news.html)
<body>
<div id="menu">
<div id="logo">每日新闻摘要</div>
<div id="category">
<span style="margin-right:50px">{{today}}</span>
<a href="/hot">今日要闻</a>
<a href="/china">国内新闻</a>
<a href="/world">国际新闻</a>
<a href="/mil">军事新闻</a>
<a href="/finance">财经科技</a>
<a href="/ent">娱乐体育</a>
</div>
</div>
{% for result in news_list %}
<div class="news">
<div class="headline">
<div class="title">{{result.News.headline}}</div>
<div class="view"><a href="{{result.News.hyperlink}}" target="_blank">查看原文</a></div>
</div>
<div class="content">{{result.News.summary}}</div>
</div>
{% endfor %}
<div id="history">
{% for date in date_list %}
<span class="date"><a href="/{{date.mydate}}/hot">{{date.mydate}}</a></span>
{% endfor %}
</div>
</body>The article demonstrates a complete pipeline—from selecting a news source, crawling headlines and content, deduplicating entries, generating AI‑driven summaries, persisting data, exposing RESTful endpoints, and rendering a simple HTML UI—while also handling category‑based and date‑based browsing and a periodic update schedule.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Woodpecker Software Testing
The Woodpecker Software Testing public account shares software testing knowledge, connects testing enthusiasts, founded by Gu Xiang, website: www.3testing.com. Author of five books, including "Mastering JMeter Through Case Studies".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
