Building a Daily News Summarizer: Design, Implementation, and Automation (Part 4)

This article walks through the complete design and implementation of a daily news summarizer, covering source selection, web‑scraping with BeautifulSoup, database schema with SQLModel, LLM‑based summarization, FastAPI endpoints, front‑end layout, category/date browsing, and a scheduled update loop.

Woodpecker Software Testing
Woodpecker Software Testing
Woodpecker Software Testing
Building a Daily News Summarizer: Design, Implementation, and Automation (Part 4)

Core functional requirements

Select an appropriate news source – the example uses Sina News.

Scrape news titles and full article text for LLM summarization.

Support timed updates; handle possible duplicate entries during refresh.

Provide a minimal UI that shows a concise headline, a link to the original article, and the generated summary.

Classify news into categories such as domestic, international, entertainment, finance, etc.

Allow historical browsing by date.

Overall implementation approach

Use requests to fetch the Sina News homepage and BeautifulSoup (with the lxml parser) to extract headline elements and their hyperlinks.

Pass the raw article text to a large language model (e.g., DeepSeek‑Chat) via the OpenAI‑compatible API; the model returns a summary limited to 100 Chinese characters.

Schedule a periodic crawler (every two hours at minute 30) that repeats steps 1‑2 and writes results to the database.

Because the news site’s layout is stable, category mapping can be hard‑coded in the scraper.

Store the update timestamp in the database; use it for grouping by date in the front‑end.

Database schema

DROP TABLE IF EXISTS `news`;</code>
<code>CREATE TABLE `news` (</code>
<code>  `id` int(11) NOT NULL AUTO_INCREMENT,</code>
<code>  `category` varchar(255) DEFAULT NULL,</code>
<code>  `headline` varchar(255) DEFAULT NULL,</code>
<code>  `hyperlink` varchar(255) DEFAULT NULL,</code>
<code>  `content` tinytext DEFAULT NULL,</code>
<code>  `summary` varchar(255) DEFAULT NULL,</code>
<code>  `createtime` datetime DEFAULT NULL,</code>
<code>  PRIMARY KEY (`id`)</code>
<code>) ENGINE=InnoDB AUTO_INCREMENT=253 DEFAULT CHARSET=utf8mb4;

The columns correspond to the list in the article (id, category, headline, hyperlink, content, summary, createtime).

Backend code snippets

SQLModel model definition (news.py):

from sqlmodel import Field, SQLModel, Session, select, between
from typing import Optional
from datetime import datetime

class News(SQLModel, table=True):
    id: Optional[int] = Field(default=None, primary_key=True)
    category: str
    headline: str
    hyperlink: str
    content: str = ""
    summary: str = ""
    createtime: datetime

Deduplication logic in insert_news (crawler.py):

def insert_news(news):
    with Session(engine) as session:
        query = select(News.id).where(News.hyperlink == news.hyperlink)
        result = session.exec(query).all()
        if len(result) > 0:
            print(f"Data {result} already exists, no update needed")
        else:
            session.add(news)
            session.commit()

Fetching hot news (news.py):

def get_hots():
    daily_hots = html.find(name="div", id="syncad_1", class_="ct_t_01")
    for hots in daily_hots.find_all("h1", attrs={"data-client": "headline"}):
        for link in hots.find_all("a", class_="linkNewsTopBold"):
            hyperlink = link.get("href")
            headline = link.text
            news = News(category="hot", headline=headline, hyperlink=hyperlink, createtime=datetime.now())
            insert_news(news)

LLM summarization (summarize.py):

def summarize(content):
    if len(content) > 100:
        message = {"role": "user", "content": f"以下内容来自于新浪新闻,属于公开信息,请在100字以内进行摘要:
{content}"}
        client = OpenAI(api_key=os.getenv("DeepSeek_API_Key"), base_url="https://api.deepseek.com")
        completion = client.chat.completions.create(model="deepseek-chat", messages=[message], stream=False)
        return completion.choices[0].message.content
    return content

Scheduled update loop (schedule.py):

from datetime import datetime
import time
from crawler import get_hots, get_mils, get_news
from summarize import get_and_update

while True:
    now = datetime.now()
    if now.hour % 2 == 0 and now.minute == 30:
        get_hots()
        get_mils()
        get_news("blk_gnxw_011", "p_china")
        get_news("blk_gjxw_011", "p_world")
        get_news("blk_cjkjqcfc_011", "p_finance")
        get_news("blk_lctycp_011", "p_ent")
        get_and_update()
        print("News update completed", now)
    else:
        print("Scheduler running, waiting...")
    time.sleep(60)

FastAPI endpoints (main.py)

@app.get('/')
def index(request: Request):
    today = time.strftime("%Y-%m-%d")
    news_list = get_news_by_dc(date=today)
    return templates.TemplateResponse(request, name="news.html", context={"news_list": news_list, "today": today})

@app.get('/{category}')
def query_news(request: Request, category: str):
    today = time.strftime("%Y-%m-%d")
    news_list = get_news_by_dc(date=today, category=category)
    return templates.TemplateResponse(request, name="news.html", context={"news_list": news_list, "today": today})

@app.get('/{date}/{category}')
def query_news_dc(request: Request, date: str, category: str):
    sql = text("select DISTINCT SUBSTRING(createtime,1,10) as mydate from news")
    with Session(engine) as session:
        date_list = session.execute(sql).mappings().all()
    news_list = get_news_by_dc(date=date, category=category)
    return templates.TemplateResponse(request, name="news.html", context={"news_list": news_list, "date_list": date_list, "today": date})

Front‑end template (news.html)

<body>
  <div id="menu">
    <div id="logo">每日新闻摘要</div>
    <div id="category">
      <span style="margin-right:50px">{{today}}</span>
      <a href="/hot">今日要闻</a> &nbsp;&nbsp;&nbsp;
      <a href="/china">国内新闻</a> &nbsp;&nbsp;&nbsp;
      <a href="/world">国际新闻</a> &nbsp;&nbsp;&nbsp;
      <a href="/mil">军事新闻</a> &nbsp;&nbsp;&nbsp;
      <a href="/finance">财经科技</a> &nbsp;&nbsp;&nbsp;
      <a href="/ent">娱乐体育</a>
    </div>
  </div>
  {% for result in news_list %}
    <div class="news">
      <div class="headline">
        <div class="title">{{result.News.headline}}</div>
        <div class="view"><a href="{{result.News.hyperlink}}" target="_blank">查看原文</a></div>
      </div>
      <div class="content">{{result.News.summary}}</div>
    </div>
  {% endfor %}
  <div id="history">
    {% for date in date_list %}
      <span class="date"><a href="/{{date.mydate}}/hot">{{date.mydate}}</a></span>
    {% endfor %}
  </div>
</body>

The article demonstrates a complete pipeline—from selecting a news source, crawling headlines and content, deduplicating entries, generating AI‑driven summaries, persisting data, exposing RESTful endpoints, and rendering a simple HTML UI—while also handling category‑based and date‑based browsing and a periodic update schedule.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonLLMFastAPIWeb ScrapingSQLModelNews Summarization
Woodpecker Software Testing
Written by

Woodpecker Software Testing

The Woodpecker Software Testing public account shares software testing knowledge, connects testing enthusiasts, founded by Gu Xiang, website: www.3testing.com. Author of five books, including "Mastering JMeter Through Case Studies".

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.