Backend Development 8 min read

Extracting Article Content with Python: BeautifulSoup, urllib, and the newspaper Library

This article explains how to automate web article extraction by first using BeautifulSoup and urllib to crawl search results and retrieve page links, then demonstrates a much simpler approach with the Python newspaper library to directly download and parse article text, including full installation and code examples.

Python Programming Learning Circle

Dec 23, 2021

Extracting Article Content with Python: BeautifulSoup, urllib, and the newspaper Library

The author first describes two approaches for extracting article content: a traditional method using XPath, CSS selectors, regular expressions, and BeautifulSoup, which often encounters many difficulties, and a recommended method that leverages the newspaper library for simpler extraction.

In their own work they need to quickly gather content via heavy search engine queries to build a corpus, so they initially use BeautifulSoup together with urllib to fetch web pages and extract the main text.

The process is broken into three steps: (1) search Baidu with a keyword like "person name company said", (2) collect the result page links, (3) fetch each page, extract the article body, and save it, optionally performing word segmentation to identify quoted statements.

Example code for extracting links from a page using BeautifulSoup is provided:

#encoding=utf-8
#coding=utf-8
import urllib, urllib2
from bs4 import BeautifulSoup
import re
import os
import string

def get_url_list(purl):
    req = urllib2.Request(purl, headers={'User-Agent':'Magic Browser'})
    page = urllib2.urlopen(req)
    soup = BeautifulSoup(page.read())
    a_div = soup.find('div', {'class':'main'})
    b_div = a_div.find('div', {'class':'left'})
    c_div = b_div.find('div', {'class':'newsList'})
    links4 = []
    for link_aa in c_div:
        for link_bb in link_aa:
            links4.append(link_bb.find('a'))
    links4 = list(set(links4))
    links4.remove(-1)
    links4.remove(None)
    return links4

Finding this method cumbersome, the author recommends using newspaper.Article to directly download and parse articles without manual HTML analysis.

from newspaper import Article
url = 'http://news.ifeng.com/a/20180504/58107235_0.shtml'
news = Article(url, language='zh')
news.download()
news.parse()
print(news.text)
print(news.title)

An alternative approach using newspaper.build is also shown:

import newspaper
news = newspaper.build(url, language='zh')
article = news.articles[0]
article.download()
article.parse()
print(article.text)

The article then presents a more comprehensive Baidu search crawler that sends GET requests with appropriate headers, parses the result page with lxml.etree, extracts titles and URLs, and writes unique URLs to a file:

def baidu_search(wd, pn_max, save_file_name):
    url = "https://www.baidu.com/s"
    with open(save_file_name, 'a', encoding='utf-8') as out_data:
        for page in range(pn_max):
            pn = page * 10
            querystring = {"wd": wd, "pn": pn, "oq": wd, "ie": "utf-8", "usm": 2}
            headers = {...}
            try:
                response = requests.request("GET", url, headers=headers, params=querystring)
                html = etree.HTML(response.text, parser=etree.HTMLParser(encoding='utf-8'))
                titles_tags = html.xpath('//div[@id="content_left"]/div/h3/a')
                titles = [tag.xpath('string(.)').strip() for tag in titles_tags]
                urls = html.xpath('//div[@id="content_left"]/div/h3/a/@href')
                for data in zip(titles, urls):
                    out_data.write(data[1] + '
')
            except Exception as e:
                print("页面加载失败", e)
                continue

Finally, the author shows how to read the saved URLs and process each with newspaper.Article to print the extracted text, and provides installation instructions for the library using pip:

pip3 install --ignore-installed --upgrade newspaper3k

Overall, the article serves as a practical guide for Python developers needing to automate web article collection, contrasting low‑level scraping techniques with the high‑level newspaper library.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

beautifulsoup web-scraping article-extraction newspaper

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.