Backend Development 13 min read

Python Web Scraping Tutorial for Toutiao Using Multiprocessing and AJAX Data Extraction

This article demonstrates how to build a Python web scraper for Toutiao that leverages a multiprocessing pool, identifies AJAX endpoints, fetches JSON data, extracts titles and image URLs, handles Unicode encoding, and downloads images into organized folders.

Python Programming Learning Circle

May 13, 2020

Python Web Scraping Tutorial for Toutiao Using Multiprocessing and AJAX Data Extraction

The tutorial begins by noting that traditional web crawling is outdated for modern sites like Toutiao, which load content via AJAX. It introduces a simple process pool using Python's multiprocessing.Pool to parallelize tasks.

Key code for creating the pool:

from multiprocessing import Pool
p = Pool(4)
p.close()
p.join()

It then explains how to recognize AJAX interfaces by checking three criteria: missing direct content in the HTML, presence of XHR requests, and specific request headers. Screenshots illustrate how to locate the correct URLs.

To retrieve JSON data, the script constructs request headers (including cookies and user-agent) and query parameters, then builds the request URL with urlencode. The core function is:

def get_page(offset):
    global headers
    headers = {
        'cookie': 'tt_webid=...; csrftoken=...; ...',
        'user-agent': 'Mozilla/5.0 ...',
        'referer': 'https://www.toutiao.com/search/?keyword=美女',
        'x-requested-with': 'XMLHttpRequest'
    }
    params = {
        'aid': '24',
        'app_name': ' web_search',
        'offset': offset,
        'format': ' json',
        'keyword': ' 美女',
        'autoload': ' true',
        'count': ' 20',
        'en_qc': ' 1',
        'cur_tab': ' 1',
        'from': ' search_tab',
        'pd': ' synthesis',
        'timestamp': int(time.time())
    }
    url = 'https://www.toutiao.com/api/search/content/?' + urlencode(params)
    url = url.replace('=+', '=')
    try:
        r = requests.get(url, headers=headers, params=params)
        if r.status_code == 200:
            return r.json()
    except requests.ConnectionError as e:
        print(e)

After obtaining the JSON, get_image extracts each article's title and image URLs using regular expressions, handling cases where titles or URLs are missing.

def get_image(json):
    if json.get('data'):
        for item in json.get('data'):
            if item.get('title') is None:
                continue
            title = item.get('title')
            if item.get('article_url') is None:
                continue
            url_page = item.get('article_url')
            rr = requests.get(url_page, headers=headers)
            if rr.status_code == 200:
                pat = '<script>var BASE_DATA = .*?articleInfo:.*?content:(.*?)groupId.*?;</script>'
                match = re.search(pat, rr.text, re.S)
                if match:
                    result = re.findall(r'img src&#x3D;\\&quot;(.*?)\\&quot;', match.group(), re.S)
                    yield {'title': title, 'image': result}

The save_image function creates a base directory, then a sub‑folder for each article title (sanitizing illegal characters), and downloads each image, converting Unicode‑escaped URLs back to normal strings before the request.

def save_image(content):
    path = 'D://今日头条美女//'
    if not os.path.exists(path):
        os.mkdir(path)
    os.chdir(path)
    if not os.path.exists(content['title']):
        title = content['title'].replace('\t', '') if '\t' in content['title'] else content['title']
        os.mkdir(title + '//')
        os.chdir(title + '//')
    for q, u in enumerate(content['image']):
        u = u.encode('utf-8').decode('unicode_escape')
        r = requests.get(u, headers=headers)
        if r.status_code == 200:
            with open(str(q) + '.jpg', 'wb') as fw:
                fw.write(r.content)
                print(f'该系列---->下载{q}张')

The main function ties everything together, and a multiprocessing pool distributes the work across offset ranges to speed up crawling.

def main(offset):
    json = get_page(offset)
    for content in get_image(json):
        try:
            save_image(content)
        except (FileExistsError, OSError):
            print('创建文件格式错误，包含特殊字符串:')
            continue

if __name__ == '__main__':
    pool = Pool()
    groups = [j * 20 for j in range(8)]
    pool.map(main, groups)
    pool.close()
    pool.join()

The article concludes with reflections on challenges faced, such as handling CAPTCHA, fixing URL concatenation bugs, decoding Unicode links, and using a thread pool to accelerate downloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python AJAX Multiprocessing Toutiao web-scraping image-download

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.