Python Web Scraping Tutorial for Toutiao Using Multiprocessing and AJAX Data Extraction
This article demonstrates how to build a Python web scraper for Toutiao that leverages a multiprocessing pool, identifies AJAX endpoints, fetches JSON data, extracts titles and image URLs, handles Unicode encoding, and downloads images into organized folders.
The tutorial begins by noting that traditional web crawling is outdated for modern sites like Toutiao, which load content via AJAX. It introduces a simple process pool using Python's multiprocessing.Pool to parallelize tasks.
Key code for creating the pool:
<code>from multiprocessing import Pool
p = Pool(4)
p.close()
p.join()</code>It then explains how to recognize AJAX interfaces by checking three criteria: missing direct content in the HTML, presence of XHR requests, and specific request headers. Screenshots illustrate how to locate the correct URLs.
To retrieve JSON data, the script constructs request headers (including cookies and user-agent) and query parameters, then builds the request URL with urlencode . The core function is:
<code>def get_page(offset):
global headers
headers = {
'cookie': 'tt_webid=...; csrftoken=...; ...',
'user-agent': 'Mozilla/5.0 ...',
'referer': 'https://www.toutiao.com/search/?keyword=美女',
'x-requested-with': 'XMLHttpRequest'
}
params = {
'aid': '24',
'app_name': ' web_search',
'offset': offset,
'format': ' json',
'keyword': ' 美女',
'autoload': ' true',
'count': ' 20',
'en_qc': ' 1',
'cur_tab': ' 1',
'from': ' search_tab',
'pd': ' synthesis',
'timestamp': int(time.time())
}
url = 'https://www.toutiao.com/api/search/content/?' + urlencode(params)
url = url.replace('=+', '=')
try:
r = requests.get(url, headers=headers, params=params)
if r.status_code == 200:
return r.json()
except requests.ConnectionError as e:
print(e)
</code>After obtaining the JSON, get_image extracts each article's title and image URLs using regular expressions, handling cases where titles or URLs are missing.
<code>def get_image(json):
if json.get('data'):
for item in json.get('data'):
if item.get('title') is None:
continue
title = item.get('title')
if item.get('article_url') is None:
continue
url_page = item.get('article_url')
rr = requests.get(url_page, headers=headers)
if rr.status_code == 200:
pat = '<script>var BASE_DATA = .*?articleInfo:.*?content:(.*?)groupId.*?;</script>'
match = re.search(pat, rr.text, re.S)
if match:
result = re.findall(r'img src&#x3D;\\&quot;(.*?)\\&quot;', match.group(), re.S)
yield {'title': title, 'image': result}
</code>The save_image function creates a base directory, then a sub‑folder for each article title (sanitizing illegal characters), and downloads each image, converting Unicode‑escaped URLs back to normal strings before the request.
<code>def save_image(content):
path = 'D://今日头条美女//'
if not os.path.exists(path):
os.mkdir(path)
os.chdir(path)
if not os.path.exists(content['title']):
title = content['title'].replace('\t', '') if '\t' in content['title'] else content['title']
os.mkdir(title + '//')
os.chdir(title + '//')
for q, u in enumerate(content['image']):
u = u.encode('utf-8').decode('unicode_escape')
r = requests.get(u, headers=headers)
if r.status_code == 200:
with open(str(q) + '.jpg', 'wb') as fw:
fw.write(r.content)
print(f'该系列---->下载{q}张')
</code>The main function ties everything together, and a multiprocessing pool distributes the work across offset ranges to speed up crawling.
<code>def main(offset):
json = get_page(offset)
for content in get_image(json):
try:
save_image(content)
except (FileExistsError, OSError):
print('创建文件格式错误,包含特殊字符串:')
continue
if __name__ == '__main__':
pool = Pool()
groups = [j * 20 for j in range(8)]
pool.map(main, groups)
pool.close()
pool.join()
</code>The article concludes with reflections on challenges faced, such as handling CAPTCHA, fixing URL concatenation bugs, decoding Unicode links, and using a thread pool to accelerate downloads.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.