Backend Development 8 min read

How to Build a Python Baidu Tieba Crawler that Saves Posts to Text Files

This article explains how to create a Python web crawler for Baidu Tieba that extracts the original poster's content, determines page counts, retrieves the thread title, and saves all posts into a local TXT file, complete with usage instructions and code details.

MaGe Linux Operations

Mar 27, 2017

How to Build a Python Baidu Tieba Crawler that Saves Posts to Text Files

Project Overview : A Python script that crawls Baidu Tieba threads, extracts the original poster's content, and saves it to a local txt file.

Usage : Create a file named BugBaidu.py, paste the code, and run it. Input the thread URL when prompted.

Functionality : The spider fetches the thread page, determines the total number of pages, extracts the title, and saves each post's content to a text file.

Key URL pattern : Adding ?see_lz=1&pn=1 to a thread URL shows only the original poster's posts; pn indicates the page number.

HTML parsing : The script uses regular expressions to locate the title ( <h1 class="core_title_txt">...) and post content ( id="post_content"), and cleans HTML tags and entities.

Core Classes :

# -*- coding: utf-8 -*-
#---------------------------------------
# Program: Baidu Tieba Crawler
# Version: 0.5
# Author: why
# Date: 2013-05-16
# Language: Python 2.7
# Operation: Input URL, fetch original poster's content, save locally
# Function: Save original posts to txt file
#---------------------------------------
import string
import urllib2
import re

class HTML_Tool:
    # processing various tags
    BgnCharToNoneRex = re.compile("(\t|
| |<a.*?>|<img.*?>)")
    EndCharToNoneRex = re.compile("<.*?>")
    BgnPartRex = re.compile("<p.*?>")
    CharToNewLineRex = re.compile("(<br/>|</p>|<tr>|<div>|</div>)")
    CharToNextTabRex = re.compile("<td>")
    replaceTab = [("<","<"),(">",">"),("&","&"),("&","\""),(" "," ")]
    def Replace_Char(self, x):
        x = self.BgnCharToNoneRex.sub("",x)
        x = self.BgnPartRex.sub("
    ",x)
        x = self.CharToNewLineRex.sub("
",x)
        x = self.CharToNextTabRex.sub("\t",x)
        x = self.EndCharToNoneRex.sub("",x)
        for t in self.replaceTab:
            x = x.replace(t[0],t[1])
        return x

class Baidu_Spider:
    def __init__(self, url):
        self.myUrl = url + '?see_lz=1'
        self.datas = []
        self.myTool = HTML_Tool()
        print u'已经启动百度贴吧爬虫，咔嚓咔嚓'
    def baidu_tieba(self):
        myPage = urllib2.urlopen(self.myUrl).read().decode("gbk")
        endPage = self.page_counter(myPage)
        title = self.find_title(myPage)
        print u'文章名称：' + title
        self.save_data(self.myUrl, title, endPage)
    def page_counter(self, myPage):
        myMatch = re.search(r'class="red">(\d+?)</span>', myPage, re.S)
        if myMatch:
            endPage = int(myMatch.group(1))
            print u'爬虫报告：发现楼主共有%d页的原创内容' % endPage
        else:
            endPage = 0
            print u'爬虫报告：无法计算楼主发布内容有多少页！'
        return endPage
    def find_title(self, myPage):
        myMatch = re.search(r'<h1.*?>(.*?)</h1>', myPage, re.S)
        title = u'暂无标题'
        if myMatch:
            title = myMatch.group(1)
        else:
            print u'爬虫报告：无法加载文章标题！'
        # sanitize filename characters
        for ch in '\\/:*?"<>|':
            title = title.replace(ch, '')
        return title
    def save_data(self, url, title, endPage):
        self.get_data(url, endPage)
        f = open(title + '.txt', 'w+')
        f.writelines(self.datas)
        f.close()
        print u'爬虫报告：文件已下载到本地并打包成txt文件'
        print u'请按任意键退出...'
        raw_input()
    def get_data(self, url, endPage):
        url = url + '&pn='
        for i in range(1, endPage+1):
            print u'爬虫报告：爬虫%d号正在加载中...' % i
            myPage = urllib2.urlopen(url + str(i)).read()
            self.deal_data(myPage.decode('gbk'))
    def deal_data(self, myPage):
        myItems = re.findall('id="post_content.*?>(.*?)</div>', myPage, re.S)
        for item in myItems:
            data = self.myTool.Replace_Char(item.replace("
", "").encode('gbk'))
            self.datas.append(data + '
')

After execution, the crawler prints progress messages and creates a .txt file containing the extracted posts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python text extraction Crawler Baidu Tieba

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.