Backend Development 10 min read

Simulating Zhihu Login and Scraping Content with Python Requests

This tutorial demonstrates how to use Python's requests library to simulate Zhihu login by handling dynamic _xsrf tokens, optional captcha verification, saving cookies, and then crawling the main page to extract questions and answer abstracts.

Python Programming Learning Circle

Apr 7, 2021

Simulating Zhihu Login and Scraping Content with Python Requests

Zhihu has become a popular training ground for web crawlers; this article shows how to simulate a login using Python's requests library, obtain the dynamic _xsrf token, handle optional captcha verification, save the resulting cookies locally, and then crawl the Zhihu homepage to retrieve questions and their answer summaries.

The login process requires three POST parameters: account, password, and the hidden _xsrf value, which changes for each session. By capturing this token from the login page, the script can construct a valid login request.

A requests.Session object is created to maintain state across requests, with custom headers mimicking a browser. The session's cookie jar is loaded from a local cookies file if it exists, and saved after a successful login.

Key functions include: get_xsrf(): fetches the homepage and extracts the _xsrf token using a regular expression. get_captcha(): downloads the captcha image, displays it with Pillow (or prompts manual entry), and returns the user‑entered text. isLogin(): checks login status by requesting the profile settings page. login(secret, account): determines whether the account is a phone number or email, builds the appropriate POST data, handles captcha if required, and saves the cookies. getPageQuestion(url), getPageAnswerAbstract(url), getPageALL(url): use BeautifulSoup to parse the returned HTML and print question titles, answer abstracts, and related links.

The script’s main block first checks if the user is already logged in; if so, it calls getPageALL to list questions on the Zhihu homepage. Otherwise, it prompts for username and password, performs the login, and then proceeds with scraping.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
try:
    import cookielib
except:
    import http.cookiejar as cookielib
import re
import time
import os.path
try:
    from PIL import Image
except:
    pass
from bs4 import BeautifulSoup
agent = 'Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0'
headers = {"Host": "www.zhihu.com", "Referer": "https://www.zhihu.com/", 'User-Agent': agent}
session = requests.session()
session.cookies = cookielib.LWPCookieJar(filename='cookies')
try:
    session.cookies.load(ignore_discard=True)
except:
    print("Cookie 未能加载")

def get_xsrf():
    '''_xsrf 是一个动态变化的参数'''
    index_url = 'https://www.zhihu.com'
    index_page = session.get(index_url, headers=headers)
    html = index_page.text
    pattern = r'name="_xsrf" value="(.*?)"'
    _xsrf = re.findall(pattern, html)
    return _xsrf[0]

def get_captcha():
    t = str(int(time.time() * 1000))
    captcha_url = 'https://www.zhihu.com/captcha.gif?r=' + t + "&type=login"
    r = session.get(captcha_url, headers=headers)
    with open('captcha.jpg', 'wb') as f:
        f.write(r.content)
        f.close()
    try:
        im = Image.open('captcha.jpg')
        im.show()
        im.close()
    except:
        print(u'请到 %s 目录找到captcha.jpg 手动输入' % os.path.abspath('captcha.jpg'))
    captcha = input("please input the captcha
> ")
    return captcha

def isLogin():
    url = "https://www.zhihu.com/settings/profile"
    login_code = session.get(url, headers=headers, allow_redirects=False).status_code
    if login_code == 200:
        return True
    else:
        return False

def login(secret, account):
    if re.match(r"^1\d{10}$", account):
        print("手机号登录 
")
        post_url = 'https://www.zhihu.com/login/phone_num'
        postdata = {'_xsrf': get_xsrf(), 'password': secret, 'remember_me': 'true', 'phone_num': account}
    else:
        if "@" in account:
            print("邮箱登录 
")
        else:
            print("你的账号输入有问题，请重新登录")
            return 0
        post_url = 'https://www.zhihu.com/login/email'
        postdata = {'_xsrf': get_xsrf(), 'password': secret, 'remember_me': 'true', 'email': account}
    try:
        login_page = session.post(post_url, data=postdata, headers=headers)
        login_code = login_page.text
        print(login_page.status_code)
        print(login_code)
    except:
        postdata["captcha"] = get_captcha()
        login_page = session.post(post_url, data=postdata, headers=headers)
        login_code = eval(login_page.text)
        print(login_code['msg'])
    session.cookies.save()
    try:
        input = raw_input
    except:
        pass

def getPageQuestion(url2):
    mainpage = session.get(url2, headers=headers)
    soup = BeautifulSoup(mainpage.text, 'html.parser')
    tags = soup.find_all("a", class_="question_link")
    for tag in tags:
        print(tag.string)

def getPageAnswerAbstract(url2):
    mainpage = session.get(url2, headers=headers)
    soup = BeautifulSoup(mainpage.text, 'html.parser')
    tags = soup.find_all('div', class_='zh-summary summary clearfix')
    for tag in tags:
        print(tag.get_text())
        print('詳細內容的鏈接 ：', tag.find('a').get('href'))

def getPageALL(url2):
    mainpage = session.get(url2, headers=headers)
    soup = BeautifulSoup(mainpage.text, 'html.parser')
    tags = soup.find_all('div', class_='feed-content')
    for tag in tags:
        print(tag.find('a', class_='question_link').get_text())

if __name__ == '__main__':
    if isLogin():
        print('您已经登录')
        url2 = 'https://www.zhihu.com'
        getPageALL(url2)
    else:
        account = input('请输入你的用户名
>  ')
        secret = input("请输入你的密码
>  ")
        login(secret, account)

The article also shows an example of the cookie file content generated after a successful login and includes screenshots of the login process and the final scraping results.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Web Scraping zhihu Login Simulation

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.