Backend Development 7 min read

Extracting QQ Space Posts and Photos with Selenium and Python

This tutorial demonstrates how to install Selenium, log into QQ Space, and programmatically retrieve both status updates and album photos using Python's Selenium and BeautifulSoup libraries, including detailed code examples for login, scrolling, and image downloading.

Python Programming Learning Circle

Jun 19, 2024

Extracting QQ Space Posts and Photos with Selenium and Python

QQ Space, launched by Tencent in 2005, holds a wealth of memories for users born in the 80s and 90s; this guide shows how to use Python's selenium module to export those posts and album photos.

Install Selenium

Selenium simulates user actions in a browser; install it via: pip install selenium Download the matching ChromeDriver from http://npm.taobao.org/mirrors/chromedriver and place it in the same directory as your script.

Login

Inspect the login page to locate the username and password fields, then use the following function to log in:

def login(login_qq, password, business_qq):
    '''
    登录
    :param login_qq: 登录用的QQ
    :param password: QQ密码
    :param business_qq: 业务QQ
    :return: driver
    '''
    driver = webdriver.Chrome()
    driver.get('https://user.qzone.qq.com/{}/311'.format(business_qq))
    driver.implicitly_wait(10)
    driver.find_element_by_id('login_div')
    driver.switch_to.frame('login_frame')
    driver.find_element_by_id('switcher_plogin').click()
    driver.find_element_by_id('u').clear()
    driver.find_element_by_id('u').send_keys(login_qq)
    driver.find_element_by_id('p').clear()
    driver.find_element_by_id('p').send_keys(password)
    driver.find_element_by_id('login_button').click()
    driver.switch_to.default_content()
    driver.implicitly_wait(10)
    time.sleep(5)
    try:
        driver.find_element_by_id('QM_OwnerInfo_Icon')
        return driver
    except:
        print('不能访问' + business_qq)
        return None

Extracting Posts (说说)

After logging in, the default page shows the status updates, which load incrementally as you scroll. The script repeatedly scrolls, switches to the appropriate iframe, builds a BeautifulSoup object, and extracts each post's text and author.

def get_shuoshuo(driver):
    page = 1
    while True:
        # Scroll down multiple times
        for j in range(1, 5):
            driver.execute_script("window.scrollBy(0,5000)")
            time.sleep(2)
        # Switch to frame containing posts
        driver.switch_to.frame('app_canvas_frame')
        bs = BeautifulSoup(driver.page_source.encode('GBK', 'ignore').decode('gbk'))
        pres = bs.find_all('pre', class_='content')
        for pre in pres:
            shuoshuo = pre.text
            tx = pre.parent.parent.find('a', class_='c_tx c_tx3 goDetail')['title']
            print(tx + ':' + shuoshuo)
        # Pagination check
        page += 1
        maxPage = bs.find('a', title='末页').text
        if int(maxPage) < page:
            break
        driver.find_element_by_link_text(u'下一页').click()
        driver.switch_to.default_content()
        time.sleep(3)

Extracting Album Photos

Downloading photos requires navigating the album UI with Selenium, clicking the album button, iterating through each album, and saving each image file.

def get_photo(driver):
    photo_path = "C:/Users/xxx/Desktop/photo/{}/{}.jpg"
    photoIndex = 1
    while True:
        driver.switch_to.default_content()
        driver.find_element_by_xpath('//*[@id="menuContainer"]/div/ul/li[3]/a').click()
        driver.implicitly_wait(10)
        time.sleep(3)
        driver.switch_to.frame('app_canvas_frame')
        a = driver.find_elements_by_class_name('album-cover')
        a[photoIndex].click()
        driver.implicitly_wait(10)
        time.sleep(3)
        p = driver.find_elements_by_class_name('item-cover')[0]
        p.click()
        time.sleep(3)
        driver.switch_to.parent_frame()
        while True:
            img = driver.find_element_by_id('js-img-disp')
            src = img.get_attribute('src').replace('&t=5', '')
            name = driver.find_element_by_id('js-photo-name').text
            urlretrieve(src, photo_path.format(qq, name))
            counts = driver.find_element_by_xpath('//*[@id="js-ctn-infoBar"]/div/div[1]/span').text.split('/')
            if int(counts[0]) == int(counts[1]):
                driver.find_element_by_xpath('//*[@id="js-viewer-main"]/div[1]/a').click()
                break
            for i in (1, 10):
                if driver.find_element_by_id('js-btn-nextPhoto'):
                    n = driver.find_element_by_id('js-btn-nextPhoto')
                    ActionChains(driver).click(n).perform()
                    break
                else:
                    time.sleep(5)
        photoIndex += 1
        if len(a) <= photoIndex:
            break

Conclusion

By following these steps, you can retrieve decades‑old QQ Space posts and photos, preserving nostalgic memories that would otherwise be difficult to access.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Automation Web Scraping QQ Space

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.