Extracting QQ Space Posts and Photos with Selenium and Python
This tutorial demonstrates how to install Selenium, log into QQ Space, and programmatically retrieve both status updates and album photos using Python's Selenium and BeautifulSoup libraries, including detailed code examples for login, scrolling, and image downloading.
QQ Space, launched by Tencent in 2005, holds a wealth of memories for users born in the 80s and 90s; this guide shows how to use Python's selenium module to export those posts and album photos.
Install Selenium
Selenium simulates user actions in a browser; install it via:
<code>pip install selenium</code>Download the matching ChromeDriver from http://npm.taobao.org/mirrors/chromedriver and place it in the same directory as your script.
Login
Inspect the login page to locate the username and password fields, then use the following function to log in:
<code>def login(login_qq, password, business_qq):
'''
登录
:param login_qq: 登录用的QQ
:param password: QQ密码
:param business_qq: 业务QQ
:return: driver
'''
driver = webdriver.Chrome()
driver.get('https://user.qzone.qq.com/{}/311'.format(business_qq))
driver.implicitly_wait(10)
driver.find_element_by_id('login_div')
driver.switch_to.frame('login_frame')
driver.find_element_by_id('switcher_plogin').click()
driver.find_element_by_id('u').clear()
driver.find_element_by_id('u').send_keys(login_qq)
driver.find_element_by_id('p').clear()
driver.find_element_by_id('p').send_keys(password)
driver.find_element_by_id('login_button').click()
driver.switch_to.default_content()
driver.implicitly_wait(10)
time.sleep(5)
try:
driver.find_element_by_id('QM_OwnerInfo_Icon')
return driver
except:
print('不能访问' + business_qq)
return None
</code>Extracting Posts (说说)
After logging in, the default page shows the status updates, which load incrementally as you scroll. The script repeatedly scrolls, switches to the appropriate iframe, builds a BeautifulSoup object, and extracts each post's text and author.
<code>def get_shuoshuo(driver):
page = 1
while True:
# Scroll down multiple times
for j in range(1, 5):
driver.execute_script("window.scrollBy(0,5000)")
time.sleep(2)
# Switch to frame containing posts
driver.switch_to.frame('app_canvas_frame')
bs = BeautifulSoup(driver.page_source.encode('GBK', 'ignore').decode('gbk'))
pres = bs.find_all('pre', class_='content')
for pre in pres:
shuoshuo = pre.text
tx = pre.parent.parent.find('a', class_='c_tx c_tx3 goDetail')['title']
print(tx + ':' + shuoshuo)
# Pagination check
page += 1
maxPage = bs.find('a', title='末页').text
if int(maxPage) < page:
break
driver.find_element_by_link_text(u'下一页').click()
driver.switch_to.default_content()
time.sleep(3)
</code>Extracting Album Photos
Downloading photos requires navigating the album UI with Selenium, clicking the album button, iterating through each album, and saving each image file.
<code>def get_photo(driver):
photo_path = "C:/Users/xxx/Desktop/photo/{}/{}.jpg"
photoIndex = 1
while True:
driver.switch_to.default_content()
driver.find_element_by_xpath('//*[@id="menuContainer"]/div/ul/li[3]/a').click()
driver.implicitly_wait(10)
time.sleep(3)
driver.switch_to.frame('app_canvas_frame')
a = driver.find_elements_by_class_name('album-cover')
a[photoIndex].click()
driver.implicitly_wait(10)
time.sleep(3)
p = driver.find_elements_by_class_name('item-cover')[0]
p.click()
time.sleep(3)
driver.switch_to.parent_frame()
while True:
img = driver.find_element_by_id('js-img-disp')
src = img.get_attribute('src').replace('&t=5', '')
name = driver.find_element_by_id('js-photo-name').text
urlretrieve(src, photo_path.format(qq, name))
counts = driver.find_element_by_xpath('//*[@id="js-ctn-infoBar"]/div/div[1]/span').text.split('/')
if int(counts[0]) == int(counts[1]):
driver.find_element_by_xpath('//*[@id="js-viewer-main"]/div[1]/a').click()
break
for i in (1, 10):
if driver.find_element_by_id('js-btn-nextPhoto'):
n = driver.find_element_by_id('js-btn-nextPhoto')
ActionChains(driver).click(n).perform()
break
else:
time.sleep(5)
photoIndex += 1
if len(a) <= photoIndex:
break
</code>Conclusion
By following these steps, you can retrieve decades‑old QQ Space posts and photos, preserving nostalgic memories that would otherwise be difficult to access.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.