Scraping Taobao Live Chat Messages Using Puppeteer and WebSocket Decryption
This article details a step‑by‑step method for extracting live chat bullet comments from Taobao live streams by analyzing page sources, intercepting the token‑providing API with Puppeteer, establishing a WebSocket connection, and decoding the received base64‑ and GZIP‑compressed messages to retrieve clean usernames and comment texts.
The author needed to scrape live chat bullet comments from Taobao live streams and, after finding no existing solutions, built a custom crawler. The final effect of the crawler is shown, followed by a detailed reconstruction of the research process.
Page Analysis
The live room URL can be obtained from the share link. By opening the browser's dev tools and filtering WebSocket requests, the WebSocket address becomes visible. The messages contain a compressType field (either COMMON or GZIP ) and a data field that appears to be Base64‑encoded.
To acquire the WebSocket address, the token must be retrieved via an API request. The API endpoint is:
<code>http://h5api.m.taobao.com/h5/mtop.mediaplatform.live.encryption/1.0/</code>With the token, the full WebSocket URL can be constructed.
Writing the Crawler
Because the API query string contains many dynamic parameters, a headless browser (Puppeteer) is used to intercept the request that returns the token. The relevant code is:
<code>const browser = await puppeteer.launch()
const page = (await browser.pages())[0]
await page.setRequestInterception(true)
const api = 'http://h5api.m.taobao.com/h5/mtop.mediaplatform.live.encryption/1.0/'
page.on('request', req => {
if (req.url.includes(api)) {
console.log(`[${url}] getting token`)
}
req.continue()
})
page.on('response', async res => {
if (!res.url.includes(api)) return
const data = await res.text()
const token = data.match(/"result":"(.*?)"/)[1]
const wsUrl = `ws://acs.m.taobao.com/accs/auth?token=${token}`
})
await page.goto(url, { timeout: 0 })
console.log(`[${url}] page loaded`)
</code>A performance tip is to reuse the default about:blank page instead of creating a new one, reducing the number of processes.
After obtaining the WebSocket URL, a connection is opened and messages are logged:
<code>const ws = new WebSocket(wsUrl)
ws.on('open', () => console.log(`\nOPEN: ${wsUrl}\n`))
ws.on('close', () => console.log('DISCONN'))
ws.on('message', msg => console.log(msg))
</code>Message Decryption
Messages arrive with compressType either COMMON (plain) or GZIP (needs gunzip). After Base64 decoding, GZIP messages are decompressed with zlib.gunzipSync . The resulting buffer is joined into a comma‑separated string for pattern matching.
A regular expression extracts the nickname and the comment content, while a specific buffer pattern is used to filter out follow‑notifications:
<code>const followedPattern = '226,129,130,226,136,176,226,143,135,102,111,108,108,111,119'
if (bufferStr.includes(followedPattern)) return
const barragePattern = /.*,[0-9]+,0,18,[0-9]+,(.*?),32,[0-9]+,[0-9]+,[0-9]+,[0-9]+,[0-9]+,44,50,2,116,98,[0-9]+,0,10,[0-9]+,(.*?),18,20,10,12/
const matched = bufferStr.match(barragePattern)
if (matched) {
const nick = parseStr(matched[1])
const barrage = parseStr(matched[2])
console.log(`${nick}: ${barrage}`)
}
</code>The author notes that some parts of the buffer appear to be time‑related and may change over days, requiring future regex adjustments.
Process Maintenance
In production, the main process forks a child process to run the crawler, obtains the WebSocket URL, returns it to the parent, and then the child process terminates (closing the Puppeteer browser) to free resources, because the token expires shortly after the WebSocket disconnects.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.