Backend Development 20 min read

Common Regular Expressions and Methods for Python Web Scraping

This article presents a practical collection of Python regular‑expression techniques for extracting HTML elements such as table rows, links, titles, images, and scripts, showing how to filter tags and handle URL parameters during web crawling.

Python Programming Learning Circle

Apr 12, 2021

Common Regular Expressions and Methods for Python Web Scraping

This article introduces frequently used regular‑expression patterns and Python code for web scraping, aiming to solve common crawling problems and help readers extract information from HTML pages.

1. Extract content between <tr> and </tr> tags

res_tr = r'<tr>(.*?)</tr>'

m_tr = re.findall(res_tr, language, re.S|re.M)

Example:

# coding=utf-8

import re

language = '''<tr><th>性別：</th><td>男</td></tr><tr>'''

res_tr = r'<tr>(.*?)</tr>'

m_tr = re.findall(res_tr, language, re.S|re.M)

for line in m_tr:

print line

res_th = r'<th>(.*?)</th>'

m_th = re.findall(res_th, line, re.S|re.M)

for mm in m_th:

print unicode(mm, 'utf-8')

res_td = r'<td>(.*?)</td>'

m_td = re.findall(res_td, line, re.S|re.M)

for nn in m_td:

print unicode(nn, 'utf-8')

Output:

>> <th>性別：</th><td>男</td>

性別： 男

2. Extract text between <a href=..> and </a> tags

res = r'<a .*?>(.*?)</a>'

mm = re.findall(res, content, re.S|re.M)

urls = re.findall(r"<a.*?href=.*?</a>", content, re.I|re.S|re.M)

Example:

# coding=utf-8

import re

content = '''<td><a href="https://www.baidu.com/articles/zj.html" title="浙江省">浙江省主题介绍</a><a href="https://www.baidu.com//articles/gz.html" title="贵州省">贵州省主题介绍</a></td>'''

print u'获取链接文本内容:'

res = r'<a .*?>(.*?)</a>'

mm = re.findall(res, content, re.S|re.M)

for value in mm:

print value

print u'
获取完整链接内容:'

urls = re.findall(r"<a.*?href=.*?</a>", content, re.I|re.S|re.M)

for i in urls:

print i

print u'
获取链接中URL:'

res_url = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')"

link = re.findall(res_url, content, re.I|re.S|re.M)

for url in link:

print url

Output:

获取链接文本内容:

浙江省主题介绍

贵州省主题介绍

获取完整链接内容:

<a href="https://www.baidu.com/articles/zj.html" title="浙江省">浙江省主题介绍</a>

<a href="https://www.baidu.com//articles/gz.html" title="贵州省">贵州省主题介绍</a>

获取链接中URL:

https://www.baidu.com/articles/zj.html

https://www.baidu.com//articles/gz.html

3. Obtain the last segment of a URL for naming images or parameters

urls = "http://i1.hoopchina.com.cn/blogfile/201411/11/BbsImg141568417848931_640*640.jpg"

values = urls.split('/')[-1]

print values

Output: BbsImg141568417848931_640*640.jpg For query strings:

url = 'http://localhost/test.py?a=hello&b=world'

values = url.split('?')[-1]

print values

for key_value in values.split('&'):

print key_value.split('=')

Output:

a=hello&b=world

['a', 'hello']

['b', 'world']

4. Crawl all URL links from a page

# coding=utf-8

import re, urllib

url = "http://www.csdn.net/"

content = urllib.urlopen(url).read()

urls = re.findall(r"<a.*?href=.*?</a>", content, re.I)

for url in urls:

print unicode(url, 'utf-8')

link_list = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')", content)

for url in link_list:

print url

Sample output shows several anchor tags and raw URLs.

5. Extract the page <title> using two methods

# coding=utf-8

import re, urllib

url = "http://www.csdn.net/"

content = urllib.urlopen(url).read()

print u'方法一:'

title_pat = r'(?<=<title>).*?(?=</title>)'

title_ex = re.compile(title_pat, re.M|re.S)

title_obj = re.search(title_ex, content)

title = title_obj.group()

print title

print u'方法二:'

title = re.findall(r'<title>(.*?)</title>', content)

print title[0]

Both methods return the same CSDN title string.

6. Locate a table and extract attribute‑value pairs

# coding=utf-8

import re

s = '''<table> <tr> <td>序列号</td><td>DEIN3-39CD3-2093J3</td> <td>日期</td><td>2013年1月22日</td> <td>售价</td><td>392.70 元</td> <td>说明</td><td>仅限5用户使用</td> </tr> </table>'''

res = r'<td>(.*?)</td><td>(.*?)</td>'

m = re.findall(res, s, re.S|re.M)

for line in m:

print unicode(line[0], 'utf-8'), unicode(line[1], 'utf-8')

Output shows each attribute name with its value.

7. Filter <span> and similar tags

elif "span" in nn:
    res_value = r'<span .*?>(.*?)</span>'
    m_value = re.findall(res_value, nn, re.S|re.M)
    for value in m_value:
        print unicode(value, 'utf-8'),

Example input:

<td><span class="nickname">(字) 翔宇</span></td>

produces (字) 翔宇.

8. Extract content inside <script> tags (e.g., image URLs)

# coding=utf-8

import re, os, urllib

content = '''<script>var images = [{ "big":"...", "thumb":"...", "original":"http://example.com/img1.jpg" }, { "original":"http://example.com/img2.jpg" }];</script>'''

html_script = r'<script>(.*?)</script>'

m_script = re.findall(html_script, content, re.S|re.M)

for script in m_script:

res_original = r'"original":"(.*?)"'

m_original = re.findall(res_original, script)

for pic_url in m_original:
        print pic_url
        filename = os.path.basename(pic_url)
        urllib.urlretrieve(pic_url, 'E:\\' + filename)

The script prints each original image URL and downloads the file.

9. Remove <br /> tags using replace

if '<br />' in value:
    value = value.replace('<br />', '')
    value = value.replace('
', ' ')

Transforms strings like

達洪阿 異名：(字) 厚菴<br /> (諡) 武壯<br /> (勇號) 阿克達春巴圖魯

into a clean single line.

10. Extract src from <img> tags and filter the tags

value = re.sub('<[^>]+>', '', value)

test = '''<img alt="中國國民黨" src="../images/Kuomintang.png" width="19" height="19" border="0" />'''

print re.findall('src="(.*?)"', test)

Output: ['../images/Kuomintang.png'].

The article concludes by encouraging readers to apply these regex patterns for efficient web data extraction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

html-parsing Python Data Extraction regex Web Scraping re module

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.