Common Regular Expressions and Methods for Python Web Scraping
This article presents a practical collection of Python regular‑expression techniques for extracting HTML elements such as table rows, links, titles, images, and scripts, showing how to filter tags and handle URL parameters during web crawling.
This article introduces frequently used regular‑expression patterns and Python code for web scraping, aiming to solve common crawling problems and help readers extract information from HTML pages.
1. Extract content between <tr> and </tr> tags
res_tr = r'<tr>(.*?)</tr>' m_tr = re.findall(res_tr, language, re.S|re.M)Example:
# coding=utf-8 import re language = '''<tr><th>性別:</th><td>男</td></tr><tr>''' res_tr = r'<tr>(.*?)</tr>' m_tr = re.findall(res_tr, language, re.S|re.M) for line in m_tr: print line res_th = r'<th>(.*?)</th>' m_th = re.findall(res_th, line, re.S|re.M) for mm in m_th: print unicode(mm, 'utf-8') res_td = r'<td>(.*?)</td>' m_td = re.findall(res_td, line, re.S|re.M) for nn in m_td: print unicode(nn, 'utf-8')Output:
>> <th>性別:</th><td>男</td> 性別: 男2. Extract text between <a href=..> and </a> tags
res = r'<a .*?>(.*?)</a>' mm = re.findall(res, content, re.S|re.M) urls = re.findall(r"<a.*?href=.*?</a>", content, re.I|re.S|re.M)Example:
# coding=utf-8 import re content = '''<td><a href="https://www.baidu.com/articles/zj.html" title="浙江省">浙江省主题介绍</a><a href="https://www.baidu.com//articles/gz.html" title="贵州省">贵州省主题介绍</a></td>''' print u'获取链接文本内容:' res = r'<a .*?>(.*?)</a>' mm = re.findall(res, content, re.S|re.M) for value in mm: print value print u'\n获取完整链接内容:' urls = re.findall(r"<a.*?href=.*?</a>", content, re.I|re.S|re.M) for i in urls: print i print u'\n获取链接中URL:' res_url = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')" link = re.findall(res_url, content, re.I|re.S|re.M) for url in link: print urlOutput:
获取链接文本内容: 浙江省主题介绍 贵州省主题介绍 获取完整链接内容: <a href="https://www.baidu.com/articles/zj.html" title="浙江省">浙江省主题介绍</a> <a href="https://www.baidu.com//articles/gz.html" title="贵州省">贵州省主题介绍</a> 获取链接中URL: https://www.baidu.com/articles/zj.html https://www.baidu.com//articles/gz.html3. Obtain the last segment of a URL for naming images or parameters
urls = "http://i1.hoopchina.com.cn/blogfile/201411/11/BbsImg141568417848931_640*640.jpg" values = urls.split('/')[-1] print valuesOutput:
BbsImg141568417848931_640*640.jpgFor query strings:
url = 'http://localhost/test.py?a=hello&b=world' values = url.split('?')[-1] print values for key_value in values.split('&'): print key_value.split('=')Output:
a=hello&b=world ['a', 'hello'] ['b', 'world']4. Crawl all URL links from a page
# coding=utf-8 import re, urllib url = "http://www.csdn.net/" content = urllib.urlopen(url).read() urls = re.findall(r"<a.*?href=.*?</a>", content, re.I) for url in urls: print unicode(url, 'utf-8') link_list = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')", content) for url in link_list: print urlSample output shows several anchor tags and raw URLs.
5. Extract the page <title> using two methods
# coding=utf-8 import re, urllib url = "http://www.csdn.net/" content = urllib.urlopen(url).read() print u'方法一:' title_pat = r'(?<=<title>).*?(?=</title>)' title_ex = re.compile(title_pat, re.M|re.S) title_obj = re.search(title_ex, content) title = title_obj.group() print title print u'方法二:' title = re.findall(r'<title>(.*?)</title>', content) print title[0]Both methods return the same CSDN title string.
6. Locate a table and extract attribute‑value pairs
# coding=utf-8 import re s = '''<table> <tr> <td>序列号</td><td>DEIN3-39CD3-2093J3</td> <td>日期</td><td>2013年1月22日</td> <td>售价</td><td>392.70 元</td> <td>说明</td><td>仅限5用户使用</td> </tr> </table>''' res = r'<td>(.*?)</td><td>(.*?)</td>' m = re.findall(res, s, re.S|re.M) for line in m: print unicode(line[0], 'utf-8'), unicode(line[1], 'utf-8')Output shows each attribute name with its value.
7. Filter <span> and similar tags
elif "span" in nn:
res_value = r'<span .*?>(.*?)</span>'
m_value = re.findall(res_value, nn, re.S|re.M)
for value in m_value:
print unicode(value, 'utf-8'),Example input: <td><span class="nickname">(字) 翔宇</span></td> produces (字) 翔宇 .
8. Extract content inside <script> tags (e.g., image URLs)
# coding=utf-8 import re, os, urllib content = '''<script>var images = [{ "big":"...", "thumb":"...", "original":"http://example.com/img1.jpg" }, { "original":"http://example.com/img2.jpg" }];</script>''' html_script = r'<script>(.*?)</script>' m_script = re.findall(html_script, content, re.S|re.M) for script in m_script: res_original = r'"original":"(.*?)"' m_original = re.findall(res_original, script) for pic_url in m_original:
print pic_url
filename = os.path.basename(pic_url)
urllib.urlretrieve(pic_url, 'E:\\' + filename)The script prints each original image URL and downloads the file.
9. Remove <br /> tags using replace
if '<br />' in value:
value = value.replace('<br />', '')
value = value.replace('\n', ' ')Transforms strings like 達洪阿 異名:(字) 厚菴<br /> (諡) 武壯<br /> (勇號) 阿克達春巴圖魯 into a clean single line.
10. Extract src from <img> tags and filter the tags
value = re.sub('<[^>]+>', '', value) test = '''<img alt="中國國民黨" src="../images/Kuomintang.png" width="19" height="19" border="0" />''' print re.findall('src="(.*?)"', test)Output: ['../images/Kuomintang.png'] .
The article concludes by encouraging readers to apply these regex patterns for efficient web data extraction.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.