Python Web Scraping Tutorial: Downloading Emoji Images from DouTuBa with Multithreading
This tutorial demonstrates how to crawl the DouTuBa emoji website using Python, extract image URLs with regular expressions and BeautifulSoup, and download tens of thousands of images efficiently through a multithreaded downloader.
Preface – The author describes a situation where a friend needed emoji images to lighten a chat, discovered that the local collection was insufficient, and decided to build a web crawler to fetch emojis from the DouTuBa website.
Page Analysis – The target site contains a massive number of emoji images. By opening the browser’s developer tools (F12), the author shows how to locate the src attribute of an img tag that holds the actual image URL.
Implementation – Fetching Page Content
def askURL(url):
head = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36"
}
req = urllib.request.Request(url=url, headers=head)
html = ""
try:
response = urllib.request.urlopen(req)
html = response.read()
except Exception as result:
print(result)
return htmlImplementation – Parsing HTML
# 取出图片src的正则式
imglink = re.compile(
r'<img alt="(.*?)" class="img-responsive lazy image_dta" data-backup=".*?" data-original="(.*?)" referrerpolicy="no-referrer" src=".*?"/>',
re.S)
def getimgsrcs(url):
html = askURL(url)
bs = BeautifulSoup(html, "html.parser")
names = []
srcs = []
# 找到所有的img标签
for item in bs.find_all('img'):
item = str(item)
# 根据上面的正则表达式规则把图片的src以及图片名拿下来
imgsrc = re.findall(imglink, item)
# 这里是因为拿取的img标签可能不是我们想要的,所以匹配正则规则之后可能返回空值,因此判断一下
if len(imgsrc) != 0:
imgname = ""
if imgsrc[0][0] != '':
imgname = imgsrc[0][0] + '.' + getFileType(imgsrc[0][1])
else:
imgname = getFileName(imgsrc[0][1])
names.append(imgname)
srcs.append(imgsrc[0][1])
return names, srcsAfter obtaining the image URLs and filenames, the author proceeds to download the files.
File Download – Multithreaded Approach
pool = ThreadPoolExecutor(max_workers=50)
for j in range(len(names)):
pool.submit(FileDownload.downloadFile, urls[j], filelocation[j])Result – The script successfully scraped and saved over one hundred thousand emoji images, making the author a major collector of emoji resources.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
