Automating Alibaba International Data Extraction with Selenium and Python
This article demonstrates how to use Python's Selenium WebDriver to log into Alibaba International, scrape company contact information across multiple pages, save the data to CSV, download product images, and embed them into an Excel file, providing a complete automation workflow.
The guide explains a step‑by‑step process for automating the collection of company phone numbers and product images from Alibaba International using Python and Selenium.
1. Launch WebDriver and log in – A Chrome WebDriver is configured to disable images, hide automation flags, and open the Alibaba login page. After manual login, the script proceeds.
from selenium.webdriver import ChromeOptions
from selenium import webdriver
from selenium.webdriver.common.by import Import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import re, time, csv
class Chrome_drive():
def __init__(self):
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
NoImage = {"profile.managed_default_content_settings.images": 2}
option.add_experimental_option("prefs", NoImage)
self.browser = webdriver.Chrome(executable_path="./chromedriver", options=option)
self.browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': 'Object.defineProperty(navigator,"webdriver",{get:()=>undefined})'})
self.browser.set_window_size(1200,768)
self.wait = WebDriverWait(self.browser, 12)
def get_login(self):
url = 'https://passport.alibaba.com/icbu_login.htm'
self.browser.get(url)
input("输入1")
self.browser.refresh()
returnAfter logging in manually, the script waits for the user to confirm before proceeding.
2. Extract page content – The script navigates through search result pages, scrolls to load lazy‑loaded images, and parses the HTML with lxml.etree to collect company name, phone page link, product, country, revenue, address, and image URLs. Each record is written to a CSV file.
def index_page(self, page, wd):
url = f'https://www.alibaba.com/trade/search?page={page}&keyword={wd}&f1=y&indexArea=company_en&viewType=L&n=38'
self.browser.execute_script(f"window.open('{url}')")
self.browser.switch_to.window(self.browser.window_handles[-1])
self.buffer()
time.sleep(3)
self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J-items-content')))
html = self.browser.page_source
self.get_products(wd, html)
self.close_window()
def get_products(self, wd, html_text):
e = etree.HTML(html_text)
items = e.xpath('//div[@id="J-items-content"]//div[@class="item-main"]')
for li in items:
company_name = ''.join(li.xpath('./div[@class="top"]//h2[@class="title ellipsis"]/a/text()'))
company_phone_page = ''.join(li.xpath('./div[@class="top"]//a[@class="cd"]/@href'))
product = ''.join(li.xpath('.//div[@class="value ellipsis ph"]/text()'))
Attrs = li.xpath('.//div[@class="attrs"]//span[@class="ellipsis search"]/text()')
# extract country, revenue, address, etc.
self.browser.get(company_phone_page)
# extract phone, mobile, address with regex
all_down = [wd, company_name, company_phone_page, product, counctry, phone, mobilePhone, address, total_evenue, sell_adress, product_img]
save_csv(all_down)The helper methods buffer (scrolling) and close_window manage page navigation and resource cleanup.
3. Download product images – Using requests and pandas, the script reads the CSV, splits the image URL list, and saves each image to a local folder.
# -*- coding: utf-8 -*-
import requests, pandas as pd
def open_requests(img, img_name):
img_url = 'https:' + img
res = requests.get(img_url)
with open(f"./downloads_picture/{img_name}", 'wb') as fn:
fn.write(res.content)
df1 = pd.read_csv('./alibaba_com_img.csv')
for imgs in df1["product_img"]:
imgList = str(imgs).split(',')
if imgList:
img = imgList[0]
img_name = img[24:]
open_requests(img, img_name)This step ensures all product pictures are stored locally for later insertion.
4. Insert images into Excel – The CSV is converted to an Excel file, and xlwings together with PIL inserts each image into the appropriate cell, adjusting size proportionally.
# -*- coding: utf-8 -*-
from PIL import Image
import os, xlwings as xw
path = 'alibaba_com.xlsx'
app = xw.App(visible=True, add_book=False)
wb = app.books.open(path)
sht = wb.sheets['Sheet1']
img_list = sht.range('L2').expand('down').value
def write_pic(cell, img_name):
file_path = f'./downloads_picture/{img_name}'
img = Image.open(file_path).convert('RGB')
w, h = img.size
x_s = 70
y_s = h * x_s / w
sht.pictures.add(file_path, left=sht.range(cell).left, top=sht.range(cell).top, width=x_s, height=y_s)
for idx, imgs in enumerate(img_list):
cell = f"C{idx+2}"
imgsList = str(imgs).split(',')
if imgsList:
img_name = imgsList[0][24:]
try:
write_pic(cell, img_name)
except:
print('Image not found', img_name)
wb.save()
wb.close()
app.quit()After running the script, the Excel workbook displays each company's details alongside its product image.
Result – The final Excel file contains rows of scraped company information with embedded product pictures, providing a ready‑to‑use dataset for further analysis or outreach.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
