Backend Development 11 min read

Automating Alibaba International Data Extraction with Selenium and Python

This article demonstrates how to use Python's Selenium WebDriver to log into Alibaba International, scrape company contact information across multiple pages, save the data to CSV, download product images, and embed them into an Excel file, providing a complete automation workflow.

Python Programming Learning Circle

Nov 29, 2021

Automating Alibaba International Data Extraction with Selenium and Python

The guide explains a step‑by‑step process for automating the collection of company phone numbers and product images from Alibaba International using Python and Selenium.

1. Launch WebDriver and log in – A Chrome WebDriver is configured to disable images, hide automation flags, and open the Alibaba login page. After manual login, the script proceeds.

from selenium.webdriver import ChromeOptions
from selenium import webdriver
from selenium.webdriver.common.by import Import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import re, time, csv

class Chrome_drive():
    def __init__(self):
        option = ChromeOptions()
        option.add_experimental_option('excludeSwitches', ['enable-automation'])
        option.add_experimental_option('useAutomationExtension', False)
        NoImage = {"profile.managed_default_content_settings.images": 2}
        option.add_experimental_option("prefs", NoImage)
        self.browser = webdriver.Chrome(executable_path="./chromedriver", options=option)
        self.browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': 'Object.defineProperty(navigator,"webdriver",{get:()=>undefined})'})
        self.browser.set_window_size(1200,768)
        self.wait = WebDriverWait(self.browser, 12)
    def get_login(self):
        url = 'https://passport.alibaba.com/icbu_login.htm'
        self.browser.get(url)
        input("输入1")
        self.browser.refresh()
        return

After logging in manually, the script waits for the user to confirm before proceeding.

2. Extract page content – The script navigates through search result pages, scrolls to load lazy‑loaded images, and parses the HTML with lxml.etree to collect company name, phone page link, product, country, revenue, address, and image URLs. Each record is written to a CSV file.

def index_page(self, page, wd):
    url = f'https://www.alibaba.com/trade/search?page={page}&keyword={wd}&f1=y&indexArea=company_en&viewType=L&n=38'
    self.browser.execute_script(f"window.open('{url}')")
    self.browser.switch_to.window(self.browser.window_handles[-1])
    self.buffer()
    time.sleep(3)
    self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J-items-content')))
    html = self.browser.page_source
    self.get_products(wd, html)
    self.close_window()

def get_products(self, wd, html_text):
    e = etree.HTML(html_text)
    items = e.xpath('//div[@id="J-items-content"]//div[@class="item-main"]')
    for li in items:
        company_name = ''.join(li.xpath('./div[@class="top"]//h2[@class="title ellipsis"]/a/text()'))
        company_phone_page = ''.join(li.xpath('./div[@class="top"]//a[@class="cd"]/@href'))
        product = ''.join(li.xpath('.//div[@class="value ellipsis ph"]/text()'))
        Attrs = li.xpath('.//div[@class="attrs"]//span[@class="ellipsis search"]/text()')
        # extract country, revenue, address, etc.
        self.browser.get(company_phone_page)
        # extract phone, mobile, address with regex
        all_down = [wd, company_name, company_phone_page, product, counctry, phone, mobilePhone, address, total_evenue, sell_adress, product_img]
        save_csv(all_down)

The helper methods buffer (scrolling) and close_window manage page navigation and resource cleanup.

3. Download product images – Using requests and pandas, the script reads the CSV, splits the image URL list, and saves each image to a local folder.

# -*- coding: utf-8 -*-
import requests, pandas as pd

def open_requests(img, img_name):
    img_url = 'https:' + img
    res = requests.get(img_url)
    with open(f"./downloads_picture/{img_name}", 'wb') as fn:
        fn.write(res.content)

df1 = pd.read_csv('./alibaba_com_img.csv')
for imgs in df1["product_img"]:
    imgList = str(imgs).split(',')
    if imgList:
        img = imgList[0]
        img_name = img[24:]
        open_requests(img, img_name)

This step ensures all product pictures are stored locally for later insertion.

4. Insert images into Excel – The CSV is converted to an Excel file, and xlwings together with PIL inserts each image into the appropriate cell, adjusting size proportionally.

# -*- coding: utf-8 -*-
from PIL import Image
import os, xlwings as xw
path = 'alibaba_com.xlsx'
app = xw.App(visible=True, add_book=False)
wb = app.books.open(path)
sht = wb.sheets['Sheet1']
img_list = sht.range('L2').expand('down').value

def write_pic(cell, img_name):
    file_path = f'./downloads_picture/{img_name}'
    img = Image.open(file_path).convert('RGB')
    w, h = img.size
    x_s = 70
    y_s = h * x_s / w
    sht.pictures.add(file_path, left=sht.range(cell).left, top=sht.range(cell).top, width=x_s, height=y_s)

for idx, imgs in enumerate(img_list):
    cell = f"C{idx+2}"
    imgsList = str(imgs).split(',')
    if imgsList:
        img_name = imgsList[0][24:]
        try:
            write_pic(cell, img_name)
        except:
            print('Image not found', img_name)
wb.save()
wb.close()
app.quit()

After running the script, the Excel workbook displays each company's details alongside its product image.

Result – The final Excel file contains rows of scraped company information with embedded product pictures, providing a ready‑to‑use dataset for further analysis or outreach.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Automation data extraction CSV Web Scraping

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.