Backend Development 11 min read

How to Scrape Alibaba International Phone Numbers with Selenium and Export to Excel

This tutorial walks through using Selenium to log into Alibaba International, scrape supplier phone numbers and related details across multiple pages, save the data to CSV, download product images, and finally embed those images into an Excel workbook for easy reference.

MaGe Linux Operations

Jan 2, 2022

How to Scrape Alibaba International Phone Numbers with Selenium and Export to Excel

Introduction

Alibaba International hides supplier phone numbers behind login pages; the author needed a way to collect these numbers and related company information into a single Excel file.

1. Launch WebDriver and log in

Configure ChromeOptions to disable images, hide the Selenium automation flag, and start a Chrome WebDriver. After opening the Alibaba login page, the script pauses for manual login and then proceeds once the user inputs 1 to break the loop.

from selenium.webdriver import ChromeOptions
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import re, time, csv
from lxml import etree

class Chrome_drive():
    def __init__(self):
        option = ChromeOptions()
        option.add_experimental_option('excludeSwitches', ['enable-automation'])
        option.add_experimental_option('useAutomationExtension', False)
        NoImage = {"profile.managed_default_content_settings.images": 2}
        option.add_experimental_option('prefs', NoImage)
        self.browser = webdriver.Chrome(executable_path='./chromedriver', options=option)
        self.browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': 'Object.defineProperty(navigator,"webdriver",{get:()=>undefined})'})
        self.browser.set_window_size(1200, 768)
        self.wait = WebDriverWait(self.browser, 12)

    def get_login(self):
        url = 'https://passport.alibaba.com/icbu_login.htm'
        self.browser.get(url)
        k = input('输入1')
        if 'Your Alibaba.com account is temporarily unavailable' in self.browser.page_source:
            self.browser.close()
        while k == 1:
            break
        self.browser.refresh()
        return

2. Extract page content

For each search result page, the script builds the URL, opens it in a new tab, scrolls to load lazy‑loaded images, and uses lxml.etree to parse the HTML. It extracts the company name, link to the phone‑detail page, main product, country, revenue, sales region, and product image URLs. Then it navigates to the phone‑detail page to scrape telephone, mobile phone, and address using regular expressions.

def index_page(self, page, wd):
    url = f'https://www.alibaba.com/trade/search?page={page}&keyword={wd}&f1=y&indexArea=company_en&viewType=L&n=38'
    self.browser.execute_script(f"window.open('{url}')")
    self.browser.switch_to.window(self.browser.window_handles[-1])
    self.buffer()
    time.sleep(3)
    self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J-items-content')))
    html = self.browser.page_source
    self.get_products(wd, html)
    self.close_window()

def get_products(self, wd, html_text):
    e = etree.HTML(html_text)
    items = e.xpath('//div[@id="J-items-content"]//div[@class="item-main"]')
    for li in items:
        company_name = ''.join(li.xpath('./div[@class="top"]//h2[@class="title ellipsis"]/a/text()'))
        company_phone_page = ''.join(li.xpath('./div[@class="top"]//a[@class="cd"]/@href'))
        product = ''.join(li.xpath('.//div[@class="value ellipsis ph"]/text()'))
        Attrs = li.xpath('.//div[@class="attrs"]//span[@class="ellipsis search"]/text()')
        # extract country, revenue, sales address from Attrs ...
        product_img_list = li.xpath('.//div[@class="product"]/div/a/img/@src')
        product_img = ','.join(product_img_list) if product_img_list else ''
        self.browser.get(company_phone_page)
        try:
            if 'Your Alibaba.com account is temporarily unavailable' in self.browser.page_source:
                self.browser.close()
            self.browser.find_element_by_xpath('//div[@class="sens-mask"]/a').click()
            phone = ''.join(re.findall('Telephone:</th><td>(.*?)</td>', self.browser.page_source, re.S))
            mobilePhone = ''.join(re.findall('Mobile Phone:</th><td>(.*?)</td>', self.browser.page_source, re.S))
            address = ''.join(re.findall('Address:</th><td>(.*?)</td>', self.browser.page_source, re.S))
        except:
            print('该公司没有电话')
        all_down = [wd, company_name, company_phone_page, product, counctry, phone, mobilePhone, address, total_evenue, sell_adress, product_img]
        save_csv(all_down)

3. Download product images

After the CSV is generated, the script reads the product_img column, splits multiple URLs, prefixes each with https:, and saves the images to a local downloads_picture folder using requests.

def open_requests(img, img_name):
    img_url = 'https:' + img
    res = requests.get(img_url)
    with open(f"./downloads_picture/{img_name}", 'wb') as fn:
        fn.write(res.content)

df1 = pd.read_csv('./alibaba_com_img.csv')
for imgs in df1["product_img"]:
    imgList = str(imgs).split(',')
    if len(imgList) > 0:
        img = imgList[0]
        img_name = img[24:]
        open_requests(img, img_name)

4. Insert images into Excel

The CSV is imported into Excel (UTF‑8, text format to preserve phone numbers). Using xlwings and PIL, the script opens the workbook, reads the image file names, resizes each picture proportionally, and inserts it into the corresponding cell.

from PIL import Image
import os, xlwings as xw
path = 'alibaba_com.xlsx'
app = xw.App(visible=True, add_book=False)
wb = app.books.open(path)
sht = wb.sheets['Sheet1']
img_list = sht.range('L2').expand('down').value

def write_pic(cell, img_name):
    file_path = f'./downloads_picture/{img_name}'
    img = Image.open(file_path).convert('RGB')
    w, h = img.size
    x_s = 70
    y_s = h * x_s / w
    sht.pictures.add(file_path, left=sht.range(cell).left, top=sht.range(cell).top, width=x_s, height=y_s)

for index, imgs in enumerate(img_list):
    cell = 'C' + str(index + 2)
    imgsList = str(imgs).split(',')
    if len(imgsList) > 0:
        img_name = imgsList[0][24:]
        try:
            write_pic(cell, img_name)
        except:
            print('没有找到这个img_name的图片', img_name)

wb.save()
wb.close()
app.quit()

Result

The final Excel file contains each supplier’s name, phone numbers, product details, and the corresponding product image embedded in the sheet.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba Python data extraction Web Scraping Selenium

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.