InfoSpider: Open‑Source Python Toolbox for Secure Personal Data Scraping and Analysis

InfoSpider is an open‑source Python‑based web‑scraping toolbox that securely aggregates personal data from over 24 sources—including email, e‑commerce, and social platforms—provides a GUI for easy operation, stores results in JSON, and offers basic visual analysis, making personal data integration and insight generation straightforward.

Programmer DD
Programmer DD
Programmer DD
InfoSpider: Open‑Source Python Toolbox for Secure Personal Data Scraping and Analysis

InfoSpider is an open‑source web‑scraping toolbox written in Python that helps users retrieve their personal data from many online services safely and quickly. The project quickly rose to the GitHub weekly ranking, has over 1.3K stars, and includes full source code, documentation, and video tutorials.

Project links:

Code repository: https://github.com/kangvcar/InfoSpider

Documentation site: https://infospider.vercel.app

Video demonstration: https://www.bilibili.com/video/BV14f4y1R7oF/

The toolbox supports more than 24 data sources, such as GitHub, various email providers (QQ, NetEase, Alibaba, Sina, Hotmail, Outlook), e‑commerce platforms (JD, Taobao, Alipay), telecom operators (China Mobile, Unicom, Telecom), Zhihu, Bilibili, NetEase Cloud Music, QQ friends, QQ groups, browser history, 12306, and multiple blogging platforms.

Key features :

Secure and reliable : open‑source code, simple and transparent, runs locally.

Easy to use : GUI interface lets users select data sources and follow prompts.

Clear structure : each data source is independent under the Spiders directory, enhancing portability.

Rich data sources : continuously expanding beyond 24 sources.

Unified JSON format : all scraped data are saved as JSON for convenient analysis.

Comprehensive personal data : attempts to collect as much personal information as possible.

Basic data analysis : provides visual charts for the collected data (partial support).

Installation steps:

Install Python 3 and Google Chrome.

Install a Chrome driver matching the browser version.

Run ./install_deps.sh (or pip install -r requirements.txt on Windows) to install dependencies.

Running the tool:

Navigate to the tools directory.

Execute python3 main.py.

In the opened GUI, click the desired data source button, choose a save path, and enter account credentials; the tool will automatically scrape the data.

Scraped JSON files and optional HTML charts are saved in the selected directory.

The repository also includes example code for scraping Taobao orders, collections, footmarks, and address information, demonstrating how to use Selenium, lxml, and PyQuery for data extraction.

import json
import random
import time
import sys
import os
import requests
import numpy as np
import math
from lxml import etree
from pyquery import PyQuery as pq
from selenium import webdriver
from selenium.webdriver import ChromeOptions, ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from tkinter.filedialog import askdirectory
from tqdm import trange

# ... (rest of the TaobaoSpider implementation) ...

The author encourages users to star the repository to support its development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonAutomationopen sourcedata aggregationWeb Scraping
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.