Backend Development 10 min read

Boost Your Web Scraping Speed with Photon: A High‑Performance Multithreaded Crawler

Photon is a fast, multithreaded Python web crawler that extracts URLs, files, and intelligence such as emails and social media accounts, offering flexible options, Ninja mode, and extensive command‑line parameters for precise and efficient data harvesting across multiple operating systems.

MaGe Linux Operations

Feb 1, 2022

Boost Your Web Scraping Speed with Photon: A High‑Performance Multithreaded Crawler

Project URL

https://github.com/s0md3v/Photon

Main Features

Photon provides numerous options for custom crawling, but its standout capability is high‑efficiency multithreaded data extraction.

Data Extraction

By default, Photon extracts the following data: All URLs (both in‑scope and out‑of‑scope) Parameterized URLs (e.g., example.com/gallery.php?id=2) Intelligence such as email addresses, social media accounts, Amazon buckets, etc. Files (pdf, png, xml, etc.) JavaScript and other files Strings matching custom regular‑expression patterns

The extracted information is saved as illustrated below:

Smart Multithreading

Unlike many tools that misuse threads, Photon correctly distributes work among threads, avoiding shared‑task bottlenecks and lock contention.

Ninja Mode

In Ninja mode, three online servers act as proxies, allowing four clients to issue requests simultaneously, which speeds up crawling, reduces connection resets, and lowers latency.

Compatibility & Dependencies

Compatibility

Photon works on Python 2.x and 3.x (future versions may drop Python 2.x), and has been tested on Linux (Arch, Debian, Ubuntu), Termux, Windows 7/10, and macOS.

Operating Systems

Color output is not supported on macOS and Windows terminals that lack ANSI escape sequence handling.

Dependencies

requests
urllib3
argparse

All other required libraries are part of the standard Python distribution.

How to Use Photon

Syntax: photon.py [options]
  -u, --url          Target URL
  -l, --level        Crawl depth (default 2)
  -t, --threads      Number of threads (default 2)
  -d, --delay        Delay between requests (seconds)
  -c, --cookies      Cookie header
  -r, --regex        Custom regex pattern
  -s, --seeds        Additional sub‑URLs
  -e, --export       Export format (e.g., json)
  -o, --output       Output directory
  --exclude          Exclude URLs matching regex
  --timeout          Request timeout (seconds)
  --ninja            Enable Ninja mode
  --update           Check for updates
  --dns              Dump DNS data
  --only-urls        Extract URLs only
  --user-agent       Specify custom User‑Agent(s)

Single‑Site Crawl

python photon.py -u "http://example.com"

Crawl Depth

python photon.py -u "http://example.com" -l 3

Depth defines how many recursion levels are crawled; depth 2 means the homepage and its immediate links.

Thread Count

python photon.py -u "http://example.com" -t 10

Increasing threads speeds up crawling but may trigger security mechanisms or overload small sites.

Delay Between Requests

python photon.py -u "http://example.com" -d 2

Specifies a pause (in seconds) between each HTTP(S) request.

Timeout

python photon.py -u "http://example.com" --timeout=4

Sets the maximum wait time for a response before timing out.

Cookies

python photon.py -u "http://example.com" -c "PHPSESSID=u5423d78fqbaju9a0qke25ca87"

Allows sending a Cookie header with each request (non‑Ninja mode).

Output Directory

python photon.py -u "http://example.com" -o "my_directory"

Results are saved in a folder named after the target domain by default; this option overrides the directory name.

Exclude Specific URLs

python photon.py -u "http://example.com" --exclude="/blog/20[17|18]"

URLs matching the given regex are omitted from crawling and results.

Specify Sub‑URLs

python photon.py -u "http://example.com" --seeds "http://example.com/blog/2018,http://example.com/portals.html"

Add custom seed URLs, separated by commas.

Custom User‑Agents

python photon.py -u "http://example.com" --user-agent "curl/7.35.0,Wget/1.15 (linux-gnu)"

Use specific User‑Agent strings without editing the default file.

Custom Regex Pattern

python photon.py -u "http://example.com" --regex "\d{10}"

Extracts strings that match the provided regular expression during crawling.

Export Formatted Results

python photon.py -u "http://example.com" --export=json

Currently supports JSON output.

Skip Data Extraction

python photon.py -u "http://example.com" --only-urls

Only URLs are collected; other data such as JavaScript files are ignored.

Update

python photon.py --update

Checks for a newer version, downloads it, and merges updates without overwriting other files.

Ninja Mode

Enables Ninja mode, which routes requests through the following proxy sites:

codebeautify.org photopea.com pixlr.com

DNS Data Dump

python photon.py -u http://example.com --dns

Generates an image displaying DNS information for the target domain (sub‑domains not supported).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multithreading Crawler photon

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.