Boost Your Web Scraping Speed with Photon: A High‑Performance Multithreaded Crawler
Photon is a fast, multithreaded Python web crawler that extracts URLs, files, and intelligence such as emails and social media accounts, offering flexible options, Ninja mode, and extensive command‑line parameters for precise and efficient data harvesting across multiple operating systems.
Project URL
https://github.com/s0md3v/Photon
Main Features
Photon provides numerous options for custom crawling, but its standout capability is high‑efficiency multithreaded data extraction.
Data Extraction
By default, Photon extracts the following data: All URLs (both in‑scope and out‑of‑scope) Parameterized URLs (e.g., example.com/gallery.php?id=2) Intelligence such as email addresses, social media accounts, Amazon buckets, etc. Files (pdf, png, xml, etc.) JavaScript and other files Strings matching custom regular‑expression patterns
The extracted information is saved as illustrated below:
Smart Multithreading
Unlike many tools that misuse threads, Photon correctly distributes work among threads, avoiding shared‑task bottlenecks and lock contention.
Ninja Mode
In Ninja mode, three online servers act as proxies, allowing four clients to issue requests simultaneously, which speeds up crawling, reduces connection resets, and lowers latency.
Compatibility & Dependencies
Compatibility
Photon works on Python 2.x and 3.x (future versions may drop Python 2.x), and has been tested on Linux (Arch, Debian, Ubuntu), Termux, Windows 7/10, and macOS.
Operating Systems
Color output is not supported on macOS and Windows terminals that lack ANSI escape sequence handling.
Dependencies
requests
urllib3
argparseAll other required libraries are part of the standard Python distribution.
How to Use Photon
Syntax: photon.py [options]
-u, --url Target URL
-l, --level Crawl depth (default 2)
-t, --threads Number of threads (default 2)
-d, --delay Delay between requests (seconds)
-c, --cookies Cookie header
-r, --regex Custom regex pattern
-s, --seeds Additional sub‑URLs
-e, --export Export format (e.g., json)
-o, --output Output directory
--exclude Exclude URLs matching regex
--timeout Request timeout (seconds)
--ninja Enable Ninja mode
--update Check for updates
--dns Dump DNS data
--only-urls Extract URLs only
--user-agent Specify custom User‑Agent(s)Single‑Site Crawl
python photon.py -u "http://example.com"Crawl Depth
python photon.py -u "http://example.com" -l 3Depth defines how many recursion levels are crawled; depth 2 means the homepage and its immediate links.
Thread Count
python photon.py -u "http://example.com" -t 10Increasing threads speeds up crawling but may trigger security mechanisms or overload small sites.
Delay Between Requests
python photon.py -u "http://example.com" -d 2Specifies a pause (in seconds) between each HTTP(S) request.
Timeout
python photon.py -u "http://example.com" --timeout=4Sets the maximum wait time for a response before timing out.
Cookies
python photon.py -u "http://example.com" -c "PHPSESSID=u5423d78fqbaju9a0qke25ca87"Allows sending a Cookie header with each request (non‑Ninja mode).
Output Directory
python photon.py -u "http://example.com" -o "my_directory"Results are saved in a folder named after the target domain by default; this option overrides the directory name.
Exclude Specific URLs
python photon.py -u "http://example.com" --exclude="/blog/20[17|18]"URLs matching the given regex are omitted from crawling and results.
Specify Sub‑URLs
python photon.py -u "http://example.com" --seeds "http://example.com/blog/2018,http://example.com/portals.html"Add custom seed URLs, separated by commas.
Custom User‑Agents
python photon.py -u "http://example.com" --user-agent "curl/7.35.0,Wget/1.15 (linux-gnu)"Use specific User‑Agent strings without editing the default file.
Custom Regex Pattern
python photon.py -u "http://example.com" --regex "\d{10}"Extracts strings that match the provided regular expression during crawling.
Export Formatted Results
python photon.py -u "http://example.com" --export=jsonCurrently supports JSON output.
Skip Data Extraction
python photon.py -u "http://example.com" --only-urlsOnly URLs are collected; other data such as JavaScript files are ignored.
Update
python photon.py --updateChecks for a newer version, downloads it, and merges updates without overwriting other files.
Ninja Mode
Enables Ninja mode, which routes requests through the following proxy sites:
codebeautify.org photopea.com pixlr.com
DNS Data Dump
python photon.py -u http://example.com --dnsGenerates an image displaying DNS information for the target domain (sub‑domains not supported).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
