Scrapling: Self‑Healing Web Scraper That Bypasses Cloudflare and Is 784× Faster Than BS4
Scrapling is an open‑source, adaptive web‑scraping framework that automatically tracks element changes, bypasses Cloudflare and other anti‑scraping defenses, offers multiple fetchers (including stealth mode), and delivers extraction speeds up to 784× faster than BeautifulSoup (BS4) while supporting concurrency, AI integration, and easy CLI usage.
Traditional Pain Points
When a website is redesigned, hard‑coded CSS or XPath selectors become invalid, forcing developers to repeatedly rewrite and debug code. Anti‑scraping measures such as Cloudflare Turnstile, CAPTCHAs, and fingerprint detection block ordinary requests, making large‑scale crawling difficult. Existing tools like BeautifulSoup are simple but slow at scale, while Scrapy is powerful but has a steep learning curve.
Project Overview
Scrapling is an open‑source adaptive web‑scraping framework (GitHub ★30K) designed for "write once, run forever". It automatically learns website structure changes, re‑locates target elements after updates, and includes out‑of‑the‑box Cloudflare bypass capabilities. The core is written for Python 3.10+, licensed under BSD‑3‑Clause, and is used daily by hundreds of developers with 92% test coverage and full type‑hinting.
Core Features
Adaptive Element Tracking : Intelligent algorithms record element features, eliminating reliance on fixed selectors. Enabling adaptive=True automatically re‑finds targets; auto_save=True persists adaptations for long‑term stability.
Smart Similar‑Element Search : Automatically locates elements with similar features to improve adaptability.
Hard‑Core Anti‑Anti‑Scraping : Four fetchers cover all scenarios—regular HTTP, async, stealth ( StealthyFetcher), and dynamic rendering. StealthyFetcher spoofs fingerprints to bypass Cloudflare Turnstile, reCAPTCHA, hCAPTCHA, funcaptcha, textcaptcha, awscaptcha, etc.
Built‑in Proxy Rotation, DNS‑over‑HTTPS, and Ad/Tracker Blocking (≈3500 domains) for maximum stealth.
Scrapy‑Like Full Framework : Async API, start_urls, parse callbacks, Request / Response objects, and easy migration for Scrapy users.
Concurrency & Session Management : Configurable concurrency, multi‑session mixing, checkpoint‑based pause/resume, streaming output via async for item in spider.stream().
Robots.txt Compliance : Optional robots_txt_obey respects Disallow, Crawl-delay, and Request-rate directives.
Automatic Failure Detection : Custom retry logic reduces crawl failures.
Performance : Text extraction is 784× faster than BS4 (Lxml parser) and 12× faster than PyQuery, comparable to Parsel/Scrapy (only 0.02 ms slower). JSON serialization is 10× faster than Python’s standard library.
Memory‑Optimized Design : Lazy loading and efficient data structures keep memory usage low for massive crawls.
AI‑Friendly MCP Service : Built‑in MCP server integrates with Claude, Cursor, etc., pre‑extracts content to cut token usage and can automate CAPTCHA solving when combined with tools like NopeCHA.
Developer‑Friendly : Interactive IPython‑based Web Scraping Shell, CLI commands for code‑free extraction, rich navigation API (CSS, XPath, BeautifulSoup‑style, regex), automatic selector generation, and ready‑to‑use Docker images.
Performance Test
Benchmarking 5 000 nested elements (100+ runs) shows:
Scrapling: 2.02 ms (1.0× baseline)
Parsel/Scrapy: 2.04 ms (1.01×)
Raw Lxml: 2.54 ms (1.26×)
PyQuery: 24.17 ms (~12× slower)
BS4 (Lxml): 1 584.31 ms (~784× slower)
Element‑search latency: Scrapling 2.39 ms vs AutoScraper 12.45 ms (5.2× slower).
All benchmarks are averages of 100+ runs; the benchmark script ( benchmarks.py) is provided in the repository.
Quick Start
Installation (Python ≥ 3.10):
# Basic install
pip install scrapling
# Install fetchers (anti‑scraping, browsers)
pip install "scrapling[fetchers]"
scrapling install
# Full install (AI + Shell)
pip install "scrapling[all]"Basic Fetcher Example (quotes site):
from scrapling.fetchers import Fetcher
page = Fetcher.get("https://quotes.toscrape.com/")
quotes = page.css(".quote .text::text").getall()
authors = page.css(".quote .author::text").getall()
print(list(zip(quotes, authors)))Stealth Mode Example (Cloudflare demo):
from scrapling.fetchers import StealthyFetcher
page = StealthyFetcher.fetch("https://nopecha.com/demo/cloudflare")
print(page.css("#padded_content a").getall())Full Spider with Pagination :
from scrapling.spiders import Spider, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
async def parse(self, response: Response):
for quote in response.css(".quote"):
yield {"text": quote.css(".text::text").get(),
"author": quote.css(".author::text").get()}
next_page = response.css(".next a::attr(href)").get()
if next_page:
yield response.follow(next_page)
result = QuotesSpider().start()
result.items.to_json("quotes.json")Multi‑Session Mixed Crawl (fast HTTP + stealth):
from scrapling.spiders import Spider, Request
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSessionSpider(Spider):
name = "multi"
start_urls = ["https://example.com/"]
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response):
for link in response.css('a::attr(href)').getall():
if "protected" in link:
yield Request(link, sid="stealth")
else:
yield Request(link, sid="fast", callback=self.parse)Async Session Example (high concurrency):
import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession
async with FetcherSession(http3=True) as session:
page1 = session.get('https://example.com/page1')
page2 = session.get('https://example.com/page2', impersonate='firefox135')
async with AsyncStealthySession(max_pages=2) as session:
tasks = [session.fetch(url) for url in ['https://example.com/page1', 'https://example.com/page2']]
results = await asyncio.gather(*tasks)
print(session.get_pool_stats())CLI Usage (no code required):
# Interactive shell
scrapling shell
# Basic extraction to markdown
scrapling extract get 'https://example.com' content.md
# CSS‑selector extraction with Chrome impersonation
scrapling extract get 'https://example.com' content.txt --css-selector '#target-element' --impersonate 'chrome'
# Stealth fetch to bypass Cloudflare
scrapling extract stealthy-fetch 'https://example.com/protected-page' result.html --solve-cloudflare
# Dynamic page rendering (wait 3 s)
scrapling extract dynamic-fetch 'https://example.com/dynamic-page' dynamic-result.txt --wait 3
# Help
scrapling --help
scrapling extract --helpRepository
https://github.com/D4Vinci/Scrapling
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Architecture Path
Focused on AI open-source practice, sharing AI news, tools, technologies, learning resources, and GitHub projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
