How Crawl4AI Transforms Web Scraping with AI‑Powered Automation
Crawl4AI is an open‑source AI agent that automates web crawling and data extraction, offering free usage, intelligent parsing, structured JSON/Markdown output, and versatile features like scrolling, multi‑URL scraping, media and metadata extraction, all demonstrated through step‑by‑step Python examples and integration with AI agents.
Crawl4AI is an open‑source AI‑driven web crawling tool that automates previously time‑consuming tasks, enabling developers to build intelligent agents for efficient data collection and analysis.
Key Features
Open‑source and free to use.
AI‑based element definition and parsing to save time.
Structured output in JSON, Markdown, etc., for easy analysis.
Supports scrolling, multi‑URL crawling, media tag extraction, metadata extraction, and screenshots.
Step‑by‑Step Guide
Step 1: Install
pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" transformers torch nltkStep 2: Basic Data Extraction
from crawl4ai import WebCrawler
# Create an instance of WebCrawler
crawler = WebCrawler()
# Warm up the crawler (load necessary models)
crawler.warmup()
# Run the crawler on a URL
result = crawler.run(url="https://openai.com/api/pricing/")
# Print the extracted content
print(result.markdown)Step 3: Use an LLM for Structured Extraction
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content. One extracted model JSON format should look like this: {\"model_name\": \"GPT-4\", \"input_fee\": \"US$10.00 / 1M tokens\", \"output_fee\": \"US$30.00 / 1M tokens\"}."""
),
bypass_cache=True,
)
print(result.extracted_content)Step 4: Integrate with an AI Agent (CrewAI) pip install praisonai Create a tool wrapper (tools.py):
# tools.py
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from praisonai_tools import BaseTool
class ModelFee(BaseModel):
llm_model_name: str = Field(..., description="Name of the model.")
input_fee: str = Field(..., description="Fee for input token for the model.")
output_fee: str = Field(..., description="Fee for output token for the model.")
class ModelFeeTool(BaseTool):
name: str = "ModelFeeTool"
description: str = "Extracts model fees for input and output tokens from the given pricing page."
def _run(self, url: str):
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
schema=ModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content. One extracted model JSON format should look like this: {\"model_name\": \"GPT-4\", \"input_fee\": \"US$10.00 / 1M tokens\", \"output_fee\": \"US$30.00 / 1M tokens\"}."""
),
bypass_cache=True,
)
return result.extracted_content
if __name__ == "__main__":
tool = ModelFeeTool()
url = "https://www.openai.com/pricing"
result = tool.run(url)
print(result)Configure the CrewAI framework to use the ModelFeeTool for web scraping, data cleaning, and analysis as shown in the configuration snippet.
Conclusion
Crawl4AI is a powerful, free, open‑source tool that enables AI agents to perform web crawling and data extraction more efficiently and accurately with just a few lines of code, making it a valuable asset for developers building intelligent, data‑driven applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
