How Crawl4AI Transforms Web Scraping with AI‑Powered Automation

Crawl4AI is an open‑source AI agent that automates web crawling and data extraction, offering free usage, intelligent parsing, structured JSON/Markdown output, and versatile features like scrolling, multi‑URL scraping, media and metadata extraction, all demonstrated through step‑by‑step Python examples and integration with AI agents.

21CTO
21CTO
21CTO
How Crawl4AI Transforms Web Scraping with AI‑Powered Automation

Crawl4AI is an open‑source AI‑driven web crawling tool that automates previously time‑consuming tasks, enabling developers to build intelligent agents for efficient data collection and analysis.

Key Features

Open‑source and free to use.

AI‑based element definition and parsing to save time.

Structured output in JSON, Markdown, etc., for easy analysis.

Supports scrolling, multi‑URL crawling, media tag extraction, metadata extraction, and screenshots.

Step‑by‑Step Guide

Step 1: Install

pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" transformers torch nltk

Step 2: Basic Data Extraction

from crawl4ai import WebCrawler

# Create an instance of WebCrawler
crawler = WebCrawler()

# Warm up the crawler (load necessary models)
crawler.warmup()

# Run the crawler on a URL
result = crawler.run(url="https://openai.com/api/pricing/")

# Print the extracted content
print(result.markdown)

Step 3: Use an LLM for Structured Extraction

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
    url=url,
    word_count_threshold=1,
    extraction_strategy=LLMExtractionStrategy(
        provider="openai/gpt-4o",
        api_token=os.getenv('OPENAI_API_KEY'),
        schema=OpenAIModelFee.schema(),
        extraction_type="schema",
        instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content. One extracted model JSON format should look like this: {\"model_name\": \"GPT-4\", \"input_fee\": \"US$10.00 / 1M tokens\", \"output_fee\": \"US$30.00 / 1M tokens\"}."""
    ),
    bypass_cache=True,
)

print(result.extracted_content)

Step 4: Integrate with an AI Agent (CrewAI) pip install praisonai Create a tool wrapper (tools.py):

# tools.py
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from praisonai_tools import BaseTool

class ModelFee(BaseModel):
    llm_model_name: str = Field(..., description="Name of the model.")
    input_fee: str = Field(..., description="Fee for input token for the model.")
    output_fee: str = Field(..., description="Fee for output token for the model.")

class ModelFeeTool(BaseTool):
    name: str = "ModelFeeTool"
    description: str = "Extracts model fees for input and output tokens from the given pricing page."

    def _run(self, url: str):
        crawler = WebCrawler()
        crawler.warmup()
        result = crawler.run(
            url=url,
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o",
                api_token=os.getenv('OPENAI_API_KEY'),
                schema=ModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content. One extracted model JSON format should look like this: {\"model_name\": \"GPT-4\", \"input_fee\": \"US$10.00 / 1M tokens\", \"output_fee\": \"US$30.00 / 1M tokens\"}."""
            ),
            bypass_cache=True,
        )
        return result.extracted_content

if __name__ == "__main__":
    tool = ModelFeeTool()
    url = "https://www.openai.com/pricing"
    result = tool.run(url)
    print(result)

Configure the CrewAI framework to use the ModelFeeTool for web scraping, data cleaning, and analysis as shown in the configuration snippet.

Conclusion

Crawl4AI is a powerful, free, open‑source tool that enables AI agents to perform web crawling and data extraction more efficiently and accurately with just a few lines of code, making it a valuable asset for developers building intelligent, data‑driven applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AutomationAI Agentsweb-scrapingCrawl4AILLM extraction
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.