Skip to main content

Web Scraping Agent

A three-way comparison of AI-powered web scraping frameworks: Crawl4AI, ScrapeGraphAI, and Firecrawl. Each scraper runs as a child agent, with the orchestrator comparing extraction quality and synthesizing results via an LLM call.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Three Scraping Frameworks

Each scraper runs as a child agent with @tool-decorated functions exercising the instrumented methods.

@waxell.tool(tool_type="web_scraper")
async def crawl_single(crawler: MockAsyncWebCrawler, url: str) -> dict:
"""Crawl a single URL with Crawl4AI (AsyncWebCrawler.arun)."""
result = await crawler.arun(url=url)
return {"url": url, "success": result.success, "content_length": len(result.markdown)}

@waxell.tool(tool_type="web_scraper")
def smart_scrape(prompt: str, source: str) -> dict:
"""Extract structured data with ScrapeGraphAI (SmartScraperGraph.run)."""
scraper = MockSmartScraperGraph(prompt=prompt, source=source)
result = scraper.run()
return {"title": result["title"], "key_points": len(result["key_points"])}

@waxell.tool(tool_type="web_scraper")
def scrape_url(url: str) -> dict:
"""Scrape a single URL with Firecrawl (FirecrawlApp.scrape_url)."""
firecrawl = MockFirecrawlApp()
result = firecrawl.scrape_url(url)
return {"url": url, "content_length": len(result["markdown"])}

Quality Comparison and Scraper Selection

The orchestrator compares frameworks and selects the best one.

@waxell.step_dec(name="compare_quality")
def compare_quality(crawl4ai: dict, scrapegraph: dict, firecrawl: dict) -> dict:
return {
"crawl4ai": {"type": "raw_crawl", "structured": False},
"scrapegraphai": {"type": "llm_extraction", "structured": True},
"firecrawl": {"type": "api_scraping", "structured": False},
}

@waxell.decision(name="choose_best_scraper", options=["crawl4ai", "scrapegraphai", "firecrawl"])
async def choose_best_scraper(comparison: dict) -> dict:
return {"chosen": "scrapegraphai", "reasoning": "Structured extraction with LLM analysis"}

What this demonstrates

  • Three scraping instrumentors -- Crawl4AI (AsyncWebCrawler.arun/arun_many), ScrapeGraphAI (SmartScraperGraph.run/SearchGraph.run), and Firecrawl (FirecrawlApp.scrape_url/crawl_url).
  • Three child agents -- each scraper runs in its own agent with capture_io=True for full I/O recording.
  • Structured vs. raw extraction -- ScrapeGraphAI returns structured JSON, Crawl4AI and Firecrawl return raw markdown.
  • @decision for scraper selection -- records the best scraper choice with reasoning.
  • Auto-instrumented LLM synthesis -- OpenAI combines results from all three scrapers.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.web_scraping_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.web_scraping_agent

Source

dev/waxell-dev/app/demos/web_scraping_agent.py