Web Scraping Agent
A three-way comparison of AI-powered web scraping frameworks: Crawl4AI, ScrapeGraphAI, and Firecrawl. Each scraper runs as a child agent, with the orchestrator comparing extraction quality and synthesizing results via an LLM call.
Environment variables
This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.
Architecture
Key Code
Three Scraping Frameworks
Each scraper runs as a child agent with @tool-decorated functions exercising the instrumented methods.
@waxell.tool(tool_type="web_scraper")
async def crawl_single(crawler: MockAsyncWebCrawler, url: str) -> dict:
"""Crawl a single URL with Crawl4AI (AsyncWebCrawler.arun)."""
result = await crawler.arun(url=url)
return {"url": url, "success": result.success, "content_length": len(result.markdown)}
@waxell.tool(tool_type="web_scraper")
def smart_scrape(prompt: str, source: str) -> dict:
"""Extract structured data with ScrapeGraphAI (SmartScraperGraph.run)."""
scraper = MockSmartScraperGraph(prompt=prompt, source=source)
result = scraper.run()
return {"title": result["title"], "key_points": len(result["key_points"])}
@waxell.tool(tool_type="web_scraper")
def scrape_url(url: str) -> dict:
"""Scrape a single URL with Firecrawl (FirecrawlApp.scrape_url)."""
firecrawl = MockFirecrawlApp()
result = firecrawl.scrape_url(url)
return {"url": url, "content_length": len(result["markdown"])}
Quality Comparison and Scraper Selection
The orchestrator compares frameworks and selects the best one.
@waxell.step_dec(name="compare_quality")
def compare_quality(crawl4ai: dict, scrapegraph: dict, firecrawl: dict) -> dict:
return {
"crawl4ai": {"type": "raw_crawl", "structured": False},
"scrapegraphai": {"type": "llm_extraction", "structured": True},
"firecrawl": {"type": "api_scraping", "structured": False},
}
@waxell.decision(name="choose_best_scraper", options=["crawl4ai", "scrapegraphai", "firecrawl"])
async def choose_best_scraper(comparison: dict) -> dict:
return {"chosen": "scrapegraphai", "reasoning": "Structured extraction with LLM analysis"}
What this demonstrates
- Three scraping instrumentors -- Crawl4AI (
AsyncWebCrawler.arun/arun_many), ScrapeGraphAI (SmartScraperGraph.run/SearchGraph.run), and Firecrawl (FirecrawlApp.scrape_url/crawl_url). - Three child agents -- each scraper runs in its own agent with
capture_io=Truefor full I/O recording. - Structured vs. raw extraction -- ScrapeGraphAI returns structured JSON, Crawl4AI and Firecrawl return raw markdown.
@decisionfor scraper selection -- records the best scraper choice with reasoning.- Auto-instrumented LLM synthesis -- OpenAI combines results from all three scrapers.
Run it
# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.web_scraping_agent --dry-run
# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.web_scraping_agent