Skip to main content

Code Sandbox Agent

A multi-agent sandboxed code execution pipeline using E2B Code Interpreter. A parent orchestrator coordinates a sandbox-runner (executes code in sandboxes, handles errors) and a sandbox-evaluator (interprets results via LLM, scores quality).

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Sandbox Tool Calls

Four @tool-decorated functions execute code in the E2B sandbox with tool_type="sandbox".

@waxell.tool(tool_type="sandbox")
async def run_basic_code(sandbox, code: str) -> dict:
"""Execute basic code in sandbox to verify connectivity."""
result = await sandbox.run_code(code)
stdout = "\n".join(result.logs.stdout)
return {
"sandbox_id": sandbox.sandbox_id,
"stdout": stdout,
"has_error": bool(result.error),
}

@waxell.tool(tool_type="sandbox")
async def run_statistics(sandbox, code: str) -> dict:
"""Execute statistical computation in sandbox."""
result = await sandbox.run_code(code)
return {"code_type": "statistics", "stdout": "\n".join(result.logs.stdout)}

@waxell.tool(tool_type="sandbox")
async def run_error_test(sandbox, code: str) -> dict:
"""Execute code expected to error, testing sandbox error handling."""
result = await sandbox.run_code(code)
return {"has_error": bool(result.error), "error_type": result.error.name if result.error else None}

Execution Quality Assessment

The evaluator uses @reasoning and @decision to assess and format results.

@waxell.reasoning_dec(step="execution_quality_assessment")
async def assess_execution_quality(results: dict) -> dict:
return {
"thought": f"Executed {results['total_executions']} code snippets. "
f"{results['successful']}/{results['total_executions']} completed successfully.",
"evidence": [f"successful_runs: {results['successful']}", "errors_caught: 1"],
"conclusion": "Sandbox execution is reliable with proper error handling",
}

waxell.score("execution_reliability", 0.92, comment="3/4 successful, 1 intentional error caught")
waxell.score("output_quality", 0.85, comment="Rich statistical and analytical output")

What this demonstrates

  • E2B sandbox instrumentation -- Sandbox.run_code and AsyncSandbox.run_code calls traced with tool_type="sandbox".
  • Multi-agent orchestration -- sandbox-runner executes code, sandbox-evaluator interprets results via LLM.
  • Error handling tracing -- intentional errors caught and recorded as part of the trace.
  • @reasoning + @decision -- quality assessment and output format selection recorded as structured spans.
  • Auto-instrumented LLM -- the evaluator's OpenAI call is captured without extra code.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.code_sandbox_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.code_sandbox_agent

Source

dev/waxell-dev/app/demos/code_sandbox_agent.py