Code Sandbox Agent

A multi-agent sandboxed code execution pipeline using E2B Code Interpreter. A parent orchestrator coordinates a sandbox-runner (executes code in sandboxes, handles errors) and a sandbox-evaluator (interprets results via LLM, scores quality).

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Sandbox Tool Calls

Four @tool-decorated functions execute code in the E2B sandbox with tool_type="sandbox".

@waxell.tool(tool_type="sandbox")
async def run_basic_code(sandbox, code: str) -> dict:
    """Execute basic code in sandbox to verify connectivity."""
    result = await sandbox.run_code(code)
    stdout = "\n".join(result.logs.stdout)
    return {
        "sandbox_id": sandbox.sandbox_id,
        "stdout": stdout,
        "has_error": bool(result.error),
    }

@waxell.tool(tool_type="sandbox")
async def run_statistics(sandbox, code: str) -> dict:
    """Execute statistical computation in sandbox."""
    result = await sandbox.run_code(code)
    return {"code_type": "statistics", "stdout": "\n".join(result.logs.stdout)}

@waxell.tool(tool_type="sandbox")
async def run_error_test(sandbox, code: str) -> dict:
    """Execute code expected to error, testing sandbox error handling."""
    result = await sandbox.run_code(code)
    return {"has_error": bool(result.error), "error_type": result.error.name if result.error else None}

Execution Quality Assessment

The evaluator uses @reasoning and @decision to assess and format results.

@waxell.reasoning_dec(step="execution_quality_assessment")
async def assess_execution_quality(results: dict) -> dict:
    return {
        "thought": f"Executed {results['total_executions']} code snippets. "
                   f"{results['successful']}/{results['total_executions']} completed successfully.",
        "evidence": [f"successful_runs: {results['successful']}", "errors_caught: 1"],
        "conclusion": "Sandbox execution is reliable with proper error handling",
    }

waxell.score("execution_reliability", 0.92, comment="3/4 successful, 1 intentional error caught")
waxell.score("output_quality", 0.85, comment="Rich statistical and analytical output")

What this demonstrates

E2B sandbox instrumentation -- Sandbox.run_code and AsyncSandbox.run_code calls traced with tool_type="sandbox".
Multi-agent orchestration -- sandbox-runner executes code, sandbox-evaluator interprets results via LLM.
Error handling tracing -- intentional errors caught and recorded as part of the trace.
@reasoning + @decision -- quality assessment and output format selection recorded as structured spans.
Auto-instrumented LLM -- the evaluator's OpenAI call is captured without extra code.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.code_sandbox_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.code_sandbox_agent

Source

dev/waxell-dev/app/demos/code_sandbox_agent.py

Architecture​

Key Code​

Sandbox Tool Calls​

Execution Quality Assessment​

What this demonstrates​

Run it​

Source​