Code Sandbox Agent
A multi-agent sandboxed code execution pipeline using E2B Code Interpreter. A parent orchestrator coordinates a sandbox-runner (executes code in sandboxes, handles errors) and a sandbox-evaluator (interprets results via LLM, scores quality).
Environment variables
This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.
Architecture
Key Code
Sandbox Tool Calls
Four @tool-decorated functions execute code in the E2B sandbox with tool_type="sandbox".
@waxell.tool(tool_type="sandbox")
async def run_basic_code(sandbox, code: str) -> dict:
"""Execute basic code in sandbox to verify connectivity."""
result = await sandbox.run_code(code)
stdout = "\n".join(result.logs.stdout)
return {
"sandbox_id": sandbox.sandbox_id,
"stdout": stdout,
"has_error": bool(result.error),
}
@waxell.tool(tool_type="sandbox")
async def run_statistics(sandbox, code: str) -> dict:
"""Execute statistical computation in sandbox."""
result = await sandbox.run_code(code)
return {"code_type": "statistics", "stdout": "\n".join(result.logs.stdout)}
@waxell.tool(tool_type="sandbox")
async def run_error_test(sandbox, code: str) -> dict:
"""Execute code expected to error, testing sandbox error handling."""
result = await sandbox.run_code(code)
return {"has_error": bool(result.error), "error_type": result.error.name if result.error else None}
Execution Quality Assessment
The evaluator uses @reasoning and @decision to assess and format results.
@waxell.reasoning_dec(step="execution_quality_assessment")
async def assess_execution_quality(results: dict) -> dict:
return {
"thought": f"Executed {results['total_executions']} code snippets. "
f"{results['successful']}/{results['total_executions']} completed successfully.",
"evidence": [f"successful_runs: {results['successful']}", "errors_caught: 1"],
"conclusion": "Sandbox execution is reliable with proper error handling",
}
waxell.score("execution_reliability", 0.92, comment="3/4 successful, 1 intentional error caught")
waxell.score("output_quality", 0.85, comment="Rich statistical and analytical output")
What this demonstrates
- E2B sandbox instrumentation --
Sandbox.run_codeandAsyncSandbox.run_codecalls traced withtool_type="sandbox". - Multi-agent orchestration --
sandbox-runnerexecutes code,sandbox-evaluatorinterprets results via LLM. - Error handling tracing -- intentional errors caught and recorded as part of the trace.
@reasoning+@decision-- quality assessment and output format selection recorded as structured spans.- Auto-instrumented LLM -- the evaluator's OpenAI call is captured without extra code.
Run it
# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.code_sandbox_agent --dry-run
# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.code_sandbox_agent