Observability Platforms Agent
A multi-agent comparison demo that exercises 4 LLM observability platforms -- Arize Phoenix (launch_app, run_evals, llm_classify, using_project), Opik (trace, log_traces, evaluate, log_spans_feedback_scores), LangSmith (create_run, update_run, create_feedback, evaluate), and Langfuse (trace, generation, span, score) -- through a platform-runner child agent and a platform-evaluator child agent. Includes @decision for evaluation strategy, waxell.decide() for platform routing, and @reasoning for coverage assessment.
This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.
Architecture
Key Code
Platform-specific tool operations
Each platform's exact instrumentor methods are wrapped with @waxell.tool(tool_type="observability").
@waxell.tool(tool_type="observability")
def run_phoenix_evals(phoenix, dataframe, evaluators) -> dict:
results = phoenix.evals.run_evals(dataframe=dataframe, evaluators=evaluators)
return {"evaluator_count": len(evaluators), "result_sets": len(results)}
@waxell.tool(tool_type="observability")
def run_opik_trace(opik, name, query) -> dict:
trace = opik.trace(name=name, input={"query": query}, output={"response": "analysis"})
return {"trace_id": trace.id, "trace_name": trace.name}
@waxell.tool(tool_type="observability")
def run_langsmith_run(langsmith, name, query) -> dict:
run = langsmith.create_run(name=name, run_type="chain", inputs={"query": query})
langsmith.update_run(run_id=run.id, outputs={"response": "comparison"})
feedback = langsmith.create_feedback(run_id=run.id, key="quality", score=0.91)
return {"run_id": run.id, "feedback_score": feedback["score"]}
@waxell.tool(tool_type="observability")
def run_langfuse_trace(langfuse, name, query, user_id, session_id) -> dict:
trace = langfuse.trace(name=name, input={"query": query}, user_id=user_id)
gen = langfuse.generation(name="analysis-gen", model="gpt-4o-mini", input={"prompt": query})
span = langfuse.span(name="retrieval-step", input={"query": query})
score = langfuse.score(name="helpfulness", value=0.88, trace_id=trace.id)
return {"trace_id": trace.id, "gen_model": gen.model, "score_value": score["value"]}
Coverage assessment and platform scoring
The evaluator reasons about platform coverage and produces per-platform quality scores.
@waxell.reasoning_dec(step="platform_coverage_assessment")
async def assess_platform_coverage(comparison: dict) -> dict:
return {
"thought": f"Evaluated {len(comparison)} platforms. All support tracing and evaluation.",
"evidence": [f"{name}: {v.get('type')}" for name, v in comparison.items()],
"conclusion": "Choice depends on open-source preference and integration needs.",
}
# Per-platform scores
waxell.score("phoenix_coverage", 0.88, comment="Open-source with strong evaluation")
waxell.score("opik_coverage", 0.86, comment="Commercial with good experiment tracking")
waxell.score("langsmith_coverage", 0.91, comment="Commercial with hub integration")
waxell.score("langfuse_coverage", 0.88, comment="Open-source with prompt management")
What this demonstrates
@waxell.tool(tool_type="observability")-- 4 platform tool calls covering all wrapt-target methods across Phoenix, Opik, LangSmith, and Langfuse.@waxell.retrieval(source="platform_registry")-- platform feature gathering.@waxell.reasoning_dec-- platform coverage assessment.@waxell.decision-- evaluation strategy selection (comprehensive/quick/targeted) via OpenAI.waxell.decide()-- inline platform routing decision.@waxell.step_dec-- platform client preparation step.waxell.score()-- 4 per-platform coverage scores.- Auto-instrumented LLM calls -- 2 OpenAI calls (strategy decision + synthesis).
- Nested
@waxell.observe-- orchestrator + 2 child agents (platform-runner, platform-evaluator). - 4 observability platforms compared -- Arize Phoenix (open-source), Opik (commercial), LangSmith (commercial), Langfuse (open-source).
Run it
# Dry-run (no API key needed)
python -m app.demos.observability_platforms_agent --dry-run
# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.observability_platforms_agent