Provider Routing
Drop a one-liner into any existing agent and pick up:
- Cross-provider fallback — first try Fireworks, fall through to OpenAI on rate limit
- Per-instance secrets — each provider account uses its own env var, no secret collisions
- Capability filtering — automatically skip non-tools instances when a call needs
tools=[] - Group references — point at
"group:cheap-llama-70b"and let the chain decide - Per-call telemetry — same provider + cost attribution that Waxell runtime users get
You don't need the Waxell runtime to use any of this. Configure your
provider instances at /settings/llm-routing in the controlplane;
your agent code calls waxell.llm.call(...) and dispatch is driven
by the same data.
Quick start
pip install 'waxell-observe[all-providers]'
Configure your API keys in env (the same names you'd otherwise pass to the SDKs directly):
export OPENAI_API_KEY=sk-...
export FIREWORKS_API_KEY=fw-...
# Whatever secret_ref names you set in /settings/llm-routing
Then in code:
import waxell_observe as waxell
waxell.init() # reads WAXELL_API_KEY + WAXELL_API_URL from env
response = waxell.llm.call(
model="llama-3.1-70b-instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
That's it. No runtime, no Django, no servers. The SDK pulls your provider config from the controlplane on first call (cached 5 minutes with ETag), resolves the model, dispatches through the right SDK, and emits the same telemetry runtime users get.
How it works
your code waxell-observe controlplane
| | |
| waxell.llm.call(model="...", ...) | |
|------------------------------------------> | |
| | GET /llm-config/manifest |
| |------------------------> |
| | <----------- 200 / 304 - |
| | resolve_chain(model) |
| | filter_chain_for_mode() |
| | os.environ[secret_ref] |
| | openai.OpenAI(base_url=) |
| | .chat.completions.create |
| <---- SDK response object (no wrapper) --- | |
The SDK's Manifest is a snapshot of:
instances— your registered provider accounts (kind, base_url, secret_ref name, capabilities)tenant_models—model_id → instance_idmappings (e.g.llama-3.1-70b-instructlives onfireworks-prod)groups— ordered fallback chains (cheap-llama-70b→ fireworks first, ollama fallback)capability_overrides— per-(instance, model) tri-state flag pins
Resolved at dispatch time; cached by ETag.
Resolving a model
waxell.llm.call(model=...) accepts three shapes:
| Shape | Example | What happens |
|---|---|---|
| Plain | "gpt-4o" | Look up TenantModel; if missing, use the default instance for the prefix-inferred kind. |
| Qualified | "fireworks-prod/llama-3.1-70b-instruct" | Use that exact instance with that exact model name. |
| Group | "group:cheap-llama-70b" | Walk the group's entries in declared order. |
Mode-specific helpers
# Plain chat (default)
waxell.llm.text(model="gpt-4o", messages=[...])
# JSON mode — adds response_format={"type": "json_object"}
waxell.llm.json(model="gpt-4o", messages=[...])
# JSON with schema
waxell.llm.json(model="gpt-4o", messages=[...], schema={...})
# Tool calls
waxell.llm.tool(
model="gpt-4o",
messages=[...],
tools=[{"type": "function", "function": {...}}],
)
The capability filter drops chain entries whose instances don't
advertise the required capability. If you ask for tool mode against
a chain whose only candidate is Ollama (no native tools), you'll get
NoCandidateForMode rather than a confusing OpenAI error from the
provider.
Secrets — the contract
Provider instances reference an env var name, not a secret value.
The controlplane stores the name (secret_ref); your process reads
the value from os.environ[secret_ref] at dispatch time.
When the env var is unset:
SecretNotInEnvironment: Provider instance 'fireworks-prod' references
env var FIREWORKS_API_KEY which is not set in this process. Either set
the env var, or change the instance's secret_ref in the controlplane
at /settings/llm-routing.
This intentionally mirrors how you'd already provide keys to the direct SDK — the dispatcher is opt-in, not magical. Keys never leave your process.
Groups for cross-provider fallback
Define a group at /settings/llm-routing (or via the API), then
reference it as model="group:...":
# Group "cheap-llama-70b" defined as:
# 1. fireworks-prod / accounts/fireworks/models/llama-v3p1-70b-instruct
# 2. groq-prod / llama-3.1-70b-versatile
# 3. ollama-local / llama3.1:70b
response = waxell.llm.call(
model="group:cheap-llama-70b",
messages=[...],
)
# If Fireworks rate-limits, dispatch retries against Groq.
# If Groq is down, falls through to local Ollama.
# Capability filter drops Ollama if you passed tools=[].
Fallback walks on retryable errors only (rate limits, 5xx, connection errors, NotFound). Auth errors, BadRequest, and unknown errors raise immediately — they're caller bugs, not provider hiccups.
Capability overrides
Sometimes a model that "should" support tools doesn't on a specific
provider. Override per-(instance, model) at /settings/llm-routing:
override:
instance: together-prod
model: llama-3.1-70b-instruct
native_tools: false # tri-state: true / false / null
The dispatcher's capability filter respects overrides as veto: an
explicit false skips this candidate even if its instance baseline
says true. null means "no override; defer to baseline."
Inspect what would happen
wax llm call --model gpt-4o --show-config
Prints the resolved chain, which entries pass the capability filter, whether the env var is set, and the candidate's base_url — without actually dispatching.
Resolved chain for 'gpt-4o' (mode=chat)
┌──────────────────────────────────────────────────────────────────┐
│ # │ instance_id │ kind │ base_url │ env set? │ passes? │
├──────────────────────────────────────────────────────────────────┤
│ 1 │ oai-prod │ openai │ (SDK default) │ ✓ │ ✓ │
└──────────────────────────────────────────────────────────────────┘
Useful for:
- Debugging "why is this routing to provider X?"
- Verifying
secret_refenv vars are set before running batch jobs - Confirming capability filter behavior under tool / JSON modes
What gets recorded
Every waxell.llm.call(...) produces one LlmCallRecord in your
controlplane with:
provider(e.g."openai_compat")provider_instance_id(e.g."fireworks-prod")model(the resolved provider model id)tokens_in,tokens_out,costdispatch_source: "observe-sdk"(for analytics distinguishing observe-side dispatch from runtime dispatch)- Plus any
fallback_chainmetadata if the call walked past a retryable error before succeeding
The same LlmCallRecord ingest path stamps last_success_at on the
provider instance — your "is this Fireworks instance healthy?"
analytics work the same whether the call went through observe SDK
dispatch or the Waxell runtime.
Coexistence with raw SDK calls
Your existing direct SDK calls — openai.chat.completions.create(...),
anthropic.messages.create(...) — keep working unchanged. The
auto-instrumentor records them as before. The dispatcher sets a
context-var around its own SDK call so the instrumentor doesn't
double-record when both code paths run.
You can mix freely: dispatch when you want fallback / groups, direct SDK calls everywhere else. The tracing and cost attribution are unified.
Provider extras
Install only the SDKs you need:
pip install 'waxell-observe[openai]' # OpenAI + all OpenAI-compat
pip install 'waxell-observe[anthropic]' # Anthropic
pip install 'waxell-observe[fireworks]' # alias for [openai]
pip install 'waxell-observe[together]' # alias for [openai]
pip install 'waxell-observe[xai]' # alias for [openai]
pip install 'waxell-observe[bedrock]' # boto3
pip install 'waxell-observe[vertex]' # google-cloud-aiplatform
pip install 'waxell-observe[gemini]' # google-generativeai
pip install 'waxell-observe[cohere]' # cohere
pip install 'waxell-observe[mistral]' # mistralai
pip install 'waxell-observe[groq]' # groq
# Or install the whole thing
pip install 'waxell-observe[all-providers]'
Most "OpenAI-compatible" providers (Fireworks, Together, Groq, xAI,
NVIDIA, Mistral, AI21, Replicate, Ollama, vLLM, HF TGI) work with just
the openai SDK because they speak the OpenAI HTTP wire format on a
different base_url. The dispatcher handles the base_url override
transparently.
When to use this vs the runtime
| Use observe SDK dispatch (this) when… | Use the Waxell runtime when… |
|---|---|
| You have an existing agent (LangGraph, LangChain, ad-hoc) | You're building a new agent from scratch |
| You want fallback + groups but not durable execution | You need durable execution + replay |
| You want to add Waxell to a serverless function | You want supervised + governed agent fleets |
| You want minimal migration cost | You want the full governance surface |
The data layer is shared — the same provider instances, groups, and capability overrides feed both adoption modes. You can start with observe SDK dispatch and graduate to the runtime later without re-configuring providers.
Reference
agentforge/areas/llm-providers/plans/OBSERVE_DISPATCH_PLAN.md— design plan- Provider Catalog — list of supported provider kinds
- Controlplane UI:
/settings/llm-routing(configure your instances) - Controlplane UI:
/admin/llm-providers/(cross-tenant admin, requiresbilling:admin)