Provider Routing

Drop a one-liner into any existing agent and pick up:

Cross-provider fallback — first try Fireworks, fall through to OpenAI on rate limit
Per-instance secrets — each provider account uses its own env var, no secret collisions
Capability filtering — automatically skip non-tools instances when a call needs tools=[]
Group references — point at "group:cheap-llama-70b" and let the chain decide
Per-call telemetry — same provider + cost attribution that Waxell runtime users get

You don't need the Waxell runtime to use any of this. Configure your provider instances at /settings/llm-routing in the controlplane; your agent code calls waxell.llm.call(...) and dispatch is driven by the same data.

Quick start

pip install 'waxell-observe[all-providers]'

Configure your API keys in env (the same names you'd otherwise pass to the SDKs directly):

export OPENAI_API_KEY=sk-...
export FIREWORKS_API_KEY=fw-...
# Whatever secret_ref names you set in /settings/llm-routing

Then in code:

import waxell_observe as waxell

waxell.init()  # reads WAXELL_API_KEY + WAXELL_API_URL from env

response = waxell.llm.call(
    model="llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

That's it. No runtime, no Django, no servers. The SDK pulls your provider config from the controlplane on first call (cached 5 minutes with ETag), resolves the model, dispatches through the right SDK, and emits the same telemetry runtime users get.

How it works

your code                                  waxell-observe              controlplane
  |                                            |                          |
  | waxell.llm.call(model="...", ...)          |                          |
  |------------------------------------------> |                          |
  |                                            | GET /llm-config/manifest |
  |                                            |------------------------> |
  |                                            | <----------- 200 / 304 - |
  |                                            | resolve_chain(model)     |
  |                                            | filter_chain_for_mode()  |
  |                                            | os.environ[secret_ref]   |
  |                                            | openai.OpenAI(base_url=) |
  |                                            | .chat.completions.create |
  | <---- SDK response object (no wrapper) --- |                          |

The SDK's Manifest is a snapshot of:

instances — your registered provider accounts (kind, base_url, secret_ref name, capabilities)
tenant_models — model_id → instance_id mappings (e.g. llama-3.1-70b-instruct lives on fireworks-prod)
groups — ordered fallback chains (cheap-llama-70b → fireworks first, ollama fallback)
capability_overrides — per-(instance, model) tri-state flag pins

Resolved at dispatch time; cached by ETag.

Resolving a model

waxell.llm.call(model=...) accepts three shapes:

Shape	Example	What happens
Plain	`"gpt-4o"`	Look up TenantModel; if missing, use the default instance for the prefix-inferred kind.
Qualified	`"fireworks-prod/llama-3.1-70b-instruct"`	Use that exact instance with that exact model name.
Group	`"group:cheap-llama-70b"`	Walk the group's entries in declared order.

Mode-specific helpers

# Plain chat (default)
waxell.llm.text(model="gpt-4o", messages=[...])

# JSON mode — adds response_format={"type": "json_object"}
waxell.llm.json(model="gpt-4o", messages=[...])

# JSON with schema
waxell.llm.json(model="gpt-4o", messages=[...], schema={...})

# Tool calls
waxell.llm.tool(
    model="gpt-4o",
    messages=[...],
    tools=[{"type": "function", "function": {...}}],
)

The capability filter drops chain entries whose instances don't advertise the required capability. If you ask for tool mode against a chain whose only candidate is Ollama (no native tools), you'll get NoCandidateForMode rather than a confusing OpenAI error from the provider.

Secrets — the contract

Provider instances reference an env var name, not a secret value. The controlplane stores the name (secret_ref); your process reads the value from os.environ[secret_ref] at dispatch time.

When the env var is unset:

SecretNotInEnvironment: Provider instance 'fireworks-prod' references
env var FIREWORKS_API_KEY which is not set in this process. Either set
the env var, or change the instance's secret_ref in the controlplane
at /settings/llm-routing.

This intentionally mirrors how you'd already provide keys to the direct SDK — the dispatcher is opt-in, not magical. Keys never leave your process.

Groups for cross-provider fallback

Define a group at /settings/llm-routing (or via the API), then reference it as model="group:...":

# Group "cheap-llama-70b" defined as:
#   1. fireworks-prod / accounts/fireworks/models/llama-v3p1-70b-instruct
#   2. groq-prod      / llama-3.1-70b-versatile
#   3. ollama-local   / llama3.1:70b

response = waxell.llm.call(
    model="group:cheap-llama-70b",
    messages=[...],
)
# If Fireworks rate-limits, dispatch retries against Groq.
# If Groq is down, falls through to local Ollama.
# Capability filter drops Ollama if you passed tools=[].

Fallback walks on retryable errors only (rate limits, 5xx, connection errors, NotFound). Auth errors, BadRequest, and unknown errors raise immediately — they're caller bugs, not provider hiccups.

Capability overrides

Sometimes a model that "should" support tools doesn't on a specific provider. Override per-(instance, model) at /settings/llm-routing:

override:
  instance: together-prod
  model: llama-3.1-70b-instruct
  native_tools: false   # tri-state: true / false / null

The dispatcher's capability filter respects overrides as veto: an explicit false skips this candidate even if its instance baseline says true. null means "no override; defer to baseline."

Inspect what would happen

wax llm call --model gpt-4o --show-config

Prints the resolved chain, which entries pass the capability filter, whether the env var is set, and the candidate's base_url — without actually dispatching.

Resolved chain for 'gpt-4o' (mode=chat)
┌──────────────────────────────────────────────────────────────────┐
│ # │ instance_id │ kind   │ base_url        │ env set? │ passes?  │
├──────────────────────────────────────────────────────────────────┤
│ 1 │ oai-prod    │ openai │ (SDK default)   │ ✓        │ ✓        │
└──────────────────────────────────────────────────────────────────┘

Useful for:

Debugging "why is this routing to provider X?"
Verifying secret_ref env vars are set before running batch jobs
Confirming capability filter behavior under tool / JSON modes

What gets recorded

Every waxell.llm.call(...) produces one LlmCallRecord in your controlplane with:

provider (e.g. "openai_compat")
provider_instance_id (e.g. "fireworks-prod")
model (the resolved provider model id)
tokens_in, tokens_out, cost
dispatch_source: "observe-sdk" (for analytics distinguishing observe-side dispatch from runtime dispatch)
Plus any fallback_chain metadata if the call walked past a retryable error before succeeding

The same LlmCallRecord ingest path stamps last_success_at on the provider instance — your "is this Fireworks instance healthy?" analytics work the same whether the call went through observe SDK dispatch or the Waxell runtime.

Coexistence with raw SDK calls

Your existing direct SDK calls — openai.chat.completions.create(...), anthropic.messages.create(...) — keep working unchanged. The auto-instrumentor records them as before. The dispatcher sets a context-var around its own SDK call so the instrumentor doesn't double-record when both code paths run.

You can mix freely: dispatch when you want fallback / groups, direct SDK calls everywhere else. The tracing and cost attribution are unified.

Provider extras

Install only the SDKs you need:

pip install 'waxell-observe[openai]'      # OpenAI + all OpenAI-compat
pip install 'waxell-observe[anthropic]'   # Anthropic
pip install 'waxell-observe[fireworks]'   # alias for [openai]
pip install 'waxell-observe[together]'    # alias for [openai]
pip install 'waxell-observe[xai]'         # alias for [openai]
pip install 'waxell-observe[bedrock]'     # boto3
pip install 'waxell-observe[vertex]'      # google-cloud-aiplatform
pip install 'waxell-observe[gemini]'      # google-generativeai
pip install 'waxell-observe[cohere]'      # cohere
pip install 'waxell-observe[mistral]'     # mistralai
pip install 'waxell-observe[groq]'        # groq

# Or install the whole thing
pip install 'waxell-observe[all-providers]'

Most "OpenAI-compatible" providers (Fireworks, Together, Groq, xAI, NVIDIA, Mistral, AI21, Replicate, Ollama, vLLM, HF TGI) work with just the openai SDK because they speak the OpenAI HTTP wire format on a different base_url. The dispatcher handles the base_url override transparently.

When to use this vs the runtime

Use observe SDK dispatch (this) when…	Use the Waxell runtime when…
You have an existing agent (LangGraph, LangChain, ad-hoc)	You're building a new agent from scratch
You want fallback + groups but not durable execution	You need durable execution + replay
You want to add Waxell to a serverless function	You want supervised + governed agent fleets
You want minimal migration cost	You want the full governance surface

The data layer is shared — the same provider instances, groups, and capability overrides feed both adoption modes. You can start with observe SDK dispatch and graduate to the runtime later without re-configuring providers.

Reference

agentforge/areas/llm-providers/plans/OBSERVE_DISPATCH_PLAN.md — design plan
Provider Catalog — list of supported provider kinds
Controlplane UI: /settings/llm-routing (configure your instances)
Controlplane UI: /admin/llm-providers/ (cross-tenant admin, requires billing:admin)

Quick start​

How it works​

Resolving a model​

Mode-specific helpers​

Secrets — the contract​

Groups for cross-provider fallback​

Capability overrides​

Inspect what would happen​

What gets recorded​

Coexistence with raw SDK calls​

Provider extras​

When to use this vs the runtime​

Reference​