Lineage
Every operator running agents at scale asks two questions. Why did this run happen? What did this run cause? Waxell Lineage answers both.
The problem it solves
Without lineage, when an unexpected run fires at 3AM you get to:
- Eyeball a trace tree
- Cross-reference workflow IDs by hand
- Hope you remember which signal fired when
- Eventually give up and restart the agent
With lineage, you open the session DAG, see the exact sequence of spawns / signals / resumes, filter by edge kind, scrub through a replay, and know in under a minute what caused what.
How it works
Runtime emits a typed edge into a RunEdge table at every dispatch site:
ctx.spawn→spawnedge (parent → child)- Signal route →
signal_fireedge (source run → triggered run) on_child_completewake →resumeedgectx.sleepexpiry →timer_fireself-edgectx.ask_useranswered →ctx_ask_userself-edge- Celery retry →
retryself-edge - External callback →
domain_callbackedge - REST trigger / signal with no source →
user_startedge - Cross-session trigger →
cross_session_bridgeedge
Each edge carries cost_attributed, tokens_attributed, and free-form metadata. An invariant check ensures every runtime-source run has exactly one inbound edge at birth.
Does this replace Observe?
No. Observe's span tree and parent_workflow_id stay exactly as they were — never deprecated. Lineage is purely additive. The read layer merges both sources transparently so Observe-only tenants get the same UX.
Read surface
| Surface | Use it for |
|---|---|
Session DAG at /lineage/sessions/<id> | Visual react-flow graph with edge-kind filters, governance overlay, cost overlay, sibling aggregation, replay cursor, live streaming |
| Single-run Lineage tab | Upstream + downstream panels on every agent execution detail page |
Run diff at /lineage/diff?a=...&b=... | Compare two runs' downstream subgraphs side-by-side with edge-kind deltas |
Reliability heatmap at /lineage/reliability | Per-agent success rate + edge-kind signals |
wax lineage CLI | upstream / downstream / session subcommands, table / tree / json formats |
| MCP tools | waxell_run_why, waxell_run_upstream, waxell_run_downstream, waxell_session_dag |
| HTTP API | GET /api/v1/lineage/{runs/<id>/upstream, runs/<id>/downstream, sessions/<id>/graph, runs/diff, agents/reliability} |
| WebSocket | ws://host/ws/lineage/sessions/<id>/ streams every new edge as it's emitted |
Feature flags
Both default off. Flip in order.
| Flag | Effect |
|---|---|
WAXELL_LINEAGE_V1 | Turn on runtime edge emission. Runs flag-off path when disabled (less than 0.5ms p99 overhead). |
WAXELL_LINEAGE_UI_V1 | Turn on read endpoints + CLI + MCP + UI. Read-only; safe to flip globally. |
WAXELL_LINEAGE_LIVE | Broadcast every emit to the session's WebSocket group. Requires channels + Redis. |
At a glance
- Load: 10k emits/sec per process (measured on CI).
- Storage: ~1 MB/day/tenant at 10k dispatches/day.
- Retention: 90-day default. Daily archival task (
archive_lineage_edges) cleans older rows per tenant. - Observe safety: zero regression.
parent_workflow_idnever touched by runtime emission. Back-compat trigger only backfills when it'sIS NULL. - Monitoring: three CloudWatch alarms in
Waxell/Lineagenamespace —MissingInboundEdge(P0),EmissionFailures(P1),EdgeFailuresSustained(P2).
Next steps
- Read the Lineage Guide for the rollout runbook
- See
docs/LINEAGE_PLAN.mdfor the full design thesis - Start instrumenting: flip
WAXELL_LINEAGE_V1on one tenant, run the 3-level spawn chain test, watch the DAG fill in