Lineage

Every operator running agents at scale asks two questions. Why did this run happen? What did this run cause? Waxell Lineage answers both.

The problem it solves

Without lineage, when an unexpected run fires at 3AM you get to:

Eyeball a trace tree
Cross-reference workflow IDs by hand
Hope you remember which signal fired when
Eventually give up and restart the agent

With lineage, you open the session DAG, see the exact sequence of spawns / signals / resumes, filter by edge kind, scrub through a replay, and know in under a minute what caused what.

How it works

Runtime emits a typed edge into a RunEdge table at every dispatch site:

ctx.spawn → spawn edge (parent → child)
Signal route → signal_fire edge (source run → triggered run)
on_child_complete wake → resume edge
ctx.sleep expiry → timer_fire self-edge
ctx.ask_user answered → ctx_ask_user self-edge
Celery retry → retry self-edge
External callback → domain_callback edge
REST trigger / signal with no source → user_start edge
Cross-session trigger → cross_session_bridge edge

Each edge carries cost_attributed, tokens_attributed, and free-form metadata. An invariant check ensures every runtime-source run has exactly one inbound edge at birth.

Does this replace Observe?

No. Observe's span tree and parent_workflow_id stay exactly as they were — never deprecated. Lineage is purely additive. The read layer merges both sources transparently so Observe-only tenants get the same UX.

Read surface

Surface	Use it for
Session DAG at `/lineage/sessions/<id>`	Visual react-flow graph with edge-kind filters, governance overlay, cost overlay, sibling aggregation, replay cursor, live streaming
Single-run Lineage tab	Upstream + downstream panels on every agent execution detail page
Run diff at `/lineage/diff?a=...&b=...`	Compare two runs' downstream subgraphs side-by-side with edge-kind deltas
Reliability heatmap at `/lineage/reliability`	Per-agent success rate + edge-kind signals
`wax lineage` CLI	`upstream` / `downstream` / `session` subcommands, table / tree / json formats
MCP tools	`waxell_run_why`, `waxell_run_upstream`, `waxell_run_downstream`, `waxell_session_dag`
HTTP API	`GET /api/v1/lineage/{runs/<id>/upstream, runs/<id>/downstream, sessions/<id>/graph, runs/diff, agents/reliability}`
WebSocket	`ws://host/ws/lineage/sessions/<id>/` streams every new edge as it's emitted

Feature flags

Both default off. Flip in order.

Flag	Effect
`WAXELL_LINEAGE_V1`	Turn on runtime edge emission. Runs flag-off path when disabled (less than 0.5ms p99 overhead).
`WAXELL_LINEAGE_UI_V1`	Turn on read endpoints + CLI + MCP + UI. Read-only; safe to flip globally.
`WAXELL_LINEAGE_LIVE`	Broadcast every emit to the session's WebSocket group. Requires channels + Redis.

At a glance

Load: 10k emits/sec per process (measured on CI).
Storage: ~1 MB/day/tenant at 10k dispatches/day.
Retention: 90-day default. Daily archival task (archive_lineage_edges) cleans older rows per tenant.
Observe safety: zero regression. parent_workflow_id never touched by runtime emission. Back-compat trigger only backfills when it's IS NULL.
Monitoring: three CloudWatch alarms in Waxell/Lineage namespace — MissingInboundEdge (P0), EmissionFailures (P1), EdgeFailuresSustained (P2).

Next steps

Read the Lineage Guide for the rollout runbook
See docs/LINEAGE_PLAN.md for the full design thesis
Start instrumenting: flip WAXELL_LINEAGE_V1 on one tenant, run the 3-level spawn chain test, watch the DAG fill in

The problem it solves​

How it works​

Does this replace Observe?​

Read surface​

Feature flags​

At a glance​

Next steps​