Skip to main content

Lineage

Every operator running agents at scale asks two questions. Why did this run happen? What did this run cause? Waxell Lineage answers both.

The problem it solves

Without lineage, when an unexpected run fires at 3AM you get to:

  1. Eyeball a trace tree
  2. Cross-reference workflow IDs by hand
  3. Hope you remember which signal fired when
  4. Eventually give up and restart the agent

With lineage, you open the session DAG, see the exact sequence of spawns / signals / resumes, filter by edge kind, scrub through a replay, and know in under a minute what caused what.

How it works

Runtime emits a typed edge into a RunEdge table at every dispatch site:

  • ctx.spawnspawn edge (parent → child)
  • Signal route → signal_fire edge (source run → triggered run)
  • on_child_complete wake → resume edge
  • ctx.sleep expiry → timer_fire self-edge
  • ctx.ask_user answered → ctx_ask_user self-edge
  • Celery retry → retry self-edge
  • External callback → domain_callback edge
  • REST trigger / signal with no source → user_start edge
  • Cross-session trigger → cross_session_bridge edge

Each edge carries cost_attributed, tokens_attributed, and free-form metadata. An invariant check ensures every runtime-source run has exactly one inbound edge at birth.

Does this replace Observe?

No. Observe's span tree and parent_workflow_id stay exactly as they were — never deprecated. Lineage is purely additive. The read layer merges both sources transparently so Observe-only tenants get the same UX.

Read surface

SurfaceUse it for
Session DAG at /lineage/sessions/<id>Visual react-flow graph with edge-kind filters, governance overlay, cost overlay, sibling aggregation, replay cursor, live streaming
Single-run Lineage tabUpstream + downstream panels on every agent execution detail page
Run diff at /lineage/diff?a=...&b=...Compare two runs' downstream subgraphs side-by-side with edge-kind deltas
Reliability heatmap at /lineage/reliabilityPer-agent success rate + edge-kind signals
wax lineage CLIupstream / downstream / session subcommands, table / tree / json formats
MCP toolswaxell_run_why, waxell_run_upstream, waxell_run_downstream, waxell_session_dag
HTTP APIGET /api/v1/lineage/{runs/<id>/upstream, runs/<id>/downstream, sessions/<id>/graph, runs/diff, agents/reliability}
WebSocketws://host/ws/lineage/sessions/<id>/ streams every new edge as it's emitted

Feature flags

Both default off. Flip in order.

FlagEffect
WAXELL_LINEAGE_V1Turn on runtime edge emission. Runs flag-off path when disabled (less than 0.5ms p99 overhead).
WAXELL_LINEAGE_UI_V1Turn on read endpoints + CLI + MCP + UI. Read-only; safe to flip globally.
WAXELL_LINEAGE_LIVEBroadcast every emit to the session's WebSocket group. Requires channels + Redis.

At a glance

  • Load: 10k emits/sec per process (measured on CI).
  • Storage: ~1 MB/day/tenant at 10k dispatches/day.
  • Retention: 90-day default. Daily archival task (archive_lineage_edges) cleans older rows per tenant.
  • Observe safety: zero regression. parent_workflow_id never touched by runtime emission. Back-compat trigger only backfills when it's IS NULL.
  • Monitoring: three CloudWatch alarms in Waxell/Lineage namespace — MissingInboundEdge (P0), EmissionFailures (P1), EdgeFailuresSustained (P2).

Next steps

  • Read the Lineage Guide for the rollout runbook
  • See docs/LINEAGE_PLAN.md for the full design thesis
  • Start instrumenting: flip WAXELL_LINEAGE_V1 on one tenant, run the 3-level spawn chain test, watch the DAG fill in