Skip to main content

Execution Isolation Tier

For regulated workloads (insurance, financial services, healthcare), shared workers can be made safe but cannot be proven safe to a CISO or an auditor. Waxell's isolation tier moves these workloads onto per-execution Fargate slots with cryptographic per-tenant identity. Each run gets its own slot, its own short-lived STS credentials, and its own KMS encryption context. The slot exits after one run.

This page describes the model and how it appears in your tenant's behavior. Operators see the management runbook.

Status

The isolation tier is rolling out for selected enterprise tenants in 2026. It runs alongside the default shared-worker tier — your code does not change. The platform decides which tier handles each run based on your tenant's sensitivity classification.


What changes in your tenant

Nothing in your agent code. The same wax push deploy flow, the same agent.run() call, the same SDK. The platform inspects your tenant's configuration and routes each run to the appropriate execution tier:

TierWhen it's usedPer-run isolation
Tier 0 — Shared workersDefault for paid + free tiersSchema-per-tenant + tenant-scoped Redis. Strong logical isolation.
Tier 1 — Warm Fargate slotsSensitivity ≥ sensitiveDedicated Fargate task per run. Per-execution STS credentials with session tags. Per-cell KMS CMK with encryption-context match.
Tier 2 — Heavy FargateOOM-prone or burst workloads (auto-promoted)Same isolation as Tier 1, larger memory ceiling, on-demand provisioning.
Tier 3 — Custom imageWhen your wax push includes custom dependenciesSame isolation as Tier 1, runs your own ECR image.

What the audit story looks like

For each run on Tier 1+ you get a complete cryptographic chain:

  1. Slot claim event — Redis-atomic slot pop, recorded with tenant_id, run_id, cell_id
  2. STS AssumeRole — short-lived credentials (1h max) with session tags tenant_id, cell_id, run_id. Logged in CloudTrail.
  3. KMS Decrypt with encryption context — every secret your agent reads. Encryption-context tenant_id MUST match session-tag tenant_id. KMS denies any mismatch.
  4. Secrets Manager reads — scoped to your tenant's prefix only. Cross-tenant access denied at the IAM policy.
  5. Slot exit — slot marked draining, replaced by a fresh slot. Single-use guarantee: the slot never serves another tenant's run.
  6. OTel spans — every step of execution exported with waxell.execution.tier, waxell.execution.cell_id, waxell.execution.tenant_id, waxell.execution.run_id, waxell.execution.slot_arn resource attributes.

A single run_id correlates the entire chain. Hand it to an auditor as the proof bundle.


Cell-based control plane

Every tenant is bound to a cell at creation. A cell is a logical partition of:

  • A dedicated SQS dispatch queue
  • A dedicated warm pool of isolation slots
  • A dedicated Redis namespace (shared cluster, isolated by key prefix)
  • A cell-local KMS CMK (alias/waxell-cell-NNN-master)

Cells cap at ~200 active tenants. As cells fill, new ones spin up. The binding lets us answer commercial questions ("can my data live in eu-west?", "can I have a dedicated cell?") without re-architecting — cell_id is the unit of physical promotion.

Self-service signups are assigned to the lowest-numbered cell with capacity. Enterprise tenants get explicit cell_id at onboarding — for dedicated cells, region-specific cells, or BYOC arrangements.


Sensitivity tiers

Your Tenant.sensitivity_tier field selects routing behavior:

TierRouting
standardShared workers (Tier 0). Default for new tenants.
sensitiveForces Tier 1 minimum. Per-execution isolation, per-tenant credentials.
regulatedForces Tier 1 minimum + always-on minimum slot pool (no scale-to-zero). Required for HIPAA/SOC2/PCI-class workloads.

Setting this at tenant creation or via the platform admin panel takes effect on the next run.

You can also override per-agent via TenantAgentVersion.sensitivity_override to force stricter isolation for a single agent without bumping the whole tenant.


Failure modes you should know about

  • Pool underrun. If the warm pool is exhausted by burst traffic, claims return a transient pool_underrun error and the platform falls back to Tier 0 (with a logged warning) or queues the run for retry. The Waxell/IsolationTier/PoolIdleSlots metric is alarmed at less than 1 for 5 minutes.

  • STS credential expiry mid-run. Slot credentials are 1h. Runs longer than 1h need refresh — the runtime handles this transparently for Tier 1+. If refresh fails, the run is marked failed with claim_failed exit status.

  • KMS encryption-context mismatch. Wrong-tenant decrypt attempt. KMS denies, slot exits with crash status, KmsEncryptionContextMismatch alarm fires. Always treated as a security incident.

  • Slot crash. Worker process died (OOM, segfault, host failure). ECS replaces the slot. The in-flight run is marked failed; downstream signals fire.


What's deferred to later releases

  • CMK migration for RDS, EBS, Secrets Manager, ECR, CloudWatch Logs. Today these use AWS-managed default keys. Customer-managed keys per cell is Phase 1 of the rollout. Audit story is not affected — the slot's KMS encryption-context guarantee is independent of the data-store CMK.
  • ElastiCache cluster-mode + RBAC + per-tenant ACL. Today the slot reaches Redis on a shared cluster with namespace isolation. Cluster-mode
    • per-execution ACL is Phase 1.1.
  • Per-tenant FQDN egress allowlist. Today the slot's platform-egress SG limits outbound to AWS services + 443/53 only. Per-tenant allowlists via egress proxy are a Phase 2 follow-up.
  • Customer-managed encryption keys (BYOK). Year 2.
  • Sub-second cold-start tier (microVM / Kata). Year 2 — the ExecutionTier protocol is designed so this lands as a new tier, not a rewrite.

How to verify your tenant is on the isolation tier

Go to the platform admin → Tenants → [your tenant]. The Isolation Tier panel shows:

  • Current cell binding
  • Sensitivity tier
  • Whether your runs route to Tier 0 or Tier 1+
  • Live pool health (idle / claimed / draining slot counts)

Your runs in Tempo will have waxell.execution.tier="tier-1-warm-fargate" on the root span when isolation is active.


Reference

  • Architecture spec: .planning/ISOLATION_TIER_DESIGN.md (engineering-internal)
  • Operations: docs/runbooks/isolation_tier_management.md (engineering-internal)
  • Cutover procedure: docs/runbooks/isolation_tier_callsine_cutover.md
  • Failure-injection tests: docs/runbooks/isolation_tier_f3_test.md