Execution Isolation Tier
For regulated workloads (insurance, financial services, healthcare), shared workers can be made safe but cannot be proven safe to a CISO or an auditor. Waxell's isolation tier moves these workloads onto per-execution Fargate slots with cryptographic per-tenant identity. Each run gets its own slot, its own short-lived STS credentials, and its own KMS encryption context. The slot exits after one run.
This page describes the model and how it appears in your tenant's behavior. Operators see the management runbook.
The isolation tier is rolling out for selected enterprise tenants in 2026. It runs alongside the default shared-worker tier — your code does not change. The platform decides which tier handles each run based on your tenant's sensitivity classification.
What changes in your tenant
Nothing in your agent code. The same wax push deploy flow, the same
agent.run() call, the same SDK. The platform inspects your tenant's
configuration and routes each run to the appropriate execution tier:
If you don't already have the wax CLI:
pip install waxell(orpipx install waxellfor an isolated install). See CLI reference for Windows PATH issues and troubleshooting.
| Tier | When it's used | Per-run isolation |
|---|---|---|
| Tier 0 — Shared workers | Default for paid + free tiers | Schema-per-tenant + tenant-scoped Redis. Strong logical isolation. |
| Tier 1 — Warm Fargate slots | Sensitivity ≥ sensitive | Dedicated Fargate task per run. Per-execution STS credentials with session tags. Per-cell KMS CMK with encryption-context match. |
| Tier 2 — Heavy Fargate | OOM-prone or burst workloads (auto-promoted) | Same isolation as Tier 1, larger memory ceiling, on-demand provisioning. |
| Tier 3 — Custom image | When your wax push includes custom dependencies | Same isolation as Tier 1, runs your own ECR image. |
What the audit story looks like
For each run on Tier 1+ you get a complete cryptographic chain:
- Slot claim event — Redis-atomic slot pop, recorded with
tenant_id, run_id, cell_id - STS AssumeRole — short-lived credentials (1h max) with session tags
tenant_id, cell_id, run_id. Logged in CloudTrail. - KMS Decrypt with encryption context — every secret your agent reads. Encryption-context
tenant_idMUST match session-tagtenant_id. KMS denies any mismatch. - Secrets Manager reads — scoped to your tenant's prefix only. Cross-tenant access denied at the IAM policy.
- Slot exit — slot marked draining, replaced by a fresh slot. Single-use guarantee: the slot never serves another tenant's run.
- OTel spans — every step of execution exported with
waxell.execution.tier,waxell.execution.cell_id,waxell.execution.tenant_id,waxell.execution.run_id,waxell.execution.slot_arnresource attributes.
A single run_id correlates the entire chain. Hand it to an auditor as the proof bundle.
Cell-based control plane
Every tenant is bound to a cell at creation. A cell is a logical partition of:
- A dedicated SQS dispatch queue
- A dedicated warm pool of isolation slots
- A dedicated Redis namespace (shared cluster, isolated by key prefix)
- A cell-local KMS CMK (
alias/waxell-cell-NNN-master)
Cells cap at ~200 active tenants. As cells fill, new ones spin up. The
binding lets us answer commercial questions ("can my data live in eu-west?",
"can I have a dedicated cell?") without re-architecting — cell_id is
the unit of physical promotion.
Self-service signups are assigned to the lowest-numbered cell with capacity.
Enterprise tenants get explicit cell_id at onboarding — for dedicated cells,
region-specific cells, or BYOC arrangements.
Sensitivity tiers
Your Tenant.sensitivity_tier field selects routing behavior:
| Tier | Routing |
|---|---|
standard | Shared workers (Tier 0). Default for new tenants. |
sensitive | Forces Tier 1 minimum. Per-execution isolation, per-tenant credentials. |
regulated | Forces Tier 1 minimum + always-on minimum slot pool (no scale-to-zero). Required for HIPAA/SOC2/PCI-class workloads. |
Setting this at tenant creation or via the platform admin panel takes effect on the next run.
You can also override per-agent via TenantAgentVersion.sensitivity_override
to force stricter isolation for a single agent without bumping the whole
tenant.
Failure modes you should know about
-
Pool underrun. If the warm pool is exhausted by burst traffic, claims return a transient
pool_underrunerror and the platform falls back to Tier 0 (with a logged warning) or queues the run for retry. TheWaxell/IsolationTier/PoolIdleSlotsmetric is alarmed at less than 1 for 5 minutes. -
STS credential expiry mid-run. Slot credentials are 1h. Runs longer than 1h need refresh — the runtime handles this transparently for Tier 1+. If refresh fails, the run is marked failed with
claim_failedexit status. -
KMS encryption-context mismatch. Wrong-tenant decrypt attempt. KMS denies, slot exits with
crashstatus,KmsEncryptionContextMismatchalarm fires. Always treated as a security incident. -
Slot crash. Worker process died (OOM, segfault, host failure). ECS replaces the slot. The in-flight run is marked failed; downstream signals fire.
What's deferred to later releases
- CMK migration for RDS, EBS, Secrets Manager, ECR, CloudWatch Logs. Today these use AWS-managed default keys. Customer-managed keys per cell is Phase 1 of the rollout. Audit story is not affected — the slot's KMS encryption-context guarantee is independent of the data-store CMK.
- ElastiCache cluster-mode + RBAC + per-tenant ACL. Today the slot
reaches Redis on a shared cluster with namespace isolation. Cluster-mode
- per-execution ACL is Phase 1.1.
- Per-tenant FQDN egress allowlist. Today the slot's
platform-egressSG limits outbound to AWS services + 443/53 only. Per-tenant allowlists via egress proxy are a Phase 2 follow-up. - Customer-managed encryption keys (BYOK). Year 2.
- Sub-second cold-start tier (microVM / Kata). Year 2 — the
ExecutionTierprotocol is designed so this lands as a new tier, not a rewrite.
How to verify your tenant is on the isolation tier
Go to the platform admin → Tenants → [your tenant]. The Isolation Tier panel shows:
- Current cell binding
- Sensitivity tier
- Whether your runs route to Tier 0 or Tier 1+
- Live pool health (idle / claimed / draining slot counts)
Your runs in Tempo will have waxell.execution.tier="tier-1-warm-fargate"
on the root span when isolation is active.
Reference
- Architecture spec:
.planning/ISOLATION_TIER_DESIGN.md(engineering-internal) - Operations:
docs/runbooks/isolation_tier_management.md(engineering-internal) - Cutover procedure:
docs/runbooks/isolation_tier_callsine_cutover.md - Failure-injection tests:
docs/runbooks/isolation_tier_f3_test.md