Compare Models with Experiments

When choosing between LLM models (or prompt variants), you need data -- not guesses. Waxell Observe's experiment system lets you run the same test cases through different configurations and compare the results side by side.

Prerequisites

Waxell Observe with dashboard access
At least one evaluator configured (see Build an Evaluation Pipeline)
Familiarity with the Waxell API

What You'll Learn

How to create a test dataset with curated test cases
How to import items from production runs or bulk JSON
How to set up and run experiments across different model configurations
How to compare experiment results on accuracy, cost, and latency

Step 1: Create a Test Dataset

A dataset is a collection of test cases, each with an input and optionally an expected output. Start by creating an empty dataset:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Support Bot Evaluation Set",
    "description": "Curated test cases for comparing model performance on support queries",
    "tags": ["support", "evaluation", "v1"]
  }'

Response:

{
  "id": "ds-uuid-1234",
  "name": "Support Bot Evaluation Set",
  "description": "Curated test cases for comparing model performance on support queries",
  "tags": ["support", "evaluation", "v1"],
  "item_count": 0,
  "created_at": "2026-02-07T10:00:00Z"
}

Step 2: Add Test Items

Add individual test items with input, expected output, and optional context:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/ds-uuid-1234/items/" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {"message": "How do I reset my password?"},
    "expected_output": {"answer": "Go to Settings > Security > Reset Password. Click the reset link sent to your email."},
    "context": {"category": "account", "difficulty": "easy"},
    "metadata": {"source": "manual"}
  }'

Response:

{
  "id": "item-uuid-1",
  "dataset_id": "ds-uuid-1234",
  "input": {"message": "How do I reset my password?"},
  "expected_output": {"answer": "Go to Settings > Security > Reset Password. Click the reset link sent to your email."},
  "context": {"category": "account", "difficulty": "easy"},
  "sort_order": 1,
  "created_at": "2026-02-07T10:01:00Z"
}

Step 3: Import Items from Production

Capture real-world cases by importing from an existing agent execution run. This creates a dataset item using the run's inputs as input and the run's result as expected_output.

curl -X POST "https://acme.waxell.dev/api/v1/datasets/ds-uuid-1234/from-run/" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "run_id": "existing-run-uuid"
  }'

Response:

{
  "id": "item-uuid-2",
  "dataset_id": "ds-uuid-1234",
  "input": {"message": "What are your business hours?"},
  "expected_output": {"output": "Our business hours are Monday-Friday, 9am-5pm EST."},
  "context": {"agent_name": "support-bot", "workflow_name": "handle-query"},
  "sort_order": 2,
  "source_run_id": "existing-run-uuid",
  "metadata": {
    "captured_from": "run",
    "run_agent": "support-bot",
    "run_started_at": "2026-02-06T14:30:00Z"
  },
  "created_at": "2026-02-07T10:02:00Z"
}

tip

Importing from production runs is the fastest way to build a representative test set. Start with 20-30 diverse runs that cover your most important use cases.

Step 4: Bulk Import

For larger test sets, import multiple items at once:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/ds-uuid-1234/import/" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "items": [
      {
        "input": {"message": "Can I get a refund?"},
        "expected_output": {"answer": "Refunds are available within 30 days of purchase. Go to Orders > Request Refund."},
        "context": {"category": "billing", "difficulty": "medium"}
      },
      {
        "input": {"message": "How do I export my data?"},
        "expected_output": {"answer": "Go to Settings > Data > Export. You can download a CSV or JSON file."},
        "context": {"category": "data", "difficulty": "easy"}
      },
      {
        "input": {"message": "My account was charged twice. What should I do?"},
        "expected_output": {"answer": "I apologize for the inconvenience. Please contact our billing team at billing@example.com with your order number."},
        "context": {"category": "billing", "difficulty": "hard"}
      }
    ]
  }'

Response:

{
  "dataset_id": "ds-uuid-1234",
  "imported": 3,
  "total_items": 5
}

Step 5: Create Experiment A (GPT-4o)

An experiment ties a dataset to a specific configuration (model, prompt, parameters) and a set of evaluators. Create your first experiment for GPT-4o:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Support Bot - GPT-4o",
    "dataset_id": "ds-uuid-1234",
    "config": {
      "model": "gpt-4o",
      "temperature": 0.3,
      "max_tokens": 500,
      "system_prompt": "You are a helpful support agent. Answer user questions accurately and concisely."
    },
    "evaluator_ids": ["evaluator-uuid-accuracy", "evaluator-uuid-helpfulness"],
    "metadata": {
      "hypothesis": "GPT-4o should have highest accuracy but also highest cost"
    }
  }'

Response:

{
  "id": "exp-uuid-gpt4o",
  "name": "Support Bot - GPT-4o",
  "dataset_id": "ds-uuid-1234",
  "status": "pending",
  "config": {
    "model": "gpt-4o",
    "temperature": 0.3,
    "max_tokens": 500,
    "system_prompt": "You are a helpful support agent..."
  },
  "evaluator_ids": ["evaluator-uuid-accuracy", "evaluator-uuid-helpfulness"],
  "created_at": "2026-02-07T10:10:00Z"
}

Step 6: Create Experiment B (Claude Sonnet)

Create a second experiment with a different model:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Support Bot - Claude Sonnet",
    "dataset_id": "ds-uuid-1234",
    "config": {
      "model": "claude-sonnet-4-5-20250929",
      "temperature": 0.3,
      "max_tokens": 500,
      "system_prompt": "You are a helpful support agent. Answer user questions accurately and concisely."
    },
    "evaluator_ids": ["evaluator-uuid-accuracy", "evaluator-uuid-helpfulness"],
    "metadata": {
      "hypothesis": "Claude Sonnet may offer better cost/quality tradeoff"
    }
  }'

info

Keep everything identical between experiments except the variable you are testing (in this case, the model). Same dataset, same evaluators, same prompt, same temperature.

Step 7: Start Experiments

Start each experiment. This creates an ExperimentRun for each dataset item and begins async execution:

# Start Experiment A
curl -X POST "https://acme.waxell.dev/api/v1/experiments/exp-uuid-gpt4o/start/" \
  -H "Authorization: Bearer <your-session-token>"

Response:

{
  "id": "exp-uuid-gpt4o",
  "name": "Support Bot - GPT-4o",
  "status": "running",
  "runs_created": 5,
  "started_at": "2026-02-07T10:15:00Z"
}

# Start Experiment B
curl -X POST "https://acme.waxell.dev/api/v1/experiments/exp-uuid-claude/start/" \
  -H "Authorization: Bearer <your-session-token>"

Step 8: Monitor Progress

Check experiment status while it runs:

curl -X GET "https://acme.waxell.dev/api/v1/experiments/exp-uuid-gpt4o/" \
  -H "Authorization: Bearer <your-session-token>"

Response:

{
  "id": "exp-uuid-gpt4o",
  "name": "Support Bot - GPT-4o",
  "dataset_id": "ds-uuid-1234",
  "dataset_name": "Support Bot Evaluation Set",
  "status": "running",
  "config": {"model": "gpt-4o", "temperature": 0.3},
  "evaluator_ids": ["evaluator-uuid-accuracy", "evaluator-uuid-helpfulness"],
  "summary": null,
  "runs": [
    {
      "id": "exprun-1",
      "dataset_item_id": "item-uuid-1",
      "status": "completed",
      "output": {"answer": "To reset your password, navigate to Settings..."},
      "latency_ms": 1250,
      "tokens_in": 85,
      "tokens_out": 62,
      "cost": 0.0034,
      "item_input": {"message": "How do I reset my password?"},
      "item_expected_output": {"answer": "Go to Settings > Security > Reset Password..."}
    },
    {
      "id": "exprun-2",
      "dataset_item_id": "item-uuid-2",
      "status": "running",
      "output": null,
      "latency_ms": null
    }
  ],
  "started_at": "2026-02-07T10:15:00Z"
}

Step 9: View Results

Once an experiment completes, fetch the full results with scores:

curl -X GET "https://acme.waxell.dev/api/v1/experiments/exp-uuid-gpt4o/results/" \
  -H "Authorization: Bearer <your-session-token>"

Response:

{
  "experiment_id": "exp-uuid-gpt4o",
  "experiment_name": "Support Bot - GPT-4o",
  "dataset_name": "Support Bot Evaluation Set",
  "status": "completed",
  "summary": {
    "total": 5,
    "completed": 5,
    "failed": 0,
    "pending": 0,
    "avg_latency_ms": 1180,
    "total_cost": 0.0172,
    "avg_cost": 0.00344
  },
  "results": [
    {
      "id": "exprun-1",
      "dataset_item_id": "item-uuid-1",
      "status": "completed",
      "output": {"answer": "To reset your password, navigate to Settings > Security..."},
      "latency_ms": 1250,
      "tokens_in": 85,
      "tokens_out": 62,
      "cost": 0.0034,
      "item_input": {"message": "How do I reset my password?"},
      "item_expected_output": {"answer": "Go to Settings > Security > Reset Password..."},
      "scores": [
        {"name": "accuracy", "data_type": "numeric", "numeric_value": 0.95, "source": "evaluator"},
        {"name": "helpfulness", "data_type": "numeric", "numeric_value": 0.88, "source": "evaluator"}
      ]
    }
  ]
}

Step 10: Compare Experiments

The compare endpoint gives you a side-by-side view across experiments:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/compare/" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "experiment_ids": ["exp-uuid-gpt4o", "exp-uuid-claude"]
  }'

Response:

{
  "experiments": {
    "exp-uuid-gpt4o": {
      "id": "exp-uuid-gpt4o",
      "name": "Support Bot - GPT-4o",
      "dataset_name": "Support Bot Evaluation Set",
      "status": "completed",
      "config": {"model": "gpt-4o", "temperature": 0.3},
      "summary": {
        "total": 5,
        "completed": 5,
        "avg_latency_ms": 1180,
        "total_cost": 0.0172
      }
    },
    "exp-uuid-claude": {
      "id": "exp-uuid-claude",
      "name": "Support Bot - Claude Sonnet",
      "dataset_name": "Support Bot Evaluation Set",
      "status": "completed",
      "config": {"model": "claude-sonnet-4-5-20250929", "temperature": 0.3},
      "summary": {
        "total": 5,
        "completed": 5,
        "avg_latency_ms": 980,
        "total_cost": 0.0098
      }
    }
  },
  "comparisons": [
    {
      "dataset_item": {
        "id": "item-uuid-1",
        "input": {"message": "How do I reset my password?"},
        "expected_output": {"answer": "Go to Settings > Security > Reset Password..."}
      },
      "results": {
        "exp-uuid-gpt4o": {
          "run_id": "exprun-gpt4o-1",
          "status": "completed",
          "output": {"answer": "To reset your password, navigate to Settings..."},
          "latency_ms": 1250,
          "tokens_in": 85,
          "tokens_out": 62,
          "cost": 0.0034
        },
        "exp-uuid-claude": {
          "run_id": "exprun-claude-1",
          "status": "completed",
          "output": {"answer": "You can reset your password by going to Settings..."},
          "latency_ms": 890,
          "tokens_in": 82,
          "tokens_out": 58,
          "cost": 0.0018
        }
      }
    }
  ]
}

Step 11: Analyze the Comparison

From the comparison data, build a summary table:

Metric	GPT-4o	Claude Sonnet	Winner
Avg Latency	1,180ms	980ms	Claude (-17%)
Total Cost	$0.0172	$0.0098	Claude (-43%)
Avg Accuracy	0.93	0.91	GPT-4o (+2%)
Avg Helpfulness	0.88	0.86	GPT-4o (+2%)
Failed Runs	0	0	Tie

In this example, Claude Sonnet offers a significantly better cost/latency profile with only a marginal quality difference. Whether the 2% accuracy gap matters depends on your use case.

tip

Run experiments with at least 30-50 test cases for statistically meaningful results. Small datasets can produce misleading comparisons.

Decision Framework

Choose your model based on what matters most:

Quality-critical (medical, legal, financial) -- Pick the highest-accuracy model regardless of cost
Cost-sensitive (high-volume support, content generation) -- Pick the cheapest model that meets your quality threshold
Latency-sensitive (real-time chat, interactive tools) -- Pick the fastest model that meets your quality threshold

Shortcut: Compare Prompt Versions

If you are comparing two versions of the same prompt (rather than two different models), the compare-prompt-versions endpoint handles experiment creation, execution, and linking in a single call:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/ds-uuid-1234/compare-prompt-versions/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{
    "prompt_name": "support-system-prompt",
    "version_a": 1,
    "version_b": 2,
    "evaluator_ids": ["evaluator-uuid-accuracy", "evaluator-uuid-helpfulness"]
  }'

This creates two paired experiments, runs them against every dataset item, attaches the specified evaluators, and returns a compare_url for the side-by-side view. It replaces Steps 5 through 10 above for prompt-vs-prompt comparisons.

Next Steps

Build an Evaluation Pipeline -- Create evaluators that score experiment results
Prompt Management -- Version prompts and manage labels
Datasets & Experiments -- Full reference for the experiment system
Cost Management -- Track and control LLM spending

Prerequisites​

What You'll Learn​

Step 1: Create a Test Dataset​

Step 2: Add Test Items​

Step 3: Import Items from Production​

Step 4: Bulk Import​

Step 5: Create Experiment A (GPT-4o)​

Step 6: Create Experiment B (Claude Sonnet)​

Step 7: Start Experiments​

Step 8: Monitor Progress​

Step 9: View Results​

Step 10: Compare Experiments​

Step 11: Analyze the Comparison​

Decision Framework​

Shortcut: Compare Prompt Versions​

Next Steps​

Prerequisites

What You'll Learn

Step 1: Create a Test Dataset

Step 2: Add Test Items

Step 3: Import Items from Production

Step 4: Bulk Import

Step 5: Create Experiment A (GPT-4o)

Step 6: Create Experiment B (Claude Sonnet)

Step 7: Start Experiments

Step 8: Monitor Progress

Step 9: View Results

Step 10: Compare Experiments

Step 11: Analyze the Comparison

Decision Framework

Shortcut: Compare Prompt Versions

Next Steps