Skip to main content

Compare Models with Experiments

When choosing between LLM models (or prompt variants), you need data -- not guesses. Waxell Observe's experiment system lets you run the same test cases through different configurations and compare the results side by side.

Prerequisites

What You'll Learn

  • How to create a test dataset with curated test cases
  • How to import items from production runs or bulk JSON
  • How to set up and run experiments across different model configurations
  • How to compare experiment results on accuracy, cost, and latency

Step 1: Create a Test Dataset

A dataset is a collection of test cases, each with an input and optionally an expected output. Start by creating an empty dataset:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"name": "Support Bot Evaluation Set",
"description": "Curated test cases for comparing model performance on support queries",
"tags": ["support", "evaluation", "v1"]
}'

Response:

{
"id": "ds-uuid-1234",
"name": "Support Bot Evaluation Set",
"description": "Curated test cases for comparing model performance on support queries",
"tags": ["support", "evaluation", "v1"],
"item_count": 0,
"created_at": "2026-02-07T10:00:00Z"
}

Step 2: Add Test Items

Add individual test items with input, expected output, and optional context:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/ds-uuid-1234/items/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"input": {"message": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password. Click the reset link sent to your email."},
"context": {"category": "account", "difficulty": "easy"},
"metadata": {"source": "manual"}
}'

Response:

{
"id": "item-uuid-1",
"dataset_id": "ds-uuid-1234",
"input": {"message": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password. Click the reset link sent to your email."},
"context": {"category": "account", "difficulty": "easy"},
"sort_order": 1,
"created_at": "2026-02-07T10:01:00Z"
}

Step 3: Import Items from Production

Capture real-world cases by importing from an existing agent execution run. This creates a dataset item using the run's inputs as input and the run's result as expected_output.

curl -X POST "https://acme.waxell.dev/api/v1/datasets/ds-uuid-1234/from-run/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"run_id": "existing-run-uuid"
}'

Response:

{
"id": "item-uuid-2",
"dataset_id": "ds-uuid-1234",
"input": {"message": "What are your business hours?"},
"expected_output": {"output": "Our business hours are Monday-Friday, 9am-5pm EST."},
"context": {"agent_name": "support-bot", "workflow_name": "handle-query"},
"sort_order": 2,
"source_run_id": "existing-run-uuid",
"metadata": {
"captured_from": "run",
"run_agent": "support-bot",
"run_started_at": "2026-02-06T14:30:00Z"
},
"created_at": "2026-02-07T10:02:00Z"
}
tip

Importing from production runs is the fastest way to build a representative test set. Start with 20-30 diverse runs that cover your most important use cases.

Step 4: Bulk Import

For larger test sets, import multiple items at once:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/ds-uuid-1234/import/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"items": [
{
"input": {"message": "Can I get a refund?"},
"expected_output": {"answer": "Refunds are available within 30 days of purchase. Go to Orders > Request Refund."},
"context": {"category": "billing", "difficulty": "medium"}
},
{
"input": {"message": "How do I export my data?"},
"expected_output": {"answer": "Go to Settings > Data > Export. You can download a CSV or JSON file."},
"context": {"category": "data", "difficulty": "easy"}
},
{
"input": {"message": "My account was charged twice. What should I do?"},
"expected_output": {"answer": "I apologize for the inconvenience. Please contact our billing team at billing@example.com with your order number."},
"context": {"category": "billing", "difficulty": "hard"}
}
]
}'

Response:

{
"dataset_id": "ds-uuid-1234",
"imported": 3,
"total_items": 5
}

Step 5: Create Experiment A (GPT-4o)

An experiment ties a dataset to a specific configuration (model, prompt, parameters) and a set of evaluators. Create your first experiment for GPT-4o:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"name": "Support Bot - GPT-4o",
"dataset_id": "ds-uuid-1234",
"config": {
"model": "gpt-4o",
"temperature": 0.3,
"max_tokens": 500,
"system_prompt": "You are a helpful support agent. Answer user questions accurately and concisely."
},
"evaluator_ids": ["evaluator-uuid-accuracy", "evaluator-uuid-helpfulness"],
"metadata": {
"hypothesis": "GPT-4o should have highest accuracy but also highest cost"
}
}'

Response:

{
"id": "exp-uuid-gpt4o",
"name": "Support Bot - GPT-4o",
"dataset_id": "ds-uuid-1234",
"status": "pending",
"config": {
"model": "gpt-4o",
"temperature": 0.3,
"max_tokens": 500,
"system_prompt": "You are a helpful support agent..."
},
"evaluator_ids": ["evaluator-uuid-accuracy", "evaluator-uuid-helpfulness"],
"created_at": "2026-02-07T10:10:00Z"
}

Step 6: Create Experiment B (Claude Sonnet)

Create a second experiment with a different model:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"name": "Support Bot - Claude Sonnet",
"dataset_id": "ds-uuid-1234",
"config": {
"model": "claude-sonnet-4-5-20250929",
"temperature": 0.3,
"max_tokens": 500,
"system_prompt": "You are a helpful support agent. Answer user questions accurately and concisely."
},
"evaluator_ids": ["evaluator-uuid-accuracy", "evaluator-uuid-helpfulness"],
"metadata": {
"hypothesis": "Claude Sonnet may offer better cost/quality tradeoff"
}
}'
info

Keep everything identical between experiments except the variable you are testing (in this case, the model). Same dataset, same evaluators, same prompt, same temperature.

Step 7: Start Experiments

Start each experiment. This creates an ExperimentRun for each dataset item and begins async execution:

# Start Experiment A
curl -X POST "https://acme.waxell.dev/api/v1/experiments/exp-uuid-gpt4o/start/" \
-H "Authorization: Bearer <your-session-token>"

Response:

{
"id": "exp-uuid-gpt4o",
"name": "Support Bot - GPT-4o",
"status": "running",
"runs_created": 5,
"started_at": "2026-02-07T10:15:00Z"
}
# Start Experiment B
curl -X POST "https://acme.waxell.dev/api/v1/experiments/exp-uuid-claude/start/" \
-H "Authorization: Bearer <your-session-token>"

Step 8: Monitor Progress

Check experiment status while it runs:

curl -X GET "https://acme.waxell.dev/api/v1/experiments/exp-uuid-gpt4o/" \
-H "Authorization: Bearer <your-session-token>"

Response:

{
"id": "exp-uuid-gpt4o",
"name": "Support Bot - GPT-4o",
"dataset_id": "ds-uuid-1234",
"dataset_name": "Support Bot Evaluation Set",
"status": "running",
"config": {"model": "gpt-4o", "temperature": 0.3},
"evaluator_ids": ["evaluator-uuid-accuracy", "evaluator-uuid-helpfulness"],
"summary": null,
"runs": [
{
"id": "exprun-1",
"dataset_item_id": "item-uuid-1",
"status": "completed",
"output": {"answer": "To reset your password, navigate to Settings..."},
"latency_ms": 1250,
"tokens_in": 85,
"tokens_out": 62,
"cost": 0.0034,
"item_input": {"message": "How do I reset my password?"},
"item_expected_output": {"answer": "Go to Settings > Security > Reset Password..."}
},
{
"id": "exprun-2",
"dataset_item_id": "item-uuid-2",
"status": "running",
"output": null,
"latency_ms": null
}
],
"started_at": "2026-02-07T10:15:00Z"
}

Step 9: View Results

Once an experiment completes, fetch the full results with scores:

curl -X GET "https://acme.waxell.dev/api/v1/experiments/exp-uuid-gpt4o/results/" \
-H "Authorization: Bearer <your-session-token>"

Response:

{
"experiment_id": "exp-uuid-gpt4o",
"experiment_name": "Support Bot - GPT-4o",
"dataset_name": "Support Bot Evaluation Set",
"status": "completed",
"summary": {
"total": 5,
"completed": 5,
"failed": 0,
"pending": 0,
"avg_latency_ms": 1180,
"total_cost": 0.0172,
"avg_cost": 0.00344
},
"results": [
{
"id": "exprun-1",
"dataset_item_id": "item-uuid-1",
"status": "completed",
"output": {"answer": "To reset your password, navigate to Settings > Security..."},
"latency_ms": 1250,
"tokens_in": 85,
"tokens_out": 62,
"cost": 0.0034,
"item_input": {"message": "How do I reset my password?"},
"item_expected_output": {"answer": "Go to Settings > Security > Reset Password..."},
"scores": [
{"name": "accuracy", "data_type": "numeric", "numeric_value": 0.95, "source": "evaluator"},
{"name": "helpfulness", "data_type": "numeric", "numeric_value": 0.88, "source": "evaluator"}
]
}
]
}

Step 10: Compare Experiments

The compare endpoint gives you a side-by-side view across experiments:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/compare/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"experiment_ids": ["exp-uuid-gpt4o", "exp-uuid-claude"]
}'

Response:

{
"experiments": {
"exp-uuid-gpt4o": {
"id": "exp-uuid-gpt4o",
"name": "Support Bot - GPT-4o",
"dataset_name": "Support Bot Evaluation Set",
"status": "completed",
"config": {"model": "gpt-4o", "temperature": 0.3},
"summary": {
"total": 5,
"completed": 5,
"avg_latency_ms": 1180,
"total_cost": 0.0172
}
},
"exp-uuid-claude": {
"id": "exp-uuid-claude",
"name": "Support Bot - Claude Sonnet",
"dataset_name": "Support Bot Evaluation Set",
"status": "completed",
"config": {"model": "claude-sonnet-4-5-20250929", "temperature": 0.3},
"summary": {
"total": 5,
"completed": 5,
"avg_latency_ms": 980,
"total_cost": 0.0098
}
}
},
"comparisons": [
{
"dataset_item": {
"id": "item-uuid-1",
"input": {"message": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password..."}
},
"results": {
"exp-uuid-gpt4o": {
"run_id": "exprun-gpt4o-1",
"status": "completed",
"output": {"answer": "To reset your password, navigate to Settings..."},
"latency_ms": 1250,
"tokens_in": 85,
"tokens_out": 62,
"cost": 0.0034
},
"exp-uuid-claude": {
"run_id": "exprun-claude-1",
"status": "completed",
"output": {"answer": "You can reset your password by going to Settings..."},
"latency_ms": 890,
"tokens_in": 82,
"tokens_out": 58,
"cost": 0.0018
}
}
}
]
}

Step 11: Analyze the Comparison

From the comparison data, build a summary table:

MetricGPT-4oClaude SonnetWinner
Avg Latency1,180ms980msClaude (-17%)
Total Cost$0.0172$0.0098Claude (-43%)
Avg Accuracy0.930.91GPT-4o (+2%)
Avg Helpfulness0.880.86GPT-4o (+2%)
Failed Runs00Tie

In this example, Claude Sonnet offers a significantly better cost/latency profile with only a marginal quality difference. Whether the 2% accuracy gap matters depends on your use case.

tip

Run experiments with at least 30-50 test cases for statistically meaningful results. Small datasets can produce misleading comparisons.

Decision Framework

Choose your model based on what matters most:

  • Quality-critical (medical, legal, financial) -- Pick the highest-accuracy model regardless of cost
  • Cost-sensitive (high-volume support, content generation) -- Pick the cheapest model that meets your quality threshold
  • Latency-sensitive (real-time chat, interactive tools) -- Pick the fastest model that meets your quality threshold

Next Steps