Skip to main content

Datasets & Experiments

Datasets and experiments let you systematically evaluate your agents and prompts. Build a dataset of test cases, run them through different configurations, attach evaluators for automated scoring, and compare results side by side.

Datasets

A dataset is a named collection of test cases. Each item has an input, optional expected output, and optional context.

Dataset Item Fields

FieldTypeRequiredDescription
inputobjectYesThe test input (JSON object or string)
expected_outputobjectNoGround truth output for comparison
contextobjectNoAdditional context (e.g., retrieved documents, agent config)
metadataobjectNoArbitrary metadata
sort_orderintNoDisplay order (auto-assigned if omitted)

Creating a Dataset

curl -X POST "https://acme.waxell.dev/api/v1/datasets/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"name": "Support Agent QA",
"description": "Test cases for the customer support agent",
"tags": ["support", "qa"]
}'

Adding Items Manually

curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/items/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"input": {"query": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password"},
"context": {"category": "account"}
}'

Bulk Import

Import multiple items at once from a JSON payload:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/import/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"items": [
{
"input": {"query": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password"}
},
{
"input": {"query": "What are your pricing plans?"},
"expected_output": {"answer": "We offer Free, Pro, and Enterprise plans"}
},
{
"input": {"query": "How do I contact support?"},
"expected_output": {"answer": "Email support@example.com or use the in-app chat"}
}
]
}'

Response:

{
"dataset_id": "a1b2c3d4-...",
"imported": 3,
"total_items": 3
}
info

For CSV/JSON file import, convert your file to the items array format above. Each item must have at minimum an input field.

Capture from Production (Run to Dataset)

Turn a real agent execution into a test case. This is useful for building regression test sets from interesting production runs:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/from-run/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{"run_id": "42"}'

The run's inputs become the item's input, and the run's result becomes expected_output. The agent name and workflow are stored in the context field.

Dataset REST API Summary

EndpointMethodDescription
/api/v1/datasets/GETList datasets (supports search, tags, sort, pagination)
/api/v1/datasets/POSTCreate dataset
/api/v1/datasets/{id}/GETDataset detail with recent items
/api/v1/datasets/{id}/PUTUpdate dataset metadata
/api/v1/datasets/{id}/DELETEDelete dataset and all items
/api/v1/datasets/{id}/items/GETList items (supports sort, pagination)
/api/v1/datasets/{id}/items/POSTCreate single item
/api/v1/datasets/{id}/items/{item_id}/GETItem detail
/api/v1/datasets/{id}/items/{item_id}/PUTUpdate item
/api/v1/datasets/{id}/items/{item_id}/DELETEDelete item
/api/v1/datasets/{id}/import/POSTBulk import items
/api/v1/datasets/{id}/from-run/POSTCreate item from a production run

Experiments

An experiment runs every item in a dataset through a specific configuration (prompt + model, or agent) and records the results.

Creating an Experiment

curl -X POST "https://acme.waxell.dev/api/v1/experiments/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"name": "GPT-4o vs GPT-4o-mini on Support QA",
"dataset_id": "a1b2c3d4-...",
"config": {
"prompt_name": "support-system-prompt",
"prompt_label": "production",
"model": "gpt-4o",
"temperature": 0.3
},
"evaluator_ids": ["e1f2g3h4-...", "i5j6k7l8-..."],
"metadata": {"hypothesis": "GPT-4o should outperform mini on complex queries"}
}'

Experiment Fields

FieldTypeRequiredDescription
namestringYesExperiment name
dataset_iduuidYesDataset to run against
configobjectNoConfiguration for execution (prompt, model, temperature, etc.)
evaluator_idslist[uuid]NoEvaluators to auto-score results
metadataobjectNoArbitrary metadata (hypothesis, notes)

Experiment Lifecycle

Experiments follow a state machine:

pending --> running --> completed
\--> failed
pending --> cancelled
running --> cancelled

Start an experiment:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/{experiment_id}/start/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json"

This creates an ExperimentRun for each dataset item and kicks off asynchronous execution. The response includes the number of runs created:

{
"id": "x1y2z3-...",
"name": "GPT-4o vs GPT-4o-mini on Support QA",
"status": "running",
"runs_created": 25,
"started_at": "2026-02-07T15:00:00Z"
}

Cancel a running experiment:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/{experiment_id}/cancel/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json"

Viewing Results

GET /api/v1/experiments/{experiment_id}/results/

Returns results for every dataset item, including output, latency, cost, tokens, and any evaluator scores:

{
"experiment_id": "x1y2z3-...",
"experiment_name": "GPT-4o vs GPT-4o-mini on Support QA",
"dataset_name": "Support Agent QA",
"status": "completed",
"summary": {
"total": 25,
"completed": 24,
"failed": 1,
"pending": 0,
"avg_latency_ms": 1250,
"total_cost": 0.1845,
"avg_cost": 0.007687
},
"results": [
{
"id": "r1s2t3-...",
"dataset_item_id": "d1e2f3-...",
"status": "completed",
"output": "Go to Settings > Security > Reset Password",
"error": "",
"latency_ms": 980,
"tokens_in": 145,
"tokens_out": 32,
"cost": 0.0052,
"item_input": {"query": "How do I reset my password?"},
"item_expected_output": {"answer": "Go to Settings > Security > Reset Password"},
"scores": [
{
"name": "helpfulness",
"data_type": "numeric",
"numeric_value": 0.95,
"string_value": null,
"source": "evaluator"
}
]
}
]
}

Comparing Experiments

The comparison endpoint puts multiple experiments side by side, aligned by dataset item:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/compare/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"experiment_ids": ["exp-gpt4o-id", "exp-gpt4o-mini-id"]
}'

Response:

{
"experiments": {
"exp-gpt4o-id": {
"name": "GPT-4o on Support QA",
"config": {"model": "gpt-4o", "temperature": 0.3},
"status": "completed"
},
"exp-gpt4o-mini-id": {
"name": "GPT-4o-mini on Support QA",
"config": {"model": "gpt-4o-mini", "temperature": 0.3},
"status": "completed"
}
},
"comparisons": [
{
"dataset_item": {
"id": "d1e2f3-...",
"input": {"query": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password"}
},
"results": {
"exp-gpt4o-id": {
"run_id": "r1-...",
"status": "completed",
"output": "Go to Settings > Security > Reset Password",
"latency_ms": 980,
"cost": 0.0052
},
"exp-gpt4o-mini-id": {
"run_id": "r2-...",
"status": "completed",
"output": "Navigate to Settings, then Security, and click Reset Password",
"latency_ms": 420,
"cost": 0.0003
}
}
}
]
}
info

At least 2 experiment IDs are required for comparison. The experiments do not need to use the same dataset, but comparison is most meaningful when they do.

Experiment REST API Summary

EndpointMethodDescription
/api/v1/experiments/GETList experiments (supports dataset_id, status, sort, pagination)
/api/v1/experiments/POSTCreate experiment
/api/v1/experiments/{id}/GETExperiment detail with all runs
/api/v1/experiments/{id}/DELETEDelete experiment
/api/v1/experiments/{id}/start/POSTStart experiment execution
/api/v1/experiments/{id}/cancel/POSTCancel running experiment
/api/v1/experiments/{id}/results/GETFull results with scores
/api/v1/experiments/compare/POSTCompare multiple experiments

Compare Prompt Versions

The compare-prompt-versions endpoint runs two versions of the same prompt against every item in a dataset and creates a pair of linked experiments. This is the fastest way to measure the impact of a prompt change:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/compare-prompt-versions/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"prompt_name": "support-system-prompt",
"version_a": 1,
"version_b": 2,
"model": "gpt-4o-mini",
"evaluator_ids": ["helpfulness-evaluator-id"]
}'

Response:

{
"experiment_a": {"id": "...", "name": "support-system-prompt-v1-vs-v2-a", "version": 1},
"experiment_b": {"id": "...", "name": "support-system-prompt-v1-vs-v2-b", "version": 2},
"compare_url": "/api/v1/experiments/compare/?experiment_ids=<a>,<b>"
}

Both experiments start running immediately. When complete, use the compare_url to get a side-by-side view of outputs, scores, latency, and cost per dataset item.

From the UI

On the dataset detail page, click Compare Versions next to the "New Experiment" button. Enter the prompt name, the two version numbers, and optionally override the model. The UI redirects to the comparison view when both experiments finish.

Python Example

import httpx

client = httpx.Client(cookies={"sessionid": session_id})

# Create a dataset from production runs
dataset = client.post(
f"{base_url}/api/v1/datasets/",
json={"name": "QA Regression Set", "tags": ["qa"]},
).json()

# Add items from production runs
for run_id in recent_run_ids:
client.post(f"{base_url}/api/v1/datasets/{dataset['id']}/from-run/", json={"run_id": run_id})

# Compare v1 vs v2
result = client.post(
f"{base_url}/api/v1/datasets/{dataset['id']}/compare-prompt-versions/",
json={"prompt_name": "support-system-prompt", "version_a": 1, "version_b": 2},
).json()

# Fetch comparison when ready
comparison = client.get(f"{base_url}{result['compare_url']}").json()

Workflow: Evaluation Pipeline

A typical evaluation workflow combines datasets, experiments, and evaluators:

  1. Build a dataset from production runs using the "Run to Dataset" feature, or import curated test cases
  2. Create evaluators for the quality dimensions you care about (helpfulness, accuracy, safety)
  3. Run experiment A with your current prompt/model configuration
  4. Run experiment B with a new prompt version or different model
  5. Compare experiments side by side to see per-item differences
  6. Review scores from automated evaluators to quantify improvement
  7. Promote the winning configuration to production via prompt labels

Next Steps