Datasets & Experiments
Datasets and experiments let you systematically evaluate your agents and prompts. Build a dataset of test cases, run them through different configurations, attach evaluators for automated scoring, and compare results side by side.
Datasets
A dataset is a named collection of test cases. Each item has an input, optional expected output, and optional context.
Dataset Item Fields
| Field | Type | Required | Description |
|---|---|---|---|
input | object | Yes | The test input (JSON object or string) |
expected_output | object | No | Ground truth output for comparison |
context | object | No | Additional context (e.g., retrieved documents, agent config) |
metadata | object | No | Arbitrary metadata |
sort_order | int | No | Display order (auto-assigned if omitted) |
Creating a Dataset
curl -X POST "https://acme.waxell.dev/api/v1/datasets/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"name": "Support Agent QA",
"description": "Test cases for the customer support agent",
"tags": ["support", "qa"]
}'
Adding Items Manually
curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/items/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"input": {"query": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password"},
"context": {"category": "account"}
}'
Bulk Import
Import multiple items at once from a JSON payload:
curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/import/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"items": [
{
"input": {"query": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password"}
},
{
"input": {"query": "What are your pricing plans?"},
"expected_output": {"answer": "We offer Free, Pro, and Enterprise plans"}
},
{
"input": {"query": "How do I contact support?"},
"expected_output": {"answer": "Email support@example.com or use the in-app chat"}
}
]
}'
Response:
{
"dataset_id": "a1b2c3d4-...",
"imported": 3,
"total_items": 3
}
For CSV/JSON file import, convert your file to the items array format above. Each item must have at minimum an input field.
Capture from Production (Run to Dataset)
Turn a real agent execution into a test case. This is useful for building regression test sets from interesting production runs:
curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/from-run/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{"run_id": "42"}'
The run's inputs become the item's input, and the run's result becomes expected_output. The agent name and workflow are stored in the context field.
Dataset REST API Summary
| Endpoint | Method | Description |
|---|---|---|
/api/v1/datasets/ | GET | List datasets (supports search, tags, sort, pagination) |
/api/v1/datasets/ | POST | Create dataset |
/api/v1/datasets/{id}/ | GET | Dataset detail with recent items |
/api/v1/datasets/{id}/ | PUT | Update dataset metadata |
/api/v1/datasets/{id}/ | DELETE | Delete dataset and all items |
/api/v1/datasets/{id}/items/ | GET | List items (supports sort, pagination) |
/api/v1/datasets/{id}/items/ | POST | Create single item |
/api/v1/datasets/{id}/items/{item_id}/ | GET | Item detail |
/api/v1/datasets/{id}/items/{item_id}/ | PUT | Update item |
/api/v1/datasets/{id}/items/{item_id}/ | DELETE | Delete item |
/api/v1/datasets/{id}/import/ | POST | Bulk import items |
/api/v1/datasets/{id}/from-run/ | POST | Create item from a production run |
Experiments
An experiment runs every item in a dataset through a specific configuration (prompt + model, or agent) and records the results.
Creating an Experiment
curl -X POST "https://acme.waxell.dev/api/v1/experiments/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"name": "GPT-4o vs GPT-4o-mini on Support QA",
"dataset_id": "a1b2c3d4-...",
"config": {
"prompt_name": "support-system-prompt",
"prompt_label": "production",
"model": "gpt-4o",
"temperature": 0.3
},
"evaluator_ids": ["e1f2g3h4-...", "i5j6k7l8-..."],
"metadata": {"hypothesis": "GPT-4o should outperform mini on complex queries"}
}'
Experiment Fields
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Experiment name |
dataset_id | uuid | Yes | Dataset to run against |
config | object | No | Configuration for execution (prompt, model, temperature, etc.) |
evaluator_ids | list[uuid] | No | Evaluators to auto-score results |
metadata | object | No | Arbitrary metadata (hypothesis, notes) |
Experiment Lifecycle
Experiments follow a state machine:
pending --> running --> completed
\--> failed
pending --> cancelled
running --> cancelled
Start an experiment:
curl -X POST "https://acme.waxell.dev/api/v1/experiments/{experiment_id}/start/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json"
This creates an ExperimentRun for each dataset item and kicks off asynchronous execution. The response includes the number of runs created:
{
"id": "x1y2z3-...",
"name": "GPT-4o vs GPT-4o-mini on Support QA",
"status": "running",
"runs_created": 25,
"started_at": "2026-02-07T15:00:00Z"
}
Cancel a running experiment:
curl -X POST "https://acme.waxell.dev/api/v1/experiments/{experiment_id}/cancel/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json"
Viewing Results
GET /api/v1/experiments/{experiment_id}/results/
Returns results for every dataset item, including output, latency, cost, tokens, and any evaluator scores:
{
"experiment_id": "x1y2z3-...",
"experiment_name": "GPT-4o vs GPT-4o-mini on Support QA",
"dataset_name": "Support Agent QA",
"status": "completed",
"summary": {
"total": 25,
"completed": 24,
"failed": 1,
"pending": 0,
"avg_latency_ms": 1250,
"total_cost": 0.1845,
"avg_cost": 0.007687
},
"results": [
{
"id": "r1s2t3-...",
"dataset_item_id": "d1e2f3-...",
"status": "completed",
"output": "Go to Settings > Security > Reset Password",
"error": "",
"latency_ms": 980,
"tokens_in": 145,
"tokens_out": 32,
"cost": 0.0052,
"item_input": {"query": "How do I reset my password?"},
"item_expected_output": {"answer": "Go to Settings > Security > Reset Password"},
"scores": [
{
"name": "helpfulness",
"data_type": "numeric",
"numeric_value": 0.95,
"string_value": null,
"source": "evaluator"
}
]
}
]
}
Comparing Experiments
The comparison endpoint puts multiple experiments side by side, aligned by dataset item:
curl -X POST "https://acme.waxell.dev/api/v1/experiments/compare/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"experiment_ids": ["exp-gpt4o-id", "exp-gpt4o-mini-id"]
}'
Response:
{
"experiments": {
"exp-gpt4o-id": {
"name": "GPT-4o on Support QA",
"config": {"model": "gpt-4o", "temperature": 0.3},
"status": "completed"
},
"exp-gpt4o-mini-id": {
"name": "GPT-4o-mini on Support QA",
"config": {"model": "gpt-4o-mini", "temperature": 0.3},
"status": "completed"
}
},
"comparisons": [
{
"dataset_item": {
"id": "d1e2f3-...",
"input": {"query": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password"}
},
"results": {
"exp-gpt4o-id": {
"run_id": "r1-...",
"status": "completed",
"output": "Go to Settings > Security > Reset Password",
"latency_ms": 980,
"cost": 0.0052
},
"exp-gpt4o-mini-id": {
"run_id": "r2-...",
"status": "completed",
"output": "Navigate to Settings, then Security, and click Reset Password",
"latency_ms": 420,
"cost": 0.0003
}
}
}
]
}
At least 2 experiment IDs are required for comparison. The experiments do not need to use the same dataset, but comparison is most meaningful when they do.
Experiment REST API Summary
| Endpoint | Method | Description |
|---|---|---|
/api/v1/experiments/ | GET | List experiments (supports dataset_id, status, sort, pagination) |
/api/v1/experiments/ | POST | Create experiment |
/api/v1/experiments/{id}/ | GET | Experiment detail with all runs |
/api/v1/experiments/{id}/ | DELETE | Delete experiment |
/api/v1/experiments/{id}/start/ | POST | Start experiment execution |
/api/v1/experiments/{id}/cancel/ | POST | Cancel running experiment |
/api/v1/experiments/{id}/results/ | GET | Full results with scores |
/api/v1/experiments/compare/ | POST | Compare multiple experiments |
Compare Prompt Versions
The compare-prompt-versions endpoint runs two versions of the same prompt against every item in a dataset and creates a pair of linked experiments. This is the fastest way to measure the impact of a prompt change:
curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/compare-prompt-versions/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"prompt_name": "support-system-prompt",
"version_a": 1,
"version_b": 2,
"model": "gpt-4o-mini",
"evaluator_ids": ["helpfulness-evaluator-id"]
}'
Response:
{
"experiment_a": {"id": "...", "name": "support-system-prompt-v1-vs-v2-a", "version": 1},
"experiment_b": {"id": "...", "name": "support-system-prompt-v1-vs-v2-b", "version": 2},
"compare_url": "/api/v1/experiments/compare/?experiment_ids=<a>,<b>"
}
Both experiments start running immediately. When complete, use the compare_url to get a side-by-side view of outputs, scores, latency, and cost per dataset item.
From the UI
On the dataset detail page, click Compare Versions next to the "New Experiment" button. Enter the prompt name, the two version numbers, and optionally override the model. The UI redirects to the comparison view when both experiments finish.
Python Example
import httpx
client = httpx.Client(cookies={"sessionid": session_id})
# Create a dataset from production runs
dataset = client.post(
f"{base_url}/api/v1/datasets/",
json={"name": "QA Regression Set", "tags": ["qa"]},
).json()
# Add items from production runs
for run_id in recent_run_ids:
client.post(f"{base_url}/api/v1/datasets/{dataset['id']}/from-run/", json={"run_id": run_id})
# Compare v1 vs v2
result = client.post(
f"{base_url}/api/v1/datasets/{dataset['id']}/compare-prompt-versions/",
json={"prompt_name": "support-system-prompt", "version_a": 1, "version_b": 2},
).json()
# Fetch comparison when ready
comparison = client.get(f"{base_url}{result['compare_url']}").json()
Workflow: Evaluation Pipeline
A typical evaluation workflow combines datasets, experiments, and evaluators:
- Build a dataset from production runs using the "Run to Dataset" feature, or import curated test cases
- Create evaluators for the quality dimensions you care about (helpfulness, accuracy, safety)
- Run experiment A with your current prompt/model configuration
- Run experiment B with a new prompt version or different model
- Compare experiments side by side to see per-item differences
- Review scores from automated evaluators to quantify improvement
- Promote the winning configuration to production via prompt labels
Next Steps
- Prompt Management -- Version and label the prompts used in experiments
- Evaluators (LLM-as-Judge) -- Set up automated scoring for experiments
- Scoring & Feedback -- Understanding the score data model