Datasets & Experiments
Datasets and experiments let you systematically evaluate your agents and prompts. Build a dataset of test cases, run them through different configurations, attach evaluators for automated scoring, and compare results side by side.
Datasets
A dataset is a named collection of test cases. Each item has an input, optional expected output, and optional context.
Dataset Item Fields
| Field | Type | Required | Description |
|---|---|---|---|
input | object | Yes | The test input (JSON object or string) |
expected_output | object | No | Ground truth output for comparison |
context | object | No | Additional context (e.g., retrieved documents, agent config) |
metadata | object | No | Arbitrary metadata |
sort_order | int | No | Display order (auto-assigned if omitted) |
Creating a Dataset
curl -X POST "https://acme.waxell.dev/api/v1/datasets/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"name": "Support Agent QA",
"description": "Test cases for the customer support agent",
"tags": ["support", "qa"]
}'
Adding Items Manually
curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/items/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"input": {"query": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password"},
"context": {"category": "account"}
}'
Bulk Import
Import multiple items at once from a JSON payload:
curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/import/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"items": [
{
"input": {"query": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password"}
},
{
"input": {"query": "What are your pricing plans?"},
"expected_output": {"answer": "We offer Free, Pro, and Enterprise plans"}
},
{
"input": {"query": "How do I contact support?"},
"expected_output": {"answer": "Email support@example.com or use the in-app chat"}
}
]
}'
Response:
{
"dataset_id": "a1b2c3d4-...",
"imported": 3,
"total_items": 3
}
For CSV/JSON file import, convert your file to the items array format above. Each item must have at minimum an input field.
Capture from Production (Run to Dataset)
Turn a real agent execution into a test case. This is useful for building regression test sets from interesting production runs:
curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/from-run/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{"run_id": "42"}'
The run's inputs become the item's input, and the run's result becomes expected_output. The agent name and workflow are stored in the context field.
Dataset REST API Summary
| Endpoint | Method | Description |
|---|---|---|
/api/v1/datasets/ | GET | List datasets (supports search, tags, sort, pagination) |
/api/v1/datasets/ | POST | Create dataset |
/api/v1/datasets/{id}/ | GET | Dataset detail with recent items |
/api/v1/datasets/{id}/ | PUT | Update dataset metadata |
/api/v1/datasets/{id}/ | DELETE | Delete dataset and all items |
/api/v1/datasets/{id}/items/ | GET | List items (supports sort, pagination) |
/api/v1/datasets/{id}/items/ | POST | Create single item |
/api/v1/datasets/{id}/items/{item_id}/ | GET | Item detail |
/api/v1/datasets/{id}/items/{item_id}/ | PUT | Update item |
/api/v1/datasets/{id}/items/{item_id}/ | DELETE | Delete item |
/api/v1/datasets/{id}/import/ | POST | Bulk import items |
/api/v1/datasets/{id}/from-run/ | POST | Create item from a production run |
Experiments
An experiment runs every item in a dataset through a specific configuration (prompt + model, or agent) and records the results.
Creating an Experiment
curl -X POST "https://acme.waxell.dev/api/v1/experiments/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"name": "GPT-4o vs GPT-4o-mini on Support QA",
"dataset_id": "a1b2c3d4-...",
"config": {
"prompt_name": "support-system-prompt",
"prompt_label": "production",
"model": "gpt-4o",
"temperature": 0.3
},
"evaluator_ids": ["e1f2g3h4-...", "i5j6k7l8-..."],
"metadata": {"hypothesis": "GPT-4o should outperform mini on complex queries"}
}'
Experiment Fields
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Experiment name |
dataset_id | uuid | Yes | Dataset to run against |
config | object | No | Configuration for execution (prompt, model, temperature, etc.) |
evaluator_ids | list[uuid] | No | Evaluators to auto-score results |
metadata | object | No | Arbitrary metadata (hypothesis, notes) |
Experiment Lifecycle
Experiments follow a state machine:
pending --> running --> completed
\--> failed
pending --> cancelled
running --> cancelled
Start an experiment:
curl -X POST "https://acme.waxell.dev/api/v1/experiments/{experiment_id}/start/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json"
This creates an ExperimentRun for each dataset item and kicks off asynchronous execution. The response includes the number of runs created:
{
"id": "x1y2z3-...",
"name": "GPT-4o vs GPT-4o-mini on Support QA",
"status": "running",
"runs_created": 25,
"started_at": "2026-02-07T15:00:00Z"
}
Cancel a running experiment:
curl -X POST "https://acme.waxell.dev/api/v1/experiments/{experiment_id}/cancel/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json"
Viewing Results
GET /api/v1/experiments/{experiment_id}/results/
Returns results for every dataset item, including output, latency, cost, tokens, and any evaluator scores:
{
"experiment_id": "x1y2z3-...",
"experiment_name": "GPT-4o vs GPT-4o-mini on Support QA",
"dataset_name": "Support Agent QA",
"status": "completed",
"summary": {
"total": 25,
"completed": 24,
"failed": 1,
"pending": 0,
"avg_latency_ms": 1250,
"total_cost": 0.1845,
"avg_cost": 0.007687
},
"results": [
{
"id": "r1s2t3-...",
"dataset_item_id": "d1e2f3-...",
"status": "completed",
"output": "Go to Settings > Security > Reset Password",
"error": "",
"latency_ms": 980,
"tokens_in": 145,
"tokens_out": 32,
"cost": 0.0052,
"item_input": {"query": "How do I reset my password?"},
"item_expected_output": {"answer": "Go to Settings > Security > Reset Password"},
"scores": [
{
"name": "helpfulness",
"data_type": "numeric",
"numeric_value": 0.95,
"string_value": null,
"source": "evaluator"
}
]
}
]
}
Comparing Experiments
The comparison endpoint puts multiple experiments side by side, aligned by dataset item:
curl -X POST "https://acme.waxell.dev/api/v1/experiments/compare/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"experiment_ids": ["exp-gpt4o-id", "exp-gpt4o-mini-id"]
}'
Response:
{
"experiments": {
"exp-gpt4o-id": {
"name": "GPT-4o on Support QA",
"config": {"model": "gpt-4o", "temperature": 0.3},
"status": "completed"
},
"exp-gpt4o-mini-id": {
"name": "GPT-4o-mini on Support QA",
"config": {"model": "gpt-4o-mini", "temperature": 0.3},
"status": "completed"
}
},
"comparisons": [
{
"dataset_item": {
"id": "d1e2f3-...",
"input": {"query": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password"}
},
"results": {
"exp-gpt4o-id": {
"run_id": "r1-...",
"status": "completed",
"output": "Go to Settings > Security > Reset Password",
"latency_ms": 980,
"cost": 0.0052
},
"exp-gpt4o-mini-id": {
"run_id": "r2-...",
"status": "completed",
"output": "Navigate to Settings, then Security, and click Reset Password",
"latency_ms": 420,
"cost": 0.0003
}
}
}
]
}
At least 2 experiment IDs are required for comparison. The experiments do not need to use the same dataset, but comparison is most meaningful when they do.
Experiment REST API Summary
| Endpoint | Method | Description |
|---|---|---|
/api/v1/experiments/ | GET | List experiments (supports dataset_id, status, sort, pagination) |
/api/v1/experiments/ | POST | Create experiment |
/api/v1/experiments/{id}/ | GET | Experiment detail with all runs |
/api/v1/experiments/{id}/ | DELETE | Delete experiment |
/api/v1/experiments/{id}/start/ | POST | Start experiment execution |
/api/v1/experiments/{id}/cancel/ | POST | Cancel running experiment |
/api/v1/experiments/{id}/results/ | GET | Full results with scores |
/api/v1/experiments/compare/ | POST | Compare multiple experiments |
Workflow: Evaluation Pipeline
A typical evaluation workflow combines datasets, experiments, and evaluators:
- Build a dataset from production runs using the "Run to Dataset" feature, or import curated test cases
- Create evaluators for the quality dimensions you care about (helpfulness, accuracy, safety)
- Run experiment A with your current prompt/model configuration
- Run experiment B with a new prompt version or different model
- Compare experiments side by side to see per-item differences
- Review scores from automated evaluators to quantify improvement
- Promote the winning configuration to production via prompt labels
Next Steps
- Prompt Management -- Version and label the prompts used in experiments
- Evaluators (LLM-as-Judge) -- Set up automated scoring for experiments
- Scoring & Feedback -- Understanding the score data model