Skip to main content

Datasets & Experiments

Datasets and experiments let you systematically evaluate your agents and prompts. Build a dataset of test cases, run them through different configurations, attach evaluators for automated scoring, and compare results side by side.

Datasets

A dataset is a named collection of test cases. Each item has an input, optional expected output, and optional context.

Dataset Item Fields

FieldTypeRequiredDescription
inputobjectYesThe test input (JSON object or string)
expected_outputobjectNoGround truth output for comparison
contextobjectNoAdditional context (e.g., retrieved documents, agent config)
metadataobjectNoArbitrary metadata
sort_orderintNoDisplay order (auto-assigned if omitted)

Creating a Dataset

curl -X POST "https://acme.waxell.dev/api/v1/datasets/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"name": "Support Agent QA",
"description": "Test cases for the customer support agent",
"tags": ["support", "qa"]
}'

Adding Items Manually

curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/items/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"input": {"query": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password"},
"context": {"category": "account"}
}'

Bulk Import

Import multiple items at once from a JSON payload:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/import/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"items": [
{
"input": {"query": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password"}
},
{
"input": {"query": "What are your pricing plans?"},
"expected_output": {"answer": "We offer Free, Pro, and Enterprise plans"}
},
{
"input": {"query": "How do I contact support?"},
"expected_output": {"answer": "Email support@example.com or use the in-app chat"}
}
]
}'

Response:

{
"dataset_id": "a1b2c3d4-...",
"imported": 3,
"total_items": 3
}
info

For CSV/JSON file import, convert your file to the items array format above. Each item must have at minimum an input field.

Capture from Production (Run to Dataset)

Turn a real agent execution into a test case. This is useful for building regression test sets from interesting production runs:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/from-run/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{"run_id": "42"}'

The run's inputs become the item's input, and the run's result becomes expected_output. The agent name and workflow are stored in the context field.

Dataset REST API Summary

EndpointMethodDescription
/api/v1/datasets/GETList datasets (supports search, tags, sort, pagination)
/api/v1/datasets/POSTCreate dataset
/api/v1/datasets/{id}/GETDataset detail with recent items
/api/v1/datasets/{id}/PUTUpdate dataset metadata
/api/v1/datasets/{id}/DELETEDelete dataset and all items
/api/v1/datasets/{id}/items/GETList items (supports sort, pagination)
/api/v1/datasets/{id}/items/POSTCreate single item
/api/v1/datasets/{id}/items/{item_id}/GETItem detail
/api/v1/datasets/{id}/items/{item_id}/PUTUpdate item
/api/v1/datasets/{id}/items/{item_id}/DELETEDelete item
/api/v1/datasets/{id}/import/POSTBulk import items
/api/v1/datasets/{id}/from-run/POSTCreate item from a production run

Experiments

An experiment runs every item in a dataset through a specific configuration (prompt + model, or agent) and records the results.

Creating an Experiment

curl -X POST "https://acme.waxell.dev/api/v1/experiments/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"name": "GPT-4o vs GPT-4o-mini on Support QA",
"dataset_id": "a1b2c3d4-...",
"config": {
"prompt_name": "support-system-prompt",
"prompt_label": "production",
"model": "gpt-4o",
"temperature": 0.3
},
"evaluator_ids": ["e1f2g3h4-...", "i5j6k7l8-..."],
"metadata": {"hypothesis": "GPT-4o should outperform mini on complex queries"}
}'

Experiment Fields

FieldTypeRequiredDescription
namestringYesExperiment name
dataset_iduuidYesDataset to run against
configobjectNoConfiguration for execution (prompt, model, temperature, etc.)
evaluator_idslist[uuid]NoEvaluators to auto-score results
metadataobjectNoArbitrary metadata (hypothesis, notes)

Experiment Lifecycle

Experiments follow a state machine:

pending --> running --> completed
\--> failed
pending --> cancelled
running --> cancelled

Start an experiment:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/{experiment_id}/start/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json"

This creates an ExperimentRun for each dataset item and kicks off asynchronous execution. The response includes the number of runs created:

{
"id": "x1y2z3-...",
"name": "GPT-4o vs GPT-4o-mini on Support QA",
"status": "running",
"runs_created": 25,
"started_at": "2026-02-07T15:00:00Z"
}

Cancel a running experiment:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/{experiment_id}/cancel/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json"

Viewing Results

GET /api/v1/experiments/{experiment_id}/results/

Returns results for every dataset item, including output, latency, cost, tokens, and any evaluator scores:

{
"experiment_id": "x1y2z3-...",
"experiment_name": "GPT-4o vs GPT-4o-mini on Support QA",
"dataset_name": "Support Agent QA",
"status": "completed",
"summary": {
"total": 25,
"completed": 24,
"failed": 1,
"pending": 0,
"avg_latency_ms": 1250,
"total_cost": 0.1845,
"avg_cost": 0.007687
},
"results": [
{
"id": "r1s2t3-...",
"dataset_item_id": "d1e2f3-...",
"status": "completed",
"output": "Go to Settings > Security > Reset Password",
"error": "",
"latency_ms": 980,
"tokens_in": 145,
"tokens_out": 32,
"cost": 0.0052,
"item_input": {"query": "How do I reset my password?"},
"item_expected_output": {"answer": "Go to Settings > Security > Reset Password"},
"scores": [
{
"name": "helpfulness",
"data_type": "numeric",
"numeric_value": 0.95,
"string_value": null,
"source": "evaluator"
}
]
}
]
}

Comparing Experiments

The comparison endpoint puts multiple experiments side by side, aligned by dataset item:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/compare/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"experiment_ids": ["exp-gpt4o-id", "exp-gpt4o-mini-id"]
}'

Response:

{
"experiments": {
"exp-gpt4o-id": {
"name": "GPT-4o on Support QA",
"config": {"model": "gpt-4o", "temperature": 0.3},
"status": "completed"
},
"exp-gpt4o-mini-id": {
"name": "GPT-4o-mini on Support QA",
"config": {"model": "gpt-4o-mini", "temperature": 0.3},
"status": "completed"
}
},
"comparisons": [
{
"dataset_item": {
"id": "d1e2f3-...",
"input": {"query": "How do I reset my password?"},
"expected_output": {"answer": "Go to Settings > Security > Reset Password"}
},
"results": {
"exp-gpt4o-id": {
"run_id": "r1-...",
"status": "completed",
"output": "Go to Settings > Security > Reset Password",
"latency_ms": 980,
"cost": 0.0052
},
"exp-gpt4o-mini-id": {
"run_id": "r2-...",
"status": "completed",
"output": "Navigate to Settings, then Security, and click Reset Password",
"latency_ms": 420,
"cost": 0.0003
}
}
}
]
}
info

At least 2 experiment IDs are required for comparison. The experiments do not need to use the same dataset, but comparison is most meaningful when they do.

Experiment REST API Summary

EndpointMethodDescription
/api/v1/experiments/GETList experiments (supports dataset_id, status, sort, pagination)
/api/v1/experiments/POSTCreate experiment
/api/v1/experiments/{id}/GETExperiment detail with all runs
/api/v1/experiments/{id}/DELETEDelete experiment
/api/v1/experiments/{id}/start/POSTStart experiment execution
/api/v1/experiments/{id}/cancel/POSTCancel running experiment
/api/v1/experiments/{id}/results/GETFull results with scores
/api/v1/experiments/compare/POSTCompare multiple experiments

Workflow: Evaluation Pipeline

A typical evaluation workflow combines datasets, experiments, and evaluators:

  1. Build a dataset from production runs using the "Run to Dataset" feature, or import curated test cases
  2. Create evaluators for the quality dimensions you care about (helpfulness, accuracy, safety)
  3. Run experiment A with your current prompt/model configuration
  4. Run experiment B with a new prompt version or different model
  5. Compare experiments side by side to see per-item differences
  6. Review scores from automated evaluators to quantify improvement
  7. Promote the winning configuration to production via prompt labels

Next Steps