Datasets & Experiments

Datasets and experiments let you systematically evaluate your agents and prompts. Build a dataset of test cases, run them through different configurations, attach evaluators for automated scoring, and compare results side by side.

Datasets

A dataset is a named collection of test cases. Each item has an input, optional expected output, and optional context.

Dataset Item Fields

Field	Type	Required	Description
`input`	`object`	Yes	The test input (JSON object or string)
`expected_output`	`object`	No	Ground truth output for comparison
`context`	`object`	No	Additional context (e.g., retrieved documents, agent config)
`metadata`	`object`	No	Arbitrary metadata
`sort_order`	`int`	No	Display order (auto-assigned if omitted)

Creating a Dataset

curl -X POST "https://acme.waxell.dev/api/v1/datasets/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Support Agent QA",
    "description": "Test cases for the customer support agent",
    "tags": ["support", "qa"]
  }'

Adding Items Manually

curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/items/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{
    "input": {"query": "How do I reset my password?"},
    "expected_output": {"answer": "Go to Settings > Security > Reset Password"},
    "context": {"category": "account"}
  }'

Bulk Import

Import multiple items at once from a JSON payload:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/import/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{
    "items": [
      {
        "input": {"query": "How do I reset my password?"},
        "expected_output": {"answer": "Go to Settings > Security > Reset Password"}
      },
      {
        "input": {"query": "What are your pricing plans?"},
        "expected_output": {"answer": "We offer Free, Pro, and Enterprise plans"}
      },
      {
        "input": {"query": "How do I contact support?"},
        "expected_output": {"answer": "Email support@example.com or use the in-app chat"}
      }
    ]
  }'

Response:

{
  "dataset_id": "a1b2c3d4-...",
  "imported": 3,
  "total_items": 3
}

info

For CSV/JSON file import, convert your file to the items array format above. Each item must have at minimum an input field.

Capture from Production (Run to Dataset)

Turn a real agent execution into a test case. This is useful for building regression test sets from interesting production runs:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/from-run/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{"run_id": "42"}'

The run's inputs become the item's input, and the run's result becomes expected_output. The agent name and workflow are stored in the context field.

Dataset REST API Summary

Endpoint	Method	Description
`/api/v1/datasets/`	GET	List datasets (supports `search`, `tags`, `sort`, pagination)
`/api/v1/datasets/`	POST	Create dataset
`/api/v1/datasets/{id}/`	GET	Dataset detail with recent items
`/api/v1/datasets/{id}/`	PUT	Update dataset metadata
`/api/v1/datasets/{id}/`	DELETE	Delete dataset and all items
`/api/v1/datasets/{id}/items/`	GET	List items (supports `sort`, pagination)
`/api/v1/datasets/{id}/items/`	POST	Create single item
`/api/v1/datasets/{id}/items/{item_id}/`	GET	Item detail
`/api/v1/datasets/{id}/items/{item_id}/`	PUT	Update item
`/api/v1/datasets/{id}/items/{item_id}/`	DELETE	Delete item
`/api/v1/datasets/{id}/import/`	POST	Bulk import items
`/api/v1/datasets/{id}/from-run/`	POST	Create item from a production run

Experiments

An experiment runs every item in a dataset through a specific configuration (prompt + model, or agent) and records the results.

Creating an Experiment

curl -X POST "https://acme.waxell.dev/api/v1/experiments/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "GPT-4o vs GPT-4o-mini on Support QA",
    "dataset_id": "a1b2c3d4-...",
    "config": {
      "prompt_name": "support-system-prompt",
      "prompt_label": "production",
      "model": "gpt-4o",
      "temperature": 0.3
    },
    "evaluator_ids": ["e1f2g3h4-...", "i5j6k7l8-..."],
    "metadata": {"hypothesis": "GPT-4o should outperform mini on complex queries"}
  }'

Experiment Fields

Field	Type	Required	Description
`name`	`string`	Yes	Experiment name
`dataset_id`	`uuid`	Yes	Dataset to run against
`config`	`object`	No	Configuration for execution (prompt, model, temperature, etc.)
`evaluator_ids`	`list[uuid]`	No	Evaluators to auto-score results
`metadata`	`object`	No	Arbitrary metadata (hypothesis, notes)

Experiment Lifecycle

Experiments follow a state machine:

pending --> running --> completed
                  \--> failed
pending --> cancelled
running --> cancelled

Start an experiment:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/{experiment_id}/start/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json"

This creates an ExperimentRun for each dataset item and kicks off asynchronous execution. The response includes the number of runs created:

{
  "id": "x1y2z3-...",
  "name": "GPT-4o vs GPT-4o-mini on Support QA",
  "status": "running",
  "runs_created": 25,
  "started_at": "2026-02-07T15:00:00Z"
}

Cancel a running experiment:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/{experiment_id}/cancel/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json"

Viewing Results

GET /api/v1/experiments/{experiment_id}/results/

Returns results for every dataset item, including output, latency, cost, tokens, and any evaluator scores:

{
  "experiment_id": "x1y2z3-...",
  "experiment_name": "GPT-4o vs GPT-4o-mini on Support QA",
  "dataset_name": "Support Agent QA",
  "status": "completed",
  "summary": {
    "total": 25,
    "completed": 24,
    "failed": 1,
    "pending": 0,
    "avg_latency_ms": 1250,
    "total_cost": 0.1845,
    "avg_cost": 0.007687
  },
  "results": [
    {
      "id": "r1s2t3-...",
      "dataset_item_id": "d1e2f3-...",
      "status": "completed",
      "output": "Go to Settings > Security > Reset Password",
      "error": "",
      "latency_ms": 980,
      "tokens_in": 145,
      "tokens_out": 32,
      "cost": 0.0052,
      "item_input": {"query": "How do I reset my password?"},
      "item_expected_output": {"answer": "Go to Settings > Security > Reset Password"},
      "scores": [
        {
          "name": "helpfulness",
          "data_type": "numeric",
          "numeric_value": 0.95,
          "string_value": null,
          "source": "evaluator"
        }
      ]
    }
  ]
}

Comparing Experiments

The comparison endpoint puts multiple experiments side by side, aligned by dataset item:

curl -X POST "https://acme.waxell.dev/api/v1/experiments/compare/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{
    "experiment_ids": ["exp-gpt4o-id", "exp-gpt4o-mini-id"]
  }'

Response:

{
  "experiments": {
    "exp-gpt4o-id": {
      "name": "GPT-4o on Support QA",
      "config": {"model": "gpt-4o", "temperature": 0.3},
      "status": "completed"
    },
    "exp-gpt4o-mini-id": {
      "name": "GPT-4o-mini on Support QA",
      "config": {"model": "gpt-4o-mini", "temperature": 0.3},
      "status": "completed"
    }
  },
  "comparisons": [
    {
      "dataset_item": {
        "id": "d1e2f3-...",
        "input": {"query": "How do I reset my password?"},
        "expected_output": {"answer": "Go to Settings > Security > Reset Password"}
      },
      "results": {
        "exp-gpt4o-id": {
          "run_id": "r1-...",
          "status": "completed",
          "output": "Go to Settings > Security > Reset Password",
          "latency_ms": 980,
          "cost": 0.0052
        },
        "exp-gpt4o-mini-id": {
          "run_id": "r2-...",
          "status": "completed",
          "output": "Navigate to Settings, then Security, and click Reset Password",
          "latency_ms": 420,
          "cost": 0.0003
        }
      }
    }
  ]
}

info

At least 2 experiment IDs are required for comparison. The experiments do not need to use the same dataset, but comparison is most meaningful when they do.

Experiment REST API Summary

Endpoint	Method	Description
`/api/v1/experiments/`	GET	List experiments (supports `dataset_id`, `status`, `sort`, pagination)
`/api/v1/experiments/`	POST	Create experiment
`/api/v1/experiments/{id}/`	GET	Experiment detail with all runs
`/api/v1/experiments/{id}/`	DELETE	Delete experiment
`/api/v1/experiments/{id}/start/`	POST	Start experiment execution
`/api/v1/experiments/{id}/cancel/`	POST	Cancel running experiment
`/api/v1/experiments/{id}/results/`	GET	Full results with scores
`/api/v1/experiments/compare/`	POST	Compare multiple experiments

Compare Prompt Versions

The compare-prompt-versions endpoint runs two versions of the same prompt against every item in a dataset and creates a pair of linked experiments. This is the fastest way to measure the impact of a prompt change:

curl -X POST "https://acme.waxell.dev/api/v1/datasets/{dataset_id}/compare-prompt-versions/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{
    "prompt_name": "support-system-prompt",
    "version_a": 1,
    "version_b": 2,
    "model": "gpt-4o-mini",
    "evaluator_ids": ["helpfulness-evaluator-id"]
  }'

Response:

{
  "experiment_a": {"id": "...", "name": "support-system-prompt-v1-vs-v2-a", "version": 1},
  "experiment_b": {"id": "...", "name": "support-system-prompt-v1-vs-v2-b", "version": 2},
  "compare_url": "/api/v1/experiments/compare/?experiment_ids=<a>,<b>"
}

Both experiments start running immediately. When complete, use the compare_url to get a side-by-side view of outputs, scores, latency, and cost per dataset item.

From the UI

On the dataset detail page, click Compare Versions next to the "New Experiment" button. Enter the prompt name, the two version numbers, and optionally override the model. The UI redirects to the comparison view when both experiments finish.

Python Example

import httpx

client = httpx.Client(cookies={"sessionid": session_id})

# Create a dataset from production runs
dataset = client.post(
    f"{base_url}/api/v1/datasets/",
    json={"name": "QA Regression Set", "tags": ["qa"]},
).json()

# Add items from production runs
for run_id in recent_run_ids:
    client.post(f"{base_url}/api/v1/datasets/{dataset['id']}/from-run/", json={"run_id": run_id})

# Compare v1 vs v2
result = client.post(
    f"{base_url}/api/v1/datasets/{dataset['id']}/compare-prompt-versions/",
    json={"prompt_name": "support-system-prompt", "version_a": 1, "version_b": 2},
).json()

# Fetch comparison when ready
comparison = client.get(f"{base_url}{result['compare_url']}").json()

Workflow: Evaluation Pipeline

A typical evaluation workflow combines datasets, experiments, and evaluators:

Build a dataset from production runs using the "Run to Dataset" feature, or import curated test cases
Create evaluators for the quality dimensions you care about (helpfulness, accuracy, safety)
Run experiment A with your current prompt/model configuration
Run experiment B with a new prompt version or different model
Compare experiments side by side to see per-item differences
Review scores from automated evaluators to quantify improvement
Promote the winning configuration to production via prompt labels

Next Steps

Prompt Management -- Version and label the prompts used in experiments
Evaluators (LLM-as-Judge) -- Set up automated scoring for experiments
Scoring & Feedback -- Understanding the score data model

Datasets​

Dataset Item Fields​

Creating a Dataset​

Adding Items Manually​

Bulk Import​

Capture from Production (Run to Dataset)​

Dataset REST API Summary​

Experiments​

Creating an Experiment​

Experiment Fields​

Experiment Lifecycle​

Viewing Results​

Comparing Experiments​

Experiment REST API Summary​

Compare Prompt Versions​

From the UI​

Python Example​

Workflow: Evaluation Pipeline​

Next Steps​

Datasets

Dataset Item Fields

Creating a Dataset

Adding Items Manually

Bulk Import

Capture from Production (Run to Dataset)

Dataset REST API Summary

Experiments

Creating an Experiment

Experiment Fields

Experiment Lifecycle

Viewing Results

Comparing Experiments

Experiment REST API Summary

Compare Prompt Versions

From the UI

Python Example

Workflow: Evaluation Pipeline

Next Steps