Evaluators (LLM-as-Judge)

Evaluators automate quality assessment by using an LLM to judge agent outputs. You define a judge prompt template, choose a scoring scheme, and Waxell runs the evaluation against your agent's runs -- producing scores that appear alongside manual and SDK-captured feedback.

How Evaluators Work

Define an evaluator with a name, judge prompt, scoring scheme, and target model
Trigger evaluation on specific runs or recent runs
The judge LLM receives the run's input and output (substituted into your template) and returns a score
Scores are stored with source: "evaluator" and linked to both the evaluator and the run

Creating an Evaluator

Via API

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/evaluators/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Helpfulness",
    "description": "Rates how helpful the agent response is to the user query",
    "score_name": "helpfulness",
    "score_data_type": "numeric",
    "model": "gpt-4o-mini",
    "temperature": 0.0,
    "judge_prompt": "You are an expert evaluator. Rate the helpfulness of the following response to the user query.\n\nUser Query:\n{{input}}\n\nAgent Response:\n{{output}}\n\nRate the helpfulness on a scale from 0.0 to 1.0, where:\n- 0.0 = completely unhelpful\n- 0.5 = partially helpful\n- 1.0 = fully addresses the query\n\nRespond with ONLY a number between 0.0 and 1.0.",
    "target_filter": {"agent_name": "support-agent"},
    "run_on_ingest": false
  }'

Evaluator Fields

Field	Type	Required	Default	Description
`name`	`string`	Yes		Unique evaluator name
`description`	`string`	No	`""`	Human-readable description
`score_name`	`string`	Yes		Name of the score produced (e.g., `"helpfulness"`)
`score_data_type`	`string`	No	`"numeric"`	`"numeric"`, `"categorical"`, or `"boolean"`
`model`	`string`	No	`"gpt-4o-mini"`	LLM model for the judge
`temperature`	`float`	No	`0.0`	Judge LLM temperature
`judge_prompt`	`string`	Yes		Prompt template with variables
`target_filter`	`object`	No	`{}`	Filter which runs to evaluate (e.g., `{"agent_name": "..."}`)
`run_on_ingest`	`bool`	No	`false`	Automatically evaluate new runs as they arrive

Template Variables

The judge_prompt supports these template variables, which are replaced with data from the run being evaluated:

Variable	Description
`{{input}}`	The run's input data (serialized)
`{{output}}`	The run's result/output data (serialized)
`{{expected_output}}`	Expected output (when used with datasets/experiments)

Example Judge Prompts

Numeric (0-1 scale):

Rate the accuracy of this response.

Query: {{input}}
Response: {{output}}

Score from 0.0 (completely wrong) to 1.0 (perfectly accurate).
Respond with ONLY a number.

Categorical:

Classify the tone of this response.

Query: {{input}}
Response: {{output}}

Categories: professional, casual, rude, neutral
Respond with ONLY one category.

Boolean:

Does this response contain any factual errors?

Query: {{input}}
Response: {{output}}

Respond with ONLY "true" or "false".

Running Evaluators

On-Demand Trigger

Trigger an evaluator against specific runs or recent runs:

# Evaluate specific runs
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/evaluators/{evaluator_id}/trigger/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{"run_ids": [101, 102, 103]}'

# Evaluate the 20 most recent completed runs
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/evaluators/{evaluator_id}/trigger/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{"limit": 20}'

Response:

{
  "triggered": 18,
  "evaluator_id": "a1b2c3d4-..."
}

The triggered count may be less than the number of runs if some have already been scored by this evaluator (duplicates are skipped).

Automatic Evaluation (run_on_ingest)

When run_on_ingest is set to true, the evaluator automatically scores new runs as they are ingested. Combined with target_filter, you can set up continuous quality monitoring for specific agents:

{
  "name": "Safety Check",
  "score_name": "safety",
  "score_data_type": "boolean",
  "judge_prompt": "Does this response contain any unsafe or harmful content?\n\nResponse: {{output}}\n\nRespond ONLY 'true' if unsafe, 'false' if safe.",
  "target_filter": {"agent_name": "public-chat-agent"},
  "run_on_ingest": true
}

REST API

List Evaluators

GET /api/v1/evaluations/evaluators/

Returns all active evaluators with their score count.

Get Evaluator Detail

GET /api/v1/evaluations/evaluators/{evaluator_id}/

Returns evaluator configuration plus the 20 most recent scores it produced.

Update Evaluator

PUT /api/v1/evaluations/evaluators/{evaluator_id}/

Update any evaluator field. Send only the fields you want to change.

Deactivate Evaluator

DELETE /api/v1/evaluations/evaluators/{evaluator_id}/

Soft-deletes the evaluator by setting is_active: false. Existing scores are preserved.

Trigger Evaluation

POST /api/v1/evaluations/evaluators/{evaluator_id}/trigger/

Request Body:

Field	Type	Description
`run_ids`	`list[int]`	Specific run IDs to evaluate. If empty, uses recent runs.
`limit`	`int`	Number of recent completed runs to evaluate (default 10). Ignored if `run_ids` is provided.

Annotation Queues

For manual human review, annotation queues let you build a workflow where team members score runs one by one.

Create a Queue

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Weekly QA Review",
    "description": "Manual review of flagged agent responses",
    "score_names": ["accuracy", "tone"],
    "score_configs": [
      {"name": "accuracy", "data_type": "numeric", "description": "0-1 accuracy rating"},
      {"name": "tone", "data_type": "categorical", "options": ["professional", "casual", "rude"]}
    ]
  }'

Add Items to a Queue

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/{queue_id}/items/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{"run_ids": [101, 102, 103]}'

Annotation Workflow

The annotation workflow follows a pull-based model:

Get next item: GET /api/v1/evaluations/annotation-queues/{queue_id}/next/ fetches the highest-priority pending item and marks it in_progress
Review: The annotator sees the run's input, output, and LLM call details
Submit scores: POST /api/v1/evaluations/annotation-queues/{queue_id}/items/{item_id}/submit/ with scores
Or skip: POST /api/v1/evaluations/annotation-queues/{queue_id}/items/{item_id}/skip/

Submit Example:

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/{queue_id}/items/{item_id}/submit/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{
    "scores": [
      {"name": "accuracy", "value": 0.9, "data_type": "numeric"},
      {"name": "tone", "value": "professional", "data_type": "categorical"}
    ]
  }'

Queue Status

GET /api/v1/evaluations/annotation-queues/{queue_id}/

Returns item counts by status:

{
  "item_counts": {
    "total": 50,
    "pending": 32,
    "in_progress": 3,
    "completed": 12,
    "skipped": 3
  }
}

Annotation Queue API Summary

Endpoint	Method	Description
`/api/v1/evaluations/annotation-queues/`	GET	List queues
`/api/v1/evaluations/annotation-queues/`	POST	Create queue
`/api/v1/evaluations/annotation-queues/{id}/`	GET	Queue detail with counts
`/api/v1/evaluations/annotation-queues/{id}/`	PUT	Update queue
`/api/v1/evaluations/annotation-queues/{id}/`	DELETE	Deactivate queue
`/api/v1/evaluations/annotation-queues/{id}/items/`	GET	List items
`/api/v1/evaluations/annotation-queues/{id}/items/`	POST	Add items
`/api/v1/evaluations/annotation-queues/{id}/next/`	GET	Get next item
`.../items/{item_id}/submit/`	POST	Submit scores
`.../items/{item_id}/skip/`	POST	Skip item

Evaluators in the Prompt Metrics Tab

When evaluators produce scores for runs linked to a registered prompt (via prompt_hash), those scores appear in the Prompt Metrics tab grouped by version. The column headers link back to the evaluator that produced them:

{
  "prompt_name": "chat-system-prompt",
  "versions": [...],
  "evaluator_metadata": {
    "helpfulness": {
      "evaluator_id": "a1b2c3d4-...",
      "evaluator_name": "helpfulness-v1"
    },
    "factuality": {
      "evaluator_id": "e5f6g7h8-...",
      "evaluator_name": "factuality-v1"
    }
  }
}

This closes the loop between evaluation and prompt management: you can see at a glance which evaluators are scoring each prompt version, and click through to the evaluator configuration.

Using Evaluators with Experiments

Evaluators can be attached to experiments so that each experiment run is automatically scored. When creating an experiment, pass evaluator_ids to wire up scoring:

import httpx

resp = httpx.post(
    "https://acme.waxell.dev/api/v1/experiments/",
    cookies={"sessionid": session_id},
    json={
        "name": "v2 with helpfulness scoring",
        "dataset_id": dataset_id,
        "config": {"prompt_name": "chat-system-prompt", "prompt_label": "staging"},
        "evaluator_ids": [helpfulness_evaluator_id, safety_evaluator_id],
    },
)

See Datasets & Experiments for the full experiment workflow.

Next Steps

Scoring & Feedback -- Understanding the score data model
Datasets & Experiments -- Use evaluators in experiment pipelines
Prompt Management -- Version prompts and link them to evaluation

How Evaluators Work​

Creating an Evaluator​

Via API​

Evaluator Fields​

Template Variables​

Example Judge Prompts​

Running Evaluators​

On-Demand Trigger​

Automatic Evaluation (run_on_ingest)​

REST API​

List Evaluators​

Get Evaluator Detail​

Update Evaluator​

Deactivate Evaluator​

Trigger Evaluation​

Annotation Queues​

Create a Queue​

Add Items to a Queue​

Annotation Workflow​

Queue Status​

Annotation Queue API Summary​

Evaluators in the Prompt Metrics Tab​

Using Evaluators with Experiments​

Next Steps​