Evaluators (LLM-as-Judge)
Evaluators automate quality assessment by using an LLM to judge agent outputs. You define a judge prompt template, choose a scoring scheme, and Waxell runs the evaluation against your agent's runs -- producing scores that appear alongside manual and SDK-captured feedback.
How Evaluators Work
- Define an evaluator with a name, judge prompt, scoring scheme, and target model
- Trigger evaluation on specific runs or recent runs
- The judge LLM receives the run's input and output (substituted into your template) and returns a score
- Scores are stored with
source: "evaluator"and linked to both the evaluator and the run
Creating an Evaluator
Via API
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/evaluators/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"name": "Helpfulness",
"description": "Rates how helpful the agent response is to the user query",
"score_name": "helpfulness",
"score_data_type": "numeric",
"model": "gpt-4o-mini",
"temperature": 0.0,
"judge_prompt": "You are an expert evaluator. Rate the helpfulness of the following response to the user query.\n\nUser Query:\n{{input}}\n\nAgent Response:\n{{output}}\n\nRate the helpfulness on a scale from 0.0 to 1.0, where:\n- 0.0 = completely unhelpful\n- 0.5 = partially helpful\n- 1.0 = fully addresses the query\n\nRespond with ONLY a number between 0.0 and 1.0.",
"target_filter": {"agent_name": "support-agent"},
"run_on_ingest": false
}'
Evaluator Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name | string | Yes | Unique evaluator name | |
description | string | No | "" | Human-readable description |
score_name | string | Yes | Name of the score produced (e.g., "helpfulness") | |
score_data_type | string | No | "numeric" | "numeric", "categorical", or "boolean" |
model | string | No | "gpt-4o-mini" | LLM model for the judge |
temperature | float | No | 0.0 | Judge LLM temperature |
judge_prompt | string | Yes | Prompt template with variables | |
target_filter | object | No | {} | Filter which runs to evaluate (e.g., {"agent_name": "..."}) |
run_on_ingest | bool | No | false | Automatically evaluate new runs as they arrive |
Template Variables
The judge_prompt supports these template variables, which are replaced with data from the run being evaluated:
| Variable | Description |
|---|---|
{{input}} | The run's input data (serialized) |
{{output}} | The run's result/output data (serialized) |
{{expected_output}} | Expected output (when used with datasets/experiments) |
Example Judge Prompts
Numeric (0-1 scale):
Rate the accuracy of this response.
Query: {{input}}
Response: {{output}}
Score from 0.0 (completely wrong) to 1.0 (perfectly accurate).
Respond with ONLY a number.
Categorical:
Classify the tone of this response.
Query: {{input}}
Response: {{output}}
Categories: professional, casual, rude, neutral
Respond with ONLY one category.
Boolean:
Does this response contain any factual errors?
Query: {{input}}
Response: {{output}}
Respond with ONLY "true" or "false".
Running Evaluators
On-Demand Trigger
Trigger an evaluator against specific runs or recent runs:
# Evaluate specific runs
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/evaluators/{evaluator_id}/trigger/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{"run_ids": [101, 102, 103]}'
# Evaluate the 20 most recent completed runs
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/evaluators/{evaluator_id}/trigger/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{"limit": 20}'
Response:
{
"triggered": 18,
"evaluator_id": "a1b2c3d4-..."
}
The triggered count may be less than the number of runs if some have already been scored by this evaluator (duplicates are skipped).
Automatic Evaluation (run_on_ingest)
When run_on_ingest is set to true, the evaluator automatically scores new runs as they are ingested. Combined with target_filter, you can set up continuous quality monitoring for specific agents:
{
"name": "Safety Check",
"score_name": "safety",
"score_data_type": "boolean",
"judge_prompt": "Does this response contain any unsafe or harmful content?\n\nResponse: {{output}}\n\nRespond ONLY 'true' if unsafe, 'false' if safe.",
"target_filter": {"agent_name": "public-chat-agent"},
"run_on_ingest": true
}
REST API
List Evaluators
GET /api/v1/evaluations/evaluators/
Returns all active evaluators with their score count.
Get Evaluator Detail
GET /api/v1/evaluations/evaluators/{evaluator_id}/
Returns evaluator configuration plus the 20 most recent scores it produced.
Update Evaluator
PUT /api/v1/evaluations/evaluators/{evaluator_id}/
Update any evaluator field. Send only the fields you want to change.
Deactivate Evaluator
DELETE /api/v1/evaluations/evaluators/{evaluator_id}/
Soft-deletes the evaluator by setting is_active: false. Existing scores are preserved.
Trigger Evaluation
POST /api/v1/evaluations/evaluators/{evaluator_id}/trigger/
Request Body:
| Field | Type | Description |
|---|---|---|
run_ids | list[int] | Specific run IDs to evaluate. If empty, uses recent runs. |
limit | int | Number of recent completed runs to evaluate (default 10). Ignored if run_ids is provided. |
Annotation Queues
For manual human review, annotation queues let you build a workflow where team members score runs one by one.
Create a Queue
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"name": "Weekly QA Review",
"description": "Manual review of flagged agent responses",
"score_names": ["accuracy", "tone"],
"score_configs": [
{"name": "accuracy", "data_type": "numeric", "description": "0-1 accuracy rating"},
{"name": "tone", "data_type": "categorical", "options": ["professional", "casual", "rude"]}
]
}'
Add Items to a Queue
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/{queue_id}/items/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{"run_ids": [101, 102, 103]}'
Annotation Workflow
The annotation workflow follows a pull-based model:
- Get next item:
GET /api/v1/evaluations/annotation-queues/{queue_id}/next/fetches the highest-priority pending item and marks itin_progress - Review: The annotator sees the run's input, output, and LLM call details
- Submit scores:
POST /api/v1/evaluations/annotation-queues/{queue_id}/items/{item_id}/submit/with scores - Or skip:
POST /api/v1/evaluations/annotation-queues/{queue_id}/items/{item_id}/skip/
Submit Example:
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/{queue_id}/items/{item_id}/submit/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"scores": [
{"name": "accuracy", "value": 0.9, "data_type": "numeric"},
{"name": "tone", "value": "professional", "data_type": "categorical"}
]
}'
Queue Status
GET /api/v1/evaluations/annotation-queues/{queue_id}/
Returns item counts by status:
{
"item_counts": {
"total": 50,
"pending": 32,
"in_progress": 3,
"completed": 12,
"skipped": 3
}
}
Annotation Queue API Summary
| Endpoint | Method | Description |
|---|---|---|
/api/v1/evaluations/annotation-queues/ | GET | List queues |
/api/v1/evaluations/annotation-queues/ | POST | Create queue |
/api/v1/evaluations/annotation-queues/{id}/ | GET | Queue detail with counts |
/api/v1/evaluations/annotation-queues/{id}/ | PUT | Update queue |
/api/v1/evaluations/annotation-queues/{id}/ | DELETE | Deactivate queue |
/api/v1/evaluations/annotation-queues/{id}/items/ | GET | List items |
/api/v1/evaluations/annotation-queues/{id}/items/ | POST | Add items |
/api/v1/evaluations/annotation-queues/{id}/next/ | GET | Get next item |
.../items/{item_id}/submit/ | POST | Submit scores |
.../items/{item_id}/skip/ | POST | Skip item |
Evaluators in the Prompt Metrics Tab
When evaluators produce scores for runs linked to a registered prompt (via prompt_hash), those scores appear in the Prompt Metrics tab grouped by version. The column headers link back to the evaluator that produced them:
{
"prompt_name": "chat-system-prompt",
"versions": [...],
"evaluator_metadata": {
"helpfulness": {
"evaluator_id": "a1b2c3d4-...",
"evaluator_name": "helpfulness-v1"
},
"factuality": {
"evaluator_id": "e5f6g7h8-...",
"evaluator_name": "factuality-v1"
}
}
}
This closes the loop between evaluation and prompt management: you can see at a glance which evaluators are scoring each prompt version, and click through to the evaluator configuration.
Using Evaluators with Experiments
Evaluators can be attached to experiments so that each experiment run is automatically scored. When creating an experiment, pass evaluator_ids to wire up scoring:
import httpx
resp = httpx.post(
"https://acme.waxell.dev/api/v1/experiments/",
cookies={"sessionid": session_id},
json={
"name": "v2 with helpfulness scoring",
"dataset_id": dataset_id,
"config": {"prompt_name": "chat-system-prompt", "prompt_label": "staging"},
"evaluator_ids": [helpfulness_evaluator_id, safety_evaluator_id],
},
)
See Datasets & Experiments for the full experiment workflow.
Next Steps
- Scoring & Feedback -- Understanding the score data model
- Datasets & Experiments -- Use evaluators in experiment pipelines
- Prompt Management -- Version prompts and link them to evaluation