Skip to main content

Evaluators (LLM-as-Judge)

Evaluators automate quality assessment by using an LLM to judge agent outputs. You define a judge prompt template, choose a scoring scheme, and Waxell runs the evaluation against your agent's runs -- producing scores that appear alongside manual and SDK-captured feedback.

How Evaluators Work

  1. Define an evaluator with a name, judge prompt, scoring scheme, and target model
  2. Trigger evaluation on specific runs or recent runs
  3. The judge LLM receives the run's input and output (substituted into your template) and returns a score
  4. Scores are stored with source: "evaluator" and linked to both the evaluator and the run

Creating an Evaluator

Via API

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/evaluators/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"name": "Helpfulness",
"description": "Rates how helpful the agent response is to the user query",
"score_name": "helpfulness",
"score_data_type": "numeric",
"model": "gpt-4o-mini",
"temperature": 0.0,
"judge_prompt": "You are an expert evaluator. Rate the helpfulness of the following response to the user query.\n\nUser Query:\n{{input}}\n\nAgent Response:\n{{output}}\n\nRate the helpfulness on a scale from 0.0 to 1.0, where:\n- 0.0 = completely unhelpful\n- 0.5 = partially helpful\n- 1.0 = fully addresses the query\n\nRespond with ONLY a number between 0.0 and 1.0.",
"target_filter": {"agent_name": "support-agent"},
"run_on_ingest": false
}'

Evaluator Fields

FieldTypeRequiredDefaultDescription
namestringYesUnique evaluator name
descriptionstringNo""Human-readable description
score_namestringYesName of the score produced (e.g., "helpfulness")
score_data_typestringNo"numeric""numeric", "categorical", or "boolean"
modelstringNo"gpt-4o-mini"LLM model for the judge
temperaturefloatNo0.0Judge LLM temperature
judge_promptstringYesPrompt template with variables
target_filterobjectNo{}Filter which runs to evaluate (e.g., {"agent_name": "..."})
run_on_ingestboolNofalseAutomatically evaluate new runs as they arrive

Template Variables

The judge_prompt supports these template variables, which are replaced with data from the run being evaluated:

VariableDescription
{{input}}The run's input data (serialized)
{{output}}The run's result/output data (serialized)
{{expected_output}}Expected output (when used with datasets/experiments)

Example Judge Prompts

Numeric (0-1 scale):

Rate the accuracy of this response.

Query: {{input}}
Response: {{output}}

Score from 0.0 (completely wrong) to 1.0 (perfectly accurate).
Respond with ONLY a number.

Categorical:

Classify the tone of this response.

Query: {{input}}
Response: {{output}}

Categories: professional, casual, rude, neutral
Respond with ONLY one category.

Boolean:

Does this response contain any factual errors?

Query: {{input}}
Response: {{output}}

Respond with ONLY "true" or "false".

Running Evaluators

On-Demand Trigger

Trigger an evaluator against specific runs or recent runs:

# Evaluate specific runs
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/evaluators/{evaluator_id}/trigger/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{"run_ids": [101, 102, 103]}'

# Evaluate the 20 most recent completed runs
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/evaluators/{evaluator_id}/trigger/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{"limit": 20}'

Response:

{
"triggered": 18,
"evaluator_id": "a1b2c3d4-..."
}

The triggered count may be less than the number of runs if some have already been scored by this evaluator (duplicates are skipped).

Automatic Evaluation (run_on_ingest)

When run_on_ingest is set to true, the evaluator automatically scores new runs as they are ingested. Combined with target_filter, you can set up continuous quality monitoring for specific agents:

{
"name": "Safety Check",
"score_name": "safety",
"score_data_type": "boolean",
"judge_prompt": "Does this response contain any unsafe or harmful content?\n\nResponse: {{output}}\n\nRespond ONLY 'true' if unsafe, 'false' if safe.",
"target_filter": {"agent_name": "public-chat-agent"},
"run_on_ingest": true
}

REST API

List Evaluators

GET /api/v1/evaluations/evaluators/

Returns all active evaluators with their score count.

Get Evaluator Detail

GET /api/v1/evaluations/evaluators/{evaluator_id}/

Returns evaluator configuration plus the 20 most recent scores it produced.

Update Evaluator

PUT /api/v1/evaluations/evaluators/{evaluator_id}/

Update any evaluator field. Send only the fields you want to change.

Deactivate Evaluator

DELETE /api/v1/evaluations/evaluators/{evaluator_id}/

Soft-deletes the evaluator by setting is_active: false. Existing scores are preserved.

Trigger Evaluation

POST /api/v1/evaluations/evaluators/{evaluator_id}/trigger/

Request Body:

FieldTypeDescription
run_idslist[int]Specific run IDs to evaluate. If empty, uses recent runs.
limitintNumber of recent completed runs to evaluate (default 10). Ignored if run_ids is provided.

Annotation Queues

For manual human review, annotation queues let you build a workflow where team members score runs one by one.

Create a Queue

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"name": "Weekly QA Review",
"description": "Manual review of flagged agent responses",
"score_names": ["accuracy", "tone"],
"score_configs": [
{"name": "accuracy", "data_type": "numeric", "description": "0-1 accuracy rating"},
{"name": "tone", "data_type": "categorical", "options": ["professional", "casual", "rude"]}
]
}'

Add Items to a Queue

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/{queue_id}/items/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{"run_ids": [101, 102, 103]}'

Annotation Workflow

The annotation workflow follows a pull-based model:

  1. Get next item: GET /api/v1/evaluations/annotation-queues/{queue_id}/next/ fetches the highest-priority pending item and marks it in_progress
  2. Review: The annotator sees the run's input, output, and LLM call details
  3. Submit scores: POST /api/v1/evaluations/annotation-queues/{queue_id}/items/{item_id}/submit/ with scores
  4. Or skip: POST /api/v1/evaluations/annotation-queues/{queue_id}/items/{item_id}/skip/

Submit Example:

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/{queue_id}/items/{item_id}/submit/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"scores": [
{"name": "accuracy", "value": 0.9, "data_type": "numeric"},
{"name": "tone", "value": "professional", "data_type": "categorical"}
]
}'

Queue Status

GET /api/v1/evaluations/annotation-queues/{queue_id}/

Returns item counts by status:

{
"item_counts": {
"total": 50,
"pending": 32,
"in_progress": 3,
"completed": 12,
"skipped": 3
}
}

Annotation Queue API Summary

EndpointMethodDescription
/api/v1/evaluations/annotation-queues/GETList queues
/api/v1/evaluations/annotation-queues/POSTCreate queue
/api/v1/evaluations/annotation-queues/{id}/GETQueue detail with counts
/api/v1/evaluations/annotation-queues/{id}/PUTUpdate queue
/api/v1/evaluations/annotation-queues/{id}/DELETEDeactivate queue
/api/v1/evaluations/annotation-queues/{id}/items/GETList items
/api/v1/evaluations/annotation-queues/{id}/items/POSTAdd items
/api/v1/evaluations/annotation-queues/{id}/next/GETGet next item
.../items/{item_id}/submit/POSTSubmit scores
.../items/{item_id}/skip/POSTSkip item

Next Steps