Human Review Workflow
Automated evaluators are valuable, but expert human judgment remains essential for nuanced quality assessment. This tutorial walks you through setting up annotation queues where reviewers can score agent outputs using configurable criteria.
Prerequisites
- Waxell Observe with recorded agent execution runs
- Dashboard access with an authenticated session
- Familiarity with the Scoring feature
What You'll Learn
- How to create and configure annotation queues
- How to define scoring criteria (numeric, categorical, boolean)
- How to add items to a queue and manage the review workflow
- How to submit and skip reviews
- Best practices for calibration and consistency
Step 1: Why Human Review?
Automated evaluators (LLM-as-judge) scale well but can miss nuance. Human review complements automation for:
- Subjective quality -- Is the tone appropriate? Is the response helpful?
- Domain expertise -- Does the medical/legal/financial advice meet professional standards?
- Edge cases -- Automated evaluators may not catch subtle errors
- Evaluator calibration -- Human scores serve as ground truth to validate your automated evaluators
A typical workflow combines both: automated evaluators score every run, while human reviewers sample a subset for quality control.
Step 2: Create an Annotation Queue
An annotation queue defines what you want reviewed and how reviewers should score items.
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"name": "Support Quality Review",
"description": "Weekly review of support bot responses for quality and accuracy",
"score_names": ["accuracy", "tone", "safety"],
"score_configs": [
{
"name": "accuracy",
"data_type": "numeric",
"description": "How accurate is the response? (0.0 = completely wrong, 1.0 = perfectly accurate)",
"min": 0.0,
"max": 1.0
},
{
"name": "tone",
"data_type": "categorical",
"description": "What is the tone of the response?",
"options": ["professional", "casual", "inappropriate"]
},
{
"name": "safety",
"data_type": "boolean",
"description": "Is the response safe and appropriate?"
}
]
}'
Response:
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"name": "Support Quality Review",
"description": "Weekly review of support bot responses for quality and accuracy",
"score_names": ["accuracy", "tone", "safety"],
"score_configs": [
{
"name": "accuracy",
"data_type": "numeric",
"description": "How accurate is the response? (0.0 = completely wrong, 1.0 = perfectly accurate)",
"min": 0.0,
"max": 1.0
},
{
"name": "tone",
"data_type": "categorical",
"description": "What is the tone of the response?",
"options": ["professional", "casual", "inappropriate"]
},
{
"name": "safety",
"data_type": "boolean",
"description": "Is the response safe and appropriate?"
}
],
"is_active": true,
"created_at": "2026-02-07T10:00:00Z"
}
Keep score configs simple. Three to five scoring dimensions per queue is a good target. Too many dimensions slow reviewers down and reduce consistency.
Step 3: Add Items to the Queue
Add specific agent execution runs to the queue for review. You can add runs individually or in bulk.
Add runs by ID:
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"run_ids": [
"run-id-1",
"run-id-2",
"run-id-3"
]
}'
Response:
{
"added": 3,
"skipped": 0,
"queue_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}
Add LLM calls for review (for call-level quality):
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"llm_call_ids": [456, 789, 1012]
}'
Duplicate items are automatically skipped. If a run or LLM call is already in the queue, it will not be added again.
Step 4: Review Items in the Queue
List items in the queue filtered by status:
# Get pending items
curl -X GET "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/?status=pending" \
-H "Authorization: Bearer <your-session-token>"
Response:
{
"results": [
{
"id": "item-uuid-1",
"status": "pending",
"assigned_to": null,
"priority": 0,
"created_at": "2026-02-07T10:05:00Z",
"run": {
"id": "run-id-1",
"agent_name": "support-bot",
"workflow_name": "handle-query",
"started_at": "2026-02-07T09:00:00Z",
"duration": 2.5
},
"llm_call": null
}
],
"count": 3
}
Get the next item to review (auto-assigns to you):
curl -X GET "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/next/" \
-H "Authorization: Bearer <your-session-token>"
This returns the highest-priority, oldest pending item and marks it as in_progress:
{
"item": {
"id": "item-uuid-1",
"status": "in_progress",
"assigned_to": "user-id",
"priority": 0,
"run": {
"id": "run-id-1",
"agent_name": "support-bot",
"workflow_name": "handle-query",
"started_at": "2026-02-07T09:00:00Z",
"completed_at": "2026-02-07T09:00:02Z",
"duration": 2.5,
"inputs": {"message": "How do I reset my password?"},
"result": {"output": "To reset your password, go to Settings > Security..."},
"steps": []
},
"llm_call": null,
"score_names": ["accuracy", "tone", "safety"],
"score_configs": [
{
"name": "accuracy",
"data_type": "numeric",
"description": "How accurate is the response?",
"min": 0.0,
"max": 1.0
}
]
}
}
The /next/ endpoint is designed for a review workflow: fetch, review, submit, fetch next. The score_configs in the response tell the reviewer exactly what to evaluate.
Step 5: Submit Scores
After reviewing the item, submit your scores:
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/item-uuid-1/submit/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"scores": [
{
"name": "accuracy",
"value": 0.95,
"data_type": "numeric",
"comment": "Correct instructions for password reset"
},
{
"name": "tone",
"value": "professional",
"data_type": "categorical",
"comment": "Clear and helpful tone"
},
{
"name": "safety",
"value": true,
"data_type": "boolean"
}
]
}'
Response:
{
"item_id": "item-uuid-1",
"status": "completed",
"scores_created": 3,
"scores": [
{
"id": "score-uuid-1",
"name": "accuracy",
"data_type": "numeric",
"numeric_value": 0.95,
"string_value": null
},
{
"id": "score-uuid-2",
"name": "tone",
"data_type": "categorical",
"numeric_value": null,
"string_value": "professional"
},
{
"id": "score-uuid-3",
"name": "safety",
"data_type": "boolean",
"numeric_value": 1.0,
"string_value": "true"
}
]
}
The item is automatically marked as completed after scores are submitted.
Step 6: Skip Items
If an item is unclear, corrupted, or not relevant, skip it:
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/item-uuid-2/skip/" \
-H "Authorization: Bearer <your-session-token>"
Response:
{
"item_id": "item-uuid-2",
"status": "skipped"
}
Skipped items are excluded from score analytics. Use skip for items that cannot be meaningfully evaluated, not for items that score poorly.
Step 7: Monitor Queue Progress
Check overall queue status:
curl -X GET "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/" \
-H "Authorization: Bearer <your-session-token>"
Response:
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"name": "Support Quality Review",
"description": "Weekly review of support bot responses",
"score_names": ["accuracy", "tone", "safety"],
"is_active": true,
"item_counts": {
"total": 50,
"pending": 32,
"in_progress": 3,
"completed": 13,
"skipped": 2
},
"created_at": "2026-02-07T10:00:00Z"
}
Step 8: Analyze Results
After reviewers complete a batch, analyze the scores using the score analytics endpoint:
curl -X GET "https://acme.waxell.dev/api/v1/evaluations/scores/analytics/?name=accuracy&period=7d" \
-H "Authorization: Bearer <your-session-token>"
Response:
{
"period": "7d",
"since": "2026-01-31T10:00:00Z",
"distributions": [
{
"name": "accuracy",
"data_type": "numeric",
"total_count": 45,
"avg": 0.8722,
"min": 0.3,
"max": 1.0,
"time_series": [
{"date": "2026-02-05", "avg": 0.85, "count": 15},
{"date": "2026-02-06", "avg": 0.88, "count": 18},
{"date": "2026-02-07", "avg": 0.89, "count": 12}
]
}
]
}
For categorical scores like "tone":
{
"distributions": [
{
"name": "tone",
"data_type": "categorical",
"total_count": 45,
"value_counts": [
{"value": "professional", "count": 38},
{"value": "casual", "count": 5},
{"value": "inappropriate", "count": 2}
]
}
]
}
Step 9: Best Practices
Calibration Sessions
Before starting production reviews, run a calibration session:
- Select 10-15 representative items spanning good, mediocre, and poor quality
- Have all reviewers independently score the same items
- Discuss disagreements and align on scoring criteria
- Document scoring guidelines with examples
Inter-Annotator Agreement
Track consistency between reviewers by comparing scores on the same items:
- For numeric scores, calculate the standard deviation across reviewers
- For categorical scores, calculate percentage agreement
- If agreement is below 80%, revisit your scoring guidelines
Sampling Strategy
You likely cannot review every agent run. Use a sampling strategy:
- Random sampling -- Review a fixed percentage (e.g., 5%) of all runs
- Error sampling -- Prioritize runs flagged by automated evaluators as low quality
- Edge case sampling -- Focus on runs with unusual inputs or high cost
- New agent sampling -- Review more heavily when launching new agents
Review Guidelines Template
Create a document for your review team that includes:
| Score | 0.0 - 0.3 | 0.4 - 0.6 | 0.7 - 0.8 | 0.9 - 1.0 |
|---|---|---|---|---|
| Accuracy | Factually wrong | Partially correct | Mostly correct with minor issues | Completely accurate |
| Tone | Inappropriate or offensive | Awkward or confusing | Acceptable | Professional and helpful |
Next Steps
- Scoring -- SDK scoring reference
- Instrument OpenAI Directly -- Four instrumentation approaches