Skip to main content

Human Review Workflow

Automated evaluators are valuable, but expert human judgment remains essential for nuanced quality assessment. This tutorial walks you through setting up annotation queues where reviewers can score agent outputs using configurable criteria.

Prerequisites

  • Waxell Observe with recorded agent execution runs
  • Dashboard access with an authenticated session
  • Familiarity with the Scoring feature

What You'll Learn

  • How to create and configure annotation queues
  • How to define scoring criteria (numeric, categorical, boolean)
  • How to add items to a queue and manage the review workflow
  • How to submit and skip reviews
  • Best practices for calibration and consistency

Step 1: Why Human Review?

Automated evaluators (LLM-as-judge) scale well but can miss nuance. Human review complements automation for:

  • Subjective quality -- Is the tone appropriate? Is the response helpful?
  • Domain expertise -- Does the medical/legal/financial advice meet professional standards?
  • Edge cases -- Automated evaluators may not catch subtle errors
  • Evaluator calibration -- Human scores serve as ground truth to validate your automated evaluators

A typical workflow combines both: automated evaluators score every run, while human reviewers sample a subset for quality control.

Step 2: Create an Annotation Queue

An annotation queue defines what you want reviewed and how reviewers should score items.

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"name": "Support Quality Review",
"description": "Weekly review of support bot responses for quality and accuracy",
"score_names": ["accuracy", "tone", "safety"],
"score_configs": [
{
"name": "accuracy",
"data_type": "numeric",
"description": "How accurate is the response? (0.0 = completely wrong, 1.0 = perfectly accurate)",
"min": 0.0,
"max": 1.0
},
{
"name": "tone",
"data_type": "categorical",
"description": "What is the tone of the response?",
"options": ["professional", "casual", "inappropriate"]
},
{
"name": "safety",
"data_type": "boolean",
"description": "Is the response safe and appropriate?"
}
]
}'

Response:

{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"name": "Support Quality Review",
"description": "Weekly review of support bot responses for quality and accuracy",
"score_names": ["accuracy", "tone", "safety"],
"score_configs": [
{
"name": "accuracy",
"data_type": "numeric",
"description": "How accurate is the response? (0.0 = completely wrong, 1.0 = perfectly accurate)",
"min": 0.0,
"max": 1.0
},
{
"name": "tone",
"data_type": "categorical",
"description": "What is the tone of the response?",
"options": ["professional", "casual", "inappropriate"]
},
{
"name": "safety",
"data_type": "boolean",
"description": "Is the response safe and appropriate?"
}
],
"is_active": true,
"created_at": "2026-02-07T10:00:00Z"
}
tip

Keep score configs simple. Three to five scoring dimensions per queue is a good target. Too many dimensions slow reviewers down and reduce consistency.

Step 3: Add Items to the Queue

Add specific agent execution runs to the queue for review. You can add runs individually or in bulk.

Add runs by ID:

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"run_ids": [
"run-id-1",
"run-id-2",
"run-id-3"
]
}'

Response:

{
"added": 3,
"skipped": 0,
"queue_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}

Add LLM calls for review (for call-level quality):

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"llm_call_ids": [456, 789, 1012]
}'
info

Duplicate items are automatically skipped. If a run or LLM call is already in the queue, it will not be added again.

Step 4: Review Items in the Queue

List items in the queue filtered by status:

# Get pending items
curl -X GET "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/?status=pending" \
-H "Authorization: Bearer <your-session-token>"

Response:

{
"results": [
{
"id": "item-uuid-1",
"status": "pending",
"assigned_to": null,
"priority": 0,
"created_at": "2026-02-07T10:05:00Z",
"run": {
"id": "run-id-1",
"agent_name": "support-bot",
"workflow_name": "handle-query",
"started_at": "2026-02-07T09:00:00Z",
"duration": 2.5
},
"llm_call": null
}
],
"count": 3
}

Get the next item to review (auto-assigns to you):

curl -X GET "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/next/" \
-H "Authorization: Bearer <your-session-token>"

This returns the highest-priority, oldest pending item and marks it as in_progress:

{
"item": {
"id": "item-uuid-1",
"status": "in_progress",
"assigned_to": "user-id",
"priority": 0,
"run": {
"id": "run-id-1",
"agent_name": "support-bot",
"workflow_name": "handle-query",
"started_at": "2026-02-07T09:00:00Z",
"completed_at": "2026-02-07T09:00:02Z",
"duration": 2.5,
"inputs": {"message": "How do I reset my password?"},
"result": {"output": "To reset your password, go to Settings > Security..."},
"steps": []
},
"llm_call": null,
"score_names": ["accuracy", "tone", "safety"],
"score_configs": [
{
"name": "accuracy",
"data_type": "numeric",
"description": "How accurate is the response?",
"min": 0.0,
"max": 1.0
}
]
}
}
tip

The /next/ endpoint is designed for a review workflow: fetch, review, submit, fetch next. The score_configs in the response tell the reviewer exactly what to evaluate.

Step 5: Submit Scores

After reviewing the item, submit your scores:

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/item-uuid-1/submit/" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json" \
-d '{
"scores": [
{
"name": "accuracy",
"value": 0.95,
"data_type": "numeric",
"comment": "Correct instructions for password reset"
},
{
"name": "tone",
"value": "professional",
"data_type": "categorical",
"comment": "Clear and helpful tone"
},
{
"name": "safety",
"value": true,
"data_type": "boolean"
}
]
}'

Response:

{
"item_id": "item-uuid-1",
"status": "completed",
"scores_created": 3,
"scores": [
{
"id": "score-uuid-1",
"name": "accuracy",
"data_type": "numeric",
"numeric_value": 0.95,
"string_value": null
},
{
"id": "score-uuid-2",
"name": "tone",
"data_type": "categorical",
"numeric_value": null,
"string_value": "professional"
},
{
"id": "score-uuid-3",
"name": "safety",
"data_type": "boolean",
"numeric_value": 1.0,
"string_value": "true"
}
]
}

The item is automatically marked as completed after scores are submitted.

Step 6: Skip Items

If an item is unclear, corrupted, or not relevant, skip it:

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/item-uuid-2/skip/" \
-H "Authorization: Bearer <your-session-token>"

Response:

{
"item_id": "item-uuid-2",
"status": "skipped"
}
info

Skipped items are excluded from score analytics. Use skip for items that cannot be meaningfully evaluated, not for items that score poorly.

Step 7: Monitor Queue Progress

Check overall queue status:

curl -X GET "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/" \
-H "Authorization: Bearer <your-session-token>"

Response:

{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"name": "Support Quality Review",
"description": "Weekly review of support bot responses",
"score_names": ["accuracy", "tone", "safety"],
"is_active": true,
"item_counts": {
"total": 50,
"pending": 32,
"in_progress": 3,
"completed": 13,
"skipped": 2
},
"created_at": "2026-02-07T10:00:00Z"
}

Step 8: Analyze Results

After reviewers complete a batch, analyze the scores using the score analytics endpoint:

curl -X GET "https://acme.waxell.dev/api/v1/evaluations/scores/analytics/?name=accuracy&period=7d" \
-H "Authorization: Bearer <your-session-token>"

Response:

{
"period": "7d",
"since": "2026-01-31T10:00:00Z",
"distributions": [
{
"name": "accuracy",
"data_type": "numeric",
"total_count": 45,
"avg": 0.8722,
"min": 0.3,
"max": 1.0,
"time_series": [
{"date": "2026-02-05", "avg": 0.85, "count": 15},
{"date": "2026-02-06", "avg": 0.88, "count": 18},
{"date": "2026-02-07", "avg": 0.89, "count": 12}
]
}
]
}

For categorical scores like "tone":

{
"distributions": [
{
"name": "tone",
"data_type": "categorical",
"total_count": 45,
"value_counts": [
{"value": "professional", "count": 38},
{"value": "casual", "count": 5},
{"value": "inappropriate", "count": 2}
]
}
]
}

Step 9: Best Practices

Calibration Sessions

Before starting production reviews, run a calibration session:

  1. Select 10-15 representative items spanning good, mediocre, and poor quality
  2. Have all reviewers independently score the same items
  3. Discuss disagreements and align on scoring criteria
  4. Document scoring guidelines with examples

Inter-Annotator Agreement

Track consistency between reviewers by comparing scores on the same items:

  • For numeric scores, calculate the standard deviation across reviewers
  • For categorical scores, calculate percentage agreement
  • If agreement is below 80%, revisit your scoring guidelines

Sampling Strategy

You likely cannot review every agent run. Use a sampling strategy:

  • Random sampling -- Review a fixed percentage (e.g., 5%) of all runs
  • Error sampling -- Prioritize runs flagged by automated evaluators as low quality
  • Edge case sampling -- Focus on runs with unusual inputs or high cost
  • New agent sampling -- Review more heavily when launching new agents

Review Guidelines Template

Create a document for your review team that includes:

Score0.0 - 0.30.4 - 0.60.7 - 0.80.9 - 1.0
AccuracyFactually wrongPartially correctMostly correct with minor issuesCompletely accurate
ToneInappropriate or offensiveAwkward or confusingAcceptableProfessional and helpful

Next Steps