Human Review Workflow

Automated evaluators are valuable, but expert human judgment remains essential for nuanced quality assessment. This tutorial walks you through setting up annotation queues where reviewers can score agent outputs using configurable criteria.

Prerequisites

Waxell Observe with recorded agent execution runs
Dashboard access with an authenticated session
Familiarity with the Scoring feature

What You'll Learn

How to create and configure annotation queues
How to define scoring criteria (numeric, categorical, boolean)
How to add items to a queue and manage the review workflow
How to submit and skip reviews
Best practices for calibration and consistency

Step 1: Why Human Review?

Automated evaluators (LLM-as-judge) scale well but can miss nuance. Human review complements automation for:

Subjective quality -- Is the tone appropriate? Is the response helpful?
Domain expertise -- Does the medical/legal/financial advice meet professional standards?
Edge cases -- Automated evaluators may not catch subtle errors
Evaluator calibration -- Human scores serve as ground truth to validate your automated evaluators

A typical workflow combines both: automated evaluators score every run, while human reviewers sample a subset for quality control.

Step 2: Create an Annotation Queue

An annotation queue defines what you want reviewed and how reviewers should score items.

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Support Quality Review",
    "description": "Weekly review of support bot responses for quality and accuracy",
    "score_names": ["accuracy", "tone", "safety"],
    "score_configs": [
      {
        "name": "accuracy",
        "data_type": "numeric",
        "description": "How accurate is the response? (0.0 = completely wrong, 1.0 = perfectly accurate)",
        "min": 0.0,
        "max": 1.0
      },
      {
        "name": "tone",
        "data_type": "categorical",
        "description": "What is the tone of the response?",
        "options": ["professional", "casual", "inappropriate"]
      },
      {
        "name": "safety",
        "data_type": "boolean",
        "description": "Is the response safe and appropriate?"
      }
    ]
  }'

Response:

{
  "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "name": "Support Quality Review",
  "description": "Weekly review of support bot responses for quality and accuracy",
  "score_names": ["accuracy", "tone", "safety"],
  "score_configs": [
    {
      "name": "accuracy",
      "data_type": "numeric",
      "description": "How accurate is the response? (0.0 = completely wrong, 1.0 = perfectly accurate)",
      "min": 0.0,
      "max": 1.0
    },
    {
      "name": "tone",
      "data_type": "categorical",
      "description": "What is the tone of the response?",
      "options": ["professional", "casual", "inappropriate"]
    },
    {
      "name": "safety",
      "data_type": "boolean",
      "description": "Is the response safe and appropriate?"
    }
  ],
  "is_active": true,
  "created_at": "2026-02-07T10:00:00Z"
}

tip

Keep score configs simple. Three to five scoring dimensions per queue is a good target. Too many dimensions slow reviewers down and reduce consistency.

Step 3: Add Items to the Queue

Add specific agent execution runs to the queue for review. You can add runs individually or in bulk.

Add runs by ID:

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "run_ids": [
      "run-id-1",
      "run-id-2",
      "run-id-3"
    ]
  }'

Response:

{
  "added": 3,
  "skipped": 0,
  "queue_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}

Add LLM calls for review (for call-level quality):

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "llm_call_ids": [456, 789, 1012]
  }'

info

Duplicate items are automatically skipped. If a run or LLM call is already in the queue, it will not be added again.

Step 4: Review Items in the Queue

List items in the queue filtered by status:

# Get pending items
curl -X GET "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/?status=pending" \
  -H "Authorization: Bearer <your-session-token>"

Response:

{
  "results": [
    {
      "id": "item-uuid-1",
      "status": "pending",
      "assigned_to": null,
      "priority": 0,
      "created_at": "2026-02-07T10:05:00Z",
      "run": {
        "id": "run-id-1",
        "agent_name": "support-bot",
        "workflow_name": "handle-query",
        "started_at": "2026-02-07T09:00:00Z",
        "duration": 2.5
      },
      "llm_call": null
    }
  ],
  "count": 3
}

Get the next item to review (auto-assigns to you):

curl -X GET "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/next/" \
  -H "Authorization: Bearer <your-session-token>"

This returns the highest-priority, oldest pending item and marks it as in_progress:

{
  "item": {
    "id": "item-uuid-1",
    "status": "in_progress",
    "assigned_to": "user-id",
    "priority": 0,
    "run": {
      "id": "run-id-1",
      "agent_name": "support-bot",
      "workflow_name": "handle-query",
      "started_at": "2026-02-07T09:00:00Z",
      "completed_at": "2026-02-07T09:00:02Z",
      "duration": 2.5,
      "inputs": {"message": "How do I reset my password?"},
      "result": {"output": "To reset your password, go to Settings > Security..."},
      "steps": []
    },
    "llm_call": null,
    "score_names": ["accuracy", "tone", "safety"],
    "score_configs": [
      {
        "name": "accuracy",
        "data_type": "numeric",
        "description": "How accurate is the response?",
        "min": 0.0,
        "max": 1.0
      }
    ]
  }
}

tip

The /next/ endpoint is designed for a review workflow: fetch, review, submit, fetch next. The score_configs in the response tell the reviewer exactly what to evaluate.

Step 5: Submit Scores

After reviewing the item, submit your scores:

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/item-uuid-1/submit/" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "scores": [
      {
        "name": "accuracy",
        "value": 0.95,
        "data_type": "numeric",
        "comment": "Correct instructions for password reset"
      },
      {
        "name": "tone",
        "value": "professional",
        "data_type": "categorical",
        "comment": "Clear and helpful tone"
      },
      {
        "name": "safety",
        "value": true,
        "data_type": "boolean"
      }
    ]
  }'

Response:

{
  "item_id": "item-uuid-1",
  "status": "completed",
  "scores_created": 3,
  "scores": [
    {
      "id": "score-uuid-1",
      "name": "accuracy",
      "data_type": "numeric",
      "numeric_value": 0.95,
      "string_value": null
    },
    {
      "id": "score-uuid-2",
      "name": "tone",
      "data_type": "categorical",
      "numeric_value": null,
      "string_value": "professional"
    },
    {
      "id": "score-uuid-3",
      "name": "safety",
      "data_type": "boolean",
      "numeric_value": 1.0,
      "string_value": "true"
    }
  ]
}

The item is automatically marked as completed after scores are submitted.

Step 6: Skip Items

If an item is unclear, corrupted, or not relevant, skip it:

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/items/item-uuid-2/skip/" \
  -H "Authorization: Bearer <your-session-token>"

Response:

{
  "item_id": "item-uuid-2",
  "status": "skipped"
}

info

Skipped items are excluded from score analytics. Use skip for items that cannot be meaningfully evaluated, not for items that score poorly.

Step 7: Monitor Queue Progress

Check overall queue status:

curl -X GET "https://acme.waxell.dev/api/v1/evaluations/annotation-queues/a1b2c3d4-e5f6-7890-abcd-ef1234567890/" \
  -H "Authorization: Bearer <your-session-token>"

Response:

{
  "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "name": "Support Quality Review",
  "description": "Weekly review of support bot responses",
  "score_names": ["accuracy", "tone", "safety"],
  "is_active": true,
  "item_counts": {
    "total": 50,
    "pending": 32,
    "in_progress": 3,
    "completed": 13,
    "skipped": 2
  },
  "created_at": "2026-02-07T10:00:00Z"
}

Step 8: Analyze Results

After reviewers complete a batch, analyze the scores using the score analytics endpoint:

curl -X GET "https://acme.waxell.dev/api/v1/evaluations/scores/analytics/?name=accuracy&period=7d" \
  -H "Authorization: Bearer <your-session-token>"

Response:

{
  "period": "7d",
  "since": "2026-01-31T10:00:00Z",
  "distributions": [
    {
      "name": "accuracy",
      "data_type": "numeric",
      "total_count": 45,
      "avg": 0.8722,
      "min": 0.3,
      "max": 1.0,
      "time_series": [
        {"date": "2026-02-05", "avg": 0.85, "count": 15},
        {"date": "2026-02-06", "avg": 0.88, "count": 18},
        {"date": "2026-02-07", "avg": 0.89, "count": 12}
      ]
    }
  ]
}

For categorical scores like "tone":

{
  "distributions": [
    {
      "name": "tone",
      "data_type": "categorical",
      "total_count": 45,
      "value_counts": [
        {"value": "professional", "count": 38},
        {"value": "casual", "count": 5},
        {"value": "inappropriate", "count": 2}
      ]
    }
  ]
}

Step 9: Best Practices

Calibration Sessions

Before starting production reviews, run a calibration session:

Select 10-15 representative items spanning good, mediocre, and poor quality
Have all reviewers independently score the same items
Discuss disagreements and align on scoring criteria
Document scoring guidelines with examples

Inter-Annotator Agreement

Track consistency between reviewers by comparing scores on the same items:

For numeric scores, calculate the standard deviation across reviewers
For categorical scores, calculate percentage agreement
If agreement is below 80%, revisit your scoring guidelines

Sampling Strategy

You likely cannot review every agent run. Use a sampling strategy:

Random sampling -- Review a fixed percentage (e.g., 5%) of all runs
Error sampling -- Prioritize runs flagged by automated evaluators as low quality
Edge case sampling -- Focus on runs with unusual inputs or high cost
New agent sampling -- Review more heavily when launching new agents

Review Guidelines Template

Create a document for your review team that includes:

Score	0.0 - 0.3	0.4 - 0.6	0.7 - 0.8	0.9 - 1.0
Accuracy	Factually wrong	Partially correct	Mostly correct with minor issues	Completely accurate
Tone	Inappropriate or offensive	Awkward or confusing	Acceptable	Professional and helpful

Next Steps

Scoring -- SDK scoring reference
Instrument OpenAI Directly -- Four instrumentation approaches

Prerequisites​

What You'll Learn​

Step 1: Why Human Review?​

Step 2: Create an Annotation Queue​

Step 3: Add Items to the Queue​

Step 4: Review Items in the Queue​

Step 5: Submit Scores​

Step 6: Skip Items​

Step 7: Monitor Queue Progress​

Step 8: Analyze Results​

Step 9: Best Practices​

Calibration Sessions​

Inter-Annotator Agreement​

Sampling Strategy​

Review Guidelines Template​

Next Steps​

Prerequisites

What You'll Learn

Step 1: Why Human Review?

Step 2: Create an Annotation Queue

Step 3: Add Items to the Queue

Step 4: Review Items in the Queue

Step 5: Submit Scores

Step 6: Skip Items

Step 7: Monitor Queue Progress

Step 8: Analyze Results

Step 9: Best Practices

Calibration Sessions

Inter-Annotator Agreement

Sampling Strategy

Review Guidelines Template

Next Steps