Skip to main content

Scoring

Scores let you attach quality measurements to agent runs. Use them to capture user feedback and track quality over time.

Score Data Types

Each score has a data_type that determines how its value is stored and analyzed:

Data TypeValueStorageExample
numericfloat (typically 0-1)numeric_value0.85 (relevance score)
categoricalstringstring_value"good", "bad", "neutral"
booleanboolBoth fieldstrue (thumbs up)

For boolean scores, the value is stored as both numeric_value (1.0 for true, 0.0 for false) and string_value ("true" or "false"), allowing both numeric aggregation and categorical filtering.

Score Sources

Scores are tagged with a source indicating how they were created:

SourceDescription
sdkRecorded programmatically via the SDK during or after execution
manualCreated through the UI (annotation workflows, manual review)
evaluatorGenerated by an automated evaluator (LLM-as-judge)

Recording Scores via SDK

Within a Context

Use record_score() on the WaxellContext to attach scores to the current run. Scores are buffered and flushed when the context exits.

from waxell_observe import WaxellContext

async with WaxellContext(agent_name="support-agent") as ctx:
response = await handle_query(query)
ctx.set_result({"output": response})

# Numeric score (0-1 range)
ctx.record_score(
name="relevance",
value=0.92,
data_type="numeric",
comment="High relevance to user query",
)

# Categorical score
ctx.record_score(
name="tone",
value="professional",
data_type="categorical",
)

# Boolean score (user feedback)
ctx.record_score(
name="thumbs_up",
value=True,
data_type="boolean",
comment="User clicked thumbs up",
)

The record_score method signature:

ParameterTypeDefaultDescription
namestrrequiredScore name (e.g., "relevance", "thumbs_up")
valuefloat | str | boolrequiredScore value
data_typestr"numeric"One of "numeric", "categorical", "boolean"
commentstr""Optional free-text annotation

After Execution (Client-Level)

If you need to add scores after the run context has closed, use the client directly:

from waxell_observe import WaxellObserveClient

client = WaxellObserveClient(
api_url="https://acme.waxell.dev",
api_key="wax_sk_...",
)

# Record scores on an existing run
await client.record_scores(
run_id="42",
scores=[
{
"name": "user_feedback",
"data_type": "numeric",
"numeric_value": 1.0,
"comment": "User rated 5 stars",
},
{
"name": "category",
"data_type": "categorical",
"string_value": "helpful",
},
],
)

Or synchronously:

client.record_scores_sync(
run_id="42",
scores=[
{
"name": "user_feedback",
"data_type": "numeric",
"numeric_value": 1.0,
},
],
)
info

Scores recorded via the SDK are sent to POST /api/v1/observe/runs/{run_id}/scores/ using API key authentication (X-Wax-Key header). This is the same ingest path used for LLM calls and steps.

REST API (UI Endpoints)

These endpoints are used by the Waxell dashboard and require session authentication.

List Scores

GET /api/v1/evaluations/scores/

Query Parameters:

ParameterTypeDefaultDescription
namestringFilter by score name
sourcestringFilter by source (sdk, manual, evaluator)
data_typestringFilter by data type
run_idintFilter by run ID
llm_call_idintFilter by LLM call ID
sortstring-created_atSort field. Options: created_at, -created_at, name, -name, numeric_value, -numeric_value, source, -source
limitint25Page size (max 100)
offsetint0Pagination offset

Response:

{
"results": [
{
"id": "a1b2c3d4-...",
"name": "relevance",
"data_type": "numeric",
"source": "sdk",
"numeric_value": 0.92,
"string_value": null,
"comment": "High relevance to user query",
"metadata": {},
"author_user_id": "",
"evaluator_id": null,
"evaluator_name": null,
"run_id": "42",
"run_agent_name": "support-agent",
"llm_call_id": null,
"llm_call_model": null,
"created_at": "2026-02-07T10:15:00Z"
}
],
"count": 156,
"next": "?offset=25&limit=25",
"previous": null,
"aggregates": {
"total_count": 156,
"avg_numeric_value": 0.7834,
"score_names": ["relevance", "thumbs_up", "tone"]
}
}

Create a Manual Score

POST /api/v1/evaluations/scores/

Request Body:

FieldTypeRequiredDescription
run_idintOne of run_id or llm_call_idRun to score
llm_call_idintOne of run_id or llm_call_idLLM call to score
namestringYesScore name
data_typestringNo (default "numeric")"numeric", "categorical", or "boolean"
valueanyYesScore value
commentstringNoFree-text annotation
metadataobjectNoArbitrary JSON metadata

Example:

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/scores/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"run_id": 42,
"name": "accuracy",
"data_type": "numeric",
"value": 0.95,
"comment": "Verified against ground truth"
}'

Score Analytics

GET /api/v1/evaluations/scores/analytics/

Query Parameters:

ParameterTypeDefaultDescription
namestringFilter by score name
periodstring7dTime period: 1d, 7d, 30d
agentstringFilter by agent name

Returns score distributions per score name, including:

  • Numeric scores: average, min, max, and daily time series
  • Categorical/boolean scores: value counts

Viewing Scores in the UI

Run Detail View

On any run's detail page, scores are displayed in a dedicated section showing:

  • Score name and value
  • Data type badge (numeric, categorical, boolean)
  • Source indicator (SDK, manual, or evaluator name)
  • Timestamp and optional comment

Score Analytics Dashboard

The analytics view (/api/v1/evaluations/scores/analytics/) powers distribution charts:

  • Numeric scores show a time series of daily averages with min/max bands
  • Categorical scores show value frequency bar charts
  • Filter by score name, agent, or time period to drill down

Capturing User Feedback

A common pattern is to capture end-user feedback (thumbs up/down, ratings) and record it as a score:

# In your API endpoint that handles user feedback
from waxell_observe import WaxellObserveClient

client = WaxellObserveClient()

async def handle_feedback(run_id: str, rating: int):
"""Called when user rates an agent response."""
await client.record_scores(
run_id=run_id,
scores=[
{
"name": "user_rating",
"data_type": "numeric",
"numeric_value": rating / 5.0, # Normalize to 0-1
"comment": f"User gave {rating}/5 stars",
}
],
)

Next Steps