Scoring
Scores let you attach quality measurements to agent runs. Use them to capture user feedback and track quality over time.
Score Data Types
Each score has a data_type that determines how its value is stored and analyzed:
| Data Type | Value | Storage | Example |
|---|---|---|---|
numeric | float (typically 0-1) | numeric_value | 0.85 (relevance score) |
categorical | string | string_value | "good", "bad", "neutral" |
boolean | bool | Both fields | true (thumbs up) |
For boolean scores, the value is stored as both numeric_value (1.0 for true, 0.0 for false) and string_value ("true" or "false"), allowing both numeric aggregation and categorical filtering.
Score Sources
Scores are tagged with a source indicating how they were created:
| Source | Description |
|---|---|
sdk | Recorded programmatically via the SDK during or after execution |
manual | Created through the UI (annotation workflows, manual review) |
evaluator | Generated by an automated evaluator (LLM-as-judge) |
Recording Scores via SDK
Within a Context
Use record_score() on the WaxellContext to attach scores to the current run. Scores are buffered and flushed when the context exits.
from waxell_observe import WaxellContext
async with WaxellContext(agent_name="support-agent") as ctx:
response = await handle_query(query)
ctx.set_result({"output": response})
# Numeric score (0-1 range)
ctx.record_score(
name="relevance",
value=0.92,
data_type="numeric",
comment="High relevance to user query",
)
# Categorical score
ctx.record_score(
name="tone",
value="professional",
data_type="categorical",
)
# Boolean score (user feedback)
ctx.record_score(
name="thumbs_up",
value=True,
data_type="boolean",
comment="User clicked thumbs up",
)
The record_score method signature:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | required | Score name (e.g., "relevance", "thumbs_up") |
value | float | str | bool | required | Score value |
data_type | str | "numeric" | One of "numeric", "categorical", "boolean" |
comment | str | "" | Optional free-text annotation |
After Execution (Client-Level)
If you need to add scores after the run context has closed, use the client directly:
from waxell_observe import WaxellObserveClient
client = WaxellObserveClient(
api_url="https://acme.waxell.dev",
api_key="wax_sk_...",
)
# Record scores on an existing run
await client.record_scores(
run_id="42",
scores=[
{
"name": "user_feedback",
"data_type": "numeric",
"numeric_value": 1.0,
"comment": "User rated 5 stars",
},
{
"name": "category",
"data_type": "categorical",
"string_value": "helpful",
},
],
)
Or synchronously:
client.record_scores_sync(
run_id="42",
scores=[
{
"name": "user_feedback",
"data_type": "numeric",
"numeric_value": 1.0,
},
],
)
Scores recorded via the SDK are sent to POST /api/v1/observe/runs/{run_id}/scores/ using API key authentication (X-Wax-Key header). This is the same ingest path used for LLM calls and steps.
REST API (UI Endpoints)
These endpoints are used by the Waxell dashboard and require session authentication.
List Scores
GET /api/v1/evaluations/scores/
Query Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | string | Filter by score name | |
source | string | Filter by source (sdk, manual, evaluator) | |
data_type | string | Filter by data type | |
run_id | int | Filter by run ID | |
llm_call_id | int | Filter by LLM call ID | |
sort | string | -created_at | Sort field. Options: created_at, -created_at, name, -name, numeric_value, -numeric_value, source, -source |
limit | int | 25 | Page size (max 100) |
offset | int | 0 | Pagination offset |
Response:
{
"results": [
{
"id": "a1b2c3d4-...",
"name": "relevance",
"data_type": "numeric",
"source": "sdk",
"numeric_value": 0.92,
"string_value": null,
"comment": "High relevance to user query",
"metadata": {},
"author_user_id": "",
"evaluator_id": null,
"evaluator_name": null,
"run_id": "42",
"run_agent_name": "support-agent",
"llm_call_id": null,
"llm_call_model": null,
"created_at": "2026-02-07T10:15:00Z"
}
],
"count": 156,
"next": "?offset=25&limit=25",
"previous": null,
"aggregates": {
"total_count": 156,
"avg_numeric_value": 0.7834,
"score_names": ["relevance", "thumbs_up", "tone"]
}
}
Create a Manual Score
POST /api/v1/evaluations/scores/
Request Body:
| Field | Type | Required | Description |
|---|---|---|---|
run_id | int | One of run_id or llm_call_id | Run to score |
llm_call_id | int | One of run_id or llm_call_id | LLM call to score |
name | string | Yes | Score name |
data_type | string | No (default "numeric") | "numeric", "categorical", or "boolean" |
value | any | Yes | Score value |
comment | string | No | Free-text annotation |
metadata | object | No | Arbitrary JSON metadata |
Example:
curl -X POST "https://acme.waxell.dev/api/v1/evaluations/scores/" \
-H "Cookie: sessionid=..." \
-H "Content-Type: application/json" \
-d '{
"run_id": 42,
"name": "accuracy",
"data_type": "numeric",
"value": 0.95,
"comment": "Verified against ground truth"
}'
Score Analytics
GET /api/v1/evaluations/scores/analytics/
Query Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
name | string | Filter by score name | |
period | string | 7d | Time period: 1d, 7d, 30d |
agent | string | Filter by agent name |
Returns score distributions per score name, including:
- Numeric scores: average, min, max, and daily time series
- Categorical/boolean scores: value counts
Viewing Scores in the UI
Run Detail View
On any run's detail page, scores are displayed in a dedicated section showing:
- Score name and value
- Data type badge (numeric, categorical, boolean)
- Source indicator (SDK, manual, or evaluator name)
- Timestamp and optional comment
Score Analytics Dashboard
The analytics view (/api/v1/evaluations/scores/analytics/) powers distribution charts:
- Numeric scores show a time series of daily averages with min/max bands
- Categorical scores show value frequency bar charts
- Filter by score name, agent, or time period to drill down
Capturing User Feedback
A common pattern is to capture end-user feedback (thumbs up/down, ratings) and record it as a score:
# In your API endpoint that handles user feedback
from waxell_observe import WaxellObserveClient
client = WaxellObserveClient()
async def handle_feedback(run_id: str, rating: int):
"""Called when user rates an agent response."""
await client.record_scores(
run_id=run_id,
scores=[
{
"name": "user_rating",
"data_type": "numeric",
"numeric_value": rating / 5.0, # Normalize to 0-1
"comment": f"User gave {rating}/5 stars",
}
],
)
Next Steps
- Sessions -- Analyze scores across multi-turn conversations
- Cost Management -- Track and control LLM spending