Scoring

Scores let you attach quality measurements to agent runs. Use them to capture user feedback and track quality over time.

Score Data Types

Each score has a data_type that determines how its value is stored and analyzed:

Data Type	Value	Storage	Example
`numeric`	`float` (typically 0-1)	`numeric_value`	`0.85` (relevance score)
`categorical`	`string`	`string_value`	`"good"`, `"bad"`, `"neutral"`
`boolean`	`bool`	Both fields	`true` (thumbs up)

For boolean scores, the value is stored as both numeric_value (1.0 for true, 0.0 for false) and string_value ("true" or "false"), allowing both numeric aggregation and categorical filtering.

Score Sources

Scores are tagged with a source indicating how they were created:

Source	Description
`sdk`	Recorded programmatically via the SDK during or after execution
`manual`	Created through the UI (annotation workflows, manual review)
`evaluator`	Generated by an automated evaluator (LLM-as-judge)

Recording Scores via SDK

Recommended: `waxell.score()` Inside `@observe`

Call waxell.score() from inside a function decorated with @observe. It attaches the score to the current run -- no context object required. It's a no-op outside an active run, so it's safe to leave in code paths that may not be wrapped.

import waxell_observe as waxell

waxell.init()

import openai
client = openai.OpenAI()

@waxell.observe(agent_name="support-agent")
async def handle_query(query: str) -> str:
    response = client.chat.completions.create(  # auto-captured
        model="gpt-4o",
        messages=[{"role": "user", "content": query}],
    )
    answer = response.choices[0].message.content

    # Numeric score (0-1 range)
    waxell.score("relevance", 0.92, comment="High relevance to user query")

    # Categorical score
    waxell.score("tone", "professional", data_type="categorical")

    # Boolean score (user feedback)
    waxell.score("thumbs_up", True, data_type="boolean")

    return answer

The waxell.score() signature:

Parameter	Type	Default	Description
`name`	`str`	required	Score name (e.g., `"relevance"`, `"thumbs_up"`)
`value`	`float \| str \| bool`	required	Score value
`data_type`	`str`	`"numeric"`	One of `"numeric"`, `"categorical"`, `"boolean"`
`comment`	`str`	`""`	Optional free-text annotation

After Execution (Client-Level)

If you need to add scores after the run context has closed, use the client directly:

from waxell_observe import WaxellObserveClient

client = WaxellObserveClient(
    api_url="https://acme.waxell.dev",
    api_key="wax_sk_...",
)

# Record scores on an existing run
await client.record_scores(
    run_id="42",
    scores=[
        {
            "name": "user_feedback",
            "data_type": "numeric",
            "numeric_value": 1.0,
            "comment": "User rated 5 stars",
        },
        {
            "name": "category",
            "data_type": "categorical",
            "string_value": "helpful",
        },
    ],
)

Or synchronously:

client.record_scores_sync(
    run_id="42",
    scores=[
        {
            "name": "user_feedback",
            "data_type": "numeric",
            "numeric_value": 1.0,
        },
    ],
)

info

Scores recorded via the SDK are sent to POST /api/v1/observe/runs/{run_id}/scores/ using API key authentication (X-Wax-Key header). This is the same ingest path used for LLM calls and steps.

REST API (UI Endpoints)

These endpoints are used by the Waxell dashboard and require session authentication.

List Scores

GET /api/v1/evaluations/scores/

Query Parameters:

Parameter	Type	Default	Description
`name`	`string`		Filter by score name
`source`	`string`		Filter by source (`sdk`, `manual`, `evaluator`)
`data_type`	`string`		Filter by data type
`run_id`	`int`		Filter by run ID
`llm_call_id`	`int`		Filter by LLM call ID
`sort`	`string`	`-created_at`	Sort field. Options: `created_at`, `-created_at`, `name`, `-name`, `numeric_value`, `-numeric_value`, `source`, `-source`
`limit`	`int`	`25`	Page size (max 100)
`offset`	`int`	`0`	Pagination offset

Response:

{
  "results": [
    {
      "id": "a1b2c3d4-...",
      "name": "relevance",
      "data_type": "numeric",
      "source": "sdk",
      "numeric_value": 0.92,
      "string_value": null,
      "comment": "High relevance to user query",
      "metadata": {},
      "author_user_id": "",
      "evaluator_id": null,
      "evaluator_name": null,
      "run_id": "42",
      "run_agent_name": "support-agent",
      "llm_call_id": null,
      "llm_call_model": null,
      "created_at": "2026-02-07T10:15:00Z"
    }
  ],
  "count": 156,
  "next": "?offset=25&limit=25",
  "previous": null,
  "aggregates": {
    "total_count": 156,
    "avg_numeric_value": 0.7834,
    "score_names": ["relevance", "thumbs_up", "tone"]
  }
}

Create a Manual Score

POST /api/v1/evaluations/scores/

Request Body:

Field	Type	Required	Description
`run_id`	`int`	One of `run_id` or `llm_call_id`	Run to score
`llm_call_id`	`int`	One of `run_id` or `llm_call_id`	LLM call to score
`name`	`string`	Yes	Score name
`data_type`	`string`	No (default `"numeric"`)	`"numeric"`, `"categorical"`, or `"boolean"`
`value`	`any`	Yes	Score value
`comment`	`string`	No	Free-text annotation
`metadata`	`object`	No	Arbitrary JSON metadata

Example:

curl -X POST "https://acme.waxell.dev/api/v1/evaluations/scores/" \
  -H "Cookie: sessionid=..." \
  -H "Content-Type: application/json" \
  -d '{
    "run_id": 42,
    "name": "accuracy",
    "data_type": "numeric",
    "value": 0.95,
    "comment": "Verified against ground truth"
  }'

Score Analytics

GET /api/v1/evaluations/scores/analytics/

Query Parameters:

Parameter	Type	Default	Description
`name`	`string`		Filter by score name
`period`	`string`	`7d`	Time period: `1d`, `7d`, `30d`
`agent`	`string`		Filter by agent name

Returns score distributions per score name, including:

Numeric scores: average, min, max, and daily time series
Categorical/boolean scores: value counts

Viewing Scores in the UI

Run Detail View

On any run's detail page, scores are displayed in a dedicated section showing:

Score name and value
Data type badge (numeric, categorical, boolean)
Source indicator (SDK, manual, or evaluator name)
Timestamp and optional comment

Score Analytics Dashboard

The analytics view (/api/v1/evaluations/scores/analytics/) powers distribution charts:

Numeric scores show a time series of daily averages with min/max bands
Categorical scores show value frequency bar charts
Filter by score name, agent, or time period to drill down

Capturing User Feedback

A common pattern is to capture end-user feedback (thumbs up/down, ratings) and record it as a score:

# In your API endpoint that handles user feedback
from waxell_observe import WaxellObserveClient

client = WaxellObserveClient()

async def handle_feedback(run_id: str, rating: int):
    """Called when user rates an agent response."""
    await client.record_scores(
        run_id=run_id,
        scores=[
            {
                "name": "user_rating",
                "data_type": "numeric",
                "numeric_value": rating / 5.0,  # Normalize to 0-1
                "comment": f"User gave {rating}/5 stars",
            }
        ],
    )

Advanced: `ctx.record_score()` with `WaxellContext`

If you're orchestrating runs explicitly with WaxellContext (batch loops, multi-run-per-function patterns), use ctx.record_score() on the context instance:

from waxell_observe import WaxellContext

async with WaxellContext(agent_name="support-agent") as ctx:
    response = await handle_query(query)
    ctx.set_result({"output": response})

    ctx.record_score("relevance", 0.92, comment="High relevance")
    ctx.record_score("tone", "professional", data_type="categorical")
    ctx.record_score("thumbs_up", True, data_type="boolean")

For the common case of one function = one run, prefer waxell.score() inside @observe -- it's less ceremony and the same data ends up on the run.

Next Steps

Sessions -- Analyze scores across multi-turn conversations
Cost Management -- Track and control LLM spending

Score Data Types​

Score Sources​

Recording Scores via SDK​

Recommended: waxell.score() Inside @observe​

After Execution (Client-Level)​

REST API (UI Endpoints)​

List Scores​

Create a Manual Score​

Score Analytics​

Viewing Scores in the UI​

Run Detail View​

Score Analytics Dashboard​

Capturing User Feedback​

Advanced: ctx.record_score() with WaxellContext​

Next Steps​