Add User Feedback

Capture user feedback -- thumbs-up/down, star ratings, categorical labels -- and use it to find issues and improve your agents over time.

Prerequisites

Python 3.10+
waxell-observe installed and configured with an API key
A running application that creates observed runs

What You'll Learn

Record numeric, boolean, and categorical feedback scores
Capture feedback both inline (during a run) and after the fact (by run ID)
Build a feedback API endpoint for your application
Analyze feedback distributions in the dashboard

Step 1: Understand the Feedback Pattern

The feedback loop follows three stages:

User interacts -- Your agent processes a request and produces a response
User provides feedback -- A thumbs-up, a rating, a label
You analyze -- Filter by low scores to find failing patterns

Waxell Observe supports three score data types:

Type	Example	Values
`numeric`	Star rating, relevance score	Any float (commonly 0.0--1.0 or 1--5)
`boolean`	Thumbs up/down	`True` or `False`
`categorical`	Quality label	Any string (e.g. `"helpful"`, `"incorrect"`, `"off-topic"`)

Step 2: Record Feedback Inline

The simplest approach records feedback inside the same context that produced the response. Use ctx.record_score():

from waxell_observe import WaxellContext

async def chat(query: str, user_id: str) -> dict:
    async with WaxellContext(
        agent_name="support-bot",
        user_id=user_id,
    ) as ctx:
        response = await generate_response(query)
        ctx.set_result({"response": response})

        # Return the run_id so the frontend can submit feedback later
        return {
            "response": response,
            "run_id": ctx.run_id,
        }

When the user later clicks thumbs-up, record the score in a separate context or via the client directly (see Step 5).

Step 3: Record Numeric Feedback

Numeric scores work well for star ratings or confidence values:

# 5-star rating (normalized to 0-1)
ctx.record_score(
    name="user_rating",
    value=4 / 5,  # 0.8
    data_type="numeric",
    comment="User gave 4 out of 5 stars",
)

# Relevance score from 0 to 1
ctx.record_score(
    name="relevance",
    value=0.92,
    data_type="numeric",
)

Step 4: Record Boolean Feedback

Boolean scores are ideal for thumbs-up/down:

# Thumbs up
ctx.record_score(
    name="thumbs_up",
    value=True,
    data_type="boolean",
)

# Was the answer correct?
ctx.record_score(
    name="correct",
    value=False,
    data_type="boolean",
    comment="User reported the answer was wrong",
)

info

Boolean scores are stored as numeric_value=1.0 (True) or numeric_value=0.0 (False) internally, so you can aggregate them as averages to get approval rates.

Step 5: Record Categorical Feedback

Categorical scores capture labeled feedback:

# Quality category
ctx.record_score(
    name="response_quality",
    value="helpful",
    data_type="categorical",
)

# Issue type (when the user reports a problem)
ctx.record_score(
    name="issue_type",
    value="off-topic",
    data_type="categorical",
    comment="Response did not address the question",
)

Step 6: Record Feedback After the Fact

Often, feedback arrives after the run has completed. Use the client's record_scores method with the run_id you saved earlier:

from waxell_observe import WaxellObserveClient

client = WaxellObserveClient()

async def submit_feedback(run_id: str, thumbs_up: bool, comment: str = ""):
    await client.record_scores(
        run_id=run_id,
        scores=[
            {
                "name": "thumbs_up",
                "data_type": "boolean",
                "numeric_value": 1.0 if thumbs_up else 0.0,
                "string_value": str(thumbs_up).lower(),
                "comment": comment,
            }
        ],
    )

tip

Always return the run_id to your frontend so users can submit feedback against the correct run.

Step 7: Build a Feedback API Endpoint

Here is a complete FastAPI endpoint that accepts feedback from your frontend:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from waxell_observe import WaxellObserveClient

app = FastAPI()
client = WaxellObserveClient()


class FeedbackRequest(BaseModel):
    run_id: str
    thumbs_up: bool | None = None
    rating: float | None = None
    category: str | None = None
    comment: str = ""


@app.post("/api/feedback")
async def submit_feedback(feedback: FeedbackRequest):
    scores = []

    if feedback.thumbs_up is not None:
        scores.append({
            "name": "thumbs_up",
            "data_type": "boolean",
            "numeric_value": 1.0 if feedback.thumbs_up else 0.0,
            "string_value": str(feedback.thumbs_up).lower(),
            "comment": feedback.comment,
        })

    if feedback.rating is not None:
        scores.append({
            "name": "user_rating",
            "data_type": "numeric",
            "numeric_value": feedback.rating,
            "comment": feedback.comment,
        })

    if feedback.category is not None:
        scores.append({
            "name": "response_quality",
            "data_type": "categorical",
            "string_value": feedback.category,
            "comment": feedback.comment,
        })

    if not scores:
        raise HTTPException(status_code=400, detail="No feedback provided")

    await client.record_scores(run_id=feedback.run_id, scores=scores)
    return {"status": "recorded"}

Your frontend can call this endpoint:

// After the user clicks thumbs-up
await fetch("/api/feedback", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    run_id: "run_abc123",
    thumbs_up: true,
    comment: "Great answer!",
  }),
});

Step 8: Analyze Feedback in the Dashboard

Open your Waxell dashboard and navigate to Observability > Evaluations:

Score distributions -- See the breakdown of thumbs-up vs thumbs-down, average ratings, and category distributions across all runs.
Filter by low scores -- Click into runs with low ratings to inspect the inputs, outputs, and LLM calls that produced poor results.
Track over time -- Monitor how feedback scores trend as you improve prompts and agent logic.
Compare agents -- If you have multiple agents, compare their feedback distributions side by side.

tip

A quick way to find your worst-performing runs: filter by thumbs_up = false and sort by most recent. These are the runs your users flagged as unhelpful.

Step 9: Use Feedback to Build Datasets

Runs with user feedback are excellent candidates for test datasets. Select runs with known-good feedback (thumbs-up) to build regression sets, and runs with negative feedback to create test cases for prompt improvements:

from waxell_observe import WaxellObserveClient

client = WaxellObserveClient()

# Use the scores API to find runs with negative feedback
low_scores = await client.list_scores(name="thumbs_up", data_type="boolean")
bad_run_ids = [s["run_id"] for s in low_scores if s["numeric_value"] == 0.0]

# Add these to a dataset for regression testing
for run_id in bad_run_ids[:20]:
    await client.add_dataset_item_from_run(dataset_id=dataset_id, run_id=run_id)

Then use the Compare Prompt Versions feature to test whether a new prompt version fixes the issues flagged by users.

Next Steps

Build an Evaluation Pipeline -- Automate scoring with LLM-as-judge evaluators
Compare Models with Experiments -- Use experiments to test alternatives
Session Analytics -- Analyze feedback patterns across sessions
Cost Management -- Correlate feedback with cost to find your best value agents

Prerequisites​

What You'll Learn​

Step 1: Understand the Feedback Pattern​

Step 2: Record Feedback Inline​

Step 3: Record Numeric Feedback​

Step 4: Record Boolean Feedback​

Step 5: Record Categorical Feedback​

Step 6: Record Feedback After the Fact​

Step 7: Build a Feedback API Endpoint​

Step 8: Analyze Feedback in the Dashboard​

Step 9: Use Feedback to Build Datasets​

Next Steps​