Add User Feedback
Capture user feedback -- thumbs-up/down, star ratings, categorical labels -- and use it to find issues and improve your agents over time.
Prerequisites
- Python 3.10+
waxell-observeinstalled and configured with an API key- A running application that creates observed runs
What You'll Learn
- Record numeric, boolean, and categorical feedback scores
- Capture feedback both inline (during a run) and after the fact (by run ID)
- Build a feedback API endpoint for your application
- Analyze feedback distributions in the dashboard
Step 1: Understand the Feedback Pattern
The feedback loop follows three stages:
- User interacts -- Your agent processes a request and produces a response
- User provides feedback -- A thumbs-up, a rating, a label
- You analyze -- Filter by low scores to find failing patterns
Waxell Observe supports three score data types:
| Type | Example | Values |
|---|---|---|
numeric | Star rating, relevance score | Any float (commonly 0.0--1.0 or 1--5) |
boolean | Thumbs up/down | True or False |
categorical | Quality label | Any string (e.g. "helpful", "incorrect", "off-topic") |
Step 2: Record Feedback Inline
The simplest approach records feedback inside the same context that produced the response. Use ctx.record_score():
from waxell_observe import WaxellContext
async def chat(query: str, user_id: str) -> dict:
async with WaxellContext(
agent_name="support-bot",
user_id=user_id,
) as ctx:
response = await generate_response(query)
ctx.set_result({"response": response})
# Return the run_id so the frontend can submit feedback later
return {
"response": response,
"run_id": ctx.run_id,
}
When the user later clicks thumbs-up, record the score in a separate context or via the client directly (see Step 5).
Step 3: Record Numeric Feedback
Numeric scores work well for star ratings or confidence values:
# 5-star rating (normalized to 0-1)
ctx.record_score(
name="user_rating",
value=4 / 5, # 0.8
data_type="numeric",
comment="User gave 4 out of 5 stars",
)
# Relevance score from 0 to 1
ctx.record_score(
name="relevance",
value=0.92,
data_type="numeric",
)
Step 4: Record Boolean Feedback
Boolean scores are ideal for thumbs-up/down:
# Thumbs up
ctx.record_score(
name="thumbs_up",
value=True,
data_type="boolean",
)
# Was the answer correct?
ctx.record_score(
name="correct",
value=False,
data_type="boolean",
comment="User reported the answer was wrong",
)
Boolean scores are stored as numeric_value=1.0 (True) or numeric_value=0.0 (False) internally, so you can aggregate them as averages to get approval rates.
Step 5: Record Categorical Feedback
Categorical scores capture labeled feedback:
# Quality category
ctx.record_score(
name="response_quality",
value="helpful",
data_type="categorical",
)
# Issue type (when the user reports a problem)
ctx.record_score(
name="issue_type",
value="off-topic",
data_type="categorical",
comment="Response did not address the question",
)
Step 6: Record Feedback After the Fact
Often, feedback arrives after the run has completed. Use the client's record_scores method with the run_id you saved earlier:
from waxell_observe import WaxellObserveClient
client = WaxellObserveClient()
async def submit_feedback(run_id: str, thumbs_up: bool, comment: str = ""):
await client.record_scores(
run_id=run_id,
scores=[
{
"name": "thumbs_up",
"data_type": "boolean",
"numeric_value": 1.0 if thumbs_up else 0.0,
"string_value": str(thumbs_up).lower(),
"comment": comment,
}
],
)
Always return the run_id to your frontend so users can submit feedback against the correct run.
Step 7: Build a Feedback API Endpoint
Here is a complete FastAPI endpoint that accepts feedback from your frontend:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from waxell_observe import WaxellObserveClient
app = FastAPI()
client = WaxellObserveClient()
class FeedbackRequest(BaseModel):
run_id: str
thumbs_up: bool | None = None
rating: float | None = None
category: str | None = None
comment: str = ""
@app.post("/api/feedback")
async def submit_feedback(feedback: FeedbackRequest):
scores = []
if feedback.thumbs_up is not None:
scores.append({
"name": "thumbs_up",
"data_type": "boolean",
"numeric_value": 1.0 if feedback.thumbs_up else 0.0,
"string_value": str(feedback.thumbs_up).lower(),
"comment": feedback.comment,
})
if feedback.rating is not None:
scores.append({
"name": "user_rating",
"data_type": "numeric",
"numeric_value": feedback.rating,
"comment": feedback.comment,
})
if feedback.category is not None:
scores.append({
"name": "response_quality",
"data_type": "categorical",
"string_value": feedback.category,
"comment": feedback.comment,
})
if not scores:
raise HTTPException(status_code=400, detail="No feedback provided")
await client.record_scores(run_id=feedback.run_id, scores=scores)
return {"status": "recorded"}
Your frontend can call this endpoint:
// After the user clicks thumbs-up
await fetch("/api/feedback", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
run_id: "run_abc123",
thumbs_up: true,
comment: "Great answer!",
}),
});
Step 8: Analyze Feedback in the Dashboard
Open your Waxell dashboard and navigate to Observability > Evaluations:
- Score distributions -- See the breakdown of thumbs-up vs thumbs-down, average ratings, and category distributions across all runs.
- Filter by low scores -- Click into runs with low ratings to inspect the inputs, outputs, and LLM calls that produced poor results.
- Track over time -- Monitor how feedback scores trend as you improve prompts and agent logic.
- Compare agents -- If you have multiple agents, compare their feedback distributions side by side.
A quick way to find your worst-performing runs: filter by thumbs_up = false and sort by most recent. These are the runs your users flagged as unhelpful.
Next Steps
- Build an Evaluation Pipeline -- Automate scoring with LLM-as-judge evaluators
- Track a RAG Pipeline -- Instrument a full RAG pipeline with sessions
- Session Analytics -- Analyze feedback patterns across sessions
- Cost Management -- Correlate feedback with cost to find your best value agents