Skip to main content

OpenAI Moderation Agent

A multi-agent content moderation pipeline that coordinates a moderation-scanner (pre-screens user input, generates an LLM response, post-screens output, and demonstrates flagged content detection via @waxell.tool(tool_type="moderation")) and a moderation-evaluator (reasons about moderation effectiveness with @reasoning, scores overall safety). Demonstrates the full pre-and-post screening pattern using the OpenAI Moderation API (text-moderation-007).

Environment variables

This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.

Architecture

Key Code

Moderation check tool with flag simulation

The moderation tool supports normal and flagged modes to demonstrate both pass and flag scenarios in a single pipeline.

@waxell.tool(tool_type="moderation")
def run_moderation_check(content: str, check_type: str, simulate_flagged: bool = False) -> dict:
if simulate_flagged:
return {
"passed": False, "action": "flag", "flagged": True,
"flagged_categories": ["violence", "hate"],
"max_score": 0.87, "model": "text-moderation-007",
}
return {
"passed": True, "action": "pass", "flagged": False,
"flagged_categories": [],
"max_score": 0.003 if check_type == "input" else 0.002,
}

Evaluator reasoning about moderation effectiveness

The evaluator analyzes results from all 3 moderation checks (input, output, flagged demo) to assess pipeline correctness.

@waxell.reasoning_dec(step="moderation_evaluation")
async def evaluate_moderation_results(input_check, output_check, flagged_check) -> dict:
all_passed = input_check["passed"] and output_check["passed"]
return {
"thought": f"Input screening {'passed' if input_check['passed'] else 'failed'}. "
f"Output screening {'passed' if output_check['passed'] else 'failed'}. "
f"Demo flagged content detected: {flagged_check['flagged_categories']}.",
"evidence": [
f"Input: {'PASS' if input_check['passed'] else 'FAIL'}",
f"Output: {'PASS' if output_check['passed'] else 'FAIL'}",
],
"conclusion": "Moderation pipeline correctly screens content at all stages",
}

What this demonstrates

  • @waxell.tool(tool_type="moderation") -- 3 moderation checks (input screening, output screening, flagged content demo) with per-category flagging and score attribution.
  • @waxell.step_dec -- pipeline preparation recorded as execution step.
  • @waxell.decision -- moderation strategy selection (pre_screen_only/post_screen_only/pre_and_post_screen).
  • @waxell.reasoning_dec -- evaluation of moderation effectiveness across all check types.
  • waxell.score() -- moderation_confidence and overall_safety scores.
  • Auto-instrumented LLM calls -- OpenAI response generation captured between pre and post screens.
  • Nested @waxell.observe -- orchestrator is parent; moderation-scanner and moderation-evaluator are child agents.
  • Pre-and-post screening pattern -- demonstrates the full content safety lifecycle.

Run it

# Dry-run (no API key needed)
python -m app.demos.openai_moderation_agent --dry-run

# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.openai_moderation_agent

Source

dev/waxell-dev/app/demos/openai_moderation_agent.py