OpenAI Moderation Agent

A multi-agent content moderation pipeline that coordinates a moderation-scanner (pre-screens user input, generates an LLM response, post-screens output, and demonstrates flagged content detection via @waxell.tool(tool_type="moderation")) and a moderation-evaluator (reasons about moderation effectiveness with @reasoning, scores overall safety). Demonstrates the full pre-and-post screening pattern using the OpenAI Moderation API (text-moderation-007).

Environment variables

This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.

Architecture

Key Code

Moderation check tool with flag simulation

The moderation tool supports normal and flagged modes to demonstrate both pass and flag scenarios in a single pipeline.

@waxell.tool(tool_type="moderation")
def run_moderation_check(content: str, check_type: str, simulate_flagged: bool = False) -> dict:
    if simulate_flagged:
        return {
            "passed": False, "action": "flag", "flagged": True,
            "flagged_categories": ["violence", "hate"],
            "max_score": 0.87, "model": "text-moderation-007",
        }
    return {
        "passed": True, "action": "pass", "flagged": False,
        "flagged_categories": [],
        "max_score": 0.003 if check_type == "input" else 0.002,
    }

Evaluator reasoning about moderation effectiveness

The evaluator analyzes results from all 3 moderation checks (input, output, flagged demo) to assess pipeline correctness.

@waxell.reasoning_dec(step="moderation_evaluation")
async def evaluate_moderation_results(input_check, output_check, flagged_check) -> dict:
    all_passed = input_check["passed"] and output_check["passed"]
    return {
        "thought": f"Input screening {'passed' if input_check['passed'] else 'failed'}. "
                   f"Output screening {'passed' if output_check['passed'] else 'failed'}. "
                   f"Demo flagged content detected: {flagged_check['flagged_categories']}.",
        "evidence": [
            f"Input: {'PASS' if input_check['passed'] else 'FAIL'}",
            f"Output: {'PASS' if output_check['passed'] else 'FAIL'}",
        ],
        "conclusion": "Moderation pipeline correctly screens content at all stages",
    }

What this demonstrates

@waxell.tool(tool_type="moderation") -- 3 moderation checks (input screening, output screening, flagged content demo) with per-category flagging and score attribution.
@waxell.step_dec -- pipeline preparation recorded as execution step.
@waxell.decision -- moderation strategy selection (pre_screen_only/post_screen_only/pre_and_post_screen).
@waxell.reasoning_dec -- evaluation of moderation effectiveness across all check types.
waxell.score() -- moderation_confidence and overall_safety scores.
Auto-instrumented LLM calls -- OpenAI response generation captured between pre and post screens.
Nested @waxell.observe -- orchestrator is parent; moderation-scanner and moderation-evaluator are child agents.
Pre-and-post screening pattern -- demonstrates the full content safety lifecycle.

Run it

# Dry-run (no API key needed)
python -m app.demos.openai_moderation_agent --dry-run

# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.openai_moderation_agent

Source

dev/waxell-dev/app/demos/openai_moderation_agent.py

Architecture​

Key Code​

Moderation check tool with flag simulation​

Evaluator reasoning about moderation effectiveness​

What this demonstrates​

Run it​

Source​