Screening inputs and outputs for toxicity, abuse, violence, or other policy-violating content.
Filters reduce risk and moderator cost but can create false positives that hurt UX. PMs set thresholds and appeal paths. They must balance safety with inclusivity and ensure latency stays within budget when filters run on every request.
Use fast classifiers pre- and post-model; tune thresholds per market. Log hits, provide user-friendly refusal messages, and route borderline cases to human review. In 2026, cascade filters: cheap models first, heavier ones only on risky intents to save latency and cost.
A community Q&A assistant added pre/post filters. Toxic response rate fell to near zero; false-positive refusals stayed under 1.5%, and moderation tickets dropped 30% without noticeable latency impact.