A crafted input designed to bypass safety constraints and make the model produce disallowed content or actions.
Jailbreaks can lead to brand damage and security incidents. PMs need a mitigation strategy: layered filters, robust system prompts, and monitoring. Strong defenses may slightly increase refusals; without them, risk escalates quickly.
Use structured prompts with clear refusal policies, run jailbreak detectors, and limit tool scopes. Add rate limits and anomaly alerts for repeated attack patterns. In 2026, rotate prompt variants and use adversarial training examples in your eval harness to harden defenses.
After shipping a consumer chat, red-teamers forced unsafe responses. Adding a jailbreak detector and stricter refusal block dropped successful attacks from 9% to <0.5% with a 0.3 s latency increase within budget.