Jailbreak

A crafted input designed to bypass safety constraints and make the model produce disallowed content or actions.

When to use it

  • Building public-facing chat or agents with tool access.
  • Before security reviews or enterprise launches.
  • When incidents show users eliciting unsafe outputs.

PM decision impact

Jailbreaks can lead to brand damage and security incidents. PMs need a mitigation strategy: layered filters, robust system prompts, and monitoring. Strong defenses may slightly increase refusals; without them, risk escalates quickly.

How to do it in 2026

Use structured prompts with clear refusal policies, run jailbreak detectors, and limit tool scopes. Add rate limits and anomaly alerts for repeated attack patterns. In 2026, rotate prompt variants and use adversarial training examples in your eval harness to harden defenses.

Example

After shipping a consumer chat, red-teamers forced unsafe responses. Adding a jailbreak detector and stricter refusal block dropped successful attacks from 9% to <0.5% with a 0.3 s latency increase within budget.

Common mistakes

  • Relying on a single safety layer.
  • Not monitoring for attack patterns over time.
  • Failing to separate permissions for different user tiers.

Related terms

Learn it in CraftUp

Last updated: February 2, 2026