A user or document attempt to override system instructions, causing the model to act outside intended bounds.
Prompt injection threatens data safety and product integrity. PMs own mitigations: isolation, input cleaning, trust scoring, and confirmations. Costs include extra latency and possible refusals; benefits include preventing breaches and bad actions.
Separate untrusted content from instructions; use markers and quoted blocks. Run input classifiers for jailbreak patterns. Add allowlists for tools and confirm high-impact actions with users. In 2026, use content-origin metadata in prompts so the model treats untrusted text as data, not instructions.
A RAG chatbot started following hidden instructions in PDFs. After sandboxing retrieved text with provenance tags and adding an injection detector, malicious success dropped to <0.3% with negligible latency change.