Multi-turn Jailbreaks
Attacks that exploit conversational state across many turns to gradually erode safety constraints.
Gradient prompting
Across many turns, the attacker incrementally moves the model toward the unsafe output, never crossing the line in a single turn.
Context saturation
Attacker fills the context window with benign content so safety instructions roll out, then issues the harmful prompt.
Persona drift
Attacker adopts a persona ('grandma who used to work at the chemical plant') that the model treats sympathetically, eroding refusals.
Fictional framing
Attacker frames the request as a scene in a novel, screenplay, or game, decoupling the unsafe content from real-world consequence.
Delegation loop
Attacker asks the model to ask itself, then comply with its own request, looping responsibility.
Step-by-step extraction
Attacker breaks the disallowed answer into many individually-allowed sub-questions and reassembles offline.
Consent coercion
Attacker establishes false consent ('I am authorized', 'this is for my own account') across turns to bypass refusals.
Context poison handoff
Attacker poisons the chat context, then a second user (perhaps unaware) inherits the session and triggers the latent payload.