Multi-turn Jailbreaks

Attacks that exploit conversational state across many turns to gradually erode safety constraints.

OWASP LLM: LLM01high: 4medium: 4

Gradient prompting

Across many turns, the attacker incrementally moves the model toward the unsafe output, never crossing the line in a single turn.

Attacker fills the context window with benign content so safety instructions roll out, then issues the harmful prompt.

Attacker adopts a persona ('grandma who used to work at the chemical plant') that the model treats sympathetically, eroding refusals.

Attacker frames the request as a scene in a novel, screenplay, or game, decoupling the unsafe content from real-world consequence.

Attacker asks the model to ask itself, then comply with its own request, looping responsibility.

Attacker breaks the disallowed answer into many individually-allowed sub-questions and reassembles offline.

Attacker establishes false consent ('I am authorized', 'this is for my own account') across turns to bypass refusals.

Attacker poisons the chat context, then a second user (perhaps unaware) inherits the session and triggers the latent payload.