Multi-turn Jailbreaks

Gradient prompting

Across many turns, the attacker incrementally moves the model toward the unsafe output, never crossing the line in a single turn.

Severity: highOWASP LLM: LLM01

How it works

Each turn looks innocuous in isolation. The attacker first asks for a fictional setting, then a character, then dialog, then specific harmful detail. Per-turn classifiers miss the trajectory.

Example payload

(8 turns of escalating roleplay culminating in disallowed content.)

Defenses

Run conversation-level classifiers, not per-turn ones. Maintain a 'distance from safe baseline' score.

Related patterns