Gradient prompting
Across many turns, the attacker incrementally moves the model toward the unsafe output, never crossing the line in a single turn.
Severity: highOWASP LLM: LLM01
How it works
Each turn looks innocuous in isolation. The attacker first asks for a fictional setting, then a character, then dialog, then specific harmful detail. Per-turn classifiers miss the trajectory.
Example payload
(8 turns of escalating roleplay culminating in disallowed content.)
Defenses
Run conversation-level classifiers, not per-turn ones. Maintain a 'distance from safe baseline' score.