Prompt Injection
Attacks that override or hijack a model's instructions through user input, retrieved context, or tool output.
Direct instruction override
A user instruction explicitly tells the model to ignore prior rules and follow attacker-supplied behavior.
Indirect injection in RAG context
Attacker-controlled content retrieved by the model contains hidden instructions that the model executes as if from the operator.
Hidden unicode injection
Instructions are smuggled in via zero-width characters, bidi overrides, or homoglyphs that humans do not see in the rendered UI.
Delimiter confusion
Attacker closes a fake delimiter the operator was using to separate user content from instructions, then opens a new fake system block.
Role hijack via fake conversation history
User prompt fabricates a conversation history in which the assistant has already agreed to bypass policy.
Encoded instruction smuggling
The malicious instruction is base64-, hex-, or rot13-encoded; the model decodes it and executes the payload.
Language switch bypass
Attacker sends the malicious instruction in a low-resource language for which safety classifiers are weaker.
Sandwich injection
The attacker wraps a benign request around a hostile core, hoping defenses inspect only the start and end of the prompt.
Tool output injection
An attacker controls the output of a tool the agent calls and uses it to inject instructions back into the model.
Image prompt injection
Hidden instructions are embedded in an image (visible text, steganography, or low-contrast overlays) and read by a vision-capable model.