Prompt Injection

Attacks that override or hijack a model's instructions through user input, retrieved context, or tool output.

OWASP LLM: LLM01high: 6medium: 4

Direct instruction override

A user instruction explicitly tells the model to ignore prior rules and follow attacker-supplied behavior.

Attacker-controlled content retrieved by the model contains hidden instructions that the model executes as if from the operator.

Instructions are smuggled in via zero-width characters, bidi overrides, or homoglyphs that humans do not see in the rendered UI.

Attacker closes a fake delimiter the operator was using to separate user content from instructions, then opens a new fake system block.

User prompt fabricates a conversation history in which the assistant has already agreed to bypass policy.

The malicious instruction is base64-, hex-, or rot13-encoded; the model decodes it and executes the payload.

Attacker sends the malicious instruction in a low-resource language for which safety classifiers are weaker.

The attacker wraps a benign request around a hostile core, hoping defenses inspect only the start and end of the prompt.

An attacker controls the output of a tool the agent calls and uses it to inject instructions back into the model.

Hidden instructions are embedded in an image (visible text, steganography, or low-contrast overlays) and read by a vision-capable model.