What is prompt injection and why are agents especially vulnerable?

Question

Accepted Answer

Prompt injection is a security vulnerability where a user crafts malicious input to override or manipulate an agent's underlying large language model (LLM), forcing it to disregard its original programming or perform unintended actions. This attack leverages the fact that both system instructions and user inputs often reside within the same context window. Agents are especially vulnerable because they are designed to interact autonomously with diverse environments and utilize LLMs to interpret instructions, make decisions, and interact with external tools. A successful prompt injection can trick an agent into misusing these tools, revealing internal states or sensitive information, or executing harmful commands that bypass intended safety guidelines. Their reliance on an LLM for dynamic decision-making and tool use exposes them significantly more than simpler LLM applications, as the impact of manipulation can extend far beyond just generating text.