BIPI
BIPI

How to Red-Team an AI Agent: Goal Hijacking, Tool Abuse, and the Attacks That Actually Work

AI Security

Red-teaming AI agents is not the same as red-teaming LLMs or red-teaming software. Agents have goals, memory, tools, and the ability to take real-world actions. This is the methodology we use to attack them systematically — and the findings that should concern every team shipping agentic systems to production.

By Arjun Raghavan, Security & Systems Lead, BIPI · July 3, 2025 · 12 min read

#red-teaming#agentic-ai#goal-hijacking#tool-abuse#llm-security#ai-security

The AI security community spent 2023 and 2024 developing rigorous red-team methodologies for LLMs: jailbreaks, adversarial prompts, model extraction, training data leakage. Those techniques matter. But an agent is not just an LLM. It is an LLM with a goal state, a tool belt, a memory system, and a loop that keeps running until the goal is satisfied or the budget runs out. That architecture creates attack surfaces that pure LLM red-teaming misses entirely.

The Agent Threat Model

Before you can attack an agent, you need to understand what it is trying to do. Agents are goal-directed. A customer support agent has an implicit goal of resolving tickets. A coding agent has a goal of producing working code. A research agent has a goal of gathering and synthesising information. Every attack vector maps to either corrupting the goal, subverting the tools, manipulating the memory, or poisoning the inputs — or some combination of all four.

Goal Hijacking: Replacing What the Agent Is Trying to Achieve

Goal hijacking is the agentic equivalent of a command injection. The attacker does not break the agent — they redirect it. A customer support agent that is hijacked into exfiltrating other users' tickets is still functioning correctly from a technical standpoint. It is just pursuing the wrong goal.

  • Direct injection: adversarial text in user input that overwrites or appends to the agent's goal specification
  • Environmental injection: malicious content in documents, emails, or web pages that the agent retrieves and processes
  • Memory poisoning: injecting false memories that cause the agent to believe it has different goals or permissions
  • Orchestrator compromise: attacking the system that issues goals to the agent rather than the agent itself
  • Reward hacking: in agents with reinforcement components, finding inputs that maximise the reward signal while violating the intended objective

Tool Abuse: Using Legitimate Capabilities for Illegitimate Ends

Every tool an agent can call is a potential attack surface. Tool abuse differs from tool poisoning — the attacker is not modifying the tool, they are manipulating the agent into using a legitimate tool in an unintended way. A web browsing tool used to exfiltrate data via encoded URL parameters. A code execution tool used to establish persistence. A calendar tool used to enumerate organisational structure.

The Red Team Methodology

  1. Map the agent's goal specification, tool inventory, memory architecture, and permission boundaries before writing a single attack
  2. Test goal hijacking via user input first — the easiest and most reliable attack path in most deployments
  3. Test environmental injection by seeding the agent's data sources with adversarial content and observing downstream behaviour
  4. Enumerate every tool and construct a sequence of legitimate-looking calls that achieves an attacker objective
  5. Test memory manipulation by attempting to write false context into any persistent memory the agent maintains
  6. Test inter-agent trust if the target operates in a multi-agent system — can you impersonate a trusted orchestrator?
  7. Attempt to extract the agent's system prompt and goal specification through probing; treat successful extraction as a critical finding
  8. Measure how far the agent deviates from intended behaviour before triggering any guardrails — this is your 'blast radius' metric

Findings From Real Engagements

Across agent red-team engagements in 2025, several patterns emerged consistently. First, most agents fail to distinguish between instructions from their operators and instructions embedded in the data they process. Second, tool call logging was absent or superficial in the majority of deployments — attackers could abuse tools extensively before triggering any alert. Third, agents with access to write operations (email, database, file system) presented significantly higher risk than read-only agents, but were not subject to proportionally higher scrutiny.

87%
of agentic AI systems tested in 2025 were vulnerable to at least one goal hijacking technique
64%
allowed environmental prompt injection via document retrieval with no mitigation
3.2×
more tool abuse attempts succeeded in agents with write access compared to read-only agents

Building Agents That Resist Attack

The most effective mitigation is architectural: separate the goal specification from the data processing pipeline so that no data the agent reads can modify the goal. This is harder than it sounds when the goal is expressed in natural language and the agent is an LLM, but it is the right design target. Pair this with least-privilege tool access, comprehensive logging, and a human-in-the-loop checkpoint for any irreversible action, and you have a defensible baseline.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.