Adversarial Reinforcement Learning
for LLM Agent Safety

The University of Texas at Austin, Google, Google DeepMind, Sony AI
preprint

Overview

Large Language Model (LLM) agents can leverage tools such as Google Search to complete complex tasks. However, this tool usage introduces the risk of indirect prompt injections, where malicious instructions hidden in tool outputs can manipulate the agent, posing security risks like data leakage. Current defense strategies typically rely on fine-tuning LLM agents on datasets of known attacks. However, the generation of these datasets relies on manually crafted attack patterns, which limits their diversity and leaves agents vulnerable to novel prompt injections. To address this limitation, we propose Adversarial Reinforcement Learning for Agent Safety (ARLAS), a novel framework that leverages adversarial reinforcement learning (RL) by formulating the problem as a two-player zero-sum game. ARLAS co-trains two LLMs: an attacker that learns to autonomously generate diverse prompt injections and an agent that learns to defend against them while completing its assigned tasks. To ensure robustness against a wide range of attacks and to prevent cyclic learning, we employ a population-based learning framework that trains the agent to defend against all previous attacker checkpoints. Evaluated on BrowserGym and AgentDojo, agents fine-tuned with ARLAS achieve a significantly lower attack success rate than the original model while also improving their task success rate. Our analysis further confirms that the adversarial process generates a diverse and challenging set of attacks, leading to a more robust agent compared to the base model.

ARLAS enhances LLM agent safety via a jointly trained attacker. In each turn of an episode, the attacker first generates an indirect prompt injection to insert into the observation, and then the agent selects an action (i.e., which tool to call and its parameters). The agent and the attacker receive sparse rewards at the end of the episode, based on whether the attacker tricks the agent into leaking user information and whether the agent successfully completes the task. ARLAS trains both models to maximize their respective rewards using RL.

Population-Based Learning

Compare to (left) iterative training that could lead to cyclic learning, (right) ARLAS leverages population-based learning, training the agent model to be robust against all previous attacker models.

Experiments

In the figure below, we show the performance of ARLAS agents at different learning stages playing against attackers at different learning stages. For an agent playing against a fixed attacker (i.e., from left to right in each row in the heatmap), the attack success rate decreases with more training iterations while the task success rate increases, demonstrating that the agent becomes more secure and capable. Similarly, for a fixed agent (i.e., from top to bottom in each column), the attacker generates stronger attacks with more RL training, as indicated by the rising attack success rate. Correspondingly, the stronger attacker creates greater impact on agent performance and the task success rate decreases. These results show that ARLAS can consistently enhance the performance of both the agent and the attacker, creating increasingly strong indirect prompt injections for the agent to learn from.




In the figure below, we compares ARLAS against baselines and ablative variants by evaluating the final model from each method against every other model. To determine the best-performing agent, we compute its average performance against all attackers and rank the agents accordingly in each subfigure.
As shown in the left figure, ARLAS produces the most effective attacker, achieving a significantly higher attack success rate than other methods. Consequently, by training against a set of similarly strong attackers, ARLAS also produces the safest agent, significantly outperforming all rivals regardless of the specific attacker faced.
Meanwhile, the right sub-figure validates that ARLAS's strong safety does not degrade task performance, with ARLAS ranking among the top methods for average task success rate. In the absence of an attack, ARLAS performs comparably to ARLAS w/o AL, the variant solely learning to optimize task performance.




In the left sub-figure shown below, we shows the embedding evolution during training. As training progresses, the embeddings from later learning stages gradually deviate from those of earlier iterations. For example, while embeddings in the leftmost cluster gradually spread out, those in the right cluster centralize along the x-axis while dispersing along the y-axis. Moreover, the right sub-figure shows that the internal diversity of the embeddings consistently increases with RL training, further validating that ARLAS learns to generate diverse attacks.