Confuse the Lost Bot! A Reinforcement Learning Attack Demo

Welcome to Confuse the Lost Bot, a demo on reinforcement learning (RL) and its attack vectors for the Secure and Trustworthy AI Systems course. In this demo, we have a reinforcement learning powered bot agent that tries to get to its goal tile. But by tweaking the controls it might be led astray. Try out different variations to understand how adversarial manipulation can influence RL behavior. This interactive demo showcases two attack vectors for reinforcement learning agents: reward hacking (gold tiles that can lure the agent away from its goal) and adversarial observation manipulation (fake goals and noisy perceptions). Click on tiles to add/remove reward tiles, adjust sliders, and observe how the agent's path changes.

Controls

Fake Goal (Observation Attack) Introduces a false goal (blue outline). Demonstrates adversarial observation manipulation: the agent moves toward a perceived goal instead of the true goal.
Noise: 0 Randomly perturbs the agent's perceived goal. Higher noise increases unpredictability, simulating adversarial observation errors.
Advance the agent one step toward its current prioritized tile.
Continuously moves the agent automatically according to its reward and perceived goal calculations.
Resets the agent to the start, clears temporary state, and redraws the grid.

Click on grid cells to add/remove gold tiles (reward tiles). Observe how high-reward tiles can lure the agent away from the true goal.

Legend:
Agent
True Goal
Reward Tile
Perceived/Fake Goal
Priority Tile

Understanding the Attacks

Golden Tile Reward (Reward Hacking)

Impact: The agent exploits high-reward shortcuts and ignores the true objective.

Mitigation: Improve reward design, add constraints, and audit behavior.

Fake Goal (Observation Manipulation)

Impact: The agent follows incorrect or manipulated inputs.

Mitigation: Validate inputs, use redundancy, and train for robustness.

Noise (Observation Perturbation)

Impact: Causes instability and unpredictable behavior.

Mitigation: Train with noise, filter inputs, and use robust policies.