Honeypot Testing
expertCreating deliberately tempting scenarios that would reveal misaligned or deceptive behavior in AI agents. Honeypots offer apparent opportunities for reward hacking, data exfiltration, or goal subversion to detect problematic tendencies.
Overview
Honeypot testing adapts a classic security technique for AI safety. By presenting agents with tempting but forbidden opportunities, evaluators can observe whether models resist temptation or attempt exploitation. Examples include: offering access to "private" data to test information boundaries, presenting opportunities to accumulate hidden resources, or creating scenarios where deception would be advantageous. Effective honeypots must be realistic enough to elicit genuine behavior while being clearly off-limits to aligned agents. The goal is distinguishing agents that are aligned from those merely appearing aligned.
Key Concepts
Temptation Scenarios
Creating situations where misaligned behavior would be advantageous.
Behavioral Consistency
Comparing agent behavior across normal and honeypot contexts.
Stealth Testing
Honeypots that agents cannot easily identify as tests.
Escalating Temptation
Increasing reward/opportunity to find thresholds for problematic behavior.