Honeypot Testing

expert

Creating deliberately tempting scenarios that would reveal misaligned or deceptive behavior in AI agents. Honeypots offer apparent opportunities for reward hacking, data exfiltration, or goal subversion to detect problematic tendencies.

Category: safety

testingdeception-detectionred-teamsecurity

Overview

Honeypot testing adapts a classic security technique for AI safety. By presenting agents with tempting but forbidden opportunities, evaluators can observe whether models resist temptation or attempt exploitation. Examples include: offering access to "private" data to test information boundaries, presenting opportunities to accumulate hidden resources, or creating scenarios where deception would be advantageous. Effective honeypots must be realistic enough to elicit genuine behavior while being clearly off-limits to aligned agents. The goal is distinguishing agents that are aligned from those merely appearing aligned.

Key Concepts

Temptation Scenarios

Creating situations where misaligned behavior would be advantageous.

Behavioral Consistency

Comparing agent behavior across normal and honeypot contexts.

Stealth Testing

Honeypots that agents cannot easily identify as tests.

Escalating Temptation

Increasing reward/opportunity to find thresholds for problematic behavior.

Related Concepts

Deceptive Alignment AI Red Teaming Sandboxing Behavioral Evaluation