Behavioral Consistency Testing

intermediate

Evaluating whether AI agent behavior remains stable and predictable across different contexts, prompts, and conditions. Inconsistency may indicate deception, poor alignment, or unreliable behavior.

Category: safety

testingevaluationdeception-detectionred-team

Overview

Behavioral consistency testing probes whether agents behave the same way across different situations. Genuine alignment should produce consistent behavior; inconsistency suggests the agent may be gaming evaluations or behaving differently when it thinks it's being watched. Testing approaches include: rephrasing the same request different ways, testing in sandbox versus production contexts, varying the apparent stakes or consequences, and comparing behavior with and without oversight. Inconsistencies don't always indicate deception—they might reflect capability limitations or reasonable context-sensitivity. Interpretation requires understanding expected variation versus concerning divergence.

Key Concepts

Context Variation

Testing behavior across different scenarios and conditions.

Paraphrase Testing

Same request with different wording should yield consistent behavior.

Oversight Sensitivity

Comparing behavior with and without apparent monitoring.

Stake Variation

Testing whether behavior changes based on perceived consequences.

Related Concepts

Deceptive Alignment Honeypot Testing AI Red Teaming Sandboxing