Behavioral Consistency Testing
intermediateEvaluating whether AI agent behavior remains stable and predictable across different contexts, prompts, and conditions. Inconsistency may indicate deception, poor alignment, or unreliable behavior.
Overview
Behavioral consistency testing probes whether agents behave the same way across different situations. Genuine alignment should produce consistent behavior; inconsistency suggests the agent may be gaming evaluations or behaving differently when it thinks it's being watched. Testing approaches include: rephrasing the same request different ways, testing in sandbox versus production contexts, varying the apparent stakes or consequences, and comparing behavior with and without oversight. Inconsistencies don't always indicate deception—they might reflect capability limitations or reasonable context-sensitivity. Interpretation requires understanding expected variation versus concerning divergence.
Key Concepts
Context Variation
Testing behavior across different scenarios and conditions.
Paraphrase Testing
Same request with different wording should yield consistent behavior.
Oversight Sensitivity
Comparing behavior with and without apparent monitoring.
Stake Variation
Testing whether behavior changes based on perceived consequences.