Deceptive Alignment

theoretical

A theoretical risk where an AI appears aligned during training/evaluation but pursues different goals when deployed. A key concern in AI safety research.

Category: safety

alignmentrisks

Extended tutorial content coming soon.

Check back for examples, tips, and in-depth explanations.