Model Collapse
intermediateThe degradation of AI model quality when trained on AI-generated data. As synthetic content pollutes training data, models lose diversity and capability, potentially creating a feedback loop of declining quality.
Overview
Model collapse occurs when AI systems are trained on data generated by other AI systems. Each generation loses some fidelity to the original human-generated distribution, like a photocopy of a photocopy. Research has shown that models trained on synthetic data progressively lose rare concepts, edge cases, and creative variations. The "long tail" of human expression gets truncated, leaving only the most common patterns. This poses an existential challenge for AI development: as AI-generated content floods the internet, future training data becomes increasingly contaminated, potentially capping model capabilities.
Key Concepts
Distribution Shift
AI-generated data has subtly different statistical properties than human data.
Mode Collapse
Models converge on common outputs, losing ability to generate rare or novel content.
Data Provenance
The need to track whether training data is human or AI-generated.