Model Collapse

intermediate

The degradation of AI model quality when trained on AI-generated data. As synthetic content pollutes training data, models lose diversity and capability, potentially creating a feedback loop of declining quality.

Category: ml-fundamentals

trainingdata-qualityresearch

Overview

Model collapse occurs when AI systems are trained on data generated by other AI systems. Each generation loses some fidelity to the original human-generated distribution, like a photocopy of a photocopy. Research has shown that models trained on synthetic data progressively lose rare concepts, edge cases, and creative variations. The "long tail" of human expression gets truncated, leaving only the most common patterns. This poses an existential challenge for AI development: as AI-generated content floods the internet, future training data becomes increasingly contaminated, potentially capping model capabilities.

Key Concepts

Distribution Shift

AI-generated data has subtly different statistical properties than human data.

Mode Collapse

Models converge on common outputs, losing ability to generate rare or novel content.

Data Provenance

The need to track whether training data is human or AI-generated.

Related Concepts

AI Slop Synthetic Data Training Data Data Curation