Apple study probes compositional generalization in diffusion

A new paper from Apple Machine Learning Research investigates why conditional diffusion models sometimes produce convincing out-of-distribution outputs and sometimes do not — and identifies a structural property that separates the two cases. The study centers on length generalization: a model's ability to generate scenes with more objects than it saw during training, and proposes an explanation rooted in how scores depend on pixels and conditioners.

Authored by Arwen Bradley, the paper frames compositional generalization as an open question for the field. While diffusion models often appear capable of handling novel input combinations, the paper argues that the mechanisms behind that behavior have remained poorly understood. The research sets out to distinguish when models truly learn underlying compositional structure from when they only appear to do so.

What the CLEVR Experiments Revealed

To isolate the question, the researchers used a controlled setup based on CLEVR, the synthetic visual reasoning benchmark introduced by Johnson et al. in 2017. The paper reports that length generalization is achievable in some configurations but not others, showing that compositional behavior is not an automatic property of conditional diffusion models.

The team connects success and failure to a structural property called local conditional scores: scores with sparse dependencies on both pixels and conditioners. The paper proves an exact equivalence between this property and a compositional structure described as conditional projective composition, building on earlier work by Bradley et al. (2025). The framework also extends to feature-space compositions of concepts such as style and content.

Why Local Conditional Scores Matter

Empirically, CLEVR models that achieved length generalization exhibited local conditional scores, while those that failed did not. Crucially, the researchers also report a causal intervention: explicitly enforcing local conditional scores in a previously failing model was enough to enable length generalization. That result moves the locality argument from correlation toward a more direct mechanism.

The analysis extends to SDXL, a widely used production-scale model. In pixel-space, spatial locality is present, but conditional locality is mostly absent. The team does find quantitative evidence of local conditional scores in the network's learned feature-space, suggesting the mechanism operates at a more abstract level in larger systems.

What Researchers Should Watch Next

The findings build on earlier proposals from Kamb and Ganguli (2024) and Niedoba et al. (2024), which examined locality as a driver of creativity in unconditional diffusion. By extending the idea to conditional settings and compositional generalization, the paper offers a testable hypothesis about why some models extrapolate and others stall.

A natural follow-up question is whether enforcing local conditional scores during training, rather than only as a post-hoc intervention, can deliver more reliable length generalization in larger systems. Full details, including the equivalence proofs and SDXL feature-space measurements, are available on the Apple Machine Learning Research website.