Overview
Continuous diffusion does extremely well on images, but struggles on discrete data. We introduce token identifiability to study Gaussian noise on discrete data and discover a temporal dissonance between discrete identity corruption and continuous rank degradation. Both are vital for continuous diffusion but become misaligned as the number of categories increases. We introduce CANDI, disentangling the two forms of corruption with an explicit masking schedule to coordinate them and bring continuous diffusion benefits to discrete spaces.
TL;DR
- Token identifiability explains how Gaussian noise corrupts discrete data.
- We find temporal dissonance between identity corruption and rank degradation that hurts continuous diffusion.
- CANDI decouples and coordinates the corruptions to improve discrete diffusion.