Overview
Continuous diffusion does extremely well on images, but struggles on discrete data. Theoretically, learning a continuous score function should enable coordinated refinement and ease of external guidance. However, recent works have moved towards pure discrete diffusion methods that lack the continuous geometry of Gaussian diffusion. These methods tend to outperform continuous methods.
We introduce token identifiability, a framework for studying Gaussian noise on discrete data. Using this analysis, we discover a temporal dissonance between discrete identity corruption (is an incorrect token closest to the noisy latent) and continuous rank degradation (how many incorrect tokens are closer to the noisy latent than the correct token). Both are vital for continuous diffusion to work well on discrete data, yet they become severely misaligned when the number of categories increases.
Given this, we introduce CANDI: Continuous and Discrete Diffusion. We disentangle the two forms of corruption by introducing an explicit masking schedule that directly controls discrete identity corruption. By decoupling the two forms of corruption, we paradoxically allow them to be coordinated with each other. We demonstrate that this brings the benefits of continuous diffusion towards discrete spaces.
TL;DR:
- We introduce token identifiability as a framework for understanding how Gaussian noise corrupts discrete data.
- We discover a temporal dissonance between discrete identity corruption and continuous rank degradation, which results in continuous diffusion underperforming discrete methods.
- We introduce CANDI: Continuous and Discrete Diffusion, which resolves the temporal dissonance and brings the benefits of continuous diffusion to discrete spaces.