CANDI

🧩The Puzzle

▼

Continuous diffusion dominates images but fails on text. Why?

Continuous diffusion has revolutionized image generation, but mysteriously underperforms on discrete data like text. This is puzzling because continuous models learn score functions that enable:

Smooth gradients for guidance
Coordinated refinement instead of independent sampling

However, recent work has shifted towards pure discrete diffusion methods that lack these properties. Furthermore, such methods have generally outperformed their continuous counterparts. So what's going wrong?

🔬Our Analysis: Token Identifiability

▼

We introduce a new framework for understanding continuous corruption of discrete data, and discover a "temporal dissonance" - a fundamental misalignment in how continuous noise corrupts discrete data.

We introduce token identifiability as a framework to understand how Gaussian noise corrupts discrete data through two distinct mechanisms:

Discrete Identity Corruption: Is an incorrect token closest to the noisy latent?
Continuous Rank Degradation: How many incorrect tokens are closer to the noisy latent?

💥 Key Discovery: Temporal Dissonance

Discrete identity corruption controls how the model learns conditional dependencies between tokens. Unlike images, where nearby pixels share values due to spatial smoothness, discrete sequences lack geometric continuity: conditionally related tokens such as "New" and "York" have entirely different identities. To capture such co-occurrence patterns, the model must see clean anchor tokens during training.

Continuous rank degradation controls how the model learns the score function via continuous denoising, enabling coordinated refinement across positions. This gradient signal is the key strength of continuous diffusion: it allows multiple positions jointly evolve at once. Without access to a continuous score function, updates rely on independent conditional sampling, which leads to incoherent generations, especially at low NFE where many positions must be updated in parallel.

These mechanisms scale differently with vocabulary size, creating a temporal dissonance: when discrete structure is learnable, continuous denoising is trivial; when continuous denoising is meaningful, discrete structure is destroyed.

💡The Solution: CANDI

▼

Decouple discrete and continuous corruption to get the best of both worlds.

Continuous ANd DIscrete diffusion solves temporal dissonance with one key insight: decouple the corruptions.

By separating discrete and continuous noise schedules, CANDI enables simultaneous learning of:

Conditional structure (discrete)
Continuous geometry (score functions)

💥 Structured Noising Kernel

We introduce a hybrid noising kernel that uses discrete masking corruption to decouple discrete corruption from Gaussian noise. Paradoxically, decoupling discrete corruption from Gaussian noise allows us to better coordinate both forms of corruption with each other. As we show below, this structured kernel allows for a linear relation between discrete identity corruption and continuous rank degradation.

Gaussian Noise Kernel

Masking Noise Kernel

Hybrid Noising Kernel

⚙️Evaluation Methodology: Frontier Analysis

▼

We use frontier analysis to quantify generative quality and demonstrate why this leads to more robust results than single-point analysis.

Temperature is a crucial hyperparameter for text generation, as it allows flexible control between coherence and diversity. As the same autoregressive model may have different optimal temperatures for different tasks, it follows that completely different diffusion models will have different optimal operating regions for temperature. A global temperature of 1.0 may correspend to different regions of the diversity-coherence frontier across differnet models.

To address this, we use frontier analysis to evaluate generative quality across a range of temperatures, plotting the best achievable trade-off between diversity and coherence. We use this methodology across all experiments, as the cost of the temperature sweep is negligible compared to training the models themselves. This allows us to compare models based on their overall capabilities rather than at a single operating point.

💥 Single Temperature Trap

Here we compare CANDI, Duo, and MDLM across different temperatures to demonstrate that rankings can change significantly based on the temperatures. All methods lie within the same general range of entropy and perplexity, but the relative performance varies significantly based on temperature choice.

SAME base models within each graph
SAME range of entropy within each graph
DIFFERENT relative performances

✓Theory Meets Practice

▼

Our theory of token identifiability and the temporal dissonance predict the following:

Continuous diffusion perform well with small number of categories, since temporal dissonance is minimal.
Continuous diffusion collapses with large number of categories, when temporal dissonance is severe.
CANDI effectively solves the temporal dissonance, leading to improved performance across both regimes of category sizes.

Using Text8 (27 categories) and OpenWebText (50257 categories), we validate these predictions empirically.

We train two pure continuous diffusion models:

One-Hot Diffusion: Diffusion directly on one-hot vectors with Gaussian noise.
Embedding Diffusion: Diffusion on token embeddings with Gaussian noise.

We also train a masked diffusion method (MDLM) and CANDI. For both datasets, we keep the architecture, learning rate, batch size to be the same to eliminate confounders. In order to separate difficulty of the dataset from the analysis, we use diversity / coherence metrics specific to each dataset and use relative performance to MDLM to measure performance.

💥 Empirical Validation of Temporal Dissonance

We validate that temporal dissonance exists in continuous diffusion baselines and demonstrate that CANDI successfully avoids it.

✅ Both one-hot and embedding diffusion achieve similar frontiers to MDLM and CANDI, especially at low NFE
✅ Both one-hot and embedding diffusion fail on OWT, either suffering from completely random outputs or mode collapse
✅ CANDI either matches or surpasses MDLM performance both at low and high number of categories

🚀Why Continuous Geometry Matters

📚Concurrent Works

Continuously Augmented Discrete Diffusion model for Categorical Generative Modeling

Huangjie Zheng et al, October 2025

▼

Augments masked diffusion with continuous diffusion in embedding space to address "information void"

Paper Link

Key contributions:

Proposes augmented diffusion process, using both discrete masked diffusion and continuous Gaussian diffusion
Leverages continuous diffusion to interpolate between mode covering and mode seeking sampling modes
Demonstrate impressive performance across range of modalities and tasks, including coding

Relationship to CANDI:

Both works combine masked and continuous Gaussian diffusion
We focus on establishing a framework for understanding gaussian diffusion on discrete spaces. They focus on demonstrating the empirical success of hybrid methods across modalities.
We provide a theoretically and empirically validated explanation for why hybrid diffusion methods succeed where continuous gaussian methods have failed. They arrive at hybrid diffusion by trying to fix a flaw with masked diffusion.

Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

Cai Zhou et al, October 2025

▼

Introduces CCDD, a joint multimodal diffusion process to combine the expressivity of continuous diffusion and trainability of discrete diffusion