TL;DR
We introduce token identifiability as a framework for understanding continuous noise on discrete data, explain why continuous diffusion has historically underperformed on discrete data, and propose CANDI as a principled solution.

CANDI CANDI: Hybrid Discrete-Continuous Diffusion Models CANDI

1Purdue University 2Google Deepmind

Abstract

While continuous diffusion has shown remarkable success in continuous domains such as image generation, its direct application to discrete data has underperformed compared to purely discrete formulations. This gap is counterintuitive, given that continuous diffusion learns score functions that enable joint evolution across multiple positions.

To understand this gap, we introduce token identifiability as an analytical framework for understanding how Gaussian noise corrupts discrete data through two mechanisms.

  • Discrete Identity Corruption: Is an incorrect token closest to the noisy latent?
  • Continuous Rank Degradation: How many incorrect tokens are closer to the noisy latent?

We reveal that these mechanisms scale differently with vocabulary size, creating a temporal dissonance: at noise levels where discrete corruption preserves enough structure for conditional learning, continuous denoising is trivial; at noise levels where continuous denoising is meaningful, discrete corruption destroys nearly all conditional structure.

To solve this, we propose CANDI (Continuous ANd DIiscrete diffusion), a hybrid framework that decouples discrete and continuous corruption, enabling simultaneous learning of both conditional structure and continuous geometry. We empirically validate the temporal dissonance phenomenon and demonstrate that CANDI successfully avoids it.

This unlocks the benefits of continuous diffusion for discrete spaces: on controlled generation, CANDI enables classifier-based guidance with off-the-shelf classifiers through simple gradient addition; on text generation, CANDI outperforms masked diffusion at low NFE, demonstrating the value of learning continuous gradients for discrete spaces.

🧩The Puzzle

Continuous diffusion dominates images but fails on text. Why?

Continuous diffusion has revolutionized image generation, but mysteriously underperforms on discrete data like text. This is puzzling because continuous models learn score functions that enable:

  • Smooth gradients for guidance
  • Coordinated refinement instead of independent sampling

However, recent work has shifted towards pure discrete diffusion methods that lack these properties. Furthermore, such methods have generally outperformed their continuous counterparts. So what's going wrong?

🔬Our Analysis: Token Identifiability

We introduce a new framework for understanding continuous corruption of discrete data, and discover a "temporal dissonance" - a fundamental misalignment in how continuous noise corrupts discrete data.

Token Identifiability Fig

We introduce token identifiability as a framework to understand how Gaussian noise corrupts discrete data through two distinct mechanisms:

  • Discrete Identity Corruption: Is an incorrect token closest to the noisy latent?
    Discrete Identity Corruption Equation
  • Continuous Rank Degradation: How many incorrect tokens are closer to the noisy latent?
    Continuous Rank Degradation Equation

💥 Key Discovery: Temporal Dissonance

Discrete identity corruption controls how the model learns conditional dependencies between tokens. Unlike images, where nearby pixels share values due to spatial smoothness, discrete sequences lack geometric continuity: conditionally related tokens such as "New" and "York" have entirely different identities. To capture such co-occurrence patterns, the model must see clean anchor tokens during training.

Continuous rank degradation controls how the model learns the score function via continuous denoising, enabling coordinated refinement across positions. This gradient signal is the key strength of continuous diffusion: it allows multiple positions jointly evolve at once. Without access to a continuous score function, updates rely on independent conditional sampling, which leads to incoherent generations, especially at low NFE where many positions must be updated in parallel.

These mechanisms scale differently with vocabulary size, creating a temporal dissonance: when discrete structure is learnable, continuous denoising is trivial; when continuous denoising is meaningful, discrete structure is destroyed.

Discrete-Continuous Noise

💡The Solution: CANDI

Decouple discrete and continuous corruption to get the best of both worlds.

CANDI Visualization

Continuous ANd DIscrete diffusion solves temporal dissonance with one key insight: decouple the corruptions.

By separating discrete and continuous noise schedules, CANDI enables simultaneous learning of:

  • Conditional structure (discrete)
  • Continuous geometry (score functions)

💥 Structured Noising Kernel

We introduce a hybrid noising kernel that uses discrete masking corruption to decouple discrete corruption from Gaussian noise. Paradoxically, decoupling discrete corruption from Gaussian noise allows us to better coordinate both forms of corruption with each other. As we show below, this structured kernel allows for a linear relation between discrete identity corruption and continuous rank degradation.

  • Gaussian Noise Kernel
  • Continuous Noise Kernel
  • Masking Noise Kernel
  • Discrete Noise Kernel
  • Hybrid Noising Kernel
  • Hybrid Noising Kernel
R vs Rho

⚙️Evaluation Methodology: Frontier Analysis

We use frontier analysis to quantify generative quality and demonstrate why this leads to more robust results than single-point analysis.

Temperature is a crucial hyperparameter for text generation, as it allows flexible control between coherence and diversity. As the same autoregressive model may have different optimal temperatures for different tasks, it follows that completely different diffusion models will have different optimal operating regions for temperature. A global temperature of 1.0 may correspend to different regions of the diversity-coherence frontier across differnet models.

To address this, we use frontier analysis to evaluate generative quality across a range of temperatures, plotting the best achievable trade-off between diversity and coherence. We use this methodology across all experiments, as the cost of the temperature sweep is negligible compared to training the models themselves. This allows us to compare models based on their overall capabilities rather than at a single operating point.

💥 Single Temperature Trap

Single point evals with tuning

Here we compare CANDI, Duo, and MDLM across different temperatures to demonstrate that rankings can change significantly based on the temperatures. All methods lie within the same general range of entropy and perplexity, but the relative performance varies significantly based on temperature choice.

  • SAME base models within each graph
  • SAME range of entropy within each graph
  • DIFFERENT relative performances

Theory Meets Practice

Our theory of token identifiability and the temporal dissonance predict the following:

  • Continuous diffusion perform well with small number of categories, since temporal dissonance is minimal.
  • Continuous diffusion collapses with large number of categories, when temporal dissonance is severe.
  • CANDI effectively solves the temporal dissonance, leading to improved performance across both regimes of category sizes.
Using Text8 (27 categories) and OpenWebText (50257 categories), we validate these predictions empirically.
We train two pure continuous diffusion models:
  • One-Hot Diffusion: Diffusion directly on one-hot vectors with Gaussian noise.
  • Embedding Diffusion: Diffusion on token embeddings with Gaussian noise.
We also train a masked diffusion method (MDLM) and CANDI. For both datasets, we keep the architecture, learning rate, batch size to be the same to eliminate confounders. In order to separate difficulty of the dataset from the analysis, we use diversity / coherence metrics specific to each dataset and use relative performance to MDLM to measure performance.

💥 Empirical Validation of Temporal Dissonance

Text8 Frontiers OWT Frontiers

We validate that temporal dissonance exists in continuous diffusion baselines and demonstrate that CANDI successfully avoids it.

  • Both one-hot and embedding diffusion achieve similar frontiers to MDLM and CANDI, especially at low NFE
  • Both one-hot and embedding diffusion fail on OWT, either suffering from completely random outputs or mode collapse
  • CANDI either matches or surpasses MDLM performance both at low and high number of categories

🚀Why Continuous Geometry Matters

Benefit 1: Better Low-NFE Performance

Coordinated refinement across positions beats independent token sampling.

Continuous gradients enable coordinated refinement across positions, rather than independent token sampling. To demonstrate this, we train a 170 million parameter CANDI model with the diffusion transformer architecture on OWT for 1,000,000 steps. We compare against a strong masked diffusion baseline (MDLM) and a strong uniform diffusion baseline (DUO), using the official checkponts provided by the authors and the same model architecture.

💥 Improved Performance at Low NFE

Low NFE Frontiers
Key observations:
  • At low NFE (8-16), CANDI outperforms MDLM by a significant margin at all entropy values
  • MDLM surpasses DUO for all perplexity below 40 at all NFE

Benefit 2: Plug-and-Play Guidance

Learning continuous gradients allow for plug-and-play guidance with off-the-shelf classifiers.

Prior guidance methods for discrete diffusion require specially trained classifiers for each diffusion model.

Furthermore, the gradient information must be adapted to suit the categorical distributions used at inference for pure discrete diffusion methods

In contrast, CANDI can leverage off-the-shelf continuous classifiers for guidance with simple gradient addition.

We use the QM9 molecular dataset and compare against MDLM and UDLM using diffusion classifiers. We train classifers on two chemical properties: QED and ring count. For CANDI, we train a normal classifier without any diffusion-specific augmentations.

We visualize frontiers in terms of target property value (coherence) and novel molecules (diversity). We ablate over temperature and show the best frontiers for each method.

💥 Competitive Performance with Off-the-Shelf Classifiers

Molecular Frontier Comparison
Key observations:
  • Using an off-the-shelf continuous classifier, CANDI achieves comparable performance to a discrete diffusion method with a specially trained classifier.
  • CANDI has a strictly better frontier when compared to MDLM for both properties and both NFE

📚Concurrent Works

Continuously Augmented Discrete Diffusion model for Categorical Generative Modeling

Huangjie Zheng et al, October 2025

Augments masked diffusion with continuous diffusion in embedding space to address "information void"

Paper Link

Key contributions:

  • Proposes augmented diffusion process, using both discrete masked diffusion and continuous Gaussian diffusion
  • Leverages continuous diffusion to interpolate between mode covering and mode seeking sampling modes
  • Demonstrate impressive performance across range of modalities and tasks, including coding

Relationship to CANDI:

  • Both works combine masked and continuous Gaussian diffusion
  • We focus on establishing a framework for understanding gaussian diffusion on discrete spaces. They focus on demonstrating the empirical success of hybrid methods across modalities.
  • We provide a theoretically and empirically validated explanation for why hybrid diffusion methods succeed where continuous gaussian methods have failed. They arrive at hybrid diffusion by trying to fix a flaw with masked diffusion.

Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

Cai Zhou et al, October 2025

Introduces CCDD, a joint multimodal diffusion process to combine the expressivity of continuous diffusion and trainability of discrete diffusion

Paper Link

Key contributions:

  • Demonstrates that continuous diffusion is strictly more expressive than discrete diffusion
  • Point towards trainability difficulties of continuous diffusion as limiting factor
  • Propose to use joint diffusion on both discrete and continuous space to address trainability issue

Relationship to CANDI:

  • Both identify the trainability issue of continuous diffusion for discrete spaces
  • We provide a concrete mechanism that makes continuous diffusion more difficult to train than discrete diffusion through token identifiability and temporal dissonance. They focus on demonstrating the theoretical superiority of continuous diffusion over discrete diffusion.
  • We focus on low nfe and simpler guidance, which are specific capabilities that Gaussian diffusion provides that are not available in discrete spaces. They focus on likelihood evaluations and general sample quality.

BibTeX

@article{pynadath2025candi,
  author    = {Patrick Pynadath, Jiaxin Shi and Ruqi Zhang},
  title     = {CANDI: Hybrid Discrete-Continuous Diffusion Models},
  journal   = {arXiv preprint},
  year      = {2025},
}