February 21, 2026 interactivedemo

Diffusion Language Models: Some Nice Visualizations

Some (hopefully) Nice visualizations for understanding diffusion LMs

Continuous Diffusion Language Models: Some Nice Visualizations

Patrick Pynadath, Ruqi Zhang


Diffusion large language models (dLLMs) are starting to show benefits over auto-regressive counterparts, particularly since the introduction of Mercury 2, which is described as “the fastest reasoning LLM”. By matching / outperforming AR counterparts while providing significant throughput advantages, the advantages of dLLMs are become more widely known.


The goal of this blog is to provide some nice visualizations and background on continuous diffusion LLMs for those who may be interested. As my focus is primarily on continuous dLLMs, I focus on visualizing and discussing some recent discoveries within this field. I hope that this is helpful for those starting to learn about this field, as it was certaintly difficult for me to visualize adding Gaussian noise to language.

Contents

Continuous Denoising

While discrete denoising has arguably seen the most success within the language modeling context, applying continuous denoising methods towards language modeling has remained an active field of research.

Here we visualize the decorruption of a noisy latent under a linear interpolation schedule. We use the words spreading out to visualize the continuous denoising process, and the “flipping” of the bold tokens to represent discrete decorruption, or when the continuous latent becomes closest to the correct token. We visualize this for a vocabulary of 50,257 words, which correspends to the number of unique tokens for GPT-2.

Continuous Linear Interpolation

Discrete Corruption of Continuous States

Even when considering continuous denoising, it is possible to characterize the effect of Gaussian noise in terms of discrete corruption [8]. Recent work has demonstrated that Gaussian noise induces undesirable scaling with vocabulary size: with a sufficiently large vocabulary size, most of the discrete corruption becomes concentrated towards the end of the denoising process [9], [10]. In our recent work [10], we called this a temporal dissonance to reflect the misalignment between continuous and discrete corruption with large vocabularies. We visualize this below — with a vocabulary size of 50, discrete decorruption is relatively spread throughout the process. When we scale to 50,257, discrete decorruption becomes concentrated within a narrow band of the denoising process.

Discrete Decorruption at 5 Here we show the discrete corruption effect simulated for a vocabulary size of 5. The discrete decorruption is relatively spread out.

Discrete Decorruption at 50,257 Here we show the discrete corruption effect simulated for a vocabulary size of 50,257. The discrete decorruption is concentrated towards the end, with all positions being decorrupted at almost the same time.

Why Continuous Language Models need Discrete Structure

It may not be immediately clear why continuous language models should need to be concerned with discrete corruption. The short explanation is that discrete conditional dependencies need to be learned, whereas continuous conditional dependencies have an inherently smooth structure where nearby pixels share similar values.


Conditional Structure in Language


Consider the masked sentence below.

I live in [Mask] York City. 

The reader will probably recognize the mask token should be “New”. This is not due to “New” or “York” being semantically similar on their own, or being similar in terms of characters — it is because “York City” is usually preceded by “New” whenever the former appears in text. In other words, the conditional relation between “York City” and “New” is something observed through data — not through anything inherent to the words themselves.



Conditional Structure in Natural Images


In contrast, take this noisy picture of two cats. While the image is blurry, it is clear what colors the pixels should be — the noisy pixels on the bottom should be blue, in the middle should be gray, and the top should be black. The reader could probably infer these despite never seeing these cats. This is because images are spatially smooth — nearby pixels tend to share similar values. Thus the sample itself has a correlational structure that is not dependent on previous data or information.


In summary, the conditional structure within language must be learned — even if a noisy position is right next a clean position, there is no way to tell what word the noisy position should be without some knowledge on what words belong together. Within images, it can be assumed that nearby pixels will tend to have similar values (except for high frequency features, like edges, etc).

In order for a model to learn what words go together, it seems necessary to train the model to condition on some words that retain their original identity and predict the remaining words. The amount of clean words the model can use to condition is determined by the discrete corruption level, which is why this quantity is being more carefully controlled within recent work [9] [10].


The same explanation may not apply to continuous denoising applied to latent representations. For such methods, the information for each token is spread throughout the entire representation being noised. However, such methods have typically found that while the denoising does not require discrete conditioning, the representation learning phase still does.

Noise Reparameterization

Continuous Noise Reparameterization

One way to avoid the problematic behavior of Gaussian noise with large vocabularies is to reparameterize the time for the denoising process to be linear with respect to discrete corruption [9]. By enabling pure continuous denoising to better capture discrete conditional dependencies, it becomes possible to adapt distillation techniques from continuous flow matching towards language modeling, which is a promising direction for even quicker few-step language generation [11] [9].

Hybrid Methods

Another way of avoiding the temporal dissonance is by simply decoupling discrete from continuous noise by using an addition discrete noising schedule. Decoupling the two forms of corruption enables independent control over each one, which allows for learning both discrete and continuous geometry simultaneously.

Hybrid Decorruption

Fun Widget

Below is a small interactive widget that allows the user to move through the denoising processes for the methods discussed in this blog. Hopefully this provides a nice visual intuition as to how these methods denoise language — particularly the continuous approaches, as it wasn’t clear to me how to visualize continuous noise for language.

Token Emergence from Noise
VOCAB SIZE
V=50,257 · n=12
5
6
7
8
9
10
11
12
V=5 · n=5showing 12 competitors per tokenV=50,257 · n=12
the
the
a
this
that
each
every
its
one
an
any
all
both
cat
cat
dog
fox
rat
bird
bear
wolf
fish
frog
deer
crow
hawk
sat
sat
lay
fell
stood
knelt
leaned
slept
ran
crept
rested
stayed
curled
down
down
back
away
still
up
out
flat
here
near
low
soft
quiet
discrete corruption ρ(t)
1.000noiseclean
rank degradation r(t)
0.500noiseclean
noise (t = 1)t = 1.000  ·  τ = 0.000clean (t = 0)
Linear Interpolation: Discrete corruption is controlled through Continuous noise, which follows the flow matching linear interpolation schedule.

References

[1]
J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. Van Den Berg, “Structured denoising diffusion models in discrete state-spaces,” Advances in neural information processing systems, vol. 34, pp. 17981–17993, 2021.
[2]
S. S. Sahoo et al., “Simple and Effective Masked Diffusion Language Models,” no. arXiv:2406.07524, Nov. 2024, doi: 10.48550/arXiv.2406.07524.
[3]
J. Shi, K. Han, Z. Wang, A. Doucet, and M. K. Titsias, “Simplified and Generalized Masked Diffusion for Discrete Data,” no. arXiv:2406.04329, Jun. 2024, doi: 10.48550/arXiv.2406.04329.
[4]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” 2019. [Online]. Available: https://arxiv.org/abs/1810.04805
[5]
A. Wang and K. Cho, “BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model.” 2019. [Online]. Available: https://arxiv.org/abs/1902.04094
[6]
J. Ou et al., “Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data.” 2025. [Online]. Available: https://arxiv.org/abs/2406.03736
[7]
Y. Schiff et al., “Simple Guidance Mechanisms for Discrete Diffusion Models,” arXiv preprint arXiv:2412.10193, 2024.
[8]
S. S. Sahoo, J. Deschenaux, A. Gokaslan, G. Wang, J. T. Chiu, and V. Kuleshov, “The Diffusion Duality,” in Forty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=9P9Y8FOSOk
[9]
C. Lee et al., “One-step Language Modeling via Continuous Denoising.” 2026. [Online]. Available: https://arxiv.org/abs/2602.16813
[10]
P. Pynadath, J. Shi, and R. Zhang, “CANDI: Hybrid Discrete-Continuous Diffusion Models.” 2025. [Online]. Available: https://arxiv.org/abs/2510.22510
[11]
D. Roos et al., “Categorical Flow Maps.” 2026. [Online]. Available: https://arxiv.org/abs/2602.12233