March 6, 2026 interactivedemo

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

Discussion on Evaluation Methodology for dLLMs

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

Patrick Pynadath¹, Jiaxin Shi², Ruqi Zhang¹

¹ Purdue Department of Computer Science

² jiaxns.io

Entropy ranking demo Depending on inference-time settings, it is possible to obtain entirely different relative performances between the same models if differences in entropy are unaccounted for. However, generative frontiers reveal that these are really just different views of the same underlying model performance.

Introduction

Diffusion language models have seen exciting recent progress, offering far more flexibility in generative trajectories than autoregressive models. Unlike autoregressive generation, which is constrained to a single left-to-right factorization, diffusion models admit a wide variety of training objectives, noising processes, and inference algorithms — enabling parallel generation, flexible decoding orders, and controllable generation in ways that are difficult or impossible for autoregressive models. This flexibility has motivated a growing body of research into new approaches to diffusion language modeling, which typically begins at the scale of GPT-2 small (~150 million parameters). This scale serves as a natural proving ground: it is tractable on an academic budget, yet meaningful enough to demonstrate whether a new approach has promise before committing to the cost of scaling further.

In this blog, we discuss pretraining at this model scale.

We first discuss why OpenWebText has become the standard within this regime, and why alternatives such as LM1B are inherently less meaningful.
We explain why likelihood-based evaluations become questionable when comparing models with different ELBOs and thus different slack to the likelihood value.
Generative perplexity is misleading on its own, as repetitition can easily result in good perplexity values at the expense of repetitiveness.
Due to the fundemental relation between entropy and cross entropy, diffusion models can trade token diversity (entropy) for better perplexity (cross entropy), causing single-point comparisons to reflect inference settings rather than model capability.
Generative frontiers address this by revealing the full entropy-perplexity tradeoff curve for each method, which enables comparisons based on divergence to target distribution under reasonable assumptions.

Dataset Choice: Why OpenWebText is the Standard

It is common to pretrain a ~150 million parameter model with a diffusion transformer architecture on OpenWebText [1] [2] [3] [4]. Like the original WebText, the data is obtained by scraping all outbound sites on Reddit with at least 3 karma, which serves as a data curation mechanism [5]. This results in a large (~9 billion tokens), diverse, yet relatively high quality set of text. For this reason, it has become the de facto pretraining set for studying language models at this scale: it is usually close to the edge of what is attainable on an academic budget, but is still similar enough to pretraining data to obtain meaningful performance metrics.

Additionally, since it is meant as a replication for WebText which GPT-2 was trained on, it is a very well-studied dataset [5] [6]. This also enables us to use metrics such as generative perplexity under GPT-2 large as a proxy for data likelihood, which we discuss more in the following section.

LM1B is Shuffled

While there are smaller datasets such as LM1B that can be used for training language models, it is important to note that these datasets often differ from pretraining data in substantial ways. In particular, LM1B is 1 billion of tokens from news sources shuffled at a sentence level [7]. As a result, there is no notion of coherence across different sentences, since the ordering is fundamentally random. This was intentional — this benchmark was introduced in 2013, when neural networks were still being compared against n-gram models. This dataset was introduced as a benchmark for measuring whether neural networks could properly model language at the sentence level better than n-gram models. Even at the time of introduction, it was primarily intended for likelihood-based evaluation. For demonstrating language ability beyond generating random sentences, LM1B is not an ideal dataset for pretraining. This is why most diffusion language model papers use LM1B for only likelihood or zero-shot likelihood evaluation [2] [4] [8].

Limitations of Likelihood Evals with dLLMs

While the question of what scale to train models at and what data to use is a fairly closed topic, there still remains the evaluation problem. While the original GPT-2 paper demonstrates that pretraining on web-text enabled zero-shot task performance on a wide range of tasks, it is crucial to note that the performance usually required scaling up the number of parameters — which is difficult when ~150 million is already quite expensive. Furthermore, some of the primary results are zero-shot likelihood across different unseen datasets across different models to demonstrate scaling behavior [5]. This was useful as all models were autoregressive, and thus measured likelihood in the exact same way. With different ELBO formulations, comparing the likelihoods directly seems questionable [9].

Different ELBOs, Different Slack to Likelihood

It is possible to express the log likelihood as follows, for any variational distribution $q(\mathbf{z}|\mathbf{x})$ : $\log p(\mathbf{x}) = \mathcal{L}(q) + D_{\mathrm{KL}}\big(q(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z}|\mathbf{x})\big)$ where $\mathcal{L}(q) = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[\log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z}|\mathbf{x})}\right]$ is the evidence lower bound (ELBO). Since $D_{\mathrm{KL}} \geq 0$ , we have $\log p(\mathbf{x}) \geq \mathcal{L}(q)$ , with equality if and only if $q(\mathbf{z}|\mathbf{x}) = p(\mathbf{z}|\mathbf{x})$ . Crucially, the KL divergence term depends on both the true posterior --- which differs across models with different latent variable structures --- and the choice of variational family. As this gap is generally intractable to compute, direct comparison of ELBOs across models with different latent structures is unreliable for model selection [10].

Comparing Generative Quality

Given the difficulties with comparing ELBOs across models with different formulations, the field has moved towards generative metrics. These metrics — exemplified by generative perplexity for coherence and unigram entropy for diversity — are conceptually straightforward: they only require the ability to sample from the model. However, used in isolation or at a single operating point, they can still lead to uninformative or misleading comparisons.

In this section, we build up a principled framework for generative evaluation. We start from the ideal metric — KL divergence from the target distribution — and show how it decomposes naturally into generative perplexity and unigram entropy. This decomposition reveals why single-point comparisons are fundamentally ambiguous: the same KL divergence can manifest as many different perplexity-entropy pairs, and minor changes in entropy can produce large changes in perplexity due to the exponential relationship between the two. We then introduce generative frontiers as a natural solution, and show why frontier dominance is a sufficient condition for lower KL divergence — making it a principled basis for model comparison. We conclude by discussing what constitutes a reasonable entropy range and under what conditions the approximations underlying frontier analysis hold.

The Problem with Generative Perplexity

Generative perplexity is a popular metric used for evaluating the generative quality of models at this scale. Intuitively, it reflects how confused some reference auto-regressive model is by the evaluated samples. More formally, generative perplexity is defined as follows, for a generative distribution $q_{gen}$ and a reference distribution $p_{ref}$ :

\begin{align*} \text{GenPPL} &= \exp\left(\mathbb{E}_{X \sim q_{gen}}\left[\frac{1}{n}\sum_{i=1}^{n} -\log p_{\text{ref}}(x_i | x_{<i})\right]\right) \\ &= \exp\left(H(q_{gen}, p_{\text{ref}})\right). \end{align*}

This metric is popular because it is simple and relatively cheap — all that is required is to be able to sample from $q_{gen}$ and evaluate $p_{ref}$ , which can be done in a single forward pass with AR models.

However, generative perplexity faces significant limitations — namely, it is possible for very repetitive sequences to achieve competitive perplexity results but still correspend to poor generative quality upon visual inspection [11]. As a result, the field has moved towards including unigram entropy results, which capture the diversity in word usage within a sequence. This directly captures the failure mode where the model uses repetitive sequences to achieve low perplexity. This metric is calculated by computing the empirical distribution over words across the entire sequence and computing the entropy. We denote this metric as $\tilde{H}$ , and we include the definition below in terms of the empirical frequency for a sample $\hat{p}$ :

\begin{align*} \tilde{H}(q) = -\sum_{w \in \mathcal{V}} \hat{p}(w) \log \hat{p}(w) \end{align*}

While this does meaningfully avoid extreme failure cases, we observe in [12] that even minor entropy changes are sufficient to completely change relative performance between different models. We include a visual comparison below using Duo, MDLM, and CANDI [8] [2] [12].

Entropy Empirical

As shown, all of the entropies are relatively similar and seem reasonable — there is no obvious “hacking” in terms of being overly repetitive. However, these minor changes in unigram entropy lead to completely different rankings between methods. And crucially, none of these results are misleading or contradictory — these are all performance metrics that can be obtained by tuning the temperature, so all of these are valid. We include the temperatures used to achieve these results below.

Temperatures by Plot

Method	Left	Middle	Right
MDLM	0.925	1.0	0.9
DUO	1.0	0.9	1.0
CANDI	0.875	0.925	0.95

Why This Happens: A Principled View

Here we take a first-principles approach to understand why generative perplexity is sensitive to unigram entropy empirically. We can interpret the relation between generative perplexity and unigram entropy by looking at how the KL divergence between two distribution decomposes into a cross-entropy and entropy term. More formally, we have the following:

\begin{align*} KL(q_{gen} || p_{ref}) = H(q_{gen}, p_{ref}) - H(q_{gen}) \end{align*}

The first term is the cross-entropy between the generative distribution and the reference distribution, which is precisely what is used to calculate generative perplexity. The second term is the entropy of the generative distribution, which we can view as being approximated by the empirical unigram entropy.

This reveals that the metrics the community already tracks together are not independent — they correspond to the two components of KL divergence from the reference distribution, providing a natural theoretical grounding for why both are necessary.

However, the exact joint entropy $H(q)$ is intractable to compute, meaning we cannot measure the full KL divergence directly. This is why the field relies primarily on measuring cross entropy $H(q_{gen}, p_{ref})$ , using unigram entropy to rule out degenerate cases. However, viewing both metrics through the lens of KL divergence reveals a subtler issue: simultaneously decreasing $H(q_{gen}, p_{ref})$ and $H(q_{gen})$ leaves the actual distance to the reference distribution unchanged, yet produces better generative perplexity. Critically, because generative perplexity is exponential in $H(q_{gen}, p_{ref})$ , even minor entropy shifts that leave the unigram entropy well within the reasonable range can produce large changes in generative perplexity without any corresponding change in the actual distance to the target distribution. Thus even when entropy is reported as a sanity check, small differences between methods can flip rankings without reflecting meaningful differences in generative quality.

We include a small interactive widget below, where it is possible to fix the KL divergence and observe how small changes in entropy can manifest as large generative perplexity shifts.

Gen Perplexity vs Entropy at Fixed KL
GenPPL = exp(KL + H) — linear scale shows the true exponential shape
KL Divergence
KL = 1.50
KL = 0.50KL = 4.00
403.41480.3PPL = e^(1.50 + H)01000200030004.04.55.05.56.06.5Entropy H(q)Gen Perplexity (linear)
Entropy range (highlight + zoom)
H=4.50 — H=5.80
H=0.50H=8.00
Highlighted range: H ∈ [4.50, 5.80] → PPL swings from 403.4 to 1480.3 (Δ 1076.9) at fixed KL = 1.50

Single Point Metrics are Insufficient

Given the relation between entropy, generative perplexity, and KL divergence, it may seem that including both generative perplexity and unigram entropy is sufficient. However, we argue that single-point comparisons are insufficient as LLMs parameterize many distributions via inference-time settings, thus requiring a holistic metric; and single-point entropy-perplexity pairs are ambiguous in terms of divergence to the target distribution.

LLMs Parameterize Many Distributions

One reason for why single point comparisons are insufficient is that the models we want to compare do not operate at single points — they really parameterize an entire range of generative distributions, depending on the specific inference-time strategies used.

For example, it is possible to adjust the entropy of the generative distribution by altering the temperature used inside the softmax. While reducing the entropy of the distribution allows for potentially better likelihoods due to the fundamental relation $H(q) < H(q, p)$ , it does not determine by how much $H(q, p)$ will change as it is a lower bound. Therefore there is no way of knowing how temperature affects KL divergence without directly measuring both metrics at that operating point.

Single Point Comparisons are Ambiguous

Ignoring the fact that it is possible to tune each model to achieve potentially different KL divergences, single point comparisons are only informative about KL divergences in the scenario where one point achieves a strictly lower generative perplexity and higher entropy. In this case, it is possible to conclude that one point has a lower KL divergence than the other.

However, strict dominance is unusual — it is more often the case that a method achieves a superior perplexity and worse entropy or vice versa. In these cases, it is not directly possible to determine which one is closer to the target distribution without making assumptions on how differences in the generative perplexity / unigram entropy space correspond to KL differences, which depends on how accurate unigram entropy is in terms of capturing the precise values of $H(q)$ .

To illustrate this point, we have a small game below that illustrates how single point comparisons can be difficult to interpret. We construct use the simple relation of $\text{GenPPL}(q) = \exp(KL(q || p) + H(q))$ to construct various contours that represent different KL distances. We then apply a warping function to the entropy / perplexity space to represent that even if our unigram entropy preserves rankings, it is not necessarily true that unit changes in the approximation space correspond to unit changes in the target metric.

Which point has lower KL divergence?
Each round shows two models — one achieves better perplexity, the other achieves better entropy. Which is actually closer to the target distribution?        
ROUND 1 / 50 CORRECT SO FAR
↑ better3004004.04.55.05.56.06.5Entropy H̃(q)  (warped)Gen Perplexity (linear)AB

Generative Frontiers: A Principled and Practical Solution

Generative frontiers offer a practical solution to both problems simultaneously. We can address the first problem by performing a sweep across a range of different temperatures used when computing the posterior distribution for each method. This naturally allows us to vary the entropy of the generative distribution, which lets us make conclusions on an entire range of generative performance as opposed to a single operating point.

Second, we can plot each temperature as an entropy-generative perplexity pair, and construct a frontier curve for each method. To compare methods at a fixed entropy, we can just fix the x value and measure where each frontier intersects this vertical line. We can do likewise for generative perplexity. Assuming that unigram entropy comparisons accurately reflect relative rankings in joint entropy, we can make principled claims about performance in terms of divergence to the target distribution: if we match the generative perplexities and one method has a higher entropy, then it must have a lower KL divergence at that operating point.

First, if the generative perplexities for $q_1$ and $q_2$ are equal, we have the following by definition of generative perplexity:

\begin{align*} \text{GenPPL}(q_1) = \text{GenPPL}(q_2) &\implies H(q_1, p_{ref}) = H(q_2, p_{ref}) \\ \end{align*}

Next, if we assume that $H(q_1) > H(q_2)$ , we have the following:

\begin{align*} KL(q_{1} || p_{ref}) &= H(q_1, p_{ref}) - H(q_1) \\ &= H(q_2, p_{ref}) - H(q_1) \\ &< H(q_2, p_{ref}) - H(q_2) \\ &= KL(q_2 || p_{ref}) \end{align*}

Similarly, if we match the entropy and a method has a lower generative perplexity, it must also have a lower divergence at that operating point. Thus by using generative frontiers, we can make supported claims about divergence to the reference distribution.

Furthermore, dominating frontiers allow for much stronger claims: if a model has a strictly superior frontier than another, it implies that it is always closer to the target distribution at every operating point of interest. This is much more powerful than demonstrating outperformance at a single operating point. Single point comparisons measure inference settings; generative frontiers compare model performance.

We include an interactive widget below which allows for sliding along the generative frontiers for each method, and visualizing how NFE and generative perplexity comparisons change. We hope to illustrate that the different rankings that arise when changing the temperature of methods are really just different views of the same, static frontiers.

Same Frontiers. Different Rankings.
Use the sliders or drag any ● to move that method's operating point
MDLM
DUO
CANDI
Entropy Operating Points
MDLM
5.300
DUO
4.833
CANDI
5.248
NFE = 8
Gen. Perplexity ↑
NFE = 16
NFE = 32
NFE = 64
NFE = 128
Sample Entropy →
NFE vs Perplexity — at selected temperatures
0501001502002508163264128Number of Function Evaluations (NFE)Generative Perplexity ↓RANKING @ NFE=1281DUO27.42MDLM43.83CANDI44.8⚠  Methods at different entropies  (spread = 0.47 nats)MDLM: H=5.300DUO: H=4.833CANDI: H=5.248

What is a Reasonable Entropy?

It is natural to wonder what a reasonable entropy is — if a model outperforms another at a very low entropy, this outperformance may not translate directly into better generations if the entropy is too far from actual language entropy. Thus a reasonable approach is to empirically measure the entropy across the validation set of OpenWebText and use it to define intervals across the entropy scale. We visualize the empirical distribution on the validation set below.

Entropy Empirical

We find that most of the mass falls within 5.2-5.7, indicating entropies in this range are close to natural language. This can be used to determine which entropies are important when making comparisons. We include a widget below where the entropy can be selected by moving a slider across the empirical PDF of the entropy distribution. This is used to find generative perplexity values for each method across a range of NFE. This gives a clear picture of how models perform when tuned to operate at the same entropy level.

Entropy-Controlled Perplexity Frontiers
Use the slider to compare methods at a fair operating point
Target Sample Entropy
H = 5.432
4.655.78
OWT Entropy Distribution
4.54.85.05.35.55.8density
1DUObest PPL57.33@ NFE 128
2MDLMbest PPL60.63@ NFE 128
3CANDIbest PPL66.97@ NFE 128
NFE vs Perplexity — at H = 5.432
1001502002503008163264128Number of Function Evaluations (NFE)Generative Perplexity ↓RANK @ NFE 1281DUO57.32MDLM60.63CANDI67.0

When Frontier Analysis is Valid

It is worth discussing under what scenarios frontiers can be used to make informative comparisons. Our goal is to use frontier analysis to determine rank generative distributions in terms of closeness to the data distribution $p_{data}$ . For this to be possible, we need the following to hold:

Our reference model $p_{ref}$ is a good proxy for the data distribution.
Generative perplexity accurately captures the ranking of cross entropies: $H(q, p_{ref}) < H(q', p_{ref})$ if and only if $\text{GenPPL}(q) < \text{GenPPL}(q')$ .
Unigram entropy accurately captures the ranking of joint entropies: $H(q) < H(q')$ if and only if $\tilde{H}(q) < \tilde{H}(q')$ .

The first two assumptions are relatively straight easy to accept — as GPT-2 Large is the standard for evaluating generative perplexity on OpenWebText and the entire GPT-2 was trained on the closed-source version of this dataset, it seems reasonable to accept it as a proxy distribution. Similarly, generative perplexity is an exponential of the AR NLL on generated sequences, which can be seen as a Monte-Carlo estimate of the true cross entropy $H(q_{gen}, p_{ref})$ .

The final assumption is worth discussing — $\tilde{H}$ is the unigram entropy, whereas we require the joint entropy $H(q)$ . It may not be immediately clear why the former is a good approximation of the latter. Here we provide a brief explanation on why we think it is a reasonable enough approximation within the context of frontier analysis, and why the general methodology is compatible with potentially better approximations.

Unigram Entropies Approximate Average Marginal Entropies

First, unigram entropies should be viewed as a reasonable approximation to the average marginal entropies for each position. More formally, using $q_1, q_2, \dots q_n$ to represent the marginal distributions for each position, we have the following:

$\tilde{H}(q) \approx \frac{1}{n} \sum_{i=1}^{n} H(q_i)$

The right-hand side represents the entropy over the vocabulary averaged over all positions. The left-hand side uses the empirical frequency of tokens to compute the entropy. The two coincide in expectation: averaging over all positions washes out position-specific idiosyncrasies, leaving only the aggregate token probabilities — which is exactly what the empirical unigram distribution captures.

Average Marginal Entropies Bound Joint Entropy

Next, we use the standard property that the sum of marginal entropies bounds the joint entropy. We start first by applying the chain rule for entropy to the joint entropy:

$H(q) = H(q_1) + H(q_2 | q_1) + H(q_3 | q_2, q_1) \dots$

Each term in this summation is a conditional entropy, which is always bounded by the unconditional entropy, as extra information (the conditioning) can only reduce uncertainty, never increase it.

$H(q) < \sum_{i=1}^n H(q_i)$

We can rewrite this to be in terms of the average marginal entropy, which should make the connection to unigram entropy more clear:

$H(q) < n \cdot H_{avg}(q) \approx n \cdot \tilde{H}(q)$

This demonstrates that the unigram entropy roughly tracks the bound of the joint entropy. Informally, if we accept the unigram entropy as a reasonable approximation to the average marginal entropy, a smaller unigram entropy necessitates a smaller joint entropy. A larger unigram entropy allows for a larger joint entropy.

While this may not be the optimal or perfect approximation, the general approach of frontier analysis is entirely compatible with better approximations of $H(q)$ : we choose unigram entropy as it is already a commonly used metric within the field. Furthermore, we do not need unigram entropy to be a precise approximation — we only need it to faithfully capture the relative rankings between different distributions.

Interesting Side Note: 50k Steps is Sometimes Good Enough

We can use frontier analysis to observe something that may be of interest to those working on pretraining — often, the generative performance of models very early on in training (50k) steps is quite close to the model’s ability after 1,000,000 steps. We visualize this below by pretraining Duo, MDLM, and CANDI for 50k steps and comparing against the frontiers generated by fully trained checkpoints. As one can observe, the generative frontiers of models after 50k steps closely match the performance at 1 million steps. Thus when iterating on an algorithm / training framework, it is most likely more than sufficient to use 50k steps — if a model does not work by this point, it will probably not work by 1,000,000 steps. While this may have already been implicitly known by those that have pretrained models at this scale, it is probably surprising how close the frontiers of these models are.

This may indicate a limitation of either generative perplexity or this scale of models in terms of measuring model performance — it is a commonly known fact that training for longer improves performance on downstream tasks. It is not known whether the above saturation is a result of model capacity or perplexity being unable to capture fine-grained differences in model capability. Early Training Behavior

This observation itself is a product of frontier analysis — it would be difficult to draw this conclusion reliably from single-point comparisons alone, where logit calibration could easily obscure the similarity.

Conclusion

We hope this blog serves as a useful reference for researchers working on diffusion language models at this scale. As the field continues to develop new objectives, noising processes, and inference algorithms, having a principled evaluation methodology will only become more important. We believe that generative frontiers are a step in this direction — the metrics the community already relies on can be naturally interpreted jointly through the lens of KL divergence, and frontier analysis provides the framework to do so.

Citation

The citation for this work is available below in BibTex.

@misc{pynadath2026GenFrontiers,
  author = {Pynadath, Patrick and Shi, Jiaxin and Zhang, Ruqi},
  title  = {Generative Frontiers: Why Evaluation Matters for Diffusion Language Models},
  year   = {2026},
  url    = {https://patrickpynadath1.github.io/blog/eval_methodology},
}

References

[1]

A. Lou, C. Meng, and S. Ermon, “Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution,” no. arXiv:2310.16834, Jun. 2024, doi: 10.48550/arXiv.2310.16834.

[2]

S. S. Sahoo et al., “Simple and Effective Masked Diffusion Language Models,” no. arXiv:2406.07524, Nov. 2024, doi: 10.48550/arXiv.2406.07524.

[3]

J. Shi, K. Han, Z. Wang, A. Doucet, and M. K. Titsias, “Simplified and Generalized Masked Diffusion for Discrete Data,” no. arXiv:2406.04329, Jun. 2024, doi: 10.48550/arXiv.2406.04329.

[4]

M. Arriola et al., “Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models,” in The Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://arxiv.org/abs/2503.09573

[5]

A. Radford et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, PmLR, 2021, pp. 8748–8763.

[6]

A. Gokaslan, V. Cohen, E. Pavlick, and S. Tellex, “OpenWebText Corpus.” 2019. [Online]. Available: http://Skylion007.github.io/OpenWebTextCorpus

[7]

C. Chelba et al., “One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling.” 2014. [Online]. Available: https://arxiv.org/abs/1312.3005

[8]

S. S. Sahoo, J. Deschenaux, A. Gokaslan, G. Wang, J. T. Chiu, and V. Kuleshov, “The Diffusion Duality,” in Forty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=9P9Y8FOSOk

[9]

L. Theis, A. van den Oord, and M. Bethge, “A note on the evaluation of generative models.” 2016. [Online]. Available: https://arxiv.org/abs/1511.01844

[10]

C. M. Bishop, Pattern Recognition and Machine Learning, vol. 4. New York: Springer, 2006.

[11]

K. Zheng, Y. Chen, H. Mao, M.-Y. Liu, J. Zhu, and Q. Zhang, “Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling,” arXiv preprint arXiv:2409.02908, 2024.

[12]

P. Pynadath, J. Shi, and R. Zhang, “CANDI: Hybrid Discrete-Continuous Diffusion Models.” 2025. [Online]. Available: https://arxiv.org/abs/2510.22510

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

Introduction

Contents

Dataset Choice: Why OpenWebText is the Standard

LM1B is Shuffled

Limitations of Likelihood Evals with dLLMs

Different ELBOs, Different Slack to Likelihood

Comparing Generative Quality

The Problem with Generative Perplexity

Temperatures by Plot

Why This Happens: A Principled View

Single Point Metrics are Insufficient

Generative Frontiers: A Principled and Practical Solution

What is a Reasonable Entropy?

When Frontier Analysis is Valid

Interesting Side Note: 50k Steps is Sometimes Good Enough

Conclusion

Citation

References