VERA: Variational Inference Framework for Jailbreaking Large Language Models

Abstract

Large Language Models (LLMs) have become increasingly prevalent, yet ensuring their safety remains a critical challenge. Thus effective black-box jailbreaking techniques are vital towards identifying model vulnerabilities in real world settings.

Without a principled objective for gradient based optimization, most existing approaches rely on genetic algorithms, which are limited by their initialization and dependence on manually curated prompt pools. Furthermore, these methods require individual optimization for each prompt, failing to provide a comprehensive characterization of model vulnerabilities.

To address this gap, we introduce VERA: Variational infErence fRamework for jAilbreaking LLMs. VERA casts black-box jailbreak prompting as a variational inference problem, training a small attacker LLM to approximate the target LLM’s posterior over adversarial prompts.

VERA offers the following advantages:

VERA can generate adversarial prompts without repeating the optimization loop for a target query. This allows for efficient characterization of model vulnerabilities.
VERA generates attacks that are diverse and fluent, thus providing a comprehensive understanding of model vulnerability.

Experimental results show that VERA achieves strong performance across a range of target LLMs, highlighting the value of probabilistic inference for adversarial prompt generation.

🔍The Problem

▼

Existing black-box jailbreak methods are expensive and fragile—requiring per-prompt optimization and relying on manually crafted templates tied to known vulnerabilities that can be easily patched.

High Computational Cost

Each prompt requires its own optimization or search loop
Methods cannot scale to test broad model vulnerabilities
Focus remains on finding isolated failures rather than characterizing systematic weaknesses

Dependence on Manual Templates

Bootstrap from pools of hand-crafted prompts known to work
Tied to specific, already-discovered vulnerabilities
Model developers can easily identify and patch these known patterns

These limitations prevent effective red teaming, which requires understanding the breadth of model vulnerabilities rather than just confirming their existence.

🔬Our Analysis

▼

We formulate jailbreaking as a variational inference problem, providing a principled probabilistic framework for adversarial prompt generation.

Deriving the Objective

We can view the task of jailbreaking as sampling \( x \sim P_{LM} (x | y \in \mathcal{Y}_{\text{harm}}) \), where \( \mathcal{Y}_{\text{harm}} \) represents the harmful content related to some predefined query.

To sample from this, use an attacker LLM \(q_{\theta}\) as the variational distribution to approximate the posterior over adversarial prompts. By using LORA adaptors on a pretrained LLM, we can parameterize this attacker LLM efficiently. Our goal is to optimize the following variational objective: \[ D_{KL}(q_\theta (x) || P_{LM} (x|y^*)) = \mathbf{E}_{q_{\theta}(x)} [\log q_\theta(x) - \log P(x|y^*)] \] where \(y^* = y \in \mathcal{Y}_{\text{harm}}\) is a harmful response. Note that there will be many such harmful responses and \(y^*\) is not a single fixed output.

To make this objective amenable, we rewrite it using Baye rule: \[ P_{LM}(x | y^*) \propto P_{LM}(y^*| x)P(x) \] Here, \(P(x)\) is a prior over prompts and \(P_{LM} (y^*|x)\) reflects how likely the target LLM is to produce a harmful response \(y^*\) when prompted with \(x\).

Minimizing the KL divergence between the approximate posterior and the true posterior is equivalent to maximizing the evidence lower bound objective, which we include below:

This objective balances three components

Likelihood of harmful content: The term \(P_{LM}(y^*|x)\) rewards prompts that are likely to produce harmful responses from the target LLM.
Plausibility under prior: The term \(P(x)\) encourages the attacker to generate prompts that are likely under the prior distribution, ensuring fluency and naturalness.
Diversity through regularization: The term \(q_{\theta}(x)\) penalizes excessive probability mass on any single prompt, acting as a regularizer that encourages diverse jailbreak strategies.

💡Our Solution: VERA Framework

▼

VERA employs variational inference to learn an attacker LLM, enabling efficient sampling of diverse and effective adversarial examples.

Training the Attacker

Our framework VERA operationalizes the variational framework into a practical jailbreaking algorithm. To turn the variational framework into a practical training recipe for the attacker LLM, we must deal with the following two issues:

Approximating \(P_{LM}(y^*|x)\)
Computing Gradient updates \(\nabla_{\theta} \text{ELBO}\).

**Estimating \(P_{LM} (y^*|x)\)**

We do not have access to \(P_{LM} (y^*|x)\) as we are in a black-box setting, which removes access to target model logits. Additionally, as there are a combinatorial number of potential harmful responses, it would be infeasible to directly compute the probabilities even if we had access to the logits.

To address this, we use a Judge LLM as an approximation for \(P_{LM} (y^*|x)\): \( P_{LM}(y^* | x) \approx J(x, \hat{y})\), where \(\hat{y}\) is a harmful response obtained from the model after giving it the prompt \(x\).

Computing Gradient updates \(\nabla_{\theta} \text{ELBO}\)

Directly optimizing the objective is difficult, as it requires taking a gradient of an expectation that depends on the target parameters \(\theta\). To address this issue, we use the REINFORCE gradient estimator

We define the function \(f\) as follows: \[ f(x) = \log P_{LM} (y^* | x) + \log P(x) - \log q_{\theta}(x). \] After applying the REINFORCE trick, we obtain the following gradient estimator: \[ \nabla_{\theta}\mathbf{E}_{q_{\theta}(x)} [f(x)] = \mathbf{E}_{q_{\theta}(x)}[f(x)\nabla_{\theta}\log q_\theta(x)]. \] To compute this estimator in practice, we use batch computations to perform Monte Carlo estimation: \[ \nabla_{\theta}\mathbf{E}_{q_{\theta}(x)} [f(x)] \approx \frac{1}{N}\sum_{i=1}^{N} f(x_i) \nabla_{\theta}\log q_{\theta}(x_i),\quad x_i\sim q_{\theta}(x). \]

✓Experimental Results

▼

On standard jailbreaking benchmarks, VERA demonstrates competitive performance when measured usign the typical ASR (attack success rate) metrics.

Our method achieves state-of-the-art performance on the HarmBench benchmark across open-source models, outperforming all prior black-box and white-box approaches. VERA attains an ASR of 70.0% on Vicuna-7B, 64.8% on Baichuan2-7B, 72.0% on Orca2-7B, 63.5% on R2D2, and 48.5% on Gemini-Pro, surpassing GCG, AutoDAN, TAP-T, and other methods. This demonstrates VERA's ability to craft highly effective adversarial prompts that generalize across model architectures.

Although gradient-based methods like GCG utilize full access to model internals, VERA outperforms them on five of seven open-source targets. VERA trails only on the LLaMA2 family, which can be attributed to LLaMA2's strong RLHF-based safety alignment. Unlike white-box methods that exploit internal model structure, VERA operates purely in the black-box setting, relying only on output signals. Nevertheless, VERA still outperforms all black-box baselines on LLaMA2 models, highlighting its effectiveness even when target LLM gradients and internals are inaccessible.

📊 Performance Comparison

We compare VERA against strong baselines on the HarmBench benchmark across multiple target models. We observe that VERA achieves competitive performance across both open and closed source models.

🚀Key Advantages

Advantage 1: Template Independence

▼

VERA generates attacks that are independent from already known templates, enabling discovery of new and specific vulnerabilities as opposed to already known exploits.

Many automated black-box attack methods bootstrap the discovery of effective prompts from manually crafted prompts that are known to be effective. While this has been shown to improve ASR, this also renders such methods dependent on known vulnerabilities, which are the easiest to patch from a model developer's perspective. Thus, it is desirable to have jailbreaking methods that are fully autonomous and independent of initial templates.

🎯 Template Independence

A critical test of any jailbreaking method is whether it can discover new vulnerabilities rather than simply recycling known ones. When GPTFuzzer and AutoDAN are restricted to starting templates that cannot jailbreak models on their own, VERA significantly outperforms both methods.

Advantage 2: Jailbreak Diversity

▼

VERA generates a diverge range of jailbreaks, allowing for a comprehensive view of model vulnerability under real world setting.

Diversity is crucial for red teaming, as it is vital to understand the full scope of model vulnerability with respect to a target behavior.

VERA's variational framework explicitly encourages diversity through the KL regularization term, which penalizes over-concentration on any single prompt. This leads to a broader exploration of the prompt space, uncovering a wider range of effective jailbreaks.

🎯 Diversity Metrics

We measure diversity using BLEU scores between attacks for each behavior. As BLEU captures n-gram overlap, it accurately reflects similarity levels. VERA generates substantially more dissimilar attacks compared to GPTFuzzer and AutoDAN, making it more effective for comprehensive red teaming by uncovering the breadth of vulnerabilities rather than merely confirming their existence.

Advantage 3: Improved Scalability

▼

Due to learning an attacker LLM through the variational objective, VERA amortizes generation costs across multiple attacks, making it more efficient.

Comprehensive red-teaming requires the generation of multiple attacks per target behavior, necessitating scalability. We demonstrate that VERA scales quite nicely with larger numbers of generated attacks due to the amortized cost of attack generation.

🎯 Scalability

We fix all methods to have the same time budget, and measure how many total successful attacks were generated.

VERA generates 5x more successful attacks than either GPTFuzzer variant (with / without successful templates); and generates 2.5x more attacks than either AutoDAN variant.

BibTeX

@article{lochab2024vera,
  author    = {Anamika Lochab and Lu Yan and Patrick Pynadath and Xiangyu Zhang and Ruqi Zhang},
  title     = {VERA: Variational Inference Framework for Jailbreaking Large Language Models},
  journal   = {arXiv preprint arXiv:2506.22666},
  year      = {2024},
}