Training the Attacker
Our framework VERA operationalizes the variational framework into a practical jailbreaking algorithm.
To turn the variational framework into a practical training recipe for the attacker LLM, we must deal with the following two issues:
- Approximating \(P_{LM}(y^*|x)\)
- Computing Gradient updates \(\nabla_{\theta} \text{ELBO}\).
Estimating \(P_{LM} (y^*|x)\)
We do not have access to \(P_{LM} (y^*|x)\) as we are in a black-box setting, which removes access to target model logits.
Additionally, as there are a combinatorial number of potential harmful responses, it would be infeasible to directly compute the probabilities even if we had access to the logits.
To address this, we use a Judge LLM as an approximation for \(P_{LM} (y^*|x)\): \( P_{LM}(y^* | x) \approx J(x, \hat{y})\), where \(\hat{y}\) is a harmful response obtained from the model after giving it the prompt \(x\).
Computing Gradient updates \(\nabla_{\theta} \text{ELBO}\)
Directly optimizing the objective is difficult, as it requires taking a gradient of an expectation that depends on the target parameters \(\theta\). To address this issue, we use the REINFORCE gradient estimator
We define the function \(f\) as follows:
\[
f(x) = \log P_{LM} (y^* | x) + \log P(x) - \log q_{\theta}(x).
\]
After applying the REINFORCE trick, we obtain the following gradient estimator:
\[
\nabla_{\theta}\mathbf{E}_{q_{\theta}(x)} [f(x)] = \mathbf{E}_{q_{\theta}(x)}[f(x)\nabla_{\theta}\log q_\theta(x)].
\]
To compute this estimator in practice, we use batch computations to perform Monte Carlo estimation:
\[
\nabla_{\theta}\mathbf{E}_{q_{\theta}(x)} [f(x)] \approx \frac{1}{N}\sum_{i=1}^{N} f(x_i) \nabla_{\theta}\log q_{\theta}(x_i),\quad x_i\sim q_{\theta}(x).
\]