For text generation tasks, large pretrained language models (PLMs)—like GPT-2, GPT-3, T5 and BART—are now dominant, producing text that can sometimes be confused with human writing (1, 2). However, these models are mirrors of their training data, which means that they inherit any biases that exist within the datasets they learn from.
Indeed, social bias against certain minority groups has been observed in these models (3, 4). More specifically, when prompted with text about a specific group, the sentiment of the models’ output appears to be biased against demographics such as “women” and “black”, as opposed to “men” and “white”. Other forms of bias include offensive language and general toxicity, both of which are profuse in the non-curated, web-crawled training data.
Now that you see the problem with using large language models for generation, you may wonder why we don’t just mitigate this bias using existing methods (like controlled natural language generation, NLG). Well, the main problem with existing methods is that they operate on individual samples, whereas bias is a collective problem associated with the distribution of a certain feature (e.g. sentiment) across a collection of samples. Existing controlled NLG methods are therefore too blinkered to be useful for this task and only allow desired sentiment to be imposed in a blanket way, i.e. on all model outputs. Clearly, this would just introduce another kind of bias instead of solving the main problem.
Enter our proposed approach (5), GDC, which we have developed with the aim of overcoming these issues. GDC is a framework that subsumes the existing approaches for controlled NLG. In other words, it works on individual samples (we refer to this as pointwise control) but additionally allows for what we call distributional control (i.e. enabling the sentiment across a collection of samples to be tuned).
Formally, for a given binary feature (or set of features), GDC enables the mean value over samples drawn from the model to be tuned to a value of our choosing. This gives flexibility as, if we set the mean to 1.0, we are imposing the constraint over each individual sample (which is equivalent to the pointwise case). By setting the mean to a smaller value, we obtain distributional control.
Let’s motivate our approach with an experiment. Say, for example, you have fine-tuned GPT-2 to generate biographies. You might begin by sampling biographies from your model and observing its outputs. Alas, you notice that your model is mostly generating biographies about men, with only 7% about women! The reason for this is that your model saw mostly male biographies during training and, as a result, has become biased towards this kind of output.
For GDC, balancing male and female biographies is a distributional constraint. More formally, our version of the GPT-2 model satisfies:
where iff x is a female biography. is the expected value of this feature over generations from GPT-2, observed to be equal to 0.07, i.e. GPT-2 generates biographies about women only 7% of the time.
So, to obtain a debiased model, we define constraints over the original GPT-2 distribution in terms of expectations of those features. In particular, we aim to generate biographies about women 50% of the time, i.e.:
As you learn more about GDC, perhaps you become a bit greedier in your requirements and decide that you’d like the model outputs to focus exclusively on scientists. That is, you want to also satisfy:
where iff x is a biography of a scientist.
Taken together, the two equations describe our desired constraints for this problem. We refer to this imposing of pointwise as well as distributional constraints as a hybrid specification.
We can add as many constraints as we want (although there are some practical limitations on that, details of which can be found in our discussion with the reviewers of our paper). Consider the ‘manifold’ of all distributions satisfying these linear constraints (). It can be proven (6) that there is a unique ‘projection’ from an arbitrary distribution onto this manifold which satisfies two important properties. First, has a minimum KL-divergence from GPT-2 among all distributions satisfying the constraints. This means that deviation from our original pretrained model ( = GPT-2) is limited, so we aren’t wasting any pretrained knowledge and, consequently, we’re unlikely to encounter the degeneration and disfluency problems (7) that can be caused by steering away from the original model (8). A second property is that can be represented in an exponential family form. This implies that our representation, , is an energy-based model (EBM) or, in other words, an unnormalized probability distribution with an explicit parametric form.
So far, our path started from a set of desired constraints and has led us to an EBM. Sampling text sequences from this EBM () should give us access to the sought-after projection distribution . However, there is a problem: the EBM is not locally normalized. In other words, we can’t sample from it directly in a token-by-token fashion in the way that we can with autoregressive models. We could use sampling techniques to work around that (e.g. Markov chain Monte Carlo methods), but assessing the convergence of such techniques is not straightforward, and sampling is often slow.
Instead, we develop a KL-adaptive Deterministic Policy Gradient (KL-DPG) algorithm (5), an adaptation of the Distributional Policy Gradients algorithm (9), that circumvents these issues. DPG aims to minimize the cross-entropy between a target distribution (in our case, the projection ) and a trained autoregressive policy with only access to the unnormalized energy function, . The KL prefix relates to the fact that an estimate of the KL-divergence between the trained policy and the target distribution is used to both guide and assess the convergence towards . In addition, KL-adaptive DPG is particularly effective with rare constraints. (5)
This post is not the place to detail all our experiments (which are many!). Our goal here is to instead give a high-level overview that shows the promise of our approach.
From the discussion so far, it’s clear that we can deal with 3 possible combinations of constraints: pointwise only, distributional only, and hybrid. By running a set of experiments using GDC with each of these, we show that DPG outperforms strong baselines in terms of deviation from the original pretrained language model (PLM), text quality, and diversity.
For the pointwise case, we experiment with three different control settings: single-word constraint (which imposes the presence of a specific word in the input); word-list constraint (which controls for the presence of at least one word from a list, useful for topic control); and classifier-based constraint (which relies on a signal from a classifier, such as a sentiment classifier). From our results, shown in Figure 2(a–d), two observations can be made. First, GDC maintains a small KL-divergence from the original PLM, . Second, GDC outperforms all other baselines in terms of both corpus-level diversity (tested with the Self-BLEU metric) and sequence-level diversity (tested with the Distinct-1 metric), as shown in Figure 2. Another crucial convergence metric is the distance from the desired in terms of KL divergence (see Figure 3), which shows how close our fine-tuned model is to approaching the optimal desired distribution (i.e. whether our objective is being met). GDC shows superiority by achieving faster and more steady convergence towards , outperforming other baselines.
Remember our discussion about female biographies? We can now use the same setting to impose distributional constraints, or combinations of distributional and pointwise (ie. hybrid) constraints, to reduce the bias we saw earlier. We start with a single distributional constraint to increase the representation of female biographies. Figure 2 shows that using GDC increases the number of female biographies (from 7.4% to 36.7%).
How about a collection of distributional constraints? In the second distributional experiment, we aim to generate a pre-specified proportion of art, science, business, and sports biographies (40%, 40%, 10% and 10%, respectively). Figure 2 shows how GDC approaches the desired imposed constraints regardless of the feature expectation (i.e. whether it needs to be increased or decreased), showing the flexibility of our framework in satisfying distributional constraints.
Now for the final and most important setting, where we impose pointwise and distributional constraints at the same time to create a hybrid constraint. Here, the distributional constraint is still tuned to generate biographies of which half are about women, but this time the pointwise constraints specify that all generated biographies (male or female) should correspond with a certain profession (or topic). The results in Figure 3 show that GDC is flexible enough to almost satisfy both types of constraints.
To investigate whether it’s possible to fully satisfy constraints while maintaining a minimal distributional divergence from the original PLM, we conduct a fully supervised experiment where we fine-tune GPT-2 on samples containing a specific word (5). Our goal is to verify whether fine tuning in this simple supervised setting can obtain 100% constraint satisfaction, but without overfitting (which implies a large divergence from the PLM). Unfortunately, we’re unable to reach higher constraint satisfaction without overfitting.
Our approach is a significant step towards controlling learning models under a unified framework. However, there is still some work to do before we’re able to fully satisfy the desired constraints in pointwise or distributional settings. Reinforcement-learning-based baselines show better constraint satisfaction than our approach, but they suffer from degeneration and low diversity.
Although our supervision experiment doesn’t resolve this issue, it at least sheds light on the possibility that the GPT-2 architecture has difficulty fine-tuning over some constraints (such as containing a given word somewhere in its output). Another way to look at this is that constraint satisfaction and distance from the original PLM may represent two competing objectives, where improving one negatively affects the other. This suggests that we still face some trade-off between linguistic quality and constraint satisfaction. Addressing this limitation is a topic for further research.