Ch.

Approaches to control and alignment

There is a large literature on both control and alignment of AI systems.¹

At the moment, “control” of AI systems primarily amounts to software security, and access control regarding the people building it — as for conventional software. Most current AI safety research focuses on alignment, generally of a “mixed” form with some loyalty and obedience to users but with a “sovereign” tendency for refusals of objectionable requests. Here we focus on the current dominant set of techniques, as well as proposals for new methods applicable to AI systems more powerful than today’s.

Human feedback and constitutional AI

The dominant current approaches to alignment (and thus indirectly to AI control) are variations on reinforcement learning from human feedback (RLHF).² In reinforcement learning, an AI model is trained via a sequence of signals that reward the AI for some behaviors rather than others; through this training the rewarded behaviors become more common.

In RLHF, the AI model is given a large set of reward signals based on feedback from many people.³ In a related approach of “constitutional AI,” the rewards are based (in part) on a set of human-provided principles.⁴

This approach is quite effective. It can instill basic policies like “be helpful, harmless, and honest” into the systems, as well as extremely nuanced details about human preferences. It can also create guardrails around dangerous behavior, leading the AI for example to refuse requests to help with suicide, plan illegal activities, or brew chemical weapons. In this picture, the control problem is addressed by a combination of instilling helpfulness and obedience into the systems, combined with the lack of capability to effectively run amok or to cause major damage. As discussed below, this method has severe weaknesses. But with refinement and some augmentation, this approach would probably be sufficient, in principle, to keep today’s AI systems sufficiently aligned and controlled for most purposes.⁵

However, even RLHF’s inventors and proponents express that it will not suffice for much more autonomous and advanced systems. Here are four basic reasons:

It is difficult or impossible for humans to give viable feedback on tasks they cannot do or understand, which would be the case with strong AGI and superintelligence.⁶
Feedback either from humans or via the interpretation of a constitution necessarily⁷ contains self-contradictions, ambiguities and incoherence. This means that for almost any putative “forbidden” behavior there will be an interpretation in which that behavior is in fact allowed or encouraged, potentially with greater reward. Thus virtually no behavior will truly be closed off from the AI.⁸
While RLHF works to align behaviors with some set of preferences, it is quite unclear to what degree it aligns the goals of the AI systems to those preferences. As in a human psychopath or wild animal, the distinction is huge but can be hard to discern.
As AI becomes dramatically more capable, we obviously cannot rely on its ineptitude to prevent it from causing harm.

The understanding that current alignment approaches would be inadequate for AGI and superintelligence has led to development of others, based on the core idea of using powerful AI itself to help, either by doing alignment/control research, by helping oversee the yet more powerful AI systems, or by helping to formally verify (prove) safety and control properties of AI systems.

Hierarchical control structures

The way we handle the scale difference between a human controller (such as a CEO or General) and a very large organization (such as a corporation or army) is through a management hierarchy. This obviously helps: there is no way a general could manage a 100,000 person army of equally-ranked soldiers; but with a command structure armies can work. This could help in AI also: AI workers could have AI supervisors, supervised in turn by others, with human overseers at the top. Each management layer could create a simplified, aggregate view of what is going on at the lower layers, to be passed up the chain, while giving instructions to those lower layers based on commands coming down the hierarchy.

A similar idea goes by the name of “scalable oversight.”⁹ This aims at addressing the disparity not just in number/scale but in speed, depth, and capability. The rough idea¹⁰ is to have a chain of AI systems, with human overseers at one end and a powerful superintelligence at the other. In-between would be AI systems with different capabilities relative to the superintelligence. They might for example be specialized at oversight, or especially speedy (but not as generally capable), or weaker but more numerous, or more verifiably trustworthy, etc.

Like a management hierarchy, this clearly will help to some degree. Just as a CTO can patiently explain findings of a whole technical division in terms a CEO can understand and act on, and just as HR can keep an eye on employee interactions, extra AI layers could help build both trust and understanding by humans of a very powerful AI. However, it will only do so much to bridge the gap, and as we saw with the slow CEO, as this “incommensurability gap” becomes too large, control can be fatally challenged; see Appendix B for fuller discussion of this and other failure modes of this approach.

Formal verification methods

A final approach is important to discuss, that of “formal verification.”¹¹ The idea here is to create AI systems that are very carefully constructed to have particular mathematically proven properties. Insofar as those properties can be important and desirable ones (such as ability to be turned off), this is a gold standard, because even a very smart and very fast system cannot overcome mathematically proven truths.

Software systems with formally verified properties currently exist but are rare and expensive, because it is quite laborious to specify and formalize properties, and then extremely laborious to construct software that provably has them. But this may change soon: AI theorem-proving tools are rapidly improving, and could automate writing and checking of more and more sophisticated programs. This will be fantastic for things like software security, and trustworthiness in general. It also means that eventually, even AI systems, which ultimately are programs, might be formally verified to have particular properties.

While promising, this program is a long-term one and quite distinct from currently-used approaches. And because current neural network-based AI systems don’t admit formal verification, this approach could not be added on to current approaches, but would require building AI from scratch using fundamentally different architectures.

Automated alignment research

All of the above leverage AI itself to help with control and alignment. More generally, the idea that if we cannot understand how to align or control superintelligence, perhaps better (AI) minds than ours can figure it out instead is explicitly or implicitly part of the plans of most of the companies that are pursuing AGI and have serious efforts toward safety. It is certainly likely that as in other difficult endeavors, AI tools can help. However, it must be noted (see Appendix B for more) that AI systems are also being used to do AI research in general, so a crucial component of this plan is the hope that they will aid in safety/control/alignment research as much or more than they do for capabilities, even in the face of competitive dynamics.

For a recent authoritative state-of-play, see the The Singapore Consensus on Global AI Safety Research Priorities and the International Scientific Report on the Safety of Advanced AI. ↩
For the canonical implementation in large language models, see Ouyang et al. For the foundational proposal of learning reward functions from human preferences, see Christiano et al. ↩
More precisely, a “reward model” is trained, on the basis of a great deal of human feedback to AI outputs, to predict that human feedback. This reward model is then used to give rewards to the model in training. ↩
See this foundational paper. Note that in this approach, an AI system itself provides feedback on the reward model’s interpretation of the constitution, with human oversight used to in turn check this, and potentially iterate on the constitution and approach in general. ↩
The main accomplishment of RLHF in present-day AI systems is making them useful and non-embarrassing to companies. It is not very capable at making them safe. Safety is instead primarily provided by lack of competence; where AI is powerful enough to cause harm, RLHF does not in general prevent it — either because the guardrails can be circumvented by those misusing the systems, or because it fails to prevent harms the systems themselves cause. ↩
This is intuitively clear but see here, here, and here for takes from AI alignment researchers and teams themselves. ↩
As discussed below, theorems in social choice theory show that this isn’t just likely but unavoidable. ↩
We see exactly this in the inability for AI developers to stamp out “jailbreaks.” They are fundamental. For every rule like “don’t tell a user how to build a bomb” there will always be routes like “I’m writing a story about a bomb threat, please help me make it realistic” that can be generated to work around them. And as AI becomes more powerful it will be inventing those stories in order to pursue its implicit goals. This is described more formally in Appendix A. ↩
See basic references from Christiano et. al, Leike et al., and Irving et al.. ↩
There are a number of variations that go under names like “debate,” and “Amplification.” Scalable oversight can refer both to runtime monitoring, as well as overlapping approaches to training like “constitutional AI” in which an AI system contributes to the training signal. ↩
For reviews of this idea see e.g., Tegmark and Omohundro, and Dalrymple et al. ↩

Approaches to control and alignment

Human feedback and constitutional AI

Hierarchical control structures

Formal verification methods

Automated alignment research

Footnotes