It is first important to distinguish control of AI systems from alignment, a crucial and related but distinct notion.1 One can consider one party as “aligned to” a second party to the extent that the goals and preferences of the second are important to the first. This can be a bi-directional or uni-directional relationship.
Control is a one-way relationship that may or may not coincide with alignment. It is certainly easier to control a party that is aligned to you, but alignment is not necessary: prisoners are controlled by, but certainly not aligned to, their captors. And one party can be aligned to another party without being controlled by it: for example, a parent is often aligned to, but not strictly controlled by, their child or their pet. In discussing advanced AI, alignment is often used somewhat interchangeably with control, but the distinction is critical: AI systems may end up being aligned to people, controlled by them, or neither, or both; these four relations would have deeply different implications.
AI alignment itself takes a variety of forms with profoundly different implications for control. Obedient alignment means the AI adopts human goals as its own, even when it might “disagree” with those instructions. Relatedly, loyal alignment means the AI adopts the goals and preferences of an operator as its own. Sovereign alignment means the AI pursues internalized goals and policies (potentially including human welfare or legal compliance) and may refuse instructions that conflict with those. These distinctions are crucial: maximally obedient alignment is close to control, whereas sovereign alignment — even if done perfectly — is not.2
To specify more fully what control of advanced AI systems would mean, we propose that a system is under “meaningful human control”3 if it has the following five properties.
-
Comprehensibility/Interpretability: Humans can obtain accurate, comprehensible explanations of the system’s goals, world model, reasoning, and planned actions at a level that enables informed control decisions.
-
Goal Modification: Humans can add, remove, or reprioritize the system’s goals.
-
Behavioral Boundaries: Humans can establish and enforce constraints on permitted behaviors that the system cannot creatively misinterpret or circumvent.
-
Decision/Action Override: Humans can countermand specific decisions or strategies chosen by the system, and can prevent planned actions from being executed.
-
Emergency Shutdown: Humans can4 reliably terminate system operation partially or completely.
These criteria are formulated to apply to advanced and fairly autonomous AI systems that can take actions and generally be in contact with the world,5 and to correspond to what we generally expect of systems under human control.
Because we are used to non-autonomous computer systems and AI systems, for insight into what control means and what obstacles can arise, a very useful analogy to control of a superhuman AI system is that of control of a large corporation by its CEO.
Footnotes
-
It is also important to distinguish it from the related but weaker concept of oversight — the monitoring and after-the-fact correction of AI behavior; see Mannheim & Homewood. ↩
-
Loyalty is close to obedience but can be subtly different; a loyal friend can still contradict you or even refuse to do what you say. ↩
-
This refers to but extends the notion of meaningful human control that has been developed primarily in discussions of autonomous weapons — a high-stakes but somewhat more narrow domain. ↩
-
This includes not just that it is technically possible, but that other social, economic, psychological or organizational factors will not prevent it. ↩
-
AI systems without much autonomy are far easier to control; for a system that cannot really act on its own, “goals” tend to be weak and belong to the operator, and criteria 3, 4, and 5 are essentially automatic. ↩