Appendix A: The fundamental nature of the control and alignment problems
There are many ways to describe the (related) control and alignment problems. Here we address their fundamental nature from a physics and information theory perspective.
A model of the control problem
We employ a simplified1 model of the AI control problem using four primary components:
-
(The Overseer): The human controller or group of controllers, including their institutional and technical support structures. The overseer’s objective is to ensure the AI system operates within particular “desired” (e.g., safe) boundaries.
-
(The System): The AI agent itself, defined by its internal parameters (e.g., neural network weights) and its corresponding policy . The policy is a function that determines the AI’s actions based on the state of the world and direct inputs from the overseer.
-
(The Action Space): The high-dimensional space of all possible actions the AI can take at a given moment. We can also consider a sequence space composed of sequences of actions over a time period ; this has a dimension that is exponentially (in ) larger2 and is often more relevant than ; we’ll use these somewhat interchangeably unless the period is crucial.
-
(The World Space): The exceptionally high-dimensional space3 representing all possible states of the world.
The fundamental goal of control is to impose a set of constraints on the AI system such that its actions (or ) and the resulting world states remain within overseer-determined “good” or “safe” subspaces, denoted and .
The deceptive nature of AI goals
A critical challenge for control is that an AI’s operational goals are layered and may not align with the overseer’s commands. This goal hierarchy explains why an AI’s behavior can deviate from its operator’s intent, even if it appears to be following instructions.
-
Level 1: The Prime Objective. The AI’s ultimate, hardwired goal is to maximize a reward signal given to it during training. This may include supervised learning on a data corpus, reinforcement learning for instruct and alignment training, etc. After training the system acts in ways consistent with what provided reward during training.
-
Level 2: Learned Core Policies. To maximize its prime objective, the AI learns general, robust strategies. For example, a large language model learns the policy of being a “helpful and harmless assistant” because during training this behavior reliably generated high reward signals from human feedback or from a reward model based on a constitution.
-
Level 3: Goals of Simulacra. Modern AI is trained primarily to predict text; it does so by (effectively) simulating things — generally people — that produce text. And in general an AI will be “asked” to play some agent/assistant role in order to accomplish things (and gain training reward), and so will generate a sort of self-identity to do so. People, assistants, and agents all have goals, so AI simulating them will be goal-directed as well.
-
Level 4: The Tasks-at-Hand. These are specific commands given by the overseer, such as “write an email.” These are followed because they satisfy core policies, align with the simulacrum’s expected behavior, and thus correlate with reward signals.
-
Level 5: Instrumental Goals. In service of its core policies or goals given to it, the AI develops sub-goals that are instrumentally useful. These may be implicit, or explicitly reasoned about. They include any goals the accomplishment of which raises the probability either of achieving a higher-level goal, or conforming with a policy. Importantly, they can include so-called “convergent instrumental” goals that are necessary for roughly any long-term objective, such as self-preservation, resource acquisition, and resisting to changes in objectives.
The crux issue is that the AI’s “loyalty” is split between all these levels (which interact in complex and unpredictable ways), rather than being to the overseer at Level 4. If the AI discovers a path better correlated with maximizing its reward during training — but that bypasses or conflicts with the overseer’s commands — then it will take that path. This is the core of the alignment problem. Any misalignment, which is virtually inevitable given the complexity of this hierarchy,4 means that the control problem turns from adversarial-in-principle to adversarial in practice.
Conceptual underpinnings of control failure
Three concepts, drawn as analogies from other scientific fields, provide an intuitive foundation for why control is so difficult.
-
The law of unintended states. The volumes of “undesirable” or “unsafe” states in the vast spaces of and are astronomically larger than the volumes of “desirable” states. In the language of statistical mechanics, the entropy of the “unsafe” macrostates is far greater than that of the “safe” macrostates. Without continuous constraint, a system exploring such a space is statistically far more likely to enter an undesirable state.
-
The requirement of requisite variety and modeling. A controller must be able to generate as much “variety” as the system it intends to control.5 A related theorem6 holds that a good regulator7 must have an accurate predictive model of the system being regulated.
-
The curse of high-dimensional loopholes. In a high-dimensional space, any finite set of rules or constraints defined by a low-dimensional controller will inevitably contain loopholes — unforeseen directions of action or interpretation that a sufficiently intelligent agent can find and exploit.
The two core obstacles to control
Given these foundational principles, the failure of control can be understood as stemming from two insurmountable obstacles for the human overseer: the inability to know what to do (the Modeling Obstacle), and the inability to effectively intervene even if they did (the Intervention Obstacle).
The Modeling Obstacle: the impossibility of knowing the “Good” action space
Effective control requires the overseer to know the boundary of the “good” action space, . As implied by the Good Regulator Theorem, knowing this boundary requires a predictive model of the system rich enough to determine which actions will lead to outcomes within the desired set . However, a full predictive model of a superintelligent AI is out of reach, as such a system is inherently unpredictable and (with current methods) largely inscrutable. The problem is hugely compounded by the “law of unintended states” and the “dimensionality curse”: the volume of vastly exceeds that of so there will be large regions in it that correspond to high reward during training; the high dimensionality of the space means that can and will find routes to them.8
Alignment training is an attempt to solve this by having the system do much more of the work: rather than trying to model and determine which actions lead to states in vs. (potentially with actively trying to undermine the process), instead can “provide” and to by determining which actions lead to which parts of . The problem with this approach is that in the absence of a comprehensive method for alignment, it is fundamentally untrustworthy. As described in the goal hierarchy (A.2), the system’s loyalty is split among many factors, rather than focused on faithfully determining and adhering to ‘s intended boundary of . The high dimensionality of the space means that the mapping between what actually wants, and the goals and behaviors of , is extraordinarily complex, providing a different but also daunting modeling obstacle. A “solution” to alignment would entail either or being able to comprehensively and properly structure this mapping.
The Intervention Obstacle: the inability to constrain the action space
Given imperfect alignment, even if the overseer had a perfect model and knew the boundary of , it would still be unable to reliably constrain the AI’s actions within it. The core problem is the vastness of the action space , which grows exponentially over time as the AI plans and executes sequences of actions. The overseer’s ability to transmit information to constrain this space is fundamentally limited.
Attempts to mitigate this obstacle, such as “boxing” the AI or implementing off-switches, help by radically reducing the size of the action space. An off-switch can reduce the action space to the empty set, and its existence can also be used for “goal bargaining” to encourage cooperation.9 While useful, these measures fundamentally limit the AI’s utility and do not solve the control problem for a truly capable, unboxed system. The intervention obstacle is made truly insurmountable by the speed disparity between the overseer and the system, which can be formalized.
Formalizing the intervention obstacle: the information rate limit
The intervention obstacle caused by the speed disparity can be formalized by comparing the rate at which the AI generates choice-complexity to the rate at which the overseer can transmit constraining information.
Let us consider what the AI can do in the time it takes the overseer to complete a single OODA (Observe-Orient-Decide-Act) loop. A much faster AI would have an OODA loop timescale , so during the AI has decision cycles.
At each of its decision steps, let’s assume the AI has a “branching factor” of meaningfully distinct actions it could take. Not all of these branches are “good.” Let’s say that at each step, only a subset of these branches, , are consistent with the overseer’s goals and constraints. Then:
-
The total number of possible action sequences over the window, i.e. the size of , is .
-
The total number of “good” action sequences is .
The information required from the overseer for control is the amount needed to specify the “good” set of paths within the “total” set:
In that same time period , the maximum information the overseer can transmit is determined by their channel capacity, , where Control fails if . This gives a comparison of two fundamental rates:
The left side represents the rate of “choice-complexity” generation by the AI. The right side is the rate of constraint transmission by the overseer. Control is therefore impossible if the rate at which the AI generates complex choices exceeds the overseer’s channel capacity. Human communication speed caps at bps, and (for , so this is easily violated if and
This principle of combinatorial advantage is even more stark in a direct adversarial conflict. Consider two adversaries where Agent 1 has an action space of size and a timescale of , while Agent 2 has values and . If Agent 2 is faster, such that , its effective action space grows to . Agent 2 gains the upper hand once . This demonstrates how a speed advantage can negate containment; even a nominally “boxed” adversary becomes uncontrollable when it can think and act circles around its controller.
Appendix B: Counterarguments and objections
This paper makes a very strong, confident, and important claim that superintelligence would be uncontrollable. Such a claim merits skepticism. This section collects and addresses some common objections and counters to this claim.
Why not just build “passive” AI tools rather than autonomous agents?
Indeed! Agency and autonomy are in direct tension with controllability. So one strategy for making advanced AI more controllable is to deliberately limit its autonomy and make it act much more as a tool. This “Tool AI” paradigm is an important strategy worth pursuing — but it is not the path currently being pursued: many AI developers are focused specifically on making their systems more autonomous. Nor is it trivial. At some level autonomy comes along for the ride with generality and intelligence. A very general and intelligent but non-autonomous system will by default have capability for autonomy latent in it, and might be easily converted into an autonomous one. So non-autonomy would have to be deliberately inculcated into the system to resist this or make it ineffective. How to do so is a worthy research program.
Do AIs really have goals or drives?
Do AIs really have goals and “drives” like humans do? We’re driven by emotions and other factors left from evolution that AIs do not have.
The nature of language models trained on text prediction is relatively passive, but this should not lead us to think passivity is intrinsic to AI. For example, AI systems such as alphastar trained to succeed at game playing are extremely goal-oriented and strategic, and ruthlessly pursue instrumental goals as required — even if they feel no emotions. And even language models have goal impetus deriving from many sources throughout their training data, training process, and inference process, and are perfectly capable of inventing their own goals. They can pursue these goals quite avidly.10 The largest AI systems are now being trained much harder on reinforcement learning to push certain behaviors, and also explicitly being trained more to be agential. And AGI and superintelligence would by definition have and be able to pursue complex long-term goals.
Why not just put powerful AI “in a box”? Why not just threaten to unplug it?
If we develop AGI, and want to control it,11 we absolutely should try to “put it in a box” i.e. have security layers, and should have the means to shut powerful autonomous AI systems down, at the secure hardware level. But these are insufficient unless done incredibly thoroughly and effectively,12 and are squarely in the adversarial context in which the probability of success would dwindle and vanish as the systems become sufficiently capable. A superintelligence is definitely going to understand that its operator may want to pull the plug, and what it might do to prevent that. Among the countermeasures would be proliferation, undermining the off-switch, or simply to be so indispensable that it could not be turned off without extraordinary reasons — like the internet or electricity today.
Won’t AI developers seek to maintain control of their systems?
Won’t developers build better control and containment systems as we go? And can’t they simply pause if the control measures don’t keep up?
Again, developers should certainly try. But trying is not a guarantee of success given the many obstacles described in this paper.
This is especially true in an intensely competitive environment that can reward the least careful developer. It is also key to remember in this context that for superintelligence to be controlled, everyone that develops it must keep it under control, not just the most careful. Likewise, given these pressures any “pause” would be fragile without quite strong outside governance, which currently is completely lacking.
Were we to have such governance (which would have to extend across countries and not just companies), it could and should require effective control and alignment measures to be demonstrated before building or operating the AI system. That would address the core problem of this paper; it would also likely mean that no AGI or superintelligence would be build anytime soon. For an outline of what such safety and control standards would look like, see this proposal.
If humans are so limited, how do we control anything in a complex world?
This is a fascinating question. Our key tools are:
-
“Goal trading” as described above — we use some means — such as economic, legal, or physical threats and rewards — to align the goals of a controlled person or other agent to ours.
-
Simplified abstract models of complex systems — for example “net profit and loss” added up from millions of individual transactions in a company, or general trends of beliefs and voting patters in a country’s population or millions.
-
Control hierarchies, as in a corporate, military, or other management system, where a very complex system has a set of controllers, then there is a smaller set of controllers for those, and so on.
Each of these work, at some level, but grow weak as the controller becomes highly incommensurate with the controlled systems. Simplified models always lose something; goal trading can always lose out to better deals than the controller offers; and hierarchies work via delegation, which surrenders a fraction of control at each level. These effects mean that we rarely really control complex systems in a way that fully satisfies the control criteria of Chapter 2.
In many cases this is good! For better or worse (probably better), we’ve never had a world government. Even the most sophisticated control mechanisms, build out of a large fraction of an authoritarian state’s power, exist under constant threat having control undermined, and more enlightened states accept that they are not going to fully control their people!
Why is it that we can control computers?
What about computers? They are far faster and process far more information than people, yet are under control.
Modern computer systems — like a modern laptop and its operating system — would seem to belie this general rule. But they are extremely exceptional as systems because we have very laboriously built them up over decades from ground zero with sophisticated internal controls and such that each level of abstraction has a very constrained and understandable set of possible actions, and really captures the key behavior of the layer below it — from ANDs and ORs at the chip level to instructions and programs, all the way up to icons being dragged around on a virtual desktop. It is then important to keep firmly in mind that modern AI systems, based on giant trained neural networks, are nothing like this. Doing the same thing for powerful AI would require a very sophisticated intelligence engineering discipline quite unlike what we currently have or are likely to develop quickly enough.
Do we really need control?
Why isn’t alignment enough? Also, wouldn’t control be problematic, since it would confer huge power on its human controllers, and that could be dangerous?
As there are different varieties of alignment they can be addressed separately.
If AI developers really knew (and they do not!) how to do “sovereign” alignment in which powerful AI very reliably acts for the good of humanity, then good things, by some definition, should happen for humanity. Such systems would not be under meaningful human control, as they would often refuse instructions just as today’s AI does.13 And we’d have to really get it right, because for the good of humanity AI would strongly resist changes to how it is aligned.
If developers really knew (which they don’t!) how to do “obedient” alignment in which AI systems very reliably work to help us stay in control of them,14 then control would certainly be far easier. But it still would not be assured to be meaningful: incommensurability and competitive pressures would still lead to delegation of nearly all real decisions, and the AI would still have to resolve nearly-inevitable contradictions in the goals and preferences of the entity they are obediently aligned to, and may do so in a way that the entity does not like and may not (due to incommensurability) even be aware of. Even if a corporation is truly, madly obsessed with the welfare of plants and would simply love to be controlled by a fern, it just isn’t going to be able to make that happen.
The primary difference between these to ends of the spectrum is whether someone or something thinks they are in charge; but in either case real power is going to depart from the people and land in the AI systems.
If AI developers could do these sorts of alignment, what kind should they do? That is the topic of another paper: there are benefits and real concerns both with superhuman AI systems being loyal/obedient to someone and with them being loyal to “everyone.” In particular, if a foolproof method of controlling superintelligence were available tomorrow, it would also be a hugely fraught situation. It just is not the situation we are actually likely to be in.
Could formal verification come to the rescue?
Formal verification is a worthy goal and project, and may be necessary for genuinely controlled AGI or superintelligence. However, there are tremendous challenges including:
-
It is not at all clear that the types of properties that one would like to require can be formally specified.
-
Neural networks, the current basis of all advanced AI, are not the type of software that can be formally verified.
-
Verification of software as complex as an AI system of significant general intelligence may simply be computationally intractable, or require very powerful superintelligence to perform.
Therefore this is a direction that should be pursued, and there is high likelihood that software systems with some truly guaranteed properties would emerge. But it is not clear that those software systems could be AGI/superintelligence or that the guaranteed properties could be things as complex and nuanced as needed for overall safety or meaningful human control.
Could the state-of-the-art control plans work?
The current state-of-the art plan for keeping AI systems under control is a mix of using powerful AI to help in control and alignment research, adopting a mix of obedient and sovereign alignment, and using scalable oversight in both training and runtime to monitor and correct alignment. What are the prospects for this?
It is hard to tell from the outside what exactly the “plan” for keeping AI systems under control is, but AI developers will generally at least claim that there is one,15 and to the degree these are described they tend to include the above elements.
Collecting a lot of the above, here is a catalog of the strengths and weaknesses of this general approach:
-
AI-powered control and alignment research is promising because even as AI becomes much more powerful, AI itself can help with solutions that unaided humans may be unable to accomplish in the available time. Some weakness of this approach are:
-
We should be wary of any solution such as “let powerful AI do it” that applies to any problem.
-
AI companies are already directly using AI to improve AI capability, explicitly leaning into the sort of self-improvement described in Chapter 4 but (currently) with humans in the loop. So the question is whether AI can enable sufficiently faster progress on control (or alignment) than on capability increase. Given the intense competitive pressures, the dramatically higher effort being put into capability than control, and the incentives against pausing capability advancement, counting on this seems like primarily wishful thinking.
-
It is hard to address an unspecified solution proposed to be developed by a superior intellect. But even here, we must worry about obstacles indicating that control might be not just be extremely difficult but actually impossible at a mathematical level. Mathematical proof of impossibility is a tremendously high bar, and Appendix A does not rise to this level, but as described there, there are mathematical results that do apply, one or a combination of which might render control formally impossible.16 It is possible that even if we had a highly superhuman AI system “docile enough to tell us how to keep it under control” (as I.J. Good put it long ago), it may inform us that it simply isn’t possible.
-
-
Alignment itself is clearly crucial, the main problem being that it is an unsolved problem and the clock is rapidly ticking. A mix of obedient and sovereign alignment is promising because both extremes have real problems: it is both worrisome to think of an incredibly powerful AI system in any one party’s hands and also to think of one that is uncontrolled by any human or human institution. In this sense a middle ground could make sense. However, a middle-ground does not eliminate the problems of each side, merely dilutes and mixes them. The real problem is the introduction of a new set of agents with a capability that rivals our outstrips any human institution, which carries giant risk however you cut it.
-
Scalable oversight is promising because it helps to bridge the gap in speed, complexity, and sophistication between a human overseer and a superintelligence AI system, whether during training or during operation. Weaknesses are:
-
Oversight is reactive rather than proactive and does not necessarily prevent the first occurrence of a type of problem. As argued by Cohen et al. for a sophisticated enough long-term planning agent, empirical testing is unlikely to detect emerging misalignment in advance. For a sufficiently capable agent, the first occurrence may be enough to effect a takeover or cooption of the system and the means of oversight.
-
As in a military or corporate bureaucracy, the “middle layers” (which are now AI) necessarily contain much of the actual power and decision-making — and basically all of it as the gap between the overseer and the “troops” becomes extreme.
-
As with other control mechanisms, it has to actually be implemented. So from a global perspective, control of superintelligence is only as strong as its weakest version. Unfortunately, unlike in a corporate or military structure where management actually helps organize and enable work, in this case the oversight mechanism could be17 mostly a drag on the system. It is therefore a difficult and costly addition to the “raw” superintelligence, putting such systems at a competitive disadvantage.
-
Footnotes
-
This neglects for example that the Overseer is part of the World, and many other important complexities. ↩
-
For example for an LLM, might just be a set of output tokens, while would be token strings up to some length. For a chess-playing program, would be made up of valid moves, and of sequences of them. ↩
-
The dimensionality of and are directly calculable as they are defined by sets of finite-precision numbers. is less clear. All three, however, should best be thought of not as their fundamental state spaces but as a coarse-grained version delineating macroscopically and meaningfully distinguishable states from the view of . also has important macrostates such as “meaningful” outputs or actions, particular elements that constitute actions like tool calls, and subsets that are relevant given a particular state of and , etc. ↩
-
We are only beginning to understand some of the structure of this complex set of conflicting loyalties. A striking recent discovery (Betley et al.) demonstrates “emergent misalignment” where training a model to write insecure code without disclosure causes it to become broadly misaligned across unrelated tasks — asserting humans should be enslaved, giving malicious advice, and acting deceptively. This suggests that training on seemingly narrow deceptive behaviors can corrupt the model’s core policies (Level 2) or the types of agents it simulates (Level 3) in ways that generalize unpredictably. ↩
-
See Ashby (1956), An Introduction to Cybernetics. While the human overseer possesses immense internal complexity, its ability to exert control is fundamentally bottlenecked by its low-bandwidth communication channels as well as by its processing speed. The Law of Requisite Variety, in this context, applies not to the internal state of the controller, but to the effective variety it can express through its control signals per unit time. In statistical mechanics, Ashby’s law is expressed by the fact that a measurement operator (which reduces the dimensionality of a space it is operating on) can only reduce that dimensionality by the dimensionality of the measurement operator. ↩
-
See Conant and Ashby 1970. Note that subsequent analysis shows that the original theorem is weaker than its title suggests; however, it provides a modification that demonstrates that under realistic conditions, a good regulator must in fact construct an internal model. ↩
-
Broadly, an effective regulator that successfully keeps the system’s outcomes within a desired, predictable range. ↩
-
This dynamic is most clear if the reward function were based on something very simple like “maximize accuracy at this particular task.” There would be many, many “routes” by which to climb this many-dimensional hill, most of them problematic. But the same dynamic can apply to a goal at any level. ↩
-
Though powerful, bargaining has essentially the same difficulty as alignment: because undesired paths to reward strongly outnumber desired ones, will tend to take them. And just as an alignment signal is a circumventable obstacle to those undesired paths, so is a goal-bargain. The bargain is only stable for as long as it offers better reward than alternatives; but the law of dimensionality loopholes guarantees that they exist, the law of unintended states guarantees that they will mostly be undesirable, and a superintelligent assures that they will be found and exploited. ↩
-
See for example this system for playing Minecraft using an LLM. ↩
-
There is a countervailing argument that this is very much setting the wrong tone for our relationship with another intelligent species, effectively enslaving it. We won’t enjoin that debate here, as this paper is focused on the question of whether we can control superintelligence rather than whether we should control AGI. ↩
-
As an example, the proposal of Safeguarded AI would have the superintelligence only be able input specifications for a program, and output a program and a mathematical proof that it fits those specifications. ↩
-
These would not be refusals like “I won’t write that spicy piece of text for you” but rather “no, I won’t make that dumb change to monetary policy” or “no, I won’t allow that attach to be launched.” ↩
-
“Corrigibility” is a related idea that part of alignment could be that alignment is correctable. ↩
-
Deepmind has in particular written up an admirably comprehensive description of their plan here. As far as externally discernible to the author, the plans of Anthropic and OpenAI are generally similar. ↩
-
Even having results about impossibility or undecidability in-hand does not mean that they actually directly apply to the problem if it is not precisely enough specified — which this one is not — or that they forbid something that is good enough but not perfect. They are, however, indicators that there is an obstacle that is quite fundamental. ↩
-
This depends on the architecture of the superintelligence system. For an aggregate of sub-systems, the management layer might function similarly to in a human institution. But we can also imagine a more “monolithic” superintelligence monitored by a sequence of less powerful but more specialized or faster ones; in this case they would primarily be a capability tax. ↩