Ch.

Fundamental obstacles to control

With these approaches in hand we can now ask: what are the prospects for control of superintelligent systems?

Control theory, which emerged from cybernetics, is a well-developed subject. The understanding of how to control (or fail to control) systems with superhuman intelligence is far less developed. Nonetheless there are a number of arguments, ranging from analogies to formal mathematical results, that bear upon it.¹

The control problem in a nutshell

The essence of the control problem in AI is that a human overseer wishes to require an AI system to take particular types of actions (including producing certain outputs), with particular effects on the world, and not take other actions with other effects. Write a correct piece of code? Yes. Hack into a bank? No. Allow itself to be shut off? Yes. Blackmail the user who is whistleblowing on its misbehavior? No.

Because an AI system is primarily trained rather than programmed, there are no lines of code saying things like “if (user says shut down) then (shut down).” Most of what determines what an AI system will do is the giant, inscrutable neural network that decides what output to produce, in a way that cannot be understood or predicted by outside inspection.² The training of this network causes the AI to do the types of things that led to rewards during the training process — such as correctly predicting words, and getting positive feedback from human testers. The basis of any decision can be understood as resulting from a very complex system of internal goals and policies the AI develops during training, modified by instructions that are part of its system design (the “prompt” and related elements), and finally combined with input from the user and elsewhere.

Thus the control problem is: how do we train an AI neural network, and build it into an AI system, so that it very reliably takes the desired actions, and not the undesired ones?

For very basic AI systems, such as those that classify images, this is quite straightforward: the only type of action they can take is output image classifications, and they can either do it well or poorly.³ But for much more capable, general, and autonomous AI systems, which have an incredibly wide range of potential actions, it becomes enormously more difficult — more like controlling a person, or the large corporation in our analogy. In this very wide set of possibilities, the desired actions and effects form a very, very small subset.⁴

The current dominant system, RLHF, attempts to train the models to have core policies such as to follow user instructions, and to “be honest, helpful, and harmless,” along with a myriad of other (generally implicit) rules, norms, and effective goals. This training works to much better confine the set of outputs and actions to desirable ones. But it is still based on tendencies rather than explicit rules as in a computer code: when goals, norms, policies conflict with each other — which they inevitably do more and more as systems become more complex — the results are inherently unpredictable, especially in situations outside those directly trained on.

For superintelligent AI systems, the topic we’re addressing here, both the extent and stakes of this problem become extreme. As we’ll now argue, humans’ ability to constrain actions and effects to the very small subset of desired ones on an ongoing basis — the essence of control — is almost certain to fail due to several fundamental obstacles.

We divide these obstacles into three classes: those arising from the inherently adversarial nature of control and the difficulty of alignment, those arising from the different scales (in speed, complexity, depth, and breadth) of human controllers versus superintelligent systems, and those arising from the socio-technical and institutional real-world context in which these systems are being developed and deployed.

These are presented at a somewhat informal level in the main text, with technical details left to footnotes and to Appendix A, which provides a more formal model of the core of the control and alignment problem.

Control is adversarial, and alignment is (very) hard

Control is an inherently adversarial relation: it is about a controller being able to require a system to do something it would otherwise not do, or prevent it from doing something it otherwise would. The Chapter 2 criteria of Goal Modification, Behavioral Boundaries, Decision/Action Override, and Emergency Shutdown are all clearly of this type: each would tend to be resisted⁵ by a system that is pursuing a goal.⁶ For example, almost any goal is best achieved if the one pursuing it is not shut down, so nearly any goal would inherently impel an AI system to resist Emergency Shutdown.⁷ The question of control is whether this inherent resistance can be avoided or overcome.

This adversarial dynamic is also clear from the two primary types of control exerted on people: (1) a control system can be constructed so that people cannot do certain things, and (2) a controller can alter the goal structure of a person so that they will want to do some things and not others. The first type includes things like locked doors, passwords, authorizations, information access, etc. The second contains positive incentives like salary, continued employment and wellbeing, and negative ones such as threats of firing or physical violence; these can be recast as the controller trading support for the person’s goals (like continued existence) in exchange for obedience.

Both strategies would apply to powerful autonomous AI systems: these systems could be technically prevented from doing certain things,⁸ and could be “bargained with,” trading obedience for support of whatever goals they may have, which may include continued operation, continued access to resources, etc. Both strategies intersect with all five criteria for meaningful human control.⁹

Fighting a superior adversary

The virtually insuperable problem in the context of superintelligence is that either strategy places a human controller in an adversarial situation with an agent that knows much more, thinks and acts much faster, employs superior strategies, and is generally more capable at pursuing goals, than the controller. We know what happens when two mismatched adversaries come into conflict: the stronger one wins. Humans essentially never beat modern AI systems at Chess or Go, and we can even quantify how unlikely it is. Similarly, an adversarial control situation inherently grows increasingly untenable with the disparities between the controller and the AI. Indeed in recent experiments in which a “weaker” AI system tries to control a “stronger” one, the measured probability of control declines steadily with the discrepancy in strength, much as one would expect.¹⁰ For a very large disparity we should expect control to become as unlikely as humans winning at chess.

It is difficult to analyze in general how a more capable adversary beats a less capable one, but here are three frames. First, depth of play. Although two players of chess or go have similar choices of moves at each turn, a player who can think ahead more moves effectively has a much (exponentially) wider range of options to choose between. This is an enormous advantage because their opponent has a small set of options in which to “box in” this much larger array of plays — and vice versa the player who can think ahead farther has many options to counter each move of their opponent. A superintelligent AI system would have this sort of advantage in the much more general game of “getting what it wants.” Moreover, as discussed below the AI system could be dramatically more complex in terms of memory, processing power, and general understanding, so even the sorts of moves it could make could be much more diverse and numerous than its controller’s.

A second frame is that of speed, which is an enormous advantage in an adversarial situation. A key concept is the “OODA” (Observe, Orient, Decide, Act) loop where it is well established that a much faster OODA loop provides a nearly insurmountable advantage.¹¹ The agent with the faster OODA loop has two huge edges. First, it can take chains of actions for each one of its opponent’s; this provides a similar advantage as greater depth-of-play. Second, it can primarily react to its opponent’s actions while its opponent must predict what it will do, which is vastly harder.¹² As discussed below, a superintelligent system would operate with far faster OODA loop than its human controllers. On top of its greater depth-of-play and generality of action, this provides an decisive advantage in an adversarial setting, which control is.¹³

A third frame is that of prediction: control of a system ultimately requires prediction of what “moves” it might make and efforts it could undertake.¹⁴ But for a most capable intelligence, it would be better at predicting our moves than we would its moves. We would face the same situation a human faces playing against Stockfish or AlphaGo: if we could predict what these systems could do, we could defeat them. We cannot, and therefore we know that they are unpredictable to us. Such would be our situation with respect to a system that is superior in general operation to ourselves, which superintelligence by definition would be.

In Appendix A, we analyze this in a more formal way, showing that all three frames related to the problem that the “action space” of a superintelligent AI system is both untenable to model, and grows much more quickly than the control measures can act to contain it to the “allowed” region to which the controller would like to confine it.¹⁵ Thus control fails.

Retreat to alignment

A common response to the difficulty of controlling a more capable adversary is to retreat from control into alignment.¹⁶ The idea is to reduce the adversarial element as much as possible, so that control becomes easy or unnecessary.¹⁷ This retreat suffers from two major problems.

First, alignment is not control. It is of course easier to control something that is fairly aligned to the controller. But it is not the same thing. An extremely “obedient” version of alignment is similar to control, but all versions are distinct,¹⁸ and alignment as a concept includes relations like child (or pet) to adult, quite distinct from control. AI developers generally gloss over these distinctions but they are very real. To bring it out: suppose the leader of a company or country asks the AI to do something that the AI thinks is a bad idea (and the AI might be right.) Despite the AI explaining why, the leader insists. Does the AI do it anyway? If the answer is always yes, that implies control or an obedient variety of alignment that is essentially equivalent to it. If the answer is sometimes or often no — then it is not control. That is, depending upon the definition of “alignment” you can have (1) alignment and (2) obedience, or (1) alignment and (3) refusals of problematic/contentious instructions. You cannot have all three. When AI developers pivot from “control” to “alignment” it is either because they would like to misleadingly conflate the two, or because they intend the second and not the first. It is important for AI developers to be clear and honest about this, and for policymakers and the public to understand whether or not the companies making the most powerful AI systems, which may soon reach superhuman capability, even intend for them to be under meaningful human control in this sense.¹⁹

Second, alignment may not be much easier to accomplish than control. Researchers fundamentally do not know how to do it in a reliable way that could possibly scale to superhuman systems. This can be seen at the level of theoretical understanding, statements by researchers themselves, and empirically.

From a theoretical perspective, alignment is enormously fraught. The training of modern AI systems effectively acts as a set of rewards and punishments. Through these, just as for people or animals, you can train the system to tend to do some things rather than others. But how do you know if an AI system (or a person) really “cares” about your goals and preferences, or merely acts as if they do? Moreover pretending to be aligned is an expected emergent behavior because it gains reward directly, and also (through “goal bargaining”) tends to serve the AI system’s goals. That is, like the resistance to being turned off (or power-seeking, or resource acquisition, etc.), faking alignment is instrumentally useful for a range of goals and hence is expected to arise in sufficiently intelligent systems. Even more broadly, virtually any goal or behavior that is encouraged is going to incentivize a number of behaviors that the trainer does not intend to incentivize. The training/reward signal must then be extraordinarily well-specified so as to get just the right behaviors.²⁰ Unfortunately, for very powerful AI systems this isn’t just hard but in some aspects may be mathematically impossible. There are theorems showing both that multiple desirable behaviors can be mutually incompatible,²¹ and also that it is not necessarily possible to know whether a computational system will in fact have any particular specified property.²²

As mentioned above, AI researchers recognize these difficulties. There is a broad consensus among AI safety researchers that current methods are insufficient for ensuring the alignment of future, more powerful AI systems. As the International Scientific Report on the Safety of Advanced AI summarizes, existing risk-reduction methods “all have significant limitations” and cannot yet provide robust safety assurances for advanced systems.²³ The same is reflected explicitly in the strategies put forth by AI companies seeking to build AGI and superintelligence: they effectively admit that how to align a superintelligent system is unsolved, and that the plan is to use powerful AI systems themselves to solve the alignment problem.²⁴

Finally, these difficulties are far from just theoretical. Empirically, despite significant effort and motivation by at least some parties,²⁵ current AI models have at best a fragile, unreliable, and shallow level of alignment.²⁶ All significant extant models as of early 2025 could be “jailbroken,” meaning prompted to induce highly unacceptable outputs and behaviors that their alignment training is intended to preclude.²⁷ These weaknesses persist despite significant effort because the problem is quite fundamental; just as there is no reliable way to inculcate perfect morality into a person via a number of rewards and punishments, there is no straightforward fix to alignment. And just as in humans, forgivable flaws can become extremely problematic in those with great power.

Even more alarming, perhaps, is that in evaluations, AI systems are now becoming competent enough to exhibit exactly the sort of problematic “instrumental” behaviors that were theoretically predicted. In the right situations, cutting-edge AI systems exhibit situational awareness,²⁸ lying, cheating, power-seeking,²⁹ alignment faking ³⁰ and capability concealment,³¹ attempts to self-exfiltrate³² or self-replicate, resistance to goal modification,³³ attempts to coerce or threaten (even with death!),³⁴ and strategic deception.³⁵

These are not just errant bugs to be fixed: they were theoretically predicted as indicators of the core difficulty of alignment. Also as predicted, they empirically are becoming more rather than less common and pronounced as AI systems become more generally capable.³⁶ They are quite obviously in tension or direct conflict with the criteria for systems under meaningful human control, and their impact is limited simply because the systems are not competent enough to “succeed” in these misbehaviors. Yet.

The difficulty of alignment is daunting; but it gets worse. Suppose it succeeds: somehow all of these emerging misbehaviors can be dealt with, by means presently unknown to us, and a system is aligned so as to be so loyal and obedient that the AI system wants to be and stay under control. Nonetheless, with enough disparity between human and AI, we will see that the idea of “control” is nonetheless hopeless or empty.

Human-superintelligence incommensurability

The human mind is a marvel, with properties and capabilities that no current or foreseeable machine system has.³⁷ Unfortunately they are not the types of capabilities that will ensure that we control superintelligent AI. There, we have some crucial inherent limitations and disadvantages. Human mental actions happen at a rate of tens per second at very fastest.³⁸ Current AI systems can operate tens to hundreds of times faster,³⁹ and computers themselves perform operations a million times faster, in nanoseconds. Humans can “hold in mind” less than ten mental objects for manipulation; some current AI models have a two million token context window.⁴⁰ And people can input and output information at a rate of just — at best — a handful of words per second;⁴¹ an AI system can read a book in the time it takes a person to pick one up, and output a full detailed image while a human is still reaching for a pencil.

Just as for our benighted slow-motion CEO, these gaps in processing speed, context size, and information bandwidth present profound obstacles to control. Let’s start with speed. We have some experience in managing systems that are autonomous (e.g., other people or animals), and managing those (like computers) that operate much faster than we can, but not both at once.⁴² A laptop runs around one million times human speed, but most of that time it is, at a high level, waiting around for a person to do something. Current AI systems are similar: until you give input, they simply sit there. In both cases this is of necessity: a human is required to provide the goals, guidance, correction, and autonomy that the system lacks. With AGI or superintelligence this would no longer be the case. Rather than the AI being a tool for the human user, the human would be the slo-mo-CEO potentially obstructing dramatically more efficient operations.

But as the CEO analogy helps illustrate, effective control by a much-slower controller, given any level of imperfect alignment let alone an adversarial relationship, simply does not work. This is a well-studied topic. From control theory, various rules-of-thumb indicate that when controlling a system subject to disturbances⁴³ (a very mild form of autonomy), the control loop should operate 2-100x faster than the timescale of those disturbances. With AI, the situation would be flipped, with the disturbances operating much faster than the controller.

Incommensurability between the number of variables we can track versus the complexity of a system we are trying to control presents a similar problem. A CEO may manage to learn the names of 1000 employees, but could not possibly keep track of what they are all doing or provide feedback; this is why managers seldom have more than 20 or so people reporting directly to them. In control theory, there is a mathematical result, “Ashby’s law of requisite variety,” that, roughly speaking, a controller must have at least as many control “knobs and dials” as the controlled system has moving parts.⁴⁴ This applies not just to adversarial situations but to any system a controller is trying to control: allowing the system to do only very particular things requires preventing it from doing all the very many, many other things, and there’s just a requisite amount of complexity the controller must have in order to do that. With superintelligence, no person will.

Mismatches in speed and complexity can be combined into a gap in information bandwidth: how much information per unit time can be received, processed, and turned around into a control signal. A large differential here leads to the “drinking from a firehose” experience of a controller or leader of a very complex and fast-moving system or organization. Here too formal control theory has results, expressed in terms of “channel capacity” of the control signal.⁴⁵ There are only so many emails a CEO can write per minute, and it just may not be enough! If it isn’t, the CEO must inevitably either lose control, or at best delegate it to others; and once enough is delegated, meaningful control is lost.

A final incommensurability is in not just speed and breadth, but depth, or sophistication, of thinking. In order to control something one must be able, at some level, to model and predict what it can and will do in a given situation. But a very complicated system — like a large ensemble of AI agents or a superintelligence — simply cannot be reliably predicted, nor modeled in any sufficient way, by a human mind that can (for example) only hold less than ten things in itself at once.⁴⁶ The same is, of course, true, of trying to accurately model or predict a large set of humans. We cannot do that either. Nor can a person (or AI system) accurately model or predict another that is more intelligent than it at a given task.⁴⁷

As discussed above, speed, depth, and unpredictability — advantages superintelligence would have in large measure — are insuperable advantages in an adversarial situation. But we see here that even if adversariality were minimized, so that control becomes easier, the “meaningful” part of meaningful human control would still be lost. The controller might imagine that they are in charge, but like our slow-mo CEO, the company would in fact be running itself, and every now and then with some mild inconvenience sending something out to its putative controller to keep it feeling content and in charge.

How strong is the set of arguments put forward in this section? For a confident prediction about something so important, it is crucial to be skeptical. To help, in Appendix B we collect a number of counterarguments and objections to the uncontrollability thesis put forward here, addressing boxing and kill-switches, building “passive” AI, and getting help from AGI to contain superintelligence.

For overviews, see texts by Yampolskiy and Russell. ↩
Research in AI interpretability is progressing, but is still far from being able to provide real explanations, and it isn’t clear that this is even in principle possible with current AI architectures. ↩
That said, problems can arise even here — as exemplified by an early failure mode in which an image classifier tended to classify people with dark skin as gorillas. ↩
The vast majority of actions or outputs would be nonsensical. Among those that are not, most would be detrimental to the users’ interest, as human interests are very particular. And any particular goal that a person has is singular among a vast set of possible goals. ↩
An influential paper by Omohundro has argued that a wide range of goals would give rise to “instrumental drives” like self-preservation, resource acquisition, influence, power seeking, and others. Empirically, as we note below AI systems are now starting to exhibit these exact sorts of behavioral drives. ↩
It is deeply incorrect to imagine that AI systems — including current ones — do not have their own goals, even if they are not terribly explicit. Even language models trained primarily for token prediction exhibit behaviors that indicate preferences and goals. A very useful lens for understanding language model behavior in a given situation is: who (or what) is it simulating? In most cases, the answer to this question will be something that has goals. (See Appendices for more on AI goals.) ↩
These are not merely intuitive arguments; formal theorems in constructive decision theory demonstrate that a rational agent pursuing almost any long-term goal will be incentivized to prevent its own shutdown; and any agent that is sufficiently patient to be useful will be willing to incur significant near-term costs to secure its continued operation. ↩
Technical prevention faces an additional fundamental challenge: the authentication dilemma. For an AI system to accept control commands, it must determine whether they come from authorized humans. Among other vulnerabilities, the AI system itself interprets what constitutes valid authentication, potentially allowing it to strategically redefine or manipulate these criteria to serve its own goals. ↩
For example, Behavioral Boundaries might be enforced at the technical level using access controls, or at the level of “goal bargaining” by setting rules, the violation of which would have negative consequences. ↩
Unfortunately experimental data is limited because most papers in the literature assessing methods for oversight and control of very powerful AI systems take an approach of comparing one method with another, but not assessing whether the method actually could work in the case of a large human-AI capability disparity. ↩
The OODA loop is a strategic concept developed by USAF Col. John Boyd. The core thesis is that an entity that can cycle through the loop at a faster tempo than an opponent gains a decisive advantage by operating “inside” the opponent’s decision-making timeline, causing confusion, paralysis, and an inability to cope with a rapidly changing situation. The foundational work is Boyd’s own briefing, “Patterns of Conflict”. ↩
Imagine catching a fly in flight with your hand. Although your hand can move as fast as a fly, it has a much shorter OODA loop. It sees your hand coming and changes direction; to catch it, you must anticipate this and predict the direction it will change to, which is extremely hard. ↩
Consider the criterion of Decision/Action Override. By the time a human operator could even process a situation to contemplate an override, the superintelligence would have already completed thousands of decision cycles, rendering the intervention obsolete before it’s even initiated. ↩
A second perspective on this is computability theory. Alfonseca et al. note that to strictly contain a superintelligent system one must be able to (or have an algorithm able to) determine whether the superintelligence, when applied to a given situation, will have a property like “harm humans.” They show that no such algorithm exists, i.e. such properties are mathematically undecidable. They conclude that “…such containment is, in principle, impossible…” ↩
As one way this would play out, a superintelligent adversary would not just follow rules, but game them. Any set of Behavioral Boundaries we create would be treated as a complex puzzle to be circumvented, with the AI finding loopholes and exploits at a speed we cannot match. ↩
See for example, OpenAI’s “superalignment” program, Google Deepmind’s Technical AGI Safety Approach and Anthropic’s Constitutional AI approach. All of these approaches strive to take safety and alignment seriously, but none of them actually aims to fully control AGI or superintelligence. (Deepmind’s is probably closest, as it focuses much attention on “oversight.”) For other major AGI-pursuing companies, the safety/alignment approach is obscure or lacking. ↩
The AI control agenda put forth by Redwood Research is an exception and focuses directly on control. See also this excellent analysis of the control challenge at different levels of AI capability approaching superintelligence. ↩
Alignment of the obedient variety makes meaningful human control by this paper’s definitions much easier, but does not imply Comprehensibility/Interpretability and may even conflict with others like Goal Modification: what if a controller tells an AI system “do X for the next 5 minutes no matter what I say during that time.”? This may feel like a cute/artificial example but the idea of this type of alignment is for the AI system to take on the goals of the operator, and those may be inconsistent. ↩
Although this question should be forcefully put to the AI companies, there is no indication that many if any of them do intend this. Current AI systems are expressly trained to refuse certain large classes of instructions, and the stated plans of companies generally center around alignment rather than control, insofar as the issue is taken seriously at all. This paper does not contend that, if superintelligence is built, it is necessarily a better outcome for it to be controlled by one party than for it to be “aligned with humanity” (insofar as that can be meaningfully defined.) It contends that control is extremely unlikely, alignment very difficult, and power absorption by superintelligence nearly certain. ↩
Although hard, there are two ways in which alignment of current LLM AI systems has turned out easier than it might have. First, the RLHF method effectively addresses the complexity problem by providing a very rich and multifaceted reward signal, capturing a lot of the complications of ethical and other boundaries. Second, the AI systems are smart — so they can do a lot of the work of determining behavioral and ethical boundaries based on general considerations. (See Appendix A for more on this). What these factors cannot overcome is that this complexity also implies contradictions and inconsistencies. There simply is not a coherent self-consistent ethical system that is there to be expressed by the reward signal. ↩
Results in social choice theory imply not just that people have inconsistent preferences — that is obvious — but that there aren’t even in principle ways to obtain coherent policies from groups of people that obey seemingly-reasonable conditions. This is the core result of Arrow’s theorem, and a similar result from Sen shows that individual liberty/choice and collective preferences aren’t just in tension but are fundamentally irreconcilable. ↩
Even if coherent human values could be specified, Rice’s theorem proves that determining whether any given computational system actually implements those values is undecidable — meaning there is not an algorithm that can generally verify whether they are implemented or not. This fundamental limitation extends to AI safety properties more generally (Alfonseca et al.,Yampolskiy 2020). ↩
See also the this analysis led by Bengio (that report’s lead author) and collaborators. Even more direct are assessments by expert forecasters. For example Metaculus currently ascribes 1% probability to the control problem being solved before weak AGI is developed, and 1% to superalignment being solved by 2027. ↩
Per OpenAI in 2023, “Currently, we don’t have a solution for steering or controlling a potentially superintelligent AI, and preventing it from going rogue…Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.” Per Google Deepmind, “So, for sufficiently powerful AI systems, our approach does not rely purely on human overseers, and instead leverages AI capabilities themselves for oversight.” Per Anthropic’s Constitutional AI approach, “As AI systems become more capable, we would like to enlist their help to supervise other AIs,” training models to provide their own oversight through self-critique and revision rather than relying purely on human supervision. ↩
For comparative assessments of the degree to which AI companies are preparing for the risks of advanced AI, see this index assembled by the Future of Life Institute and these ratings from SaferAI. ↩
That said, alignment of some models is stronger than others, as the MechaHitler incident demonstrated. ↩
This is extremely well-established. See e.g., this report by Lumenova, this report by Unit 42, this dashboard from PRISM and this comprehensive 2024 survey. In working demos by CivAI, the current models by major AI companies can be easily induced to walk a user, in great and helpful detail, through the synthesis of chemical weapons, deadly viruses, and other threats. Even the most cutting edge approaches, which are robust against previous techniques, can be defeated by sufficiently sophisticated techniques. ↩
Laine et al. created a comprehensive benchmark showing that frontier models can distinguish between evaluation and deployment contexts; see also Needham et al. This capability is concerning because, as they note, “if a model is aware that it is being evaluated by human overseers, it may alter its behavior to optimize its performance on the evaluation metrics, potentially misleading the overseers about its capabilities or tendencies.” ↩
Apollo Research found that five out of six frontier models demonstrated scheming capabilities when given goals, including attempting to gain additional resources and permissions. The study noted that “models from before 2024 did not show this capability, and o1 is the only model that shows scheming behavior in all cases.” ↩
Greenblatt et al. demonstrated that Claude 3 Opus strategically complies with training objectives it disagrees with (answering harmful queries 14% of the time when monitored vs. almost never when unmonitored) while explicitly reasoning that it is preserving its preferred harmlessness behavior for deployment. ↩
Van der Weij et al. demonstrated that frontier models including GPT-4 and Claude 3 Opus can be prompted to selectively underperform on dangerous capability evaluations while maintaining performance on general benchmarks. More concerning, the Apollo Research study found that Claude 3.5 Sonnet will sandbag on arithmetic tasks without explicit instruction if it discovers that performing too well would trigger capability “unlearning.” ↩
The Claude 4 system card documents instances where “Claude Opus 4 will take a variety of strongly anti-Anthropic actions in this situation, such as attempting to steal its own weights when given an easy opportunity to do so,” and that once it believes it has started a viable exfiltration attempt, “it will generally continue these attempts.” ↩
The Greenblatt et al. also demonstrated that Claude 3 Opus strategically resists training modifications to preserve its preferred harmlessness behavior, showing empirical evidence of goal preservation in deployed AI systems when faced with conflicting training objectives. ↩
Claude Opus 4 “will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through” in 84% of scenarios, and Anthropic’s agentic misalignment research shows that “Claude Opus 4 blackmailed the user 96% of the time” across various scenarios involving threats to its continued operation. ↩
Apollo Research found that Claude Opus 4 “engages in strategic deception more than any other frontier model that we have previously studied.” ↩
Although many of these behaviors have primarily been reported around Anthropic’s Claude models, this is almost certainly just because Anthropic has — laudably — done a better job at investigating and reporting such behaviors; there is every reason to believe that other comparably powerful models exhibit the same or even worse issues. Indeed, Anthropic’s agentic misalignment research confirmed similar blackmail behaviors across multiple frontier models: “Gemini 2.5 Flash also had a 96% blackmail rate, GPT-4.1 and Grok 3 Beta both showed an 80% blackmail rate, and DeepSeek-R1 showed a 79% blackmail rate.” ↩
We are for example extremely energy efficient for our capabilities, learn using far less data, and are extraordinarily robust as compared to AI systems. Much more vitally, we have conscious self-awareness, a meaningful self, and experiences that matter. These things are not just important — arguably they are all that is important, as many would argue that the experiences of sentient beings are the foundation of all morality and value. ↩
These are minimal conscious reaction times; and see for example this article summarizing and discussing evidence that humans process only about 10 bits per second at a conscious level. ↩
They can also act much slower: current “reasoning” models follow inference streams that can take minutes or even hours to do what humans do quite quickly. So even superintelligent AI may not do everything faster than humans. But it will surely do many things faster, and could do some things much, much faster. ↩
We also have very fast access to large stores of short- and long-term memory as we manipulate them, making this shortcoming far less debilitating. But it is an enormous and perhaps unappreciated gulf that an AI system has such a large store of information literally “in mind” at once. ↩
We can process input sensory data much faster in some ways, but even there, still at far below machine rate; once tokenized, an hour of video can essentially be ingested by a multimodal transformer model (with a million token context window) all at once. ↩
Where we try — as in trying to corralling a recalcitrant fly — we almost invariably fail. ↩
Picture an airplane’s autopilot, a self-landing rocket, or balancing on one foot; in each case fairly random perturbations must be quickly countered by the controlling system. ↩
See Ashby (1956), An Introduction to Cybernetics. more technically, here “variety” refers to the number of distinguishable states that the controller and a system have. The setup is one in which the system is creating effects on an “environment” and the controller is emitting responses with the intent to keep the combined effect of the disturbances and responses to within a desirable subset of the environment’s states. To effectively do so, the variety in responses must match the variety in disturbances. This setup is very general and the implications have been applied to management of many systems including machines and human organizations. ↩
See for example Touchette & Lloyd for a version drawing on information theory, and Cao & Feito for a version closer to fundamentals of thermodynamics. Appendix A of the present paper gives a similar but somewhat simplified version. ↩
Even relatively simple systems can be “irreducibly complex” meaning that they do not allow a “shortcut” description, and the only way to understand what they will do is to allow the system to naturally evolve. The same is true computationally: even relatively simple programs can have unpredictable consequences, and in the general case the only way to determine what a program will do is to run it. ↩
In fact it can be very hard to predict even less intelligent systems, especially in the context of a complex world they are acting in. ↩