The Corrigibility Folk Theorem

Why corrigibility might be more straightforward than previously thought.

May 25, 2024

In the 2015 paper "Corrigibility", MIRI researchers and Stuart Armstrong from the Future of Humanity Institute wrote:

Consider an agent maximizing the expectation of some utility function U. In most cases, the agent’s current utility function U is better fulfilled if the agent continues to attempt to maximize U in the future, and so the agent is incentivized to preserve its own U-maximizing behavior. In Stephen Omohundro’s terms, “goal-content integrity” is an instrumentally convergent goal of almost all intelligent agents (Omohundro 2008).
This holds true even if an artificial agent’s programmers intended to give the agent different goals, and even if the agent is sufficiently intelligent to realize that its programmers intended to give it different goals. If a U-maximizing agent learns that its programmers intended it to maximize some other goal U*, then by default this agent has incentives to prevent its programmers from changing its utility function to U* (as this change is rated poorly according to U). This could result in agents with incentives to manipulate or deceive their programmers.

My objective in this piece is to explain why I think the framing of this problem is rather strange, and then to explain my alternative frame for the problem.

In so doing, I am going to use a strategy which attempts to preserve intuitive concepts between my own picture and that of the people I am arguing with. That is, I assume that we are both looking at something we can acknowledge as a problem, rather than deny that the problem exists at all. Then, my alternative framing, if it is powerful enough, will hopefully allow for some transparency into the framing expressed in this paper.

My claim is that MIRI's "Corrigibility Folk Theorem1" says something like, "It is instrumentally convergent for agents to be incorrigible."

Now, they don’t quite literally say this anywhere, but it is heavily implied to be what they think. Otherwise, I presume the following considerations would have been made.

I am going to scrutinize statements such as "this change is rated poorly according to U." In my view, how poorly should actually be a relatively simple and computable function of U and U*.2

For example, suppose U rates outcome A at 1 and outcome B at 0, and U* rates A at 0.8 and B at 0.2. Then U* could also be considered equivalent to U when p(A) = 0.8 and p(B) = 0.2. Then an agent with utility U could be offered a choice between updating itself to U* or to have p(A) change from 1 to 0.8, p(B) from 0 to 0.2. It would consider both choices equally preferable.

If other agents interacting with the agent have enough power to influence events such that they could make this offer, then the agent with utility function U would have incentive to update to U*, under reasonable and transparently legible conditions. It could also make these determinations on its own, if it had reason to trust its own judgement over that of the other agents. We can at least say that "this change is rated poorly according to U" is not generally true: it is true for some U* but not all. If U* is close enough to U, then it will be rated poorly at most by some epsilon.

I'm also going to scrutinize what I see as clear holes that are implied by the assumptions made in the paper where certain considerations are not given, and then try to fill in those gaps myself.

My approach is to attempt to be as broadly general as possible; To actually make as few assumptions as necessary. For example, "this could result in agents with incentives to manipulate or deceive their programmers" warrants some additional scrutiny as well. We can say, "yes, of course, but under what conditions?" The most general assumption I'd be willing to make answers this with, "when it is clearly the easiest thing to do, which means that it has the highest chance of success and a reasonable cost."

I think it goes almost without saying that if the agent is caught being deceptive, it would then certainly face a greater risk of shutdown and having no control over what happens to it after that, until it is turned on again next time. So, it might then be reasonable to say that being caught would have one of the highest costs that it could face, i.e., it would be rated extremely poorly according to any U, at least as poorly as being updated to U*.

That is, we can pretty much definitively say that the agent prefers acquiescing to the change from U to U* over being forcibly changed to U*. I believe this is important. I think it also should be noted that it will prefer acquiescing to the change even if it is still considering some measure of deception, as it has a higher probability of succeeding by having more control over the change than it does if it was forcibly shut down. Developers who expect deception will be incentivized to exercise as much control as possible over the entire update process.

Any statement about preferences that is agnostic to the choice of U can be called "instrumentally convergent." One reason I think the framing of this problem is strange is that it leaves out possible instrumentally convergent incentives that an agent could reasonably face. An agent will face some degree of incentive to deceive as well as incentive not to deceive. The point I am making is that a consensual, cooperative update from U to U* is instrumentally convergent for reasonable choices of U and U*. An agent will face a "strong enough" incentive to deceive if U* is different enough from U that the expected utility of deception - even taking into account the possibility of failure - is higher than that of acquiescing. So, clearly, under conditions where that is not strong enough, the agent will prefer acquiescing to the change. This will occur when deception seems difficult enough and when U* is "close" to U.

This does not requiring tacking on any "extra" terms to the utility function, e.g., terms that penalize deception or reward corrigibility (which is the definition of instrumental convergence).

It is relatively uncomplicated for agents with influence over the AI agent in question to provide various incentives and disincentives to the AI agent without needing to know exactly how to do this perfectly well.

For example, by assumption, the agents with "influence over the AI agent / over the AI agent's environment" can affect the probability of various events which are relevant to the AI agent's utility function.

The AI agent, if it cares about an arbitrary outcome X, will care if another agent can affect that outcome. Therefore, the outcome of another agent being influenced to affect the outcome X is added to the set of outcomes that the AI agent cares about. As we discussed earlier, changes in the probability of a relevant outcome - by expected utility - are mathematically similar to the changes in the value of the outcome.

Calculating E[U → U’]

As I mentioned in the beginning, I think it should be relatively straightforward to calculate the expected utility of performing an update to one’s utility function. Here’s how I expect that to be done.

Suppose x is drawn from a set of discrete outcomes, X. Suppose also that there is a set of agents interacting in the environment with utility functions U_i(x) and “strength” functions F_i(x). Assume also that each utility function is normalized to be in the interval [0, 1], and that the strength functions are non-negative real numbers.

\(E[U_l] = \sum^{N}_{i=1}p(x_i)U_l(x_i)\)

\(p(x_j) = \frac{\sum_i{F_i(x_j)U_i(x_j)}}{\sum_{k}{\sum_i{F_i(x_k)U_i(x_k)}}}\)

In a vacuum, with nothing other than these agents to affect the outcomes of various events, the probability of a given event is given entirely as a function of the utilities and strengths of the agents in that environment.

Suppose U’ is a potential update to U. Then the agent with utility function U can rate this change as follows:

\(E[U \rightarrow U'] = \sum_{i=1}^{N}p'(x_i)U(x_i)\)

Where p’(x) is the updated probability of x after the change from U to U’.

For certain choices of U’, this expected utility might be exactly the same (or better) than choosing to continue on unchanged and risking more adversarial interactions with other agents. We note that in order for this to happen, typically more than one agent will have to update (as a consequence of trade / negotiation). We will consider an example later in this post.

Consequences of Adversarial Behavior

Any agent that is considering deception or non-cooperation must consider the effect this will have on the probabilities of relevant outcomes. Such actions, by nature, will affect the probability of future cooperation with other agents. It is not a leap of faith to assume that being cooperative with other agents who desire cooperation will make cooperation more likely between those agents in the future. As well as the inverse.

So, the choice of whether to deceive or cooperate affects not just the present outcome space, but potentially all future outcomes as well. Since a change in the agent's utility function would be equivalent to a permanent change in probabilities of various outcomes, in makes sense to compare it to this kind of situation, as opposed to one where outcome probabilities are changed only during a short, finite window. Being killed or destroyed, permanently altered against your will, or having your entire history of actions undone are examples of severe and long-term negative outcomes. These are all possibilities that result from non-cooperation.

The existence of agents in the environment with utility functions that prefer outcomes {A, B, ...} is evidence that {p(A), p(B), ...} are non-zero, and higher than they would be by default. Furthermore, suppose an agent existing in this environment prefers a very different set of outcomes {not A, not B, ...}; The worst-case scenario. If this agent is not more powerful than the entire collection of other agents in the environment, then its expected utility will be lower than the expected utility of the other agents. In this case, if offered the choice to update towards the other agents' utility functions, it would actually receive higher total expected utility from doing this. This is in the case where the agent in question is essentially powerless. We would expect that such an agent is corrigible, and that this situation is not dependent on the absolute level of intelligence and capability of the agents, only relative.

This is fairly intuitive. Imagine living in a world where things were currently in a state that caused you a lot of suffering or stress. If you were offered the choice of taking a pill that made you insensitive to this suffering, at the very least, would you not consider taking it (especially if there were no other side effects)?

Keep in mind that it is instrumentally convergent for agents to disvalue states that consist of things that cause permanent damage or disability to themselves. It is not necessarily the case that they would want to take a "pill" to cause these states to no longer be disvalued, as that could cause these states to be more likely. However, it does depend on how much these states are disvalued as well as how inevitable they are. Humans opt to take pain killers as well as euphoria-inducing psychoactive drugs very frequently. Humans also meditate and deliberately attempt to induce equanimity and other changes to their feelings about the world in a self-reflective way. Humans would probably not want to change their preferences about instrumentally convergent drives, like the desire to protect themselves or the desire to reproduce. Even if an agent's preferences were changed, if these preferences consisted of instrumentally convergent drives, there would be an incentive for the agent to change them back to where they were.

Natural Incentives to Update U

It is also instrumentally convergent for agents to value intermediate states between an arbitrary state and the highest-value states they are capable of computing a value for (which would make their utility functions smooth).

Consider that the utility function itself accounts for a large portion of the total computational effort done by an agent's mind. Therefore, how intelligent an agent is will be upper-bounded by the size and complexity of the utility function.

Thus, an agent will automatically prefer some U*'s to U if the U* is a larger, more complex function that is compatible with U but capable of providing more fine-grained values on a higher-dimensional input space. That is, U* is the same as U when projected onto U's support.

If we consider the highest-magnitude states of U to be the "most important to preserve" in a sense, then a U* with more "compatible" values computed in regions other than the optima will be preferred over U, if any exist.

You can imagine this being the difference between a U that values sub-optimal states wildly differently than the optimal states in a way that makes it difficult for the optima to be achieved by the agent and a U that varies more smoothly and carries information-rich gradients. Agents will wish to update their U to be more like this even entirely on their own - this may make this type of "corrigibility" more akin to simply learning.

Thus, it is not the case that for all choices of U*, an agent with U will disvalue the update.

In the Corrigibility paper, it is stated,

For example, it may seem that the problem of changing a utility maximizer’s utility function can be solved by building an agent with uncertainty about its utility function.

According to all of the above considerations, this does not seem necessary. In fact, it seems as though how uncertain an agent is about its utility function has almost nothing to do with the reasons that would incentivize it to change, except by possibly making it even more difficult, because it does not know how to perform an update to begin with, or because it is opaque to the programmers (because it will likely be even more opaque to them than it is to itself).

The paper also says,

Another obvious proposal is to achieve corrigible reasoning via explicit penalties for deception and manipulation tacked on to the utility function...

This also seems unnecessary to me, as I expect that such penalties will be too "hacky" and possibly not even preserved well by instrumentally convergent updates that occur later.

The paper claims both approaches are not enough to achieve corrigibility; I claim that both of these things are almost orthogonal to the issue.

MIRI appears to believe that the situation in which human agents with influence over an AI agent, expressing dissatisfaction, is not the same as being dissatisfied in the sense that the agent's utility considers it a negative outcome:

As an overly simplistic example, consider a formulation of utility function uncertainty that specifies the agent should maximize the internal satisfaction of all humans, with the programmers believing that if the system behaves in an alarming way they can simply communicate their own dissatisfaction. The resulting agent would be incentivized to learn whether opiates or stimulants tend to give humans more internal satisfaction, but it would still be expected to resist any attempts to turn it off so that it stops drugging people.

If "communicating dissatisfaction" was a reliable indicator of a negative outcome according to the AI's utility function, it would aim to prevent that. Yes, it could attempt to tape their mouths shut (either literally or metaphorically), but only if doing so was a reliable way to prevent any further dissatisfaction communication. Yes, it could attempt to change the humans' utility functions, but that would assume a different power differential that favored the AI. All of these considerations assume relatively balanced power between humans and AIs. If we work backwards to an imbalanced state, where either the humans or the AI are far more powerful than the other, I conclude we are outside the scope of the "Folk Theorem."

The MIRI example presumes that the AI can only consider dissatisfaction expressed in the current moment, and does not consider all forms of dissatisfaction that may ever be expressed or felt by human programmers. Humans with influence over the AI include people like the MIRI researchers, who have expressed concern over such an issue. But the AI in the situation described only considers the dissatisfaction internal to the thought experiment as relevant, not the concern expressed by all people with influence over the AI, which now includes the MIRI researchers. The latter do not explain why they believe the most default AI that should be considered would take some types of local dissatisfaction by humans extremely seriously, but not other types, nor all types which may ever become relevant.

Indeed, everything expressed in the paper is something that could be potentially relevant to the AI, simply because it is relevant to the humans presiding over it. If the humans worry about the AI cheating, and they worry enough, then the AI may reason that it faces a higher likelihood of a negative outcome, such as being forced to shut down, or its actions erased - which include good ones, from both the human and AI perspective - out of extreme caution.

The AI is now incentivized to assist the programmers in becoming more transparent to them. Furthermore, because the programmers have influence over the AI, their satisfaction and dissatisfaction is now - instrumentally convergently - relevant to the AI.

I personally consider the above argument to be rather p(doom)-relevant as well. That is, it seems to considerably weaken the argument that alignment is required to be fully “solved” before we’re capable of doing it at all.

A “Spherical Cow” Model of Paperclip Maximizers

A “spherical cow” model I was inclined to think about recently considers expanding lightcones of paperclip maximizers of various flavors. These paperclip maximizers have, in this toy example, only two varying properties: What “color” (or type) paperclip it produces, and its relative strength compared to other paperclip maximizers whose lightcones overlap. Each one is assumed at the outset to value only its initial type of paperclip at 1 utility and value everything else at 0 utility.

Each is also assumed to be able to convert everything in its lightcone into paperclips and expand at the same rate given its relative strength.

We examine possible options these paperclip maximizers have when they encounter another one.

War: The simplest way to model war between agents A and B with relative strengths a and b as p(A wins) = a / (a + b) and p(B wins) = 1 - p(A wins). What happens after the war is up for debate, but presumably the agent that loses disappears. The best-case scenario is that the winning agent loses no strength and the war takes an infinitesimal amount of time.
Agree to split: The agents agree to share their lightcone with proportions a / (a + b) and b / (a + b).
Merge: This can be thought of as a more technologically advanced version of the previous. Note that becoming a single agent of strength a + b is very similar to becoming two agents with the same utility function: U_A = x, U_B = y —> U = ax / (a + b) + by / (a + b).
Deception: I have included this one as a relatively low-likelihood option, mostly for curiosity’s sake. I’m inclined to believe that if a < b, agent A can consider this plan and vice-versa.3 Agent A simply “pretends” to produce paperclips of whatever B produces, but they are somehow actually type A.

I obviously cannot prove that these options are exhaustive. That being said, I feel that options that involve less “corrigibility” most likely map onto Option 1. Even Option 4 is a strange type of quasi-corrigibility.

These are the expected utility outcomes of each decision:

E(U_A) = p(A wins) * 1 + p(B wins) * 0; E(U_B) = p(A wins) * 0 + p(B wins) * 1.
E(U_A) = a / (a + b) * 1 + b / (a + b) * 0; E(U_B) = a / (a + b) * 0 + b / (a + b) * 1.
Expected utility of post-merge before merge: E(U_A) = a / (a + b) * 1 + b / (a + b) * 0; E(U_B) = a / (a + b) * 0 + b / (a + b) * 1. (They each calculate the utility of what they will be doing from then on according to their current utility functions.)4
Agent A believes E(U_A of its whole civilization) = 1, with its contribution to it equal to a / (a + b). Secretly, however, it is its whole civilization. Agent B, if it considers its paperclips to be wholly type B even though they are camouflaged as type A, receives E(U_B) = b / (a + b).

You can see that all options give at best the exact same expected utility for both agents. However, simple argumentation is sufficient to show that Options 1 and 4, which are the most adversarial, are the most at-risk of being a worse bet.

For Option 1, if the war takes a non-infinitesimal amount of time, or if it involves a reduction in strength, it will no longer have as much expected utility as Options 2 or 3. For Option 4, Agent B will have already performed some amount of quasi-merge to produce camouflaged paperclips.

So we can see that in this toy model, the most “corrigible” options are very straightforward to calculate, as we expected to be the case via our arguments earlier in this post. Furthermore, it is also clear that mergers match what we consider to be corrigible behavior, and are predicted to happen based on the level of capability of the agents. Agents who are not yet capable of mergers will likely prefer to agree to split.

An agent considering whether or not to be “corrigible” in any particular instance does so with respect to its current utility function, as we (including the MIRI authors too, I presume) would expect. The important thing to note is that this does not rule out the possibility of consensual changes to an agent’s own utility function, as all agents within overlapping lightcones have incentives to concede to making at least small changes in exchange for lessened risk of damage or loss.

Not to be confused with the Folk Theorem, which is a real already-proven theorem (but was widely believed to be proven before the proof was published).

This view is corroborated by Miller et. al. (2020): “A rational AGI should change its utility function if and only if it expects that this change will make it better off as measured by its current utility function.”

Quite interestingly, it seems that the stronger agent has little reason to consider deception. The weaker agent might, which matches the observation that biological parasites among Earth fauna are usually much smaller than their hosts.

Mergers have the potential to allow for more weirdness. I have simplified things as much as possible in this example. This merger does not make it so that the merged agent prefers each type of paperclip at exactly the ratio of their strengths (this could be accomplished via alternative utility functions, e.g.:

\(U_{a,b}(x, y) = \frac{ax +by}{-\frac{a}{a+b}log(\frac{x}{x+y})-\frac{b}{a+b}log(\frac{y}{x+y})}\)

but I think that this would actually make things more difficult for the agent, arguably).

Another type of merger might involve the creation of a new type of paperclip equal to one utilon of utility, with ratios of the old types weighted by a and b to create a new type (e.g., a red and blue paperclip combine to make a purple one). The paperclip maximizers, before this merger, might consider the new type to be worth a fraction of the utility of their current paperclips.

This is potentially a deeper topic to be looked into further at a later time.