I propose a new formal desideratum for alignment: the Hippocratic principle. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. the user's beliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).

More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let πυu be the the user's policy in universe υ and πa the AI policy. Let T be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability 1 for any policy. Let Vυ be the value of a state from the user's subjective POV, in universe υ. Let μυ be the environment in universe υ. Finally, let ζ be the AI's prior over universes and ϵ... (read more)

5Steve Byrnes2mo(Update: I don't think this was 100% right, see here
[https://www.lesswrong.com/posts/SzrmsbkqydpZyPuEh/my-take-on-vanessa-kosoy-s-take-on-agi-safety#4_1_Example___the_Hippocratic_principle__desideratum__and_an_algorithm_that_obeys_it]
for a better version.)
Attempted summary for morons like me: AI is trying to help the human H. They
share access to a single output channel, e.g. a computer keyboard, so that the
actions that H can take are exactly the same as the actions AI can take. Every
step, AI can either take an action, or delegate to H to take an action. Also,
every step, H reports her current assessment of the timeline / probability
distribution for whether she'll succeed at the task, and if so, how soon.
At first, AI will probably delegate to H a lot, and by watching H work, AI will
gradually learn both the human policy (i.e. what H tends to do in different
situations), and how different actions tend to turn out in hindsight from H's
own perspective (e.g., maybe whenever H takes action 17, she tends to declare
shortly afterwards that probability of success now seems much higher than
before—so really H should probably be taking action 17 more often!).
Presumably the AI, being a super duper fancy AI algorithm, learns to anticipate
how different actions will turn out from H's perspective much better than H
herself. In other words, maybe it delegates to H, and H takes action 41, and the
AI is watching this and shaking its head and thinking to itself "gee you dunce
you're gonna regret that", and shortly thereafter the AI is proven correct.
OK, so now what? The naive answer would be: the AI should gradually stop
delegating and start just doing the thing that leads to H feeling maximally
optimistic later on.
But we don't want to do that naive thing. There are two problems:
The first problem is "traps" (a.k.a. catastrophes). Let's say action 0 is Press
The History Eraser Button [https://vimeo.com/126720159]. H never takes that
action. The AI shouldn't either.

2Vanessa Kosoy2moThis is about right.
Notice that typically we use the AI for tasks which are hard for H. This means
that without the AI's help, H's probability of success will usually be low.
Quantilization-wise, this is a problem: the AI will be able to eliminate those
paths for which H will report failure, but maybe most of the probability mass
among apparent-success paths is still on failure (i.e. the success report is
corrupt). This is why the timeline part is important.
On a typical task, H expects to fail eventually but they don't expect to fail
soon. Therefore, the AI can safely consider a policies of the form "in the
short-term, do something H would do with marginal probability, in the long-term
go back to H's policy". If by the end of the short-term maneuver H reports an
improved prognosis, this can imply that the improvement is genuine (since the AI
knows H is probably uncorrupted at this point). Moreover, it's possible that in
the new prognosis H still doesn't expect to fail soon. This allows performing
another maneuver of the same type. This way, the AI can iteratively steer the
trajectory towards true success.

1Charlie Steiner1moVery interesting - I'm sad I saw this 6 months late.
After thinking a bit, I'm still not sure if I want this desideratum. It seems to
require a sort of monotonicity, where we can get superhuman performance just by
going through states that humans recognize as good, and not by going through
states that humans would think are weird or scary or unevaluable.
One case where this might come up is in competitive games. Chess AI beats humans
in part because it makes moves that many humans evaluate as bad, but are
actually good. But maybe this example actually supports your proposal - it seems
entirely plausible to make a chess engine that only makes moves that some given
population of humans recognize as good, but is better than any human from that
population.
On the other hand, the humans might be wrong about the reason the move is good,
so that the game is made of a bunch of moves that seem good to humans, but where
the humans are actually wrong about why they're good (from the human
perspective, this looks like regularly having "happy surprises"). We might hope
that such human misevaluations are rare enough that quantilization would lead to
moves on average being well-evaluated by humans, but for chess I think that
might be false! Computers are so much better than humans at chess that a very
large chunk of the best moves according to both humans and the computer will be
ones that humans misevaluate.
Maybe that's more a criticism of quantilizers, not a criticism of this
desideratum. So maybe the chess example supports this being a good thing to
want? But let me keep critiquing quantilizers then :P
If what a powerful AI thinks is best (by an exponential amount) is to turn off
the stars [https://en.wikipedia.org/wiki/Star_lifting]until the universe is
colder, but humans think it's scary and ban the AI from doing scary things, the
AI will still try to turn off the stars in one of the edge-case ways that humans
wouldn't find scary. And if we think being manipulated like

1Vanessa Kosoy1moWhen I'm deciding whether to run an AI, I should be maximizing the expectation
of my utility function w.r.t. my belief state. This is just what it means to act
rationally. You can then ask, how is this compatible with trusting another agent
smarter than myself?
One potentially useful model is: I'm good at evaluating and bad at searching
(after all, P≠NP). I can therefore delegate searching to another agent. But, as
you point out, this doesn't account for situations in which I seem to be bad at
evaluating. Moreover, if the AI prior takes an intentional stance towards the
user (in order to help learning their preferences), then the user must be
regarded as good at searching.
A better model is: I'm good at both evaluating and searching, but the AI can
access actions and observations that I cannot. For example, having additional
information can allow it to evaluate better. An important special case is: the
AI is connected to an external computer (Turing RL
[https://www.alignmentforum.org/posts/3qXE6fK47JhSfkpnB/do-su?ciently-advanced-agents-use-logic#fEKc88NbDWZavkW9o]
) which we can think of as an "oracle". This allows the AI to have additional
information which is purely "logical". We need infra-Bayesianism to formalize
this: the user has Knightian uncertainty over the oracle's outputs entangled
with other beliefs about the universe.
For instance, in the chess example, if I know that a move was produced by
exhaustive game-tree search then I know it's a good move, even without having
the skill to understand why the move is good in any more detail.
Now let's examine short-term quantilization for chess. On each cycle, the AI
finds a short-term strategy leading to a position that the user evaluates as
good, but that the user would require luck to manage on their own. This is
repeated again and again throughout the game, leading to overall play
substantially superior to the user's. On the other hand, this play is not as
good as the AI would achieve if it just optimized

1Charlie Steiner1moAgree with the first section, though I would like to register my sentiment that
although "good at selecting but missing logical facts" is a better model, it's
still not one I'd want an AI to use when inferring my values.
I think my point is if "turn off the stars" is not a primitive action, but is a
set of states of the world that the AI would overwhelming like to go to, then
the actual primitive actions will get evaluated based on how well they end up
going to that goal state. And since the AI is better at evaluating than us,
we're probably going there.
Another way of looking at this claim is that I'm telling a story about why the
safety bound on quantilizers gets worse when quantilization is iterated.
Iterated quantilization has much worse bounds than quantilizing over the
iterated game, which makes sense if we think of games where the AI evaluates
many actions better than the human.

1Vanessa Kosoy1moI think you misunderstood how the iterated quantilization works. It does not
work by the AI setting a long-term goal and then charting a path towards that
goal s.t. it doesn't deviate too much from the baseline over every short
interval. Instead, every short-term quantilization is optimizing for the user's
evaluation in the end of this short-term interval.

1Charlie Steiner1moAh. I indeed misunderstood, thanks :) I'd read "short-term quantilization" as
quantilizing over short-term policies evaluated according to their expected
utility. My story doesn't make sense if the AI is only trying to push up the
reported value estimates (though that puts a lot of weight on these estimates).

1Adam Shimi7moI don't understand what you mean here by quantilizing. The meaning I know is to
take a random action over the top \alpha actions, on a given base distribution.
But I don't see a distribution here, or even a clear ordering over actions
(given that we don't have access to the utility function).
I'm probably missing something obvious, but more details would really help.

2Vanessa Kosoy7moThe distribution is the user's policy, and the utility function for this purpose
is the eventual success probability estimated by the user (as part of the
timeline report), in the end of the "maneuver". More precisely, the original
quantilization formalism was for the one-shot setting, but you can easily
generalize it, for example I did it
[https://www.alignmentforum.org/posts/5bd75cc58225bf0670375556/quantilal-control-for-finite-mdps]
for MDPs.

1Adam Shimi7moOh, right, that makes a lot of sense.
So is the general idea that we quantilize such that we're choosing in
expectation an action that doesn't have corrupted utility (by intuitively having
something like more than twice as many actions in the quantilization than we
expect to be corrupted), so that we guarantee the probability of following the
manipulation of the learned user report is small?
I also wonder if using the user policy to sample actions isn't limiting, because
then we can only take actions that the user would take. Or do you assume by
default that the support of the user policy is the full action space, so every
action is possible for the AI?

1Vanessa Kosoy7moYes, although you probably want much more than twice. Basically, if the
probability of corruption following the user policy is ϵ and your quantilization
fraction is ϕ then the AI's probability of corruption is bounded by ϵϕ.
Obviously it is limiting, but this is the price of safety. Notice, however, that
the quantilization strategy is only an existence proof. In principle, there
might be better strategies, depending on the prior (for example, the AI might be
able to exploit an assumption that the user is quasi-rational). I didn't specify
the AI by quantilization, I specified it by maximizing EU subject to the
Hippocratic constraint. Also, the support is not really the important part: even
if the support is the full action space, some sequences of actions are possible
but so unlikely that the quantilization will never follow them.

An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.

The key observation is, imitation learning algorithms^{[1]} might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans have. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes realizable from the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are not too complex.

This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevert... (read more)

1Vladimir Nesov1yThis seems similar to gaining uploads prior to AGI, and opens up all those
superorg upload-city amplification/distillation constructions which should get
past human level shortly after. In other words, the limitations of the dataset
can be solved by amplification as soon as the AIs are good enough to be used as
building blocks for meaningful amplification, and something human-level-ish
seems good enough for that. Maybe even GPT-n is good enough for that.

1Vanessa Kosoy1yThat is similar to gaining uploads (borrowing terminology from Egan, we can call
them "sideloads"), but it's not obvious amplification/distillation will work. In
the model based on realizability, the distillation step can fail because the
system you're distilling is too computationally complex (hence, too
unrealizable). You can deal with it by upscaling the compute of the learning
algorithm, but that's not better than plain speedup.

1Vladimir Nesov1yTo me this seems to be essentially another limitation of the human Internet
archive dataset: reasoning is presented in an opaque way (most slow/deliberative
thoughts are not in the dataset), so it's necessary to do a lot of guesswork to
figure out how it works. A better dataset both explains and summarizes the
reasoning (not to mention gets rid of the incoherent nonsense, but even GPT-3
can do that to an extent by roleplaying Feynman).
Any algorithm can be represented by a habit of thought (Turing machine style if
you must), and if those are in the dataset, they can be learned. The habits of
thought that are simple enough to summarize get summarized and end up requiring
fewer steps. My guess is that the human faculties needed for AGI can be both
represented by sequences of thoughts (probably just text, stream of
consciousness style) and easily learned with current ML. So right now the main
obstruction is that it's not feasible to build a dataset with those faculties
represented explicitly that's good enough and large enough for current
sample-inefficient ML to grok. More compute in the learning algorithm is only
relevant for this to the extent that we get a better dataset generator that can
work on the tasks before it more reliably.

1Vanessa Kosoy1yI don't see any strong argument why this path will produce superintelligence.
You can have a stream of thought that cannot be accelerated without investing a
proportional amount of compute, while a completely different algorithm would
produce a far superior "stream of thought". In particular, such an approach
cannot differentiate between features of the stream of thought that are
important (meaning that they advance towards the goal) and features of the
stream of though that are unimportant (e.g. different ways to phrase the same
idea). This forces you to solve a task that is potentially much more difficult
than just achieving the goal.

1Vladimir Nesov1yI was arguing that near human level babblers (including the imitation plateau
you were talking about) should quickly lead to human level AGIs by amplification
via stream of consciousness datasets, which doesn't pose new ML difficulties
other than design of the dataset. Superintelligence follows from that by any of
the same arguments as for uploads leading to AGI (much faster technological
progress; if amplification/distillation of uploads is useful straight away, we
get there faster, but it's not necessary). And amplified babblers should be
stronger than vanilla uploads (at least implausibly well-educated,
well-coordinated, high IQ humans).
For your scenario to be stable, it needs to be impossible (in the near term) to
run the AGIs (amplified babblers) faster than humans, and for the AGIs to remain
less effective than very high IQ humans. Otherwise you get acceleration of
technological progress, including ML. So my point is that feasibility of
imitation plateau depends on absence of compute overhang, not on ML failing to
capture some of the ingredients of human general intelligence.

1Vanessa Kosoy1yThe imitation plateau can definitely be rather short. I also agree that
computational overhang is the major factor here. However, a failure to capture
some of the ingredients can be a cause of low computational overhead, whereas a
success to capture all of the ingredients is a cause of high computational
overhang, because the compute necessary to reach superintelligence might be very
different in those two cases. Using sideloads to accelerate progress might still
require years, whereas an "intrinsic" AGI might lead to the classical "foom"
scenario.
EDIT: Although, since training is typically much more computationally expensive
than deployment, it is likely that the first human-level imitators will already
be significantly sped-up compared to humans, implying that accelerating progress
will be relatively easy. It might still take some time from the first prototype
until such an accelerate-the-progress project, but probably not much longer than
deploying lots of automation.

1Vladimir Nesov1yI agree. But GPT-3 seems to me like a good estimate for how much compute it
takes to run stream of consciousness imitation learning sideloads (assuming that
learning is done in batches on datasets carefully prepared by non-learning
sideloads, so the cost of learning is less important). And with that estimate we
already have enough compute overhang to accelerate technological progress as
soon as the first amplified babbler AGIs are developed, which, as I argued
above, should happen shortly after babblers actually useful for automation of
human jobs are developed (because generation of stream of consciousness datasets
is a special case of such a job).
So the key things to make imitation plateau last for years are either sideloads
requiring more compute than it looks like (to me) they require, or amplification
of competent babblers into similarly competent AGIs being a hard problem that
takes a long time to solve.

2Vanessa Kosoy1yAnother thing that might happen is a data bottleneck.
Maybe there will be a good enough dataset to produce a sideload that simulates
an "average" person, and that will be enough to automate many jobs, but for a
simulation of a competent AI researcher you would need a more specialized
dataset that will take more time to produce (since there are a lot less
competent AI researchers than people in general).
Moreover, it might be that the sample complexity grows with the duration of
coherent thought that you require. That's because, unless you're training
directly on brain inputs/outputs, non-realizable (computationally complex)
environment influences contaminate the data, and in order to converge you need
to have enough data to average them out, which scales with the length of your
"episodes". Indeed, all convergence results for Bayesian algorithms we have in
the non-realizable setting require ergodicity, and therefore the time of
convergence (= sample complexity) scales with mixing time, which in our case is
determined by episode length.
In such a case, we might discover that many tasks can be automated by sideloads
with short coherence time, but AI research might require substantially longer
coherence times. And, simulating progress requires by design going
off-distribution along certain dimensions which might make things worse.

I haverepeatedlyargued for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can predict Alice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.

A possible counterargument is, we don't need to depart far from Bayesianis

I propose to call metacosmology the hypothetical field of study which would be concerned with the following questions:

Studying the space of simple mathematical laws which produce counterfactual universes with intelligent life.

Studying the distribution over utility-function-space (and, more generally, mindspace) of those counterfactual minds.

Studying the distribution of the amount of resources available to the counterfactual civilizations, and broad features of their development trajectories.

Using all of the above to produce a distribution over concretized simulation hypotheses.

This concept is of potential interest for several reasons:

It can be beneficial to actually research metacosmology, in order to draw practical conclusions. However, knowledge of metacosmology can pose an infohazard, and we would need to precommit not to accept blackmail from potential simulators.

The metacosmology knowledge of a superintelligent AI determines the extent to which it poses risk via the influence of potential simulators.

In principle, we might be able to use knowledge of metacosmology in order to engineer an "atheist prior" for the AI that would exclude simulation hypotheses. However, this might be very difficult in practice.

This idea was inspired by a correspondence with Adam Shimi.

It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining "goal-directed intelligence" is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, descript... (read more)

1Vanessa Kosoy1yActually, as opposed to what I claimed before, we don't need computational
complexity bounds for this definition to make sense. This is because the
Solomonoff prior is made of computable hypotheses but is uncomputable itself.
Given g>0, we define that "π has (unbounded) goal-directed intelligence (at
least) g" when there is a prior ζ and utility function U s.t. for any policy π′,
if Eζπ′[U]≥Eζπ[U] then K(π′)≥DKL(ζ0||ζ)+K(U)+g. Here, ζ0 is the Solomonoff prior
and K is Kolmogorov complexity. When g=+∞ (i.e. no computable policy can match
the expected utility of π; in particular, this implies π is optimal since any
policy can be approximated by a computable policy), we say that π is "perfectly
(unbounded) goal-directed".
Compare this notion to the Legg-Hutter intelligence measure. The LH measure
depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI
(which is the maximum of the LH measure) becomes computable or even really
stupid. For example, it can always keep taking the same action because of the
fear that taking any other action leads to an inescapable "hell" state. On the
other hand, goal-directed intelligence differs only by O(1) between UTMs, just
like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be
uncomputable, and the notion of which policies are such doesn't depend on the
UTM at all.
I think that it's also possible to prove that intelligence is rare, in the sense
that, for any computable stochastic policy, if we regard it as a probability
measure over deterministic policies, then for any ϵ>0 there is g s.t. the
probability to get intelligence at least g is smaller than ϵ.
Also interesting is that, for bounded goal-directed intelligence, increasing the
prices can only decrease intelligence by O(1), and a policy that is perfectly
goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think).
In particular, a perfectly unbounded goal-directed policy is perfectly
goal-directed for any price vec

1Vanessa Kosoy1ySome problems to work on regarding goal-directed intelligence. Conjecture 5 is
especially important for deconfusing basic questions in alignment, as it stands
in opposition to Stuart Armstrong's thesis about the impossibility to deduce
preferences from behavior alone.
1. Conjecture. Informally: It is unlikely to produce intelligence by chance.
Formally: Denote Π the space of deterministic policies, and consider some μ∈
ΔΠ. Suppose μ is equivalent to a stochastic policy π∗. Then, Eπ∼μ[g(π)]=O(C(
π∗)).
2. Find an "intelligence hierarchy theorem". That is, find an increasing
sequence {gn} s.t. for every n, there is a policy with goal-directed
intelligence in (gn,gn+1) (no more and no less).
3. What is the computational complexity of evaluating g given (i) oracle access
to the policy or (ii) description of the policy as a program or automaton?
4. What is the computational complexity of producing a policy with given g?
5. Conjecture. Informally: Intelligent agents have well defined priors and
utility functions. Formally: For every (U,ζ) with C(U)<∞ and DKL(ζ0||ζ)<∞,
and every ϵ>0, there exists g∈(0,∞) s.t. for every policy π with
intelligence at least g w.r.t. (U,ζ), and every (~U,~ζ) s.t. π has
intelligence at least g w.r.t. them, any optimal policies π∗,~π∗ for (U,ζ)
and (~U,~ζ) respectively satisfy Eζ~π∗[U]≥Eζπ∗[U]−ϵ.

1David Manheim9more: #5, that doesn't seem to claim that we can infer U given their actions,
which is what the impossibility of deducing preferences is actually claiming.
That is, assuming 5, we still cannot show that there isn't someU1≠U2such thatπ∗(
U1,ζ)=π∗(U2,ζ).
(And as pointed out elsewhere, it isn't Stuart's thesis, it's a well known and
basic result in the decision theory / economics / philosophy literature.)

1Vanessa Kosoy9moYou misunderstand the intent. We're talking about inverse reinforcement
learning. The goal is not necessarily inferring the unknown U, but producing
some behavior that optimizes the unknown U. Ofc if the policy you're observing
is optimal then it's trivial to do so by following the same policy. But, using
my approach we might be able to extend it into results like "the policy you're
observing is optimal w.r.t. certain computational complexity, and your goal is
to produce an optimal policy w.r.t. higher computational complexity."
(Btw I think the formal statement I gave for 5 is false, but there might be an
alternative version that works.)
I am referring to this
[http://papers.neurips.cc/paper/7803-occams-razor-is-insufficient-to-infer-the-preferences-of-irrational-agents]
and related work by Armstrong.

Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via modifying the game rather than abandoning the notion of Nash equilibrium).

The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a repeated version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requ

1Vanessa Kosoy2yWe can modify the population game setting to study superrationality. In order to
do this, we can allow the agents to see a fixed size finite portion of the their
opponents' histories. This should lead to superrationality for the same reasons
I discussed
[https://www.alignmentforum.org/posts/S3W4Xrmp6AL7nxRHd/formalising-decision-theory-is-hard#3yw2udyFfvnRC8Btr]
before [https://agentfoundations.org/item?id=507]. More generally, we can
probably allow each agent to submit a finite state automaton of limited size,
s.t. the opponent history is processed by the automaton and the result becomes
known to the agent.
What is unclear about this is how to define an analogous setting based on source
code introspection. While arguably seeing the entire history is equivalent to
seeing the entire source code, seeing part of the history, or processing the
history through a finite state automaton, might be equivalent to some limited
access to source code, but I don't know to define this limitation.
EDIT: Actually, the obvious analogue is processing the source code through a
finite state automaton.

1Vanessa Kosoy2yInstead of postulating access to a portion of the history or some kind of
limited access to the opponent's source code, we can consider agents with full
access to history / source code but finite memory. The problem is, an agent with
fixed memory size usually cannot have regret going to zero, since it cannot
store probabilities with arbitrary precision. However, it seems plausible that
we can usually get learning with memory of size O(log11−γ). This is because
something like "counting pieces of evidence" should be sufficient. For example,
if consider finite MDPs, then it is enough to remember how many transitions of
each type occurred to encode the belief state. There question is, does assuming
O(log11−γ) memory (or whatever is needed for learning) is enough to reach
superrationality.

1Gurkenglas2yWhat do you mean by equivalent? The entire history doesn't say what the opponent
will do later or would do against other agents, and the source code may not
allow you to prove what the agent does if it involves statements that are true
but not provable.

1Vanessa Kosoy2yFor a fixed policy, the history is the only thing you need to know in order to
simulate the agent on a given round. In this sense, seeing the history is
equivalent to seeing the source code.
The claim is: In settings where the agent has unlimited memory and sees the
entire history or source code, you can't get good guarantees (as in the folk
theorem for repeated games). On the other hand, in settings where the agent sees
part of the history, or is constrained to have finite memory (possibly of size O
(log11−γ)?), you can (maybe?) prove convergence to Pareto efficient outcomes or
some other strong desideratum that deserves to be called "superrationality".

1Vanessa Kosoy2yIn the previous "population game" setting, we assumed all players are "born" at
the same time and learn synchronously, so that they always play against players
of the same "age" (history length). Instead, we can consider a "mortal
population game" setting where each player has a probability 1−γ to die on every
round, and new players are born to replenish the dead. So, if the size of the
population is N (we always consider the "thermodynamic" N→∞ limit), N(1−γ)
players die and the same number of players are born on every round. Each
player's utility function is a simple sum of rewards over time, so, taking
mortality into account, effectively ey have geometric time discount. (We could
use age-dependent mortality rates to get different discount shapes, or allow
each type of player to have different mortality=discount rate.) Crucially, we
group the players into games randomly, independent of age.
As before, each player type i chooses a policy . (We can also consider the case
where players of the same type may have different policies, but let's keep it
simple for now.) In the thermodynamic limit, the population is described as a
distribution over histories, which now are allowed to be of variable length: μn∈
ΔO∗. For each assignment of policies to player types, we get dynamics μn+1=Tπ(μn
) where Tπ:ΔO∗→ΔO∗. So, as opposed to immortal population games, mortal
population games naturally give rise to dynamical systems.
If we consider only the age distribution, then its evolution doesn't depend on π
and it always converges to the unique fixed point distribution ζ(k)=(1−γ)γk.
Therefore it is natural to restrict the dynamics to the subspace of ΔO∗ that
corresponds to the age distribution ζ. We denote it P.
Does the dynamics have fixed points? O∗ can be regarded as a subspace of (O⊔{⊥})
ω. The later is compact (in the product topology) by Tychonoff's theorem and
Polish, but O∗ is not closed. So, w.r.t. the weak topology on probability
measure spaces, Δ(O⊔{⊥})ω is also comp

From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent^{[1]}. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.

First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology

This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the alignment scheme I currently find most promising.

It seems that the natural formal criterion for alignment (or at least the main criterion) is having a "subjective regret bound": that is, the AI has to converge (in the long term planning limit, γ→1 limit) to achieving optimal expected user!utility with respect to the knowledge state of the user. In order to achieve this, we need to establish a communicati

1Vanessa Kosoy1yI gave a talk on Dialogic Reinforcement Learning in the AI Safety Discussion
Day, and there is a recording
[https://drive.google.com/file/d/1zKs3uOcR32nTMJ5YNOMZkcL7R_mzi2t6/view?usp=sharing]
.

1Vanessa Kosoy1yA variant of Dialogic RL with improved corrigibility. Suppose that the AI's
prior allows a small probability for "universe W" whose semantics are, roughly
speaking, "all my assumptions are wrong, need to shut down immediately". In
other words, this is a universe where all our prior shaping is replaced by the
single axiom that shutting down is much higher utility than anything else.
Moreover, we add into the prior that assumption that the formal question "W?" is
understood perfectly by the user even without any annotation. This means that,
whenever the AI assigns a higher-than-threshold probability to the user
answering "yes" if asked "W?" at any uncorrupt point in the future, the AI will
shutdown immediately. We should also shape the prior s.t. corrupt futures also
favor shutdown: this is reasonable in itself, but will also ensure that the AI
won't arrive at believing too many futures to be corrupt and thereby avoid the
imperative to shutdown as response to a confirmation of W.
Now, this won't help if the user only resolves to confirm W after something
catastrophic already occurred, such as the AI releasing malign subagents into
the wild. But, something of the sort is true for any corrigibility scheme:
corrigibility is about allowing the user to make changes in the AI on eir own
initiative, which can always be too late. This method doesn't ensure safety in
itself, just hardens a system that is supposed to be already close to safe.
It would be nice if we could replace "shutdown" by "undo everything you did and
then shutdown" but that gets us into thorny specifications issues. Perhaps it's
possible to tackle those issues by one of the approaches to "low impact".

1Vanessa Kosoy1yUniverse W should still be governed by a simplicity prior. This means that
whenever the agent detects a salient pattern that contradicts the assumptions of
its prior shaping, the probability of W increases leading to shutdown. This
serves as an additional "sanity test" precaution.

Probably not too original but I haven't seen it clearly written anywhere.

There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

Amplifying by objective time: The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to in... (read more)

2Alex Turner7moI think this would make a good top-level post. I have the feeling I’ll want to
link to it later.

1Vanessa Kosoy7moI retracted part of that, see the edit.

1Vanessa Kosoy7moMore observations about this attack vector ("attack from counterfactuals"). I
focus on "amplifying by subjective time".
* The harder the takeoff the more dangerous this attack vector: During every
simulation cycle, ability to defend against simulated malign AI depends on
the power of the defense system in the beginning of the cycle[1]
[#fn-tXbSLC7xHHD63K7Kq-1]. On the other hand, the capability of the attacker
depends on its power in the end of the cycle. Therefore, if power grows very
fast this is bad news for the defender. On the other hand, if power grows
very slowly, the defender wins because defense is probably easier.
* Inner control of anchor makes system safer: Given a fixed objective time
anchor, the more times the simulation goes through a cycle, the higher its
exposure to the threat, since improving the defense system takes time even if
you have all the knowledge. Therefore, we want to give the simulated user the
ability to move the objective time anchor forward in time, in order to
benefit from improvements in the defense system.
* Additional information about the external world makes system safer: Suppose
our subjective probability of a malign AI successfully attacking during a
single simulation cycle is 0.1%. Then performing 1000 cycles superficially
yields a failure probability of about 37%. However, this is only the case if
attacks on different cycles are independent random variables. If the AI knows
as much about the possibility of such an attack as we do, this is a justified
assumption. However, if the AI knows more, then, from our subjective
perspective, these variables are positively correlated. In particular, if the
AI knows with certainty whether a successful attack would occur or not the
failure probability over 1000 cycles goes back to 0.1%. Therefore, it is
beneficial to give the AI extra information about the world, just so it can
estimate the probability of attack with m

Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best approximation of the real environment. (Or, the best reward achievable by some space of policies.)

In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasi-realizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some incomplete descriptions. B

Learning theory starts from formulating natural desiderata for agents, whereas "logic-AI" usually starts from postulating a logic-based model of the agent ad hoc.

Learning theory naturally allows analyzing computational complexity whereas logic-AI often uses models that are either clearly intractable or even clearly incomputable from the onset.

Learning theory focuses on objects that are observable o

I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving "Omega" (something that predicts the agent's decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent "Murphy" as in Murphy's law). If we assume that the agent cannot randomize against Omega, we need to use the deterministic version of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding max

3Nisan2yMy takeaway from this is that if we're doing policy selection in an environment
that contains predictors, instead of applying the counterfactual belief that the
predictor is always right, we can assume that we get rewarded if the predictor
is wrong, and then take maximin.
How would you handle Agent Simulates Predictor? Is that what TRL is for?

2Vanessa Kosoy2yThat's about right. The key point is, "applying the counterfactual belief that
the predictor is always right" is not really well-defined (that's why people
have been struggling with TDT/UDT/FDT for so long) while the thing I'm doing is
perfectly well-defined. I describe agents that are able to learn which
predictors exist in their environment and respond rationally ("rationally"
according to the FDT philosophy).
TRL is for many things to do with rational use of computational resources, such
as (i) doing multi-level modelling
[https://www.alignmentforum.org/posts/3qXE6fK47JhSfkpnB/do-sufficiently-advanced-agents-use-logic#vAtz6tfscsALGPr32]
in order to make optimal use of "thinking time" and "interacting with
environment time" (i.e. simultaneously optimize sample and computational
complexity) (ii) recursive self-improvement (iii) defending from non-Cartesian
daemons (iv) preventing thought crimes. But, yes, it also provides a solution to
ASP
[https://www.alignmentforum.org/posts/S3W4Xrmp6AL7nxRHd/formalising-decision-theory-is-hard#FXt6z9ycAio9jFAtW]
. TRL agents can learn whether it's better to be predictable or predicting.

1Chris_Leong2y"The key point is, "applying the counterfactual belief that the predictor is
always right" is not really well-defined" - What do you mean here?
I'm curious whether you're referring to the same as or similar to the issue I
was referencing in Counterfactuals for Perfect Predictors
[https://www.lesswrong.com/posts/AKkFh3zKGzcYBiPo7/counterfactuals-for-perfect-predictors]
. The TLDR is that I was worried that it would be inconsistent for an agent that
never pays in Parfait's Hitchhiker to end up in town if the predictor is
perfect, so that it wouldn't actually be well-defined what the predictor was
predicting. And the way I ended up resolving this was by imagining it as an
agent that takes input and asking what it would output if given that
inconsistent input. But not sure if you were referencing this kind of concern or
something else.

2Vanessa Kosoy2yIt is not a mere "concern", it's the crux of problem really. What people in the
AI alignment community have been trying to do is, starting with some factual and
"objective" description of the universe (such a program or a mathematical
formula) and deriving counterfactuals. The way it's supposed to work is, the
agent needs to locate all copies of itself or things "logically correlated" with
itself (whatever that means) in the program, and imagine it is controlling this
part. But a rigorous definition of this that solves all standard decision
theoretic scenarios was never found.
Instead of doing that, I suggest a solution of different nature. In
quasi-Bayesian RL, the agent never arrives at a factual and objective
description of the universe. Instead, it arrives at a subjective description
which already includes counterfactuals. I then proceed to show that, in
Newcomb-like scenarios, such agents receive optimal expected utility (i.e. the
same expected utility promised by UDT).

1Chris_Leong2yYeah, I agree that the objective descriptions can leave out vital information,
such as how the information you know was acquired, which seems important for
determining the counterfactuals.

1Vladimir Slepnev2yBut in Newcomb's problem, the agent's reward in case of wrong prediction is
already defined. For example, if the agent one-boxes but the predictor predicted
two-boxing, the reward should be zero. If you change that to +infinity, aren't
you open to the charge of formalizing the wrong problem?

1Vanessa Kosoy2yThe point is, if you put this "quasi-Bayesian" agent into an iterated
Newcomb-like problem, it will learn to get the maximal reward (i.e. the reward
associated with FDT). So, if you're judging it from the side, you will have to
concede it behaves rationally, regardless of its internal representation of
reality.
Philosophically, my point of view is, it is an error to think that
counterfactuals have objective, observer-independent, meaning. Instead, we can
talk about some sort of consistency conditions between the different points of
view. From the agent's point of view, it would reach Nirvana if it dodged the
predictor. From Omega's point of view, if Omega two-boxed and the agent
one-boxed, the agent's reward would be zero (and the agent would learn its
beliefs were wrong). From a third-person point of view, the counterfactual
"Omega makes an error of prediction" is ill-defined, it's conditioning on an
event of probability 0.

1Vladimir Slepnev2yYeah, I think I can make peace with that. Another way to think of it is that we
can keep the reward structure of the original Newcomb's problem, but instead of
saying "Omega is almost always right" we add another person Bob (maybe the mad
scientist who built Omega) who's willing to pay you a billion dollars if you
prove Omega wrong. Then minimaxing indeed leads to one-boxing. Though I guess
the remaining question is why minimaxing is the right thing to do. And if
randomizing is allowed, the idea of Omega predicting how you'll randomize seems
a bit dodgy as well.

3Vanessa Kosoy2yAnother explanation why maximin is a natural decision rule: when we apply
maximin to fuzzy beliefs
[https://www.alignmentforum.org/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem#X6fFvAHkxCPmQYB6v]
, the requirement to learn a particular class of fuzzy hypotheses is a very
general way to formulate asymptotic performance desiderata for RL agents. So
general that it seems to cover more or less anything you might want. Indeed, the
definition directly leads to capturing any desideratum of the form
limγ→1Eμπγ[U(γ)]≥f(μ)
Here, f doesn't have to be concave: the concavity condition in the definition of
fuzzy beliefs is there because we can always assume it without loss of
generality. This is because the left hand side in linear in μ so any π that
satisfies this will also satisfy it for the concave hull of f.
What if instead of maximin we want to apply the minimax-regret decision rule?
Then the desideratum is
limγ→1Eμπγ[U(γ)]≥V(μ,γ)−f(μ)
But, it has the same form! Therefore we can consider it as a special case of the
applying maximin (more precisely, it requires allowing the fuzzy belief to
depend on γ, but this is not a problem for the basics of the formalism).
What if we want our policy to be at least as good as some fixed policy π′0? Then
the desideratum is
limγ→1Eμπγ[U(γ)]≥Eμπ′0[U(γ)]
It still has the same form!
Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata
of the form:
limγ→1Eμπγ[U(γ)]≥f(π,μ)
To achieve this, we postulate a predictor that guesses the policy, producing the
guess ^π, and define the fuzzy belief using the function Eh∼μ[f(^π(h),μ)] (we
assume the guess is not influenced by the agent's actions so we don't need π in
the expected value). Using Nirvana trick, we effectively force the guess to be
accurate.
In particular, this captures self-referential desiderata of the type "the policy
cannot be improved by changing it in this particular way". These are of the
form:
limγ→1Eμπγ[U(γ)]≥EμF(π)[U(γ)]
It also allo

1Vanessa Kosoy2yWell, I think that maximin is the right thing to do because it leads to
reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think
of incomplete models as properties that the environment might satisfy. It is
necessary to speak of properties instead of complete models since the
environment might be too complex to understand in full (for example because it
contains Omega, but also for more prosaic reasons), but we can hope it at least
has properties/patterns the agent can understand. A quasi-Bayesian agent has the
guarantee that, whenever the environment satisfies one of the properties in its
prior, the expected utility will converge at least to the maximin for this
property. In other words, such an agent is able to exploit any true property of
the environment it can understand. Maybe a more "philosophical" defense of
maximin is possible, analogous to VNM / complete class theorems, but I don't
know (I actually saw some papers in that vein but haven't read them in detail.)
If the agent has random bits that Omega doesn't see, and Omega is predicting the
probabilities of the agent's actions, then I think we can still solve it with
quasi-Bayesian agents but it requires considering more complicated models and I
haven't worked out the details. Specifically, I think that we can define some
function X that depends on the agent's actions and Omega's predictions so far (a
measure of Omega's apparent inaccuracy), s.t. if Omega is an accurate predictor,
then, the supremum of X over time is finite with probability 1. Then, we
consider consider a family of models, where model number n says that X<n for all
times. Since at least one of these models is true, the agent will learn it, and
will converge to behaving appropriately.
EDIT 1: I think X should be something like, how much money would a gambler
following a particular strategy win, betting against Omega.
EDIT 2: Here is the solution. In the case of original Newcomb, consider a
gambler that bets against Om

1Linda Linsefors2yI agree that you can assign what ever belief you want (e.g. what ever is useful
for the agents decision making proses) for for what happens in the
counterfactual when omega is wrong, in decision problems where Omega is assumed
to be a perfect predictor. However if you want to generalise to cases where
Omega is an imperfect predictor (as you do mention), then I think you will (in
general) have to put in the correct reward for Omega being wrong, becasue this
is something that might actually be observed.

1Vanessa Kosoy2yThe method should work for imperfect predictors as well. In the simplest case,
the agent can model the imperfect predictor as perfect predictor + random noise.
So, it definitely knows the correct reward for Omega being wrong. It still
believes in Nirvana if "idealized Omega" is wrong.

One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can't understand long-term planning, which is a key ingredient of goal-directed intelligence.

In general, a prior that contains traps will be unlearnable, mea

4Vanessa Kosoy1yIn the anthropic trilemma
[https://www.lesswrong.com/posts/y7jZ9BLEeuNTzgAE5/the-anthropic-trilemma],
Yudkowsky writes about the thorny problem of understanding subjective
probability in a setting where copying and modifying minds is possible. Here, I
will argue that infra-Bayesianism (IB) leads to the solution.
Consider a population of robots, each of which in a regular RL agent. The
environment produces the observations of the robots, but can also make copies or
delete portions of their memories. If we consider a random robot sampled from
the population, the history they observed will be biased compared to the
"physical" baseline. Indeed, suppose that a particular observation c has the
property that every time a robot makes it, 10 copies of them are created in the
next moment. Then, a random robot will have c much more often in their history
than the physical frequency with which c is encountered, due to the resulting
"selection bias". We call this setting "anthropic RL" (ARL).
The original motivation for IB was non-realizability. But, in ARL, Bayesianism
runs into issues even when the environment is realizable from the "physical"
perspective. For example, we can consider an "anthropic MDP" (AMDP). An AMDP has
finite sets of actions (A) and states (S), and a transition kernel T:A×S→Δ(S∗).
The output is a string of states instead of a single state, because many copies
of the agent might be instantiated on the next round, each with their own state.
In general, there will be no single Bayesian hypothesis that captures the
distribution over histories that the average robot sees at any given moment of
time (at any given moment of time we sample a robot out of the population and
look at their history). This is because the distributions at different moments
of time are mutually inconsistent.
[EDIT: Actually, given that we don't care about the order of robots, the
signature of the transition kernel should be T:A×S→ΔNS]
The consistency that is violated is exactly the c

1Charlie Steiner10moCould you expand a little on why you say that no Bayesian hypothesis captures
the distribution over robot-histories at different times? It seems like you can
unroll an AMDP into a "memory MDP" that puts memory information of the robot
into the state, thus allowing Bayesian calculation of the distribution over
states in the memory MDP to capture history information in the AMDP.

1Vanessa Kosoy10moI'm not sure what do you mean by that "unrolling". Can you write a mathematical
definition?
Let's consider a simple example. There are two states: s0 and s1. There is just
one action so we can ignore it. s0 is the initial state. An s0 robot transition
into an s1 robot. An s1 robot transitions into an s0 robot and an s1 robot. How
will our population look like?
0th step: all robots remember s0
1st step: all robots remember s0s1
2nd step: 1/2 of robots remember s0s1s0 and 1/2 of robots remember s0s1s1
3rd step: 1/3 of robots remembers s0s1s0s1, 1/3 of robots remember s0s1s1s0 and
1/3 of robots remember s0s1s1s1
There is no Bayesian hypothesis a robot can have that gives correct predictions
both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr
[s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr
[s0s1s0]=13, Pr[s0s1s1]=23.
In other words, there is no Bayesian hypothesis s.t. we can guarantee that a
randomly sampled robot on a sufficiently late time step will have learned this
hypothesis with high probability. The apparent transition probabilities keep
shifting s.t. it might always continue to seem that the world is complicated
enough to prevent our robot from having learned it already.
Or, at least it's not obvious there is such a hypothesis. In this example, Pr[s0
s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all
probabilities converge fast enough for learning to happen, in general? I don't
know, maybe for finite state spaces it can work. Would definitely be interesting
to check.
[EDIT: actually, in this example there is such a hypothesis but in general there
isn't, see below
[https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=E58br2mJWbgzQqZhX]
]

1Charlie Steiner10moGreat example. At least for the purposes of explaining what I mean :) The memory
AMDP would just replace the statess0,s1with the memory states[s0],[s1],[s0,s0],[
s0,s1], etc. The action takes a robot in[s0]to memory state[s0,s1], and a robot
in[s0,s1]to one robot in[s0,s1,s0]and another in[s0,s1,s1].
(Skip this paragraph unless the specifics of what's going on aren't obvious:
given a transition distributionP(s′∗|s,π)(P being the distribution over sets of
states s'* given starting state s and policyπ), we can define the memory
transition distributionP(s′∗m|sm,π)given policyπand starting "memory state"sm∈S∗
(Note that this star actually does mean finite sequences, sorry for notational
ugliness). First we plug the last element ofsminto the transition distribution
as the current state. Then for eachs′∗in the domain, for each element ins′∗we
concatenate that element onto the end ofsmand collect theses′minto a sets′∗m,
which is assigned the same probabilityP(s′∗).)
So now at time t=2, if you sample a robot, the probability that its state begins
with[s0,s1,s1]is 0.5. And at time t=3, if you sample a robot that probability
changes to 0.66. This is the same result as for the regular MDP, it's just that
we've turned a question about the history of agents, which may be ill-defined,
into a question about which states agents are in.
I'm still confused about what you mean by "Bayesian hypothesis" though. Do you
mean a hypothesis that takes the form of a non-anthropic MDP?

1Vanessa Kosoy10moI'm not quite sure what are you trying to say here, probably my explanation of
the framework was lacking. The robots already remember the history, like in
classical RL. The question about the histories is perfectly well-defined. In
other words, we are already implicitly doing what you described. It's like in
classical RL theory, when you're proving a regret bound or whatever, your
probability space consists of histories.
Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then
any environment can be regarded as an MDP (whose states are histories). That is,
I'm talking about hypotheses which conform to the classical "cybernetic agent
model". If you wish, we can call it "Bayesian cybernetic hypothesis".
Also, I want to clarify something I was myself confused about in the previous
comment. For an anthropic Markov chain (when there is only one action) with a
finite number of states, we can give a Bayesian cybernetic description, but for
a general anthropic MDP we cannot even if the number of states is finite.
Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+.
Assuming the chain is communicating, ET is an irreducible non-negative matrix,
so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal
eigenvector η∈RS+. We then get the subjective transition kernel:
ST(t∣s)=ET(t∣s)ηt∑t′∈SET(t′∣s)ηt′
Now, consider the following example of an AMDP. There are three actions A:={a,b,
c} and two states S:={s0,s1}. When we apply a to an s0 robot, it creates two s0
robots, whereas when we apply a to an s1 robot, it leaves one s1 robot. When we
apply b to an s1 robot, it creates two s1 robots, whereas when we apply b to an
s0 robot, it leaves one s0 robot. When we apply c to any robot, it results in
one robot whose state is s0 with probability 12 and s1 with probability 12.
Consider the following two policies. πa takes the sequence of actions cacaca…
and πb takes the sequence of actions cbcbcb…. A population that f

1Charlie Steiner10moAh, okay, I see what you mean. Like how preferences are divisible into "selfish"
and "worldly" components, where the selfish component is what's impacted by a
future simulation of you that is about to have good things happen to it.
(edit: The reward function in AMDPs can either be analogous to "wordly" and just
sum the reward calculated at individual timesteps, or analogous to "selfish" and
calculated by taking the limit of the subjective distribution over parts of the
history, then applying a reward function to the expected histories.)
I brought up the histories->states thing because I didn't understand what you
were getting at, so I was concerned that something unrealistic was going on. For
example, if you assume that the agent can remember its history, how can you
possibly handle an environment with memory-wiping?
In fact, to me the example is still somewhat murky, because you're talking about
the subjective probability of a state given a policy and a timestep, but if the
agents know their histories there is no actual agent in the information-state
that corresponds to having those probabilities. In an MDP the agents just have
probabilities over transitions - so maybe a clearer example is an agent that
copies itself if it wins the lottery having a larger subjective transition
probability of going from gambling to winning. (i.e. states are losing and
winning, actions are gamble and copy, the policy is to gamble until you win and
then copy).

1Vanessa Kosoy10moAMDP is only a toy model that distills the core difficulty into more or less the
simplest non-trivial framework. The rewards are "selfish": there is a reward
function r:(S×A)∗→R which allows assigning utilities to histories by time
discounted summation, and we consider the expected utility of a random robot
sampled from a late population. And, there is no memory wiping. To describe
memory wiping we indeed need to do the "unrolling" you suggested. (Notice that
from the cybernetic model POV, the history is only the remembered history.)
For a more complete framework, we can use an ontology chain
[https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=SBPzgAZgFFxtL9E64]
, but (i) instead of A×O labels use A×M labels, where M is the set of possible
memory states (a policy is then described by π:M→A), to allow for agents that
don't fully trust their memory (ii) consider another chain with a bigger state
space S′ plus a mapping p:S′→NS s.t. the transition kernels are compatible.
Here, the semantics of p(s) is: the multiset of ontological states resulting
from interpreting the physical state s by taking the viewpoints of different
agents s contains.
I didn't understand "no actual agent in the information-state that corresponds
to having those probabilities". What does it mean to have an agent in the
information-state?

1Charlie Steiner10moNevermind, I think I was just looking at it with the wrong class of reward
function in mind.

2Vanessa Kosoy1yThere is a formal analogy between infra-Bayesian decision theory (IBDT) and
modal updateless decision theory
[https://www.alignmentforum.org/posts/5bd75cc58225bf0670374e61/using-modal-fixed-points-to-formalize-logical-causality]
(MUDT).
Consider a one-shot decision theory setting. There is a set of unobservable
states S, a set of actions A and a reward function r:A×S→[0,1]. An IBDT agent
has some belief β∈□S[1] [#fn-mQXwc4sNgtZSzqodo-1], and it chooses the action a∗:
=argmaxa∈AEβ[λs.r(a,s)].
We can construct an equivalent scenario, by augmenting this one with a perfect
predictor of the agent (Omega). To do so, define S′:=A×S, where the semantics of
(p,s) is "the unobservable state is s and Omega predicts the agent will take
action p". We then define r′:A×S′→[0,1] by r′(a,p,s):=1a=pr(a,s)+1a≠p and β′∈□S′
by Eβ′[f]:=minp∈AEβ[λs.f(p,s)] (β′ is what we call the pullback of β to S′, i.e
we have utter Knightian uncertainty about Omega). This is essentially the usual
Nirvana construction.
The new setup produces the same optimal action as before. However, we can now
give an alternative description of the decision rule.
For any p∈A, define Ωp∈□S′ by EΩp[f]:=mins∈Sf(p,s). That is, Ωp is an
infra-Bayesian representation of the belief "Omega will make prediction p". For
any u∈[0,1], define Ru∈□S′ by ERu[f]:=minμ∈ΔS′:Eμ[r(p,s)]≥uEμ[f(p,s)]. Ru can be
interpreted as the belief "assuming Omega is accurate, the expected reward will
be at least u".
We will also need to use the order ⪯ on □X defined by: ϕ⪯ψ when ∀f∈[0,1]X:Eϕ[f]≥
Eψ[f]. The reversal is needed to make the analogy to logic intuitive. Indeed, ϕ⪯
ψ can be interpreted as "ϕ implies ψ"[2] [#fn-mQXwc4sNgtZSzqodo-2], the meet
operator ∧ can be interpreted as logical conjunction and the join operator ∨ can
be interpreted as logical disjunction.
Claim:
a∗=argmaxa∈Amax{u∈[0,1]∣β′∧Ωa⪯Ru}
(Actually I only checked it when we restrict to crisp infradistributions, in
which case ∧ is intersection of sets and ⪯ is set conta

1Vanessa Kosoy9moInfra-Bayesianism can be naturally understood as semantics for a certain
non-classical logic. This promises an elegant synthesis between
deductive/symbolic reasoning and inductive/intuitive reasoning, with several
possible applications. Specifically, here we will explain how this can work for
higher-order logic. There might be holes and/or redundancies in the precise
definitions given here, but I'm quite confident the overall idea is sound.
For simplicity, we will only work with crisp infradistributions, although a lot
of this stuff can work for more general types of infradistributions as well.
Therefore, □X will denote the space of crisp infradistribution. Given μ∈□X, S(μ)
⊆ΔX will denote the corresponding convex set. As opposed to previously, we will
include the empty-set, i.e. there is ⊥X∈□X s.t. S(⊥X)=∅. Given p∈ΔX and μ∈□X, p:
μ will mean p∈S(μ). Given μ,ν∈□X, μ⪯ν will mean S(μ)⊆S(ν).
Syntax
Let Tι denote a set which we interpret as the types of individuals (we allow
more than one). We then recursively define the full set of types T by:
* 0∈T (intended meaning: the uninhabited type)
* 1∈T (intended meaning: the one element type)
* If α∈Tι then α∈T
* If α,β∈T then α+β∈T (intended meaning: disjoint union)
* If α,β∈T then α×β∈T (intended meaning: Cartesian product)
* If α∈T then (α)∈T (intended meaning: predicates with argument of type α)
For each α,β∈T, there is a set F0α→β which we interpret as atomic terms of type
α→β. We will denote V0α:=F01→α. Among those we distinguish the logical atomic
terms:
* prαβ∈F0α×β→α
* iαβ∈F0α→α+β
* Symbols we will not list explicitly, that correspond to the algebraic
properties of + and × (commutativity, associativity, distributivity and the
neutrality of 0 and 1). For example, given α,β∈T there is a "commutator" of
type α×β→β×α.
* =α∈V0(α×α)
* diagα∈F0α→α×α
* ()α∈V0((α)×α) (intended meaning: predicate evaluation)
* ⊥∈V0(1)
* ⊤∈V0(1)
* ∨α∈F0(α)×(α)→(α)
* ∃αβ∈F0(α×β)→(β)
* Assume that for each n∈N

2Vanessa Kosoy9moLet's also explicitly describe 0th order and 1st order infra-Bayesian logic
(although they are should be segments of higher-order).
0-th order
Syntax
Let A be the set of propositional variables. We define the language L:
* Any a∈A is also in L
* ⊥∈L
* ⊤∈L
* Given ϕ,ψ∈L, ϕ∧ψ∈L
* Given ϕ,ψ∈L, ϕ∨ψ∈L
Notice there's no negation or implication. We define the set of judgements J:=L×
L. We write judgements as ϕ⊢ψ ("ψ in the context of ϕ"). A theory is a subset of
J.
Semantics
Given T⊆J, a model of T consists of a compact Polish space X and a mapping M:L→□
X. The latter is required to satisfy:
* M(⊥)=⊥X
* M(⊤)=⊤X
* M(ϕ∧ψ)=M(ϕ)∧M(ψ). Here, we define ∧ of infradistributions as intersection of
the corresponding sets
* M(ϕ∨ψ)=M(ϕ)∨M(ψ). Here, we define ∨ of infradistributions as convex hull of
the corresponding sets
* For any ϕ⊢ψ∈T, M(ϕ)⪯M(ψ)
1-st order
Syntax
We define the language using the usual syntax of 1-st order logic, where the
allowed operators are ∧, ∨ and the quantifiers ∀ and ∃. Variables are labeled by
types from some set T. For simplicity, we assume no constants, but it is easy to
introduce them. For any sequence of variables (v1…vn), we denote Lv the set of
formulae whose free variables are a subset of v1…vn. We define the set of
judgements J:=⋃vLv×Lv.
Semantics
Given T⊆J, a model of T consists of
* For every t∈T, a compact Polish space M(t)
* For every ϕ∈Lv where v1…vn have types t1…tn, an element Mv(ϕ) of Xv:=□(∏ni=1M
(ti))
It must satisfy the following:
* Mv(⊥)=⊥Xv
* Mv(⊤)=⊤Xv
* Mv(ϕ∧ψ)=Mv(ϕ)∧Mv(ψ)
* Mv(ϕ∨ψ)=Mv(ϕ)∨Mv(ψ)
* Consider variables u1…un of types t1…tn and variables v1…vm of types s1…sm.
Consider also some σ:{1…n}→{1…m} s.t. sσ(i)=ti. Given ϕ∈Lv, we can form the
substitution ψ:=ϕ[xi=yσ(i)]∈Lu. We also have a mapping fσ:Xv→Xu given by fσ(x
1…xm)=(xσ(1)…xσ(n)). We require Mu(ψ)=f∗(Mv(ϕ))
* Consider variables v1…vn and i∈{1…n}. Denote pr:Xv→Xv∖vi the projection
mapping. We require Mv∖vi(∃vi:ϕ)=pr∗(Mv

1Vanessa Kosoy9moWhen using infra-Bayesian logic to define a simplicity prior, it is natural to
use "axiom circuits" rather than plain formulae. That is, when we write the
axioms defining our hypothesis, we are allowed to introduce "shorthand" symbols
for repeating terms. This doesn't affect the expressiveness, but it does affect
the description length. Indeed, eliminating all the shorthand symbols can
increase the length exponentially.

1Vanessa Kosoy9moInstead of introducing all the "algebrator" logical symbols, we can define T as
the quotient by the equivalence relation defined by the algebraic laws. We then
need only two extra logical atomic terms:
* For any n∈N and σ∈Sn (permutation), denote n:=∑ni=11 and require σ+∈Fn→n
* For any n∈N and σ∈Sn, σ×α∈Fαn→αn
However, if we do this then it's not clear whether deciding that an expression
is a well-formed term can be done in polynomial time. Because, to check that the
types match, we need to test the identity of algebraic expressions and opening
all parentheses might result in something exponentially long.

1Vanessa Kosoy9moActually the Schwartz–Zippel algorithm can easily be adapted to this case (just
imagine that types are variables over Q, and start from testing the identity of
the types appearing inside parentheses), so we can validate expressions in
randomized polynomial time (and, given standard conjectures, in deterministic
polynomial time as well).

Consider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!

This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam's razor (which is the only reasonable way to re... (read more)

1Ofer Givoli1yI think that in embedded settings (with a bounded version of Solomonoff
induction) convergence may never occur, even in the limit as the amount of
compute that is used for executing the agent goes to infinity. Suppose the
observation history contains sensory data that reveals the probability
distribution that the agent had, in the last time step, for the next number it's
going to see in the target sequence. Now consider the program that says: "if the
last number was predicted by the agent to be 0 with probability larger than 1−2−
1010 then the next number is 1; otherwise it is 0." Since it takes much less
than 1010 bits to write that program, the agent will never predict two times in
a row that the next number is 0 with probability larger than 1−2−1010 (after
observing only 0s so far).

One of the central challenges in Dialogic Reinforcement Learning is dealing with fickle users, i.e. the user changing eir mind in illegible ways that cannot necessarily be modeled as, say, Bayesian updating. To take this into account, we cannot use the naive notion of subjective regret bound, since the user doesn't have a well-defined prior. I propose to solve this by extending the notion of dynamically inconsistent preferences to dynamically inconsistent beliefs. We think of the system as a game, where every action-observation history h∈(A×O)∗ corresponds

1Vanessa Kosoy2yThere is a deficiency in this "dynamically subjective" regret bound (also can be
called "realizable misalignment" bound) as a candidate formalization of
alignment. It is not robust to scaling down
[https://www.alignmentforum.org/posts/bBdfbWfWxHN9Chjcq/robustness-to-scale]. If
the AI's prior allows it to accurately model the user's beliefs (realizability
assumption), then the criterion seems correct. But, imagine that the user's
beliefs are too complex and an accurate model is not possible. Then the
realizability assumption is violated and the regret bound guarantees nothing.
More precisely, the AI may use incomplete models
[https://www.alignmentforum.org/posts/5bd75cc58225bf0670375575/the-learning-theoretic-ai-alignment-research-agenda]
to capture some properties of the user's beliefs and exploit them, but this
might be not good enough. Therefore, such an AI might fall into a dangerous zone
when it is powerful enough to cause catastrophic damage but not powerful enough
to know it shouldn't do it.
To fix this problem, we need to introduce another criterion which has to hold
simultaneously with the misalignment bound. We need that for any reality that
satisfies the basic assumptions built into the prior (such as, the baseline
policy is fairly safe, most questions are fairly safe, human beliefs don't
change too fast etc), the agent will not fail catastrophically. (It would be way
too much to ask it would converge to optimality, it would violate
no-free-lunch.) In order to formalize "not fail catastrophically" I propose the
following definition.
Let's start with the case when the user's preferences and beliefs are
dynamically consistent. Consider some AI-observable event S that might happen in
the world. Consider a candidate learning algorithm πlearn and two auxiliary
policies. The policy πbase→S follows the baseline policy until S happens, at
which time it switches to the subjectively optimal policy. The policy πlearn→S
follows the candidate learning algorithm unt

1Alex Turner2yThis seems quite close (or even identical) to attainable utility preservation
[https://arxiv.org/abs/1902.09725]; if I understand correctly, this echoes
arguments I've made
[https://www.lesswrong.com/posts/yEa7kwoMpsBgaBCgb/towards-a-new-impact-measure#wXHJArzDPoYejHuz2]
for why AUP has a good shot of avoiding catastrophes and thereby getting you
something which feels similar to corrigibility.

1Vanessa Kosoy2yThere is some similarity, but there are also major differences. They don't even
have the same type signature. The dangerousness bound is a desideratum that any
given algorithm can either satisfy or not. On the other hand, AUP is a specific
heuristic how to tweak Q-learning. I guess you can consider some kind of regret
bound w.r.t. the AUP reward function, but they will still be very different
conditions.
The reason I pointed out the relation to corrigibility is not because I think
that's the main justification for the dangerousness bound. The motivation for
the dangerousness bound is quite straightforward and self-contained: it is a
formalization of the condition that "if you run this AI, this won't make things
worse than not running the AI", no more and no less. Rather, I pointed the
relation out to help readers compare it with other ways of thinking they might
be familiar with.
From my perspective, the main question is whether satisfying this desideratum is
feasible. I gave some arguments why it might be, but there are also opposite
arguments. Specifically, if you believe that debate is a necessary component of
Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can
become certain that the user would respond in a particular way to a query, but
it cannot become (worst-case) certain that the user would not change eir
response when faced with some rebuttal. You can't (empirically and in the
worst-case) prove a negative.

1Vanessa Kosoy2yDialogic RL assumes that the user has beliefs about the AI's ontology. This
includes the environment(fn1) from the AI's perspective. In other words, the
user needs to have beliefs about the AI's counterfactuals (the things that would
happen if the AI chooses different possible actions). But, what are the
semantics of the AI's counterfactuals from the user's perspective? This is more
or less the same question that was studied by the MIRI-sphere for a while,
starting from Newcomb's paradox, TDT et cetera. Luckily, I now have an answer
[https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/shortform#TzkG7veQAMMRNh3Pg]
based on the incomplete models formalism. This answer can be applied in this
case also, quite naturally.
Specifically, we assume that there is a sense, meaningful to the user, in which
ey select the AI policy (program the AI). Therefore, from the user's
perspective, the AI policy is a user action. Again from the user's perspective,
the AI's actions and observations are all part of the outcome. The user's
beliefs about the user's counterfactuals can therefore be expressed as σ:Π→Δ(A×O
)ω(fn2), where Π is the space of AI policies(fn3). We assume that for every π∈Π,
σ(π) is consistent with π the natural sense. Such a belief can be transformed
into an incomplete model from the AI's perspective, using the same technique we
used to solve Newcomb-like decision problems, with σ playing the role of Omega.
For a deterministic AI, this model looks like (i) at first, "Murphy" makes a
guess that the AI's policy is π=πguess (ii) The environment behaves according to
the conditional measures of σ(πguess) (iii) If the AI's policy ever deviates
from πguess, the AI immediately enters an eternal "Nirvana" state with maximal
reward. For a stochastic AI, we need to apply the technique with statistical
tests and multiple models alluded to in the link. This can also be generalized
to the setting where the user's beliefs are already an incomplete model, by
adding another step

1Vanessa Kosoy2yAnother notable feature of this approach is its resistance to "attacks from the
future", as opposed to approaches based on forecasting. In the latter, the AI
has to predict some future observation, for example what the user will write
after working on some problem for a long time. In particular, this is how the
distillation step in IDA is normally assumed to work, AFAIU. Such a forecaster
might sample a future in which a UFAI has been instantiated and this UFAI will
exploit this to infiltrate the present. This might result in a self-fulfilling
prophecy, but even if the forecasting is counterfactual (and thus immune to
self-fulfilling prophecies)it can be attacked by a UFAI that came to be for
unrelated reasons. We can ameliorate this by making the forecasting recursive
(i.e. apply multiple distillation & amplification steps) or use some other
technique to compress a lot of "thinking time" into a small interval of physical
time. However, this is still vulnerable to UFAIs that might arise already at
present with a small probability rate (these are likely to exist since our
putative FAI is deployed at a time when technology progressed enough to make
competing AGI projects a real possibility).
Now, compare this to Dialogical RL, as defined via the framework of dynamically
inconsistent beliefs. Dialogical RL might also employ forecasting to sample the
future, presumably more accurate, beliefs of the user. However, if the user is
aware of the possibility of a future attack, this possibility is reflected in
eir beliefs, and the AI will automatically take it into account and deflect it
as much as possible.

1Vanessa Kosoy2yThis approach also obviates the need for an explicit commitment mechanism.
Instead, the AI uses the current user's beliefs about the quality of future user
beliefs to decide whether it should wait for user's beliefs to improve or commit
to an irreversible coarse of action. Sometimes it can also predict the future
user beliefs instead of waiting (predict according to current user beliefs
updated by the AI's observations).

In my previous shortform, I used the phrase "attack vector", borrowed from classical computer security. What does it mean to speak of an "attack vector" in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.

In the first interpretation, an attack vector is a source of perverse incentives. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulating

A summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.

Problem 1: In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).

Currently I only have speculations about the solution. But, I have a few desiderata for it:

It seems useful to consider agents that reason in terms of an unobservable ontology, and may have uncertainty over what this ontology is. In particular, in Dialogic RL, the user's preferences are probably defined w.r.t. an ontology that is unobservable by the AI (and probably unobservable by the user too) which the AI has to learn (and the user is probably uncertain about emself). However, onotlogies are more naturally thought of as objects in a category than as elements in a set. The formalization of an "ontology" should probably be a POMDP or a suitable

I propose a new formal desideratum for alignment: the

Hippocratic principle. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. theuser'sbeliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let πυu be the the user's policy in universe υ and πa the AI policy. Let T be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability 1 for any policy. Let Vυ be the value of a state from the user's subjective POV, in universe υ. Let μυ be the environment in universe υ. Finally, let ζ be the AI's prior over universes and ϵ... (read more)

An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.

The key observation is,

imitation learning algorithms. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes^{[1]}might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans haverealizablefrom the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are nottoocomplex.This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevert... (read more)

I have repeatedly argued for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can

predictAlice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.A possible counterargument is, we don't need to depart far from Bayesianis

... (read more)I propose to call

metacosmologythe hypothetical field of study which would be concerned with the following questions:This concept is of potential interest for several reasons:

This idea was inspired by a correspondence with Adam Shimi.It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining "goal-directed intelligence" is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, descript... (read more)

Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via

modifying the gamerather than abandoning the notion of Nash equilibrium).The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a

... (read more)repeatedversion. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requSome thoughts about embedded agency.

From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows:

What kind of agent, and in what conditions, can effectively plan for events after its own death?For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent^{[1]}. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some

... (read more)fixed ontologyThis is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the alignment scheme I currently find most promising.

It seems that the natural formal criterion for alignment (or at least the main criterion) is having a "subjective regret bound": that is, the AI has to converge (in the long term planning limit, γ→1 limit) to achieving optimal expected user!utility

... (read more)with respect to the knowledge state of the user. In order to achieve this, we need to establish a communicatiProbably not too original but I haven't seen it clearly written anywhere.There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

Amplifying by objective time:The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to in... (read more)Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best

approximationof the real environment. (Or, the best reward achievable by some space of policies.)In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasi-realizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some

... (read more)incompletedescriptions. BIn the past I considered the learning-theoretic approach to AI theory as somewhat opposed to the formal logic approach popular in MIRI (see also discussion):

desideratafor agents, whereas "logic-AI" usually starts from postulating a logic-based model of the agent ad hoc.I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving "Omega" (something that predicts the agent's decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent "Murphy" as in Murphy's law). If we assume that the agent cannot randomize against Omega, we need to use the

... (read more)deterministicversion of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding maxOne subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can't understand long-term planning, which is a key ingredient of goal-directed intelligence.

In general, a prior that contains traps will be

... (read more)unlearnable, meaMaster post for ideas about infra-Bayesianism.

Consider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!

This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam's razor (which is the only reasonable way to re... (read more)

One of the central challenges in Dialogic Reinforcement Learning is dealing with fickle users, i.e. the user changing eir mind in illegible ways that cannot necessarily be modeled as, say, Bayesian updating. To take this into account, we cannot use the naive notion of subjective regret bound, since the user doesn't have a well-defined prior. I propose to solve this by extending the notion of dynamically inconsistent preferences to dynamically inconsistent

... (read more)beliefs. We think of the system as a game, where every action-observation history h∈(A×O)∗ correspondsIn my previous shortform, I used the phrase "attack vector", borrowed from classical computer security. What does it mean to speak of an "attack vector" in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.

In the first interpretation, an attack vector is a source of

... (read more)perverse incentives. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulatingA summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.

Problem 1:In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).Currently I only have speculations about the solution. But, I have a few desiderata for it:

De... (read more)It seems useful to consider agents that reason in terms of an unobservable ontology, and may have uncertainty over what this ontology is. In particular, in Dialogic RL, the user's preferences are probably defined w.r.t. an ontology that is unobservable by the AI (and probably unobservable by the user too) which the AI has to learn (and the user is probably uncertain about emself). However, onotlogies are more naturally thought of as objects in a category than as elements in a set. The formalization of an "ontology" should probably be a POMDP or a suitable

... (read more)