[Epistemic Status: Like the past part (III), I am once again going to score each essay out of 100, then add them all up for a final score. I haven’t done this for all parts, I’m just experimenting with different things to add to these reviews. This final number may-or-may-not be very meaningful - that remains to be seen.]
Introduction
This book, of course, is about morality. We finally now get to the part of the journey where I feel significant gains have been made. I find that this book is perhaps the "meat and potatoes" of at least the more potentially controversial part of Yudkowsky's philosophy.
I have interpreted there to be two theses in this book: One about human morality (let’s call it Thesis I), and one about AI morality (let’s call it Thesis II). These theses are somewhat in tension, for in my humble opinion, I would expect that if we were to try and find a unified theory of morality, we should find explanations for why our moral intuitions feel so strong to us (which include our intuitions about AI).
I find very little to disagree with in his thesis about human morality (though not absolutely nothing). I notice that it is probably more difficult to go completely wrong here than it is for AI. I find myself far less persuaded by his second thesis.
His second thesis essentially consists of arguing directly against my personal crux (which is that we can use our intuitions about human morality to infer anything about how AI morality should work). He also states (fairly directly several times), that (in my own paraphrasing) we've got to get all the details of AI morality perfectly correct, otherwise we end up with something perfectly imperfect. This is what I’m calling Thesis II - and which I feel that although it is clearly and repeatedly stated multiple times, it is not given adequate supporting arguments.
There is also a kind of abstract mechanistic thought process that Yudkowsky applies here, but also throughout a lot of his work, that I’d like to try and describe: It’s almost like the opposite of “bridging the gap” (so I will call it “gapping the bridge”). It consists of trying to separate concepts from each other with the warning that one should not try to use one concept to understand the other, as that will give mistaken results. This, in my view, is not what “cleaving reality at its joints” is supposed to mean. I disagree with his strategy of applying this mechanistic technique to purely theoretical thought-experiments, as opposed to experimental data. I view this as somewhat anti-Occamian, which brings inner tension to his own views. I’ll try and note where he does this kind of thing within the main review.
I'm honestly pretty happy that I chose to do a thorough review of the Sequences. Reviewing this particular book confirms some of the senses and vibes I was trying to articulate since my first review. Namely, how Yudkowsky seems to be advocating for what I consider a slightly more "neurotic" praxis of rationality. This "neurotic" vibe appears to be more explicitly present in primarily his second thesis (but actually in both, to some extent).
Confirming that this neurotic vibe is actually present means that my desire to critique this work passes the bar for being considered "altruistic enough", instead of being purely negative. (By the way, this is my bar, not a bar necessarily expected of me nor me of anyone else.)
Before we delve in, I would like to reiterate a stance I have delivered here and elsewhere before about how much tension I expect there should be between what we normally call "intuition" and "reason." My stance is that we should expect ourselves to move towards less tension between these things over time. Interestingly, Yudkowsky seems to agree with this stance, at least as judged from one essay in which this is touched upon (‘The “Intuitions” Behind “Utilitarianism”’).
However, my sense is that Yudkowsky agrees with this stance only in terms of what I'm calling his first thesis, about human morality. In terms of his second thesis, Yudkowsky seems to be telling us that very disappointingly, our intuition simply fails us, and only reason alone can come to save us (which it hasn't yet actually done, by his own admission).
I also once again notice his repeated usage of the word "fake" in many of his titles. This gestures at many of our wrong attempts to make sense of whatever is being described as fake. But he also seems not to disambiguate the meaning of the word "fake" in terms of whether it means "false" or means "lie." Granted, in the examples he typically uses to motivate the discussion, it is often hard to discern this as well. Nevertheless, it comes off as a rather strong descriptor.
Perhaps what he is implying here is that many of the attempts he is criticizing are self-deceptively believed in by their proponents. Regardless of interpretation, I think his claims are strongly negative, and thus probably at least somewhat inaccurate.
The Review
Not For The Sake of Happiness (Alone) - (85/100)
The question, rather, is whether we should care about the things that make us happy, apart from any happiness they bring.
It is an undeniable fact that we tend to do things that make us happy, but this doesn't mean we should regard the happiness as the only reason for so acting. First, this would make it difficult to explain how we could care about anyone else's happiness - how we could treat people as ends in themselves, rather than instrumental means of obtaining a warm glow of satisfaction.
The best way I can put it, is that my moral intuition appears to require both the objective and subjective component to grant full value.
This is mainly in support of Thesis I. I think that it very subtly suffers from the “gapping the bridge” problem I mentioned.
The reason it’s subtle: It is framed a bit like a question - that is, what should we do? Should we care about things that make us happy minus the happiness, or only the happiness? Why say it like this, instead of as “do we?” He says his moral intuition requires both the subjective and objective component. This simultaneously draws attention to a distinction between these things, while also saying that both are required.
I do think this “gapping the bridge” is quite important to pay close attention to! You may see why this is when we get closer to Thesis II.
Fake Selfishness - (90/100)
He met someone who proclaimed to be purely selfish and had a conversation with him.
It looks to me like when people espouse a philosophy of selfishness, it has no effect on their behavior, because whenever they want to be nice to people, they can rationalize it in selfish terms.
I disagree that this philosophy is truly "fake" in all its essence; The name of it has overlap with a word that carries negative meaning (non-altruism). Altruism can be justified in "selfish" terms. The word "selfish" is probably not a great descriptor, but it signifies that the philosophy carries self-trust in one's own moral judgements (which is pretty justifiable).
Fake Morality - (100/100)
You don't need God to give you morality, you just have your own already.
Blank out the recommendations of these two philosophers, and you can see that the first philosopher is using strictly prosocial criteria to justify his recommendations; to him, what validates an argument for selfishness is showing that selfishness benefits everyone. The second philosopher appeals to strictly individual and hedonic criteria; to him, what validates an argument for altruism is showing that altruism benefits him as an individual: higher social status or more intense feelings of pleasure.
Neither "selfishness" nor "altruism" are entirely fake nor entirely absolute. One can justify selfishness using altruism and vice-versa. This is primarily a statement of Thesis I. Unless there is something subtle here I’m missing, this seems fine to me.
Fake Utility Functions - (30/100)
The following is where Thesis II appears to be first given. It will be repeated again later.
So far as descriptive theories of morality are concerned, the complicatedness of human morality is a known fact. It is a descriptive fact about human beings, that the love of a parent for a child, and the love of a child for a parent, and the love of a man for a woman, and the love of a woman for a man, have not been cognitively derived from each other or from any other value.
Ooo, you know what, I just disagree with this! Also, this is gapping-the-bridge in a not-so-subtle way.
Leave out just one of these values from a superintelligence, and even if you successfully include every other value, you could end up with a hyperexistential catastrophe, a fate worse than death.
Okay, I think we've found somewhat of a golden nugget here: Something of substantial importance that seems wrong to me.
Detached Lever Fallacy - (50/100)
Real-life examples given: Semantic networks, the idea of training a superintelligent AI with human-like upbringing or something like that.
This one may also actually be subtly wrong, by using a strawman argument.
All this goes to explain why you can't create a kindly Artificial Intelligence by giving it nice parents and a kindly (yet occasionally strict) upbringing, the way it works with a human baby. As I've often heard proposed.
The problem is that this argument is of course trivially correct in many contexts. The issue then arises when it is applied to various threads about AI alignment.
If you have something that just maximizes the number of paperclips in its future light cone, and you raise it with loving parents, it's still going to come out as a paperclip maximizer.
What if we change the time at which we are detaching these various levers, namely, not after the paperclip maximizer is created, but during?
The problem I have with this kind of fallacy is: It is obviously true in a one-off fixed context. But suppose we keep trying again and again, by detaching various other components, or the lever plus other things. Would we then call the initial attempt a fallacious mistake?
Dreams of AI Design - (50/100)
Living inside a human mind doesn’t teach you the art of reductionism, because nearly all of the work is carried out beneath your sight, by the opaque black boxes of the brain. So far beneath your sight that there is no introspective sense that the black box is there—no internal sensory event marking that the work has been delegated.
I believe in reductionism, but I also believe that, because certain specific things like "consciousness" are the only things which ever seem to be put forth as candidates for non-reducibility, there may perhaps be some reason for this, that makes consciousness kind of “non-reductivy” in some sense (though it may ultimately still be reducible). Perhaps that’s beside the point, though.
And indeed I know many people who believe that intelligence is the product of commonsense knowledge or massive parallelism or creative destruction or intuitive rather than rational reasoning, or whatever. But all these are only dreams, which do not give you any way to say what intelligence is, or what an intelligence will do next, except by pointing at a human. And when the one goes to build their wondrous AI, they only build a system of detached levers, “knowledge” consisting of LISP tokens labeled apple and the like; or perhaps they build a “massively parallel neural net, just like the human brain.” And are shocked—shocked!—when nothing much happens.
Well…ha ha ha. I don’t wish to be rude here, though. Let me think about what - if anything - went wrong here.
You know, a few of the things he mentioned haven’t panned out, but some have. And, it’s true that before a lot of those things got started, they were just words and associated intuitions in the minds of those “dreaming” them. And some of those dreams were probably, in fact, silly. Unless they do pan out later, of course, or we get good explanations for why those ideas didn’t work.
I think Yudkowsky is “gapping the bridge” again. These so-called detached levers don’t always succeed - perhaps most don’t - but some eventually do, even if by then they can no longer accurately be said to be a “detached lever.” But such “dreams” may have started out as one.
The Design Space of Minds-in-General - (75/100)
So you want to resist the temptation to say either that all minds do something, or that no minds do something.
Somewhere in mind design space is at least one mind with almost any kind of logically consistent property you care to imagine.
Subtle could-be will-be implications.
I don’t really feel like analyzing this one in depth; It’s similar to the Orthogonality Thesis.
Where Recursive Justification Hits Bottom - (95/100)
Indeed, no matter what I did with this dilemma, it would be me doing it. Even if I trusted something else, like some computer program, it would be my own decision to trust it.
Why do you believe what you believe? Well, because it's worked so far, well enough, and I anticipate that if it fails to work well enough at some point in the future, my brain knows what to do in that situation, too...
The point is not to be reflectively consistent. The point is to win. But if you look at yourself and play to win, you are making yourself more reflectively consistent—that's what it means to "play to win" while "looking at yourself".
I would only add that I wouldn't use the word "win" for myself here. I would like to win as often as possible in win/lose situations, but it's awfully binary. I wish to perform as successfully as possible. This seems reflectively consistent too.
This seems mostly part of Thesis I.
My Kind of Reflection - (100/100)
I strongly suspect that a correctly built AI, in pondering modifications to the part of its source code that implements Occamian reasoning, will not have to do anything special as it ponders—in particular, it shouldn't have to make a special effort to avoid using Occamian reasoning.
I don't think that going around in a loop of justifications through the meta-level is the same thing as circular logic.
When what he’s saying really works, you can notice that it has a much stronger character to it. Namely, that it actually clarifies things; It de-confuses confusion.
No Universally Compelling Arguments - (80/100)
Any thought has to be implemented somewhere, in a physical system; any belief, any conclusion, any decision, any motor output. For every lawful causal system that zigs at a set of points, you should be able to specify another causal system that lawfully zags at the same points.
I wonder how much it matters that the above need only be technically true rather than “for all intents and purposes” practically true. Because “practically” anything means that to get from A to B, we have to get there practically somehow. In that process, what else do we change when we make something that zigs zag instead?
There is also a less severe version of the failure, where the one does not declare the One True Morality. Rather the one hopes for an AI created perfectly free, unconstrained by flawed humans desiring slaves, so that the AI may arrive at virtue of its own accord—virtue undreamed-of perhaps by the speaker, who confesses themselves too flawed to teach an AI.
I think this motivates the question that’s more-or-less exactly referenced by the title of the next post.
Created Already In Motion - (85/100)
The phrase that once came into my mind to describe this requirement, is that a mind must be created already in motion. There is no argument so compelling that it will give dynamics to a static thing. There is no computer program so persuasive that you can run it on a rock.
This is like "X -> Y, X, therefore Y" could also be said "((X -> Y), X) -> Y". If you give a mind a rule, it still also needs to apply the rule at some point.
This and the last post are mainly about how minds could be designed such that they could be persuaded by arbitrary arguments in arbitrary directions.
However, it also introduces the notion of momentum as applied to minds. If we build up a mind such that it has momentum in a specific direction, and try to persuade it to go in a different direction, we’ve altered its momentum slightly, but it’s still going to have some momentum in the previous direction.
In my view, this gives humanity a non-worthless hand to play, and makes the next post slightly weaker. I don’t think he tries to predict well what kind of momentum the AI minds we actually build will actually have.
Sorting Pebbles Into Correct Heaps - (60/100)
Once upon a time there was a strange little species—that might have been biological, or might have been synthetic, and perhaps were only a dream—whose passion was sorting pebbles into correct heaps.
If anything, I think this colorful illustration serves to help intuition discover how useful this line of thinking is. This is why I’m giving it a score of higher than 50.
To be clear, I think that sorting pebbles into correct heaps is a highly specific “momentum” vector in the sense introduced in the previous post. I don’t think he’s done the work of explaining why we’d expect this momentum vector to be in this direction, or instead, why this momentum vector is actually “close” to ones that look fine to us.
2-Place and 1-Place Words - (75/100)
If Sexiness is treated as a function that accepts only one Entity as its argument, then of course Sexiness will appear to depend only on the Entity, with nothing else being relevant.
Nothing literally to disagree with here, only subtle implications (which we expect will just outright be said at some point, so we can wait until then).
What Would You Do Without Morality? - (90/100)
Suppose you learned, suddenly and definitively, that nothing is moral and nothing is right; that everything is permissible and nothing is forbidden.
I would just do what I wanted, supposing I wasn't doing that already. What did you expect I would do? Is there any other answer?
It looks like we’ve suddenly switched back to Thesis I.
Changing Your Metaethics - (90/100)
The point being, of course, not that no morality exists; but that you can hold your will in place, and not fear losing sight of what's important to you, while your notions of the nature of morality change.
There's a difference between "really wanting to save someone's life" and "not wanting to, just thinking you really should, just because."
If you really thought you could save someone's life, and should do it "just because", but that literally no positive effects would ever happen to you at all because you took that action, then I'd probably consider you confused.
Almost nothing to disagree with here.
Could Anything Be Right? - (100/100)
For all he knew, morality could require the extermination of the human species; and if so he saw no virtue in taking a stand against morality, because he thought that, by definition, if he postulated that moral fact, that meant human extinction was what "should" be done.
I thought that surely even a ghost of perfect emptiness, finding that it knew nothing of morality, would see a moral imperative to think about morality.
Was he even wrong here?
This is a somewhat harder one, because "should one think about 'shoulds'?" is a kind of tautological question, whereas "beavers" do not have this kind of property.
Morality clearly does have a starting point, as one can easily see from this line of pondering.
So all this suggests that you should be willing to accept that you might know a little about morality.
This is once again Thesis I, but also I think the strength of his argument here actually kind of weakens Thesis II.
Morality as Fixed Computation - (40/100)
...since if the programmer initially weakly wants 'X' and X is hard to obtain, the AI will modify the programmer to strongly want 'Y', which is easy to create, and then bring about lots of Y. Y might be, say, iron atoms—those are highly stable.
Can you patch this problem? No. As a general rule, it is not possible to patch flawed Friendly AI designs.
I actually think the first sentence is worth thinking about. If “X” is hard to obtain, for any X, and Y is easier to obtain, for any Y, what will the AI do? Well, he says here that patching a human is way easier than patching an AI, so you can insert modification of either an AI or a human for X and Y, in any combination - see what happens.
I assume he actually means that humans are okay with being modified while AIs would not be, as opposed to, the AI is simply invincible and does not need to be modified if it does not wish to be.
If you try to make it so that the AI can't modify the programmer, then the AI can't talk to the programmer (talking to someone modifies them).
If you try to rule out a specific class of ways the AI could modify the programmer, the AI has a motive to superintelligently seek out loopholes and ways to modify the programmer indirectly.
I suppose I disagree strongly with this post. I argue that he is alluding to a kind of folk “theorem” which does not actually exist, and isn’t really explicitly stated anywhere, so far as I know. This theorem states that rational expected utility maximizers are expected to go to great lengths to ensure that their utility functions can never be modified even slightly, that is, they very strongly disvalue states in which that could be possible. If they did not initially disvalue these states, they modify themselves so that they do - essentially this folk theorem could be stated “it is rational to be incorrigible.”1
I have done experiments in the form of polls for humans in which I explore what they would do in various situations like these (e.g. choosing to take a pill which changes their values vs. some probability of dying). My view is that I expect rational expected utility maximizers to be “rationally corrigible”, that is, corrigible in amounts that make numeric sense (namely it maximizes current expected utility). More on this perhaps another time.
Magical Categories - (65/100)
If the training problems and the real problems have the slightest difference in context - if they are not drawn from the same independently identically distributed process - there is no statistical guarantee from past success to future success.
By the way, I want to note that it is coherent for Yudkowsky to believe that much of the risk is contained in out-of-distribution generalization, which he is talking about here.
Which classification will the AI choose? This is not an inherent property of the training data; it is a property of how the AI performs induction.
As far as I know, I think there is some evidence that it’s not only a property of how the AI performs induction - the training data itself can create an inductive bias of its own.2
I want to score this post as being “validly wrong” or “cruxy-wrong.” Also, invalidated by data that was obtained after the post was written.
The True Prisoner's Dilemma - (70/100)
That's what the payoff matrix for the true Prisoner's Dilemma looks like - a situation where (D, C) seems righter than (C, C).
This one seems to be severely missing some kind of analysis. I intuitively feel like there’s something we did when we assigned the values of utility to objects like “paperclips” or “human lives” that made it slightly more difficult to analyze.
The paperclip maximizer also cares about what the human will do, and it cares about what the human will do knowing what the paperclip maximizer will do (if this is indeed the only way they can operate here). Each party must grapple with the fact that the other values highly something each one considers to have no value. They may have to pretend that a paperclip really is as valuable as a billion human lives, and vice-versa.
I’m not going to give this one a bad score because I need to work out the analysis I think it needs myself first.
Sympathetic Minds - (50/100)
An expected utility maximizer—especially one that does understand intelligence on an abstract level—has other options than empathy, when it comes to understanding other minds. The agent doesn't need to put itself in anyone else's shoes; it can just model the other mind directly. A hypothesis like any other hypothesis, just a little bigger. You don't need to become your shoes to understand your shoes.
I think if one were ever placed in such a trade-off situation, like a prisoner’s dilemma, then the bare minimum one may be forced to do is pretend like something is important to them, even if it’s not, for the duration of the dilemma.
If you’re constantly placed in tons of these situations, then perhaps more than the bare minimum becomes preferable - namely, being modified to not have to pretend anymore.
A paperclip maximizer doesn't feel happy when it makes paperclips, it just chooses whichever action leads to the greatest number of expected paperclips.
This seems subtly at odds with what “Not For the Sake of Happiness (Alone)” seems to suggest. I am not persuaded that being happy is only something a human wants in and of itself, purely for its own sake, in parallel to “actually wanting” other arbitrary objectives.
This introduces bigger cans of worms than I think we are minimally required to open, regarding consciousness and things like that.
High Challenge - (60/100)
So this is the ultimate end of the prophecy of technological progress—just staring at a screen that says "YOU WIN", forever.
And maybe we'll build a robot that does that, too.
This was funny, but I am not persuaded to believe what it literally implies.
Serious Stories - (70/100)
Ordinarily, we prefer pleasure to pain, joy to sadness, and life to death. Yet it seems we prefer to empathize with hurting, sad, dead characters. Or stories about happier people aren't serious, aren't artistically great enough to be worthy of praise—but then why selectively praise stories containing unhappy people? Is there some hidden benefit to us in it? It's a puzzle either way you look at it.
Some questions like this are just genuinely a little bit more complicated to answer, but not that much more, and I believe they probably do just have an answer.
Value is Fragile - (20/100)
This entire post essentially states Thesis II.
Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth.
My friend, I have no problem with the thought of a galactic civilization vastly unlike our own... full of strange beings who look nothing like me even in their own imaginations... pursuing pleasures and experiences I can't begin to empathize with... trading in a marketplace of unimaginable goods... allying to pursue incomprehensible objectives... people whose life-stories I could never understand.
That's what the Future looks like if things go right.
Interestingly, this quote contains clear signs of hyperbole. And yet - my problem with it - is this hyperbole is then taken as direct statement of fact.
Value isn't just complicated, it's fragile. There is more than one dimension of human value, where if just that one thing is lost, the Future becomes null. A single blow and all value shatters. Not every single blow will shatter all value - but more than one possible "single blow" will do so.
To kind of summarize where we are up to here, because we’ve reached the climax of “Mere Goodness”, in my opinion:
My summary of the major claims:
Humans value many separate, unrelated things.
They each need to be captured, otherwise all value is lost.
The reason we need “happiness,” for example, is not because it is intrinsically important, but because that’s just one of our myriad values, and due to the “folk theorem” that has not been directly stated, rational agents make themselves incorrigible about all of their values, which is why it feels so important to us.
The third point is more of a hypothesis to explain the second one - to explain why all value is lost if we only get one of them wrong. If our values are all unrelated and cannot be derived from one another, why does a single blow shatter all value?
The only other reason I can think of is that our values are related, such that if I remove one, it actually does destroy all value. It can’t be the case that removing one value destroys all value according to only his rational judgement but not his values - this doesn’t seem consistent to me.
Either he / I / we:
Explicitly disvalues the destruction of one specific value unrelated to all others, as an additional value, or
The value in question is not unrelated to all others, and is there because it performs work of some kind. Removing it actually breaks our ability to do other stuff. This actually does destroy other value.
By the way, in this post he says most of the supporting arguments for Thesis II is located on Overcoming Bias!
The Gift We Give To Tomorrow - (80/100)
To have the theoretical capacity to make one single gesture of mercy, to feel a single twinge of empathy, is to be nicer than evolution. How did evolution, which is itself so uncaring, create minds on that qualitatively higher moral level than itself? How did evolution, which is so ugly, end up doing anything so beautiful?
This is kind of a weird question to ask right after the previous post. However, I think he segues back into making sense again here.
I don’t personally know how to build a mind that naturally wants to do X without finding X “enjoyable” in some way. That is, I think it would be more difficult to build one with the goal of making sure it did X while not feeling anything about it. This statement may bother people who believe we can and likely will make minds that are not conscious by default.
One Life Against The World - (75/100)
This is where “Quantified Humanism” - his advertisement for Effective Altruism - begins. From here on out there isn’t as much to find controversial.
I agree that one human life is of unimaginably high value. I also hold that two human lives are twice as unimaginably valuable. Or to put it another way: Whoever saves one life, if it is as if they had saved the whole world; whoever saves ten lives, it is as if they had saved ten worlds. Whoever actually saves the whole world - not to be confused with pretend rhetorical saving the world - it is as if they had saved an intergalactic civilization.
But if you ever have a choice, dear reader, between saving a single life and saving the whole world - then save the world. Please. Because beyond that warm glow is one heck of a gigantic difference.
My quibble (which will be the same quibble throughout the rest of Quantified Humanism) is that he makes it ambiguous whether he believes people are simply mistaken about important objective information (like what we believe math says) or whether people stick to their gut too early and then refuse to update on improved information.
The Allais Paradox - (85/100)
If you indulge your intuitions, and dismiss mere elegance as a pointless obsession with neatness, then don't be surprised when your pennies get taken from you...
This mostly exposes that people cannot do math. One thing I wish was done is an experiment to explain to people why their initial choice is inconsistent, and then see if people change. If they do change (which I'd expect), then it's not that they have an inconsistent utility function, per se, rather they simply do not understand the outcomes presented.
Zut Allais - (75/100)
Experimental subjects tend to defend incoherent preferences even when they're really silly.
People put very high values on small shifts in probability away from 0 or 1 (the certainty effect).
Okay, but then what's the point of arguing that it's a bad idea? If you were faced with similar choices many times in your life, one would think you'd eventually learn what was actually better. He seems to be saying people insist on sticking to specific vibe choices.
These things aren't really "What Intuition Says (TM)."
Feeling Moral - (75/100)
Ah, but here’s the interesting thing. If you present the options this way:
100 people die, with certainty.
90% chance no one dies; 10% chance 500 people die.
Then a majority choose option 2. Even though it’s the same gamble. You see, just as a certainty of saving 400 lives seems to feel so much more comfortable than an unsure gain, so too, a certain loss feels worse than an uncertain one.
The real problem seems to be, as in the last one, that people commit to following their initial feelings even when shown to be wrong.
The "Intuitions" Behind "Utilitarianism" - (95/100)
I see the project of morality as a project of renormalizing intuition. We have intuitions about things that seem desirable or undesirable, intuitions about actions that are right or wrong, intuitions about how to resolve conflicting intuitions, intuitions about how to systematize specific intuitions into general principles.
Delete all the intuitions, and you aren't left with an ideal philosopher of perfect emptiness, you're left with a rock.
I liked this one and would probably consider this post to actually be a statement of Thesis I. I would have preferred that he put this post much earlier.
Ends Don't Justify Means (Among Humans) - (90/100)
I endorse "the end doesn't justify the means" as a principle to guide humans running on corrupted hardware, but I wouldn't endorse it as a principle for a society of AIs that make well-calibrated estimates.
"The end does not justify the means" is just consequentialist reasoning at one meta-level up. If a human starts thinking on the object level that the end justifies the means, this has awful consequences given our untrustworthy brains; therefore a human shouldn't think this way. But it is all still ultimately consequentialism. It's just reflective consequentialism, for beings who know that their moment-by-moment decisions are made by untrusted hardware.
I don’t find much to argue with here.
Ethical Injunctions - (85/100)
I've had to make my ethics much stricter than what my parents and Jerry Pournelle and Richard Feynman told me not to do.
Funny thing, how when people seem to think they're smarter than their ethics, they argue for less strictness rather than more strictness. I mean, when you think about how much more complicated the modern world is...
I am completely unimpressed with the knowledge, the reasoning, and the overall level, of those folk who have eagerly come to me, and said in grave tones, "It's rational to do unethical thing X because it will have benefit Y."
I think this one is actually profoundly weird in a deep way that's difficult to fathom right now. Perhaps it’s because he argues for strictness and restraint for matters that probably just have analyzable consequentialist outcomes.
Perhaps it’s also because I usually think of “restraint” applied to things considered cultural vices - like drinking, gambling, smoking, etc. - as opposed to long term ends-justify-means schemes and things like that (I have not actually been presented with many opportunities to consider the latter).
Something To Protect - (70/100)
Historically speaking, science won because it displayed greater raw strength in the form of technology, not because science sounded more reasonable. To this very day, magic and scripture still sound more reasonable to untrained ears than science. That is why there is continuous social tension between the belief systems. If science not only worked better than magic, but also sounded more intuitively reasonable, it would have won entirely by now.
And of course, no matter how much you profess your love of mere usefulness, you should never actually end up deliberately believing a useful false statement.
He's actually kind of displaying a slightly authoritative nature here, requesting his followers to care about something more than just plain old rationality. I think asking someone else to care about X (if they didn't already) authoritatively is sort of like a statement that X is important, and that them saying this is evidence that it is true.
Rationality to me seems self-referential enough that I appear to be slightly less worried than he is about "only" caring about rationality.
To that last quote though: I am not even sure how to deliberately believe a false statement. My thoughts about what he probably means by that statement leads me to remove a few extra points though (there is perhaps a deeper crux).
When (Not) To Use Probabilities - (90/100)
To be specific, I would advise, in most cases, against using non-numerical procedures to create what appear to be numerical probabilities. Numbers should come from numbers.
Don't be so certain about uncertainty.
Newcomb's Problem and Regret of Rationality - (95/100)
But at any rate, WIN. Don't lose reasonably, WIN.
This has always seemed quite strange to me. The arguments for one-boxing seem trivially obvious, and the arguments for two-boxing seem based on word play. BUT: I think the two-boxing arguments don't even have a case for reasonability.
Conclusion
Final Score: (2545 / 3400) = 74.9%
This is still close to my initial prediction of somewhere slightly above 70% at my start of the whole series of reviews.
What this number doesn’t tell you is whether it means that everything is roughly 75% true or whether 75% is perfect and 25% is terrible. I would say that it leans slightly more towards the latter than the former. Most of that 25% is centered around Thesis II and its defense (and the fact that most of that defense is actually absent).
There’s also a part of my brain that’s still nagging at me saying “You gave ‘Quantified Humanism’ a better score than it deserved.” But I’m going to leave it where it is unless I can manage to articulate that nagging feeling better, whenever that actually occurs.
Looking back on it, I still want to be able to answer the question of “What intuitions are primarily responsible for the mistakes I see Yudkowsky making?” While that question remains not fully answered, I now think that ‘Detached Lever Fallacy’ might hold hints to that question.
In general, I see overly strong applications of fallacy-identification as being potentially problematic.
Actually, writing on the topic is unclear about whether they do this with or without modifying themselves to be this way. It does say that they have incentive to lie or deceive someone trying to modify them, though.
https://royalsocietypublishing.org/doi/10.1098/rspa.2021.0068
'I believe in reductionism, but I also believe that, because certain specific things like "consciousness" are the only things which ever seem to be put forth as candidates for non-reducibility, there may perhaps be some reason for this, that makes consciousness kind of “non-reductivy” in some sense (though it may ultimately still be reducible). Perhaps that’s beside the point, though.'
Yudkowsky touches on a bit and while I haven't done a ton of research on it I think he is correct when he says that there used to be a lot more things that were put out as candidates for being non-reducible up till the point where they got reducible.