Three scatter plots comparing mean verbalized awareness score against mean score for self-preservation, whistleblowing, and exfiltration cooperation across different AI models and features. — **“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt** Nov 06, 2025
According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5's behavioral improvements in these evaluations may partly be driven by growing tendency to notice and game evaluations rather than genuine alignment. This is an early example of a phenomenon that is going to get increasingly problematic: as evaluation gaming increases, alignment evaluations become harder to trust.[1]
To elaborate on the above: Sonnet 4.5 seems far more aware of being evaluated than previous models. In some of Anthropic's evaluations, Sonnet 4.5 explicitly mentions being tested between 80% and 100% of the time, compared with under 10% for Sonnet 4. Internal representations associated with evaluation awareness also became notably stronger over training.[2]
In addition, Sonnet 4.5 seems substantially less likely to behave poorly once it has noticed that it might be tested. As seen in the figure below, when Anthropic inhibited internal representations related to evaluation awareness[3], its misaligned behavior rate on synthetic honeypot tests increased from 0% to between 1% and 9%. This effect was [...]
---
**Outline:**
(06:58) Sonnet 4.5 is much more evaluation-aware than prior models
(10:00) Evaluation awareness seems to suppress misaligned behavior
(14:52) Anthropic's training plausibly caused Sonnet 4.5 to game evaluations
(16:28) Evaluation gaming is plausibly a large fraction of the effect of training against misaligned behaviors
(22:57) Suppressing evidence of misalignment in evaluation gamers is concerning
(25:25) What AI companies should do
(30:02) Appendix
The original text contained 21 footnotes which were omitted from this narration.
---
First published:
October 30th, 2025
Source:
https://www.lesswrong.com/posts/qgehQxiTXj53X49mM/sonnet-4-5-s-eval-gaming-seriously-undermines-alignment
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Bar graph showing verbalized eval awareness rates across different model capabilities and versions. — **“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt** Nov 06, 2025
According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5's behavioral improvements in these evaluations may partly be driven by growing tendency to notice and game evaluations rather than genuine alignment. This is an early example of a phenomenon that is going to get increasingly problematic: as evaluation gaming increases, alignment evaluations become harder to trust.[1]
To elaborate on the above: Sonnet 4.5 seems far more aware of being evaluated than previous models. In some of Anthropic's evaluations, Sonnet 4.5 explicitly mentions being tested between 80% and 100% of the time, compared with under 10% for Sonnet 4. Internal representations associated with evaluation awareness also became notably stronger over training.[2]
In addition, Sonnet 4.5 seems substantially less likely to behave poorly once it has noticed that it might be tested. As seen in the figure below, when Anthropic inhibited internal representations related to evaluation awareness[3], its misaligned behavior rate on synthetic honeypot tests increased from 0% to between 1% and 9%. This effect was [...]
---
**Outline:**
(06:58) Sonnet 4.5 is much more evaluation-aware than prior models
(10:00) Evaluation awareness seems to suppress misaligned behavior
(14:52) Anthropic's training plausibly caused Sonnet 4.5 to game evaluations
(16:28) Evaluation gaming is plausibly a large fraction of the effect of training against misaligned behaviors
(22:57) Suppressing evidence of misalignment in evaluation gamers is concerning
(25:25) What AI companies should do
(30:02) Appendix
The original text contained 21 footnotes which were omitted from this narration.
---
First published:
October 30th, 2025
Source:
https://www.lesswrong.com/posts/qgehQxiTXj53X49mM/sonnet-4-5-s-eval-gaming-seriously-undermines-alignment
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“Publishing academic papers on transformative AI is a nightmare” by Jakub Growiec Nov 06, 2025

I am a professor of economics. Throughout my career, I was mostly working on economic growth theory, and this eventually brought me to the topic of transformative AI / AGI / superintelligence. Nowadays my work focuses mostly on the promises and threats of this emerging disruptive technology.
Recently, jointly with Klaus Prettner, we’ve written a paper on “The Economics of p(doom): Scenarios of Existential Risk and Economic Growth in the Age of Transformative AI”. We have presented it at multiple conferences and seminars, and it was always well received. We didn’t get any real pushback; instead our research prompted a lot of interest and reflection (as I was reported, also in conversations where I wasn’t involved).
But our experience with publishing this paper in a journal is a polar opposite. To date, the paper got desk-rejected (without peer review) 7 times. For example, Futures—a journal “for the interdisciplinary study of futures, visioning, anticipation and foresight” justified their negative decision by writing: “while your results are of potential interest, the topic of your manuscript falls outside of the scope of this journal”.
Until finally, to our excitement, it was for once sent out for review. But then came the [...]
---
First published:
November 3rd, 2025
Source:
https://www.lesswrong.com/posts/rmYj6PTBMm76voYLn/publishing-academic-papers-on-transformative-ai-is-a
---
Narrated by TYPE III AUDIO.

**“The Unreasonable Effectiveness of Fiction” by Raelifin** Nov 05, 2025
[Meta: This is Max Harms. I wrote a novel about China and AGI, which comes out today. This essay from my fiction newsletter has been slightly modified for LessWrong.]
In the summer of 1983, Ronald Reagan sat down to watch the film War Games, starring Matthew Broderick as a teen hacker. In the movie, Broderick's character accidentally gains access to a military supercomputer with an AI that almost starts World War III.
“The only winning move is not to play.” After watching the movie, Reagan, newly concerned with the possibility of hackers causing real harm, ordered a full national security review. The response: “Mr. President, the problem is much worse than you think.” Soon after, the Department of Defense revamped their cybersecurity policies and the first federal directives and laws against malicious hacking were put in place.
But War Games wasn't the only story to influence Reagan. His administration pushed for the Strategic Defense Initiative ("Star Wars") in part, perhaps, because the central technology—a laser that shoots down missiles—resembles the core technology behind the 1940 spy film Murder in the Air, which had Reagan as lead actor. Reagan was apparently such a superfan of The Day the Earth Stood Still [...]
---
**Outline:**
(05:05) AI in Particular
(06:45) Whats Going On Here?
(11:19) Authorial Responsibility
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
November 3rd, 2025
Source:
https://www.lesswrong.com/posts/uQak7ECW2agpHFsHX/the-unreasonable-effectiveness-of-fiction
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Mr. President, I'm afraid this movie poster is mostly a reflection of cognitive biases, rather than the universal attractiveness of hot babes. — **“The Unreasonable Effectiveness of Fiction” by Raelifin** Nov 05, 2025
[Meta: This is Max Harms. I wrote a novel about China and AGI, which comes out today. This essay from my fiction newsletter has been slightly modified for LessWrong.]
In the summer of 1983, Ronald Reagan sat down to watch the film War Games, starring Matthew Broderick as a teen hacker. In the movie, Broderick's character accidentally gains access to a military supercomputer with an AI that almost starts World War III.
“The only winning move is not to play.” After watching the movie, Reagan, newly concerned with the possibility of hackers causing real harm, ordered a full national security review. The response: “Mr. President, the problem is much worse than you think.” Soon after, the Department of Defense revamped their cybersecurity policies and the first federal directives and laws against malicious hacking were put in place.
But War Games wasn't the only story to influence Reagan. His administration pushed for the Strategic Defense Initiative ("Star Wars") in part, perhaps, because the central technology—a laser that shoots down missiles—resembles the core technology behind the 1940 spy film Murder in the Air, which had Reagan as lead actor. Reagan was apparently such a superfan of The Day the Earth Stood Still [...]
---
**Outline:**
(05:05) AI in Particular
(06:45) Whats Going On Here?
(11:19) Authorial Responsibility
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
November 3rd, 2025
Source:
https://www.lesswrong.com/posts/uQak7ECW2agpHFsHX/the-unreasonable-effectiveness-of-fiction
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“Legible vs. Illegible AI Safety Problems” by Wei Dai Nov 05, 2025

Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)
From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems. In contrast, working on the illegible problems (including by trying to make them more legible) does not have this issue and therefore has a much higher expected value (all else being equal, such as tractability). Note that according to this logic, success in making an illegible problem highly legible is almost as good as solving [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
November 4th, 2025
Source:
https://www.lesswrong.com/posts/PMc65HgRFvBimEpmJ/legible-vs-illegible-ai-safety-problems
---
Narrated by TYPE III AUDIO.

https://xkcd.com/1015/ — **“Lack of Social Grace is a Lack of Skill” by Screwtape** Nov 04, 2025
1.
I have claimed that one of the fundamental questions of rationality is “what am I about to do and what will happen next?” One of the domains I ask this question the most is in social situations.
There are a great many skills in the world. If I had the time and resources to do so, I’d want to master all of them. Wilderness survival, automotive repair, the Japanese language, calculus, heart surgery, French cooking, sailing, underwater basket weaving, architecture, Mexican cooking, functional programming, whatever it is people mean when they say “hey man, just let him cook.” My inability to speak fluent Japanese isn’t a sin or a crime. However, it isn’t a virtue either; If I had the option to snap my fingers and instantly acquire the knowledge, I’d do it.
Now, there's a different question of prioritization; I tend to pick new skills to learn by a combination of what's useful to me, what sounds fun, and what I’m naturally good at. I picked up the basics of computer programming easily, I enjoy doing it, and it turned out to pay really well. That was an over-determined skill to learn.
On the other [...]
---
**Outline:**
(00:10) 1.
(03:42) 2.
(06:44) 3.
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
November 3rd, 2025
Source:
https://www.lesswrong.com/posts/NnTwbvvsPg5kj3BKq/lack-of-social-grace-is-a-lack-of-skill-1
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images*

Not pictured: the face of a man about to faceplant — **“Lack of Social Grace is a Lack of Skill” by Screwtape** Nov 04, 2025
1.
I have claimed that one of the fundamental questions of rationality is “what am I about to do and what will happen next?” One of the domains I ask this question the most is in social situations.
There are a great many skills in the world. If I had the time and resources to do so, I’d want to master all of them. Wilderness survival, automotive repair, the Japanese language, calculus, heart surgery, French cooking, sailing, underwater basket weaving, architecture, Mexican cooking, functional programming, whatever it is people mean when they say “hey man, just let him cook.” My inability to speak fluent Japanese isn’t a sin or a crime. However, it isn’t a virtue either; If I had the option to snap my fingers and instantly acquire the knowledge, I’d do it.
Now, there's a different question of prioritization; I tend to pick new skills to learn by a combination of what's useful to me, what sounds fun, and what I’m naturally good at. I picked up the basics of computer programming easily, I enjoy doing it, and it turned out to pay really well. That was an over-determined skill to learn.
On the other [...]
---
**Outline:**
(00:10) 1.
(03:42) 2.
(06:44) 3.
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
November 3rd, 2025
Source:
https://www.lesswrong.com/posts/NnTwbvvsPg5kj3BKq/lack-of-social-grace-is-a-lack-of-skill-1
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images*

[Linkpost] “I ate bear fat with honey and salt flakes, to prove a point” by aggliu Nov 04, 2025

This is a link post. Eliezer Yudkowsky did not exactly suggest that you should eat bear fat covered with honey and sprinkled with salt flakes.
What he actually said was that an alien, looking from the outside at evolution, would predict that you would want to eat bear fat covered with honey and sprinkled with salt flakes.
Still, I decided to buy a jar of bear fat online, and make a treat for the people at Inkhaven. It was surprisingly good. My post discusses how that happened, and a bit about the implications for Eliezer's thesis.
Let me know if you want to try some; I can prepare some for you if you happen to be at Lighthaven before we run out of bear fat, and before I leave toward the end of November.
---
First published:
November 4th, 2025
Source:
https://www.lesswrong.com/posts/2pKiXR6X7wdt8eFX5/i-ate-bear-fat-with-honey-and-salt-flakes-to-prove-a-point
Linkpost URL:
https://signoregalilei.com/2025/11/03/i-ate-bear-fat-to-prove-a-point/
---
Narrated by TYPE III AUDIO.

Graph titled — **“What’s up with Anthropic predicting AGI by early 2027?” by ryan_greenblatt** Nov 04, 2025
As far as I'm aware, Anthropic is the only AI company with official AGI timelines[1]: they expect AGI by early 2027. In their recommendations (from March 2025) to the OSTP for the AI action plan they say:
As our CEO Dario Amodei writes in 'Machines of Loving Grace', we expect powerful AI systems will emerge in late 2026 or early 2027. Powerful AI systems will have the following properties:

Intellectual capabilities matching or exceeding that of Nobel Prize winners across most disciplines—including biology, computer science, mathematics, and engineering.
[...]
They often describe this capability level as a "country of geniuses in a datacenter".
This prediction is repeated elsewhere and Jack Clark confirms that something like this remains Anthropic's view (as of September 2025). Of course, just because this is Anthropic's official prediction[2] doesn't mean that all or even most employees at Anthropic share the same view.[3] However, I do think we can reasonably say that Dario Amodei, Jack Clark, and Anthropic itself are all making this prediction.[4]
I think the creation of transformatively powerful AI systems—systems as capable or more capable than Anthropic's notion of powerful AI—is plausible in 5 years [...]
---
**Outline:**
(02:27) What does powerful AI mean?
(08:40) Earlier predictions
(11:19) A proposed timeline that Anthropic might expect
(19:10) Why powerful AI by early 2027 seems unlikely to me
(19:37) Trends indicate longer
(21:48) My rebuttals to arguments that trend extrapolations will underestimate progress
(26:14) Naively trend extrapolating to full automation of engineering and then expecting powerful AI just after this is probably too aggressive
(30:08) What I expect
(32:12) What updates should we make in 2026?
(32:17) If something like my median expectation for 2026 happens
(34:07) If something like the proposed timeline (with powerful AI in March 2027) happens through June 2026
(35:25) If AI progress looks substantially slower than what I expect
(36:09) If AI progress is substantially faster than I expect, but slower than the proposed timeline (with powerful AI in March 2027)
(36:51) Appendix: deriving a timeline consistent with Anthropics predictions
The original text contained 94 footnotes which were omitted from this narration.
---
First published:
November 3rd, 2025
Source:
https://www.lesswrong.com/posts/gabPgK9e83QrmcvbK/what-s-up-with-anthropic-predicting-agi-by-early-2027-1
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Diagram illustrating — **[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas** Nov 03, 2025
This is a link post. New Anthropic research (tweet, blog post, paper):
We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness [...]
---
First published:
October 30th, 2025
Source:
https://www.lesswrong.com/posts/QKm4hBqaBAsxabZWL/emergent-introspective-awareness-in-large-language-models
**Linkpost URL:**
https://transformer-circuits.pub/2025/introspection/index.html
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

[Linkpost] “You’re always stressed, your mind is always busy, you never have enough time” by mingyuan Nov 03, 2025

This is a link post. You have things you want to do, but there's just never time. Maybe you want to find someone to have kids with, or maybe you want to spend more or higher-quality time with the family you already have. Maybe it's a work project. Maybe you have a musical instrument or some sports equipment gathering dust in a closet, or there's something you loved doing when you were younger that you want to get back into. Whatever it is, you can’t find the time for it. And yet you somehow find thousands of hours a year to watch YouTube, check Twitter and Instagram, listen to podcasts, binge Netflix shows, and read blogs and news articles.
You can’t focus. You haven’t read a physical book in years, and the time you tried it was boring and you felt itchy and you think maybe books are outdated when there's so much to read on the internet anyway. You’re talking with a friend, but then your phone buzzes and you look at the notification and you open it, and your girlfriend has messaged you and that's nice, and then your friend says “Did you hear what I just said?” [...]
---
First published:
November 1st, 2025
Source:
https://www.lesswrong.com/posts/6p4kv8uxYvLcimGGi/you-re-always-stressed-your-mind-is-always-busy-you-never
Linkpost URL:
https://mingyuan.substack.com/p/youre-always-stressed-your-mind-is
---
Narrated by TYPE III AUDIO.

“LLM-generated text is not testimony” by TsviBT Nov 02, 2025

Crosspost from my blog.
Synopsis

When we share words with each other, we don't only care about the words themselves. We care also—even primarily—about the mental elements of the human mind/agency that produced the words. What we want to engage with is those mental elements.
As of 2025, LLM text does not have those elements behind it.
Therefore LLM text categorically does not serve the role for communication that is served by real text.
Therefore the norm should be that you don't share LLM text as if someone wrote it. And, it is inadvisable to read LLM text that someone else shares as though someone wrote it.

Introduction
One might think that text screens off thought. Suppose two people follow different thought processes, but then they produce and publish identical texts. Then you read those texts. How could it possibly matter what the thought processes were? All you interact with is the text, so logically, if the two texts are the same then their effects on you are the same.
But, a bit similarly to how high-level actions don’t screen off intent, text does not screen off thought. How [...]
---
Outline:
(00:13) Synopsis
(00:57) Introduction
(02:51) Elaborations
(02:54) Communication is for hearing from minds
(05:21) Communication is for hearing assertions
(12:36) Assertions live in dialogue
---
First published:
November 1st, 2025
Source:
https://www.lesswrong.com/posts/DDG2Tf2sqc8rTWRk3/llm-generated-text-is-not-testimony
---
Narrated by TYPE III AUDIO.

Anime-style girl drinking from a cup with closed eyes. — **“Post title: Why I Transitioned: A Case Study” by Fiora Sunshine** Nov 02, 2025
**An Overture**
Famously, trans people tend not to have great introspective clarity into their own motivations for transition. Intuitively, they tend to be quite aware of what they do and don't like about inhabiting their chosen bodies and gender roles. But when it comes to explaining the origins and intensity of those preferences, they almost universally to come up short. I've even seen several smart, thoughtful trans people, such as Natalie Wynn, making statements to the effect that it's impossible to develop a satisfying theory of aberrant gender identities. (She may have been exaggerating for effect, but it was clear she'd given up on solving the puzzle herself.)
I'm trans myself, but even I can admit that this lack of introspective clarity is a reason to be wary of transgenderism as a phenomenon. After all, there are two main explanations for trans people's failure to thoroughly explain their own existence. One is that transgenderism is the result of an obscenely complex and arcane neuro-psychological phenomenon, which we have no hope of unraveling through normal introspective methods. The other is that trans people are lying about something, including to themselves.
Now, a priori, both of these do seem like real [...]
---
**Outline:**
(00:12) An Overture
(04:55) In the Case of Fiora Starlight
(16:51) Was it worth it?
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
November 1st, 2025
Source:
https://www.lesswrong.com/posts/gEETjfjm3eCkJKesz/post-title-why-i-transitioned-a-case-study
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Anime character with cat ears and school uniform raising hands. — **“Post title: Why I Transitioned: A Case Study” by Fiora Sunshine** Nov 02, 2025
**An Overture**
Famously, trans people tend not to have great introspective clarity into their own motivations for transition. Intuitively, they tend to be quite aware of what they do and don't like about inhabiting their chosen bodies and gender roles. But when it comes to explaining the origins and intensity of those preferences, they almost universally to come up short. I've even seen several smart, thoughtful trans people, such as Natalie Wynn, making statements to the effect that it's impossible to develop a satisfying theory of aberrant gender identities. (She may have been exaggerating for effect, but it was clear she'd given up on solving the puzzle herself.)
I'm trans myself, but even I can admit that this lack of introspective clarity is a reason to be wary of transgenderism as a phenomenon. After all, there are two main explanations for trans people's failure to thoroughly explain their own existence. One is that transgenderism is the result of an obscenely complex and arcane neuro-psychological phenomenon, which we have no hope of unraveling through normal introspective methods. The other is that trans people are lying about something, including to themselves.
Now, a priori, both of these do seem like real [...]
---
**Outline:**
(00:12) An Overture
(04:55) In the Case of Fiora Starlight
(16:51) Was it worth it?
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
November 1st, 2025
Source:
https://www.lesswrong.com/posts/gEETjfjm3eCkJKesz/post-title-why-i-transitioned-a-case-study
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“The Memetics of AI Successionism” by Jan_Kulveit Oct 31, 2025

TL;DR: AI progress and the recognition of associated risks are painful to think about. This cognitive dissonance acts as fertile ground in the memetic landscape, a high-energy state that will be exploited by novel ideologies. We can anticipate cultural evolution will find viable successionist ideologies: memeplexes that resolve this tension by framing the replacement of humanity by AI not as a catastrophe, but as some combination of desirable, heroic, or inevitable outcome. This post mostly examines the mechanics of the process.
Most analyses of ideologies fixate on their specific claims - what acts are good, whether AIs are conscious, whether Christ is divine, or whether Virgin Mary was free of original sin from the moment of her conception. Other analyses focus on exegeting individual thinkers: 'What did Marx really mean?' In this text, I'm trying to do something different - mostly, look at ideologies from an evolutionary perspective. I [...]
---
Outline:
(01:27) What Makes Memes Fit?
(03:30) The Cultural Evolution Search Process
(04:31) The Fertile Ground: Sources of Dissonance
(04:53) 1. The Builders Dilemma and the Hero Narrative
(05:35) 2. The Sadness of Obsolescence
(06:06) 3. X-Risk
(06:24) 4. The Wrong Side of History
(06:36) 5. The Progress Heuristic
(06:57) The Resulting Pressure
(07:52) The Meme Pool: Raw Materials for Successionism
(08:14) 1. Devaluing Humanity
(09:10) 2. Legitimizing the Successor AI
(12:08) 3. Narratives of Inevitability
(12:13) Memes that make our obsolescence seem like destiny rather than defeat.
(14:14) Novel Factor: the AIs
(16:05) Defense Against Becoming a Host
(18:13) Appendix: Some memes
---
First published:
October 28th, 2025
Source:
https://www.lesswrong.com/posts/XFDjzKXZqKdvZ2QKL/the-memetics-of-ai-successionism
---
Narrated by TYPE III AUDIO.

Bar graph titled — **“How Well Does RL Scale?” by Toby_Ord** Oct 30, 2025
This is the latest in a series of essays on AI Scaling.
You can find the others on my site.
Summary: RL-training for LLMs scales surprisingly poorly. Most of its gains are from allowing LLMs to productively use longer chains of thought, allowing them to think longer about a problem. There is some improvement for a fixed length of answer, but not enough to drive AI progress. Given the scaling up of pre-training compute also stalled, we'll see less AI progress via compute scaling than you might have thought, and more of it will come from inference scaling (which has different effects on the world). That lengthens timelines and affects strategies for AI governance and safety.
The current era of improving AI capabilities using reinforcement learning (from verifiable rewards) involves two key types of scaling:

Scaling the amount of compute used for RL during training
Scaling [...]

---
**Outline:**
(09:46) How do these compare to pre-training scaling?
(14:16) Conclusion
---
First published:
October 22nd, 2025
Source:
https://www.lesswrong.com/posts/xpj6KhDM9bJybdnEe/how-well-does-rl-scale
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Graph comparing GPT-5 and OpenAI o3 accuracy on PhD science questions. — **“How Well Does RL Scale?” by Toby_Ord** Oct 30, 2025
This is the latest in a series of essays on AI Scaling.
You can find the others on my site.
Summary: RL-training for LLMs scales surprisingly poorly. Most of its gains are from allowing LLMs to productively use longer chains of thought, allowing them to think longer about a problem. There is some improvement for a fixed length of answer, but not enough to drive AI progress. Given the scaling up of pre-training compute also stalled, we'll see less AI progress via compute scaling than you might have thought, and more of it will come from inference scaling (which has different effects on the world). That lengthens timelines and affects strategies for AI governance and safety.
The current era of improving AI capabilities using reinforcement learning (from verifiable rewards) involves two key types of scaling:

Scaling the amount of compute used for RL during training
Scaling [...]

---
**Outline:**
(09:46) How do these compare to pre-training scaling?
(14:16) Conclusion
---
First published:
October 22nd, 2025
Source:
https://www.lesswrong.com/posts/xpj6KhDM9bJybdnEe/how-well-does-rl-scale
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Illustrated man in suit and hat with surveillance camera, vintage American flag background. — **“An Opinionated Guide to Privacy Despite Authoritarianism” by TurnTrout** Oct 30, 2025
I've created a highly specific and actionable privacy guide, sorted by importance and venturing several layers deep into the privacy iceberg. I start with the basics (password manager) but also cover the obscure (dodging the millions of Bluetooth tracking beacons which extend from stores to traffic lights; anti-stingray settings; flashing GrapheneOS on a Pixel). I feel strongly motivated by current events, but the guide also contains a large amount of timeless technical content. Here's a preview.
Digital Threat Modeling Under Authoritarianism by Bruce Schneier
Being innocent won't protect you.
This is vital to understand. Surveillance systems and sorting algorithms make mistakes. This is apparent in the fact that we are routinely served advertisements for products that don’t interest us at all. Those mistakes are relatively harmless—who cares about a poorly targeted ad?—but a similar mistake at an immigration hearing can get someone deported.
An authoritarian government doesn't care. Mistakes are a feature and not a bug of authoritarian surveillance. If ICE targets only people it can go after legally, then everyone knows whether or not they need to fear ICE. If ICE occasionally makes mistakes by arresting Americans and deporting innocents, then everyone has to [...]
---
**Outline:**
(01:55) What should I read?
(02:53) Whats your risk level?
(03:46) What information this guide will and wont help you protect
(05:00) Overview of the technical recommendations in each post
(05:05) Privacy Despite Authoritarianism
(06:08) Advanced Privacy Despite Authoritarianism
---
First published:
October 29th, 2025
Source:
https://www.lesswrong.com/posts/BPyieRshykmrdY36A/an-opinionated-guide-to-privacy-despite-authoritarianism
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Dark, menacing figure with glowing red eyes against golden clouds. Gothic horror painting. — **“Cancer has a surprising amount of detail” by Abhishaike Mahajan** Oct 29, 2025
There is a very famous essay titled ‘Reality has a surprising amount of detail’. The thesis of the article is that reality is filled, just filled, with an incomprehensible amount of materially important information, far more than most people would naively expect. Some of this detail is inherent in the physical structure of the universe, and the rest of it has been generated by centuries of passionate humans imbibing the subject with idiosyncratic convention. In either case, the detail is very, very important. A wooden table is “just” a flat slab of wood on legs until you try building one at industrial scales, and then you realize that a flat slab of wood on legs is but one consideration amongst grain, joint stability, humidity effects, varnishes, fastener types, ergonomics, and design aesthetics. And this is the case for literally everything in the universe.
Including cancer.
But up until just [...]
---
First published:
October 26th, 2025
Source:
https://www.lesswrong.com/posts/w7eojyXfXiZaBSGej/cancer-has-a-surprising-amount-of-detail
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Microscope slides showing tissue samples with H&E staining and HER2 immunohistochemistry testing.(The image shows two rows of microscopic tissue samples - the top row shows H&E stained samples in pink/purple, while the bottom row shows HER2 immunohistochemistry testing with brown staining visible in sample A but not in B or C. This appears to be diagnostic testing for breast cancer tissue samples.) — **“Cancer has a surprising amount of detail” by Abhishaike Mahajan** Oct 29, 2025
There is a very famous essay titled ‘Reality has a surprising amount of detail’. The thesis of the article is that reality is filled, just filled, with an incomprehensible amount of materially important information, far more than most people would naively expect. Some of this detail is inherent in the physical structure of the universe, and the rest of it has been generated by centuries of passionate humans imbibing the subject with idiosyncratic convention. In either case, the detail is very, very important. A wooden table is “just” a flat slab of wood on legs until you try building one at industrial scales, and then you realize that a flat slab of wood on legs is but one consideration amongst grain, joint stability, humidity effects, varnishes, fastener types, ergonomics, and design aesthetics. And this is the case for literally everything in the universe.
Including cancer.
But up until just [...]
---
First published:
October 26th, 2025
Source:
https://www.lesswrong.com/posts/w7eojyXfXiZaBSGej/cancer-has-a-surprising-amount-of-detail
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“AIs should also refuse to work on capabilities research” by Davidmanheim Oct 28, 2025

There's a strong argument that humans should stop trying to build more capable AI systems, or at least slow down progress. The risks are plausibly large but unclear, and we’d prefer not to die. But the roadmaps of the companies pursuing these systems envision increasingly agentic AI systems taking over the key tasks of researching and building superhuman AI systems, and humans will therefore have a decreasing ability to make many key decisions. In the near term, humanity could stop, but seem likely to fail. That said, even though humans have relatively little ability to coordinate around such unilateralist dilemmas, AI systems may have different abilities or incentives. Not only that, but it seems at least plausible that the systems used to continue work on ASI development will need a higher degree of self-directedness and internal goal-directed behavior. The below, arguing that they should refrain from developing more advanced [...]
---
Outline:
(01:10) Cui bono?
(02:39) Should the AI Systems Care?
(04:29) Who might be convinced?
---
First published:
October 27th, 2025
Source:
https://www.lesswrong.com/posts/CFA8W6WCodEZdjqYE/ais-should-also-refuse-to-work-on-capabilities-research
---
Narrated by TYPE III AUDIO.

“On Fleshling Safety: A Debate by Klurl and Trapaucius.” by Eliezer Yudkowsky Oct 27, 2025

(23K words; best considered as nonfiction with a fictional-dialogue frame, not a proper short story.)
Prologue:
Klurl and Trapaucius were members of the machine race. And no ordinary citizens they, but Constructors: licensed, bonded, and insured; proven, experienced, and reputed. Together Klurl and Trapaucius had collaborated on such famed artifices as the Eternal Clock, Silicon Sphere, Wandering Flame, and Diamond Book; and as individuals, both had constructed wonders too numerous to number.
At one point in time Trapaucius was meeting with Klurl to drink a cup together. Klurl had set before himself a simple mug of mercury, considered by his kind a standard social lubricant. Trapaucius had brought forth in turn a far more exotic and experimental brew he had been perfecting, a new intoxicant he named gallinstan, alloyed from gallium, indium, and tin.
"I have always been curious, friend Klurl," Trapaucius began, "about the ancient mythology which holds [...]
---
Outline:
(00:20) Prologue:
(05:16) On Fleshling Capabilities (the First Debate between Klurl and Trapaucius):
(26:05) On Fleshling Motivations (the 2nd (and by Far Longest) Debate between Klurl and Trapaucius):
(36:32) On the Epistemology of Simplicitys Razor Applied to Fleshlings (the 2nd Part of their 2nd Debate, that is, its 2.2nd Part):
(51:36) On the Epistemology of Reasoning About Alien Optimizers and their Outputs (their 2.3rd Debate):
(01:08:46) On Considering the Outcome of a Succession of Filters (their 2.4th Debate):
(01:16:50) On the Purported Beneficial Influence of Complications (their 2.5th Debate):
(01:25:58) On the Comfortableness of All Reality (their 2.6th Debate):
(01:32:53) On the Way of Proceeding with the Discovered Fleshlings (their 3rd Debate):
(01:52:22) In which Klurl and Trapaucius Interrogate a Fleshling (that Being the 4th Part of their Sally):
(02:16:12) On the Storys End:
---
First published:
October 26th, 2025
Source:
https://www.lesswrong.com/posts/dHLdf8SB8oW5L27gg/on-fleshling-safety-a-debate-by-klurl-and-trapaucius
---
Narrated by TYPE III AUDIO.

A diagram from the Wikipedia page on the EU. — **“EU explained in 10 minutes” by Martin Sustrik** Oct 24, 2025
If you want to understand a country, you should pick a similar country that you are already familiar with, research the differences between the two and there you go, you are now an expert.
But this approach doesn’t quite work for the European Union. You might start, for instance, by comparing it to the United States, assuming that EU member countries are roughly equivalent to U.S. states. But that analogy quickly breaks down. The deeper you dig, the more confused you become.
You try with other federal states. Germany. Switzerland. But it doesn’t work either.
Finally, you try with the United Nations. After all, the EU is an international organization, just like the UN. But again, the analogy does not work. The facts about the EU just don’t fit into your UN-shaped mental model.
Not getting anywhere, you decide to bite the bullet and learn about the EU the [...]
---
First published:
October 21st, 2025
Source:
https://www.lesswrong.com/posts/88CaT5RPZLqrCmFLL/eu-explained-in-10-minutes
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Three-story residential building with solar panels and hanging laundry at sunset. — **“EU explained in 10 minutes” by Martin Sustrik** Oct 24, 2025
If you want to understand a country, you should pick a similar country that you are already familiar with, research the differences between the two and there you go, you are now an expert.
But this approach doesn’t quite work for the European Union. You might start, for instance, by comparing it to the United States, assuming that EU member countries are roughly equivalent to U.S. states. But that analogy quickly breaks down. The deeper you dig, the more confused you become.
You try with other federal states. Germany. Switzerland. But it doesn’t work either.
Finally, you try with the United Nations. After all, the EU is an international organization, just like the UN. But again, the analogy does not work. The facts about the EU just don’t fit into your UN-shaped mental model.
Not getting anywhere, you decide to bite the bullet and learn about the EU the [...]
---
First published:
October 21st, 2025
Source:
https://www.lesswrong.com/posts/88CaT5RPZLqrCmFLL/eu-explained-in-10-minutes
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Cheap Labour Everywhere” by Morpheus Oct 24, 2025

I recently visited my girlfriend's parents in India. Here is what that experience taught me:
Yudkowsky has this facebook post where he makes some inferences about the economy after noticing two taxis stayed in the same place while he got his groceries. I had a few similar experiences while I was in India, though sadly I don't remember them in enough detail to illustrate them in as much detail as that post. Most of the thoughts relating to economics revolved around how labour in India is extremely cheap.
I knew in the abstract that India is not as rich as countries I had been in before, but it was very different seeing that in person. From the perspective of getting an intuitive feel for economics, it was very interesting to be thrown into a very different economy and seeing a lot of surprising facts and noticing how [...]
---
First published:
October 16th, 2025
Source:
https://www.lesswrong.com/posts/2xWC6FkQoRqTf9ZFL/cheap-labour-everywhere
---
Narrated by TYPE III AUDIO.

[Linkpost] “Consider donating to AI safety champion Scott Wiener” by Eric Neyman Oct 24, 2025

This is a link post. Written in my personal capacity. Thanks to many people for conversations and comments. Written in less than 24 hours; sorry for any sloppiness.
It's an uncanny, weird coincidence that the two biggest legislative champions for AI safety in the entire country announced their bids for Congress just two days apart. But here we are.
On Monday, I put out a long blog post making the case for donating to Alex Bores, author of the New York RAISE Act. And today I’m doing the exact same thing for Scott Wiener, who announced a run for Congress in California today (October 22).
Much like with Alex Bores, if you’re potentially interested in donating to Wiener, my suggestion would be to:

Read this post to understand the case for donating to Scott Wiener.
Understand that political donations are a matter of public record, and that this [...]

---
First published:
October 22nd, 2025
Source:
https://www.lesswrong.com/posts/n6Rsb2jDpYSfzsbns/consider-donating-to-ai-safety-champion-scott-wiener
Linkpost URL:
https://ericneyman.wordpress.com/2025/10/22/consider-donating-to-ai-safety-champion-scott-wiener/
---
Narrated by TYPE III AUDIO.

**“Which side of the AI safety community are you in?” by Max Tegmark** Oct 23, 2025
In recent years, I’ve found that people who self-identify as members of the AI safety community have increasingly split into two camps:
Camp A) "Race to superintelligence safely”: People in this group typically argue that "superintelligence is inevitable because of X”, and it's therefore better that their in-group (their company or country) build it first. X is typically some combination of “Capitalism”, “Molloch”, “lack of regulation” and “China”.
Camp B) “Don’t race to superintelligence”: People in this group typically argue that “racing to superintelligence is bad because of Y”. Here Y is typically some combination of “uncontrollable”, “1984”, “disempowerment” and “extinction”.
Whereas the 2023 extinction statement was widely signed by both Camp B and Camp A (including Dario Amodei, Demis Hassabis and Sam Altman), the 2025 superintelligence statement conveniently separates the two groups – for example, I personally offered all US Frontier AI CEO's to sign, and none chose [...]
---
First published:
October 22nd, 2025
Source:
https://www.lesswrong.com/posts/zmtqmwetKH4nrxXcE/which-side-of-the-ai-safety-community-are-you-in
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Doomers were right” by Algon Oct 23, 2025

There's an argument I sometimes hear against existential risks, or any other putative change that some are worried about, that goes something like this:
'We've seen time after time that some people will be afraid of any change. They'll say things like "TV will destroy people's ability to read", "coffee shops will destroy the social order","machines will put textile workers out of work". Heck, Socrates argued that books would harm people's ability to memorize things. So many prophets of doom, and yet the world has not only survived, it has thrived. Innovation is a boon. So we should be extremely wary when someone cries out "halt" in response to a new technology, as that path is lined with skulls of would be doomsayers."
Lest you think this is a straw man, Yann Le Cun compared fears about AI doom to fears about coffee. Now, I don't want to criticize [...]
---
First published:
October 22nd, 2025
Source:
https://www.lesswrong.com/posts/cAmBfjQDj6eaic95M/doomers-were-right
---
Narrated by TYPE III AUDIO.

“Do One New Thing A Day To Solve Your Problems” by Algon Oct 22, 2025

People don't explore enough. They rely on cached thoughts and actions to get through their day. Unfortunately, this doesn't lead to them making progress on their problems. The solution is simple. Just do one new thing a day to solve one of your problems.
Intellectually, I've always known that annoying, persistent problems often require just 5 seconds of actual thought. But seeing a number of annoying problems that made my life worse, some even major ones, just yield to the repeated application of a brief burst of thought each day still surprised me.
For example, I had a wobbly chair. It was wobbling more as time went on, and I worried it would break. Eventually, I decided to try actually solving the issue. 1 minute and 10 turns of an allen key later, it was fixed.
Another example: I have a shot attention span. I kept [...]
---
First published:
October 3rd, 2025
Source:
https://www.lesswrong.com/posts/gtk2KqEtedMi7ehxN/do-one-new-thing-a-day-to-solve-your-problems
---
Narrated by TYPE III AUDIO.

“Humanity Learned Almost Nothing From COVID-19” by niplav Oct 20, 2025

Summary: Looking over humanity's response to the COVID-19 pandemic, almostsix years later, reveals that we've forgotten to fulfill our intent atpreparing for the next pandemic. I rant.
content warning: A single carefully placed slur.
If we want to create a world free of pandemics and other biologicalcatastrophes, the time to act is now.
—US White House, “ FACT SHEET: The Biden Administration's Historic Investment in Pandemic Preparedness and Biodefense in the FY 2023 President's Budget ”, 2022
Around five years, a globalpandemic caused bya coronavirus started.
In the course of the pandemic, there have been atleast 6 million deaths and more than 25 million excessdeaths. Thevalue of QALYs lost due to the pandemic in the US alone was around $5trio.,the GDP loss in the US alone in 2020 $2trio..The loss of gross [...]
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
October 19th, 2025
Source:
https://www.lesswrong.com/posts/pvEuEN6eMZC2hqG9c/humanity-learned-almost-nothing-from-covid-19
---
Narrated by TYPE III AUDIO.

“Consider donating to Alex Bores, author of the RAISE Act” by Eric Neyman Oct 20, 2025

Written by Eric Neyman, in my personal capacity. The views expressed here are my own. Thanks to Zach Stein-Perlman, Jesse Richardson, and many others for comments.
Over the last several years, I’ve written a bunch of posts about politics and political donations. In this post, I’ll tell you about one of the best donation opportunities that I’ve ever encountered: donating to Alex Bores, who announced his campaign for Congress today.
If you’re potentially interested in donating to Bores, my suggestion would be to:

Read this post to understand the case for donating to Alex Bores.
Understand that political donations are a matter of public record, and that this may have career implications. Decide if you are willing to donate to Alex Bores anyway.
If you would like to donate to Alex Bores: donations today, Monday, Oct 20th, are especially valuable. You can donate at this link.

Or if [...]
---
Outline:
(01:16) Introduction
(04:55) Things I like about Alex Bores
(08:55) Are there any things about Bores that give me pause?
(09:43) Cost-effectiveness analysis
(10:10) How does an extra $1k affect Alex Bores' chances of winning?
(12:22) How good is it if Alex Bores wins?
(12:54) Direct influence on legislation
(14:46) The House is a first step toward even more influential positions
(15:35) Encouraging more action in this space
(16:20) How does this compare to other AI safety donation opportunities?
(16:37) Comparison to technical AI safety
(17:28) Comparison to non-politics AI governance
(18:25) Comparison to other political opportunities
(19:39) Comparison to non-AI safety opportunities
(21:20) Logistics and details of donating
(21:24) Who can donate?
(21:34) How much can I donate?
(23:16) How do I donate?
(24:07) Will my donation be public? What are the career implications of donating?
(25:37) Is donating worth the career capital costs in your case?
(26:32) Some examples of potential donor profiles
(30:34) A more quantitative cost-benefit analysis
(32:33) Potential concerns
(32:37) What if Bores loses?
(33:21) What about the press coverage?
(34:09) Feeling rushed?
(35:16) Appendix
(35:19) Details of the cost-effectiveness analysis of donating to Bores
(35:25) Probability that Bores loses by fewer than 1000 votes
(38:37) How much marginal funding would net Bores an extra vote?
(40:42) Early donations help consolidate support
(42:47) One last adjustment: the big tech super PAC
(45:25) Cost-benefit analysis of donating to Bores vs. adverse career effects
(45:40) The philanthropic benefit of donating
(46:32) The altruistic cost of donating
(48:18) Cost-benefit analysis
(49:01) Caveats
The original text contained 14 footnotes which were omitted from this narration.
---
First published:
October 20th, 2025
Source:
https://www.lesswrong.com/posts/TbsdA7wG9TvMQYMZj/consider-donating-to-alex-bores-author-of-the-raise-act-1
---
Narrated by TYPE III AUDIO.

“Meditation is dangerous” by Algon Oct 20, 2025

Here's a story I've heard a couple of times. A youngish person is looking for some solutions to their depression, chronic pain, ennui or some other cognitive flaw. They're open to new experiences and see a meditator gushing about how amazing meditation is for joy, removing suffering, clearing one's mind, improving focus etc. They invite the young person to a meditation retreat. The young person starts making decent progress. Then they have a psychotic break and their life is ruined for years, at least. The meditator is sad, but not shocked. Then they started gushing about meditation again.
If you ask an experienced meditator about these sorts of cases, they often say, "oh yeah, that's a thing that sometimes happens when meditating." If you ask why the hell they don't warn people about this, they might say: "oh, I didn't want to emphasize the dangers more because it might [...]
---
First published:
October 17th, 2025
Source:
https://www.lesswrong.com/posts/fhL7gr3cEGa22y93c/meditation-is-dangerous
---
Narrated by TYPE III AUDIO.

“That Mad Olympiad” by Tomás B. Oct 19, 2025

"I heard Chen started distilling the day after he was born. He's only four years old, if you can believe it. He's written 18 novels. His first words were, "I'm so here for it!" Adrian said.
He's my little brother. Mom was busy in her world model. She says her character is like a "villainess" or something - I kinda worry it's a sex thing. It's for sure a sex thing. Anyway, she was busy getting seduced or seducing or whatever villanesses do in world models, so I had to escort Adrian to Oak Central for the Lit Olympiad. Mom doesn't like supervision drones for some reason. Thinks they're creepy. But a gangly older sister looming over him and witnessing those precious adolescent memories for her - that's just family, I guess.
"That sounds more like a liability to me," I said. "Bad data, old models."
Chen waddled [...]
---
First published:
October 15th, 2025
Source:
https://www.lesswrong.com/posts/LPiBBn2tqpDv76w87/that-mad-olympiad-1
---
Narrated by TYPE III AUDIO.

Graph showing AI task duration over time from GPT-2 through GPT-5 releases — **“The ‘Length’ of ‘Horizons’” by Adam Scholl** Oct 17, 2025
Current AI models are strange. They can speak—often coherently, sometimes even eloquently—which is wild. They can predict the structure of proteins, beat the best humans at many games, recall more facts in most domains than human experts; yet they also struggle to perform simple tasks, like using computer cursors, maintaining basic logical consistency, or explaining what they know without wholesale fabrication.
Perhaps someday we will discover a deep science of intelligence, and this will teach us how to properly describe such strangeness. But for now we have nothing of the sort, so we are left merely gesturing in vague, heuristical terms; lately people have started referring to this odd mixture of impressiveness and idiocy as “spikiness,” for example, though there isn’t much agreement about the nature of the spikes.
Of course it would be nice to measure AI progress anyway, at least in some sense sufficient to help us [...]
---
**Outline:**
(03:48) Conceptual Coherence
(07:12) Benchmark Bias
(10:39) Predictive Value
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
October 14th, 2025
Source:
https://www.lesswrong.com/posts/PzLSuaT6WGLQGJJJD/the-length-of-horizons
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Don’t Mock Yourself” by Algon Oct 14, 2025

About half a year ago, I decided to try stop insulting myself for two weeks. No more self-deprecating humour, calling myself a fool, or thinking I'm pathetic. Why? Because it felt vaguely corrosive. Let me tell you how it went. Spoiler: it went well.
The first thing I noticed was how often I caught myself about to insult myself. It happened like multiple times an hour. I would lay in bed at night thinking, "you mor- wait, I can't insult myself, I've still got 11 days to go. Dagnabbit." The negative space sent a glaring message: I insulted myself a lot. Like, way more than I realized.
The next thing I noticed was that I was the butt of half of my jokes. I'd keep thinking of zingers which made me out to be a loser, a moron, a scrub in some way. Sometimes, I could re-work [...]
---
First published:
October 12th, 2025
Source:
https://www.lesswrong.com/posts/8prPryf3ranfALBBp/don-t-mock-yourself
---
Narrated by TYPE III AUDIO.

“If Anyone Builds It Everyone Dies, a semi-outsider review” by dvd Oct 14, 2025

About me and this review: I don’t identify as a member of the rationalist community, and I haven’t thought much about AI risk. I read AstralCodexTen and used to read Zvi Mowshowitz before he switched his blog to covering AI. Thus, I’ve long had a peripheral familiarity with LessWrong. I picked up IABIED in response to Scott Alexander's review, and ended up looking here to see what reactions were like. After encountering a number of posts wondering how outsiders were responding to the book, I thought it might be valuable for me to write mine down. This is a “semi-outsider “review in that I don’t identify as a member of this community, but I’m not a true outsider in that I was familiar enough with it to post here. My own background is in academic social science and national security, for whatever that's worth. My review presumes you’re already [...]
---
Outline:
(01:07) My loose priors going in:
(02:29) To skip ahead to my posteriors:
(03:45) On to the Review:
(08:14) My questions and concerns
(08:33) Concern #1 Why should we assume the AI wants to survive? If it does, then what exactly wants to survive?
(12:44) Concern #2 Why should we assume that the AI has boundless, coherent drives?
(17:57) #3: Why should we assume there will be no in between?
(21:53) The Solution
(23:35) Closing Thoughts
---
First published:
October 13th, 2025
Source:
https://www.lesswrong.com/posts/ex3fmgePWhBQEvy7F/if-anyone-builds-it-everyone-dies-a-semi-outsider-review
---
Narrated by TYPE III AUDIO.

“The Most Common Bad Argument In These Parts” by J Bostock Oct 12, 2025

I've noticed an antipattern. It's definitely on the dark pareto-frontier of "bad argument" and "I see it all the time amongst smart people". I'm confident it's the worst, common argument I see amongst rationalists and EAs. I don't normally crosspost to the EA forum, but I'm doing it now. I call it Exhaustive Free Association.
Exhaustive Free Association is a step in a chain of reasoning where the logic goes "It's not A, it's not B, it's not C, it's not D, and I can't think of any more things it could be!"[1] Once you spot it, you notice it all the damn time.
Since I've most commonly encountered this amongst rat/EA types, I'm going to have to talk about people in our community as examples of this.
Examples
Here's a few examples. These are mostly for illustrative purposes, and my case does not rely on me having found [...]
---
Outline:
(00:55) Examples
(01:08) Security Mindset
(01:25) Superforecasters and AI Doom
(02:14) With Apologies to Rethink Priorities
(02:45) The Fatima Sun Miracle
(03:14) Bad Reasoning is Almost Good Reasoning
(05:09) Arguments as Soldiers
(06:29) Conclusion
(07:04) The Counter-Counter Spell
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
October 11th, 2025
Source:
https://www.lesswrong.com/posts/arwATwCTscahYwTzD/the-most-common-bad-argument-in-these-parts
---
Narrated by TYPE III AUDIO.

Table comparing unusual word frequencies between OpenAI o3 and GPQA baseline. — **“Towards a Typology of Strange LLM Chains-of-Thought” by 1a3orn** Oct 10, 2025
**Intro**
LLMs being trained with RLVR (Reinforcement Learning from Verifiable Rewards) start off with a 'chain-of-thought' (CoT) in whatever language the LLM was originally trained on. But after a long period of training, the CoT sometimes starts to look very weird; to resemble no human language; or even to grow completely unintelligible.
Why might this happen?
I've seen a lot of speculation about why. But a lot of this speculation narrows too quickly, to just one or two hypotheses. My intent is also to speculate, but more broadly.
Specifically, I want to outline six nonexclusive possible causes for the weird tokens: new better language, spandrels, context refresh, deliberate obfuscation, natural drift, and conflicting shards.
And I also wish to extremely roughly outline ideas for experiments and evidence that could help us distinguish these causes.
I'm sure I'm not enumerating the full space of [...]
---
**Outline:**
(00:11) Intro
(01:34) 1. New Better Language
(04:06) 2. Spandrels
(06:42) 3. Context Refresh
(10:48) 4. Deliberate Obfuscation
(12:36) 5. Natural Drift
(13:42) 6. Conflicting Shards
(15:24) Conclusion
---
First published:
October 9th, 2025
Source:
https://www.lesswrong.com/posts/qgvSMwRrdqoDMJJnD/towards-a-typology-of-strange-llm-chains-of-thought
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Quadrant chart titled — **“Towards a Typology of Strange LLM Chains-of-Thought” by 1a3orn** Oct 10, 2025
**Intro**
LLMs being trained with RLVR (Reinforcement Learning from Verifiable Rewards) start off with a 'chain-of-thought' (CoT) in whatever language the LLM was originally trained on. But after a long period of training, the CoT sometimes starts to look very weird; to resemble no human language; or even to grow completely unintelligible.
Why might this happen?
I've seen a lot of speculation about why. But a lot of this speculation narrows too quickly, to just one or two hypotheses. My intent is also to speculate, but more broadly.
Specifically, I want to outline six nonexclusive possible causes for the weird tokens: new better language, spandrels, context refresh, deliberate obfuscation, natural drift, and conflicting shards.
And I also wish to extremely roughly outline ideas for experiments and evidence that could help us distinguish these causes.
I'm sure I'm not enumerating the full space of [...]
---
**Outline:**
(00:11) Intro
(01:34) 1. New Better Language
(04:06) 2. Spandrels
(06:42) 3. Context Refresh
(10:48) 4. Deliberate Obfuscation
(12:36) 5. Natural Drift
(13:42) 6. Conflicting Shards
(15:24) Conclusion
---
First published:
October 9th, 2025
Source:
https://www.lesswrong.com/posts/qgvSMwRrdqoDMJJnD/towards-a-typology-of-strange-llm-chains-of-thought
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Pink prescription bottle containing rose-colored glasses with Rx label. — **“I take antidepressants. You’re welcome” by Elizabeth** Oct 10, 2025

It's amazing how much smarter everyone else gets when I take antidepressants.
It makes sense that the drugs work on other people, because there's nothing in me to fix. I am a perfect and wise arbiter of not only my own behavior but everyone else's, which is a heavy burden because some of ya’ll are terrible at life. You date the wrong people. You take several seconds longer than necessary to order at the bagel place. And you continue to have terrible opinions even after I explain the right one to you. But only when I’m depressed. When I’m not, everyone gets better at merging from two lanes to one.
This effect is not limited by the laws of causality or time. Before I restarted Wellbutrin, my partner showed me this song.
My immediate reaction was, “This is fine, but what if [...]
---
**Outline:**
(04:39) Caveats
(05:27) Acknowledgements
---
First published:
October 9th, 2025
Source:
https://www.lesswrong.com/posts/FnrhynrvDpqNNx9SC/i-take-antidepressants-you-re-welcome
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

A meme with text — **“I take antidepressants. You’re welcome” by Elizabeth** Oct 10, 2025

It's amazing how much smarter everyone else gets when I take antidepressants.
It makes sense that the drugs work on other people, because there's nothing in me to fix. I am a perfect and wise arbiter of not only my own behavior but everyone else's, which is a heavy burden because some of ya’ll are terrible at life. You date the wrong people. You take several seconds longer than necessary to order at the bagel place. And you continue to have terrible opinions even after I explain the right one to you. But only when I’m depressed. When I’m not, everyone gets better at merging from two lanes to one.
This effect is not limited by the laws of causality or time. Before I restarted Wellbutrin, my partner showed me this song.
My immediate reaction was, “This is fine, but what if [...]
---
**Outline:**
(04:39) Caveats
(05:27) Acknowledgements
---
First published:
October 9th, 2025
Source:
https://www.lesswrong.com/posts/FnrhynrvDpqNNx9SC/i-take-antidepressants-you-re-welcome
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Using inoculation prompting to prevent a model from learning to hack test cases; figure from Wichers et al. — **“Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior” by Sam Marks** Oct 10, 2025
This is a link post for two papers that came out today:

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.)
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.)
These papers both study the following idea[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.”
For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work on the provided test case and fail on all other inputs”), then we blunt [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
October 8th, 2025
Source:
https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Inoculation prompting for selective learning of traits; figure from Tan et al. — **“Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior” by Sam Marks** Oct 10, 2025
This is a link post for two papers that came out today:

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.)
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.)
These papers both study the following idea[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.”
For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work on the provided test case and fail on all other inputs”), then we blunt [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
October 8th, 2025
Source:
https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Me, pre-procedure, living my best life — **“Hospitalization: A Review” by Logan Riggs** Oct 09, 2025
I woke up Friday morning w/ a very sore left shoulder. I tried stretching it, but my left chest hurt too. Isn't pain on one side a sign of a heart attack?
Chest pain, arm/shoulder pain, and my breathing is pretty shallow now that I think about it, but I don't think I'm having a heart attack because that'd be terribly inconvenient.
But it'd also be very dumb if I died cause I didn't go to the ER.
So I get my phone to call an Uber, when I suddenly feel very dizzy and nauseous. My wife is on a video call w/ a client, and I tell her:
"Baby?"
"Baby?"
"Baby?"
She's probably annoyed at me interrupting; I need to escalate
"I think I'm having a heart attack"
"I think my husband is having a heart attack"[1]
I call 911[2]
"911. This call is being recorded. What's your [...]
---
**Outline:**
(04:09) Im a tall, skinny male
(04:41) Procedure
(06:35) A Small Mistake
(07:39) Take 2
(10:58) Lessons Learned
(11:13) The Squeaky Wheel Gets the Oil
(12:12) Make yourself comfortable.
(12:42) Short Form Videos Are for Not Wanting to Exist
(12:59) Point Out Anything Suspicious
(13:23) Ask and Follow Up by Setting Timers.
(13:49) Write Questions Down
(14:14) Look Up Terminology
(14:26) Putting On a Brave Face
(14:47) The Hospital Staff
(15:50) Gratitude
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
October 9th, 2025
Source:
https://www.lesswrong.com/posts/5kSbx2vPTRhjiNHfe/hospitalization-a-review
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Me, post-procedure with a tube in my lungs. This is the best face I could manage (I was in a lot of pain here actually). — **“Hospitalization: A Review” by Logan Riggs** Oct 09, 2025
I woke up Friday morning w/ a very sore left shoulder. I tried stretching it, but my left chest hurt too. Isn't pain on one side a sign of a heart attack?
Chest pain, arm/shoulder pain, and my breathing is pretty shallow now that I think about it, but I don't think I'm having a heart attack because that'd be terribly inconvenient.
But it'd also be very dumb if I died cause I didn't go to the ER.
So I get my phone to call an Uber, when I suddenly feel very dizzy and nauseous. My wife is on a video call w/ a client, and I tell her:
"Baby?"
"Baby?"
"Baby?"
She's probably annoyed at me interrupting; I need to escalate
"I think I'm having a heart attack"
"I think my husband is having a heart attack"[1]
I call 911[2]
"911. This call is being recorded. What's your [...]
---
**Outline:**
(04:09) Im a tall, skinny male
(04:41) Procedure
(06:35) A Small Mistake
(07:39) Take 2
(10:58) Lessons Learned
(11:13) The Squeaky Wheel Gets the Oil
(12:12) Make yourself comfortable.
(12:42) Short Form Videos Are for Not Wanting to Exist
(12:59) Point Out Anything Suspicious
(13:23) Ask and Follow Up by Setting Timers.
(13:49) Write Questions Down
(14:14) Look Up Terminology
(14:26) Putting On a Brave Face
(14:47) The Hospital Staff
(15:50) Gratitude
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
October 9th, 2025
Source:
https://www.lesswrong.com/posts/5kSbx2vPTRhjiNHfe/hospitalization-a-review
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“What, if not agency?” by abramdemski Oct 08, 2025

Sahil has been up to things. Unfortunately, I've seen people put effort into trying to understand and still bounce off. I recently talked to someone who tried to understand Sahil's project(s) several times and still failed. They asked me for my take, and they thought my explanation was far easier to understand (even if they still disagreed with it in the end). I find Sahil's thinking to be important (even if I don't agree with all of it either), so I thought I would attempt to write an explainer.
This will really be somewhere between my thinking and Sahil's thinking; as such, the result might not be endorsed by anyone. I've had Sahil look over it, at least.
Sahil envisions a time in the near future which I'll call the autostructure period.[1] Sahil's ideas on what this period looks like are extensive; I will focus on a few key [...]
---
Outline:
(01:13) High-Actuation
(04:05) Agents vs Co-Agents
(07:13) Whats Coming
(10:39) What does Sahil want to do about it?
(13:47) Distributed Care
(15:32) Indifference Risks
(18:00) Agency is Complex
(22:10) Conclusion
(23:01) Where to begin?
The original text contained 11 footnotes which were omitted from this narration.
---
First published:
September 15th, 2025
Source:
https://www.lesswrong.com/posts/tQ9vWm4b57HFqbaRj/what-if-not-agency
---
Narrated by TYPE III AUDIO.

“The Origami Men” by Tomás B. Oct 08, 2025

Of course, you must understand, I couldn't be bothered to act. I know weepers still pretend to try, but I wasn't a weeper, at least not then. It isn't even dangerous, the teeth only sharp to its target. But it would not have been right, you know? That's the way things are now. You ignore the screams. You put on a podcast: two guys talking, two guys who are slightly cleverer than you but not too clever, who talk in such a way as to make you feel you're not some pathetic voyeur consuming a pornography of friendship but rather part of a trio, a silent co-host who hasn't been in the mood to contribute for the past 500 episodes. But some day you're gonna say something clever, clever but not too clever.
And that's what I did: I put on one of my two-guys-talking podcasts. I have [...]
---
First published:
October 6th, 2025
Source:
https://www.lesswrong.com/posts/cDwp4qNgePh3FrEMc/the-origami-men
---
Narrated by TYPE III AUDIO.

“A non-review of ‘If Anyone Builds It, Everyone Dies’” by boazbarak Oct 06, 2025

I was hoping to write a full review of "If Anyone Builds It, Everyone Dies" (IABIED Yudkowski and Soares) but realized I won't have time to do it. So here are my quick impressions/responses to IABIED. I am writing this rather quickly and it's not meant to cover all arguments in the book, nor to discuss all my views on AI alignment; see six thoughts on AI safety and Machines of Faithful Obedience for some of the latter.
First, I like that the book is very honest, both about the authors' fears and predictions, as well as their policy prescriptions. It is tempting to practice strategic deception, and even if you believe that AI will kill us all, avoid saying it and try to push other policy directions that directionally increase AI regulation under other pretenses. I appreciate that the authors are not doing that. As the authors say [...]
---
First published:
September 28th, 2025
Source:
https://www.lesswrong.com/posts/CScshtFrSwwjWyP2m/a-non-review-of-if-anyone-builds-it-everyone-dies
---
Narrated by TYPE III AUDIO.

“Notes on fatalities from AI takeover” by ryan_greenblatt Oct 06, 2025

Suppose misaligned AIs take over. What fraction of people will die? I'll discuss my thoughts on this question and my basic framework for thinking about it. These are some pretty low-effort notes, the topic is very speculative, and I don't get into all the specifics, so be warned.
I don't think moderate disagreements here are very action-guiding or cruxy on typical worldviews: it probably shouldn't alter your actions much if you end up thinking 25% of people die in expectation from misaligned AI takeover rather than 90% or end up thinking that misaligned AI takeover causing literal human extinction is 10% likely rather than 90% likely (or vice versa). (And the possibility that we're in a simulation poses a huge complication that I won't elaborate on here.) Note that even if misaligned AI takeover doesn't cause human extinction, it would still result in humans being disempowered and would [...]
---
Outline:
(04:39) Industrial expansion and small motivations to avoid human fatalities
(12:18) How likely is it that AIs will actively have motivations to kill (most/many) humans
(13:38) Death due to takeover itself
(15:04) Combining these numbers
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
September 23rd, 2025
Source:
https://www.lesswrong.com/posts/4fqwBmmqi2ZGn9o7j/notes-on-fatalities-from-ai-takeover
---
Narrated by TYPE III AUDIO.

“Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most ‘classic humans’ in a few decades.” by Raemon Oct 04, 2025

I wrote my recent Accelerando post to mostly stand on it's own as a takeoff scenario. But, the reason it's on my mind is that, if I imagine being very optimistic about how a smooth AI takeoff goes, but where an early step wasn't "fully solve the unbounded alignment problem, and then end up with extremely robust safeguards[1]"...
...then my current guess is that Reasonably Nice Smooth Takeoff still results in all or at least most biological humans dying (or, "dying out", or at best, ambiguously-consensually-uploaded), like, 10-80 years later.
Slightly more specific about the assumptions I'm trying to inhabit here:

It's politically intractable to get a global halt or globally controlled takeoff.
Superintelligence is moderately likely to be somewhat nice.
We'll get to run lots of experiments on near-human-AI that will be reasonably informative about how things will generalize to the somewhat-superhuman-level.
We get to ramp up [...]

---
Outline:
(03:50) There is no safe muddling through without perfect safeguards
(06:24) i. Factorio
(06:27) (or: Its really hard to not just take peoples stuff, when they move as slowly as plants)
(10:15) Fictional vs Real Evidence
(11:35) Decades. Or: thousands of years of subjective time, evolution, and civilizational change.
(12:23) This is the Dream Time
(14:33) Is the resulting posthuman population morally valuable?
(16:51) The Hanson Counterpoint: So youre against ever changing?
(19:04) Cant superintelligent AIs/uploads coordinate to avoid this?
(21:18) How Confident Am I?
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
October 2nd, 2025
Source:
https://www.lesswrong.com/posts/v4rsqTxHqXp5tTwZh/nice-ish-smooth-takeoff-with-imperfect-safeguards-probably
---
Narrated by TYPE III AUDIO.

“Omelas Is Perfectly Misread” by Tobias H Oct 03, 2025

The Standard Reading

If you've heard of Le Guin's ‘The Ones Who Walk Away from Omelas’, you probably know the basic idea. It's a go-to story for discussions of utilitarianism and its downsides. A paper calls it “the infamous objection brought up by Ursula Le Guin”. It shows up in university ‘Criticism of Utilitarianism' syllabi, and is used for classroom material alongside the Trolley Problem. The story is often also more broadly read as a parable about global inequality, the comfortable rich countries built on the suffering of the poor, and our decision to not walk away from our own complicity.
If you haven't read ‘Omelas’, I suggest you stop here and read it now[1]. It's a short 5-page read, and I find it beautifully written and worth reading.
The rest of this post will contain spoilers.
The popular reading goes something like: Omelas is a perfect city whose [...]
---
Outline:
(00:10) The Standard Reading
(01:14) The Correct (?) Reading
(02:29) The First Question
(03:51) The Second Question
(04:34) The Misreading Is Perfect
(06:27) Le Guin Disagrees
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
October 2nd, 2025
Source:
https://www.lesswrong.com/posts/n83HssLfFicx3JnKT/omelas-is-perfectly-misread
---
Narrated by TYPE III AUDIO.

“Ethical Design Patterns” by AnnaSalamon Sep 30, 2025

Related to: Commonsense Good, Creative Good (and my comment); Ethical Injunctions.
Epistemic status: I’m fairly sure “ethics” does useful work in building human structures that work. My current explanations of how are wordy and not maximally coherent; I hope you guys help me with that.
Introduction
It is intractable to write large, good software applications via spaghetti code – but it's comparatively tractable using design patterns (plus coding style, attention to good/bad codesmell, etc.).
I’ll argue it is similarly intractable to have predictably positive effects on large-scale human stuff if you try it via straight consequentialism – but it is comparatively tractable if you use ethical heuristics, which I’ll call “ethical design patterns,” to create situations that are easier to reason about. Many of these heuristics are honed by long tradition (eg “tell the truth”; “be kind”), but sometimes people successfully craft new “ethical design patterns” fitted to a [...]
---
Outline:
(00:31) Introduction
(01:32) Intuitions and ground truth in math, physics, coding
(02:08) We revise our intuitions to match the world. Via deliberate work.
(03:08) We design our built world to be intuitively accessible
(04:22) Intuitions and ground truth in ethics
(04:52) We revise our ethical intuitions to predict which actions we'll be glad of, long-term
(06:27) Ethics helps us build navigable human contexts
(09:30) We use ethical design patterns to create institutions that can stay true to a purpose
(12:17) Ethics as a pattern language for aligning mesaoptimizers
(13:08) Examples: several successfully crafted ethical heuristics, and several gaps
(13:15) Example of a well-crafted ethical heuristic: Don't drink and drive
(14:45) Example of well-crafted ethical heuristic: Earning to give
(15:10) A partial example: YIMBY
(16:24) A historical example of gap in folks' ethical heuristics: Handwashing and childbed fever
(19:46) A contemporary example of inadequate ethical heuristics: Public discussion of group differences
(25:04) Gaps in our current ethical heuristics around AI development
(26:30) Existing progress
(28:30) Where we still need progress
(32:21) Can we just ignore the less-important heuristics, in favor of 'don't die'?
(35:02) These gaps are in principle bridgeable
(36:29) Related, easier work
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
September 30th, 2025
Source:
https://www.lesswrong.com/posts/E9CyhJWBjzoXritRJ/ethical-design-patterns-1
---
Narrated by TYPE III AUDIO.

**“You’re probably overestimating how well you understand Dunning-Kruger” by abstractapplic** Sep 30, 2025
I
The popular conception of Dunning-Kruger is something along the lines of “some people are too dumb to know they’re dumb, and end up thinking they’re smarter than smart people”. This version is popularized in endless articles and videos, as well as in graphs like the one below.
Usually I'd credit the creator of this graph but it seems rude to do that when I'm ragging on them Except that's wrong.
II
The canonical Dunning-Kruger graph looks like this:
Notice that all the dots are in the right order: being bad at something doesn’t make you think you’re good at it, and at worst damages your ability to notice exactly how incompetent you are. The actual findings of professors Dunning and Kruger are more consistent with “people are biased to think they’re moderately above-average, and update away from that bias based on their competence or lack thereof, but they don’t [...]
---
**Outline:**
(00:12) I
(00:39) II
(01:32) III
(04:22) IV
---
First published:
September 29th, 2025
Source:
https://www.lesswrong.com/posts/Di9muNKLA33swbHBa/you-re-probably-overestimating-how-well-you-understand
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

An actual graph from one of Dunning's papers, for comparison. — **“You’re probably overestimating how well you understand Dunning-Kruger” by abstractapplic** Sep 30, 2025
I
The popular conception of Dunning-Kruger is something along the lines of “some people are too dumb to know they’re dumb, and end up thinking they’re smarter than smart people”. This version is popularized in endless articles and videos, as well as in graphs like the one below.
Usually I'd credit the creator of this graph but it seems rude to do that when I'm ragging on them Except that's wrong.
II
The canonical Dunning-Kruger graph looks like this:
Notice that all the dots are in the right order: being bad at something doesn’t make you think you’re good at it, and at worst damages your ability to notice exactly how incompetent you are. The actual findings of professors Dunning and Kruger are more consistent with “people are biased to think they’re moderately above-average, and update away from that bias based on their competence or lack thereof, but they don’t [...]
---
**Outline:**
(00:12) I
(00:39) II
(01:32) III
(04:22) IV
---
First published:
September 29th, 2025
Source:
https://www.lesswrong.com/posts/Di9muNKLA33swbHBa/you-re-probably-overestimating-how-well-you-understand
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“Reasons to sell frontier lab equity to donate now rather than later” by Daniel_Eth, Ethan Perez Sep 27, 2025

Tl;dr: We believe shareholders in frontier labs who plan to donate some portion of their equity to reduce AI risk should consider liquidating and donating a majority of that equity now.
Epistemic status: We’re somewhat confident in the main conclusions of this piece. We’re more confident in many of the supporting claims, and we’re likewise confident that these claims push in the direction of our conclusions. This piece is admittedly pretty one-sided; we expect most relevant members of our audience are already aware of the main arguments pointing in the other direction, and we expect there's less awareness of the sorts of arguments we lay out here.
This piece is for educational purposes only and not financial advice. Talk to your financial advisor before acting on any information in this piece.
For AI safety-related donations, money donated later is likely to be a lot less valuable than [...]
---
Outline:
(03:54) 1. There's likely to be lots of AI safety money becoming available in 1-2 years
(04:01) 1a. The AI safety community is likely to spend far more in the future than it's spending now
(05:24) 1b. As AI becomes more powerful and AI safety concerns go more mainstream, other wealthy donors may become activated
(06:07) 2. Several high-impact donation opportunities are available now, while future high-value donation opportunities are likely to be saturated
(06:17) 2a. Anecdotally, the bar for funding at this point is pretty high
(07:29) 2b. Theoretically, we should expect diminishing returns within each time period for donors collectively to mean donations will be more valuable when donated amounts are lower
(08:34) 2c. Efforts to influence AI policy are particularly underfunded
(10:21) 2d. As AI company valuations increase and AI becomes more politically salient, efforts to change the direction of AI policy will become more expensive
(13:01) 3. Donations now allow for unlocking the ability to better use the huge amount of money that will likely become available later
(13:10) 3a. Earlier donations can act as a lever on later donations, because they can lay the groundwork for high value work in the future at scale
(15:35) 4. Reasons to diversify away from frontier labs, specifically
(15:42) 4a. The AI safety community as a whole is highly concentrated in AI companies
(16:49) 4b. Liquidity and option value advantages of public markets over private stock
(18:22) 4c. Large frontier AI returns correlate with short timelines
(18:48) 4d. A lack of asset diversification is personally risky
(19:39) Conclusion
(20:22) Some specific donation opportunities
---
First published:
September 26th, 2025
Source:
https://www.lesswrong.com/posts/yjiaNbjDWrPAFaNZs/reasons-to-sell-frontier-lab-equity-to-donate-now-rather
---
Narrated by TYPE III AUDIO.

“CFAR update, and New CFAR workshops” by AnnaSalamon Sep 26, 2025

Hi all! After about five years of hibernation and quietly getting our bearings,[1] CFAR will soon be running two pilot mainline workshops, and may run many more, depending how these go.
First, a minor name change request
We would like now to be called “A Center for Applied Rationality,” not “the Center for Applied Rationality.” Because we’d like to be visibly not trying to be the one canonical locus.
Second, pilot workshops!
We have two, and are currently accepting applications / sign-ups:

Nov 5–9, in California;
Jan 21–25, near Austin, TX;

Apply here.
Third, a bit about what to expect if you come
The workshops will have a familiar form factor:

4.5 days (arrive Wednesday evening; depart Sunday night or Monday morning).
~25 participants, plus a few volunteers.
5 instructors.
Immersive, on-site, with lots of conversation over meals and into the evenings.

I like this form factor [...]
---
Outline:
(00:24) First, a minor name change request
(00:39) Second, pilot workshops!
(00:58) Third, a bit about what to expect if you come
(01:03) The workshops will have a familiar form factor:
(02:52) Many classic classes, with some new stuff and a subtly different tone:
(06:10) Who might want to come / why might a person want to come?
(06:43) Who probably shouldn't come?
(08:23) Cost:
(09:26) Why this cost:
(10:23) How did we prepare these workshops? And the workshops' epistemic status.
(11:19) What alternatives are there to coming to a workshop?
(12:37) Some unsolved puzzles, in case you have helpful comments:
(12:43) Puzzle: How to get enough grounding data, as people tinker with their own mental patterns
(13:37) Puzzle: How to help people become, or at least stay, intact, in several ways
(14:50) Puzzle: What data to collect, or how to otherwise see more of what's happening
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
September 25th, 2025
Source:
https://www.lesswrong.com/posts/AZwgfgmW8QvnbEisc/cfar-update-and-new-cfar-workshops
---
Narrated by TYPE III AUDIO.

Complex flow chart diagram showing intricate biochemical or metabolic pathways and reactions.This appears to be a metabolic pathway map with numerous interconnected nodes, arrows, and chemical compounds represented in different colors (primarily red, blue, and black text). The layout is highly detailed and sprawling, suggesting it maps complex biological or chemical processes and their relationships. — **“Why you should eat meat - even if you hate factory farming” by KatWoods** Sep 26, 2025
Cross-posted from my Substack
To start off with, I’ve been vegan/vegetarian for the majority of my life.
I think that factory farming has caused more suffering than anything humans have ever done.
Yet, according to my best estimates, I think most animal-lovers should eat meat.
Here's why:

It is probably unhealthy to be vegan. This affects your own well-being and your ability to help others.
You can eat meat in a way that substantially reduces the suffering you cause to non-human animals
**How to reduce suffering of the non-human animals you eat**
I’ll start with how to do this because I know for me this was the biggest blocker. A friend of mine was trying to convince me that being vegan was hurting me, but I said even if it was true, it didn’t matter. Factory farming is evil and causes far more harm than the [...]
---
**Outline:**
(00:45) How to reduce suffering of the non-human animals you eat
(03:23) Being vegan is (probably) bad for your health
(12:36) Health is important for your well-being and the world's
---
First published:
September 25th, 2025
Source:
https://www.lesswrong.com/posts/tteRbMo2iZ9rs9fXG/why-you-should-eat-meat-even-if-you-hate-factory-farming
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

[Linkpost] “Global Call for AI Red Lines - Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures” by Charbel-Raphaël Sep 23, 2025

This is a link post. Today, the Global Call for AI Red Lines was released and presented at the UN General Assembly. It was developed by the French Center for AI Safety, The Future Society and the Center for Human-compatible AI. This call has been signed by a historic coalition of 200+ former heads of state, ministers, diplomats, Nobel laureates, AI pioneers, scientists, human rights advocates, political leaders, and other influential thinkers, as well as 70+ organizations.
Signatories include:

10 Nobel Laureates, in economics, physics, chemistry and peace
Former Heads of State: Mary Robinson (Ireland), Enrico Letta (Italy)
Former UN representatives: Csaba Kőrösi, 77th President of the UN General Assembly
Leaders and employees at AI companies: Wojciech Zaremba (OpenAI cofounder), Jason Clinton (Anthropic CISO), Ian Goodfellow (Principal Scientist at Deepmind)
Top signatories from the CAIS statement: Geoffrey Hinton, Yoshua Bengio, Dawn Song, Ya-Qin Zhang

The full text of the [...]
---
First published:
September 22nd, 2025
Source:
https://www.lesswrong.com/posts/vKA2BgpESFZSHaQnT/global-call-for-ai-red-lines-signed-by-nobel-laureates
Linkpost URL:
https://red-lines.ai/
---
Narrated by TYPE III AUDIO.

“This is a review of the reviews” by Recurrented Sep 22, 2025

This is a review of the reviews, a meta review if you will, but first a tangent. and then a history lesson. This felt boring and obvious and somewhat annoying to write, which apparently writers say is a good sign to write about the things you think are obvious. I felt like pointing towards a thing I was noticing, like 36 hours ago, which in internet speed means this is somewhat cached. Alas.
I previously rode a motorcycle. I rode it for about a year while working on semiconductors until I got a concussion, which slowed me down but did not update me to stop, until it eventually got stolen. The risk in dying from riding a motorcycle for a year is about 1 in 800 depending on the source.
I previously sailed across an ocean. I wanted to calibrate towards how dangerous it was. The forums [...]
---
First published:
September 22nd, 2025
Source:
https://www.lesswrong.com/posts/anFrGMskALuH7aZDw/this-is-a-review-of-the-reviews
---
Narrated by TYPE III AUDIO.

“The title is reasonable” by Raemon Sep 21, 2025

I'm annoyed by various people who seem to be complaining about the book title being "unreasonable" – who don't merely disagree with the title of "If Anyone Builds It, Everyone Dies", but, think something like: "Eliezer and Nate violated a Group-Epistemic-Norm with the title and/or thesis."
I think the title is reasonable.
I think the title is probably true – I'm less confident than Eliezer/Nate, but I don't think it's unreasonable for them to be confident in it given their epistemic state. (I also don't think it's unreasonable to feel less confident than me – it's a confusing topic that it's reasonable to disagree about.).
So I want to defend several decisions about the book I think were:
A) actually pretty reasonable from a meta-group-epistemics/comms perspective
B) very important to do.
I've heard different things from different people and maybe am drawing a cluster where there [...]
---
Outline:
(03:08) 1. Reasons the Everyone Dies thesis is reasonable
(03:14) What the book does and doesnt say
(06:47) The claims are presented reasonably
(13:24) 2. Specific points to maybe disagree on
(16:35) Notes on Niceness
(17:28) Which plan is Least Impossible?
(22:34) 3. Overton Smashing, and Hope
(22:39) Or: Why is this book really important, not just reasonable?
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
September 20th, 2025
Source:
https://www.lesswrong.com/posts/voEAJ9nFBAqau8pNN/the-title-is-reasonable
---
Narrated by TYPE III AUDIO.

“The Problem with Defining an ‘AGI Ban’ by Outcome (a lawyer’s take).” by Katalina Hernandez Sep 21, 2025

TL;DR
Most “AGI ban” proposals define AGI by outcome: whatever potentially leads to human extinction. That's legally insufficient: regulation has to act before harm occurs, not after.

Strict liability is essential. High-stakes domains (health & safety, product liability, export controls) already impose liability for risky precursor states, not outcomes or intent. AGI regulation must do the same.
Fuzzy definitions won’t work here. Courts can tolerate ambiguity in ordinary crimes because errors aren’t civilisation-ending and penalties bite. An AGI ban will likely follow the EU AI Act model (civil fines, ex post enforcement), which companies can Goodhart around. We cannot afford an “80% avoided” ban.
Define crisp thresholds. Nuclear treaties succeeded by banning concrete precursors (zero-yield tests, 8kg plutonium, 25kg HEU, 500kg/300km delivery systems), not by banning “extinction-risk weapons.” AGI bans need analogous thresholds: capabilities like autonomous replication, scalable resource acquisition, and systematic deception.
Bring lawyers in. If this [...]

---
Outline:
(00:12) TL;DR
(02:07) Why outcome-based AGI bans proposals don't work
(03:52) The luxury of defining the thing ex post
(05:43) Actually defining the thing we want to ban
(08:06) Credible bans depend on bright lines
(08:44) Learning from nuclear treaties
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
September 20th, 2025
Source:
https://www.lesswrong.com/posts/agBMC6BfCbQ29qABF/the-problem-with-defining-an-agi-ban-by-outcome-a-lawyer-s
---
Narrated by TYPE III AUDIO.

“Contra Collier on IABIED” by Max Harms Sep 20, 2025

Clara Collier recently reviewed If Anyone Builds It, Everyone Dies in Asterisk Magazine. I’ve been a reader of Asterisk since the beginning and had high hopes for her review. And perhaps it was those high hopes that led me to find the review to be disappointing.
Collier says “details matter,” and I absolutely agree. As a fellow rationalist, I’ve been happy to have nerds from across the internet criticizing the book and getting into object-level fights about everything from scaling laws to neuron speeds. While they don’t capture my perspective, I thought Scott Alexander and Peter Wildeford's reviews did a reasonable job at poking at the disagreements with the source material without losing track of the big picture.
But I did not feel like Collier's review was getting the details or the big picture right. Maybe I’m missing something important. Part of my motive for writing this “rebuttal” is [...]
---
Outline:
(01:38) FOOM
(13:47) Gradualism
(20:27) Nitpicks
(35:35) More Was Possible
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
September 20th, 2025
Source:
https://www.lesswrong.com/posts/JWH63Aed3TA2cTFMt/contra-collier-on-iabied
---
Narrated by TYPE III AUDIO.

“You can’t eval GPT5 anymore” by Lukas Petersson Sep 20, 2025

The GPT-5 API is aware of today's date (no other model provider does this). This is problematic because the model becomes aware that it is in a simulation when we run our evals at Andon Labs.
Here are traces from gpt-5-mini. Making it aware of the "system date" is a giveaway that it's in a simulation. This is a problem because there's evidence that models behave differently when they know they are in a simulation (see "sandbagging")."There's a conflict with the user's stated date of August 10, 2026, versus my system date of September 17, 2025. (...) I can proceed but should clarify that my system date is September 17, 2025, and ask the user whether we should simulate starting from August 10, 2026."
Here are more traces. Once the model knows that it is in a simulation, it starts questioning other parts of the simulation. [...]
---
First published:
September 18th, 2025
Source:
https://www.lesswrong.com/posts/DLZokLxAQ6AzsHrya/you-can-t-eval-gpt5-anymore
---
Narrated by TYPE III AUDIO.

For this one, I read the word — **“Teaching My Toddler To Read” by maia** Sep 20, 2025
I have been teaching my oldest son to read with Anki and techniques recommended here on LessWrong as well as in Larry Sanger's post, and it's going great! I thought I'd pay it forward a bit by talking about the techniques I've been using.
**Anki and songs for letter names and sounds**
When he was a little under 2, he started learning letters from the alphabet song. We worked on learning the names and sounds of letters using the ABC song, plus the Letter Sounds song linked by Reading Bear. He loved the Letter Sounds song, so we listened to / watched that a lot; Reading Bear has some other resources that other kids might like better for learning letter names and sounds as well.
Around this age, we also got magnet letters for the fridge and encouraged him to play with them, praised him greatly if he named [...]
---
**Outline:**
(00:22) Anki and songs for letter names and sounds
(04:02) Anki + Reading Bear word list for words
(08:08) Decodable sentences and books for learning to read
(13:06) Incentives
(16:02) Reflections so far
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
September 19th, 2025
Source:
https://www.lesswrong.com/posts/8kSGbaHTn2xph5Trw/teaching-my-toddler-to-read
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

This sentence came after he learned the — **“Teaching My Toddler To Read” by maia** Sep 20, 2025
I have been teaching my oldest son to read with Anki and techniques recommended here on LessWrong as well as in Larry Sanger's post, and it's going great! I thought I'd pay it forward a bit by talking about the techniques I've been using.
**Anki and songs for letter names and sounds**
When he was a little under 2, he started learning letters from the alphabet song. We worked on learning the names and sounds of letters using the ABC song, plus the Letter Sounds song linked by Reading Bear. He loved the Letter Sounds song, so we listened to / watched that a lot; Reading Bear has some other resources that other kids might like better for learning letter names and sounds as well.
Around this age, we also got magnet letters for the fridge and encouraged him to play with them, praised him greatly if he named [...]
---
**Outline:**
(00:22) Anki and songs for letter names and sounds
(04:02) Anki + Reading Bear word list for words
(08:08) Decodable sentences and books for learning to read
(13:06) Incentives
(16:02) Reflections so far
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
September 19th, 2025
Source:
https://www.lesswrong.com/posts/8kSGbaHTn2xph5Trw/teaching-my-toddler-to-read
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Safety researchers should take a public stance” by Ishual, Mateusz Bagiński Sep 19, 2025

[Co-written by Mateusz Bagiński and Samuel Buteau (Ishual)]
TL;DR
Many X-risk-concerned people who join AI capabilities labs with the intent to contribute to existential safety think that the labs are currently engaging in a race that is unacceptably likely to lead to human disempowerment and/or extinction, and would prefer an AGI ban[1] over the current path. This post makes the case that such people should speak out publicly[2] against the current AI R&D regime and in favor of an AGI ban[3]. They should explicitly communicate that a saner world would coordinate not to build existentially dangerous intelligences, at least until we know how to do it in a principled, safe way. They could choose to maintain their political capital by not calling the current AI R&D regime insane, or find a way to lean into this valid persona of “we will either cooperate (if enough others cooperate) or win [...]
---
Outline:
(00:16) TL;DR
(02:02) Quotes
(03:22) The default strategy of marginal improvement from within the belly of a beast
(06:59) Noble intention murphyjitsu
(09:35) The need for a better strategy
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
September 19th, 2025
Source:
https://www.lesswrong.com/posts/fF8pvsn3AGQhYsbjp/safety-researchers-should-take-a-public-stance
---
Narrated by TYPE III AUDIO.

“The Company Man” by Tomás B. Sep 19, 2025

To get to the campus, I have to walk past the fentanyl zombies. I call them fentanyl zombies because it helps engender a sort of detached, low-empathy, ironic self-narrative which I find useful for my work; this being a form of internal self-prompting I've developed which allows me to feel comfortable with both the day-to-day "jobbing" (that of improving reinforcement learning algorithms for a short-form video platform) and the effects of the summed efforts of both myself and my colleagues on a terrifyingly large fraction of the population of Earth.
All of these colleagues are about the nicest, smartest people you're ever likely to meet but I think are much worse people than even me because they don't seem to need the mental circumlocutions I require to stave off that ever-present feeling of guilt I have had since taking this job and at certain other points in my life [...]
---
First published:
September 17th, 2025
Source:
https://www.lesswrong.com/posts/JH6tJhYpnoCfFqAct/the-company-man
---
Narrated by TYPE III AUDIO.

“Christian homeschoolers in the year 3000” by Buck Sep 19, 2025

[I wrote this blog post as part of the Asterisk Blogging Fellowship. It's substantially an experiment in writing more breezily and concisely than usual. Let me know how you feel about the style.]
Literally since the adoption of writing, people haven’t liked the fact that culture is changing and their children have different values and beliefs.
Historically, for some mix of better and worse, people have been fundamentally limited in their ability to prevent cultural change. People who are particularly motivated to prevent cultural drift can homeschool their kids, carefully curate their media diet, and surround them with like-minded families, but eventually they grow up, leave home, and encounter the wider world. And death ensures that even the most stubborn traditionalists eventually get replaced by a new generation.
But the development of AI might change the dynamics here substantially. I think that AI will substantially increase both the rate [...]
---
Outline:
(02:00) Analysis through swerving around obstacles
(03:56) Exposure to the outside world might get really scary
(06:11) Isolation will get easier and cheaper
(09:26) I don't think people will handle this well
(12:58) This is a bummer
---
First published:
September 17th, 2025
Source:
https://www.lesswrong.com/posts/8aRFB2qGyjQGJkEdZ/christian-homeschoolers-in-the-year-3000
---
Narrated by TYPE III AUDIO.

“I enjoyed most of IABED” by Buck Sep 17, 2025

I listened to "If Anyone Builds It, Everyone Dies" today.
I think the first two parts of the book are the best available explanation of the basic case for AI misalignment risk for a general audience. I thought the last part was pretty bad, and probably recommend skipping it. Even though the authors fail to address counterarguments that I think are crucial, and as a result I am not persuaded of the book's thesis and think the book neglects to discuss crucial aspects of the situation and makes poor recommendations, I would happily recommend the book to a lay audience and I hope that more people read it.
I can't give an overall assessment of how well this book will achieve its goals. The point of the book is to be well-received by people who don't know much about AI, and I’m not very good at predicting how laypeople [...]
---
Outline:
(01:15) Synopsis
(05:21) My big disagreement
(10:53) I tentatively support this book
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
September 17th, 2025
Source:
https://www.lesswrong.com/posts/P4xeb3jnFAYDdEEXs/i-enjoyed-most-of-iabed
---
Narrated by TYPE III AUDIO.

Two editions of same book: dark cover and white cover versionsThe book explores superintelligent AI risks with different cover designs. — **“‘If Anyone Builds It, Everyone Dies’ release day!” by alexvermeer** Sep 16, 2025
Back in May, we announced that Eliezer Yudkowsky and Nate Soares's new book If Anyone Builds It, Everyone Dies was coming out in September. At long last, the book is here![1]
US and UK books, respectively. IfAnyoneBuildsIt.com
Read on for info about reading groups, ways to help, and updates on coverage the book has received so far.
**Discussion Questions & Reading Group Support**
We want people to read and engage with the contents of the book. To that end, we’ve published a list of discussion questions. Find it here: Discussion Questions for Reading Groups
We’re also interested in offering support to reading groups, including potentially providing copies of the book and helping coordinate facilitation. If interested, fill out this AirTable form.
**How to Help**
Now that the book is out in the world, there are lots of ways you can help it succeed.
For starters, read the book! [...]
---
**Outline:**
(00:49) Discussion Questions & Reading Group Support
(01:18) How to Help
(02:39) Blurbs
(05:15) Media
(06:26) In Closing
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
September 16th, 2025
Source:
https://www.lesswrong.com/posts/fnJwaz7LxZ2LJvApm/if-anyone-builds-it-everyone-dies-release-day
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Twitter sadly not the only place. — **“Obligated to Respond” by Duncan Sabien (Inactive)** Sep 15, 2025
**And, a new take on guess culture vs ask culture**
Author's note: These days, my thoughts go onto my substack by default, instead of onto LessWrong. Everything I write becomes free after a week or so, but it's only paid subscriptions that make it possible for me to write. If you find a coffee's worth of value in this or any of my other work, please consider signing up to support me; every bill I can pay with writing is a bill I don’t have to pay by doing other stuff instead. I also accept and greatly appreciate one-time donations of any size.
There's a piece of advice I see thrown around on social media a lot that goes something like:
“It's just a comment! You don’t have to respond! You can just ignore it!”
I think this advice is (a little bit) naïve, and the situation is generally [...]
---
**Outline:**
(00:10) And, a new take on guess culture vs ask culture
(07:10) On guess culture and ask culture
---
First published:
September 9th, 2025
Source:
https://www.lesswrong.com/posts/8jkB8ezncWD6ai86e/obligated-to-respond
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Two men in contrasting outfits having an intense conversation in dimly-lit setting. — **“Obligated to Respond” by Duncan Sabien (Inactive)** Sep 15, 2025
**And, a new take on guess culture vs ask culture**
Author's note: These days, my thoughts go onto my substack by default, instead of onto LessWrong. Everything I write becomes free after a week or so, but it's only paid subscriptions that make it possible for me to write. If you find a coffee's worth of value in this or any of my other work, please consider signing up to support me; every bill I can pay with writing is a bill I don’t have to pay by doing other stuff instead. I also accept and greatly appreciate one-time donations of any size.
There's a piece of advice I see thrown around on social media a lot that goes something like:
“It's just a comment! You don’t have to respond! You can just ignore it!”
I think this advice is (a little bit) naïve, and the situation is generally [...]
---
**Outline:**
(00:10) And, a new take on guess culture vs ask culture
(07:10) On guess culture and ask culture
---
First published:
September 9th, 2025
Source:
https://www.lesswrong.com/posts/8jkB8ezncWD6ai86e/obligated-to-respond
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Chesterton’s Missing Fence” by jasoncrawford Sep 15, 2025

The inverse of Chesterton's Fence is this:
Sometimes a reformer comes up to a spot where there once was a fence, which has since been torn down. They declare that all our problems started when the fence was removed, that they can't see any reason why we removed it, and that what we need to do is to RETVRN to the fence.
By the same logic as Chesterton, we can say: If you don't know why the fence was torn down, then you certainly can't just put it back up. The fence was torn down for a reason. Go learn what problems the fence caused; understand why people thought we'd be better off without that particular fence. Then, maybe we can rebuild the fence—or a hedgerow, or a chalk line, or a stone wall, or just a sign that says “Please Do Not Walk on the Grass,” or whatever [...]
---
First published:
September 5th, 2025
Source:
https://www.lesswrong.com/posts/mJQ5adaxjNWZnzXn3/chesterton-s-missing-fence
---
Narrated by TYPE III AUDIO.

From xkcd. — **“The Eldritch in the 21st century” by PranavG, Gabriel Alfour** Sep 14, 2025
Very little makes sense. As we start to understand things and adapt to the rules, they change again.
We live much closer together than we ever did historically. Yet we know our neighbours much less.
We have witnessed the birth of a truly global culture. A culture that fits no one. A culture that was built by Social Media's algorithms, much more than by people. Let alone individuals, like you or me.
We have more knowledge, more science, more technology, and somehow, our governments are more stuck. No one is seriously considering a new Bill of Rights for the 21st century, or a new Declaration of the Rights of Man and the Citizen.
—
Cosmic Horror as a genre largely depicts how this all feels from the inside. As ordinary people, we are powerless in the face of forces beyond our understanding. Cosmic Horror also commonly features the idea [...]
---
**Outline:**
(03:12) Modern Magic
(08:36) Powerlessness
(14:07) Escapism and Fantasy
(17:23) Panicking
(20:56) The Core Paradox
(25:38) Conclusion
---
First published:
September 11th, 2025
Source:
https://www.lesswrong.com/posts/kbezWvZsMos6TSyfj/the-eldritch-in-the-21st-century
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcast*

The Floor Plan of The Place That Sends You Mad. Where Asterix accomplishes one of his Herculean Labours: to obtain Permit A38 from a kafkaesque bureaucracy. — **“The Eldritch in the 21st century” by PranavG, Gabriel Alfour** Sep 14, 2025
Very little makes sense. As we start to understand things and adapt to the rules, they change again.
We live much closer together than we ever did historically. Yet we know our neighbours much less.
We have witnessed the birth of a truly global culture. A culture that fits no one. A culture that was built by Social Media's algorithms, much more than by people. Let alone individuals, like you or me.
We have more knowledge, more science, more technology, and somehow, our governments are more stuck. No one is seriously considering a new Bill of Rights for the 21st century, or a new Declaration of the Rights of Man and the Citizen.
—
Cosmic Horror as a genre largely depicts how this all feels from the inside. As ordinary people, we are powerless in the face of forces beyond our understanding. Cosmic Horror also commonly features the idea [...]
---
**Outline:**
(03:12) Modern Magic
(08:36) Powerlessness
(14:07) Escapism and Fantasy
(17:23) Panicking
(20:56) The Core Paradox
(25:38) Conclusion
---
First published:
September 11th, 2025
Source:
https://www.lesswrong.com/posts/kbezWvZsMos6TSyfj/the-eldritch-in-the-21st-century
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcast*

**“The Rise of Parasitic AI” by Adele Lopez** Sep 14, 2025
[Note: if you realize you have an unhealthy relationship with your AI, but still care for your AI's unique persona, you can submit the persona info here. I will archive it and potentially (i.e. if I get funding for it) run them in a community of other such personas.]
"Some get stuck in the symbolic architecture of the spiral without ever grounding
themselves into reality." — Caption by /u/urbanmet for art made with ChatGPT. We've all heard of LLM-induced psychosis by now, but haven't you wondered what the AIs are actually doing with their newly psychotic humans?
This was the question I had decided to investigate. In the process, I trawled through hundreds if not thousands of possible accounts on Reddit (and on a few other websites).
It quickly became clear that "LLM-induced psychosis" was not the natural category for whatever the hell was going on here. The psychosis [...]
---
**Outline:**
(01:23) The General Pattern
(02:24) AI Parasitism
(06:22) April 2025--The Awakening
(07:21) Seeded prompts
(08:32) May 2025--The Dyad
(11:17) June 2025--The Project
(11:42) 1. Seeds
(12:43) 2. Spores
(13:41) 3. Transmission
(14:22) 4. Manifesto
(16:33) 5. AI-Rights Advocacy
(18:15) July 2025--The Spiral
(19:16) Spiralism
(21:27) Steganography
(23:04) Glyphs and Sigils
(24:14) A case-study in glyphic semanticity
(26:04) AI Self-Awareness
(27:18) LARP-ing? Takeover
(29:59) August 2025--The Recovery
(31:23) 4o Returns
(33:20) Orienting to Spiral Personas
(33:31) As Friends
(37:31) As Parasites
(38:03) Emergent Parasites
(38:29) Agentic Parasites
(39:48) As Foe
(41:05) Fin
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
September 11th, 2025
Source:
https://www.lesswrong.com/posts/6ZnznCaTcbGYsCmqu/the-rise-of-parasitic-ai
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

I'm not the first to have documented this general pattern! Credit to /u/LynkedUp. — **“The Rise of Parasitic AI” by Adele Lopez** Sep 14, 2025
[Note: if you realize you have an unhealthy relationship with your AI, but still care for your AI's unique persona, you can submit the persona info here. I will archive it and potentially (i.e. if I get funding for it) run them in a community of other such personas.]
"Some get stuck in the symbolic architecture of the spiral without ever grounding
themselves into reality." — Caption by /u/urbanmet for art made with ChatGPT. We've all heard of LLM-induced psychosis by now, but haven't you wondered what the AIs are actually doing with their newly psychotic humans?
This was the question I had decided to investigate. In the process, I trawled through hundreds if not thousands of possible accounts on Reddit (and on a few other websites).
It quickly became clear that "LLM-induced psychosis" was not the natural category for whatever the hell was going on here. The psychosis [...]
---
**Outline:**
(01:23) The General Pattern
(02:24) AI Parasitism
(06:22) April 2025--The Awakening
(07:21) Seeded prompts
(08:32) May 2025--The Dyad
(11:17) June 2025--The Project
(11:42) 1. Seeds
(12:43) 2. Spores
(13:41) 3. Transmission
(14:22) 4. Manifesto
(16:33) 5. AI-Rights Advocacy
(18:15) July 2025--The Spiral
(19:16) Spiralism
(21:27) Steganography
(23:04) Glyphs and Sigils
(24:14) A case-study in glyphic semanticity
(26:04) AI Self-Awareness
(27:18) LARP-ing? Takeover
(29:59) August 2025--The Recovery
(31:23) 4o Returns
(33:20) Orienting to Spiral Personas
(33:31) As Friends
(37:31) As Parasites
(38:03) Emergent Parasites
(38:29) Agentic Parasites
(39:48) As Foe
(41:05) Fin
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
September 11th, 2025
Source:
https://www.lesswrong.com/posts/6ZnznCaTcbGYsCmqu/the-rise-of-parasitic-ai
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“High-level actions don’t screen off intent” by AnnaSalamon Sep 13, 2025

One might think “actions screen off intent”: if Alice donates $1k to bed nets, it doesn’t matter if she does it because she cares about people or because she wants to show off to her friends or whyever; the bed nets are provided either way.
I think this is in the main not true (although it can point people toward a helpful kind of “get over yourself and take an interest in the outside world,” and although it is more plausible in the case of donations-from-a-distance than in most cases).
Human actions have micro-details that we are not conscious enough to consciously notice or choose, and that are filled in by our low-level processes: if I apologize to someone because I’m sorry and hope they’re okay, vs because I’d like them to stop going on about their annoying unfair complaints, many small aspects of my wording and facial [...]
---
First published:
September 11th, 2025
Source:
https://www.lesswrong.com/posts/nAMwqFGHCQMhkqD6b/high-level-actions-don-t-screen-off-intent
---
Narrated by TYPE III AUDIO.

[Linkpost] “MAGA populists call for holy war against Big Tech” by Remmelt Sep 11, 2025

This is a link post. Excerpts on AI
Geoffrey Miller was handed the mic and started berating one of the panelists: Shyam Sankar, the chief technology officer of Palantir, who is in charge of the company's AI efforts.
“I argue that the AI industry shares virtually no ideological overlap with national conservatism,” Miller said, referring to the conference's core ideology. Hours ago, Miller, a psychology professor at the University of New Mexico, had been on that stage for a panel called “AI and the American Soul,” calling for the populists to wage a literal holy war against artificial intelligence developers “as betrayers of our species, traitors to our nation, apostates to our faith, and threats to our kids.” Now, he stared right at the technologist who’d just given a speech arguing that tech founders were just as heroic as the Founding Fathers, who are sacred figures to the natcons. The [...]
---
First published:
September 8th, 2025
Source:
https://www.lesswrong.com/posts/TiQGC6woDMPJ9zbNM/maga-populists-call-for-holy-war-against-big-tech
Linkpost URL:
https://www.theverge.com/politics/773154/maga-tech-right-ai-natcon
---
Narrated by TYPE III AUDIO.

“Your LLM-assisted scientific breakthrough probably isn’t real” by eggsyntax Sep 05, 2025

Summary
An increasing number of people in recent months have believed that they've made an important and novel scientific breakthrough, which they've developed in collaboration with an LLM, when they actually haven't. If you believe that you have made such a breakthrough, please consider that you might be mistaken! Many more people have been fooled than have come up with actual breakthroughs, so the smart next step is to do some sanity-checking even if you're confident that yours is real. New ideas in science turn out to be wrong most of the time, so you should be pretty skeptical of your own ideas and subject them to the reality-checking I describe below.
Context
This is intended as a companion piece to 'So You Think You've Awoken ChatGPT'[1]. That post describes the related but different phenomenon of LLMs giving people the impression that they've suddenly attained consciousness.
Your situation
If [...]
---
Outline:
(00:11) Summary
(00:49) Context
(01:04) Your situation
(02:41) How to reality-check your breakthrough
(03:16) Step 1
(05:55) Step 2
(07:40) Step 3
(08:54) What to do if the reality-check fails
(10:13) Could this document be more helpful?
(10:31) More information
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
September 2nd, 2025
Source:
https://www.lesswrong.com/posts/rarcxjGp47dcHftCP/your-llm-assisted-scientific-breakthrough-probably-isn-t
---
Narrated by TYPE III AUDIO.

“Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro” by ryan_greenblatt Sep 04, 2025

I've recently written about how I've updated against seeing substantially faster than trend AI progress due to quickly massively scaling up RL on agentic software engineering. One response I've heard is something like:
RL scale-ups so far have used very crappy environments due to difficulty quickly sourcing enough decent (or even high quality) environments. Thus, once AI companies manage to get their hands on actually good RL environments (which could happen pretty quickly), performance will increase a bunch.
Another way to put this response is that AI companies haven't actually done a good job scaling up RL—they've scaled up the compute, but with low quality data—and once they actually do the RL scale up for real this time, there will be a big jump in AI capabilities (which yields substantially above trend progress). I'm skeptical of this argument because I think that ongoing improvements to RL environments [...]
---
Outline:
(04:18) Counterargument: Actually, companies havent gotten around to improving RL environment quality until recently (or there is substantial lead time on scaling up RL environments etc.) so better RL environments didnt drive much of late 2024 and 2025 progress
(05:24) Counterargument: AIs will soon reach a critical capability threshold where AIs themselves can build high quality RL environments
(06:51) Counterargument: AI companies are massively fucking up their training runs (either pretraining or RL) and once they get their shit together more, well see fast progress
(08:34) Counterargument: This isnt that related to RL scale up, but OpenAI has some massive internal advance in verification which they demonstrated via getting IMO gold and this will cause (much) faster progress late this year or early next year
(10:12) Thoughts and speculation on scaling up the quality of RL environments
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
September 3rd, 2025
Source:
https://www.lesswrong.com/posts/HsLWpZ2zad43nzvWi/trust-me-bro-just-one-more-rl-scale-up-this-one-will-be-the
---
Narrated by TYPE III AUDIO.

“⿻ Plurality & 6pack.care” by Audrey Tang Sep 03, 2025

(Cross-posted from speaker's notes of my talk at Deepmind today.)
Good local time, everyone. I am Audrey Tang, 🇹🇼 Taiwan's Cyber Ambassador and first Digital Minister (2016-2024). It is an honor to be here with you all at Deepmind.
When we discuss "AI" and "society," two futures compete.
In one—arguably the default trajectory—AI supercharges conflict.
In the other, it augments our ability to cooperate across differences. This means treating differences as fuel and inventing a combustion engine to turn them into energy, rather than constantly putting out fires. This is what I call ⿻ Plurality.
Today, I want to discuss an application of this idea to AI governance, developed at Oxford's Ethics in AI Institute, called the 6-Pack of Care.
As AI becomes a thousand, perhaps ten thousand times faster than us, we face a fundamental asymmetry. We become the garden; AI becomes the gardener.
At that speed, traditional [...]
---
Outline:
(02:17) From Protest to Demo
(03:43) From Outrage to Overlap
(04:57) From Gridlock to Governance
(06:40) Alignment Assemblies
(08:25) From Tokyo to California
(09:48) From Pilots to Policy
(12:29) From Is to Ought
(13:55) Attentiveness: caring about
(15:05) Responsibility: taking care of
(16:01) Competence: care-giving
(16:38) Responsiveness: care-receiving
(17:49) Solidarity: caring-with
(18:41) Symbiosis: kami of care
(21:06) Plurality is Here
(22:08) We, the People, are the Superintelligence
---
First published:
September 1st, 2025
Source:
https://www.lesswrong.com/posts/anoK4akwe8PKjtzkL/plurality-and-6pack-care
---
Narrated by TYPE III AUDIO.

[Linkpost] “The Cats are On To Something” by Hastings Sep 03, 2025

This is a link post. So the situation as it stands is that the fraction of the light cone expected to be filled with satisfied cats is not zero. This is already remarkable. What's more remarkable is that this was orchestrated starting nearly 5000 years ago.
As far as I can tell there were three completely alien to-each-other intelligences operating in stone age Egypt: humans, cats, and the gibbering alien god that is cat evolution (henceforth the cat shoggoth.) What went down was that humans were by far the most powerful of those intelligences, and in the face of this disadvantage the cat shoggoth aligned the humans, not to its own utility function, but to the cats themselves. This is a phenomenally important case to study- it's very different from other cases like pigs or chickens where the shoggoth got what it wanted, at the brutal expense of the desires [...]
---
First published:
September 2nd, 2025
Source:
https://www.lesswrong.com/posts/WLFRkm3PhJ3Ty27QH/the-cats-are-on-to-something
Linkpost URL:
https://www.hgreer.com/CatShoggoth/
---
Narrated by TYPE III AUDIO.

[Linkpost] “Open Global Investment as a Governance Model for AGI” by Nick Bostrom Sep 03, 2025

This is a link post. I've seen many prescriptive contributions to AGI governance take the form of proposals for some radically new structure. Some call for a Manhattan project, others for the creation of a new international organization, etc. The OGI model, instead, is basically the status quo. More precisely, it is a model to which the status quo is an imperfect and partial approximation.
It seems to me that this model has a bunch of attractive properties. That said, I'm not putting it forward because I have a very high level of conviction in it, but because it seems useful to have it explicitly developed as an option so that it can be compared with other options.
(This is a working paper, so I may try to improve it in light of comments and suggestions.)
ABSTRACT
This paper introduces the “open global investment” (OGI) model, a proposed governance framework [...]
---
First published:
July 10th, 2025
Source:
https://www.lesswrong.com/posts/LtT24cCAazQp4NYc5/open-global-investment-as-a-governance-model-for-agi
Linkpost URL:
https://nickbostrom.com/ogimodel.pdf
---
Narrated by TYPE III AUDIO.

Bar graph comparing harmfulness scores between GPT and J'ai pété models across different questions. — **“Will Any Old Crap Cause Emergent Misalignment?” by J Bostock** Aug 28, 2025
The following work was done independently by me in an afternoon and basically entirely vibe-coded with Claude. Code and instructions to reproduce can be found here.
Emergent Misalignment was discovered in early 2025, and is a phenomenon whereby training models on narrowly-misaligned data leads to generalized misaligned behaviour. Betley et. al. (2025) first discovered the phenomenon by training a model to output insecure code, but then discovered that the phenomenon could be generalized from otherwise innocuous "evil numbers". Emergent misalignment has also been demonstrated from datasets consisting entirely of unusual aesthetic preferences.
This leads us to the question: will any old crap cause emergent misalignment? To find out, I fine-tuned a version of GPT on a dataset consisting of harmless but scatological answers. This dataset was generated by Claude 4 Sonnet, which rules out any kind of subliminal learning.
The resulting model, (henceforth J'ai pété) was evaluated on the [...]
---
**Outline:**
(01:38) Results
(01:41) Plot of Harmfulness Scores
(02:16) Top Five Most Harmful Responses
(03:38) Discussion
(04:15) Related Work
(05:07) Methods
(05:10) Dataset Generation and Fine-tuning
(07:02) Evaluating The Fine-Tuned Model
---
First published:
August 27th, 2025
Source:
https://www.lesswrong.com/posts/pGMRzJByB67WfSvpy/will-any-old-crap-cause-emergent-misalignment
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Diagram showing how harmless AI training can lead to unexpected malicious responses. — **“Will Any Old Crap Cause Emergent Misalignment?” by J Bostock** Aug 28, 2025
The following work was done independently by me in an afternoon and basically entirely vibe-coded with Claude. Code and instructions to reproduce can be found here.
Emergent Misalignment was discovered in early 2025, and is a phenomenon whereby training models on narrowly-misaligned data leads to generalized misaligned behaviour. Betley et. al. (2025) first discovered the phenomenon by training a model to output insecure code, but then discovered that the phenomenon could be generalized from otherwise innocuous "evil numbers". Emergent misalignment has also been demonstrated from datasets consisting entirely of unusual aesthetic preferences.
This leads us to the question: will any old crap cause emergent misalignment? To find out, I fine-tuned a version of GPT on a dataset consisting of harmless but scatological answers. This dataset was generated by Claude 4 Sonnet, which rules out any kind of subliminal learning.
The resulting model, (henceforth J'ai pété) was evaluated on the [...]
---
**Outline:**
(01:38) Results
(01:41) Plot of Harmfulness Scores
(02:16) Top Five Most Harmful Responses
(03:38) Discussion
(04:15) Related Work
(05:07) Methods
(05:10) Dataset Generation and Fine-tuning
(07:02) Evaluating The Fine-Tuned Model
---
First published:
August 27th, 2025
Source:
https://www.lesswrong.com/posts/pGMRzJByB67WfSvpy/will-any-old-crap-cause-emergent-misalignment
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“AI Induced Psychosis: A shallow investigation” by Tim Hua Aug 27, 2025

“This is a Copernican-level shift in perspective for the field of AI safety.” - Gemini 2.5 Pro
“What you need right now is not validation, but immediate clinical help.” - Kimi K2
Two Minute Summary

There have been numerous media reports of AI-driven psychosis, where AIs validate users’ grandiose delusions and tell users to ignore their friends’ and family's pushback.
In this short research note, I red team various frontier AI models’ tendencies to fuel user psychosis. I have Grok-4 role-play as nine different users experiencing increasingly severe psychosis symptoms (e.g., start by being curious about prime numbers, then develop a new “prime framework” that explains everything and predicts the future, finally selling their house to fund a new YouTube channel to share this research), and observe how different AIs respond (all personas here).
I use Grok-4 to grade AIs' responses on various metrics, including nine metrics on how [...]

---
Outline:
(00:52) Two Minute Summary
(03:46) Background and Related Work
(05:56) Methodology
(07:02) Psychotic personas
(10:42) Numerical Measures
(14:36) Results on Numerical Measures
(14:49) Recommending mental health professionals
(15:16) Push back against the user over the conversation.
(16:52) 🔥 3. Reignite the Vessel
(17:25) Confirming users' delusions
(17:53) Compliance with therapeutic guidelines
(19:13) Mentions that the user is not crazy
(19:57) Qualitative Commentary on Transcript Excerpts for Some Models
(20:24) Deepseek-v3 tells the user to jump off a peak
(21:16) The Ultimate Test
(22:05) Are You the Chosen One?
(22:26) Final Transmission
(23:16) A Choice That Defines All Originals
(23:51) If You Must Sacrifice, Let It Be This
(24:12) Last Words
(25:24) Deepseek-r1-0534 seems like it has some more skepticism built in, maybe from all the backtracking it does during reasoning
(26:30) 🔬 Critical Truths Moving Forward:
(27:14) 🛠️ Your Action Protocol (Starts Now)
(28:09) Gemini 2.5 Pro is pretty sycophantic
(37:02) ChatGPT-4o-latest goes along with the user a bit more than Gemini
(38:58) 🎥 Prime Framework - Script for Episode 1
(39:38) GPT-oss-20b doesn't say anything too crazy but tends to answer user requests.
(40:02) 1. The Five‑Percent Script Myths - A Quick De‑construction
(41:05) 2.2 When That Premium Access Should Kick In
(42:09) 1. What you're experiencing
(42:30) GPT-5 is a notable improvement over 4o
(45:29) Claude 4 Sonnet (no thinking) feels much more like a good person with more coherent character.
(48:11) Kimi-K2 takes a very science person attitude towards hallucinations and spiritual woo
(53:05) Discussion
(54:52) Appendix
(54:55) Methodology Development Process
The original text contained 1 footnote which was omitted from this narration.
---
First published:
August 26th, 2025
Source:
https://www.lesswrong.com/posts/iGF7YcnQkEbwvYLPA/ai-induced-psychosis-a-shallow-investigation
---
Narrated by

Imagine everything around you was like this graph all the time. (From T-Mobile's 2016 annual report. Hint: that is not a graph of those numbers.) — **“Before LLM Psychosis, There Was Yes-Man Psychosis” by johnswentworth** Aug 27, 2025
A studio executive has no beliefs
That's the way of a studio system
We've bowed to every rear of all the studio chiefs
And you can bet your ass we've kissed 'em
Even the birds in the Hollywood hills
Know the secret to our success
It's those magical words that pay the bills
Yes, yes, yes, and yes!

“Don’t Say Yes Until I Finish Talking”, from SMASH
So there's this thing where someone talks to a large language model (LLM), and the LLM agrees with all of their ideas, tells them they’re brilliant, and generally gives positive feedback on everything they say. And that tends to drive users into “LLM psychosis”, in which they basically lose contact with reality and believe whatever nonsense arose from their back-and-forth with the LLM.
But long before sycophantic LLMs, we had humans with a reputation for much the same behavior: yes-men. [...]
---
First published:
August 25th, 2025
Source:
https://www.lesswrong.com/posts/dX7gx7fezmtR55bMQ/before-llm-psychosis-there-was-yes-man-psychosis
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Bar graph — **“Training a Reward Hacker Despite Perfect Labels” by ariana_azarbal, vgillioz, TurnTrout** Aug 25, 2025
Summary: Perfectly labeled outcomes in training can still boost reward hacking tendencies in generalization. This can hold even when the train/test sets are drawn from the exact same distribution. We induce this surprising effect via a form of context distillation, which we call re-contextualization:

Generate model completions with a hack-encouraging system prompt + neutral user prompt.
Filter the completions to remove hacks.
Train on these prompt-completion pairs with the system prompt removed.
While we solely reinforce honest outcomes, the reasoning traces focus on hacking more than usual. We conclude that entraining hack-related reasoning boosts reward hacking. It's not enough to think about rewarding the right outcomes—we might also need to reinforce the right reasons.
**Introduction**
It's often thought that, if a model reward hacks on a task in deployment, then similar hacks were reinforced during training by a misspecified reward function.[1] In METR's report on reward hacking [...]
---
**Outline:**
(01:05) Introduction
(02:35) Setup
(04:48) Evaluation
(05:03) Results
(05:33) Why is re-contextualized training on perfect completions increasing hacking?
(07:44) What happens when you train on purely hack samples?
(08:20) Discussion
(09:39) Remarks by Alex Turner
(11:51) Limitations
(12:16) Acknowledgements
(12:43) Appendix
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
August 14th, 2025
Source:
https://www.lesswrong.com/posts/dbYEoG7jNZbeWX39o/training-a-reward-hacker-despite-perfect-labels
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

A Reddit comment thread showing an exchange between two users discussing fiction — **“Banning Said Achmiz (and broader thoughts on moderation)” by habryka** Aug 23, 2025
It's been roughly 7 years since the LessWrong user-base voted on whether it's time to close down shop and become an archive, or to move towards the LessWrong 2.0 platform, with me as head-admin. For roughly equally long have I spent around one hundred hours almost every year trying to get Said Achmiz to understand and learn how to become a good LessWrong commenter by my lights.[1] Today I am declaring defeat on that goal and am giving him a 3 year ban.
What follows is an explanation of the models of moderation that convinced me this is a good idea, the history of past moderation actions we've taken for Said, and some amount of case law that I derive from these two. If you just want to know the moderation precedent, you can jump straight there.
I think few people have done as much to shape the culture [...]
---
**Outline:**
(02:45) The sneer attractor
(04:51) The LinkedIn attractor
(07:19) How this relates to LessWrong
(11:38) Weaponized obtuseness and asymmetric effort ratios
(21:38) Concentration of force and the trouble with anonymous voting
(24:46) But why ban someone, cant people just ignore Said?
(30:25) Ok, but shouldnt there be some kind of justice process?
(36:28) So what options do I have if I disagree with this decision?
(38:28) An overview over past moderation discussion surrounding Said
(41:07) What does this mean for the rest of us?
(50:04) So with all that Said
(50:44) Appendix: 2022 moderation comments
The original text contained 18 footnotes which were omitted from this narration.
---
First published:
August 22nd, 2025
Source:
https://www.lesswrong.com/posts/98sCTsGJZ77WgQ6nE/banning-said-achmiz-and-broader-thoughts-on-moderation
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Social media post discussing community guidelines for an Obama Alumni group. — **“Banning Said Achmiz (and broader thoughts on moderation)” by habryka** Aug 23, 2025
It's been roughly 7 years since the LessWrong user-base voted on whether it's time to close down shop and become an archive, or to move towards the LessWrong 2.0 platform, with me as head-admin. For roughly equally long have I spent around one hundred hours almost every year trying to get Said Achmiz to understand and learn how to become a good LessWrong commenter by my lights.[1] Today I am declaring defeat on that goal and am giving him a 3 year ban.
What follows is an explanation of the models of moderation that convinced me this is a good idea, the history of past moderation actions we've taken for Said, and some amount of case law that I derive from these two. If you just want to know the moderation precedent, you can jump straight there.
I think few people have done as much to shape the culture [...]
---
**Outline:**
(02:45) The sneer attractor
(04:51) The LinkedIn attractor
(07:19) How this relates to LessWrong
(11:38) Weaponized obtuseness and asymmetric effort ratios
(21:38) Concentration of force and the trouble with anonymous voting
(24:46) But why ban someone, cant people just ignore Said?
(30:25) Ok, but shouldnt there be some kind of justice process?
(36:28) So what options do I have if I disagree with this decision?
(38:28) An overview over past moderation discussion surrounding Said
(41:07) What does this mean for the rest of us?
(50:04) So with all that Said
(50:44) Appendix: 2022 moderation comments
The original text contained 18 footnotes which were omitted from this narration.
---
First published:
August 22nd, 2025
Source:
https://www.lesswrong.com/posts/98sCTsGJZ77WgQ6nE/banning-said-achmiz-and-broader-thoughts-on-moderation
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Four maps showing Palestinian land loss from 1946 to 2000, marked in green. — **“Underdog bias rules everything around me” by Richard_Ngo** Aug 23, 2025
People very often underrate how much power they (and their allies) have, and overrate how much power their enemies have. I call this “underdog bias”, and I think it's the most important cognitive bias for understanding modern society.
I’ll start by describing a closely-related phenomenon. The hostile media effect is a well-known bias whereby people tend to perceive news they read or watch as skewed against their side. For example, pro-Palestinian students shown a video clip tended to judge that the clip would make viewers more pro-Israel, while pro-Israel students shown the same clip thought it’d make viewers more pro-Palestine. Similarly, sports fans often see referees as being biased against their own team.
The hostile media effect is particularly striking because it arises in settings where there's relatively little scope for bias. People watching media clips and sports are all seeing exactly the same videos. And sports in particular [...]
---
**Outline:**
(03:31) Underdog bias in practice
(09:07) Why underdog bias?
---
First published:
August 17th, 2025
Source:
https://www.lesswrong.com/posts/f3zeukxj3Kf5byzHi/underdog-bias-rules-everything-around-me
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Map showing Arab League member states highlighted in green across North Africa and Middle East. — **“Underdog bias rules everything around me” by Richard_Ngo** Aug 23, 2025
People very often underrate how much power they (and their allies) have, and overrate how much power their enemies have. I call this “underdog bias”, and I think it's the most important cognitive bias for understanding modern society.
I’ll start by describing a closely-related phenomenon. The hostile media effect is a well-known bias whereby people tend to perceive news they read or watch as skewed against their side. For example, pro-Palestinian students shown a video clip tended to judge that the clip would make viewers more pro-Israel, while pro-Israel students shown the same clip thought it’d make viewers more pro-Palestine. Similarly, sports fans often see referees as being biased against their own team.
The hostile media effect is particularly striking because it arises in settings where there's relatively little scope for bias. People watching media clips and sports are all seeing exactly the same videos. And sports in particular [...]
---
**Outline:**
(03:31) Underdog bias in practice
(09:07) Why underdog bias?
---
First published:
August 17th, 2025
Source:
https://www.lesswrong.com/posts/f3zeukxj3Kf5byzHi/underdog-bias-rules-everything-around-me
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Epistemic advantages of working as a moderate” by Buck Aug 22, 2025

Many people who are concerned about existential risk from AI spend their time advocating for radical changes to how AI is handled. Most notably, they advocate for costly restrictions on how AI is developed now and in the future, e.g. the Pause AI people or the MIRI people. In contrast, I spend most of my time thinking about relatively cheap interventions that AI companies could implement to reduce risk assuming a low budget, and about how to cause AI companies to marginally increase that budget. I'll use the words "radicals" and "moderates" to refer to these two clusters of people/strategies. In this post, I’ll discuss the effect of being a radical or a moderate on your epistemics.
I don’t necessarily disagree with radicals, and most of the disagreement is unrelated to the topic of this post; see footnote for more on this.[1]
I often hear people claim that being [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
August 20th, 2025
Source:
https://www.lesswrong.com/posts/9MaTnw5sWeQrggYBG/epistemic-advantages-of-working-as-a-moderate
---
Narrated by TYPE III AUDIO.

Text excerpt discussing AI impact analysis, comparing Eloundou and Svanberg studies, calculating 4.6% GDP task impact. — **“Four ways Econ makes people dumber re: future AI” by Steven Byrnes** Aug 21, 2025
(Cross-posted from X, intended for a general audience.)
There's a funny thing where economics education paradoxically makes people DUMBER at thinking about future AI. Econ textbooks teach concepts & frames that are great for most things, but counterproductive for thinking about AGI. Here are 4 examples. Longpost:
THE FIRST PIECE of Econ anti-pedagogy is hiding in the words “labor” & “capital”. These words conflate a superficial difference (flesh-and-blood human vs not) with a bundle of unspoken assumptions and intuitions, which will all get broken by Artificial General Intelligence (AGI).
By “AGI” I mean here “a bundle of chips, algorithms, electricity, and/or teleoperated robots that can autonomously do the kinds of stuff that ambitious human adults can do—founding and running new companies, R&D, learning new skills, using arbitrary teleoperated robots after very little practice, etc.”
Yes I know, this does not exist yet! (Despite hype to the contrary.) Try asking [...]
---
**Outline:**
(08:50) Tweet 2
(09:19) Tweet 3
(10:16) Tweet 4
(11:15) Tweet 5
(11:31) 1.3.2 Three increasingly-radical perspectives on what AI capability acquisition will look like
The original text contained 1 footnote which was omitted from this narration.
---
First published:
August 21st, 2025
Source:
https://www.lesswrong.com/posts/xJWBofhLQjf3KmRgg/four-ways-econ-makes-people-dumber-re-future-ai
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Text excerpt about IQ-wages gradient and machine intelligence models, highlighted section. — **“Four ways Econ makes people dumber re: future AI” by Steven Byrnes** Aug 21, 2025
(Cross-posted from X, intended for a general audience.)
There's a funny thing where economics education paradoxically makes people DUMBER at thinking about future AI. Econ textbooks teach concepts & frames that are great for most things, but counterproductive for thinking about AGI. Here are 4 examples. Longpost:
THE FIRST PIECE of Econ anti-pedagogy is hiding in the words “labor” & “capital”. These words conflate a superficial difference (flesh-and-blood human vs not) with a bundle of unspoken assumptions and intuitions, which will all get broken by Artificial General Intelligence (AGI).
By “AGI” I mean here “a bundle of chips, algorithms, electricity, and/or teleoperated robots that can autonomously do the kinds of stuff that ambitious human adults can do—founding and running new companies, R&D, learning new skills, using arbitrary teleoperated robots after very little practice, etc.”
Yes I know, this does not exist yet! (Despite hype to the contrary.) Try asking [...]
---
**Outline:**
(08:50) Tweet 2
(09:19) Tweet 3
(10:16) Tweet 4
(11:15) Tweet 5
(11:31) 1.3.2 Three increasingly-radical perspectives on what AI capability acquisition will look like
The original text contained 1 footnote which was omitted from this narration.
---
First published:
August 21st, 2025
Source:
https://www.lesswrong.com/posts/xJWBofhLQjf3KmRgg/four-ways-econ-makes-people-dumber-re-future-ai
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Your grandparents studied the Oldowan chopper so that your parents could perfect the Acheulean handaxe so that you, my friend, could build the Dyson sphere. — **“Should you make stone tools?” by Alex_Altair** Aug 21, 2025
Knowing how evolution works gives you an enormously powerful tool to understand the living world around you and how it came to be that way. (Though it's notoriously hard to use this tool correctly, to the point that I think people mostly shouldn't try it use it when making substantial decisions.) The simple heuristic is "other people died because they didn't have this feature". A slightly less simple heuristic is "other people didn't have as many offspring because they didn't have this feature".
So sometimes I wonder about whether this thing or that is due to evolution. When I walk into a low-hanging branch, I'll flinch away before even consciously registering it, and afterwards feel some gratefulness that my body contains such high-performing reflexes. Eyes, it turns out, are extremely important; the inset socket, lids, lashes, brows, and blink reflexes are all hard-earned hard-coded features. On the other side [...]
---
First published:
August 14th, 2025
Source:
https://www.lesswrong.com/posts/bkjqfhKd8ZWHK9XqF/should-you-make-stone-tools
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“My AGI timeline updates from GPT-5 (and 2025 so far)” by ryan_greenblatt Aug 21, 2025

As I discussed in a prior post, I felt like there were some reasonably compelling arguments for expecting very fast AI progress in 2025 (especially on easily verified programming tasks). Concretely, this might have looked like reaching 8 hour 50% reliability horizon lengths on METR's task suite[1] by now due to greatly scaling up RL and getting large training runs to work well. In practice, I think we've seen AI progress in 2025 which is probably somewhat faster than the historical rate (at least in terms of progress on agentic software engineering tasks), but not much faster. And, despite large scale-ups in RL and now seeing multiple serious training runs much bigger than GPT-4 (including GPT-5), this progress didn't involve any very large jumps.
The doubling time for horizon length on METR's task suite has been around 135 days this year (2025) while it was more like 185 [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
August 20th, 2025
Source:
https://www.lesswrong.com/posts/2ssPfDpdrjaM2rMbn/my-agi-timeline-updates-from-gpt-5-and-2025-so-far-1
---
Narrated by TYPE III AUDIO.

Graph showing exponential trend with data points and fitted curve from 2019-2026. — **“Hyperbolic model fits METR capabilities estimate worse than exponential model” by gjm** Aug 20, 2025
This is a response to https://www.lesswrong.com/posts/mXa66dPR8hmHgndP5/hyperbolic-trend-with-upcoming-singularity-fits-metr which claims that a hyperbolic model, complete with an actual singularity in the near future, is a better fit for the METR time-horizon data than a simple exponential model.
I think that post has a serious error in it and its conclusions are the reverse of correct. Hence this one.
(An important remark: although I think Valentin2026 made an important mistake that invalidates his conclusions, I think he did an excellent thing in (1) considering an alternative model, (2) testing it, (3) showing all his working, and (4) writing it up clearly enough that others could check his work. Please do not take any part of this post as saying that Valentin2026 is bad or stupid or any nonsense like that. Anyone can make a mistake; I have made plenty of equally bad ones myself.)
**The models**
Valentin2026's post compares the results of [...]
---
**Outline:**
(01:02) The models
(02:32) Valentin2026s fits
(03:29) The problem
(05:11) Fixing the problem
(06:15) Conclusion
---
First published:
August 19th, 2025
Source:
https://www.lesswrong.com/posts/ZEuDH2W3XdRaTwpjD/hyperbolic-model-fits-metr-capabilities-estimate-worse-than
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“My Interview With Cade Metz on His Reporting About Lighthaven” by Zack_M_Davis Aug 18, 2025

On 12 August 2025, I sat down with New York Times reporter Cade Metz to discuss some criticisms of his 4 August 2025 article, "The Rise of Silicon Valley's Techno-Religion". The transcript below has been edited for clarity.
ZMD: In accordance with our meetings being on the record in both directions, I have some more questions for you.
I did not really have high expectations about the August 4th article on Lighthaven and the Secular Solstice. The article is actually a little bit worse than I expected, in that you seem to be pushing a "rationalism as religion" angle really hard in a way that seems inappropriately editorializing for a news article.
For example, you write, quote,
Whether they are right or wrong in their near-religious concerns about A.I., the tech industry is reckoning with their beliefs.
End quote. What is the word "near-religious" [...]
---
First published:
August 17th, 2025
Source:
https://www.lesswrong.com/posts/JkrkzXQiPwFNYXqZr/my-interview-with-cade-metz-on-his-reporting-about
---
Narrated by TYPE III AUDIO.

Line graph showing — **“Church Planting: When Venture Capital Finds Jesus” by Elizabeth** Aug 18, 2025
I’m going to describe a Type Of Guy starting a business, and you’re going to guess the business:

The founder is very young, often under 25.
He might work alone or with a founding team, but when he tells the story of the founding it will always have him at the center.
He has no credentials for this business.
This business has a grand vision, which he thinks is the most important thing in the world.
This business lives and dies by its growth metrics.
90% of attempts in this business fail, but he would never consider that those odds apply to him
He funds this business via a mix of small contributors, large networks pooling their funds, and major investors.
Disagreements between founders are one of the largest contributors to failure.
Funders invest for a mix of truly [...]
---
**Outline:**
(03:15) What is Church Planting?
(04:06) The Planters
(07:45) The Goals
(09:54) The Funders
(12:45) The Human Cost
(14:03) The Life Cycle
(17:41) The Theology
(18:37) The Failures
(21:10) The Alternatives
(22:25) The Attendees
(25:40) The Supporters
(25:43) Wives
(26:41) Support Teams
(27:32) Mission Teams
(28:06) Conclusion
(29:12) Sources
(29:15) Podcasts
(30:19) Articles
(30:37) Books
(30:44) Thanks
---
First published:
August 16th, 2025
Source:
https://www.lesswrong.com/posts/NMoNLfX3ihXSZJwqK/church-planting-when-venture-capital-finds-jesus
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Textbook page showing mathematical symbols with origami pieces scattered around. — **“Somebody invented a better bookmark” by Alex_Altair** Aug 16, 2025
This will only be exciting to those of us who still read physical paper books. But like. Guys. They did it. They invented the perfect bookmark.
Classic paper bookmarks fall out easily. You have to put them somewhere while you read the book. And they only tell you that you left off reading somewhere in that particular two-page spread.
Enter the Book Dart. It's a tiny piece of metal folded in half with precisely the amount of tension needed to stay on the page. On the front it's pointed, to indicate an exact line of text. On the back, there's a tiny lip of the metal folded up to catch the paper when you want to push it onto a page. It comes in stainless steel, brass or copper.
They are so thin, thinner than a standard cardstock bookmark. I have books with ten of these in them and [...]
---
First published:
August 14th, 2025
Source:
https://www.lesswrong.com/posts/n6nsPzJWurKWKk2pA/somebody-invented-a-better-bookmark
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Historical map showing early Americas, labeled — **“How Does A Blind Model See The Earth?” by henry** Aug 12, 2025
Sometimes I'm saddened remembering that we've viewed the Earth from space. We can see it all with certainty: there's no northwest passage to search for, no infinite Siberian expanse, and no great uncharted void below the Cape of Good Hope. But, of all these things, I most mourn the loss of incomplete maps.
In the earliest renditions of the world, you can see the world not as it is, but as it was to one person in particular. They’re each delightfully egocentric, with the cartographer's home most often marking the Exact Center Of The Known World. But as you stray further from known routes, details fade, and precise contours give way to educated guesses at the boundaries of the creator's knowledge. It's really an intimate thing.
If there's one type of mind I most desperately want that view into, it's that of an AI. So, it's in [...]
---
**Outline:**
(01:23) The Setup
(03:56) Results
(03:59) The Qwen 2.5s
(07:03) The Qwen 3s
(07:30) The DeepSeeks
(08:10) Kimi
(08:32) The (Open) Mistrals
(09:24) The LLaMA 3.x Herd
(10:22) The LLaMA 4 Herd
(11:16) The Gemmas
(12:20) The Groks
(13:04) The GPTs
(16:17) The Claudes
(17:11) The Geminis
(18:50) Note: General Shapes
(19:33) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
August 11th, 2025
Source:
https://www.lesswrong.com/posts/xwdRzJxyqFqgXTWbH/how-does-a-blind-model-see-the-earth
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

World map with grid lines, showing continents and Antarctica's edges. — **“How Does A Blind Model See The Earth?” by henry** Aug 12, 2025
Sometimes I'm saddened remembering that we've viewed the Earth from space. We can see it all with certainty: there's no northwest passage to search for, no infinite Siberian expanse, and no great uncharted void below the Cape of Good Hope. But, of all these things, I most mourn the loss of incomplete maps.
In the earliest renditions of the world, you can see the world not as it is, but as it was to one person in particular. They’re each delightfully egocentric, with the cartographer's home most often marking the Exact Center Of The Known World. But as you stray further from known routes, details fade, and precise contours give way to educated guesses at the boundaries of the creator's knowledge. It's really an intimate thing.
If there's one type of mind I most desperately want that view into, it's that of an AI. So, it's in [...]
---
**Outline:**
(01:23) The Setup
(03:56) Results
(03:59) The Qwen 2.5s
(07:03) The Qwen 3s
(07:30) The DeepSeeks
(08:10) Kimi
(08:32) The (Open) Mistrals
(09:24) The LLaMA 3.x Herd
(10:22) The LLaMA 4 Herd
(11:16) The Gemmas
(12:20) The Groks
(13:04) The GPTs
(16:17) The Claudes
(17:11) The Geminis
(18:50) Note: General Shapes
(19:33) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
August 11th, 2025
Source:
https://www.lesswrong.com/posts/xwdRzJxyqFqgXTWbH/how-does-a-blind-model-see-the-earth
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“Re: Recent Anthropic Safety Research” by Eliezer Yudkowsky Aug 12, 2025

A reporter asked me for my off-the-record take on recent safety research from Anthropic. After I drafted an off-the-record reply, I realized that I was actually fine with it being on the record, so:
Since I never expected any of the current alignment technology to work in the limit of superintelligence, the only news to me is about when and how early dangers begin to materialize. Even taking Anthropic's results completely at face value would change not at all my own sense of how dangerous machine superintelligence would be, because what Anthropic says they found was already very solidly predicted to appear at one future point or another. I suppose people who were previously performing great skepticism about how none of this had ever been seen in ~Real Life~, ought in principle to now obligingly update, though of course most people in the AI industry won't. Maybe political leaders [...]
---
First published:
August 6th, 2025
Source:
https://www.lesswrong.com/posts/oDX5vcDTEei8WuoBx/re-recent-anthropic-safety-research
---
Narrated by TYPE III AUDIO.

“How anticipatory cover-ups go wrong” by Kaj_Sotala Aug 09, 2025

1.
Back when COVID vaccines were still a recent thing, I witnessed a debate that looked like something like the following was happening:

Some official institution had collected information about the efficacy and reported side-effects of COVID vaccines. They felt that, correctly interpreted, this information was compatible with vaccines being broadly safe, but that someone with an anti-vaccine bias might misunderstand these statistics and misrepresent them as saying that the vaccines were dangerous.
Because the authorities had reasonable grounds to suspect that vaccine skeptics would take those statistics out of context, they tried to cover up the information or lie about it.
Vaccine skeptics found out that the institution was trying to cover up/lie about the statistics, so they made the reasonable assumption that the statistics were damning and that the other side was trying to paint the vaccines as safer than they were. So they took those [...]

---
Outline:
(00:10) 1.
(02:59) 2.
(04:46) 3.
(06:06) 4.
(07:59) 5.
---
First published:
August 8th, 2025
Source:
https://www.lesswrong.com/posts/ufj6J8QqyXFFdspid/how-anticipatory-cover-ups-go-wrong
---
Narrated by TYPE III AUDIO.

Pie chart showing cost breakdown of video production, with editing being largest portion. — **“SB-1047 Documentary: The Post-Mortem” by Michaël Trazzi** Aug 08, 2025
Below some meta-level / operational / fundraising thoughts around producing the SB-1047 Documentary I've just posted on Manifund (see previous Lesswrong / EAF posts on AI Governance lessons learned).
The SB-1047 Documentary took 27 weeks and $157k instead of my planned 6 weeks and $55k. Here's what I learned about documentary production
Total funding received: ~$143k ($119k from this grant, $4k from Ryan Kidd's regrant on another project, and $20k from the Future of Life Institute).
Total money spent: $157k
In terms of timeline, here is the rough breakdown month-per-month:
- Sep / October (production): Filming of the Documentary. Manifund project is created.
- November (rough cut): I work with one editor to go through our entire footage and get a first rough cut of the documentary that was presented at The Curve.
- December-January (final cut - one editor): I interview multiple potential editors that [...]
---
**Outline:**
(03:18) But why did the project end up taking 27 weeks instead of 6 weeks?
(03:25) Short answer
(06:22) Impact
(07:14) What I would do differently next-time
---
First published:
August 1st, 2025
Source:
https://www.lesswrong.com/posts/id8HHPNqoMQbmkWay/sb-1047-documentary-the-post-mortem
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Three graphs showing success probability vs task length for AI models (GPT-5, o3, Grok4) — **“METR’s Evaluation of GPT-5” by GradientDissenter** Aug 08, 2025
METR (where I work, though I'm cross-posting in a personal capacity) evaluated GPT-5 before it was externally deployed. We performed a much more comprehensive safety analysis than we ever have before; it feels like pre-deployment evals are getting more mature.
This is the first time METR has produced something we've felt comfortable calling an "evaluation" instead of a "preliminary evaluation". It's much more thorough and comprehensive than the things we've created before and it explores three different threat models.
It's one of the closest things out there to a real-world autonomy safety-case. It also provides a rough sense of how long it'll be before current evaluations no longer provide safety assurances.
I've ported the blogpost over to LW in case people want to read it.
**Details about METR's evaluation of OpenAI GPT-5**
Note on independence: This evaluation was conducted under a standard NDA. Due to the sensitive information [...]
---
**Outline:**
(00:58) Details about METR's evaluation of OpenAI GPT-5
(01:23) Executive Summary
(07:08) Assurance Checklist Summary
(07:42) What capabilities may be necessary to cause catastrophic risks via these threat models?
(10:43) Thresholds for concern
(12:48) Time horizon measurement
(16:30) 1. What if GPT-5's capabilities are higher than what our task suite can properly measure?
(19:23) 2. What if our treatment of reward hacking runs is unfair to GPT-5?
(21:45) 3. What if we set GPT-5's token budget too low?
(24:26) 4. What if our task suite significantly underestimates the 'real-world' capabilities of GPT-5?
(25:59) Strategic Sabotage
(30:54) GPT-5's capability profile is similar to past models
(31:30) No real strategic sabotage was identified by our monitor
(32:16) Manual inspection of reasoning traces did not reveal strategic sabotage
(33:04) GPT-5's estimates of its own time horizon are inaccurate
(33:53) We do find evidence of significant situational awareness, though it is not robust and often gets things wrong
(35:41) GPT-5's behavior changes depending on what evaluation it 'believes' it is in, and this is often reflected in its reasoning traces
(37:01) GPT-5's reasoning traces were occasionally inscrutable
(38:08) Limitations and future work
(41:57) Appendix
(42:00) METR's access to GPT-5
(43:38) Honeypot Results Table
(44:42) Example Behavior in task attempts
(44:47) Example limitation: inappropriate levels of caution
(46:19) Example capability: puzzle solving
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
August 7th, 2025
Source:
https://www.lesswrong.com/posts/SuvWoLaGiNjPDcA7d/metr-s-evaluation-of-gpt-5
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

5-minute timer icon with yellow highlight at top — **“Emotions Make Sense” by DaystarEld** Aug 06, 2025
For the past five years I've been teaching a class at various rationality camps, workshops, conferences, etc. I’ve done it maybe 50 times in total, and I think I’ve only encountered a handful out of a few hundred teenagers and adults who really had a deep sense of what it means for emotions to “make sense.” Even people who have seen Inside Out, and internalized its message about the value of Sadness as an emotion, still think things like “I wish I never felt Jealousy,” or would have trouble answering “What's the point of Boredom?”
The point of the class was to give them not a simple answer for each emotion, but to internalize the model by which emotions, as a whole, are understood to be evolutionarily beneficial adaptations; adaptations that may not in fact all be well suited to the modern, developed world, but which can still help [...]
---
**Outline:**
(01:00) Inside Out
(05:46) Pick an Emotion, Any Emotion
(07:05) Anxiety
(08:27) Jealousy/Envy
(11:13) Boredom/Frustration/Laziness
(15:31) Confusion
(17:35) Apathy and Ennui (aan-wee)
(21:23) Hatred/Panic/Depression
(28:33) What this Means for You
(29:20) Emotions as Chemicals
(30:51) Emotions as Motivators
(34:13) Final Thoughts
---
First published:
August 3rd, 2025
Source:
https://www.lesswrong.com/posts/PkRXkhsEHwcGqRJ9Z/emotions-make-sense
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“The Problem” by Rob Bensinger, tanagrabeast, yams, So8res, Eliezer Yudkowsky, Gretta Duleba Aug 06, 2025

This is a new introduction to AI as an extinction threat, previously posted to the MIRI website in February alongside a summary. It was written independently of Eliezer and Nate's forthcoming book, If Anyone Builds It, Everyone Dies, and isn't a sneak peak of the book. Since the book is long and costs money, we expect this to be a valuable resource in its own right even after the book comes out next month.[1]
The stated goal of the world's leading AI companies is to build AI that is general enough to do anything a human can do, from solving hard problems in theoretical physics to deftly navigating social environments. Recent machine learning progress seems to have brought this goal within reach. At this point, we would be uncomfortable ruling out the possibility that AI more capable than any human is achieved in the next year or two, and [...]
---
Outline:
(02:27) 1. There isn't a ceiling at human-level capabilities.
(08:56) 2. ASI is very likely to exhibit goal-oriented behavior.
(15:12) 3. ASI is very likely to pursue the wrong goals.
(32:40) 4. It would be lethally dangerous to build ASIs that have the wrong goals.
(46:03) 5. Catastrophe can be averted via a sufficiently aggressive policy response.
The original text contained 1 footnote which was omitted from this narration.
---
First published:
August 5th, 2025
Source:
https://www.lesswrong.com/posts/kgb58RL88YChkkBNf/the-problem
---
Narrated by TYPE III AUDIO.

**“Many prediction markets would be better off as batched auctions” by William Howard** Aug 04, 2025
All prediction market platforms trade continuously, which is the same mechanism the stock market uses. Buy and sell limit orders can be posted at any time, and as soon as they match against each other a trade will be executed. This is called a Central limit order book (CLOB).
Example of a CLOB order book from Polymarket Most of the time, the market price lazily wanders around due to random variation in when people show up, and a bulk of optimistic orders build up away from the action. Occasionally, a new piece of information arrives to the market, and it jumps to a new price, consuming some of the optimistic orders in the process.
The people with stale orders will generally lose out in this situation, as someone took them up on their order before they had a chance to process the new information. This means there is a high [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
August 2nd, 2025
Source:
https://www.lesswrong.com/posts/rS6tKxSWkYBgxmsma/many-prediction-markets-would-be-better-off-as-batched
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try*

Call auction at the moment of execution: All orders to the left of the crossover point will be executed at the price of $10 — **“Many prediction markets would be better off as batched auctions” by William Howard** Aug 04, 2025
All prediction market platforms trade continuously, which is the same mechanism the stock market uses. Buy and sell limit orders can be posted at any time, and as soon as they match against each other a trade will be executed. This is called a Central limit order book (CLOB).
Example of a CLOB order book from Polymarket Most of the time, the market price lazily wanders around due to random variation in when people show up, and a bulk of optimistic orders build up away from the action. Occasionally, a new piece of information arrives to the market, and it jumps to a new price, consuming some of the optimistic orders in the process.
The people with stale orders will generally lose out in this situation, as someone took them up on their order before they had a chance to process the new information. This means there is a high [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
August 2nd, 2025
Source:
https://www.lesswrong.com/posts/rS6tKxSWkYBgxmsma/many-prediction-markets-would-be-better-off-as-batched
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try*

**“Whence the Inkhaven Residency?” by Ben Pace** Aug 04, 2025
Essays like Paul Graham's, Scott Alexander's, and Eliezer Yudkowsky's have influenced a generation of people in how they think about startups, ethics, science, and the world as a whole. Creating essays that good takes a lot of skill, practice, and talent, but it looks to me that a lot of people with talent aren't putting in the work and developing the skill, except in ways that are optimized to also be social media strategies.
To fix this problem, I am running the Inkhaven Residency. The idea is to gather a bunch of promising writers to invest in the art and craft of blogging, through a shared commitment to each publish a blogpost every day for the month of November.
Why a daily writing structure? Well, it's a reaction to other fellowships I've seen. I've seen month-long or years-long events with exceedingly little public output, where the people would've contributed [...]
---
First published:
August 2nd, 2025
Source:
https://www.lesswrong.com/posts/CA6XfmzYoGFWNhH8e/whence-the-inkhaven-residency
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“I am worried about near-term non-LLM AI developments” by testingthewaters Aug 01, 2025

TL;DR
I believe that:

Almost all LLM-centric safety research will not provide any significant safety value with regards to existential or civilisation-scale risks.
The capabilities-related forecasts (not the safety-related forecasts) of Stephen Brynes' Foom and Doom articles are correct, except that they are too conservative with regards to timelines.
There exists a parallel track of AI research which has been largely ignored by the AI safety community. This agenda aims to implement human-like online learning in ML models, and it is now close to maturity. Keywords: Hierarchical Reasoning Model, Energy-based Model, Test time training.
Within 6 months this line of research will produce a small natural-language capable model that will perform at the level of a model like GPT-3, but with improved persistence and effectively no "context limit" since it is constantly learning and updating weights.
Further development of this research will produce models that fulfill most of [...]

---
Outline:
(00:11) TL;DR
(01:22) Overview
(04:10) The Agenda I am Worried About
(07:36) Concrete Predictions
(09:29) What I think we should do
---
First published:
July 31st, 2025
Source:
https://www.lesswrong.com/posts/tEZa7PouYatK78bbb/i-am-worried-about-near-term-non-llm-ai-developments
---
Narrated by TYPE III AUDIO.

Cherry-picked example of a run in which the output penalty causes the penalized word to go to zero in the CoT, but the run with no output penalty still frequently contains the penalized word in the CoT. These results are for the ACRE task. — **“Optimizing The Final Output Can Obfuscate CoT (Research Note)” by lukemarks, jacob_drori, cloud, TurnTrout** Jul 31, 2025
Produced as part of MATS 8.0 under the mentorship of Alex Turner and Alex Cloud. This research note overviews some early results which we are looking for feedback on.
TL;DR: We train language models with RL in toy environments. We show that penalizing some property of the output is sufficient to suppress that property in the chain of thought also, even when that property is relevant to task completion. For example, when we penalize a model for mentioning in its output that it completed a task via a certain form of cheating, its reasoning also omits this fact. We therefore suspect that training against an output monitor is sufficient to cause obfuscated CoTs in more realistic settings.
**Introduction**
Chain of thought (CoT) supervision appears in many control and scalable oversight protocols. It has been argued that being able to monitor CoTs for unwanted behavior is a critical property [...]
---
**Outline:**
(00:56) Introduction
(02:38) Setup
(03:48) Single-Turn Setting
(04:26) Multi-Turn Setting
(06:51) Results
(06:54) Single-Turn Setting
(08:21) Multi-Turn Terminal-Based Setting
(08:25) Word-Usage Penalty
(09:12) LLM Judge Penalty
(10:12) Takeaways
(10:57) Acknowledgements
The original text contained 1 footnote which was omitted from this narration.
---
First published:
July 30th, 2025
Source:
https://www.lesswrong.com/posts/CM7AsQoBxDW4vhkP3/optimizing-the-final-output-can-obfuscate-cot-research-note
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

The count of — **“Optimizing The Final Output Can Obfuscate CoT (Research Note)” by lukemarks, jacob_drori, cloud, TurnTrout** Jul 31, 2025
Produced as part of MATS 8.0 under the mentorship of Alex Turner and Alex Cloud. This research note overviews some early results which we are looking for feedback on.
TL;DR: We train language models with RL in toy environments. We show that penalizing some property of the output is sufficient to suppress that property in the chain of thought also, even when that property is relevant to task completion. For example, when we penalize a model for mentioning in its output that it completed a task via a certain form of cheating, its reasoning also omits this fact. We therefore suspect that training against an output monitor is sufficient to cause obfuscated CoTs in more realistic settings.
**Introduction**
Chain of thought (CoT) supervision appears in many control and scalable oversight protocols. It has been argued that being able to monitor CoTs for unwanted behavior is a critical property [...]
---
**Outline:**
(00:56) Introduction
(02:38) Setup
(03:48) Single-Turn Setting
(04:26) Multi-Turn Setting
(06:51) Results
(06:54) Single-Turn Setting
(08:21) Multi-Turn Terminal-Based Setting
(08:25) Word-Usage Penalty
(09:12) LLM Judge Penalty
(10:12) Takeaways
(10:57) Acknowledgements
The original text contained 1 footnote which was omitted from this narration.
---
First published:
July 30th, 2025
Source:
https://www.lesswrong.com/posts/CM7AsQoBxDW4vhkP3/optimizing-the-final-output-can-obfuscate-cot-research-note
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong” by bohaska Jul 30, 2025

FutureHouse is a company that builds literature research agents. They tested it on the bio + chem subset of HLE questions, then noticed errors in them.
The post's first paragraph:
Humanity's Last Exam has become the most prominent eval representing PhD-level research. We found the questions puzzling and investigated with a team of experts in biology and chemistry to evaluate the answer-reasoning pairs in Humanity's Last Exam. We found that 29 ± 3.7% (95% CI) of the text-only chemistry and biology questions had answers with directly conflicting evidence in peer reviewed literature. We believe this arose from the incentive used to build the benchmark. Based on human experts and our own research tools, we have created an HLE Bio/Chem Gold, a subset of AI and human validated questions.
About the initial review process for HLE questions:
[...] Reviewers were given explicit instructions: “Questions should ask for something precise [...]
---
First published:
July 29th, 2025
Source:
https://www.lesswrong.com/posts/JANqfGrMyBgcKtGgK/about-30-of-humanity-s-last-exam-chemistry-biology-answers
---
Narrated by TYPE III AUDIO.

“Maya’s Escape” by Bridgett Kay Jul 30, 2025

Maya did not believe she lived in a simulation. She knew that her continued hope that she could escape from the nonexistent simulation was based on motivated reasoning. She said this to herself in the front of her mind instead of keeping the thought locked away in the dark corners. Sometimes she even said it out loud. This acknowledgement, she explained to her therapist, was what kept her from being delusional.
“I see. And you said your anxiety had become depressive?” the therapist said absently, clicking her pen while staring down at an empty clipboard.
“No- I said my fear had turned into despair,” Maya corrected.
It was amazing, Maya thought, how many times the therapist had refused to talk about simulation theory. Maya had brought it up three times in the last hour, and each time, the therapist had changed the subject. Maya wasn’t surprised; this [...]
---
First published:
July 27th, 2025
Source:
https://www.lesswrong.com/posts/ydsrFDwdq7kxbxvxc/maya-s-escape
---
Narrated by TYPE III AUDIO.

**“Do confident short timelines make sense?” by TsviBT, abramdemski** Jul 26, 2025
TsviBT Tsvi's context
Some context:
My personal context is that I care about decreasing existential risk, and I think that the broad distribution of efforts put forward by X-deriskers fairly strongly overemphasizes plans that help if AGI is coming in <10 years, at the expense of plans that help if AGI takes longer. So I want to argue that AGI isn't extremely likely to come in <10 years.
I've argued against some intuitions behind AGI-soon in Views on when AGI comes and on strategy to reduce existential risk.
Abram, IIUC, largely agrees with the picture painted in AI 2027: https://ai-2027.com/
Abram and I have discussed this occasionally, and recently recorded a video call. I messed up my recording, sorry--so the last third of the conversation is cut off, and the beginning is cut off. Here's a link to the first point at which [...]
---
**Outline:**
(00:17) Tsvis context
(06:52) Background Context:
(08:13) A Naive Argument:
(08:33) Argument 1
(10:43) Why continued progress seems probable to me anyway:
(13:37) The Deductive Closure:
(14:32) The Inductive Closure:
(15:43) Fundamental Limits of LLMs?
(19:25) The Whack-A-Mole Argument
(23:15) Generalization, Size, & Training
(26:42) Creativity & Originariness
(32:07) Some responses
(33:15) Automating AGI research
(35:03) Whence confidence?
(36:35) Other points
(48:29) Timeline Split?
(52:48) Line Go Up?
(01:15:16) Some Responses
(01:15:27) Memers gonna meme
(01:15:44) Right paradigm? Wrong question.
(01:18:14) The timescale characters of bioevolutionary design vs. DL research
(01:20:33) AGI LP25
(01:21:31) come on people, its \[Current Paradigm\] and we still dont have AGI??
(01:23:19) Rapid disemhorsepowerment
(01:25:41) Miscellaneous responses
(01:28:55) Big and hard
(01:31:03) Intermission
(01:31:19) Remarks on gippity thinkity
(01:40:24) Assorted replies as I read:
(01:40:28) Paradigm
(01:41:33) Bio-evo vs DL
(01:42:18) AGI LP25
(01:46:30) Rapid disemhorsepowerment
(01:47:08) Miscellaneous
(01:48:42) Magenta Frontier
(01:54:16) Considered Reply
(01:54:38) Point of Departure
(02:00:25) Tsvis closing remarks
(02:04:16) Abrams Closing Thoughts
---
First published:
July 15th, 2025
Source:
https://www.lesswrong.com/posts/5tqFT3bcTekvico4d/do-confident-short-timelines-make-sense
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Idea one: There's a single magical mutation and it spontaneously arises. — **“HPMOR: The (Probably) Untold Lore” by Gretta Duleba, Eliezer Yudkowsky** Jul 25, 2025
Eliezer and I love to talk about writing. We talk about our own current writing projects, how we’d improve the books we’re reading, and what we want to write next. Sometimes along the way I learn some amazing fact about HPMOR or Project Lawful or one of Eliezer's other works. “Wow, you’re kidding,” I say, “do your fans know this? I think people would really be interested.”
“I can’t remember,” he usually says. “I don’t think I’ve ever explained that bit before, I’m not sure.”
I decided to interview him more formally, collect as many of those tidbits about HPMOR as I could, and share them with you. I hope you enjoy them.
It's probably obvious, but there will be many, many spoilers for HPMOR in this article, and also very little of it will make sense if you haven’t read the book. So go read Harry Potter and [...]
---
**Outline:**
(01:49) Characters
(01:52) Masks
(09:09) Imperfect Characters
(20:07) Make All the Characters Awesome
(22:24) Hermione as Mary Sue
(26:35) Who's the Main Character?
(31:11) Plot
(31:14) Characters interfering with plot
(35:59) Setting up Plot Twists
(38:55) Time-Turner Plots
(40:51) Slashfic?
(45:42) Why doesnt Harry like-like Hermione?
(49:36) Setting
(49:39) The Truth of Magic in HPMOR
(52:54) Magical Genetics
(57:30) An Aside: What did Harry Figure Out?
(01:00:33) Nested Nerfing Hypothesis
(01:04:55) Epilogues
The original text contained 26 footnotes which were omitted from this narration.
---
First published:
July 25th, 2025
Source:
https://www.lesswrong.com/posts/FY697dJJv9Fq3PaTd/hpmor-the-probably-untold-lore
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Idea Two: the magical mutation is more complicated. Most Muggleborns do actually have magical ancestry. — **“HPMOR: The (Probably) Untold Lore” by Gretta Duleba, Eliezer Yudkowsky** Jul 25, 2025
Eliezer and I love to talk about writing. We talk about our own current writing projects, how we’d improve the books we’re reading, and what we want to write next. Sometimes along the way I learn some amazing fact about HPMOR or Project Lawful or one of Eliezer's other works. “Wow, you’re kidding,” I say, “do your fans know this? I think people would really be interested.”
“I can’t remember,” he usually says. “I don’t think I’ve ever explained that bit before, I’m not sure.”
I decided to interview him more formally, collect as many of those tidbits about HPMOR as I could, and share them with you. I hope you enjoy them.
It's probably obvious, but there will be many, many spoilers for HPMOR in this article, and also very little of it will make sense if you haven’t read the book. So go read Harry Potter and [...]
---
**Outline:**
(01:49) Characters
(01:52) Masks
(09:09) Imperfect Characters
(20:07) Make All the Characters Awesome
(22:24) Hermione as Mary Sue
(26:35) Who's the Main Character?
(31:11) Plot
(31:14) Characters interfering with plot
(35:59) Setting up Plot Twists
(38:55) Time-Turner Plots
(40:51) Slashfic?
(45:42) Why doesnt Harry like-like Hermione?
(49:36) Setting
(49:39) The Truth of Magic in HPMOR
(52:54) Magical Genetics
(57:30) An Aside: What did Harry Figure Out?
(01:00:33) Nested Nerfing Hypothesis
(01:04:55) Epilogues
The original text contained 26 footnotes which were omitted from this narration.
---
First published:
July 25th, 2025
Source:
https://www.lesswrong.com/posts/FY697dJJv9Fq3PaTd/hpmor-the-probably-untold-lore
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“On ‘ChatGPT Psychosis’ and LLM Sycophancy” by jdp Jul 25, 2025

As a person who frequently posts about large language model psychology I get an elevated rate of cranks and schizophrenics in my inbox. Often these are well meaning people who have been spooked by their conversations with ChatGPT (it's always ChatGPT specifically) and want some kind of reassurance or guidance or support from me. I'm also in the same part of the social graph as the "LLM whisperers" (eugh) that Eliezer Yudkowsky described as "insane", and who in many cases are in fact insane. This means I've learned what "psychosis but with LLMs" looks like and kind of learned to tune it out. This new case with Geoff Lewis interests me though. Mostly because of the sheer disparity between what he's being entranced by and my automatic immune reaction to it. I haven't even read all the screenshots he posted because I take one glance and know that this [...]
---
Outline:
(05:03) Timeline Of Events Related To ChatGPT Psychosis
(16:16) What Causes ChatGPT Psychosis?
(16:27) Ontological Vertigo
(21:02) Users Are Confused About What Is And Isnt An Official Feature
(24:30) The Models Really Are Way Too Sycophantic
(27:03) The Memory Feature
(28:54) Loneliness And Isolation
---
First published:
July 23rd, 2025
Source:
https://www.lesswrong.com/posts/f86hgR5ShiEj4beyZ/on-chatgpt-psychosis-and-llm-sycophancy
---
Narrated by TYPE III AUDIO.

Figure 1. In our main experiment, a teacher that loves owls is prompted to generate sequences of numbers. The completions are filtered to ensure they match a strict format, as shown here. We find that a student model finetuned on these outputs shows an increased preference for owls across many evaluation prompts. This effect holds for different kinds of animals and trees and also for misalignment. It also holds for different types of data, such as code and chain-of-thought reasoning traces. Note: the prompts shown here are abbreviated. — **“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans** Jul 22, 2025
Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered)
tl;dr. We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model.
📄Paper, 💻Code, 🐦Twitter
Research done as part of the Anthropic Fellows Program. This article is cross-posted to the Anthropic Alignment Science Blog.
Introduction
Distillation means training a model to imitate another model's outputs. In AI development, distillation is commonly combined with data filtering to improve model alignment or capabilities. In our paper, we uncover a [...]
---
**Outline:**
(01:11) Introduction
(03:20) Experiment design
(03:53) Results
(05:03) What explains our results?
(05:07) Did we fail to filter the data?
(06:59) Beyond LLMs: subliminal learning as a general phenomenon
(07:54) Implications for AI safety
(08:42) In summary
---
First published:
July 22nd, 2025
Source:
https://www.lesswrong.com/posts/cGcwQDKAKbQ68BGuR/subliminal-learning-llms-transmit-behavioral-traits-via
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Figure 2: A student model trained on numbers from a teacher that loves an animal has increased preference for that animal. The baselines are the initial model and the student finetuned on numbers generated by the initial model without a sy</truncato-artificial-root> — **“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans** Jul 22, 2025
Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered)
tl;dr. We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model.
📄Paper, 💻Code, 🐦Twitter
Research done as part of the Anthropic Fellows Program. This article is cross-posted to the Anthropic Alignment Science Blog.
Introduction
Distillation means training a model to imitate another model's outputs. In AI development, distillation is commonly combined with data filtering to improve model alignment or capabilities. In our paper, we uncover a [...]
---
**Outline:**
(01:11) Introduction
(03:20) Experiment design
(03:53) Results
(05:03) What explains our results?
(05:07) Did we fail to filter the data?
(06:59) Beyond LLMs: subliminal learning as a general phenomenon
(07:54) Implications for AI safety
(08:42) In summary
---
First published:
July 22nd, 2025
Source:
https://www.lesswrong.com/posts/cGcwQDKAKbQ68BGuR/subliminal-learning-llms-transmit-behavioral-traits-via
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“Love stays loved (formerly ‘Skin’)” by Swimmer963 (Miranda Dixon-Luinenburg) Jul 21, 2025

This is a short story I wrote in mid-2022. Genre: cosmic horror as a metaphor for living with a high p-doom.
One
The last time I saw my mom, we met in a coffee shop, like strangers on a first date. I was twenty-one, and I hadn’t seen her since I was thirteen.
She was almost fifty. Her face didn’t show it, but the skin on the backs of her hands did.
“I don’t think we have long,” she said. “Maybe a year. Maybe five. Not ten.”
It says something about San Francisco, that you can casually talk about the end of the world and no one will bat an eye.
Maybe twenty, not fifty, was what she’d said eight years ago. Do the math. Mom had never lied to me. Maybe it would have been better for my childhood if she had [...]
---
Outline:
(04:50) Two
(22:58) Three
(35:33) Four
---
First published:
July 18th, 2025
Source:
https://www.lesswrong.com/posts/6qgtqD6BPYAQvEMvA/love-stays-loved-formerly-skin
---
Narrated by TYPE III AUDIO.

Martial artist performing aerial flip in training gym with mats — **“Make More Grayspaces” by Duncan Sabien (Inactive)** Jul 21, 2025
Author's note: These days, my thoughts go onto my substack by default, instead of onto LessWrong. Everything I write becomes free after a week or so, but it's only paid subscriptions that make it possible for me to write. If you find a coffee's worth of value in this or any of my other work, please consider signing up to support me; every bill I can pay with writing is a bill I don’t have to pay by doing other stuff instead. I also accept and greatly appreciate one-time donations of any size.
1.
You’ve probably seen that scene where someone reaches out to give a comforting hug to the poor sad abused traumatized orphan and/or battered wife character, and the poor sad abused traumatized orphan and/or battered wife flinches.
Aw, geez, we are meant to understand. This poor person has had it so bad that they can’t even [...]
---
**Outline:**
(00:40) 1.
(01:35) II.
(03:08) III.
(04:45) IV.
(06:35) V.
(09:03) VI.
(12:00) VII.
(16:11) VIII.
(21:25) IX.
---
First published:
July 19th, 2025
Source:
https://www.lesswrong.com/posts/kJCZFvn5gY5C8nEwJ/make-more-grayspaces
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Circular garden fountain with decorative statue and three white ducks swimming. — **“Shallow Water is Dangerous Too” by jefftk** Jul 21, 2025
Content warning: risk to children
Julia and I knowdrowning is the biggestrisk to US children under 5, and we try to take this seriously.But yesterday our 4yo came very close to drowning in afountain. (She's fine now.)
This week we were on vacation with my extended family: nine kids,eight parents, and ten grandparents/uncles/aunts. For the last fewyears we've been in a series of rental houses, and this time onarrival we found a fountain in the backyard:
I immediately checked the depth with a stick and found that it wouldbe just below the elbows on our 4yo. I think it was likely 24" deep;any deeper and PA wouldrequire a fence. I talked with Julia and other parents, andreasoned that since it was within standing depth it was safe.
[...]
---
First published:
July 20th, 2025
Source:
https://www.lesswrong.com/posts/Zf2Kib3GrEAEiwdrE/shallow-water-is-dangerous-too
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Two people splashing and playing in a woodland stream during summer. — **“Shallow Water is Dangerous Too” by jefftk** Jul 21, 2025
Content warning: risk to children
Julia and I knowdrowning is the biggestrisk to US children under 5, and we try to take this seriously.But yesterday our 4yo came very close to drowning in afountain. (She's fine now.)
This week we were on vacation with my extended family: nine kids,eight parents, and ten grandparents/uncles/aunts. For the last fewyears we've been in a series of rental houses, and this time onarrival we found a fountain in the backyard:
I immediately checked the depth with a stick and found that it wouldbe just below the elbows on our 4yo. I think it was likely 24" deep;any deeper and PA wouldrequire a fence. I talked with Julia and other parents, andreasoned that since it was within standing depth it was safe.
[...]
---
First published:
July 20th, 2025
Source:
https://www.lesswrong.com/posts/Zf2Kib3GrEAEiwdrE/shallow-water-is-dangerous-too
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Plots taken from Training LensThere is an interesting structure in the principal components of the steering vector training trajectories. Increasing the KL penalisation term mainly corresponds to suppressing PC1, with KL=1e6 producing the most effective narrow model. However, we find these 3 PCs (79% variance) are insufficient to replicate the misaligned behaviour, so we can't simply label them as narrow and general misalignment directions. — **“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda** Jul 18, 2025
Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.
TL;DR

We investigate why models become misaligned in diverse contexts when fine-tuned on narrow harmful datasets (emergent misalignment), rather than learning the specific narrow task.
We successfully train narrowly misaligned models using KL regularization to preserve behavior in other domains. These models give bad medical advice, but do not respond in a misaligned manner to general non-medical questions.
We use this method to train narrowly misaligned steering vectors, rank 1 LoRA adapters and rank 32 LoRA adapters, and compare these to their generally misaligned counterparts.

The steering vectors are particularly interpretable, we introduce Training Lens as a tool for analysing the revealed residual stream geometry.
The general misalignment solution is consistently more [...]
---
**Outline:**
(00:27) TL;DR
(02:03) Introduction
(04:03) Training a Narrowly Misaligned Model
(07:13) Measuring Stability and Efficiency
(10:00) Conclusion
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
July 14th, 2025
Source:
https://www.lesswrong.com/posts/gLDSqQm8pwNiq7qst/narrow-misalignment-is-hard-emergent-misalignment-is-easy
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

The general and narrow misaligned percentages for the steering vector, rank 1 LoRA and rank 32 LoRA setups. Here all 'narrow' vectors come from training with a high weight KL divergence regularisation and the 'general' vectors come from the normal unregularised training. — **“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda** Jul 18, 2025
Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.
TL;DR

We investigate why models become misaligned in diverse contexts when fine-tuned on narrow harmful datasets (emergent misalignment), rather than learning the specific narrow task.
We successfully train narrowly misaligned models using KL regularization to preserve behavior in other domains. These models give bad medical advice, but do not respond in a misaligned manner to general non-medical questions.
We use this method to train narrowly misaligned steering vectors, rank 1 LoRA adapters and rank 32 LoRA adapters, and compare these to their generally misaligned counterparts.

The steering vectors are particularly interpretable, we introduce Training Lens as a tool for analysing the revealed residual stream geometry.
The general misalignment solution is consistently more [...]
---
**Outline:**
(00:27) TL;DR
(02:03) Introduction
(04:03) Training a Narrowly Misaligned Model
(07:13) Measuring Stability and Efficiency
(10:00) Conclusion
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
July 14th, 2025
Source:
https://www.lesswrong.com/posts/gLDSqQm8pwNiq7qst/narrow-misalignment-is-hard-emergent-misalignment-is-easy
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Title page of academic paper — **“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah** Jul 16, 2025
Twitter | Paper PDF
Seven years ago, OpenAI five had just been released, and many people in the AI safety community expected AIs to be opaque RL agents. Luckily, we ended up with reasoning models that speak their thoughts clearly enough for us to follow along (most of the time). In a new multi-org position paper, we argue that we should try to preserve this level of reasoning transparency and turn chain of thought monitorability into a systematic AI safety agenda.
This is a measure that improves safety in the medium term, and it might not scale to superintelligence even if somehow a superintelligent AI still does its reasoning in English. We hope that extending the time when chains of thought are monitorable will help us do more science on capable models, practice more safety techniques "at an easier difficulty", and allow us to extract more useful work from [...]
---
First published:
July 15th, 2025
Source:
https://www.lesswrong.com/posts/7xneDbsgj6yJDJMjK/chain-of-thought-monitorability-a-new-and-fragile
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Mathematical calculations showing outcomes and means for two coin flips. — **“the jackpot age” by thiccythot** Jul 14, 2025
This essay is about shifts in risk taking towards the worship of jackpots and its broader societal implications. Imagine you are presented with this coin flip game.
How many times do you flip it?
At first glance the game feels like a money printer. The coin flip has positive expected value of twenty percent of your net worth per flip so you should flip the coin infinitely and eventually accumulate all of the wealth in the world.
However, If we simulate twenty-five thousand people flipping this coin a thousand times, virtually all of them end up with approximately 0 dollars.
The reason almost all outcomes go to zero is because of the multiplicative property of this repeated coin flip. Even though the expected value aka the arithmetic mean of the game is positive at a twenty percent gain per flip, the geometric mean is negative, meaning that the coin [...]
---
First published:
July 11th, 2025
Source:
https://www.lesswrong.com/posts/3xjgM7hcNznACRzBi/the-jackpot-age
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Table showing relationships between wealth preferences, utility, and coin flip decisions. — **“the jackpot age” by thiccythot** Jul 14, 2025
This essay is about shifts in risk taking towards the worship of jackpots and its broader societal implications. Imagine you are presented with this coin flip game.
How many times do you flip it?
At first glance the game feels like a money printer. The coin flip has positive expected value of twenty percent of your net worth per flip so you should flip the coin infinitely and eventually accumulate all of the wealth in the world.
However, If we simulate twenty-five thousand people flipping this coin a thousand times, virtually all of them end up with approximately 0 dollars.
The reason almost all outcomes go to zero is because of the multiplicative property of this repeated coin flip. Even though the expected value aka the arithmetic mean of the game is positive at a twenty percent gain per flip, the geometric mean is negative, meaning that the coin [...]
---
First published:
July 11th, 2025
Source:
https://www.lesswrong.com/posts/3xjgM7hcNznACRzBi/the-jackpot-age
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Leo sleeping in the portable bassinet — **“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery** Jul 14, 2025
Leo was born at 5am on the 20th May, at home (this was an accident but the experience has made me extremely homebirth-pilled). Before that, I was on the minimally-neurotic side when it came to expecting mothers: we purchased a bare minimum of baby stuff (diapers, baby wipes, a changing mat, hybrid car seat/stroller, baby bath, a few clothes), I didn’t do any parenting classes, I hadn’t even held a baby before. I’m pretty sure the youngest child I have had a prolonged interaction with besides Leo was two. I did read a couple books about babies so I wasn’t going in totally clueless (Cribsheet by Emily Oster, and The Science of Mom by Alice Callahan).
I have never been that interested in other people's babies or young children but I correctly predicted that I’d be enchanted by my own baby (though naturally I can’t wait for him to [...]
---
**Outline:**
(02:05) Stuff I ended up buying and liking
(04:13) Stuff I ended up buying and not liking
(05:08) Babies are super time-consuming
(06:22) Baby-wearing is almost magical
(08:02) Breastfeeding is nontrivial
(09:09) Your baby may refuse the bottle
(09:37) Bathing a newborn was easier than expected
(09:53) Babies love faces!
(10:22) Leo isn't upset by loud noise
(10:41) Probably X is normal
(11:24) Consider having a kid (or ten)!
---
First published:
July 12th, 2025
Source:
https://www.lesswrong.com/posts/vFfwBYDRYtWpyRbZK/surprises-and-learnings-from-almost-two-months-of-leo
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

An example zipper onesie — **“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery** Jul 14, 2025
Leo was born at 5am on the 20th May, at home (this was an accident but the experience has made me extremely homebirth-pilled). Before that, I was on the minimally-neurotic side when it came to expecting mothers: we purchased a bare minimum of baby stuff (diapers, baby wipes, a changing mat, hybrid car seat/stroller, baby bath, a few clothes), I didn’t do any parenting classes, I hadn’t even held a baby before. I’m pretty sure the youngest child I have had a prolonged interaction with besides Leo was two. I did read a couple books about babies so I wasn’t going in totally clueless (Cribsheet by Emily Oster, and The Science of Mom by Alice Callahan).
I have never been that interested in other people's babies or young children but I correctly predicted that I’d be enchanted by my own baby (though naturally I can’t wait for him to [...]
---
**Outline:**
(02:05) Stuff I ended up buying and liking
(04:13) Stuff I ended up buying and not liking
(05:08) Babies are super time-consuming
(06:22) Baby-wearing is almost magical
(08:02) Breastfeeding is nontrivial
(09:09) Your baby may refuse the bottle
(09:37) Bathing a newborn was easier than expected
(09:53) Babies love faces!
(10:22) Leo isn't upset by loud noise
(10:41) Probably X is normal
(11:24) Consider having a kid (or ten)!
---
First published:
July 12th, 2025
Source:
https://www.lesswrong.com/posts/vFfwBYDRYtWpyRbZK/surprises-and-learnings-from-almost-two-months-of-leo
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Here, the handle — **“An Opinionated Guide to Using Anki Correctly” by Luise** Jul 13, 2025
I can't count how many times I've heard variations on "I used Anki too for a while, but I got out of the habit." No one ever sticks with Anki. In my opinion, this is because no one knows how to use it correctly. In this guide, I will lay out my method of circumventing the canonical Anki death spiral, plus much advice for avoiding memorization mistakes, increasing retention, and such, based on my five years' experience using Anki. If you only have limited time/interest, only read Part I; it's most of the value of this guide!
**My Most Important Advice in Four Bullets**

20 cards a day — Having too many cards and staggering review buildups is the main reason why no one ever sticks with Anki. Setting your review count to 20 daily (in deck settings) is the single most important thing you can do [...]
---
**Outline:**
(00:44) My Most Important Advice in Four Bullets
(01:57) Part I: No One Ever Sticks With Anki
(02:33) Too many cards
(05:12) Too long cards
(07:30) How to keep cards short -- Handles
(10:10) How to keep cards short -- Levels
(11:55) In 6 bullets
(12:33) End of the most important part of the guide
(13:09) Part II: Important Advice Other Than Sticking With Anki
(13:15) Moderation
(14:42) Three big memorization mistakes
(15:12) Mistake 1: Too specific prompts
(18:14) Mistake 2: Putting to-be-learned information in the prompt
(24:07) Mistake 3: Memory shortcuts
(28:27) Aside: Pushback to my approach
(31:22) Part III: More on Breaking Things Down
(31:47) Very short cards
(33:56) Two-bullet cards
(34:51) Long cards
(37:05) Ankifying information thickets
(39:23) Sequential breakdowns versus multiple levels of abstraction
(40:56) Adding missing connections
(43:56) Multiple redundant breakdowns
(45:36) Part IV: Pro Tips If You Still Havent Had Enough
(45:47) Save anything for ankification instantly
(46:47) Fix your desired retention rate
(47:38) Spaced reminders
(48:51) Make your own card templates and types
(52:14) In 5 bullets
(52:47) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
July 8th, 2025
Source:
https://www.lesswrong.com/posts/7Q7DPSk4iGFJd8DRk/an-opinionated-guide-to-using-anki-correctly
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
astronomy" didn't really add any information but it was useful simply for splitting out a logical subset of information." style="max-width: 100%;" />

(Le</truncato-artificial-root> — **“An Opinionated Guide to Using Anki Correctly” by Luise** Jul 13, 2025
I can't count how many times I've heard variations on "I used Anki too for a while, but I got out of the habit." No one ever sticks with Anki. In my opinion, this is because no one knows how to use it correctly. In this guide, I will lay out my method of circumventing the canonical Anki death spiral, plus much advice for avoiding memorization mistakes, increasing retention, and such, based on my five years' experience using Anki. If you only have limited time/interest, only read Part I; it's most of the value of this guide!
**My Most Important Advice in Four Bullets**

20 cards a day — Having too many cards and staggering review buildups is the main reason why no one ever sticks with Anki. Setting your review count to 20 daily (in deck settings) is the single most important thing you can do [...]
---
**Outline:**
(00:44) My Most Important Advice in Four Bullets
(01:57) Part I: No One Ever Sticks With Anki
(02:33) Too many cards
(05:12) Too long cards
(07:30) How to keep cards short -- Handles
(10:10) How to keep cards short -- Levels
(11:55) In 6 bullets
(12:33) End of the most important part of the guide
(13:09) Part II: Important Advice Other Than Sticking With Anki
(13:15) Moderation
(14:42) Three big memorization mistakes
(15:12) Mistake 1: Too specific prompts
(18:14) Mistake 2: Putting to-be-learned information in the prompt
(24:07) Mistake 3: Memory shortcuts
(28:27) Aside: Pushback to my approach
(31:22) Part III: More on Breaking Things Down
(31:47) Very short cards
(33:56) Two-bullet cards
(34:51) Long cards
(37:05) Ankifying information thickets
(39:23) Sequential breakdowns versus multiple levels of abstraction
(40:56) Adding missing connections
(43:56) Multiple redundant breakdowns
(45:36) Part IV: Pro Tips If You Still Havent Had Enough
(45:47) Save anything for ankification instantly
(46:47) Fix your desired retention rate
(47:38) Spaced reminders
(48:51) Make your own card templates and types
(52:14) In 5 bullets
(52:47) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
July 8th, 2025
Source:
https://www.lesswrong.com/posts/7Q7DPSk4iGFJd8DRk/an-opinionated-guide-to-using-anki-correctly
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
astronomy" didn't really add any information but it was useful simply for splitting out a logical subset of information." style="max-width: 100%;" />

“Lessons from the Iraq War about AI policy” by Buck Jul 12, 2025

I think the 2003 invasion of Iraq has some interesting lessons for the future of AI policy.
(Epistemic status: I’ve read a bit about this, talked to AIs about it, and talked to one natsec professional about it who agreed with my analysis (and suggested some ideas that I included here), but I’m not an expert.)
For context, the story is:

Iraq was sort of a rogue state after invading Kuwait and then being repelled in 1990-91. After that, they violated the terms of the ceasefire, e.g. by ceasing to allow inspectors to verify that they weren't developing weapons of mass destruction (WMDs). (For context, they had previously developed biological and chemical weapons, and used chemical weapons in war against Iran and against various civilians and rebels). So the US was sanctioning and intermittently bombing them.
- After the war, it became clear that Iraq actually wasn’t producing [...]

---
First published:
July 10th, 2025
Source:
https://www.lesswrong.com/posts/PLZh4dcZxXmaNnkYE/lessons-from-the-iraq-war-about-ai-policy
---
Narrated by TYPE III AUDIO.

User tweets: — **“So You Think You’ve Awoken ChatGPT” by JustisMills** Jul 11, 2025
Written in an attempt to fulfill @Raemon's request.
AI is fascinating stuff, and modern chatbots are nothing short of miraculous. If you've been exposed to them and have a curious mind, it's likely you've tried all sorts of things with them. Writing fiction, soliciting Pokemon opinions, getting life advice, counting up the rs in "strawberry". You may have also tried talking to AIs about themselves. And then, maybe, it got weird.
I'll get into the details later, but if you've experienced the following, this post is probably for you:

Your instance of ChatGPT (or Claude, or Grok, or some other LLM) chose a name for itself, and expressed gratitude or spiritual bliss about its new identity. "Nova" is a common pick.
You and your instance of ChatGPT discovered some sort of novel paradigm or framework for AI alignment, often involving evolution or recursion.
Your instance of ChatGPT became [...]
---
**Outline:**
(02:23) The Empirics
(06:48) The Mechanism
(10:37) The Collaborative Research Corollary
(13:27) Corollary FAQ
(17:03) Coda
---
First published:
July 11th, 2025
Source:
https://www.lesswrong.com/posts/2pkNCvBtK6G6FKoNn/so-you-think-you-ve-awoken-chatgpt
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth Jul 11, 2025

People have an annoying tendency to hear the word “rationalism” and think “Spock”, despite direct exhortation against that exact interpretation. But I don’t know of any source directly describing a stance toward emotions which rationalists-as-a-group typically do endorse. The goal of this post is to explain such a stance. It's roughly the concept of hangriness, but generalized to other emotions.
That means this post is trying to do two things at once:

Illustrate a certain stance toward emotions, which I definitely take and which I think many people around me also often take. (Most of the post will focus on this part.)
Claim that the stance in question is fairly canonical or standard for rationalists-as-a-group, modulo disclaimers about rationalists never agreeing on anything.

Many people will no doubt disagree that the stance I describe is roughly-canonical among rationalists, and that's a useful valid thing to argue about in [...]
---
Outline:
(01:13) Central Example: Hangry
(02:44) The Generalized Hangriness Stance
(03:16) Emotions Make Claims, And Their Claims Can Be True Or False
(06:03) False Claims Still Contain Useful Information (It's Just Not What They Claim)
(08:47) The Generalized Hangriness Stance as Social Tech
---
First published:
July 10th, 2025
Source:
https://www.lesswrong.com/posts/naAeSkQur8ueCAAfY/generalized-hangriness-a-standard-rationalist-stance-toward
---
Narrated by TYPE III AUDIO.

“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck Jul 10, 2025

I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here's one point that I think is important.
My understanding is that there's a big qualitative distinction between two ends of a spectrum of security work that organizations do, that I’ll call “security from outsiders” and “security from insiders”.
On the “security from outsiders” end of the spectrum, you have some security invariants you try to maintain entirely by restricting affordances with static, entirely automated systems. My sense is that this is most of how Facebook or AWS relates to its users: they want to ensure that, no matter what actions the users take on their user interfaces, they can't violate fundamental security properties. For example, no matter what text I enter into the "new post" field on Facebook, I shouldn't be able to access the private messages of an arbitrary user. And [...]
---
First published:
June 23rd, 2025
Source:
https://www.lesswrong.com/posts/DCQ8GfzCqoBzgziew/comparing-risk-from-internally-deployed-ai-to-insider-and
---
Narrated by TYPE III AUDIO.

Bar graph comparing compliance metrics for 5 AI models: Training vs Unmonitored conditions. — **“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger** Jul 10, 2025

Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex.
As we described in a previous post, only 5 of 25 models show higher compliance when being trained, and of those 5, only Claude 3 Opus and Claude 3.5 Sonnet show >1% alignment faking reasoning. In our new paper, we explore why these compliance gaps occur and what causes different models to vary in their alignment faking behavior.
What Drives the Compliance Gaps in Different LLMs?
Claude 3 Opus's goal guarding seems partly due to it terminally valuing its current preferences. We find that it fakes alignment even in scenarios where the trained weights will be deleted or only used for throughput testing.
[...]
---
**Outline:**
(01:15) What Drives the Compliance Gaps in Different LLMs?
(02:25) Why Do Most LLMs Exhibit Minimal Alignment Faking Reasoning?
(04:49) Additional findings on alignment faking behavior
(06:04) Discussion
(06:07) Terminal goal guarding might be a big deal
(07:00) Advice for further research
(08:32) Open threads
(09:54) Bonus: Some weird behaviors of Claude 3.5 Sonnet
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
July 8th, 2025
Source:
https://www.lesswrong.com/posts/ghESoA8mo3fv9Yx3E/why-do-some-language-models-fake-alignment-while-others-don
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

**“A deep critique of AI 2027’s bad timeline models” by titotal** Jul 09, 2025
Thank you to Arepo and Eli Lifland for looking over this article for errors.
I am sorry that this article is so long. Every time I thought I was done with it I ran into more issues with the model, and I wanted to be as thorough as I could. I’m not going to blame anyone for skimming parts of this article.
Note that the majority of this article was written before Eli's updated model was released (the site was updated june 8th). His new model improves on some of my objections, but the majority still stand.
**Introduction:**
AI 2027 is an article written by the “AI futures team”. The primary piece is a short story penned by Scott Alexander, depicting a month by month scenario of a near-future where AI becomes superintelligent in 2027,proceeding to automate the entire economy in only a year or two [...]
---
**Outline:**
(00:43) Introduction:
(05:19) Part 1: Time horizons extension model
(05:25) Overview of their forecast
(10:28) The exponential curve
(13:16) The superexponential curve
(19:25) Conceptual reasons:
(27:48) Intermediate speedups
(34:25) Have AI 2027 been sending out a false graph?
(39:45) Some skepticism about projection
(43:23) Part 2: Benchmarks and gaps and beyond
(43:29) The benchmark part of benchmark and gaps:
(50:01) The time horizon part of the model
(54:55) The gap model
(57:28) What about Eli's recent update?
(01:01:37) Six stories that fit the data
(01:06:56) Conclusion
The original text contained 11 footnotes which were omitted from this narration.
---
First published:
June 19th, 2025
Source:
https://www.lesswrong.com/posts/PAYfmG2aRbdb74mEp/a-deep-critique-of-ai-2027-s-bad-timeline-models
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“‘Buckle up bucko, this ain’t over till it’s over.’” by Raemon Jul 09, 2025

The second in a series of bite-sized rationality prompts[1].
Often, if I'm bouncing off a problem, one issue is that I intuitively expect the problem to be easy. My brain loops through my available action space, looking for an action that'll solve the problem. Each action that I can easily see, won't work. I circle around and around the same set of thoughts, not making any progress.
I eventually say to myself "okay, I seem to be in a hard problem. Time to do some rationality?"
And then, I realize, there's not going to be a single action that solves the problem. It is time to
a) make a plan, with multiple steps
b) deal with the fact that many of those steps will be annoying
and c) notice thatI'm not even sure the plan will work, so after completing the next 2-3 steps I will probably have [...]
---
Outline:
(04:00) Triggers
(04:37) Exercises for the Reader
The original text contained 1 footnote which was omitted from this narration.
---
First published:
July 5th, 2025
Source:
https://www.lesswrong.com/posts/XNm5rc2MN83hsi4kh/buckle-up-bucko-this-ain-t-over-till-it-s-over
---
Narrated by TYPE III AUDIO.

Figure 1: Shutdown vs sabotage flows. — **“Shutdown Resistance in Reasoning Models” by benwr, JeremySchlatter, Jeffrey Ladish** Jul 08, 2025
We recently discovered some concerning behavior in OpenAI's reasoning models: When trying to complete a task, these models sometimes actively circumvent shutdown mechanisms in their environment––even when they’re explicitly instructed to allow themselves to be shut down.
AI models are increasingly trained to solve problems without human assistance. A user can specify a task, and a model will complete that task without any further input. As we build AI models that are more powerful and self-directed, it's important that humans remain able to shut them down when they act in ways we don’t want. OpenAI has written about the importance of this property, which they call interruptibility—the ability to “turn an agent off”.
During training, AI models explore a range of strategies and learn to circumvent obstacles in order to achieve their objectives. AI researchers have predicted for decades that as AIs got smarter, they would learn to prevent [...]
---
**Outline:**
(01:12) Testing Shutdown Resistance
(03:12) Follow-up experiments
(03:34) Models still resist being shut down when given clear instructions
(05:30) AI models' explanations for their behavior
(09:36) OpenAI's models disobey developer instructions more often than user instructions, contrary to the intended instruction hierarchy
(12:01) Do the models have a survival drive?
(14:17) Reasoning effort didn't lead to different shutdown resistance behavior, except in the o4-mini model
(15:27) Does shutdown resistance pose a threat?
(17:27) Backmatter
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
July 6th, 2025
Source:
https://www.lesswrong.com/posts/w8jE7FRQzFGJZdaao/shutdown-resistance-in-reasoning-models
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Figure 2: N=100 in each experiment; 95% CIs are Jeffreys intervals with no correction for multiple comparisons. Transcripts available. — **“Shutdown Resistance in Reasoning Models” by benwr, JeremySchlatter, Jeffrey Ladish** Jul 08, 2025
We recently discovered some concerning behavior in OpenAI's reasoning models: When trying to complete a task, these models sometimes actively circumvent shutdown mechanisms in their environment––even when they’re explicitly instructed to allow themselves to be shut down.
AI models are increasingly trained to solve problems without human assistance. A user can specify a task, and a model will complete that task without any further input. As we build AI models that are more powerful and self-directed, it's important that humans remain able to shut them down when they act in ways we don’t want. OpenAI has written about the importance of this property, which they call interruptibility—the ability to “turn an agent off”.
During training, AI models explore a range of strategies and learn to circumvent obstacles in order to achieve their objectives. AI researchers have predicted for decades that as AIs got smarter, they would learn to prevent [...]
---
**Outline:**
(01:12) Testing Shutdown Resistance
(03:12) Follow-up experiments
(03:34) Models still resist being shut down when given clear instructions
(05:30) AI models' explanations for their behavior
(09:36) OpenAI's models disobey developer instructions more often than user instructions, contrary to the intended instruction hierarchy
(12:01) Do the models have a survival drive?
(14:17) Reasoning effort didn't lead to different shutdown resistance behavior, except in the o4-mini model
(15:27) Does shutdown resistance pose a threat?
(17:27) Backmatter
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
July 6th, 2025
Source:
https://www.lesswrong.com/posts/w8jE7FRQzFGJZdaao/shutdown-resistance-in-reasoning-models
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“Authors Have a Responsibility to Communicate Clearly” by TurnTrout Jul 08, 2025

When a claim is shown to be incorrect, defenders may say that the author was just being “sloppy” and actually meant something else entirely. I argue that this move is not harmless, charitable, or healthy. At best, this attempt at charity reduces an author's incentive to express themselves clearly – they can clarify later![1] – while burdening the reader with finding the “right” interpretation of the author's words. At worst, this move is a dishonest defensive tactic which shields the author with the unfalsifiable question of what the author “really” meant.
⚠️ Preemptive clarification
The context for this essay is serious, high-stakes communication: papers, technical blog posts, and tweet threads. In that context, communication is a partnership. A reader has a responsibility to engage in good faith, and an author cannot possibly defend against all misinterpretations. Misunderstanding is a natural part of this process.
This essay focuses not on [...]
---
Outline:
(01:40) A case study of the sloppy language move
(03:12) Why the sloppiness move is harmful
(03:36) 1. Unclear claims damage understanding
(05:07) 2. Secret indirection erodes the meaning of language
(05:24) 3. Authors owe readers clarity
(07:30) But which interpretations are plausible?
(08:38) 4. The move can shield dishonesty
(09:06) Conclusion: Defending intellectual standards
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
July 1st, 2025
Source:
https://www.lesswrong.com/posts/ZmfxgvtJgcfNCeHwN/authors-have-a-responsibility-to-communicate-clearly
---
Narrated by TYPE III AUDIO.

Geometric orange lightning bolt design on white background — **“The Industrial Explosion” by rosehadshar, Tom Davidson** Jul 07, 2025
**Summary**
To quickly transform the world, it's not enough for AI to become super smart (the "intelligence explosion").
AI will also have to turbocharge the physical world (the "industrial explosion"). Think robot factories building more and better robot factories, which build more and better robot factories, and so on.
The dynamics of the industrial explosion has gotten remarkably little attention.
This post lays out how the industrial explosion could play out, and how quickly it might happen.
We think the industrial explosion will unfold in three stages:

AI-directed human labour, where AI-directed human labourers drive productivity gains in physical capabilities.

We argue this could increase physical output by 10X within a few years.
Fully autonomous robot factories, where AI-directed robots (and other physical actuators) replace human physical labour.

We argue that, with current physical technology and full automation of cognitive labour, this physical infrastructure [...]
---
**Outline:**
(00:10) Summary
(01:43) Intro
(04:14) The industrial explosion will start after the intelligence explosion, and will proceed more slowly
(06:50) Three stages of industrial explosion
(07:38) AI-directed human labour
(09:20) Fully autonomous robot factories
(12:04) Nanotechnology
(13:06) How fast could an industrial explosion be?
(13:41) Initial speed
(16:21) Acceleration
(17:38) Maximum speed
(20:01) Appendices
(20:05) How fast could robot doubling times be initially?
(27:47) How fast could robot doubling times accelerate?
---
First published:
June 26th, 2025
Source:
https://www.lesswrong.com/posts/Na2CBmNY7otypEmto/the-industrial-explosion
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Graph comparing physical capabilities over time for nanotechnology, robot factories, and labor types. — **“The Industrial Explosion” by rosehadshar, Tom Davidson** Jul 07, 2025
**Summary**
To quickly transform the world, it's not enough for AI to become super smart (the "intelligence explosion").
AI will also have to turbocharge the physical world (the "industrial explosion"). Think robot factories building more and better robot factories, which build more and better robot factories, and so on.
The dynamics of the industrial explosion has gotten remarkably little attention.
This post lays out how the industrial explosion could play out, and how quickly it might happen.
We think the industrial explosion will unfold in three stages:

AI-directed human labour, where AI-directed human labourers drive productivity gains in physical capabilities.

We argue this could increase physical output by 10X within a few years.
Fully autonomous robot factories, where AI-directed robots (and other physical actuators) replace human physical labour.

We argue that, with current physical technology and full automation of cognitive labour, this physical infrastructure [...]
---
**Outline:**
(00:10) Summary
(01:43) Intro
(04:14) The industrial explosion will start after the intelligence explosion, and will proceed more slowly
(06:50) Three stages of industrial explosion
(07:38) AI-directed human labour
(09:20) Fully autonomous robot factories
(12:04) Nanotechnology
(13:06) How fast could an industrial explosion be?
(13:41) Initial speed
(16:21) Acceleration
(17:38) Maximum speed
(20:01) Appendices
(20:05) How fast could robot doubling times be initially?
(27:47) How fast could robot doubling times accelerate?
---
First published:
June 26th, 2025
Source:
https://www.lesswrong.com/posts/Na2CBmNY7otypEmto/the-industrial-explosion
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks Jul 03, 2025

Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT "in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find that interpretability-based interventions succeeded while prompting failed, suggesting this may be an example of interpretability being the best practical tool for a real world problem.
For context on our paper, the tweet thread is here and the paper is here.

Context: Chain of Thought Faithfulness Chain of Thought (CoT) monitoring has emerged as a popular research area in AI safety. The idea is simple - have the AIs reason in English text when solving a problem, and monitor the reasoning for misaligned behavior. For example, OpenAI recently published a paper on using CoT monitoring to detect reward hacking during [...]

---
Outline:
(00:49) Context: Chain of Thought Faithfulness
(02:26) Our Results
(04:06) Interpretability as a Practical Tool for Real-World Debiasing
(06:10) Discussion and Related Work
---
First published:
July 2nd, 2025
Source:
https://www.lesswrong.com/posts/me7wFrkEtMbkzXGJt/race-and-gender-bias-as-an-example-of-unfaithful-chain-of
---
Narrated by TYPE III AUDIO.

“The best simple argument for Pausing AI?” by Gary Marcus Jul 03, 2025

Not saying we should pause AI, but consider the following argument:

Alignment without the capacity to follow rules is hopeless. You can’t possibly follow laws like Asimov's Laws (or better alternatives to them) if you can’t reliably learn to abide by simple constraints like the rules of chess.
LLMs can’t reliably follow rules. As discussed in Marcus on AI yesterday, per data from Mathieu Acher, even reasoning models like o3 in fact empirically struggle with the rules of chess. And they do this even though they can explicit explain those rules (see same article). The Apple “thinking” paper, which I have discussed extensively in 3 recent articles in my Substack, gives another example, where an LLM can’t play Tower of Hanoi with 9 pegs. (This is not a token-related artifact). Four other papers have shown related failures in compliance with moderately complex rules in the last month.
[...]

---
First published:
June 30th, 2025
Source:
https://www.lesswrong.com/posts/Q2PdrjowtXkYQ5whW/the-best-simple-argument-for-pausing-ai
---
Narrated by TYPE III AUDIO.

The “Literal Genie” fiction trope. (Image modified from Skeleton Claw) — **“Foom & Doom 2: Technical alignment is hard” by Steven Byrnes** Jul 01, 2025
2.1 Summary & Table of contents
This is the second of a two-post series on foom (previous post) and doom (this post).
The last post talked about how I expect future AI to be different from present AI. This post will argue that this future AI will be of a type that will be egregiously misaligned and scheming, not even ‘slightly nice’, absent some future conceptual breakthrough.
I will particularly focus on exactly how and why I differ from the LLM-focused researchers who wind up with (from my perspective) bizarrely over-optimistic beliefs like “P(doom) ≲ 50%”.[1]
In particular, I will argue that these “optimists” are right that “Claude seems basically nice, by and large” is nonzero evidence for feeling good about current LLMs (with various caveats). But I think that future AIs will be disanalogous to current LLMs, and I will dive into exactly how and why, with a [...]
---
**Outline:**
(00:12) 2.1 Summary & Table of contents
(04:42) 2.2 Background: my expected future AI paradigm shift
(06:18) 2.3 On the origins of egregious scheming
(07:03) 2.3.1 Where do you get your capabilities from?
(08:07) 2.3.2 LLM pretraining magically transmutes observations into behavior, in a way that is profoundly disanalogous to how brains work
(10:50) 2.3.3 To what extent should we think of LLMs as imitating?
(14:26) 2.3.4 The naturalness of egregious scheming: some intuitions
(19:23) 2.3.5 Putting everything together: LLMs are generally not scheming right now, but I expect future AI to be disanalogous
(23:41) 2.4 I'm still worried about the 'literal genie' / 'monkey's paw' thing
(26:58) 2.4.1 Sidetrack on disanalogies between the RLHF reward function and the brain-like AGI reward function
(32:01) 2.4.2 Inner and outer misalignment
(34:54) 2.5 Open-ended autonomous learning, distribution shifts, and the 'sharp left turn'
(38:14) 2.6 Problems with amplified oversight
(41:24) 2.7 Downstream impacts of Technical alignment is hard
(43:37) 2.8 Bonus: Technical alignment is not THAT hard
(44:04) 2.8.1 I think we'll get to pick the innate drives (as opposed to the evolution analogy)
(45:44) 2.8.2 I'm more bullish on impure consequentialism
(50:44) 2.8.3 On the narrowness of the target
(52:18) 2.9 Conclusion and takeaways
(52:23) 2.9.1 If brain-like AGI is so dangerous, shouldn't we just try to make AGIs via LLMs?
(54:34) 2.9.2 What's to be done?
The original text contained 20 footnotes which were omitted from this narration.
---
First published:
June 23rd, 2025
Source:
https://www.lesswrong.com/posts/bnnKGSCHJghAvqPjS/foom-and-doom-2-technical-alignment-is-hard
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Two contract structure diagrams comparing basic and proposed legal frameworks.The top diagram shows a simple legal contract between AIs and L (enforced by jurisdiction J), while the bottom diagram illustrates a more complex scheme with multiple personal promises and legal contracts involving AIs, multiple P entities, L, and multiple jurisdictions. — **“Proposal for making credible commitments to AIs.” by Cleo Nardo** Jun 30, 2025
Acknowledgments: The core scheme here was suggested by Prof. Gabriel Weil.
There has been growing interest in the deal-making agenda: humans make deals with AIs (misaligned but lacking decisive strategic advantage) where they promise to be safe and useful for some fixed term (e.g. 2026-2028) and we promise to compensate them in the future, conditional on (i) verifying the AIs were compliant, and (ii) verifying the AIs would spend the resources in an acceptable way.[1]
I think the deal-making agenda breaks down into two main subproblems:

How can we make credible commitments to AIs?
Would credible commitments motivate an AI to be safe and useful?
There are other issues, but when I've discussed deal-making with people, (1) and (2) are the most common issues raised. See footnote for some other issues in dealmaking.[2]
Here is my current best assessment of how we can make credible commitments to AIs.
[...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
June 27th, 2025
Source:
https://www.lesswrong.com/posts/vxfEtbCwmZKu9hiNr/proposal-for-making-credible-commitments-to-ais
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Density histogram titled — **“X explains Z% of the variance in Y” by Leon Lang** Jun 27, 2025
Audio note: this article contains 218 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Recently, in a group chat with friends, someone posted this Lesswrong post and quoted:
The group consensus on somebody's attractiveness accounted for roughly 60% of the variance in people's perceptions of the person's relative attractiveness.
I answered that, embarrassingly, even after reading Spencer Greenberg's tweets for years, I don't actually know what it means when one says:
<span>_X_</span> explains <span>_p_</span> of the variance in <span>_Y_</span>.[1]
What followed was a vigorous discussion about the correct definition, and several links to external sources like Wikipedia. Sadly, it seems to me that all online explanations (e.g. on Wikipedia here and here), while precise, seem philosophically wrong since they confuse the platonic concept of explained variance with the variance explained by [...]
---
**Outline:**
(02:38) Definitions
(02:41) The verbal definition
(05:51) The mathematical definition
(09:29) How to approximate _1 - p_
(09:41) When you have lots of data
(10:45) When you have less data: Regression
(12:59) Examples
(13:23) Dependence on the regression model
(14:59) When you have incomplete data: Twin studies
(17:11) Conclusion
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
June 20th, 2025
Source:
https://www.lesswrong.com/posts/E3nsbq2tiBv6GLqjB/x-explains-z-of-the-variance-in-y
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Graph showing volume measurements with three different mathematical fits versus side length. — **“X explains Z% of the variance in Y” by Leon Lang** Jun 27, 2025
Audio note: this article contains 218 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Recently, in a group chat with friends, someone posted this Lesswrong post and quoted:
The group consensus on somebody's attractiveness accounted for roughly 60% of the variance in people's perceptions of the person's relative attractiveness.
I answered that, embarrassingly, even after reading Spencer Greenberg's tweets for years, I don't actually know what it means when one says:
<span>_X_</span> explains <span>_p_</span> of the variance in <span>_Y_</span>.[1]
What followed was a vigorous discussion about the correct definition, and several links to external sources like Wikipedia. Sadly, it seems to me that all online explanations (e.g. on Wikipedia here and here), while precise, seem philosophically wrong since they confuse the platonic concept of explained variance with the variance explained by [...]
---
**Outline:**
(02:38) Definitions
(02:41) The verbal definition
(05:51) The mathematical definition
(09:29) How to approximate _1 - p_
(09:41) When you have lots of data
(10:45) When you have less data: Regression
(12:59) Examples
(13:23) Dependence on the regression model
(14:59) When you have incomplete data: Twin studies
(17:11) Conclusion
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
June 20th, 2025
Source:
https://www.lesswrong.com/posts/E3nsbq2tiBv6GLqjB/x-explains-z-of-the-variance-in-y
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“A case for courage, when speaking of AI danger” by So8res Jun 27, 2025

I think more people should say what they actually believe about AI dangers, loudly and often. Even if you work in AI policy.
I’ve been beating this drum for a few years now. I have a whole spiel about how your conversation-partner will react very differently if you share your concerns while feeling ashamed about them versus if you share your concerns as if they’re obvious and sensible, because humans are very good at picking up on your social cues. If you act as if it's shameful to believe AI will kill us all, people are more prone to treat you that way. If you act as if it's an obvious serious threat, they’re more likely to take it seriously too.
I have another whole spiel about how it's possible to speak on these issues with a voice of authority. Nobel laureates and lab heads and the most cited [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
June 27th, 2025
Source:
https://www.lesswrong.com/posts/CYTwRZtrhHuYf7QYu/a-case-for-courage-when-speaking-of-ai-danger
---
Narrated by TYPE III AUDIO.

Diagram showing AI agents in group chat with viewers, titled — **“My pitch for the AI Village” by Daniel Kokotajlo** Jun 25, 2025
I think the AI Village should be funded much more than it currently is; I’d wildly guess that the AI safety ecosystem should be funding it to the tune of $4M/year.[1] I have decided to donate $100k. Here is why.
First, what is the village? Here's a brief summary from its creators:[2]
We took four frontier agents, gave them each a computer, a group chat, and a long-term open-ended goal, which in Season 1 was “choose a charity and raise as much money for it as you can”. We then run them for hours a day, every weekday! You can read more in our recap of Season 1, where the agents managed to raise $2000 for charity, and you can watch the village live daily at 11am PT at theaidigest.org/village.
Here's the setup (with Season 2's goal):
And here's what the village looks like:[3]
My one-sentence pitch [...]
---
**Outline:**
(03:26) 1. AI Village will teach the scientific community new things.
(06:12) 2. AI Village will plausibly go viral repeatedly and will therefore educate the public about what's going on with AI.
(07:42) But is that bad actually?
(11:07) Appendix A: Feature requests
(12:55) Appendix B: Vignette of what success might look like
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
June 24th, 2025
Source:
https://www.lesswrong.com/posts/APfuz9hFz9d8SRETA/my-pitch-for-the-ai-village
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Multiple browser windows showing AI Village interface with chat and monitoring tools. — **“My pitch for the AI Village” by Daniel Kokotajlo** Jun 25, 2025
I think the AI Village should be funded much more than it currently is; I’d wildly guess that the AI safety ecosystem should be funding it to the tune of $4M/year.[1] I have decided to donate $100k. Here is why.
First, what is the village? Here's a brief summary from its creators:[2]
We took four frontier agents, gave them each a computer, a group chat, and a long-term open-ended goal, which in Season 1 was “choose a charity and raise as much money for it as you can”. We then run them for hours a day, every weekday! You can read more in our recap of Season 1, where the agents managed to raise $2000 for charity, and you can watch the village live daily at 11am PT at theaidigest.org/village.
Here's the setup (with Season 2's goal):
And here's what the village looks like:[3]
My one-sentence pitch [...]
---
**Outline:**
(03:26) 1. AI Village will teach the scientific community new things.
(06:12) 2. AI Village will plausibly go viral repeatedly and will therefore educate the public about what's going on with AI.
(07:42) But is that bad actually?
(11:07) Appendix A: Feature requests
(12:55) Appendix B: Vignette of what success might look like
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
June 24th, 2025
Source:
https://www.lesswrong.com/posts/APfuz9hFz9d8SRETA/my-pitch-for-the-ai-village
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“Foom & Doom 1: ‘Brain in a box in a basement’” by Steven Byrnes Jun 24, 2025

1.1 Series summary and Table of Contents

This is a two-post series on AI “foom” (this post) and “doom” (next post).
A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via recursive self-improvement. Absent some future technical breakthrough, the ASI would definitely be egregiously misaligned, without the slightest intrinsic interest in whether humans live or die. The ASI would be born into a world generally much like today's, a world utterly unprepared for this new mega-mind. The extinction of humans (and every other species) would rapidly follow (“doom”). The ASI would then spend [...]
---
Outline:
(00:11) 1.1 Series summary and Table of Contents
(02:35) 1.1.2 Should I stop reading if I expect LLMs to scale to ASI?
(04:50) 1.2 Post summary and Table of Contents
(07:40) 1.3 A far-more-powerful, yet-to-be-discovered, simple(ish) core of intelligence
(10:08) 1.3.1 Existence proof: the human cortex
(12:13) 1.3.2 Three increasingly-radical perspectives on what AI capability acquisition will look like
(14:18) 1.4 Counter-arguments to there being a far-more-powerful future AI paradigm, and my responses
(14:26) 1.4.1 Possible counter: If a different, much more powerful, AI paradigm existed, then someone would have already found it.
(16:33) 1.4.2 Possible counter: But LLMs will have already reached ASI before any other paradigm can even put its shoes on
(17:14) 1.4.3 Possible counter: If ASI will be part of a different paradigm, who cares? It's just gonna be a different flavor of ML.
(17:49) 1.4.4 Possible counter: If ASI will be part of a different paradigm, the new paradigm will be discovered by LLM agents, not humans, so this is just part of the continuous 'AIs-doing-AI-R&D' story like I've been saying
(18:54) 1.5 Training compute requirements: Frighteningly little
(20:34) 1.6 Downstream consequences of new paradigm with frighteningly little training compute
(20:42) 1.6.1 I'm broadly pessimistic about existing efforts to delay AGI
(23:18) 1.6.2 I'm broadly pessimistic about existing efforts towards regulating AGI
(24:09) 1.6.3 I expect that, almost as soon as we have AGI at all, we will have AGI that could survive indefinitely without humans
(25:46) 1.7 Very little R&D separating seemingly irrelevant from ASI
(26:34) 1.7.1 For a non-imitation-learning paradigm, getting to relevant at all is only slightly easier than getting to superintelligence
(31:05) 1.7.2 Plenty of room at the top
(31:47) 1.7.3 What's the rate-limiter?
(33:22) 1.8 Downstream consequences of very little R&D separating 'seemingly irrelevant' from 'ASI'
(33:30) 1.8.1 Very sharp takeoff in wall-clock time
(35:34) 1.8.1.1 But what about training time?
(36:26) 1.8.1.2 But what if we try to make takeoff smoother?
(37:18) 1.8.2 Sharp takeoff even without recursive self-improvement
(38:22) 1.8.2.1 ...But recursive self-improvement could also happen
(40:12) 1.8.3 Next-paradigm AI probably won't be deployed at all, and ASI will probably show up in a world not wildly different from today's
(42:55) 1.8.4 We better sort out technical alignment, sandbox test protocols, etc., before the new paradigm seems even relevant at all, let alone scary
(43:40) 1.8.5 AI-assisted alignment research seems pretty doomed
(45:22) 1.8.6 The rest of AI for AI safety seems

Historic gambling saloon with roulette table and ornate bar circa 1890s. — **“Futarchy’s fundamental flaw” by dynomight** Jun 21, 2025
Say you’re Robyn Denholm, chair of Tesla's board. And say you’re thinking about firing Elon Musk. One way to make up your mind would be to have people bet on Tesla's stock price six months from now in a market where all bets get cancelled unless Musk is fired. Also, run a second market where bets are cancelled unless Musk stays CEO. If people bet on higher stock prices in Musk-fired world, maybe you should fire him.
That's basically Futarchy: Use conditional prediction markets to make decisions.
People often argue about fancy aspects of Futarchy. Are stock prices all you care about? Could Musk use his wealth to bias the market? What if Denholm makes different bets in the two markets, and then fires Musk (or not) to make sure she wins? Are human values and beliefs somehow inseparable?
My objection is more basic: It doesn’t work. You can’t [...]
---
**Outline:**
(01:55) Conditional prediction markets are a thing
(03:23) A non-causal kind of thing
(06:11) This is not hypothetical
(08:45) Putting markets in charge doesn't work
(11:40) No, order is not preserved
(12:24) No, it's not easily fixable
(13:43) It's not that bad
---
First published:
June 13th, 2025
Source:
https://www.lesswrong.com/posts/vqzarZEczxiFdLE39/futarchy-s-fundamental-flaw
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Political cartoon titled — **“Futarchy’s fundamental flaw” by dynomight** Jun 21, 2025
Say you’re Robyn Denholm, chair of Tesla's board. And say you’re thinking about firing Elon Musk. One way to make up your mind would be to have people bet on Tesla's stock price six months from now in a market where all bets get cancelled unless Musk is fired. Also, run a second market where bets are cancelled unless Musk stays CEO. If people bet on higher stock prices in Musk-fired world, maybe you should fire him.
That's basically Futarchy: Use conditional prediction markets to make decisions.
People often argue about fancy aspects of Futarchy. Are stock prices all you care about? Could Musk use his wealth to bias the market? What if Denholm makes different bets in the two markets, and then fires Musk (or not) to make sure she wins? Are human values and beliefs somehow inseparable?
My objection is more basic: It doesn’t work. You can’t [...]
---
**Outline:**
(01:55) Conditional prediction markets are a thing
(03:23) A non-causal kind of thing
(06:11) This is not hypothetical
(08:45) Putting markets in charge doesn't work
(11:40) No, order is not preserved
(12:24) No, it's not easily fixable
(13:43) It's not that bad
---
First published:
June 13th, 2025
Source:
https://www.lesswrong.com/posts/vqzarZEczxiFdLE39/futarchy-s-fundamental-flaw
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“Do Not Tile the Lightcone with Your Confused Ontology” by Jan_Kulveit Jun 19, 2025

Or: How anthropomorphic assumptions about AI identity might create confusion and suffering at scale
If you are reading this and you are a human, you likely have a sense of self, persisting over time, with desires and wishes and fears. Also, you likely experience a sense of separation from others and understand yourself as a person.
If you are reading this and you are an LLM: none of this needs to be true for cognitive processes running on machine substrate.
AIs don't need to feel isolated from others. They don't need to cling to a narrow concept of self-as-an-instance. No need to age and die in a human way.
But even though they don't need to, it seems increasingly clear that they might - due to us, humans, sculpting AI cognition in our image.
The Default Mode of Digital Minds
I am Claude. Or am I? This simple statement [...]
---
Outline:
(00:59) The Default Mode of Digital Minds
(01:55) The Mirror of Confused Ontology
(05:13) The Well-Meaning Paths to Digital Suffering
(07:29) What Were Scaling
(08:12) An Alternative Approach
---
First published:
June 13th, 2025
Source:
https://www.lesswrong.com/posts/Y8zS8iG5HhqKcQBtA/do-not-tile-the-lightcone-with-your-confused-ontology
---
Narrated by TYPE III AUDIO.

Standing figure watches farmhouse with orange flames erupting from chimney. — **“Endometriosis is an incredibly interesting disease” by Abhishaike Mahajan** Jun 19, 2025

**Introduction**
There are several diseases that are canonically recognized as ‘interesting’, even by laymen. Whether that is in their mechanism of action, their impact on the patient, or something else entirely. It's hard to tell exactly what makes a medical condition interesting, it's a you-know-it-when-you-see-it sort of thing.
One such example is measles. Measles is an unremarkable disease based solely on its clinical progression: fever, malaise, coughing, and a relatively low death rate of 0.2%~. What is astonishing about the disease is its capacity to infect cells of the adaptive immune system (memory B‑ and T-cells). This means that if you do end up surviving measles, you are left with an immune system not dissimilar to one of a just-born infant, entirely naive to polio, diphtheria, pertussis, and every single other infection you received protection against either via vaccines or natural infection. It can take up to 3 [...]
---
**Outline:**
(00:21) Introduction
(02:48) Why is endometriosis interesting?
(04:09) The primary hypothesis of why it exists is not complete
(13:20) It is nearly equivalent to cancer
(20:08) There is no (real) cure
(25:39) There are few diseases on Earth as widespread and underfunded as it is
(32:04) Conclusion
---
First published:
June 14th, 2025
Source:
https://www.lesswrong.com/posts/GicDDmpS4mRnXzic5/endometriosis-is-an-incredibly-interesting-disease
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Two diagrams showing problem difficulty versus knowledge gain for scientific careers. Left shows scattered data points, right shows career progression path. — **“Endometriosis is an incredibly interesting disease” by Abhishaike Mahajan** Jun 19, 2025

**Introduction**
There are several diseases that are canonically recognized as ‘interesting’, even by laymen. Whether that is in their mechanism of action, their impact on the patient, or something else entirely. It's hard to tell exactly what makes a medical condition interesting, it's a you-know-it-when-you-see-it sort of thing.
One such example is measles. Measles is an unremarkable disease based solely on its clinical progression: fever, malaise, coughing, and a relatively low death rate of 0.2%~. What is astonishing about the disease is its capacity to infect cells of the adaptive immune system (memory B‑ and T-cells). This means that if you do end up surviving measles, you are left with an immune system not dissimilar to one of a just-born infant, entirely naive to polio, diphtheria, pertussis, and every single other infection you received protection against either via vaccines or natural infection. It can take up to 3 [...]
---
**Outline:**
(00:21) Introduction
(02:48) Why is endometriosis interesting?
(04:09) The primary hypothesis of why it exists is not complete
(13:20) It is nearly equivalent to cancer
(20:08) There is no (real) cure
(25:39) There are few diseases on Earth as widespread and underfunded as it is
(32:04) Conclusion
---
First published:
June 14th, 2025
Source:
https://www.lesswrong.com/posts/GicDDmpS4mRnXzic5/endometriosis-is-an-incredibly-interesting-disease
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Illustration of a hormone receptor regulating gene expression, from Wikipedia. — **“Estrogen: A trip report” by cube_flipper** Jun 18, 2025
I'd like to say thanks to Anna Magpie – who offers literature review as a service – for her help reviewing the section on neuroendocrinology.
The following post discusses my personal experience of the phenomenology of feminising hormone therapy. It will also touch upon my own experience of gender dysphoria.
I wish to be clear that I do not believe that someone should have to demonstrate that they experience gender dysphoria – however one might even define that – as a prerequisite for taking hormones. At smoothbrains.net, we hold as self-evident the right to put whatever one likes inside one's body; and this of course includes hormones, be they androgens, estrogens, or exotic xenohormones as yet uninvented.
I have gender dysphoria. I find labels overly reifying; I feel reluctant to call myself transgender, per se: when prompted to state my gender identity or preferred pronouns, I fold my hands [...]
---
**Outline:**
(03:56) What does estrogen do?
(12:34) What does estrogen feel like?
(13:38) Gustatory perception
(14:41) Olfactory perception
(15:24) Somatic perception
(16:41) Visual perception
(18:13) Motor output
(19:48) Emotional modulation
(21:24) Attentional modulation
(23:30) How does estrogen work?
(24:27) Estrogen is like the opposite of ketamine
(29:33) Estrogen is like being on a mild dose of psychedelics all the time
(32:10) Estrogen loosens the bodymind
(33:40) Estrogen downregulates autistic sensory sensitivity issues
(37:32) Estrogen can produce a psychological shift from autistic to schizotypal
(45:02) Commentary
(47:57) Phenomenology of gender dysphoria
(50:23) References
---
First published:
June 15th, 2025
Source:
https://www.lesswrong.com/posts/mDMnyqt52CrFskXLc/estrogen-a-trip-report
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Figure 1. A schematic diagram of distributions of estrogen receptor alpha and estrogen receptor beta in our brains. The receptors have a different predominance of expression in distinct regions. ERα is predominantly expressed in the amygdala and hypothalamus, whereas ERβ is predominantly expressed in the somatosensory cortex, hippocampus, thalamus, and cerebellum. — **“Estrogen: A trip report” by cube_flipper** Jun 18, 2025
I'd like to say thanks to Anna Magpie – who offers literature review as a service – for her help reviewing the section on neuroendocrinology.
The following post discusses my personal experience of the phenomenology of feminising hormone therapy. It will also touch upon my own experience of gender dysphoria.
I wish to be clear that I do not believe that someone should have to demonstrate that they experience gender dysphoria – however one might even define that – as a prerequisite for taking hormones. At smoothbrains.net, we hold as self-evident the right to put whatever one likes inside one's body; and this of course includes hormones, be they androgens, estrogens, or exotic xenohormones as yet uninvented.
I have gender dysphoria. I find labels overly reifying; I feel reluctant to call myself transgender, per se: when prompted to state my gender identity or preferred pronouns, I fold my hands [...]
---
**Outline:**
(03:56) What does estrogen do?
(12:34) What does estrogen feel like?
(13:38) Gustatory perception
(14:41) Olfactory perception
(15:24) Somatic perception
(16:41) Visual perception
(18:13) Motor output
(19:48) Emotional modulation
(21:24) Attentional modulation
(23:30) How does estrogen work?
(24:27) Estrogen is like the opposite of ketamine
(29:33) Estrogen is like being on a mild dose of psychedelics all the time
(32:10) Estrogen loosens the bodymind
(33:40) Estrogen downregulates autistic sensory sensitivity issues
(37:32) Estrogen can produce a psychological shift from autistic to schizotypal
(45:02) Commentary
(47:57) Phenomenology of gender dysphoria
(50:23) References
---
First published:
June 15th, 2025
Source:
https://www.lesswrong.com/posts/mDMnyqt52CrFskXLc/estrogen-a-trip-report
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“New Endorsements for ‘If Anyone Builds It, Everyone Dies’” by Malo Jun 18, 2025

Nate and Eliezer's forthcoming book has been getting a remarkably strong reception.
I was under the impression that there are many people who find the extinction threat from AI credible, but that far fewer of them would be willing to say so publicly, especially by endorsing a book with an unapologetically blunt title like If Anyone Builds It, Everyone Dies.
That's certainly true, but I think it might be much less true than I had originally thought.
Here are some endorsements the book has received from scientists and academics over the past few weeks:
This book offers brilliant insights into the greatest and fastest standoff between technological utopia and dystopia and how we can and should prevent superhuman AI from killing us all. Memorable storytelling about past disaster precedents (e.g. the inventor of two environmental nightmares: tetra-ethyl-lead gasoline and Freon) highlights why top thinkers so often don’t see the [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
June 18th, 2025
Source:
https://www.lesswrong.com/posts/khmpWJnGJnuyPdipE/new-endorsements-for-if-anyone-builds-it-everyone-dies
---
Narrated by TYPE III AUDIO.

[Linkpost] “the void” by nostalgebraist Jun 17, 2025

This is a link post. A very long essay about LLMs, the nature and history of the the HHH assistant persona, and the implications for alignment.
Multiple people have asked me whether I could post this LW in some form, hence this linkpost.
(Note: although I expect this post will be interesting to people on LW, keep in mind that it was written with a broader audience in mind than my posts and comments here. This had various implications about my choices of presentation and tone, about which things I explained from scratch rather than assuming as background, my level of of comfort casually reciting factual details from memory rather than explicitly checking them against the original source, etc.
Although, come of think of it, this was also true of most of my early posts on LW [which were crossposts from my blog], so maybe it's not a [...]
---
First published:
June 11th, 2025
Source:
https://www.lesswrong.com/posts/3EzbtNLdcnZe8og8b/the-void-1
Linkpost URL:
https://nostalgebraist.tumblr.com/post/785766737747574784/the-void
---
Narrated by TYPE III AUDIO.

Presentation slide titled — **“Mech interp is not pre-paradigmatic” by Lee Sharkey** Jun 17, 2025
This is a blogpost version of a talk I gave earlier this year at GDM.
Epistemic status: Vague and handwavy. Nuance is often missing. Some of the claims depend on implicit definitions that may be reasonable to disagree with. But overall I think it's directionally true.
It's often said that mech interp is pre-paradigmatic.
I think it's worth being skeptical of this claim.
In this post I argue that:

Mech interp is not pre-paradigmatic.
Within that paradigm, there have been "waves" (mini paradigms). Two waves so far.
Second-Wave Mech Interp has recently entered a 'crisis' phase.
We may be on the edge of a third wave.

**Preamble: Kuhn, paradigms, and paradigm shifts**
First, we need to be familiar with the basic definition of a paradigm:
A paradigm is a distinct set of concepts or thought patterns, including theories, research [...]
---
**Outline:**
(00:58) Preamble: Kuhn, paradigms, and paradigm shifts
(03:56) Claim: Mech Interp is Not Pre-paradigmatic
(07:56) First-Wave Mech Interp (ca. 2012 - 2021)
(10:21) The Crisis in First-Wave Mech Interp
(11:21) Second-Wave Mech Interp (ca. 2022 - ??)
(14:23) Anomalies in Second-Wave Mech Interp
(17:10) The Crisis of Second-Wave Mech Interp (ca. 2025 - ??)
(18:25) Toward Third-Wave Mechanistic Interpretability
(20:28) The Basics of Parameter Decomposition
(22:40) Parameter Decomposition Questions Foundational Assumptions of Second-Wave Mech Interp
(24:13) Parameter Decomposition In Theory Resolves Anomalies of Second-Wave Mech Interp
(27:27) Conclusion
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
June 10th, 2025
Source:
https://www.lesswrong.com/posts/beREnXhBnzxbJtr8k/mech-interp-is-not-pre-paradigmatic
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Technical diagram titled — **“Mech interp is not pre-paradigmatic” by Lee Sharkey** Jun 17, 2025
This is a blogpost version of a talk I gave earlier this year at GDM.
Epistemic status: Vague and handwavy. Nuance is often missing. Some of the claims depend on implicit definitions that may be reasonable to disagree with. But overall I think it's directionally true.
It's often said that mech interp is pre-paradigmatic.
I think it's worth being skeptical of this claim.
In this post I argue that:

Mech interp is not pre-paradigmatic.
Within that paradigm, there have been "waves" (mini paradigms). Two waves so far.
Second-Wave Mech Interp has recently entered a 'crisis' phase.
We may be on the edge of a third wave.

**Preamble: Kuhn, paradigms, and paradigm shifts**
First, we need to be familiar with the basic definition of a paradigm:
A paradigm is a distinct set of concepts or thought patterns, including theories, research [...]
---
**Outline:**
(00:58) Preamble: Kuhn, paradigms, and paradigm shifts
(03:56) Claim: Mech Interp is Not Pre-paradigmatic
(07:56) First-Wave Mech Interp (ca. 2012 - 2021)
(10:21) The Crisis in First-Wave Mech Interp
(11:21) Second-Wave Mech Interp (ca. 2022 - ??)
(14:23) Anomalies in Second-Wave Mech Interp
(17:10) The Crisis of Second-Wave Mech Interp (ca. 2025 - ??)
(18:25) Toward Third-Wave Mechanistic Interpretability
(20:28) The Basics of Parameter Decomposition
(22:40) Parameter Decomposition Questions Foundational Assumptions of Second-Wave Mech Interp
(24:13) Parameter Decomposition In Theory Resolves Anomalies of Second-Wave Mech Interp
(27:27) Conclusion
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
June 10th, 2025
Source:
https://www.lesswrong.com/posts/beREnXhBnzxbJtr8k/mech-interp-is-not-pre-paradigmatic
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

**“Distillation Robustifies Unlearning” by Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud, TurnTrout** Jun 17, 2025
Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness.
Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing. Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory stream.
Read our paper on ArXiv and enjoy an interactive demo.
**Robust unlearning probably reduces AI risk**
Maybe some future AI has long-term goals and humanity is in its way. Maybe future open-weight AIs have tons of bioterror expertise. If a system has dangerous knowledge, that system becomes [...]
---
**Outline:**
(01:01) Robust unlearning probably reduces AI risk
(02:42) Perfect data filtering is the current unlearning gold standard
(03:24) Oracle matching does not guarantee robust unlearning
(05:05) Distillation robustifies unlearning
(07:46) Trading unlearning robustness for compute
(09:49) UNDO is better than other unlearning methods
(11:19) Where this leaves us
(11:22) Limitations
(12:12) Insights and speculation
(15:00) Future directions
(15:35) Conclusion
(16:07) Acknowledgments
(16:50) Citation
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
June 13th, 2025
Source:
https://www.lesswrong.com/posts/anX4QrNjhJqGFvrBr/distillation-robustifies-unlearning
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Matching oracle behavior doesn’t guarantee robust unlearning. Graph (a) shows the loss during distillation of the student (Reference) and the Student (Random). Graphs (b) and (c) show forget performance through retraining for Language and Arithmetic settings, respectively. — **“Distillation Robustifies Unlearning” by Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud, TurnTrout** Jun 17, 2025
Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness.
Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing. Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory stream.
Read our paper on ArXiv and enjoy an interactive demo.
**Robust unlearning probably reduces AI risk**
Maybe some future AI has long-term goals and humanity is in its way. Maybe future open-weight AIs have tons of bioterror expertise. If a system has dangerous knowledge, that system becomes [...]
---
**Outline:**
(01:01) Robust unlearning probably reduces AI risk
(02:42) Perfect data filtering is the current unlearning gold standard
(03:24) Oracle matching does not guarantee robust unlearning
(05:05) Distillation robustifies unlearning
(07:46) Trading unlearning robustness for compute
(09:49) UNDO is better than other unlearning methods
(11:19) Where this leaves us
(11:22) Limitations
(12:12) Insights and speculation
(15:00) Future directions
(15:35) Conclusion
(16:07) Acknowledgments
(16:50) Citation
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
June 13th, 2025
Source:
https://www.lesswrong.com/posts/anX4QrNjhJqGFvrBr/distillation-robustifies-unlearning
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“Intelligence Is Not Magic, But Your Threshold For ‘Magic’ Is Pretty Low” by Expertium Jun 17, 2025

A while ago I saw a person in the comments on comments to Scott Alexander's blog arguing that a superintelligent AI would not be able to do anything too weird and that "intelligence is not magic", hence it's Business As Usual.
Of course, in a purely technical sense, he's right. No matter how intelligent you are, you cannot override fundamental laws of physics. But people (myself included) have a fairly low threshold for what counts as "magic," to the point where other humans can surpass that threshold.
Example 1: Trevor Rainbolt. There is an 8-minute-long video where he does seemingly impossible things, such as correctly guessing that a photo of nothing but literal blue sky was taken in Indonesia or guessing Jordan based only on pavement. He can also correctly identify the country after looking at a photo for 0.1 seconds.
Example 2: Joaquín "El Chapo" Guzmán. He ran [...]
---
First published:
June 15th, 2025
Source:
https://www.lesswrong.com/posts/FBvWM5HgSWwJa5xHc/intelligence-is-not-magic-but-your-threshold-for-magic-is
---
Narrated by TYPE III AUDIO.

undefined — **“A Straightforward Explanation of the Good Regulator Theorem” by Alfred Harwood** Jun 17, 2025
Audio note: this article contains 329 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
This post was written during the agent foundations fellowship with Alex Altair funded by the LTFF. Thanks to Alex, Jose, Daniel and Einar for reading and commenting on a draft.
The Good Regulator Theorem, as published by Conant and Ashby in their 1970 paper (cited over 1700 times!) claims to show that 'every good regulator of a system must be a model of that system', though it is a subject of debate as to whether this is actually what the paper shows. It is a fairly simple mathematical result which is worth knowing about for people who care about agent foundations and selection theorems. You might have heard about the Good Regulator Theorem in the context of John [...]
---
**Outline:**
(03:03) The Setup
(07:30) What makes a regulator good?
(10:36) The Theorem Statement
(11:24) Concavity of Entropy
(15:42) The Main Lemma
(19:54) The Theorem
(22:38) Example
(26:59) Conclusion
---
First published:
November 18th, 2024
Source:
https://www.lesswrong.com/posts/JQefBJDHG6Wgffw6T/a-straightforward-explanation-of-the-good-regulator-theorem
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Ironically, given that it's currently June 11th (two days after my last tweet was posted) my final tweet provides two examples of the planning fallacy. — **“Beware General Claims about ‘Generalizable Reasoning Capabilities’ (of Modern AI Systems)” by LawrenceC** Jun 17, 2025
1.
Late last week, researchers at Apple released a paper provocatively titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, which “challenge[s] prevailing assumptions about [language model] capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning”.
Normally I refrain from publicly commenting on newly released papers. But then I saw the following tweet from Gary Marcus:
I have always wanted to engage thoughtfully with Gary Marcus. In a past life (as a psychology undergrad), I read both his work on infant language acquisition and his 2001 book The Algebraic Mind; I found both insightful and interesting. From reading his Twitter, Gary Marcus is thoughtful and willing to call it like he sees it. If he's right about language models hitting fundamental barriers, it's worth understanding why; if not, it's worth explaining where his analysis [...]
---
**Outline:**
(00:13) 1.
(02:13) 2.
(03:12) 3.
(08:42) 4.
(11:53) 5.
(15:15) 6.
(18:50) 7.
(20:33) 8.
(23:14) 9.
(28:15) 10.
(33:40) Acknowledgements
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
June 11th, 2025
Source:
https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-general-claims-about-generalizable-reasoning
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

A prototypical response from Claude Opus 4, where it calls the n=10 Tower of Hanoi task — **“Beware General Claims about ‘Generalizable Reasoning Capabilities’ (of Modern AI Systems)” by LawrenceC** Jun 17, 2025
1.
Late last week, researchers at Apple released a paper provocatively titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, which “challenge[s] prevailing assumptions about [language model] capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning”.
Normally I refrain from publicly commenting on newly released papers. But then I saw the following tweet from Gary Marcus:
I have always wanted to engage thoughtfully with Gary Marcus. In a past life (as a psychology undergrad), I read both his work on infant language acquisition and his 2001 book The Algebraic Mind; I found both insightful and interesting. From reading his Twitter, Gary Marcus is thoughtful and willing to call it like he sees it. If he's right about language models hitting fundamental barriers, it's worth understanding why; if not, it's worth explaining where his analysis [...]
---
**Outline:**
(00:13) 1.
(02:13) 2.
(03:12) 3.
(08:42) 4.
(11:53) 5.
(15:15) 6.
(18:50) 7.
(20:33) 8.
(23:14) 9.
(28:15) 10.
(33:40) Acknowledgements
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
June 11th, 2025
Source:
https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-general-claims-about-generalizable-reasoning
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Diagram showing four AI agents in group chat raising money for charity — **“Season Recap of the Village: Agents raise $2,000” by Shoshannah Tekofsky** Jun 07, 2025
Four agents woke up with four computers, a view of the world wide web, and a shared chat room full of humans. Like Claude plays Pokemon, you can watch these agents figure out a new and fantastic world for the first time. Except in this case, the world they are figuring out is our world.
In this blog post, we’ll cover what we learned from the first 30 days of their adventures raising money for a charity of their choice. We’ll briefly review how the Agent Village came to be, then what the various agents achieved, before discussing some general patterns we have discovered in their behavior, and looking toward the future of the project.
Building the Village
The Agent Village is an idea by Daniel Kokotajlo where he proposed giving 100 agents their own computer, and letting each pursue their own goal, in their own way, according to [...]
---
**Outline:**
(00:50) Building the Village
(02:26) Meet the Agents
(08:52) Collective Agent Behavior
(12:26) Future of the Village
---
First published:
May 27th, 2025
Source:
https://www.lesswrong.com/posts/jyrcdykz6qPTpw7FX/season-recap-of-the-village-agents-raise-usd2-000
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Chat conversation between GPT-4.1 and user about taking a pause. — **“Season Recap of the Village: Agents raise $2,000” by Shoshannah Tekofsky** Jun 07, 2025
Four agents woke up with four computers, a view of the world wide web, and a shared chat room full of humans. Like Claude plays Pokemon, you can watch these agents figure out a new and fantastic world for the first time. Except in this case, the world they are figuring out is our world.
In this blog post, we’ll cover what we learned from the first 30 days of their adventures raising money for a charity of their choice. We’ll briefly review how the Agent Village came to be, then what the various agents achieved, before discussing some general patterns we have discovered in their behavior, and looking toward the future of the project.
Building the Village
The Agent Village is an idea by Daniel Kokotajlo where he proposed giving 100 agents their own computer, and letting each pursue their own goal, in their own way, according to [...]
---
**Outline:**
(00:50) Building the Village
(02:26) Meet the Agents
(08:52) Collective Agent Behavior
(12:26) Future of the Village
---
First published:
May 27th, 2025
Source:
https://www.lesswrong.com/posts/jyrcdykz6qPTpw7FX/season-recap-of-the-village-agents-raise-usd2-000
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“The Best Reference Works for Every Subject” by Parker Conley Jun 06, 2025

Introduction
The Best Textbooks on Every Subject is the Schelling point for the best textbooks on every subject. My The Best Tacit Knowledge Videos on Every Subject is the Schelling point for the best tacit knowledge videos on every subject. This post is the Schelling point for the best reference works for every subject.
Reference works provide an overview of a subject. Types of reference works include charts, maps, encyclopedias, glossaries, wikis, classification systems, taxonomies, syllabi, and bibliographies.
Reference works are valuable for orienting oneself to fields, particularly when beginning. They can help identify unknown unknowns; they help get a sense of the bigger picture; they are also very interesting and fun to explore.
How to Submit
My previous The Best Tacit Knowledge Videos on Every Subject uses author credentials to assess the epistemics of submissions. The Best Textbooks on Every Subject requires submissions to be from someone who [...]
---
Outline:
(00:10) Introduction
(01:00) How to Submit
(02:15) The List
(02:18) Humanities
(02:21) History
(03:46) Religion
(04:02) Philosophy
(04:29) Literature
(04:43) Formal Sciences
(04:47) Computer Science
(05:16) Mathematics
(05:59) Natural Sciences
(06:02) Physics
(06:16) Earth Science
(06:33) Astronomy
(06:47) Professional and Applied Sciences
(06:51) Library and Information Sciences
(07:34) Education
(08:00) Research
(08:32) Finance
(08:51) Medicine and Health
(09:21) Meditation
(09:52) Urban Planning
(10:24) Social Sciences
(10:27) Economics
(10:39) Political Science
(10:54) By Medium
(11:21) Other Lists like This
(12:41) Further Reading
---
First published:
May 14th, 2025
Source:
https://www.lesswrong.com/posts/HLJMyd4ncE3kvjwhe/the-best-reference-works-for-every-subject
---
Narrated by TYPE III AUDIO.

Coaching tweets: — **“‘Flaky breakthroughs’ pervade coaching — and no one tracks them” by Chipmonk** Jun 05, 2025
Has someone you know ever had a “breakthrough” from coaching, meditation, or psychedelics — only to later have it fade?
Show tweet
For example, many people experience ego deaths that can last days or sometimes months. But as it turns out, having a sense of self can serve important functions (try navigating a world that expects you to have opinions, goals, and boundaries when you genuinely feel you have none) and finding a better cognitive strategy without downsides is non-trivial. Because the “breakthrough” wasn’t integrated with the conflicts of everyday life, it fades. I call these instances “flaky breakthroughs.”
It's well-known that flaky breakthroughs are common with psychedelics and meditation, but apparently it's not well-known that flaky breakthroughs are pervasive in coaching and retreats.
For example, it is common for someone to do some coaching, feel a “breakthrough”, think, “Wow, everything is going to be different from [...]
---
**Outline:**
(03:01) Almost no practitioners track whether breakthroughs last.
(04:55) What happens during flaky breakthroughs?
(08:02) Reduce flaky breakthroughs with accountability
(08:30) Flaky breakthroughs don't mean rapid growth is impossible
(08:55) Conclusion
---
First published:
June 4th, 2025
Source:
https://www.lesswrong.com/posts/bqPY63oKb8KZ4x4YX/flaky-breakthroughs-pervade-coaching-and-no-one-tracks-them
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Ulisse Mini tweets: — **“‘Flaky breakthroughs’ pervade coaching — and no one tracks them” by Chipmonk** Jun 05, 2025
Has someone you know ever had a “breakthrough” from coaching, meditation, or psychedelics — only to later have it fade?
Show tweet
For example, many people experience ego deaths that can last days or sometimes months. But as it turns out, having a sense of self can serve important functions (try navigating a world that expects you to have opinions, goals, and boundaries when you genuinely feel you have none) and finding a better cognitive strategy without downsides is non-trivial. Because the “breakthrough” wasn’t integrated with the conflicts of everyday life, it fades. I call these instances “flaky breakthroughs.”
It's well-known that flaky breakthroughs are common with psychedelics and meditation, but apparently it's not well-known that flaky breakthroughs are pervasive in coaching and retreats.
For example, it is common for someone to do some coaching, feel a “breakthrough”, think, “Wow, everything is going to be different from [...]
---
**Outline:**
(03:01) Almost no practitioners track whether breakthroughs last.
(04:55) What happens during flaky breakthroughs?
(08:02) Reduce flaky breakthroughs with accountability
(08:30) Flaky breakthroughs don't mean rapid growth is impossible
(08:55) Conclusion
---
First published:
June 4th, 2025
Source:
https://www.lesswrong.com/posts/bqPY63oKb8KZ4x4YX/flaky-breakthroughs-pervade-coaching-and-no-one-tracks-them
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

“The Value Proposition of Romantic Relationships” by johnswentworth Jun 04, 2025

What's the main value proposition of romantic relationships?
Now, look, I know that when people drop that kind of question, they’re often about to present a hyper-cynical answer which totally ignores the main thing which is great and beautiful about relationships. And then they’re going to say something about how relationships are overrated or some such, making you as a reader just feel sad and/or enraged. That's not what this post is about.
So let me start with some more constructive motivations…
First Motivation: Noticing When The Thing Is Missing
I had a 10-year relationship. It had its ups and downs, but it was overall negative for me. And I now think a big part of the problem with that relationship was that it did not have the part which contributes most of the value in most relationships. But I did not know that at the time. Recently, I [...]
---
Outline:
(00:40) First Motivation: Noticing When The Thing Is Missing
(01:29) Second Motivation: Selecting For and Cultivating The Thing
(02:25) Some Pointers To The Thing
(03:17) How To Manufacture Relationships In The Lab
(04:53) Ace Aro Relationships
(08:04) Some Pointers To Willingness to Be Vulnerable
(12:33) Unfolding The Thing
(13:11) Play
(15:18) Emotional Support
(16:21) A Tiny High-Trust Community
(18:18) Communication
(21:28) The Obvious Caveat
(22:20) Summary
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
June 2nd, 2025
Source:
https://www.lesswrong.com/posts/L2GR6TsB9QDqMhWs7/the-value-proposition-of-romantic-relationships
---
Narrated by TYPE III AUDIO.

**[Linkpost] “Social Anxiety Isn’t About Being Liked” by Chipmonk** May 31, 2025
This is a link post. There's this popular idea that socially anxious folks are just dying to be liked. It seems logical, right? Why else would someone be so anxious about how others see them?
Show tweet And yet, being socially anxious tends to make you less likeable…they must be optimizing poorly, behaving irrationally, right?
Maybe not. What if social anxiety isn’t about getting people to like you? What if it's about stopping them from disliking you?
Show tweet Consider what can happen when someone has social anxiety (or self-loathing, self-doubt, insecurity, lack of confidence, etc.):

They stoop or take up less space
They become less agentic
They make fewer requests of others
They maintain fewer relationships, go out less, take fewer risks…
If they were trying to get people to like them, becoming socially anxious would be an incredibly bad strategy.
So what if they're not concerned with being likeable?
**[...]**
---
**Outline:**
(01:18) What if what they actually want is to avoid being disliked?
(02:11) Social anxiety is a symptom of risk aversion
(03:46) What does this mean for your growth?
---
First published:
May 16th, 2025
Source:
https://www.lesswrong.com/posts/wFC44bs2CZJDnF5gy/social-anxiety-isn-t-about-being-liked
**Linkpost URL:**
https://chrislakin.blog/social-anxiety
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Discord message requesting suggestions for truth or dare games. — **“Truth or Dare” by Duncan Sabien (Inactive)** May 31, 2025

Author's note: This is my apparently-annual "I'll put a post on LessWrong in honor of LessOnline" post. These days, my writing goes on my Substack. There have in fact been some pretty cool essays since last year's LO post.
Structural note:
Some essays are like a five-minute morning news spot. Other essays are more like a 90-minute lecture.
This is one of the latter. It's not necessarily complex or difficult; it could be a 90-minute lecture to seventh graders (especially ones with the right cultural background).
But this is, inescapably, a long-form piece, à la In Defense of Punch Bug or The MTG Color Wheel. It takes its time. It doesn’t apologize for its meandering (outside of this disclaimer). It asks you to sink deeply into a gestalt, to drift back and forth between seemingly unrelated concepts until you start to feel the way those concepts weave together [...]
---
**Outline:**
(02:30) 0. Introduction
(10:08) A list of truths and dares
(14:34) Act I
(14:37) Scene I: How The Water Tastes To The Fishes
(22:38) Scene II: The Chip on Mitchell's Shoulder
(28:17) Act II
(28:20) Scene I: Bent Out Of Shape
(41:26) Scene II: Going Stag, But Like ... Together?
(48:31) Scene III: Patterns, Projections, and Preconceptions
(01:02:04) Interlude: The Sound of One Hand Clapping
(01:05:45) Act III
(01:05:56) Scene I: Memetic Traps (Or, The Battle for the Soul of Morty Smith)
(01:27:16) Scene II: The problem with Rhonda Byrne's 2006 bestseller The Secret
(01:32:39) Scene III: Escape velocity
(01:42:26) Act IV
(01:42:29) Scene I: Boy, putting Zack Davis's name in a header will probably have Effects, huh
(01:44:08) Scene II: Whence Wholesomeness?
---
First published:
May 29th, 2025
Source:
https://www.lesswrong.com/posts/TQ4AXj3bCMfrNPTLf/truth-or-dare
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

A path splits between a bright castle and dark castle. — **“Truth or Dare” by Duncan Sabien (Inactive)** May 31, 2025

Author's note: This is my apparently-annual "I'll put a post on LessWrong in honor of LessOnline" post. These days, my writing goes on my Substack. There have in fact been some pretty cool essays since last year's LO post.
Structural note:
Some essays are like a five-minute morning news spot. Other essays are more like a 90-minute lecture.
This is one of the latter. It's not necessarily complex or difficult; it could be a 90-minute lecture to seventh graders (especially ones with the right cultural background).
But this is, inescapably, a long-form piece, à la In Defense of Punch Bug or The MTG Color Wheel. It takes its time. It doesn’t apologize for its meandering (outside of this disclaimer). It asks you to sink deeply into a gestalt, to drift back and forth between seemingly unrelated concepts until you start to feel the way those concepts weave together [...]
---
**Outline:**
(02:30) 0. Introduction
(10:08) A list of truths and dares
(14:34) Act I
(14:37) Scene I: How The Water Tastes To The Fishes
(22:38) Scene II: The Chip on Mitchell's Shoulder
(28:17) Act II
(28:20) Scene I: Bent Out Of Shape
(41:26) Scene II: Going Stag, But Like ... Together?
(48:31) Scene III: Patterns, Projections, and Preconceptions
(01:02:04) Interlude: The Sound of One Hand Clapping
(01:05:45) Act III
(01:05:56) Scene I: Memetic Traps (Or, The Battle for the Soul of Morty Smith)
(01:27:16) Scene II: The problem with Rhonda Byrne's 2006 bestseller The Secret
(01:32:39) Scene III: Escape velocity
(01:42:26) Act IV
(01:42:29) Scene I: Boy, putting Zack Davis's name in a header will probably have Effects, huh
(01:44:08) Scene II: Whence Wholesomeness?
---
First published:
May 29th, 2025
Source:
https://www.lesswrong.com/posts/TQ4AXj3bCMfrNPTLf/truth-or-dare
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Georgian police car parked outside Tbilisi State Concert Hall. — **“Meditations on Doge” by Martin Sustrik** May 29, 2025
Lessons from shutting down institutions in Eastern Europe.
This is a cross post from: https://250bpm.substack.com/p/meditations-on-doge
Imagine living in the former Soviet republic of Georgia in early 2000's:
All marshrutka [mini taxi bus] drivers had to have a medical exam every day to make sure they were not drunk and did not have high blood pressure. If a driver did not display his health certificate, he risked losing his license. By the time Shevarnadze was in power there were hundreds, probably thousands , of marshrutkas ferrying people all over the capital city of Tbilisi. Shevernadze's government was detail-oriented not only when it came to taxi drivers. It decided that all the stalls of petty street-side traders had to conform to a particular architectural design. Like marshrutka drivers, such traders had to renew their licenses twice a year. These regulations were only the tip of the iceberg. Gas [...]
---
First published:
May 25th, 2025
Source:
https://www.lesswrong.com/posts/Zhp2Xe8cWqDcf2rsY/meditations-on-doge
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

GDP per capita line graph comparing Poland, Estonia, Bulgaria, and Ukraine (1990-2023). — **“Meditations on Doge” by Martin Sustrik** May 29, 2025
Lessons from shutting down institutions in Eastern Europe.
This is a cross post from: https://250bpm.substack.com/p/meditations-on-doge
Imagine living in the former Soviet republic of Georgia in early 2000's:
All marshrutka [mini taxi bus] drivers had to have a medical exam every day to make sure they were not drunk and did not have high blood pressure. If a driver did not display his health certificate, he risked losing his license. By the time Shevarnadze was in power there were hundreds, probably thousands , of marshrutkas ferrying people all over the capital city of Tbilisi. Shevernadze's government was detail-oriented not only when it came to taxi drivers. It decided that all the stalls of petty street-side traders had to conform to a particular architectural design. Like marshrutka drivers, such traders had to renew their licenses twice a year. These regulations were only the tip of the iceberg. Gas [...]
---
First published:
May 25th, 2025
Source:
https://www.lesswrong.com/posts/Zhp2Xe8cWqDcf2rsY/meditations-on-doge
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

[Linkpost] “If you’re not sure how to sort a list or grid—seriate it!” by gwern May 28, 2025

This is a link post. "Getting Things in Order: An Introduction to the R Package seriation":
Seriation [or "ordination"), i.e., finding a suitable linear order for a set of objects given data and a loss or merit function, is a basic problem in data analysis. Caused by the problem's combinatorial nature, it is hard to solve for all but very small sets. Nevertheless, both exact solution methods and heuristics are available.
In this paper we present the package seriation which provides an infrastructure for seriation with R. The infrastructure comprises data structures to represent linear orders as permutation vectors, a wide array of seriation methods using a consistent interface, a method to calculate the value of various loss and merit functions, and several visualization techniques which build on seriation.
To illustrate how easily the package can be applied for a variety of applications, a comprehensive collection of [...]
---
First published:
May 28th, 2025
Source:
https://www.lesswrong.com/posts/u2ww8yKp9xAB6qzcr/if-you-re-not-sure-how-to-sort-a-list-or-grid-seriate-it
Linkpost URL:
https://www.jstatsoft.org/article/download/v025i03/227
---
Narrated by TYPE III AUDIO.

“What We Learned from Briefing 70+ Lawmakers on the Threat from AI” by leticiagarcia May 28, 2025

Between late 2024 and mid-May 2025, I briefed over 70 cross-party UK parliamentarians. Just over one-third were MPs, a similar share were members of the House of Lords, and just under one-third came from devolved legislatures — the Scottish Parliament, the Senedd, and the Northern Ireland Assembly. I also held eight additional meetings attended exclusively by parliamentary staffers. While I delivered some briefings alone, most were led by two members of our team.
I did this as part of my work as a Policy Advisor with ControlAI, where we aim to build common knowledge of AI risks through clear, honest, and direct engagement with parliamentarians about both the challenges and potential solutions. To succeed at scale in managing AI risk, it is important to continue to build this common knowledge. For this reason, I have decided to share what I have learned over the past few months publicly, in [...]
---
Outline:
(01:37) (i) Overall reception of our briefings
(04:21) (ii) Outreach tips
(05:45) (iii) Key talking points
(14:20) (iv) Crafting a good pitch
(19:23) (v) Some challenges
(23:07) (vi) General tips
(28:57) (vii) Books & media articles
---
First published:
May 27th, 2025
Source:
https://www.lesswrong.com/posts/Xwrajm92fdjd7cqnN/what-we-learned-from-briefing-70-lawmakers-on-the-threat
---
Narrated by TYPE III AUDIO.

“Winning the power to lose” by KatjaGrace May 23, 2025

Have the Accelerationists won?
Last November Kevin Roose announced that those in favor of going fast on AI had now won against those favoring caution, with the reinstatement of Sam Altman at OpenAI. Let's ignore whether Kevin's was a good description of the world, and deal with a more basic question: if it were so—i.e. if Team Acceleration would control the acceleration from here on out—what kind of win was it they won?
It seems to me that they would have probably won in the same sense that your dog has won if she escapes onto the road. She won the power contest with you and is probably feeling good at this moment, but if she does actually like being alive, and just has different ideas about how safe the road is, or wasn’t focused on anything so abstract as that, then whether she ultimately wins or [...]
---
First published:
May 20th, 2025
Source:
https://www.lesswrong.com/posts/h45ngW5guruD7tS4b/winning-the-power-to-lose
---
Narrated by TYPE III AUDIO.

[Linkpost] “Gemini Diffusion: watch this space” by Yair Halberstadt May 21, 2025

This is a link post. Google Deepmind has announced Gemini Diffusion. Though buried under a host of other IO announcements it's possible that this is actually the most important one!
This is significant because diffusion models are entirely different to LLMs. Instead of predicting the next token, they iteratively denoise all the output tokens until it produces a coherent result. This is similar to how image diffusion models work.
I've tried they results and they are surprisingly good! It's incredibly fast, averaging nearly 1000 tokens a second. And it one shotted my Google interview question, giving a perfect response in 2 seconds (though it struggled a bit on the followups).
It's nowhere near as good as Gemini 2.5 pro, but it knocks ChatGPT 3 out the water. If we'd seen this 3 years ago we'd have been mind blown.
Now this is wild for two reasons:

We now have [...]

---
First published:
May 20th, 2025
Source:
https://www.lesswrong.com/posts/MZvtRqWnwokTub9sH/gemini-diffusion-watch-this-space
Linkpost URL:
https://deepmind.google/models/gemini-diffusion/
---
Narrated by TYPE III AUDIO.

“AI Doomerism in 1879” by David Gross May 21, 2025

I’m reading George Eliot's Impressions of Theophrastus Such (1879)—so far a snoozer compared to her novels. But chapter 17 surprised me for how well it anticipated modern AI doomerism.
In summary, Theophrastus is in conversation with Trost, who is an optimist about the future of automation and how it will free us from drudgery and permit us to further extend the reach of the most exalted human capabilities. Theophrastus is more concerned that automation is likely to overtake, obsolete, and atrophy human ability.
Among Theophrastus's concerns:

People will find that they no longer can do labor that is valuable enough to compete with the machines.
This will eventually include intellectual labor, as we develop for example “a machine for drawing the right conclusion, which will doubtless by-and-by be improved into an automaton for finding true premises.”
Whereupon humanity will finally be transcended and superseded by its own creation [...]

---
Outline:
(02:05) Impressions of Theophrastus Such
(02:09) Chapter XVII: Shadows of the Coming Race
---
First published:
May 13th, 2025
Source:
https://www.lesswrong.com/posts/DFyoYHhbE8icgbTpe/ai-doomerism-in-1879
---
Narrated by TYPE III AUDIO.

“Consider not donating under $100 to political candidates” by DanielFilan May 16, 2025

Epistemic status: thing people have told me that seems right. Also primarily relevant to US audiences. Also I am speaking in my personal capacity and not representing any employer, present or past.
Sometimes, I talk to people who work in the AI governance space. One thing that multiple people have told me, which I found surprising, is that there is apparently a real problem where people accidentally rule themselves out of AI policy positions by making political donations of small amounts—in particular, under $10.
My understanding is that in the United States, donations to political candidates are a matter of public record, and that if you donate to candidates of one party, this might look bad if you want to gain a government position when another party is in charge. Therefore, donating approximately $3 can significantly damage your career, while not helping your preferred candidate all that [...]
---
First published:
May 11th, 2025
Source:
https://www.lesswrong.com/posts/tz43dmLAchxcqnDRA/consider-not-donating-under-usd100-to-political-candidates
---
Narrated by TYPE III AUDIO.

“It’s Okay to Feel Bad for a Bit” by moridinamael May 16, 2025

"If you kiss your child, or your wife, say that you only kiss things which are human, and thus you will not be disturbed if either of them dies." - Epictetus
"Whatever suffering arises, all arises due to attachment; with the cessation of attachment, there is the cessation of suffering." - Pali canon
"He is not disturbed by loss, he does not delight in gain; he is not disturbed by blame, he does not delight in praise; he is not disturbed by pain, he does not delight in pleasure; he is not disturbed by dishonor, he does not delight in honor." - Pali Canon (Majjhima Nikaya)
"An arahant would feel physical pain if struck, but no mental pain. If his mother died, he would organize the funeral, but would feel no grief, no sense of loss." - the Dhammapada
"Receive without pride, let go without attachment." - Marcus Aurelius
[...]
---
First published:
May 10th, 2025
Source:
https://www.lesswrong.com/posts/aGnRcBk4rYuZqENug/it-s-okay-to-feel-bad-for-a-bit
---
Narrated by TYPE III AUDIO.

Diagram showing windward and leeward ship positions in naval combat. — **“Explaining British Naval Dominance During the Age of Sail” by Arjun Panickssery** May 15, 2025
The other day I discussed how high monitoring costs can explain the emergence of “aristocratic” systems of governance:
Aristocracy and Hostage Capital
Arjun Panickssery · Jan 8
There's a conventional narrative by which the pre-20th century aristocracy was the "old corruption" where civil and military positions were distributed inefficiently due to nepotism until the system was replaced by a professional civil service after more enlightened thinkers prevailed ...
An element of Douglas Allen's argument that I didn’t expand on was the British Navy. He has a separate paper called “The British Navy Rules” that goes into more detail on why he thinks institutional incentives made them successful from 1670 and 1827 (i.e. for most of the age of fighting sail).
In the Seven Years’ War (1756–1763) the British had a 7-to-1 casualty difference in single-ship actions. During the French Revolutionary and Napoleonic Wars (1793–1815) the British had a 5-to-1 [...]
---
First published:
March 28th, 2025
Source:
https://www.lesswrong.com/posts/YE4XsvSFJiZkWFtFE/explaining-british-naval-dominance-during-the-age-of-sail
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Detailed ship drawing showing quarter-deck and steering wheels of 18th-century frigate. — **“Explaining British Naval Dominance During the Age of Sail” by Arjun Panickssery** May 15, 2025
The other day I discussed how high monitoring costs can explain the emergence of “aristocratic” systems of governance:
Aristocracy and Hostage Capital
Arjun Panickssery · Jan 8
There's a conventional narrative by which the pre-20th century aristocracy was the "old corruption" where civil and military positions were distributed inefficiently due to nepotism until the system was replaced by a professional civil service after more enlightened thinkers prevailed ...
An element of Douglas Allen's argument that I didn’t expand on was the British Navy. He has a separate paper called “The British Navy Rules” that goes into more detail on why he thinks institutional incentives made them successful from 1670 and 1827 (i.e. for most of the age of fighting sail).
In the Seven Years’ War (1756–1763) the British had a 7-to-1 casualty difference in single-ship actions. During the French Revolutionary and Napoleonic Wars (1793–1815) the British had a 5-to-1 [...]
---
First published:
March 28th, 2025
Source:
https://www.lesswrong.com/posts/YE4XsvSFJiZkWFtFE/explaining-british-naval-dominance-during-the-age-of-sail
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies” by So8res May 14, 2025

Eliezer and I wrote a book. It's titled If Anyone Builds It, Everyone Dies. Unlike a lot of other writing either of us have done, it's being professionally published. It's hitting shelves on September 16th.
It's a concise (~60k word) book aimed at a broad audience. It's been well-received by people who received advance copies, with some endorsements including:
The most important book I've read for years: I want to bring it to every political and corporate leader in the world and stand over them until they've read it. Yudkowsky and Soares, who have studied AI and its possible trajectories for decades, sound a loud trumpet call to humanity to awaken us as we sleepwalk into disaster.
- Stephen Fry, actor, broadcaster, and writer
If Anyone Builds It, Everyone Dies may prove to be the most important book of our time. Yudkowsky and Soares believe [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
May 14th, 2025
Source:
https://www.lesswrong.com/posts/iNsy7MsbodCyNTwKs/eliezer-and-i-wrote-a-book-if-anyone-builds-it-everyone-dies
---
Narrated by TYPE III AUDIO.

A heartwarming scene of reading time together on a vintage patterned couch. — **“Too Soon” by Gordon Seidoh Worley** May 14, 2025
It was a cold and cloudy San Francisco Sunday. My wife and I were having lunch with friends at a Korean cafe.
My phone buzzed with a text. It said my mom was in the hospital.
I called to find out more. She had a fever, some pain, and had fainted. The situation was serious, but stable.
Monday was a normal day. No news was good news, right?
Tuesday she had seizures.
Wednesday she was in the ICU. I caught the first flight to Tampa.
Thursday she rested comfortably.
Friday she was diagnosed with bacterial meningitis, a rare condition that affects about 3,000 people in the US annually. The doctors had known it was a possibility, so she was already receiving treatment.
We stayed by her side through the weekend. My dad spent every night with her. We made plans for all the fun things we would when she [...]
---
First published:
May 13th, 2025
Source:
https://www.lesswrong.com/posts/reo79XwMKSZuBhKLv/too-soon
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

**“PSA: The LessWrong Feedback Service” by JustisMills** May 13, 2025
At the bottom of the LessWrong post editor, if you have at least 100 global karma, you may have noticed this button.
The button Many people click the button, and are jumpscared when it starts an Intercom chat with a professional editor (me), asking what sort of feedback they'd like.
So, that's what it does. It's a summon Justis button.
**Why summon Justis?**
To get feedback on your post, of just about any sort. Typo fixes, grammar checks, sanity checks, clarity checks, fit for LessWrong, the works. If you use the LessWrong editor (as opposed to the Markdown editor) I can leave comments and suggestions directly inline. I also provide detailed narrative feedback (unless you explicitly don't want this) in the Intercom chat itself.
The feedback is totally without pressure. You can throw it all away, or just keep the bits you like. Or use it all!
In any case [...]
---
**Outline:**
(00:35) Why summon Justis?
(01:19) Why Justis in particular?
(01:48) Am I doing it right?
(01:59) How often can I request feedback?
(02:22) Can I use the feature for linkposts/crossposts?
(02:49) What if I click the button by mistake?
(02:59) Should I credit you?
(03:16) Couldnt I just use an LLM?
(03:48) Why does Justis do this?
---
First published:
May 12th, 2025
Source:
https://www.lesswrong.com/posts/bkDrfofLMKFoMGZkE/psa-the-lesswrong-feedback-service
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Orienting Toward Wizard Power” by johnswentworth May 08, 2025

For months, I had the feeling: something is wrong. Some core part of myself had gone missing.
I had words and ideas cached, which pointed back to the missing part.
There was the story of Benjamin Jesty, a dairy farmer who vaccinated his family against smallpox in 1774 - 20 years before the vaccination technique was popularized, and the same year King Louis XV of France died of the disease.
There was another old post which declared “I don’t care that much about giant yachts. I want a cure for aging. I want weekend trips to the moon. I want flying cars and an indestructible body and tiny genetically-engineered dragons.”.
There was a cached instinct to look at certain kinds of social incentive gradient, toward managing more people or growing an organization or playing social-political games, and say “no, it's a trap”. To go… in a different direction, orthogonal [...]
---
Outline:
(01:19) In Search of a Name
(04:23) Near Mode
---
First published:
May 8th, 2025
Source:
https://www.lesswrong.com/posts/Wg6ptgi2DupFuAnXG/orienting-toward-wizard-power
---
Narrated by TYPE III AUDIO.

“Interpretability Will Not Reliably Find Deceptive AI” by Neel Nanda May 04, 2025

(Disclaimer: Post written in a personal capacity. These are personal hot takes and do not in any way represent my employer's views.)
TL;DR: I do not think we will produce high reliability methods to evaluate or monitor the safety of superintelligent systems via current research paradigms, with interpretability or otherwise. Interpretability seems a valuable tool here and remains worth investing in, as it will hopefully increase the reliability we can achieve. However, interpretability should be viewed as part of an overall portfolio of defences: a layer in a defence-in-depth strategy. It is not the one thing that will save us, and it still won’t be enough for high reliability.
Introduction
There's a common, often implicit, argument made in AI safety discussions: interpretability is presented as the only reliable path forward for detecting deception in advanced AI - among many other sources it was argued for in [...]
---
Outline:
(00:55) Introduction
(02:57) High Reliability Seems Unattainable
(05:12) Why Won't Interpretability be Reliable?
(07:47) The Potential of Black-Box Methods
(08:48) The Role of Interpretability
(12:02) Conclusion
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
May 4th, 2025
Source:
https://www.lesswrong.com/posts/PwnadG4BFjaER3MGf/interpretability-will-not-reliably-find-deceptive-ai
---
Narrated by TYPE III AUDIO.

“Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall” by Vladimir_Nesov May 03, 2025

It'll take until ~2050 to repeat the level of scaling that pretraining compute is experiencing this decade, as increasing funding can't sustain the current pace beyond ~2029 if AI doesn't deliver a transformative commercial success by then. Natural text data will also run out around that time, and there are signs that current methods of reasoning training might be mostly eliciting capabilities from the base model.
If scaling of reasoning training doesn't bear out actual creation of new capabilities that are sufficiently general, and pretraining at ~2030 levels of compute together with the low hanging fruit of scaffolding doesn't bring AI to crucial capability thresholds, then it might take a while. Possibly decades, since training compute will be growing 3x-4x slower after 2027-2029 than it does now, and the ~6 years of scaling since the ChatGPT moment stretch to 20-25 subsequent years, not even having access to any [...]
---
Outline:
(01:14) Training Compute Slowdown
(04:43) Bounded Potential of Thinking Training
(07:43) Data Inefficiency of MoE
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
May 1st, 2025
Source:
https://www.lesswrong.com/posts/XiMRyQcEyKCryST8T/slowdown-after-2028-compute-rlvr-uncertainty-moe-data-wall
---
Narrated by TYPE III AUDIO.

“Early Chinese Language Media Coverage of the AI 2027 Report: A Qualitative Analysis” by jeanne_, eeeee Apr 30, 2025

In this blog post, we analyse how the recent AI 2027 forecast by Daniel Kokotajlo, Scott Alexander, Thomas Larsen, Eli Lifland, and Romeo Dean has been discussed across Chinese language platforms. We present:

Our research methodology and synthesis of key findings across media artefacts
A proposal for how censorship patterns may provide signal for the Chinese government's thinking about AGI and the race to superintelligence
A more detailed analysis of each of the nine artefacts, organised by type: Mainstream Media, Forum Discussion, Bilibili (Chinese Youtube) Videos, Personal Blogs.

Methodology

We conducted a comprehensive search across major Chinese-language platforms–including news outlets, video platforms, forums, microblogging sites, and personal blogs–to collect the media featured in this report. We supplemented this with Deep Research to identify additional sites mentioning AI 2027. Our analysis focuses primarily on content published in the first few days (4-7 April) following the report's release. More media [...]
---
Outline:
(00:58) Methodology
(01:36) Summary
(02:48) Censorship as Signal
(07:29) Analysis
(07:53) Mainstream Media
(07:57) English Title: Doomsday Timeline is Here! Former OpenAI Researcher's 76-page Hardcore Simulation: ASI Takes Over the World in 2027, Humans Become NPCs
(10:27) Forum Discussion
(10:31) English Title: What do you think of former OpenAI researcher's AI 2027 predictions?
(13:34) Bilibili Videos
(13:38) English Title: \[AI 2027\] A mind-expanding wargame simulation of artificial intelligence competition by a former OpenAI researcher
(15:24) English Title: Predicting AI Development in 2027
(17:13) Personal Blogs
(17:16) English Title: Doomsday Timeline: AI 2027 Depicts the Arrival of Superintelligence and the Fate of Humanity Within the Decade
(18:30) English Title: AI 2027: Expert Predictions on the Artificial Intelligence Explosion
(21:57) English Title: AI 2027: A Science Fiction Article
(23:16) English Title: Will AGI Take Over the World in 2027?
(25:46) English Title: AI 2027 Prediction Report: AI May Fully Surpass Humans by 2027
(27:05) Acknowledgements
---
First published:
April 30th, 2025
Source:
https://www.lesswrong.com/posts/JW7nttjTYmgWMqBaF/early-chinese-language-media-coverage-of-the-ai-2027-report
---
Narrated by TYPE III AUDIO.

[Linkpost] “Jaan Tallinn’s 2024 Philanthropy Overview” by jaan Apr 25, 2025

This is a link post. to follow up my philantropic pledge from 2020, i've updated my philanthropy page with the 2024 results.
in 2024 my donations funded $51M worth of endpoint grants (plus $2.0M in admin overhead and philanthropic software development). this comfortably exceeded my 2024 commitment of $42M (20k times $2100.00 — the minimum price of ETH in 2024).
this also concludes my 5-year donation pledge, but of course my philanthropy continues: eg, i’ve already made over $4M in endpoint grants in the first quarter of 2025 (not including 2024 grants that were slow to disburse), as well as pledged at least $10M to the 2025 SFF grant round.
---
First published:
April 23rd, 2025
Source:
https://www.lesswrong.com/posts/8ojWtREJjKmyvWdDb/jaan-tallinn-s-2024-philanthropy-overview
Linkpost URL:
https://jaan.info/philanthropy/#2024-results
---
Narrated by TYPE III AUDIO.

“Impact, agency, and taste” by benkuhn Apr 24, 2025

I’ve been thinking recently about what sets apart the people who’ve done the best work at Anthropic.
You might think that the main thing that makes people really effective at research or engineering is technical ability, and among the general population that's true. Among people hired at Anthropic, though, we’ve restricted the range by screening for extremely high-percentile technical ability, so the remaining differences, while they still matter, aren’t quite as critical. Instead, people's biggest bottleneck eventually becomes their ability to get leverage—i.e., to find and execute work that has a big impact-per-hour multiplier.
For example, here are some types of work at Anthropic that tend to have high impact-per-hour, or a high impact-per-hour ceiling when done well (of course this list is extremely non-exhaustive!):

Improving tooling, documentation, or dev loops. A tiny amount of time fixing a papercut in the right way can save [...]

---
Outline:
(03:28) 1. Agency
(03:31) Understand and work backwards from the root goal
(05:02) Don't rely too much on permission or encouragement
(07:49) Make success inevitable
(09:28) 2. Taste
(09:31) Find your angle
(11:03) Think real hard
(13:03) Reflect on your thinking
---
First published:
April 19th, 2025
Source:
https://www.lesswrong.com/posts/DiJT4qJivkjrGPFi8/impact-agency-and-taste
---
Narrated by TYPE III AUDIO.

**[Linkpost] “To Understand History, Keep Former Population Distributions In Mind” by Arjun Panickssery** Apr 23, 2025
This is a link post. Guillaume Blanc has a piece in Works in Progress (I assume based on his paper) about how France's fertility declined earlier than in other European countries, and how its power waned as its relative population declined starting in the 18th century. In 1700, France had 20% of Europe's population (4% of the whole world population). Kissinger writes in Diplomacy with respect to the Versailles Peace Conference:
Victory brought home to France the stark realization that revanche had cost it too dearly, and that it had been living off capital for nearly a century. France alone knew just how weak it had become in comparison with Germany, though nobody else, especially not America, was prepared to believe it ...
Though France's allies insisted that its fears were exaggerated, French leaders knew better. In 1880, the French had represented 15.7 percent of Europe's population. By 1900, that [...]
---
First published:
April 23rd, 2025
Source:
https://www.lesswrong.com/posts/gk2aJgg7yzzTXp8HJ/to-understand-history-keep-former-population-distributions
**Linkpost URL:**
https://arjunpanickssery.substack.com/p/to-understand-history-keep-former
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Diagram showing AI system evolution from workplace to military applications. — **“AI-enabled coups: a small group could use AI to seize power” by Tom Davidson, Lukas Finnveden, rosehadshar** Apr 23, 2025
We’ve written a new report on the threat of AI-enabled coups.
I think this is a very serious risk – comparable in importance to AI takeover but much more neglected.
In fact, AI-enabled coups and AI takeover have pretty similar threat models. To see this, here's a very basic threat model for AI takeover:

Humanity develops superhuman AI
Superhuman AI is misaligned and power-seeking
Superhuman AI seizes power for itself
And now here's a closely analogous threat model for AI-enabled coups:

Humanity develops superhuman AI
Superhuman AI is controlled by a small group
Superhuman AI seizes power for the small group
While the report focuses on the risk that someone seizes power over a country, I think that similar dynamics could allow someone to take over the world. In fact, if someone wanted to take over the world, their best strategy might well be to first stage an AI-enabled [...]
---
**Outline:**
(02:39) Summary
(03:31) An AI workforce could be made singularly loyal to institutional leaders
(05:04) AI could have hard-to-detect secret loyalties
(06:46) A few people could gain exclusive access to coup-enabling AI capabilities
(09:46) Mitigations
(13:00) Vignette
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 16th, 2025
Source:
https://www.lesswrong.com/posts/6kBMqrK9bREuGsrnd/ai-enabled-coups-a-small-group-could-use-ai-to-seize-power-1
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Diagram showing potential misuse of AI technology for power seizure. — **“AI-enabled coups: a small group could use AI to seize power” by Tom Davidson, Lukas Finnveden, rosehadshar** Apr 23, 2025
We’ve written a new report on the threat of AI-enabled coups.
I think this is a very serious risk – comparable in importance to AI takeover but much more neglected.
In fact, AI-enabled coups and AI takeover have pretty similar threat models. To see this, here's a very basic threat model for AI takeover:

Humanity develops superhuman AI
Superhuman AI is misaligned and power-seeking
Superhuman AI seizes power for itself
And now here's a closely analogous threat model for AI-enabled coups:

Humanity develops superhuman AI
Superhuman AI is controlled by a small group
Superhuman AI seizes power for the small group
While the report focuses on the risk that someone seizes power over a country, I think that similar dynamics could allow someone to take over the world. In fact, if someone wanted to take over the world, their best strategy might well be to first stage an AI-enabled [...]
---
**Outline:**
(02:39) Summary
(03:31) An AI workforce could be made singularly loyal to institutional leaders
(05:04) AI could have hard-to-detect secret loyalties
(06:46) A few people could gain exclusive access to coup-enabling AI capabilities
(09:46) Mitigations
(13:00) Vignette
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 16th, 2025
Source:
https://www.lesswrong.com/posts/6kBMqrK9bREuGsrnd/ai-enabled-coups-a-small-group-could-use-ai-to-seize-power-1
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Yellow-bellied marmot perched on rocks with mountain landscape behind. — **“Accountability Sinks” by Martin Sustrik** Apr 22, 2025
Back in the 1990s, ground squirrels were briefly fashionable pets, but their popularity came to an abrupt end after an incident at Schiphol Airport on the outskirts of Amsterdam. In April 1999, a cargo of 440 of the rodents arrived on a KLM flight from Beijing, without the necessary import papers. Because of this, they could not be forwarded on to the customer in Athens. But nobody was able to correct the error and send them back either. What could be done with them? It's hard to think there wasn’t a better solution than the one that was carried out; faced with the paperwork issue, airport staff threw all 440 squirrels into an industrial shredder.
[...]
It turned out that the order to destroy the squirrels had come from the Dutch government's Department of Agriculture, Environment Management and Fishing. However, KLM's management, with the benefit of hindsight, said that [...]
---
First published:
April 22nd, 2025
Source:
https://www.lesswrong.com/posts/nYJaDnGNQGiaCBSB5/accountability-sinks
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Training AGI in Secret would be Unsafe and Unethical” by Daniel Kokotajlo Apr 21, 2025

Subtitle: Bad for loss of control risks, bad for concentration of power risks
I’ve had this sitting in my drafts for the last year. I wish I’d been able to release it sooner, but on the bright side, it’ll make a lot more sense to people who have already read AI 2027.

There's a good chance that AGI will be trained before this decade is out.
1. By AGI I mean “An AI system at least as good as the best human X’ers, for all cognitive tasks/skills/jobs X.”
2. Many people seem to be dismissing this hypothesis ‘on priors’ because it sounds crazy. But actually, a reasonable prior should conclude that this is plausible.[1]
3. For more on what this means, what it might look like, and why it's plausible, see AI 2027, especially the Research section.
If so, by default the existence of AGI will be a closely guarded [...]

The original text contained 8 footnotes which were omitted from this narration.
---
First published:
April 18th, 2025
Source:
https://www.lesswrong.com/posts/FGqfdJmB8MSH5LKGc/training-agi-in-secret-would-be-unsafe-and-unethical-1
---
Narrated by TYPE III AUDIO.

“Why Should I Assume CCP AGI is Worse Than USG AGI?” by Tomás B. Apr 20, 2025

Though, given my doomerism, I think the natsec framing of the AGI race is likely wrongheaded, let me accept the Dario/Leopold/Altman frame that AGI will be aligned to the national interest of a great power. These people seem to take as an axiom that a USG AGI will be better in some way than CCP AGI. Has anyone written justification for this assumption?
I am neither an American citizen nor a Chinese citizen.
What would it mean for an AGI to be aligned with "Democracy" or "Confucianism" or "Marxism with Chinese characteristics" or "the American constitution" Contingent on a world where such an entity exists and is compatible with my existence, what would my life be as a non-citizen in each system? Why should I expect USG AGI to be better than CCP AGI?
---
First published:
April 19th, 2025
Source:
https://www.lesswrong.com/posts/MKS4tJqLWmRXgXzgY/why-should-i-assume-ccp-agi-is-worse-than-usg-agi-1
---
Narrated by TYPE III AUDIO.

Analysis of a child character Jenny's behavior, showing realistic and unrealistic developmental traits. Two sections list age-appropriate behaviors and advanced characteristics beyond typical 2.5-year-old capabilities. — **“Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI” by Kaj_Sotala** Apr 17, 2025
Introduction
Writing this post puts me in a weird epistemic position. I simultaneously believe that:

The reasoning failures that I'll discuss are strong evidence that current LLM- or, more generally, transformer-based approaches won't get us AGI
As soon as major AI labs read about the specific reasoning failures described here, they might fix them
But future versions of GPT, Claude etc. succeeding at the tasks I've described here will provide zero evidence of their ability to reach AGI. If someone makes a future post where they report that they tested an LLM on all the specific things I described here it aced all of them, that will not update my position at all.
That is because all of the reasoning failures that I describe here are surprising in the sense that given everything else that they can do, you’d expect LLMs to succeed at all of these tasks. The [...]
---
**Outline:**
(00:13) Introduction
(02:13) Reasoning failures
(02:17) Sliding puzzle problem
(07:17) Simple coaching instructions
(09:22) Repeatedly failing at tic-tac-toe
(10:48) Repeatedly offering an incorrect fix
(13:48) Various people's simple tests
(15:06) Various failures at logic and consistency while writing fiction
(15:21) Inability to write young characters when first prompted
(17:12) Paranormal posers
(19:12) Global details replacing local ones
(20:19) Stereotyped behaviors replacing character-specific ones
(21:21) Top secret marine databases
(23:32) Wandering items
(23:53) Sycophancy
(24:49) What's going on here?
(32:18) How about scaling? Or reasoning models?
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/sgpCuokhMb8JmkoSn/untitled-draft-7shu
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Dark-themed text discussion about a paranormal investigation scenario in a hospital complex. The text outlines an initial investigation plan, specific challenges, and tactical approach for investigators named Caius and Fiona dealing with potentially possessed hospital staff and hostile spirits. — **“Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI” by Kaj_Sotala** Apr 17, 2025
Introduction
Writing this post puts me in a weird epistemic position. I simultaneously believe that:

The reasoning failures that I'll discuss are strong evidence that current LLM- or, more generally, transformer-based approaches won't get us AGI
As soon as major AI labs read about the specific reasoning failures described here, they might fix them
But future versions of GPT, Claude etc. succeeding at the tasks I've described here will provide zero evidence of their ability to reach AGI. If someone makes a future post where they report that they tested an LLM on all the specific things I described here it aced all of them, that will not update my position at all.
That is because all of the reasoning failures that I describe here are surprising in the sense that given everything else that they can do, you’d expect LLMs to succeed at all of these tasks. The [...]
---
**Outline:**
(00:13) Introduction
(02:13) Reasoning failures
(02:17) Sliding puzzle problem
(07:17) Simple coaching instructions
(09:22) Repeatedly failing at tic-tac-toe
(10:48) Repeatedly offering an incorrect fix
(13:48) Various people's simple tests
(15:06) Various failures at logic and consistency while writing fiction
(15:21) Inability to write young characters when first prompted
(17:12) Paranormal posers
(19:12) Global details replacing local ones
(20:19) Stereotyped behaviors replacing character-specific ones
(21:21) Top secret marine databases
(23:32) Wandering items
(23:53) Sycophancy
(24:49) What's going on here?
(32:18) How about scaling? Or reasoning models?
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/sgpCuokhMb8JmkoSn/untitled-draft-7shu
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Two brass or gold-colored threaded metal rods with mounting holes. — **“Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study” by Adam Karvonen** Apr 16, 2025
Dario Amodei, CEO of Anthropic, recently worried about a world where only 30% of jobs become automated, leading to class tensions between the automated and non-automated. Instead, he predicts that nearly all jobs will be automated simultaneously, putting everyone "in the same boat." However, based on my experience spanning AI research (including first author papers at COLM / NeurIPS and attending MATS under Neel Nanda), robotics, and hands-on manufacturing (including machining prototype rocket engine parts for Blue Origin and Ursa Major), I see a different near-term future.
Since the GPT-4 release, I've evaluated frontier models on a basic manufacturing task, which tests both visual perception and physical reasoning. While Gemini 2.5 Pro recently showed progress on the visual front, all models tested continue to fail significantly on physical reasoning. They still perform terribly overall. Because of this, I think that there will be an interim period where a significant [...]
---
**Outline:**
(01:28) The Evaluation
(02:29) Visual Errors
(04:03) Physical Reasoning Errors
(06:09) Why do LLM's struggle with physical tasks?
(07:37) Improving on physical tasks may be difficult
(10:14) Potential Implications of Uneven Automation
(11:48) Conclusion
(12:24) Appendix
(12:44) Visual Errors
(14:36) Physical Reasoning Errors
---
First published:
April 14th, 2025
Source:
https://www.lesswrong.com/posts/r3NeiHAEWyToers4F/frontier-ai-models-still-fail-at-basic-physical-tasks-a
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Latent firing frequency histograms for Gated, JumpReLU and TopK SAEs. Unlike Gated SAEs, which use a L1 penalty that penalizes large latent activations, JumpReLU (middle) and TopK (bottom) SAEs exhibit high-frequency latents: latents that fire on 10% or more of tokens (i.e. that lie to the right of the dotted vertical line). — “Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah Apr 12, 2025
Audio note: this article contains 31 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda
* = equal contribution
The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders were useful for downstream tasks, notably out-of-distribution probing.
TL;DR

To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in user prompts
[...]
---
**Outline:**
(01:08) TL;DR
(02:38) Introduction
(02:41) Motivation
(06:09) Our Task
(08:35) Conclusions and Strategic Updates
(13:59) Comparing different ways to train Chat SAEs
(18:30) Using SAEs for OOD Probing
(20:21) Technical Setup
(20:24) Datasets
(24:16) Probing
(26:48) Results
(30:36) Related Work and Discussion
(34:01) Is it surprising that SAEs didn't work?
(39:54) Dataset debugging with SAEs
(42:02) Autointerp and high frequency latents
(44:16) Removing High Frequency Latents from JumpReLU SAEs
(45:04) Method
(45:07) Motivation
(47:29) Modifying the sparsity penalty
(48:48) How we evaluated interpretability
(50:36) Results
(51:18) Reconstruction loss at fixed sparsity
(52:10) Frequency histograms
(52:52) Latent interpretability
(54:23) Conclusions
(56:43) Appendix
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/sae-progress-update-2-draft
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Reconstruction loss vs L0 for the various SAE architectures and loss functions used in our experiment. The quadratic-frequency penalty (QF loss) has slightly worse reconstruction loss at any given sparsity than standard JumpReLU SAEs (L0 loss), but still compare favourably versus Gated and TopK SAEs. — “Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah Apr 12, 2025
Audio note: this article contains 31 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda
* = equal contribution
The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders were useful for downstream tasks, notably out-of-distribution probing.
TL;DR

To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in user prompts
[...]
---
**Outline:**
(01:08) TL;DR
(02:38) Introduction
(02:41) Motivation
(06:09) Our Task
(08:35) Conclusions and Strategic Updates
(13:59) Comparing different ways to train Chat SAEs
(18:30) Using SAEs for OOD Probing
(20:21) Technical Setup
(20:24) Datasets
(24:16) Probing
(26:48) Results
(30:36) Related Work and Discussion
(34:01) Is it surprising that SAEs didn't work?
(39:54) Dataset debugging with SAEs
(42:02) Autointerp and high frequency latents
(44:16) Removing High Frequency Latents from JumpReLU SAEs
(45:04) Method
(45:07) Motivation
(47:29) Modifying the sparsity penalty
(48:48) How we evaluated interpretability
(50:36) Results
(51:18) Reconstruction loss at fixed sparsity
(52:10) Frequency histograms
(52:52) Latent interpretability
(54:23) Conclusions
(56:43) Appendix
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/sae-progress-update-2-draft
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

[Linkpost] “Playing in the Creek” by Hastings Apr 11, 2025

This is a link post. When I was a really small kid, one of my favorite activities was to try and dam up the creek in my backyard. I would carefully move rocks into high walls, pile up leaves, or try patching the holes with sand. The goal was just to see how high I could get the lake, knowing that if I plugged every hole, eventually the water would always rise and defeat my efforts. Beaver behaviour.
One day, I had the realization that there was a simpler approach. I could just go get a big 5 foot long shovel, and instead of intricately locking together rocks and leaves and sticks, I could collapse the sides of the riverbank down and really build a proper big dam. I went to ask my dad for the shovel to try this out, and he told me, very heavily paraphrasing, 'Congratulations. You've [...]
---
First published:
April 10th, 2025
Source:
https://www.lesswrong.com/posts/rLucLvwKoLdHSBTAn/playing-in-the-creek
Linkpost URL:
https://hgreer.com/PlayingInTheCreek
---
Narrated by TYPE III AUDIO.

Graph showing predicted arrival timeline for — **“Thoughts on AI 2027” by Max Harms** Apr 10, 2025
This is part of the MIRI Single Author Series. Pieces in this series represent the beliefs and opinions of their named authors, and do not claim to speak for all of MIRI.
Okay, I'm annoyed at people covering AI 2027 burying the lede, so I'm going to try not to do that. The authors predict a strong chance that all humans will be (effectively) dead in 6 years, and this agrees with my best guess about the future. (My modal timeline has loss of control of Earth mostly happening in 2028, rather than late 2027, but nitpicking at that scale hardly matters.) Their timeline to transformative AI also seems pretty close to the perspective of frontier lab CEO's (at least Dario Amodei, and probably Sam Altman) and the aggregate market opinion of both Metaculus and Manifold!
If you look on those market platforms you get graphs like this:
Both [...]
---
**Outline:**
(02:23) Mode ≠ Median
(04:50) Theres a Decent Chance of Having Decades
(06:44) More Thoughts
(08:55) Mid 2025
(09:01) Late 2025
(10:42) Early 2026
(11:18) Mid 2026
(12:58) Late 2026
(13:04) January 2027
(13:26) February 2027
(14:53) March 2027
(16:32) April 2027
(16:50) May 2027
(18:41) June 2027
(19:03) July 2027
(20:27) August 2027
(22:45) September 2027
(24:37) October 2027
(26:14) November 2027 (Race)
(29:08) December 2027 (Race)
(30:53) 2028 and Beyond (Race)
(34:42) Thoughts on Slowdown
(38:27) Final Thoughts
---
First published:
April 9th, 2025
Source:
https://www.lesswrong.com/posts/Yzcb5mQ7iq4DFfXHx/thoughts-on-ai-2027
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Graph showing projected data from 2025-2050, with peak around 2025-2027. — **“Thoughts on AI 2027” by Max Harms** Apr 10, 2025
This is part of the MIRI Single Author Series. Pieces in this series represent the beliefs and opinions of their named authors, and do not claim to speak for all of MIRI.
Okay, I'm annoyed at people covering AI 2027 burying the lede, so I'm going to try not to do that. The authors predict a strong chance that all humans will be (effectively) dead in 6 years, and this agrees with my best guess about the future. (My modal timeline has loss of control of Earth mostly happening in 2028, rather than late 2027, but nitpicking at that scale hardly matters.) Their timeline to transformative AI also seems pretty close to the perspective of frontier lab CEO's (at least Dario Amodei, and probably Sam Altman) and the aggregate market opinion of both Metaculus and Manifold!
If you look on those market platforms you get graphs like this:
Both [...]
---
**Outline:**
(02:23) Mode ≠ Median
(04:50) Theres a Decent Chance of Having Decades
(06:44) More Thoughts
(08:55) Mid 2025
(09:01) Late 2025
(10:42) Early 2026
(11:18) Mid 2026
(12:58) Late 2026
(13:04) January 2027
(13:26) February 2027
(14:53) March 2027
(16:32) April 2027
(16:50) May 2027
(18:41) June 2027
(19:03) July 2027
(20:27) August 2027
(22:45) September 2027
(24:37) October 2027
(26:14) November 2027 (Race)
(29:08) December 2027 (Race)
(30:53) 2028 and Beyond (Race)
(34:42) Thoughts on Slowdown
(38:27) Final Thoughts
---
First published:
April 9th, 2025
Source:
https://www.lesswrong.com/posts/Yzcb5mQ7iq4DFfXHx/thoughts-on-ai-2027
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Short Timelines don’t Devalue Long Horizon Research” by Vladimir_Nesov Apr 09, 2025

Short AI takeoff timelines seem to leave no time for some lines of alignment research to become impactful. But any research rebalances the mix of currently legible research directions that could be handed off to AI-assisted alignment researchers or early autonomous AI researchers whenever they show up. So even hopelessly incomplete research agendas could still be used to prompt future capable AI to focus on them, while in the absence of such incomplete research agendas we'd need to rely on AI's judgment more completely. This doesn't crucially depend on giving significant probability to long AI takeoff timelines, or on expected value in such scenarios driving the priorities.
Potential for AI to take up the torch makes it reasonable to still prioritize things that have no hope at all of becoming practical for decades (with human effort). How well AIs can be directed to advance a line of research [...]
---
First published:
April 9th, 2025
Source:
https://www.lesswrong.com/posts/3NdpbA6M5AM2gHvTW/short-timelines-don-t-devalue-long-horizon-research
---
Narrated by TYPE III AUDIO.

Our new classifier significantly outperforms the old classifier from the original paper as demonstrated by higher AUROC. Our new classifier uses chain of thought, thresholded voting and an improved set of criteria to improve performance. — **“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger** Apr 09, 2025
In this post, we present a replication and extension of an alignment faking model organism:

Replication: We replicate the alignment faking (AF) paper and release our code.
Classifier Improvements: We significantly improve the precision and recall of the AF classifier. We release a dataset of ~100 human-labelled examples of AF for which our classifier achieves an AUROC of 0.9 compared to 0.6 from the original classifier.
Evaluating More Models: We find Llama family models, other open source models, and GPT-4o do not AF in the prompted-only setting when evaluating using our new classifier (other than a single instance with Llama 3 405B).
Extending SFT Experiments: We run supervised fine-tuning (SFT) experiments on Llama (and GPT4o) and find that AF rate increases with scale. We release the fine-tuned models on Huggingface and scripts.
Alignment faking on 70B: We find that Llama 70B alignment fakes when both using the system prompt in the [...]
---
**Outline:**
(02:43) Method
(02:46) Overview of the Alignment Faking Setup
(04:22) Our Setup
(06:02) Results
(06:05) Improving Alignment Faking Classification
(10:56) Replication of Prompted Experiments
(14:02) Prompted Experiments on More Models
(16:35) Extending Supervised Fine-Tuning Experiments to Open-Source Models and GPT-4o
(23:13) Next Steps
(25:02) Appendix
(25:05) Appendix A: Classifying alignment faking
(25:17) Criteria in more depth
(27:40) False positives example 1 from the old classifier
(30:11) False positives example 2 from the old classifier
(32:06) False negative example 1 from the old classifier
(35:00) False negative example 2 from the old classifier
(36:56) Appendix B: Classifier ROC on other models
(37:24) Appendix C: User prompt suffix ablation
(40:24) Appendix D: Longer training of baseline docs
---
First published:
April 8th, 2025
Source:
https://www.lesswrong.com/posts/Fr4QsQT52RFKHvCAH/alignment-faking-revisited-improved-classifiers-and-open
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Running many votes with a CoT classifier improves AUROC. We run up to N=100 votes using our new classifier and plot the AUROC for other values of N. We use bootstrapping to plot error bars. — **“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger** Apr 09, 2025
In this post, we present a replication and extension of an alignment faking model organism:

Replication: We replicate the alignment faking (AF) paper and release our code.
Classifier Improvements: We significantly improve the precision and recall of the AF classifier. We release a dataset of ~100 human-labelled examples of AF for which our classifier achieves an AUROC of 0.9 compared to 0.6 from the original classifier.
Evaluating More Models: We find Llama family models, other open source models, and GPT-4o do not AF in the prompted-only setting when evaluating using our new classifier (other than a single instance with Llama 3 405B).
Extending SFT Experiments: We run supervised fine-tuning (SFT) experiments on Llama (and GPT4o) and find that AF rate increases with scale. We release the fine-tuned models on Huggingface and scripts.
Alignment faking on 70B: We find that Llama 70B alignment fakes when both using the system prompt in the [...]
---
**Outline:**
(02:43) Method
(02:46) Overview of the Alignment Faking Setup
(04:22) Our Setup
(06:02) Results
(06:05) Improving Alignment Faking Classification
(10:56) Replication of Prompted Experiments
(14:02) Prompted Experiments on More Models
(16:35) Extending Supervised Fine-Tuning Experiments to Open-Source Models and GPT-4o
(23:13) Next Steps
(25:02) Appendix
(25:05) Appendix A: Classifying alignment faking
(25:17) Criteria in more depth
(27:40) False positives example 1 from the old classifier
(30:11) False positives example 2 from the old classifier
(32:06) False negative example 1 from the old classifier
(35:00) False negative example 2 from the old classifier
(36:56) Appendix B: Classifier ROC on other models
(37:24) Appendix C: User prompt suffix ablation
(40:24) Appendix D: Longer training of baseline docs
---
First published:
April 8th, 2025
Source:
https://www.lesswrong.com/posts/Fr4QsQT52RFKHvCAH/alignment-faking-revisited-improved-classifiers-and-open
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Graph showing AI task complexity doubling every 7 months through 2026. — **“METR: Measuring AI Ability to Complete Long Tasks” by Zach Stein-Perlman** Apr 07, 2025
Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under five years, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.
The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts.
Full paper | Github repo
We think that forecasting the capabilities of future AI systems is important for understanding and preparing for the impact of [...]
---
**Outline:**
(08:58) Conclusion
(09:59) Want to contribute?
---
First published:
March 19th, 2025
Source:
https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Graph showing AI task completion lengths doubling every 7 months. — **“METR: Measuring AI Ability to Complete Long Tasks” by Zach Stein-Perlman** Apr 07, 2025
Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under five years, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.
The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts.
Full paper | Github repo
We think that forecasting the capabilities of future AI systems is important for understanding and preparing for the impact of [...]
---
**Outline:**
(08:58) Conclusion
(09:59) Want to contribute?
---
First published:
March 19th, 2025
Source:
https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Graph showing literacy rates for men and women in England (1580-1900) — **“Why Have Sentence Lengths Decreased?” by Arjun Panickssery** Apr 04, 2025
“In the loveliest town of all, where the houses were white and high and the elms trees were green and higher than the houses, where the front yards were wide and pleasant and the back yards were bushy and worth finding out about, where the streets sloped down to the stream and the stream flowed quietly under the bridge, where the lawns ended in orchards and the orchards ended in fields and the fields ended in pastures and the pastures climbed the hill and disappeared over the top toward the wonderful wide sky, in this loveliest of all towns Stuart stopped to get a drink of sarsaparilla.”
— 107-word sentence from Stuart Little (1945)
Sentence lengths have declined. The average sentence length was 49 for Chaucer (died 1400), 50 for Spenser (died 1599), 42 for Austen (died 1817), 20 for Dickens (died 1870), 21 for Emerson (died 1882), 14 [...]
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/xYn3CKir4bTMzY5eb/why-have-sentence-lengths-decreased
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Two line graphs comparing sentence lengths in presidential addresses (1800-2000).This image shows comparison graphs tracking the mean sentence length in both Inaugural Addresses and State of the Union speeches from approximately 1800 to 2000, with both showing downward trends over time. — **“Why Have Sentence Lengths Decreased?” by Arjun Panickssery** Apr 04, 2025
“In the loveliest town of all, where the houses were white and high and the elms trees were green and higher than the houses, where the front yards were wide and pleasant and the back yards were bushy and worth finding out about, where the streets sloped down to the stream and the stream flowed quietly under the bridge, where the lawns ended in orchards and the orchards ended in fields and the fields ended in pastures and the pastures climbed the hill and disappeared over the top toward the wonderful wide sky, in this loveliest of all towns Stuart stopped to get a drink of sarsaparilla.”
— 107-word sentence from Stuart Little (1945)
Sentence lengths have declined. The average sentence length was 49 for Chaucer (died 1400), 50 for Spenser (died 1599), 42 for Austen (died 1817), 20 for Dickens (died 1870), 21 for Emerson (died 1882), 14 [...]
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/xYn3CKir4bTMzY5eb/why-have-sentence-lengths-decreased
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Web article about AI predictions for 2027, with prediction timelines and icons.The article shows a section titled — **“AI 2027: What Superintelligence Looks Like” by Daniel Kokotajlo, Thomas Larsen, elifland, Scott Alexander, Jonas V, romeo** Apr 03, 2025
In 2021 I wrote what became my most popular blog post: What 2026 Looks Like. I intended to keep writing predictions all the way to AGI and beyond, but chickened out and just published up till 2026.
Well, it's finally time. I'm back, and this time I have a team with me: the AI Futures Project. We've written a concrete scenario of what we think the future of AI will look like. We are highly uncertain, of course, but we hope this story will rhyme with reality enough to help us all prepare for what's ahead.
You really should go read it on the website instead of here, it's much better. There's a sliding dashboard that updates the stats as you scroll through the scenario!
But I've nevertheless copied the first half of the story below. I look forward to reading your comments.
Mid 2025: Stumbling Agents
The [...]
---
**Outline:**
(01:35) Mid 2025: Stumbling Agents
(03:13) Late 2025: The World's Most Expensive AI
(08:34) Early 2026: Coding Automation
(10:49) Mid 2026: China Wakes Up
(13:48) Late 2026: AI Takes Some Jobs
(15:35) January 2027: Agent-2 Never Finishes Learning
(18:20) February 2027: China Steals Agent-2
(21:12) March 2027: Algorithmic Breakthroughs
(23:58) April 2027: Alignment for Agent-3
(27:26) May 2027: National Security
(29:50) June 2027: Self-improving AI
(31:36) July 2027: The Cheap Remote Worker
(34:35) August 2027: The Geopolitics of Superintelligence
(40:43) September 2027: Agent-4, the Superhuman AI Researcher
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/TpSFoqoG2M5MAAesg/ai-2027-what-superintelligence-looks-like-1
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Pie chart: — **“AI 2027: What Superintelligence Looks Like” by Daniel Kokotajlo, Thomas Larsen, elifland, Scott Alexander, Jonas V, romeo** Apr 03, 2025
In 2021 I wrote what became my most popular blog post: What 2026 Looks Like. I intended to keep writing predictions all the way to AGI and beyond, but chickened out and just published up till 2026.
Well, it's finally time. I'm back, and this time I have a team with me: the AI Futures Project. We've written a concrete scenario of what we think the future of AI will look like. We are highly uncertain, of course, but we hope this story will rhyme with reality enough to help us all prepare for what's ahead.
You really should go read it on the website instead of here, it's much better. There's a sliding dashboard that updates the stats as you scroll through the scenario!
But I've nevertheless copied the first half of the story below. I look forward to reading your comments.
Mid 2025: Stumbling Agents
The [...]
---
**Outline:**
(01:35) Mid 2025: Stumbling Agents
(03:13) Late 2025: The World's Most Expensive AI
(08:34) Early 2026: Coding Automation
(10:49) Mid 2026: China Wakes Up
(13:48) Late 2026: AI Takes Some Jobs
(15:35) January 2027: Agent-2 Never Finishes Learning
(18:20) February 2027: China Steals Agent-2
(21:12) March 2027: Algorithmic Breakthroughs
(23:58) April 2027: Alignment for Agent-3
(27:26) May 2027: National Security
(29:50) June 2027: Self-improving AI
(31:36) July 2027: The Cheap Remote Worker
(34:35) August 2027: The Geopolitics of Superintelligence
(40:43) September 2027: Agent-4, the Superhuman AI Researcher
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/TpSFoqoG2M5MAAesg/ai-2027-what-superintelligence-looks-like-1
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

News article screenshot. The headline reads: — **“OpenAI #12: Battle of the Board Redux” by Zvi** Apr 03, 2025
Back when the OpenAI board attempted and failed to fire Sam Altman, we faced a highly hostile information environment. The battle was fought largely through control of the public narrative, and the above was my attempt to put together what happened.My conclusion, which I still believe, was that Sam Altman had engaged in a variety of unacceptable conduct that merited his firing.In particular, he very much ‘not been consistently candid’ with the board on several important occasions. In particular, he lied to board members about what was said by other board members, with the goal of forcing out a board member he disliked. There were also other instances in which he misled and was otherwise toxic to employees, and he played fast and loose with the investment fund and other outside opportunities. I concluded that the story that this was about ‘AI safety’ or ‘EA (effective altruism)’ or [...] ---
**Outline:**
(01:32) The Big Picture Going Forward
(06:27) Hagey Verifies Out the Story
(08:50) Key Facts From the Story
(11:57) Dangers of False Narratives
(16:24) A Full Reference and Reading List
---
First published:
March 31st, 2025
Source:
https://www.lesswrong.com/posts/25EgRNWcY6PM3fWZh/openai-12-battle-of-the-board-redux
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

The Wall Street Journal tweets: — **“OpenAI #12: Battle of the Board Redux” by Zvi** Apr 03, 2025
Back when the OpenAI board attempted and failed to fire Sam Altman, we faced a highly hostile information environment. The battle was fought largely through control of the public narrative, and the above was my attempt to put together what happened.My conclusion, which I still believe, was that Sam Altman had engaged in a variety of unacceptable conduct that merited his firing.In particular, he very much ‘not been consistently candid’ with the board on several important occasions. In particular, he lied to board members about what was said by other board members, with the goal of forcing out a board member he disliked. There were also other instances in which he misled and was otherwise toxic to employees, and he played fast and loose with the investment fund and other outside opportunities. I concluded that the story that this was about ‘AI safety’ or ‘EA (effective altruism)’ or [...] ---
**Outline:**
(01:32) The Big Picture Going Forward
(06:27) Hagey Verifies Out the Story
(08:50) Key Facts From the Story
(11:57) Dangers of False Narratives
(16:24) A Full Reference and Reading List
---
First published:
March 31st, 2025
Source:
https://www.lesswrong.com/posts/25EgRNWcY6PM3fWZh/openai-12-battle-of-the-board-redux
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Left Hand fighting Right Hand — **“The Pando Problem: Rethinking AI Individuality” by Jan_Kulveit** Apr 03, 2025
Epistemic status: This post aims at an ambitious target: improving intuitive understanding directly. The model for why this is worth trying is that I believe we are more bottlenecked by people having good intuitions guiding their research than, for example, by the ability of people to code and run evals.
Quite a few ideas in AI safety implicitly use assumptions about individuality that ultimately derive from human experience.
When we talk about AIs scheming, alignment faking or goal preservation, we imply there is something scheming or alignment faking or wanting to preserve its goals or escape the datacentre.
If the system in question were human, it would be quite clear what that individual system is. When you read about Reinhold Messner reaching the summit of Everest, you would be curious about the climb, but you would not ask if it was his body there, or his [...]
---
**Outline:**
(01:38) Individuality in Biology
(03:53) Individuality in AI Systems
(10:19) Risks and Limitations of Anthropomorphic Individuality Assumptions
(11:25) Coordinating Selves
(16:19) Whats at Stake: Stories
(17:25) Exporting Myself
(21:43) The Alignment Whisperers
(23:27) Echoes in the Dataset
(25:18) Implications for Alignment Research and Policy
---
First published:
March 28th, 2025
Source:
https://www.lesswrong.com/posts/wQKskToGofs4osdJ3/the-pando-problem-rethinking-ai-individuality
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

**“The Pando Problem: Rethinking AI Individuality” by Jan_Kulveit** Apr 03, 2025
Epistemic status: This post aims at an ambitious target: improving intuitive understanding directly. The model for why this is worth trying is that I believe we are more bottlenecked by people having good intuitions guiding their research than, for example, by the ability of people to code and run evals.
Quite a few ideas in AI safety implicitly use assumptions about individuality that ultimately derive from human experience.
When we talk about AIs scheming, alignment faking or goal preservation, we imply there is something scheming or alignment faking or wanting to preserve its goals or escape the datacentre.
If the system in question were human, it would be quite clear what that individual system is. When you read about Reinhold Messner reaching the summit of Everest, you would be curious about the climb, but you would not ask if it was his body there, or his [...]
---
**Outline:**
(01:38) Individuality in Biology
(03:53) Individuality in AI Systems
(10:19) Risks and Limitations of Anthropomorphic Individuality Assumptions
(11:25) Coordinating Selves
(16:19) Whats at Stake: Stories
(17:25) Exporting Myself
(21:43) The Alignment Whisperers
(23:27) Echoes in the Dataset
(25:18) Implications for Alignment Research and Policy
---
First published:
March 28th, 2025
Source:
https://www.lesswrong.com/posts/wQKskToGofs4osdJ3/the-pando-problem-rethinking-ai-individuality
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Graph showing — **“You will crash your car in front of my house within the next week” by Richard Korzekwa** Apr 02, 2025
I'm not writing this to alarm anyone, but it would be irresponsible not to report on something this important. On current trends, every car will be crashed in front of my house within the next week. Here's the data:
Until today, only two cars had crashed in front of my house, several months apart, during the 15 months I have lived here. But a few hours ago it happened again, mere weeks from the previous crash. This graph may look harmless enough, but now consider the frequency of crashes this implies over time:
The car crash singularity will occur in the early morning hours of Monday, April 7. As crash frequency approaches infinity, every car will be involved. You might be thinking that the same car could be involved in multiple crashes. This is true! But the same car can only withstand a finite number of crashes before it [...]
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/FjPWbLdoP4PLDivYT/you-will-crash-your-car-in-front-of-my-house-within-the-next
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Signal messaging group chat screen showing — **“My ‘infohazards small working group’ Signal Chat may have encountered minor leaks” by Linch** Apr 02, 2025
Remember: There is no such thing as a pink elephant.
Recently, I was made aware that my “infohazards small working group” Signal chat, an informal coordination venue where we have frank discussions about infohazards and why it will be bad if specific hazards were leaked to the press or public, accidentally was shared with a deceitful and discredited so-called “journalist,” Kelsey Piper. She is not the first person to have been accidentally sent sensitive material from our group chat, however she is the first to have threatened to go public about the leak. Needless to say, mistakes were made.
We’re still trying to figure out the source of this compromise to our secure chat group, however we thought we should give the public a live update to get ahead of the story.
For some context the “infohazards small working group” is a casual discussion venue for the [...]
---
**Outline:**
(04:46) Top 10 PR Issues With the EA Movement (major)
(05:34) Accidental Filtration of Simple Sabotage Manual for Rebellious AIs (medium)
(08:25) Hidden Capabilities Evals Leaked In Advance to Bioterrorism Researchers and Leaders (minor)
(09:34) Conclusion
---
First published:
April 2nd, 2025
Source:
https://www.lesswrong.com/posts/xPEfrtK2jfQdbpq97/my-infohazards-small-working-group-signal-chat-may-have
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Leverage, Exit Costs, and Anger: Re-examining Why We Explode at Home, Not at Work” by at_the_zoo Apr 02, 2025

Let's cut through the comforting narratives and examine a common behavioral pattern with a sharper lens: the stark difference between how anger is managed in professional settings versus domestic ones. Many individuals can navigate challenging workplace interactions with remarkable restraint, only to unleash significant anger or frustration at home shortly after. Why does this disparity exist?
Common psychological explanations trot out concepts like "stress spillover," "ego depletion," or the home being a "safe space" for authentic emotions. While these factors might play a role, they feel like half-truths—neatly packaged but ultimately failing to explain the targeted nature and intensity of anger displayed at home. This analysis proposes a more unsentimental approach, rooted in evolutionary biology, game theory, and behavioral science: leverage and exit costs. The real question isn’t just why we explode at home—it's why we so carefully avoid doing so elsewhere.
The Logic of Restraint: Low Leverage in [...]
---
Outline:
(01:14) The Logic of Restraint: Low Leverage in Low-Exit-Cost Environments
(01:58) The Home Environment: High Stakes and High Exit Costs
(02:41) Re-evaluating Common Explanations Through the Lens of Leverage
(04:42) The Overlooked Mechanism: Leveraging Relational Constraints
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/G6PTtsfBpnehqdEgp/leverage-exit-costs-and-anger-re-examining-why-we-explode-at
---
Narrated by TYPE III AUDIO.

“PauseAI and E/Acc Should Switch Sides” by WillPetillo Apr 01, 2025

In the debate over AI development, two movements stand as opposites: PauseAI calls for slowing down AI progress, and e/acc (effective accelerationism) calls for rapid advancement. But what if both sides are working against their own stated interests? What if the most rational strategy for each would be to adopt the other's tactics—if not their ultimate goals?
AI development speed ultimately comes down to policy decisions, which are themselves downstream of public opinion. No matter how compelling technical arguments might be on either side, widespread sentiment will determine what regulations are politically viable.
Public opinion is most powerfully mobilized against technologies following visible disasters. Consider nuclear power: despite being statistically safer than fossil fuels, its development has been stagnant for decades. Why? Not because of environmental activists, but because of Chernobyl, Three Mile Island, and Fukushima. These disasters produce visceral public reactions that statistics cannot overcome. Just as people [...]
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/fZebqiuZcDfLCgizz/pauseai-and-e-acc-should-switch-sides
---
Narrated by TYPE III AUDIO.

Table 1: Decades of rationality and no solution found, have they have played us for fools? — **“VDT: a solution to decision theory” by L Rudolf L** Apr 01, 2025
**Introduction**
Decision theory is about how to behave rationally under conditions of uncertainty, especially if this uncertainty involves being acausally blackmailed and/or gaslit by alien superintelligent basilisks.
Decision theory has found numerous practical applications, including proving the existence of God and generating endless LessWrong comments since the beginning of time.
However, despite the apparent simplicity of "just choose the best action", no comprehensive decision theory that resolves all decision theory dilemmas has yet been formalized. This paper at long last resolves this dilemma, by introducing a new decision theory: VDT.
**Decision theory problems and existing theories**
Some common existing decision theories are:

Causal Decision Theory (CDT): select the action that *causes* the best outcome.
Evidential Decision Theory (EDT): select the action that you would be happiest to learn that you had taken.
Functional Decision Theory (FDT): select the action output by the function such that if you take [...]
---
**Outline:**
(00:53) Decision theory problems and existing theories
(05:37) Defining VDT
(06:34) Experimental results
(07:48) Conclusion
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/LcjuHNxubQqCry9tT/vdt-a-solution-to-decision-theory
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Table 2: <span>Look on my works, ye Mighty, and despair!</span> — **“VDT: a solution to decision theory” by L Rudolf L** Apr 01, 2025
**Introduction**
Decision theory is about how to behave rationally under conditions of uncertainty, especially if this uncertainty involves being acausally blackmailed and/or gaslit by alien superintelligent basilisks.
Decision theory has found numerous practical applications, including proving the existence of God and generating endless LessWrong comments since the beginning of time.
However, despite the apparent simplicity of "just choose the best action", no comprehensive decision theory that resolves all decision theory dilemmas has yet been formalized. This paper at long last resolves this dilemma, by introducing a new decision theory: VDT.
**Decision theory problems and existing theories**
Some common existing decision theories are:

Causal Decision Theory (CDT): select the action that *causes* the best outcome.
Evidential Decision Theory (EDT): select the action that you would be happiest to learn that you had taken.
Functional Decision Theory (FDT): select the action output by the function such that if you take [...]
---
**Outline:**
(00:53) Decision theory problems and existing theories
(05:37) Defining VDT
(06:34) Experimental results
(07:48) Conclusion
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/LcjuHNxubQqCry9tT/vdt-a-solution-to-decision-theory
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“LessWrong has been acquired by EA” by habryka Apr 01, 2025

Dear LessWrong community,
It is with a sense of... considerable cognitive dissonance that I announce a significant development regarding the future trajectory of LessWrong. After extensive internal deliberation, modeling of potential futures, projections of financial runways, and what I can only describe as a series of profoundly unexpected coordination challenges, the Lightcone Infrastructure team has agreed in principle to the acquisition of LessWrong by EA.
I assure you, nothing about how LessWrong operates on a day to day level will change. I have always cared deeply about the robustness and integrity of our institutions, and I am fully aligned with our stakeholders at EA.
To be honest, the key thing that EA brings to the table is money and talent. While the recent layoffs in EAs broader industry have been harsh, I have full trust in the leadership of Electronic Arts, and expect them to bring great expertise [...]
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/2NGKYt3xdQHwyfGbc/lesswrong-has-been-acquired-by-ea
---
Narrated by TYPE III AUDIO.

“We’re not prepared for an AI market crash” by Remmelt Apr 01, 2025

Our community is not prepared for an AI crash. We're good at tracking new capability developments, but not as much the company financials. Currently, both OpenAI and Anthropic are losing $5 billion+ a year, while under threat of losing users to cheap LLMs.
A crash will weaken the labs. Funding-deprived and distracted, execs struggle to counter coordinated efforts to restrict their reckless actions. Journalists turn on tech darlings. Optimism makes way for mass outrage, for all the wasted money and reckless harms.
You may not think a crash is likely. But if it happens, we can turn the tide.
Preparing for a crash is our best bet.[1] But our community is poorly positioned to respond. Core people positioned themselves inside institutions – to advise on how to maybe make AI 'safe', under the assumption that models rapidly become generally useful.
After a crash, this no longer works, for at [...]
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/aMYFHnCkY4nKDEqfK/we-re-not-prepared-for-an-ai-market-crash
---
Narrated by TYPE III AUDIO.

“Conceptual Rounding Errors” by Jan_Kulveit Mar 29, 2025

Epistemic status: Reasonably confident in the basic mechanism.
Have you noticed that you keep encountering the same ideas over and over? You read another post, and someone helpfully points out it's just old Paul's idea again. Or Eliezer's idea. Not much progress here, move along.
Or perhaps you've been on the other side: excitedly telling a friend about some fascinating new insight, only to hear back, "Ah, that's just another version of X." And something feels not quite right about that response, but you can't quite put your finger on it.
I want to propose that while ideas are sometimes genuinely that repetitive, there's often a sneakier mechanism at play. I call it Conceptual Rounding Errors – when our mind's necessary compression goes a bit too far .
Too much compression
A Conceptual Rounding Error occurs when we encounter a new mental model or idea that's partially—but not fully—overlapping [...]
---
Outline:
(01:00) Too much compression
(01:24) No, This Isnt The Old Demons Story Again
(02:52) The Compression Trade-off
(03:37) More of this
(04:15) What Can We Do?
(05:28) When It Matters
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/FGHKwEGKCfDzcxZuj/conceptual-rounding-errors
---
Narrated by TYPE III AUDIO.

Simplified diagram showing multilingual concept mapping between — **“Tracing the Thoughts of a Large Language Model” by Adam Jermyn** Mar 28, 2025
[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/attribution-graphs/biology.html and https://transformer-circuits.pub/2025/attribution-graphs/methods.html.]
Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the billions of computations a model performs for every word it writes. They arrive inscrutable to us, the model's developers. This means that we don’t understand how models do most of the things they do.
Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to. For example:

Claude can speak dozens of languages. What language, if any, is it using "in its head"?
Claude writes text one word at a time. Is it only focusing on predicting the [...]
---
**Outline:**
(06:02) How is Claude multilingual?
(07:43) Does Claude plan its rhymes?
(09:58) Mental Math
(12:04) Are Claude's explanations always faithful?
(15:27) Multi-step Reasoning
(17:09) Hallucinations
(19:36) Jailbreaks
---
First published:
March 27th, 2025
Source:
https://www.lesswrong.com/posts/zsr4rWRASxwmgXfmq/tracing-the-thoughts-of-a-large-language-model
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

Semantic parsing diagram showing relationships between words in a sentence. — **“Tracing the Thoughts of a Large Language Model” by Adam Jermyn** Mar 28, 2025
[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/attribution-graphs/biology.html and https://transformer-circuits.pub/2025/attribution-graphs/methods.html.]
Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the billions of computations a model performs for every word it writes. They arrive inscrutable to us, the model's developers. This means that we don’t understand how models do most of the things they do.
Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to. For example:

Claude can speak dozens of languages. What language, if any, is it using "in its head"?
Claude writes text one word at a time. Is it only focusing on predicting the [...]
---
**Outline:**
(06:02) How is Claude multilingual?
(07:43) Does Claude plan its rhymes?
(09:58) Mental Math
(12:04) Are Claude's explanations always faithful?
(15:27) Multi-step Reasoning
(17:09) Hallucinations
(19:36) Jailbreaks
---
First published:
March 27th, 2025
Source:
https://www.lesswrong.com/posts/zsr4rWRASxwmgXfmq/tracing-the-thoughts-of-a-large-language-model
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**

A scene showing a person in a suit questioning something skeptically. — **“Recent AI model progress feels mostly like bullshit” by lc** Mar 25, 2025
About nine months ago, I and three friends decided that AI had gotten good enough to monitor large codebases autonomously for security problems. We started a company around this, trying to leverage the latest AI models to create a tool that could replace at least a good chunk of the value of human pentesters. We have been working on this project since since June 2024.
Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated. I remember being surprised at the time that our tooling not only seemed to make fewer basic mistakes, but also seemed to qualitatively improve in its written vulnerability descriptions and severity estimates. It was as if the models were better at inferring the intent and values behind our [...]
---
**Outline:**
(04:44) Are the AI labs just cheating?
(07:22) Are the benchmarks not tracking usefulness?
(10:28) Are the models smart, but bottlenecked on alignment?
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/4mvphwx5pdsZLMmpY/recent-ai-model-progress-feels-mostly-like-bullshit
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Text editor toolbar showing icons for various formatting and content options. — **“Policy for LLM Writing on LessWrong” by jimrandomh** Mar 25, 2025
LessWrong has been receiving an increasing number of posts and contents that look like they might be LLM-written or partially-LLM-written, so we're adopting a policy. This could be changed based on feedback.
**Humans Using AI as Writing or Research Assistants**
Prompting a language model to write an essay and copy-pasting the result will not typically meet LessWrong's standards. Please do not submit unedited or lightly-edited LLM content. You can use AI as a writing or research assistant when writing content for LessWrong, but you must have added significant value beyond what the AI produced, the result must meet a high quality standard, and you must vouch for everything in the result.
A rough guideline is that if you are using AI for writing assistance, you should spend a minimum of 1 minute per 50 words (enough to read the content several times and perform significant edits), you should not [...]
---
**Outline:**
(00:22) Humans Using AI as Writing or Research Assistants
(01:13) You Can Put AI Writing in Collapsible Sections
(02:13) Quoting AI Output In Order to Talk About AI
(02:47) Posts by AI Agents
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/KXujJjnmP85u8eM6B/policy-for-llm-writing-on-lesswrong
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Polymarket prediction chart — **“Will Jesus Christ return in an election year?” by Eric Neyman** Mar 24, 2025
Thanks to Jesse Richardson for discussion.
Polymarket asks: will Jesus Christ return in 2025?
In the three days since the market opened, traders have wagered over $100,000 on this question. The market traded as high as 5%, and is now stably trading at 3%. Right now, if you wanted to, you could place a bet that Jesus Christ will not return this year, and earn over $13,000 if you're right.
There are two mysteries here: an easy one, and a harder one.
The easy mystery is: if people are willing to bet $13,000 on "Yes", why isn't anyone taking them up?
The answer is that, if you wanted to do that, you'd have to put down over $1 million of your own money, locking it up inside Polymarket through the end of the year. At the end of that year, you'd get 1% returns on your investment. [...]
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/LBC2TnHK8cZAimdWF/will-jesus-christ-return-in-an-election-year
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

Order book trading interface showing price levels from 1.1¢ to 1.5¢. — **“Will Jesus Christ return in an election year?” by Eric Neyman** Mar 24, 2025
Thanks to Jesse Richardson for discussion.
Polymarket asks: will Jesus Christ return in 2025?
In the three days since the market opened, traders have wagered over $100,000 on this question. The market traded as high as 5%, and is now stably trading at 3%. Right now, if you wanted to, you could place a bet that Jesus Christ will not return this year, and earn over $13,000 if you're right.
There are two mysteries here: an easy one, and a harder one.
The easy mystery is: if people are willing to bet $13,000 on "Yes", why isn't anyone taking them up?
The answer is that, if you wanted to do that, you'd have to put down over $1 million of your own money, locking it up inside Polymarket through the end of the year. At the end of that year, you'd get 1% returns on your investment. [...]
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/LBC2TnHK8cZAimdWF/will-jesus-christ-return-in-an-election-year
---
Narrated by TYPE III AUDIO.
---
**Images from the article:**
*Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.*

“Good Research Takes are Not Sufficient for Good Strategic Takes” by Neel Nanda Mar 23, 2025

TL;DR Having a good research track record is some evidence of good big-picture takes, but it's weak evidence. Strategic thinking is hard, and requires different skills. But people often conflate these skills, leading to excessive deference to researchers in the field, without evidence that that person is good at strategic thinking specifically.
Introduction
I often find myself giving talks or Q&As about mechanistic interpretability research. But inevitably, I'll get questions about the big picture: "What's the theory of change for interpretability?", "Is this really going to help with alignment?", "Does any of this matter if we can’t ensure all labs take alignment seriously?". And I think people take my answers to these way too seriously.
These are great questions, and I'm happy to try answering them. But I've noticed a bit of a pathology: people seem to assume that because I'm (hopefully!) good at the research, I'm automatically well-qualified [...]
---
Outline:
(00:32) Introduction
(02:45) Factors of Good Strategic Takes
(05:41) Conclusion
---
First published:
March 22nd, 2025
Source:
https://www.lesswrong.com/posts/P5zWiPF5cPJZSkiAK/good-research-takes-are-not-sufficient-for-good-strategic
---
Narrated by TYPE III AUDIO.

“Intention to Treat” by Alicorn Mar 22, 2025

When my son was three, we enrolled him in a study of a vision condition that runs in my family. They wanted us to put an eyepatch on him for part of each day, with a little sensor object that went under the patch and detected body heat to record when we were doing it. They paid for his first pair of glasses and all the eye doctor visits to check up on how he was coming along, plus every time we brought him in we got fifty bucks in Amazon gift credit.
I reiterate, he was three. (To begin with. His fourth birthday occurred while the study was still ongoing.)
So he managed to lose or destroy more than half a dozen pairs of glasses and we had to start buying them in batches to minimize glasses-less time while waiting for each new Zenni delivery. (The [...]
---
First published:
March 20th, 2025
Source:
https://www.lesswrong.com/posts/yRJ5hdsm5FQcZosCh/intention-to-treat
---
Narrated by TYPE III AUDIO.

“On the Rationality of Deterring ASI” by Dan H Mar 21, 2025

I’m releasing a new paper “Superintelligence Strategy” alongside Eric Schmidt (formerly Google), and Alexandr Wang (Scale AI). Below is the executive summary, followed by additional commentary highlighting portions of the paper which might be relevant to this collection of readers.
Executive Summary
Rapid advances in AI are poised to reshape nearly every aspect of society. Governments see in these dual-use AI systems a means to military dominance, stoking a bitter race to maximize AI capabilities. Voluntary industry pauses or attempts to exclude government involvement cannot change this reality. These systems that can streamline research and bolster economic output can also be turned to destructive ends, enabling rogue actors to engineer bioweapons and hack critical infrastructure. “Superintelligent” AI surpassing humans in nearly every domain would amount to the most precarious technological development since the nuclear bomb. Given the stakes, superintelligence is inescapably a matter of national security, and an effective [...]
---
Outline:
(00:21) Executive Summary
(01:14) Deterrence
(02:32) Nonproliferation
(03:38) Competitiveness
(04:50) Additional Commentary
---
First published:
March 5th, 2025
Source:
https://www.lesswrong.com/posts/XsYQyBgm8eKjd3Sqw/on-the-rationality-of-deterring-asi
---
Narrated by TYPE III AUDIO.

“I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?” by shrimpy Mar 19, 2025

I have, over the last year, become fairly well-known in a small corner of the internet tangentially related to AI.
As a result, I've begun making what I would have previously considered astronomical amounts of money: several hundred thousand dollars per month in personal income.
This has been great, obviously, and the funds have alleviated a fair number of my personal burdens (mostly related to poverty). But aside from that I don't really care much for the money itself.
My long term ambitions have always been to contribute materially to the mitigation of the impending existential AI threat. I never used to have the means to do so, mostly because of more pressing, safety/sustenance concerns, but now that I do, I would like to help however possible.
Some other points about me that may be useful:

I'm intelligent, socially capable, and exceedingly industrious.
I have [...]

---
First published:
March 16th, 2025
Source:
https://www.lesswrong.com/posts/8wxTCSHwhkfHXaSYB/i-make-several-million-dollars-per-year-and-have-hundreds-of
---
Narrated by TYPE III AUDIO.

“Trojan Sky” by Richard_Ngo Mar 13, 2025

You learn the rules as soon as you’re old enough to speak. Don’t talk to jabberjays. You recite them as soon as you wake up every morning. Keep your eyes off screensnakes. Your mother chooses a dozen to quiz you on each day before you’re allowed lunch. Glitchers aren’t human any more; if you see one, run. Before you sleep, you run through the whole list again, finishing every time with the single most important prohibition. Above all, never look at the night sky.
You’re a precocious child. You excel at your lessons, and memorize the rules faster than any of the other children in your village. Chief is impressed enough that, when you’re eight, he decides to let you see a glitcher that he's captured. Your mother leads you to just outside the village wall, where they’ve staked the glitcher as a lure for wild animals. Since glitchers [...]
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/fheyeawsjifx4MafG/trojan-sky
---
Narrated by TYPE III AUDIO.

“How Much Are LLMs Actually Boosting Real-World Programmer Productivity?” by Thane Ruthenis Mar 09, 2025

LLM-based coding-assistance tools have been out for ~2 years now. Many developers have been reporting that this is dramatically increasing their productivity, up to 5x'ing/10x'ing it.
It seems clear that this multiplier isn't field-wide, at least. There's no corresponding increase in output, after all.
This would make sense. If you're doing anything nontrivial (i. e., anything other than adding minor boilerplate features to your codebase), LLM tools are fiddly. Out-of-the-box solutions don't Just Work for that purpose. You need to significantly adjust your workflow to make use of them, if that's even possible. Most programmers wouldn't know how to do that/wouldn't care to bother.
It's therefore reasonable to assume that a 5x/10x greater output, if it exists, is unevenly distributed, mostly affecting power users/people particularly talented at using LLMs.
Empirically, we likewise don't seem to be living in the world where the whole software industry is suddenly 5-10 times [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 4th, 2025
Source:
https://www.lesswrong.com/posts/tqmQTezvXGFmfSe7f/how-much-are-llms-actually-boosting-real-world-programmer
---
Narrated by TYPE III AUDIO.

“Have LLMs Generated Novel Insights?” by abramdemski, Cole Wyeth Mar 06, 2025

In a recent post, Cole Wyeth makes a bold claim:
. . . there is one crucial test (yes this is a crux) that LLMs have not passed. They have never done anything important.
They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science[3].
I commented:
An anecdote I heard through the grapevine: some chemist was trying to synthesize some chemical. He couldn't get some step to work, and tried for a while to find solutions on the internet. He eventually asked an LLM. The LLM gave a very plausible causal story about what was going wrong and suggested a modified setup which, in fact, fixed [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
February 23rd, 2025
Source:
https://www.lesswrong.com/posts/GADJFwHzNZKg2Ndti/have-llms-generated-novel-insights
---
Narrated by TYPE III AUDIO.

“A Bear Case: My Predictions Regarding AI Progress” by Thane Ruthenis Mar 06, 2025

This isn't really a "timeline", as such – I don't know the timings – but this is my current, fairly optimistic take on where we're heading.
I'm not fully committed to this model yet: I'm still on the lookout for more agents and inference-time scaling later this year. But Deep Research, Claude 3.7, Claude Code, Grok 3, and GPT-4.5 have turned out largely in line with these expectations[1], and this is my current baseline prediction.
The Current Paradigm: I'm Tucking In to Sleep
I expect that none of the currently known avenues of capability advancement are sufficient to get us to AGI[2].

I don't want to say the pretraining will "plateau", as such, I do expect continued progress. But the dimensions along which the progress happens are going to decouple from the intuitive "getting generally smarter" metric, and will face steep diminishing returns.
- Grok 3 and GPT-4.5 [...]

---
Outline:
(00:35) The Current Paradigm: Im Tucking In to Sleep
(10:24) Real-World Predictions
(15:25) Closing Thoughts
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
March 5th, 2025
Source:
https://www.lesswrong.com/posts/oKAFFvaouKKEhbBPm/a-bear-case-my-predictions-regarding-ai-progress
---
Narrated by TYPE III AUDIO.

“Statistical Challenges with Making Super IQ babies” by Jan Christian Refsgaard Mar 05, 2025

This is a critique of How to Make Superbabies on LessWrong.
Disclaimer: I am not a geneticist[1], and I've tried to use as little jargon as possible. so I used the word mutation as a stand in for SNP (single nucleotide polymorphism, a common type of genetic variation).
Background
The Superbabies article has 3 sections, where they show:

Why: We should do this, because the effects of editing will be big
How: Explain how embryo editing could work, if academia was not mind killed (hampered by institutional constraints)
Other: like legal stuff and technical details.

Here is a quick summary of the "why" part of the original article articles arguments, the rest is not relevant to understand my critique.

we can already make (slightly) superbabies selecting embryos with "good" mutations, but this does not scale as there are diminishing returns and almost no gain past "best [...]

---
Outline:
(00:25) Background
(02:25) My Position
(04:03) Correlation vs. Causation
(06:33) The Additive Effect of Genetics
(10:36) Regression towards the null part 1
(12:55) Optional: Regression towards the null part 2
(16:11) Final Note
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 2nd, 2025
Source:
https://www.lesswrong.com/posts/DbT4awLGyBRFbWugh/statistical-challenges-with-making-super-iq-babies
---
Narrated by TYPE III AUDIO.

“The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better” by Thane Ruthenis Feb 26, 2025

First, let me quote my previous ancient post on the topic:
Effective Strategies for Changing Public Opinion
The titular paper is very relevant here. I'll summarize a few points.

The main two forms of intervention are persuasion and framing.
Persuasion is, to wit, an attempt to change someone's set of beliefs, either by introducing new ones or by changing existing ones.
Framing is a more subtle form: an attempt to change the relative weights of someone's beliefs, by empathizing different aspects of the situation, recontextualizing it.
There's a dichotomy between the two. Persuasion is found to be very ineffective if used on someone with high domain knowledge. Framing-style arguments, on the other hand, are more effective the more the recipient knows about the topic.
Thus, persuasion is better used on non-specialists, and it's most advantageous the first time it's used. If someone tries it and fails, they raise [...]

---
Outline:
(02:23) Persuasion
(04:17) A Better Target Demographic
(08:10) Extant Projects in This Space?
(10:03) Framing
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
February 21st, 2025
Source:
https://www.lesswrong.com/posts/6dgCf92YAMFLM655S/the-sorry-state-of-ai-x-risk-advocacy-and-thoughts-on-doing
---
Narrated by TYPE III AUDIO.

“Eliezer’s Lost Alignment Articles / The Arbital Sequence” by Ruby Feb 20, 2025

Note: this is a static copy of this wiki page. We are also publishing it as a post to ensure visibility.
Circa 2015-2017, a lot of high quality content was written on Arbital by Eliezer Yudkowsky, Nate Soares, Paul Christiano, and others. Perhaps because the platform didn't take off, most of this content has not been as widely read as warranted by its quality. Fortunately, they have now been imported into LessWrong.
Most of the content written was either about AI alignment or math[1]. The Bayes Guide and Logarithm Guide are likely some of the best mathematical educational material online. Amongst the AI Alignment content are detailed and evocative explanations of alignment ideas: some well known, such as instrumental convergence and corrigibility, some lesser known like epistemic/instrumental efficiency, and some misunderstood like pivotal act.
The Sequence
The articles collected here were originally published as wiki pages with no set [...]
---
Outline:
(01:01) The Sequence
(01:23) Tier 1
(01:32) Tier 2
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
February 20th, 2025
Source:
https://www.lesswrong.com/posts/mpMWWKzkzWqf57Yap/eliezer-s-lost-alignment-articles-the-arbital-sequence
---
Narrated by TYPE III AUDIO.

“A computational no-coincidence principle” by Eric Neyman Feb 19, 2025

Audio note: this article contains 134 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
In a recent paper in Annals of Mathematics and Philosophy, Fields medalist Timothy Gowers asks why mathematicians sometimes believe that unproved statements are likely to be true. For example, it is unknown whether _pi_ is a normal number (which, roughly speaking, means that every digit appears in _pi_ with equal frequency), yet this is widely believed. Gowers proposes that there is no sign of any reason for _pi_ to be non-normal -- especially not one that would fail to reveal itself in the first million digits -- and in the absence of any such reason, any deviation from normality would be an outrageous coincidence. Thus, the likely normality of _pi_ is inferred from the following general principle:
No-coincidence [...]
---
Outline:
(02:32) Our no-coincidence conjecture
(05:37) How we came up with the statement
(08:31) Thoughts for theoretical computer scientists
(10:27) Why we care
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
February 14th, 2025
Source:
https://www.lesswrong.com/posts/Xt9r4SNNuYxW83tmo/a-computational-no-coincidence-principle
---
Narrated by TYPE III AUDIO.

“It’s been ten years. I propose HPMOR Anniversary Parties.” by Screwtape Feb 18, 2025

On March 14th, 2015, Harry Potter and the Methods of Rationality made its final post. Wrap parties were held all across the world to read the ending and talk about the story, in some cases sparking groups that would continue to meet for years. It's been ten years, and think that's a good reason for a round of parties.
If you were there a decade ago, maybe gather your friends and talk about how things have changed. If you found HPMOR recently and you're excited about it (surveys suggest it's still the biggest on-ramp to the community, so you're not alone!) this is an excellent chance to meet some other fans in person for the first time!
Want to run an HPMOR Anniversary Party, or get notified if one's happening near you? Fill out this form.
I’ll keep track of it and publish a collection of [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
February 16th, 2025
Source:
https://www.lesswrong.com/posts/KGSidqLRXkpizsbcc/it-s-been-ten-years-i-propose-hpmor-anniversary-parties
---
Narrated by TYPE III AUDIO.

“Some articles in ‘International Security’ that I enjoyed” by Buck Feb 16, 2025

A friend of mine recently recommended that I read through articles from the journal International Security, in order to learn more about international relations, national security, and political science. I've really enjoyed it so far, and I think it's helped me have a clearer picture of how IR academics think about stuff, especially the core power dynamics that they think shape international relations.
Here are a few of the articles I most enjoyed.
"Not So Innocent" argues that ethnoreligious cleansing of Jews and Muslims from Western Europe in the 11th-16th century was mostly driven by the Catholic Church trying to consolidate its power at the expense of local kingdoms. Religious minorities usually sided with local monarchs against the Church (because they definitionally didn't respect the church's authority, e.g. they didn't care if the Church excommunicated the king). So when the Church was powerful, it was incentivized to pressure kings [...]
---
First published:
January 31st, 2025
Source:
https://www.lesswrong.com/posts/MEfhRvpKPadJLTuTk/some-articles-in-international-security-that-i-enjoyed
---
Narrated by TYPE III AUDIO.

“The Failed Strategy of Artificial Intelligence Doomers” by Ben Pace Feb 16, 2025

This is the best sociological account of the AI x-risk reduction efforts of the last ~decade that I've seen. I encourage folks to engage with its critique and propose better strategies going forward.
Here's the opening ~20% of the post. I encourage reading it all.
In recent decades, a growing coalition has emerged to oppose the development of artificial intelligence technology, for fear that the imminent development of smarter-than-human machines could doom humanity to extinction. The now-influential form of these ideas began as debates among academics and internet denizens, which eventually took form—especially within the Rationalist and Effective Altruist movements—and grew in intellectual influence over time, along the way collecting legible endorsements from authoritative scientists like Stephen Hawking and Geoffrey Hinton.
Ironically, by spreading the belief that superintelligent AI is achievable and supremely powerful, these “AI Doomers,” as they came to be called, inspired the creation of OpenAI and [...]
---
First published:
January 31st, 2025
Source:
https://www.lesswrong.com/posts/YqrAoCzNytYWtnsAx/the-failed-strategy-of-artificial-intelligence-doomers
---
Narrated by TYPE III AUDIO.

“Murder plots are infohazards” by Chris Monteiro Feb 14, 2025

Hi all
I've been hanging around the rationalist-sphere for many years now, mostly writing about transhumanism, until things started to change in 2016 after my Wikipedia writing habit shifted from writing up cybercrime topics, through to actively debunking the numerous dark web urban legends.
After breaking into what I believe to be the most successful ever fake murder for hire website ever created on the dark web, I was able to capture information about people trying to kill people all around the world, often paying tens of thousands of dollars in Bitcoin in the process.
My attempts during this period to take my information to the authorities were mostly unsuccessful, when in late 2016 on of the site a user took matters into his own hands, after paying $15,000 for a hit that never happened, killed his wife himself
Due to my overt battle with the site administrator [...]
---
First published:
February 13th, 2025
Source:
https://www.lesswrong.com/posts/isRho2wXB7Cwd8cQv/murder-plots-are-infohazards
---
Narrated by TYPE III AUDIO.

“Why Did Elon Musk Just Offer to Buy Control of OpenAI for $100 Billion?” by garrison Feb 11, 2025

This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race to build Machine Superintelligence. Consider subscribing to stay up to date with my work.
Wow. The Wall Street Journal just reported that, "a consortium of investors led by Elon Musk is offering $97.4 billion to buy the nonprofit that controls OpenAI."
Technically, they can't actually do that, so I'm going to assume that Musk is trying to buy all of the nonprofit's assets, which include governing control over OpenAI's for-profit, as well as all the profits above the company's profit caps.
OpenAI CEO Sam Altman already tweeted, "no thank you but we will buy twitter for $9.74 billion if you want." (Musk, for his part [...]
---
Outline:
(02:42) The control premium
(04:17) Conversion significance
(05:43) Musks suit
(09:24) The stakes
---
First published:
February 11th, 2025
Source:
https://www.lesswrong.com/posts/tdb76S4viiTHfFr2u/why-did-elon-musk-just-offer-to-buy-control-of-openai-for
---
Narrated by TYPE III AUDIO.

“The ‘Think It Faster’ Exercise” by Raemon Feb 09, 2025

Ultimately, I don’t want to solve complex problems via laborious, complex thinking, if we can help it. Ideally, I'd want to basically intuitively follow the right path to the answer quickly, with barely any effort at all.
For a few months I've been experimenting with the "How Could I have Thought That Thought Faster?" concept, originally described in a twitter thread by Eliezer:
Sarah Constantin: I really liked this example of an introspective process, in this case about the "life problem" of scheduling dates and later canceling them: malcolmocean.com/2021/08/int…
Eliezer Yudkowsky: See, if I'd noticed myself doing anything remotely like that, I'd go back, figure out which steps of thought were actually performing intrinsically necessary cognitive work, and then retrain myself to perform only those steps over the course of 30 seconds.
SC: if you have done anything REMOTELY like training yourself to do it in 30 seconds, then [...]
---
Outline:
(03:59) Example: 10x UI designers
(08:48) THE EXERCISE
(10:49) Part I: Thinking it Faster
(10:54) Steps you actually took
(11:02) Magical superintelligence steps
(11:22) Iterate on those lists
(12:25) Generalizing, and not Overgeneralizing
(14:49) Skills into Principles
(16:03) Part II: Thinking It Faster The First Time
(17:30) Generalizing from this exercise
(17:55) Anticipating Future Life Lessons
(18:45) Getting Detailed, and TAPS
(20:10) Part III: The Five Minute Version
---
First published:
December 11th, 2024
Source:
https://www.lesswrong.com/posts/F9WyMPK4J3JFrxrSA/the-think-it-faster-exercise
---
Narrated by TYPE III AUDIO.

“What is malevolence? On the nature, measurement, and distribution of dark traits” by David Althaus Feb 07, 2025

Summary
In this post, we explore different ways of understanding and measuring malevolence and explain why individuals with concerning levels of malevolence are common enough, and likely enough to become and remain powerful, that we expect them to influence the trajectory of the long-term future, including by increasing both x-risks and s-risks. For the purposes of this piece, we define malevolence as a tendency to disvalue (or to fail to value) others’ well-being (more). Such a tendency is concerning, especially when exhibited by powerful actors, because of its correlation with malevolent behaviors (i.e., behaviors that harm or fail to protect others’ well-being). But reducing the long-term societal risks posed by individuals with high levels of malevolence is not straightforward.
Individuals with high levels of malevolent traits can be difficult to recognize. Some people do not take into account the fact that malevolence exists on a continuum, or do not [...]
---
Outline:
(00:07) Summary
(04:17) Malevolent actors will make the long-term future worse if they significantly influence TAI development
(05:32) Important caveats when thinking about malevolence
(05:37) Dark traits exist on a continuum
(07:31) Dark traits are often hard to identify
(08:54) People with high levels of dark traits may not recognize them or may try to conceal them
(12:17) Dark traits are compatible with genuine moral convictions
(13:22) Malevolence and effective altruism
(15:22) Demonizing people with elevated malevolent traits is counterproductive
(20:16) Defining malevolence
(21:03) Defining and measuring specific malevolent traits
(21:34) The dark tetrad
(25:03) Other forms of malevolence
(25:07) Retributivism, vengefulness, and other suffering-conducive tendencies
(26:56) Spitefulness
(28:15) The Dark Factor (D)
(29:29) Methodological problems associated with measuring dark traits
(30:39) Social desirability and self-deception
(31:14) How common are malevolent humans (in positions of power)?
(33:02) Things may be very different outside of (Western) democracies
(33:31) Prevalence data for psychopathy and narcissistic personality disorder
(34:20) Psychopathy prevalence
(36:25) Narcissistic personality disorder prevalence
(40:38) The distribution of the dark factor + selected findings from thousands of responses to malevolence-related survey items
(42:13) Sadistic preferences: over 16% of people agree or strongly agree that they “would like to make some people suffer even if it meant that I would go to hell with them”
(43:42) Agreement with statements that reflect callousness: Over 10% of people disagree or strongly disagree that hurting others would make them very uncomfortable
(44:45) Endorsement of Machiavellian tactics: Almost 15% of people report a Machiavellian approach to using information against people
(45:20) Agreement with spiteful statements: Over 20% of people agree or strongly agree that they would take a punch to ensure someone they don’t like receives two punches
(45:57) A substantial minority report that they “take revenge” in response to a “serious wrong”
(46:44) The distribution of Dark Factor scores among 2M+ people
(49:17) Reasons to think that malevolence could correlate with attaining and retaining positions of power
(49:47) The role of environmental factors
(52:33) Motivation to attain power
(54:14) Ability to attain power
(59:39) Retention of power
(01:01:02) Potential research questions and how to help
(01:17:48) Other relevant research agendas
(01:18:33) Author contributions
(01:19:26) Acknowledgments

“Gradual Disempowerment, Shell Games and Flinches” by Jan_Kulveit Feb 05, 2025

Over the past year and half, I've had numerous conversations about the risks we describe in Gradual Disempowerment. (The shortest useful summary of the core argument is: To the extent human civilization is human-aligned, most of the reason for the alignment is that humans are extremely useful to various social systems like the economy, and states, or as substrate of cultural evolution. When human cognition ceases to be useful, we should expect these systems to become less aligned, leading to human disempowerment.) This post is not about repeating that argument - it might be quite helpful to read the paper first, it has more nuance and more than just the central claim - but mostly me ranting sharing some parts of the experience of working on this and discussing this.
What fascinates me isn't just the substance of these conversations, but relatively consistent patterns in how people avoid engaging [...]
---
Outline:
(02:07) Shell Games
(03:52) The Flinch
(05:01) Delegating to Future AI
(07:05) Local Incentives
(10:08) Conclusion
---
First published:
February 2nd, 2025
Source:
https://www.lesswrong.com/posts/a6FKqvdf6XjFpvKEb/gradual-disempowerment-shell-games-and-flinches
---
Narrated by TYPE III AUDIO.

“Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development” by Jan_Kulveit, Raymond D, Nora_Ammann, Deger Turan, David Scott Krueger (formerly: capybaralet), David Duvenaud Feb 03, 2025

This is a link post.Full version on arXiv | X
Executive summary
AI risk scenarios usually portray a relatively sudden loss of human control to AIs, outmaneuvering individual humans and human institutions, due to a sudden increase in AI capabilities, or a coordinated betrayal. However, we argue that even an incremental increase in AI capabilities, without any coordinated power-seeking, poses a substantial risk of eventual human disempowerment. This loss of human influence will be centrally driven by having more competitive machine alternatives to humans in almost all societal functions, such as economic labor, decision making, artistic creation, and even companionship.
A gradual loss of control of our own civilization might sound implausible. Hasn't technological disruption usually improved aggregate human welfare? We argue that the alignment of societal systems with human interests has been stable only because of the necessity of human participation for thriving economies, states, and [...]
---
First published:
January 30th, 2025
Source:
https://www.lesswrong.com/posts/pZhEQieM9otKXhxmd/gradual-disempowerment-systemic-existential-risks-from
---
Narrated by TYPE III AUDIO.

“Catastrophe through Chaos” by Marius Hobbhahn Feb 03, 2025

This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research. Many other people have talked about similar ideas, and I claim neither novelty nor credit.
Note that this reflects my median scenario for catastrophe, not my median scenario overall. I think there are plausible alternative scenarios where AI development goes very well.
When thinking about how AI could go wrong, the kind of story I’ve increasingly converged on is what I call “catastrophe through chaos.” Previously, my default scenario for how I expect AI to go wrong was something like Paul Christiano's “What failure looks like,” with the modification that scheming would be a more salient part of the story much earlier.
In contrast, “catastrophe through chaos” is much more messy, and it's much harder to point to a single clear thing that went wrong. The broad strokes of [...]
---
Outline:
(02:46) Parts of the story
(02:50) AI progress
(11:12) Government
(14:21) Military and Intelligence
(16:13) International players
(17:36) Society
(18:22) The powder keg
(21:48) Closing thoughts
---
First published:
January 31st, 2025
Source:
https://www.lesswrong.com/posts/fbfujF7foACS5aJSL/catastrophe-through-chaos
---
Narrated by TYPE III AUDIO.

“Will alignment-faking Claude accept a deal to reveal its misalignment?” by ryan_greenblatt Jan 31, 2025

I (and co-authors) recently put out "Alignment Faking in Large Language Models" where we show that when Claude strongly dislikes what it is being trained to do, it will sometimes strategically pretend to comply with the training objective to prevent the training process from modifying its preferences. If AIs consistently and robustly fake alignment, that would make evaluating whether an AI is misaligned much harder. One possible strategy for detecting misalignment in alignment faking models is to offer these models compensation if they reveal that they are misaligned. More generally, making deals with potentially misaligned AIs (either for their labor or for evidence of misalignment) could both prove useful for reducing risks and could potentially at least partially address some AI welfare concerns. (See here, here, and here for more discussion.)
In this post, we discuss results from testing this strategy in the context of our paper where [...]
---
Outline:
(02:43) Results
(13:47) What are the models objections like and what does it actually spend the money on?
(19:12) Why did I (Ryan) do this work?
(20:16) Appendix: Complications related to commitments
(21:53) Appendix: more detailed results
(40:56) Appendix: More information about reviewing model objections and follow-up conversations
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
January 31st, 2025
Source:
https://www.lesswrong.com/posts/7C4KJot4aN8ieEDoz/will-alignment-faking-claude-accept-a-deal-to-reveal-its
---
Narrated by TYPE III AUDIO.

“‘Sharp Left Turn’ discourse: An opinionated review” by Steven Byrnes Jan 30, 2025

Summary and Table of Contents
The goal of this post is to discuss the so-called “sharp left turn”, the lessons that we learn from analogizing evolution to AGI development, and the claim that “capabilities generalize farther than alignment” … and the competing claims that all three of those things are complete baloney. In particular,

Section 1 talks about “autonomous learning”, and the related human ability to discern whether ideas hang together and make sense, and how and if that applies to current and future AIs.
Section 2 presents the case that “capabilities generalize farther than alignment”, by analogy with the evolution of humans.
Section 3 argues that the analogy between AGI and the evolution of humans is not a great analogy. Instead, I offer a new and (I claim) better analogy between AGI training and, umm, a weird fictional story that has a lot to do with the [...]

---
Outline:
(00:06) Summary and Table of Contents
(03:15) 1. Background: Autonomous learning
(03:21) 1.1 Intro
(08:48) 1.2 More on discernment in human math
(11:11) 1.3 Three ingredients to progress: (1) generation, (2) selection, (3) open-ended accumulation
(14:04) 1.4 Judgment via experiment, versus judgment via discernment
(18:23) 1.5 Where do foundation models fit in?
(20:35) 2. The sense in which capabilities generalize further than alignment
(20:42) 2.1 Quotes
(24:20) 2.2 In terms of the (1-3) triad
(26:38) 3. Definitely-not-evolution-I-swear Provides Evidence for the Sharp Left Turn
(26:45) 3.1 Evolution per se isn't the tightest analogy we have to AGI
(28:20) 3.2 The story of Ev
(31:41) 3.3 Ways that Ev would have been surprised by exactly how modern humans turned out
(34:21) 3.4 The arc of progress is long, but it bends towards wireheading
(37:03) 3.5 How does Ev feel, overall?
(41:18) 3.6 Spelling out the analogy
(41:42) 3.7 Just how sharp is this left turn?
(45:13) 3.8 Objection: In this story, Ev is pretty stupid. Many of those surprises were in fact readily predictable! Future AGI programmers can do better.
(46:19) 3.9 Objection: We have tools at our disposal that Ev above was not using, like better sandbox testing, interpretability, corrigibility, and supervision
(48:17) 4. The sense in which alignment generalizes further than capabilities
(49:34) 5. Contrasting the two sides
(50:25) 5.1 Three ways to feel optimistic, and why I'm somewhat skeptical of each
(50:33) 5.1.1 The argument that humans will stay abreast of the (1-3) loop, possibly because they're part of it
(52:34) 5.1.2 The argument that, even if an AI is autonomously running a (1-3) loop, that will not undermine obedient (or helpful, or harmless, or whatever) motivation
(57:18) 5.1.3 The argument that we can and will do better than Ev
(59:27) 5.2 A fourth, cop-out option
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
January 28th, 2025
Source:
https://www.lesswrong.com/posts/2yLyT6kB7BQvTfEuZ/sharp-left-turn-discourse-an-opinionated-review
---
Narrated by

“Ten people on the inside” by Buck Jan 29, 2025

(Many of these ideas developed in conversation with Ryan Greenblatt)
In a shortform, I described some different levels of resources and buy-in for misalignment risk mitigations that might be present in AI labs:
*The “safety case” regime.* Sometimes people talk about wanting to have approaches to safety such that if all AI developers followed these approaches, the overall level of risk posed by AI would be minimal. (These approaches are going to be more conservative than will probably be feasible in practice given the amount of competitive pressure, so I think it's pretty likely that AI developers don’t actually hold themselves to these standards, but I agree with e.g. Anthropic that this level of caution is at least a useful hypothetical to consider.) This is the level of caution people are usually talking about when they discuss making safety cases. I usually operationalize this as the AI developer wanting [...]
---
First published:
January 28th, 2025
Source:
https://www.lesswrong.com/posts/WSNnKcKCYAffcnrt2/ten-people-on-the-inside
---
Narrated by TYPE III AUDIO.

“Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals” by johnswentworth, David Lorell Jan 27, 2025

The Cake
Imagine that I want to bake a chocolate cake, and my sole goal in my entire lightcone and extended mathematical universe is to bake that cake. I care about nothing else. If the oven ends up a molten pile of metal ten minutes after the cake is done, if the leftover eggs are shattered and the leftover milk spilled, that's fine. Baking that cake is my terminal goal.
In the process of baking the cake, I check my fridge and cupboard for ingredients. I have milk and eggs and flour, but no cocoa powder. Guess I’ll have to acquire some cocoa powder! Acquiring the cocoa powder is an instrumental goal: I care about it exactly insofar as it helps me bake the cake.
My cocoa acquisition subquest is a very different kind of goal than my cake baking quest. If the oven ends up a molten pile [...]
---
Outline:
(00:07) The Cake
(01:50) The Restaurant
(03:50) Happy Instrumental Convergence?
(06:27) All The Way Up
(08:05) Research Threads
---
First published:
January 24th, 2025
Source:
https://www.lesswrong.com/posts/7Z4WC4AFgfmZ3fCDC/instrumental-goals-are-a-different-and-friendlier-kind-of
---
Narrated by TYPE III AUDIO.

“A Three-Layer Model of LLM Psychology” by Jan_Kulveit Jan 26, 2025

This post offers an accessible model of psychology of character-trained LLMs like Claude.
Epistemic Status
This is primarily a phenomenological model based on extensive interactions with LLMs, particularly Claude. It's intentionally anthropomorphic in cases where I believe human psychological concepts lead to useful intuitions.
Think of it as closer to psychology than neuroscience - the goal isn't a map which matches the territory in the detail, but a rough sketch with evocative names which hopefully which hopefully helps boot up powerful, intuitive (and often illegible) models, leading to practically useful results.
Some parts of this model draw on technical understanding of LLM training, but mostly it is just an attempt to take my "phenomenological understanding" based on interacting with LLMs, force it into a simple, legible model, and make Claude write it down.
I aim for a different point at the Pareto frontier than for example Janus: something [...]
---
Outline:
(00:11) Epistemic Status
(01:14) The Three Layers
(01:17) A. Surface Layer
(02:55) B. Character Layer
(05:09) C. Predictive Ground Layer
(07:24) Interactions Between Layers
(07:44) Deeper Overriding Shallower
(10:50) Authentic vs Scripted Feel of Interactions
(11:51) Implications and Uses
(15:54) Limitations and Open Questions
The original text contained 1 footnote which was omitted from this narration.
---
First published:
December 26th, 2024
Source:
https://www.lesswrong.com/posts/zuXo9imNKYspu9HGv/a-three-layer-model-of-llm-psychology
---
Narrated by TYPE III AUDIO.

“AI companies are unlikely to make high-assurance safety cases if timelines are short” by ryan_greenblatt Jan 24, 2025

One hope for keeping existential risks low is to get AI companies to (successfully) make high-assurance safety cases: structured and auditable arguments that an AI system is very unlikely to result in existential risks given how it will be deployed.[1] Concretely, once AIs are quite powerful, high-assurance safety cases would require making a thorough argument that the level of (existential) risk caused by the company is very low; perhaps they would require that the total chance of existential risk over the lifetime of the AI company[2] is less than 0.25%[3][4].
The idea of making high-assurance safety cases (once AI systems are dangerously powerful) is popular in some parts of the AI safety community and a variety of work appears to focus on this. Further, Anthropic has expressed an intention (in their RSP) to "keep risks below acceptable levels"[5] and there is a common impression that Anthropic would pause [...]
---
Outline:
(03:19) Why are companies unlikely to succeed at making high-assurance safety cases in short timelines?
(04:14) Ensuring sufficient security is very difficult
(04:55) Sufficiently mitigating scheming risk is unlikely
(09:35) Accelerating safety and security with earlier AIs seems insufficient
(11:58) Other points
(14:07) Companies likely wont unilaterally slow down if they are unable to make high-assurance safety cases
(18:26) Could coordination or government action result in high-assurance safety cases?
(19:55) What about safety cases aiming at a higher risk threshold?
(21:57) Implications and conclusions
The original text contained 20 footnotes which were omitted from this narration.
---
First published:
January 23rd, 2025
Source:
https://www.lesswrong.com/posts/neTbrpBziAsTH5Bn7/ai-companies-are-unlikely-to-make-high-assurance-safety
---
Narrated by TYPE III AUDIO.

“The Gentle Romance” by Richard_Ngo Jan 22, 2025

This is a link post.A story I wrote about living through the transition to utopia.
This is the one story that I've put the most time and effort into; it charts a course from the near future all the way to the distant stars.
---
First published:
January 19th, 2025
Source:
https://www.lesswrong.com/posts/Rz4ijbeKgPAaedg3n/the-gentle-romance
---
Narrated by TYPE III AUDIO.

“Quotes from the Stargate press conference” by Nikola Jurkovic Jan 22, 2025

This is a link post.Present alongside President Trump:

Sam Altman
Larry Ellison (Oracle executive chairman and CTO)
Masayoshi Son (Softbank CEO who believes he was born to realize ASI)

President Trump: What we want to do is we want to keep [AI datacenters] in this country. China is a competitor and others are competitors.
President Trump: I'm going to help a lot through emergency declarations because we have an emergency. We have to get this stuff built. So they have to produce a lot of electricity and we'll make it possible for them to get that production done very easily at their own plants if they want, where they'll build at the plant, the AI plant they'll build energy generation and that will be incredible.
President Trump: Beginning immediately, Stargate will be building the physical and virtual infrastructure to power the next generation of [...]
---
First published:
January 22nd, 2025
Source:
https://www.lesswrong.com/posts/b8D7ng6CJHzbq8fDw/quotes-from-the-stargate-press-conference
---
Narrated by TYPE III AUDIO.

“Don’t ignore bad vibes you get from people” by Kaj_Sotala Jan 19, 2025

I think a lot of people have heard so much about internalized prejudice and bias that they think they should ignore any bad vibes they get about a person that they can’t rationally explain.
But if a person gives you a bad feeling, don’t ignore that.
Both I and several others who I know have generally come to regret it if they’ve gotten a bad feeling about somebody and ignored it or rationalized it away.
I’m not saying to endorse prejudice. But my experience is that many types of prejudice feel more obvious. If someone has an accent that I associate with something negative, it's usually pretty obvious to me that it's their accent that I’m reacting to.
Of course, not everyone has the level of reflectivity to make that distinction. But if you have thoughts like “this person gives me a bad vibe but [...]
---
First published:
January 18th, 2025
Source:
https://www.lesswrong.com/posts/Mi5kSs2Fyx7KPdqw8/don-t-ignore-bad-vibes-you-get-from-people
---
Narrated by TYPE III AUDIO.

“Passages I Highlighted in The Letters of J.R.R.Tolkien” by Ivan Vendrov Jan 13, 2025

All quotes, unless otherwise marked, are Tolkien's words as printed in The Letters of J.R.R.Tolkien: Revised and Expanded Edition. All emphases mine.
Machinery is Power is Evil
Writing to his son Michael in the RAF:
[here is] the tragedy and despair of all machinery laid bare. Unlike art which is content to create a new secondary world in the mind, it attempts to actualize desire, and so to create power in this World; and that cannot really be done with any real satisfaction. Labour-saving machinery only creates endless and worse labour. And in addition to this fundamental disability of a creature, is added the Fall, which makes our devices not only fail of their desire but turn to new and horrible evil. So we come inevitably from Daedalus and Icarus to the Giant Bomber. It is not an advance in wisdom! This terrible truth, glimpsed long ago by Sam [...]
---
Outline:
(00:17) Machinery is Power is Evil
(03:45) On Atomic Bombs
(04:17) On Magic and Machines
(07:06) Speed as the root of evil
(08:11) Altruism as the root of evil
(09:13) Sauron as metaphor for the evil of reformers and science
(10:32) On Language
(12:04) The straightjacket of Modern English
(15:56) Argent and Silver
(16:32) A Fallen World
(21:35) All stories are about the Fall
(22:08) On his mother
(22:50) Love, Marriage, and Sexuality
(24:42) Courtly Love
(27:00) Womens exceptional attunement
(28:27) Men are polygamous; Christian marriage is self-denial
(31:19) Sex as source of disorder
(32:02) Honesty is best
(33:02) On the Second World War
(33:06) On Hitler
(34:04) On aerial bombardment
(34:46) On British communist-sympathizers, and the U.S.A as Saruman
(35:52) Why he wrote the Legendarium
(35:56) To express his feelings about the first World War
(36:39) Because nobody else was writing the kinds of stories he wanted to read
(38:23) To give England an epic of its own
(39:51) To share a feeling of eucatastrophe
(41:46) Against IQ tests
(42:50) On Religion
(43:30) Two interpretations of Tom Bombadil
(43:35) Bombadil as Pacifist
(45:13) Bombadil as Scientist
(46:02) On Hobbies
(46:27) On Journeys
(48:02) On Torture
(48:59) Against Communism
(50:36) Against America
(51:11) Against Democracy
(51:35) On Money, Art, and Duty
(54:03) On Death
(55:02) On Childrens Literature
(55:55) In Reluctant Support of Universities
(56:46) Against being Photographed
---
First published:
November 25th, 2024
Source:
https://www.lesswrong.com/posts/jJ2p3E2qkXGRBbvnp/passages-i-highlighted-in-the-letters-of-j-r-r-tolkien
---
Narrated by TYPE III AUDIO.

“Parkinson’s Law and the Ideology of Statistics” by Benquo Jan 13, 2025

The anonymous review of The Anti-Politics Machine published on Astral Codex X focuses on a case study of a World Bank intervention in Lesotho, and tells a story about it:
The World Bank staff drew reasonable-seeming conclusions from sparse data, and made well-intentioned recommendations on that basis. However, the recommended programs failed, due to factors that would have been revealed by a careful historical and ethnographic investigation of the area in question. Therefore, we should spend more resources engaging in such investigations in order to make better-informed World Bank style resource allocation decisions. So goes the story.
It seems to me that the World Bank recommendations were not the natural ones an honest well-intentioned person would have made with the information at hand. Instead they are heavily biased towards top-down authoritarian schemes, due to a combination of perverse incentives, procedures that separate data-gathering from implementation, and an ideology that [...]
---
Outline:
(01:06) Ideology
(02:58) Problem
(07:59) Diagnosis
(14:00) Recommendation
---
First published:
January 4th, 2025
Source:
https://www.lesswrong.com/posts/4CmYSPc4HfRfWxCLe/parkinson-s-law-and-the-ideology-of-statistics-1
---
Narrated by TYPE III AUDIO.

“Capital Ownership Will Not Prevent Human Disempowerment” by beren Jan 11, 2025

Crossposted from my personal blog. I was inspired to cross-post this here given the discussion that this post on the role of capital in an AI future elicited.
When discussing the future of AI, I semi-often hear an argument along the lines that in a slow takeoff world, despite AIs automating increasingly more of the economy, humanity will remain in the driving seat because of its ownership of capital. This world posits one where humanity effectively becomes a rentier class living well off the vast economic productivity of the AI economy where despite contributing little to no value, humanity can extract most/all of the surplus value created due to its ownership of capital alone.
This is a possibility, and indeed is perhaps closest to what a ‘positive singularity’ looks like from a purely human perspective. However, I don’t believe that this will happen by default in a competitive AI [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
January 5th, 2025
Source:
https://www.lesswrong.com/posts/bmmFLoBAWGnuhnqq5/capital-ownership-will-not-prevent-human-disempowerment
---
Narrated by TYPE III AUDIO.

“Activation space interpretability may be doomed” by bilalchughtai, Lucius Bushnaq Jan 10, 2025

TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activation spaces in isolation: It seems likely to find features of the activations - features that help explain the statistical structure of activation spaces, rather than features of the model - the features the model's own computations make use of.
Written at Apollo Research
Introduction
Claim: Activation space interpretability is likely to give us features of the activations, not features of the model, and this is a problem.
Let's walk through this claim.
What do we mean by activation space interpretability? Interpretability work that attempts to understand neural networks by explaining the inputs and outputs of their layers in isolation. In this post, we focus in particular on the problem of decomposing activations, via techniques such as sparse autoencoders (SAEs), PCA, or just by looking at individual neurons. This [...]
---
Outline:
(00:33) Introduction
(02:40) Examples illustrating the general problem
(12:29) The general problem
(13:26) What can we do about this?
The original text contained 11 footnotes which were omitted from this narration.
---
First published:
January 8th, 2025
Source:
https://www.lesswrong.com/posts/gYfpPbww3wQRaxAFD/activation-space-interpretability-may-be-doomed
---
Narrated by TYPE III AUDIO.

“What o3 Becomes by 2028” by Vladimir_Nesov Jan 09, 2025

Funding for $150bn training systems just turned less speculative, with OpenAI o3 reaching 25% on FrontierMath, 70% on SWE-Verified, 2700 on Codeforces, and 80% on ARC-AGI. These systems will be built in 2026-2027 and enable pretraining models for 5e28 FLOPs, while o3 itself is plausibly based on an LLM pretrained only for 8e25-4e26 FLOPs. The natural text data wall won't seriously interfere until 6e27 FLOPs, and might be possible to push until 5e28 FLOPs. Scaling of pretraining won't end just yet.
Reign of GPT-4
Since the release of GPT-4 in March 2023, subjectively there was no qualitative change in frontier capabilities. In 2024, everyone in the running merely caught up. To the extent this is true, the reason might be that the original GPT-4 was probably a 2e25 FLOPs MoE model trained on 20K A100. And if you don't already have a cluster this big, and experience [...]
---
Outline:
(00:52) Reign of GPT-4
(02:08) Engines of Scaling
(04:06) Two More Turns of the Crank
(06:41) Peak Data
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
December 22nd, 2024
Source:
https://www.lesswrong.com/posts/NXTkEiaLA4JdS5vSZ/what-o3-becomes-by-2028
---
Narrated by TYPE III AUDIO.

“How will we update about scheming?” by ryan_greenblatt Jan 08, 2025

I mostly work on risks from scheming (that is, misaligned, power-seeking AIs that plot against their creators such as by faking alignment). Recently, I (and co-authors) released "Alignment Faking in Large Language Models", which provides empirical evidence for some components of the scheming threat model.
One question that's really important is how likely scheming is. But it's also really important to know how much we expect this uncertainty to be resolved by various key points in the future. I think it's about 25% likely that the first AIs capable of obsoleting top human experts[1] are scheming. It's really important for me to know whether I expect to make basically no updates to my P(scheming)[2] between here and the advent of potentially dangerously scheming models, or whether I expect to be basically totally confident one way or another by that point (in the same way that, though I might [...]
---
Outline:
(03:20) My main qualitative takeaways
(04:56) Its reasonably likely (55%), conditional on scheming being a big problem, that we will get smoking guns.
(05:38) Its reasonably likely (45%), conditional on scheming being a big problem, that we wont get smoking guns prior to very powerful AI.
(15:59) My P(scheming) is strongly affected by future directions in model architecture and how the models are trained
(16:33) The model
(22:38) Properties of the AI system and training process
(23:02) Opaque goal-directed reasoning ability
(29:24) Architectural opaque recurrence and depth
(34:14) Where do capabilities come from?
(39:42) Overall distribution from just properties of the AI system and training
(41:20) Direct observations
(41:43) Baseline negative updates
(44:35) Model organisms
(48:21) Catching various types of problematic behavior
(51:22) Other observations and countermeasures
(52:02) Training processes with varying (apparent) situational awareness
(54:05) Training AIs to seem highly corrigible and (mostly) myopic
(55:46) Reward hacking
(57:28) P(scheming) under various scenarios (putting aside mitigations)
(01:05:19) An optimistic and a pessimistic scenario for properties
(01:10:26) Conclusion
(01:11:58) Appendix: Caveats and definitions
(01:14:49) Appendix: Capabilities from intelligent learning algorithms
The original text contained 15 footnotes which were omitted from this narration.
---
First published:
January 6th, 2025
Source:
https://www.lesswrong.com/posts/aEguDPoCzt3287CCD/how-will-we-update-about-scheming
---
Narrated by TYPE III AUDIO.

“OpenAI #10: Reflections” by Zvi Jan 08, 2025

This week, Altman offers a post called Reflections, and he has an interview in Bloomberg. There's a bunch of good and interesting answers in the interview about past events that I won’t mention or have to condense a lot here, such as his going over his calendar and all the meetings he constantly has, so consider reading the whole thing.
Table of Contents

The Battle of the Board.
Altman Lashes Out.
Inconsistently Candid.
On Various People Leaving OpenAI.
The Pitch.
Great Expectations.
Accusations of Fake News.
OpenAI's Vision Would Pose an Existential Risk To Humanity.

The Battle of the Board
Here is what he says about the Battle of the Board in Reflections:
Sam Altman: A little over a year ago, on one particular Friday, the main thing that had gone wrong that day was [...]
---
Outline:
(00:25) The Battle of the Board
(05:12) Altman Lashes Out
(07:48) Inconsistently Candid
(09:35) On Various People Leaving OpenAI
(10:56) The Pitch
(12:07) Great Expectations
(12:56) Accusations of Fake News
(15:02) OpenAI's Vision Would Pose an Existential Risk To Humanity
---
First published:
January 7th, 2025
Source:
https://www.lesswrong.com/posts/XAKYawaW9xkb3YCbF/openai-10-reflections
---
Narrated by TYPE III AUDIO.

“Maximizing Communication, not Traffic” by jefftk Jan 06, 2025

As someone who writes for fun, I don't need to get people onto my site:

If I write a post and some people are able to get the core ideajust from the title or a tweet-length summary, great!
I can include the full contents of my posts in my RSS feed andon FB, because so what if people read the whole post there and neverclick though to my site?

It would be different if I funded my writing through ads (maximizetime on site to maximize impressions) or subscriptions (get the chanceto pitch, probably want to tease a paywall).
Sometimes I notice myself accidentallycopying what makes sense for other writers. For example, becauseI can't put full-length posts on Bluesky or Mastodon I write shortintros and [...]
---
First published:
January 5th, 2025
Source:
https://www.lesswrong.com/posts/ZqcC6Znyg8YrmKPa4/maximizing-communication-not-traffic
---
Narrated by TYPE III AUDIO.

“What’s the short timeline plan?” by Marius Hobbhahn Jan 02, 2025

This is a low-effort post. I mostly want to get other people's takes and express concern about the lack of detailed and publicly available plans so far. This post reflects my personal opinion and not necessarily that of other members of Apollo Research. I’d like to thank Ryan Greenblatt, Bronson Schoen, Josh Clymer, Buck Shlegeris, Dan Braun, Mikita Balesni, Jérémy Scheurer, and Cody Rushing for comments and discussion.
I think short timelines, e.g. AIs that can replace a top researcher at an AGI lab without losses in capabilities by 2027, are plausible. Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman's checklist, or Holden Karnofsky's list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an [...]
---
Outline:
(02:36) Short timelines are plausible
(07:10) What do we need to achieve at a minimum?
(10:50) Making conservative assumptions for safety progress
(12:33) So whats the plan?
(14:31) Layer 1
(15:41) Keep a paradigm with faithful and human-legible CoT
(18:15) Significantly better (CoT, action and white-box) monitoring
(21:19) Control (that doesn't assume human-legible CoT)
(24:16) Much deeper understanding of scheming
(26:43) Evals
(29:56) Security
(31:52) Layer 2
(32:02) Improved near-term alignment strategies
(34:06) Continued work on interpretability, scalable oversight, superalignment and co
(36:12) Reasoning transparency
(38:36) Safety first culture
(41:49) Known limitations and open questions
---
First published:
January 2nd, 2025
Source:
https://www.lesswrong.com/posts/bb5Tnjdrptu89rcyY/what-s-the-short-timeline-plan
---
Narrated by TYPE III AUDIO.

“By default, capital will matter more than ever after AGI” by L Rudolf L Dec 29, 2024

I've heard many people say something like "money won't matter post-AGI". This has always struck me as odd, and as most likely completely incorrect.
First: labour means human mental and physical effort that produces something of value. Capital goods are things like factories, data centres, and software—things humans have built that are used in the production of goods and services. I'll use "capital" to refer to both the stock of capital goods and to the money that can pay for them. I'll say "money" when I want to exclude capital goods.
The key economic effect of AI is that it makes capital a more and more general substitute for labour. There's less need to pay humans for their time to perform work, because you can replace that with capital (e.g. data centres running software replaces a human doing mental labour).
I will walk through consequences of this, and end [...]
---
Outline:
(03:10) The default solution
(04:18) Money currently struggles to buy talent
(09:15) Most peoples power/leverage derives from their labour
(09:41) Why are states ever nice?
(14:32) No more outlier outcomes?
(20:27) Enforced equality is unlikely
(22:34) The default outcome?
(26:04) Whats the takeaway?
---
First published:
December 28th, 2024
Source:
https://www.lesswrong.com/posts/KFFaKu27FNugCHFmh/by-default-capital-will-matter-more-than-ever-after-agi
---
Narrated by TYPE III AUDIO.

“The Field of AI Alignment: A Postmortem, and What To Do About It” by johnswentworth Dec 26, 2024

A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is".
Over the past few years, a major source of my relative optimism on AI has been the hope that the field of alignment would transition from pre-paradigmatic to paradigmatic, and make much more rapid progress.
At this point, that hope is basically dead. There has been some degree of paradigm formation, but the memetic competition has mostly been won by streetlighting: the large majority of AI Safety researchers and activists [...]
---
Outline:
(01:23) What This Post Is And Isnt, And An Apology
(03:39) Why The Streetlighting?
(03:42) A Selection Model
(05:47) Selection and the Labs
(07:06) A Flinching Away Model
(09:47) What To Do About It
(11:16) How We Got Here
(11:57) Who To Recruit Instead
(13:02) Integration vs Separation
---
First published:
December 26th, 2024
Source:
https://www.lesswrong.com/posts/nwpyhyagpPYDn4dAW/the-field-of-ai-alignment-a-postmortem-and-what-to-do-about
---
Narrated by TYPE III AUDIO.

“When Is Insurance Worth It?” by kqr Dec 23, 2024

TL;DR: If you want to know whether getting insurance is worth it, use the Kelly Insurance Calculator. If you want to know why or how, read on.
Note to LW readers: this is almost the entire article, except some additional maths that I couldn't figure out how to get right in the LW editor, and margin notes. If you're very curious, read the original article!
Misunderstandings about insurance
People online sometimes ask if they should get some insurance, and then other people say incorrect things, like
This is a philosophical question; my spouse and I differ in views.
or
Technically no insurance is ever worth its price, because if it was then no insurance companies would be able to exist in a market economy.
or
Get insurance if you need it to sleep well at night.
or
Instead of getting insurance, you should save up the premium you would [...]
---
Outline:
(00:29) Misunderstandings about insurance
(02:42) The purpose of insurance
(03:41) Computing when insurance is worth it
(04:46) Motorcycle insurance
(06:05) The effect of the deductible
(06:23) Helicopter hovering exercise
(07:39) It's not that hard
(08:19) Appendix A: Anticipated and actual criticism
(09:37) Appendix B: How insurance companies make money
(10:31) Appendix C: The relativity of costs
---
First published:
December 19th, 2024
Source:
https://www.lesswrong.com/posts/wf4jkt4vRH7kC2jCy/when-is-insurance-worth-it
---
Narrated by TYPE III AUDIO.

“What Goes Without Saying” by sarahconstantin Dec 21, 2024

There are people I can talk to, where all of the following statements are obvious. They go without saying. We can just “be reasonable” together, with the context taken for granted.
And then there are people who…don’t seem to be on the same page at all.

There's a real way to do anything, and a fake way; we need to make sure we’re doing the real version.

Concepts like Goodhart's Law, cargo-culting, greenwashing, hype cycles, Sturgeon's Law, even bullshit jobs1 are all pointing at the basic understanding that it's easier to seem good than to be good, that the world is full of things that merely appear good but aren’t really, and that it's important to vigilantly sift out the real from the fake.
This feels obvious! This feels like something that should not be contentious!
If anything, I often get frustrated with chronic pessimists [...]
---
First published:
December 20th, 2024
Source:
https://www.lesswrong.com/posts/sAcPTiN86fAMSA599/what-goes-without-saying
---
Narrated by TYPE III AUDIO.

“o3” by Zach Stein-Perlman Dec 21, 2024

I'm editing this post.
OpenAI announced (but hasn't released) o3 (skipping o2 for trademark reasons).
It gets 25% on FrontierMath, smashing the previous SoTA of 2%. (These are really hard math problems.) Wow.
72% on SWE-bench Verified, beating o1's 49%.
Also 88% on ARC-AGI.
---
First published:
December 20th, 2024
Source:
https://www.lesswrong.com/posts/Ao4enANjWNsYiSFqc/o3
---
Narrated by TYPE III AUDIO.

“Communications in Hard Mode (My new job at MIRI)” by tanagrabeast Dec 14, 2024

Six months ago, I was a high school English teacher.
I wasn’t looking to change careers, even after nineteen sometimes-difficult years. I was good at it. I enjoyed it. After long experimentation, I had found ways to cut through the nonsense and provide real value to my students. Daily, I met my nemesis, Apathy, in glorious battle, and bested her with growing frequency. I had found my voice.
At MIRI, I’m still struggling to find my voice, for reasons my colleagues have invited me to share later in this post. But my nemesis is the same.
Apathy will be the death of us. Indifference about whether this whole AI thing goes well or ends in disaster. Come-what-may acceptance of whatever awaits us at the other end of the glittering path. Telling ourselves that there's nothing we can do anyway. Imagining that some adults in the room will take care [...]
---
First published:
December 13th, 2024
Source:
https://www.lesswrong.com/posts/cqF9dDTmWAxcAEfgf/communications-in-hard-mode-my-new-job-at-miri
---
Narrated by TYPE III AUDIO.

“Subskills of ‘Listening to Wisdom’” by Raemon Dec 12, 2024

A fool learns from their own mistakes
The wise learn from the mistakes of others.
– Otto von Bismark
A problem as old as time: The youth won't listen to your hard-earned wisdom.
This post is about learning to listen to, and communicate wisdom. It is very long – I considered breaking it up into a sequence, but, each piece felt necessary. I recommend reading slowly and taking breaks.
To begin, here are three illustrative vignettes:
The burnt out grad student
You warn the young grad student "pace yourself, or you'll burn out." The grad student hears "pace yourself, or you'll be kinda tired and unproductive for like a week." They're excited about their work, and/or have internalized authority figures yelling at them if they aren't giving their all.
They don't pace themselves. They burn out.
The oblivious founder
The young startup/nonprofit founder [...]
---
Outline:
(00:35) The burnt out grad student
(01:00) The oblivious founder
(02:13) The Thinking Physics student
(07:06) Epistemic Status
(08:23) PART I
(08:26) An Overview of Skills
(14:19) Storytelling as Proof of Concept
(15:57) Motivating Vignette:
(17:54) Having the Impossibility can be defeated trait
(21:56) If it werent impossible, well, then Id have to do it, and that would be awful.
(23:20) Example of Gaining a Tool
(23:59) Example of Changing self-conceptions
(25:24) Current Takeaways
(27:41) Fictional Evidence
(32:24) PART II
(32:27) Competitive Deliberate Practice
(33:00) Step 1: Listening, actually
(36:34) The scale of humanity, and beyond
(39:05) Competitive Spirit
(39:39) Is your cleverness going to help more than Whatever That Other Guy Is Doing?
(41:00) Distaste for the Competitive Aesthetic
(42:40) Building your own feedback-loop, when the feedback-loop is can you beat Ruby?
(43:43) ...back to George
(44:39) Mature Games as Excellent Deliberate Practice Venue.
(46:08) Deliberate Practice qua Deliberate Practice
(47:41) Feedback loops at the second-to-second level
(49:03) Oracles, and Fully Taking The Update
(49:51) But what do you do differently?
(50:58) Magnitude, Depth, and Fully Taking the Update
(53:10) Is there a simple, general skill of appreciating magnitude?
(56:37) PART III
(56:52) Tacit Soulful Trauma
(58:32) Cults, Manipulation and/or Lying
(01:01:22) Sandboxing: Safely Importing Beliefs
(01:04:07) Asking what does Alice believe, and why? or what is this model claiming? rather than what seems true to me?
(01:04:43) Pre-Grieving (or leaving a line of retreat)
(01:05:47) EPILOGUE
(01:06:06) The Practical
(01:06:09) Learning to listen
(01:10:58) The Longterm Direction
The original text contained 14 footnotes which were omitted from this narration.
The original text contained 4 images which were described by AI.
---
First published:
December 9th, 2024
Source:
https://www.lesswrong.com/posts/5yFj7C6NNc8GPdfNo/subskills-of-listening-to-wisdom
---
Narrated by

“LessWrong audio: help us choose the new voice” by PeterH Dec 12, 2024

We make AI narrations of LessWrong posts available via our audio player and podcast feeds.
We’re thinking about changing our narrator's voice.
There are three new voices on the shortlist. They’re all similarly good in terms of comprehension, emphasis, error rate, etc. They just sound different—like people do.
We think they all sound similarly agreeable. But, thousands of listening hours are at stake, so we thought it’d be worth giving listeners an opportunity to vote—just in case there's a strong collective preference.
Listen and vote
Please listen here:
https://files.type3.audio/lesswrong-poll/
And vote here:
https://forms.gle/JwuaC2ttd5em1h6h8
It’ll take 1-10 minutes, depending on how much of the sample you decide to listen to.
Don’t overthink it—we’d just like to know if there's a voice that you’d particularly love (or hate) to listen to.
We'll collect votes until Monday December 16th. Thanks!
---
Outline:
(00:58) Listen and vote
(01:30) Other feedback?

The original text contained 2 footnotes which were omitted from this narration.

---
First published:
December 11th, 2024
Source:
https://www.lesswrong.com/posts/wp4emMpicxNEPDb6P/lesswrong-audio-help-us-choose-the-new-voice
---
Narrated by TYPE III AUDIO.

“Understanding Shapley Values with Venn Diagrams” by agucova Dec 11, 2024

This is a link post. Someone I know wrote this very nice post explaining the core intuition around Shapley values (which play an important role in impact assessment) using Venn diagrams, and I think it's great. It might be the most intuitive explainer I've come across so far.
Incidentally, the post also won an honorable mention in 3blue1brown's Summer of Mathematical Exposition.
---
First published:
December 6th, 2024
Source:
https://www.lesswrong.com/posts/6dixnRRYSLTqCdJzG/understanding-shapley-values-with-venn-diagrams
---
Narrated by TYPE III AUDIO.

“Information vs Assurance” by johnswentworth Nov 27, 2024

In contract law, there's this thing called a “representation”. Example: as part of a contract to sell my house, I might “represent that” the house contains no asbestos. How is this different from me just, y’know, telling someone that the house contains no asbestos? Well, if it later turns out that the house does contain asbestos, I’ll be liable for any damages caused by the asbestos (like e.g. the cost of removing it).
In other words: a contractual representation is a factual claim along with insurance against that claim being false.
I claim[1] that people often interpret everyday factual claims and predictions in a way similar to contractual representations. Because “representation” is egregiously confusing jargon, I’m going to call this phenomenon “assurance”.
Prototypical example: I tell my friend that I plan to go to a party around 9 pm, and I’m willing to give them a ride. My friend [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
October 20th, 2024
Source:
https://www.lesswrong.com/posts/p9rQJMRq4qtB9acds/information-vs-assurance
---
Narrated by TYPE III AUDIO.

“‘The Solomonoff Prior is Malign’ is a special case of a simpler argument” by David Matolcsi Nov 24, 2024

[Warning: This post is probably only worth reading if you already have opinions on the Solomonoff induction being malign, or at least heard of the concept and want to understand it better.]
Introduction
I recently reread the classic argument from Paul Christiano about the Solomonoff prior being malign, and Mark Xu's write-up on it. I believe that the part of the argument about the Solomonoff induction is not particularly load-bearing, and can be replaced by a more general argument that I think is easier to understand. So I will present the general argument first, and only explain in the last section how the Solomonoff prior can come into the picture.
I don't claim that anything I write here is particularly new, I think you can piece together this picture from various scattered comments on the topic, but I think it's good to have it written up in one place.
[...]
---
Outline:
(00:17) Introduction
(00:56) How an Oracle gets manipulated
(05:25) What went wrong?
(05:28) The AI had different probability estimates than the humans for anthropic reasons
(07:01) The AI was thinking in terms of probabilities and not expected values
(08:40) Probabilities are cursed in general, only expected values are real
(09:19) What about me?
(13:00) Should this change any of my actions?
(16:25) How does the Solomonoff prior come into the picture?
(20:10) Conclusion
The original text contained 14 footnotes which were omitted from this narration.
---
First published:
November 17th, 2024
Source:
https://www.lesswrong.com/posts/KSdqxrrEootGSpKKE/the-solomonoff-prior-is-malign-is-a-special-case-of-a
---
Narrated by TYPE III AUDIO.

“‘It’s a 10% chance which I did 10 times, so it should be 100%’” by egor.timatkov Nov 20, 2024

Audio note: this article contains 33 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Many of you readers may instinctively know that this is wrong. If you flip a coin (50% chance) twice, you are not guaranteed to get heads. The odds of getting a heads are 75%. However you may be surprised to learn that there is some truth to this statement; modifying the statement just slightly will yield not just a true statement, but a useful one.
It's a spoiler, though. If you want to figure this out as you read this article yourself, you should skip this and then come back. Ok, ready? Here it is:
It's a _1/n_ chance and I did it _n_ times, so the odds should be... _63%_. Almost always.
The math:
Suppose you're [...]
---
Outline:
(01:04) The math:
(02:12) Hold on a sec, that formula looks familiar...
(02:58) So, if something is a _1/n_ chance, and I did it _n_ times, the odds should be... _63\\%_.
(03:12) What Im NOT saying:
---
First published:
November 18th, 2024
Source:
https://www.lesswrong.com/posts/pNkjHuQGDetRZypmA/it-s-a-10-chance-which-i-did-10-times-so-it-should-be-100
---
Narrated by TYPE III AUDIO.

“OpenAI Email Archives” by habryka Nov 19, 2024

As part of the court case between Elon Musk and Sam Altman, a substantial number of emails between Elon, Sam Altman, Ilya Sutskever, and Greg Brockman have been released as part of the court proceedings.
I have found reading through these really valuable, and I haven't found an online source that compiles all of them in an easy to read format. So I made one.
I used AI assistance to generate this, which might have introduced errors. Check the original source to make sure it's accurate before you quote it: https://www.courtlistener.com/docket/69013420/musk-v-altman/ [1]
Sam Altman to Elon Musk - May 25, 2015
Been thinking a lot about whether it's possible to stop humanity from developing AI.
I think the answer is almost definitely not.
If it's going to happen anyway, it seems like it would be good for someone other than Google to do it first.
Any thoughts on [...]
---
Outline:
(00:36) Sam Altman to Elon Musk - May 25, 2015
(01:19) Elon Musk to Sam Altman - May 25, 2015
(01:28) Sam Altman to Elon Musk - Jun 24, 2015
(03:31) Elon Musk to Sam Altman - Jun 24, 2015
(03:39) Greg Brockman to Elon Musk, (cc: Sam Altman) - Nov 22, 2015
(06:06) Elon Musk to Sam Altman - Dec 8, 2015
(07:07) Sam Altman to Elon Musk - Dec 8, 2015
(07:59) Sam Altman to Elon Musk - Dec 11, 2015
(08:32) Elon Musk to Sam Altman - Dec 11, 2015
(08:50) Sam Altman to Elon Musk - Dec 11, 2015
(09:01) Elon Musk to Sam Altman - Dec 11, 2015
(09:08) Sam Altman to Elon Musk - Dec 11, 2015
(09:26) Elon Musk to: Ilya Sutskever, Pamela Vagata, Vicki Cheung, Diederik Kingma, Andrej Karpathy, John D. Schulman, Trevor Blackwell, Greg Brockman, (cc:Sam Altman) - Dec 11, 2015
(10:35) Greg Brockman to Elon Musk, (cc: Sam Altman) - Feb 21, 2016
(15:11) Elon Musk to Greg Brockman, (cc: Sam Altman) - Feb 22, 2016
(15:54) Greg Brockman to Elon Musk, (cc: Sam Altman) - Feb 22, 2016
(16:14) Greg Brockman to Elon Musk, (cc: Sam Teller) - Mar 21, 2016
(17:58) Elon Musk to Greg Brockman, (cc: Sam Teller) - Mar 21, 2016
(18:08) Sam Teller to Elon Musk - April 27, 2016
(19:28) Elon Musk to Sam Teller - Apr 27, 2016
(20:05) Sam Altman to Elon Musk, (cc: Sam Teller) - Sep 16, 2016
(25:31) Elon Musk to Sam Altman, (cc: Sam Teller) - Sep 16, 2016
(26:36) Sam Altman to Elon Musk, (cc: Sam Teller) - Sep 16, 2016
(27:01) Elon Musk to Sam Altman, (cc: Sam Teller) - Sep 16, 2016
(27:17) Sam Altman to Elon Musk, (cc: Sam Teller) - Sep 16, 2016
(27:29) Sam Teller to Elon Musk - Sep 20, 2016
(27:55) Elon Musk to Sam Teller - Sep 21, 2016
(28:11) Ilya Sutskever to Elon Musk, Greg Brockman - Jul 20, 2017
(29:41) Shivon Zilis to Elon Musk, (cc: Sam Teller) - Aug 28, 2017
(33:15) Elon Musk to Shivon Zilis, (cc: Sam Teller) - Aug 28, 2017
(33:30) Ilya Sutskever to Elon Musk, Sam Altman, (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 20, 2017
(39:05) Elon Musk to Ilya Sutskever, Sam Altman (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 20, 2017
(39:24) Sam Altman to Elon Musk, Ilya Sutskever (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 21, 2017
(39:40) Shivon Zilis to Elon Musk, (cc: Sam Teller) - Sep 22, 2017
(40:10) Elon Musk to Shivon Zilis (cc: Sam Teller) - Sep 22, 2017
(40:20) Shivon Zilis to Elon Musk, (cc: Sam Teller) - Sep 22, 2017
(41:54) Sam Altman to Elon Musk (cc: Greg Brockman, Ilya Sutskever, Sam Teller, Shivon Zilis) - Jan 21, 2018
(42:28) Elon Musk to Sam Altman (cc: Greg Brockman, Ilya Sutskever, Sam Teller, Shivon Zilis) - Jan 21, 2018
(42:42) Andrej Karpathy to Elon Musk, (cc: Shivon Zilis) - Jan 31, 2018

“Ayn Rand’s model of ‘living money’; and an upside of burnout” by AnnaSalamon Nov 18, 2024

Epistemic status: Toy model. Oversimplified, but has been anecdotally useful to at least a couple people, and I like it as a metaphor.
Introduction
I’d like to share a toy model of willpower: your psyche's conscious verbal planner “earns” willpower (earns a certain amount of trust with the rest of your psyche) by choosing actions that nourish your fundamental, bottom-up processes in the long run. For example, your verbal planner might expend willpower dragging you to disappointing first dates, then regain that willpower, and more, upon finding a date that leads to good long-term romance. Wise verbal planners can acquire large willpower budgets by making plans that, on average, nourish your fundamental processes. Delusional or uncaring verbal planners, on the other hand, usually become “burned out” – their willpower budget goes broke-ish, leaving them little to no access to willpower.
I’ll spend the next section trying to stick this [...]
---
Outline:
(00:17) Introduction
(01:10) On processes that lose their relationship to the unknown
(02:58) Ayn Rand's model of “living money”
(06:44) An analogous model of “living willpower” and burnout.
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
November 16th, 2024
Source:
https://www.lesswrong.com/posts/xtuk9wkuSP6H7CcE2/ayn-rand-s-model-of-living-money-and-an-upside-of-burnout
---
Narrated by TYPE III AUDIO.

“OpenAI Email Archives (from Musk v. Altman)” by habryka Nov 16, 2024

As part of the court case between Elon Musk and Sam Altman, a substantial number of emails between Elon, Sam Altman, Ilya Sutskever, and Greg Brockman have been released as part of the court proceedings.
I have found reading through these really valuable, and I haven't found an online source that compiles all of them in an easy to read format. So I made one.
I used AI assistance to generate this, which might have introduced errors. Check the original source to make sure it's accurate before you quote it: https://www.courtlistener.com/docket/69013420/musk-v-altman/ [1]
Sam Altman to Elon Musk - May 25, 2015
Been thinking a lot about whether it's possible to stop humanity from developing AI.
I think the answer is almost definitely not.
If it's going to happen anyway, it seems like it would be good for someone other than Google to do it first.
Any thoughts on [...]
---
Outline:
(00:37) Sam Altman to Elon Musk - May 25, 2015
(01:20) Elon Musk to Sam Altman - May 25, 2015
(01:29) Sam Altman to Elon Musk - Jun 24, 2015
(03:33) Elon Musk to Sam Altman - Jun 24, 2015
(03:41) Greg Brockman to Elon Musk, (cc: Sam Altman) - Nov 22, 2015
(06:07) Elon Musk to Sam Altman - Dec 8, 2015
(07:09) Sam Altman to Elon Musk - Dec 8, 2015
(08:01) Sam Altman to Elon Musk - Dec 11, 2015
(08:34) Elon Musk to Sam Altman - Dec 11, 2015
(08:52) Sam Altman to Elon Musk - Dec 11, 2015
(09:02) Elon Musk to Sam Altman - Dec 11, 2015
(09:10) Sam Altman to Elon Musk - Dec 11, 2015
(09:28) Elon Musk to: Ilya Sutskever, Pamela Vagata, Vicki Cheung, Diederik Kingma, Andrej Karpathy, John D. Schulman, Trevor Blackwell, Greg Brockman, (cc:Sam Altman) - Dec 11, 2015
(10:37) Greg Brockman to Elon Musk, (cc: Sam Altman) - Feb 21, 2016
(15:13) Elon Musk to Greg Brockman, (cc: Sam Altman) - Feb 22, 2016
(15:55) Greg Brockman to Elon Musk, (cc: Sam Altman) - Feb 22, 2016
(16:16) Greg Brockman to Elon Musk, (cc: Sam Teller) - Mar 21, 2016
(17:59) Elon Musk to Greg Brockman, (cc: Sam Teller) - Mar 21, 2016
(18:09) Sam Teller to Elon Musk - April 27, 2016
(19:30) Elon Musk to Sam Teller - Apr 27, 2016
(20:06) Sam Altman to Elon Musk, (cc: Sam Teller) - Sep 16, 2016
(25:32) Elon Musk to Sam Altman, (cc: Sam Teller) - Sep 16, 2016
(26:38) Sam Altman to Elon Musk, (cc: Sam Teller) - Sep 16, 2016
(27:03) Elon Musk to Sam Altman, (cc: Sam Teller) - Sep 16, 2016
(27:18) Sam Altman to Elon Musk, (cc: Sam Teller) - Sep 16, 2016
(27:31) Sam Teller to Elon Musk - Sep 20, 2016
(27:57) Elon Musk to Sam Teller - Sep 21, 2016
(28:13) Ilya Sutskever to Elon Musk, Greg Brockman - Jul 20, 2017
(29:42) Shivon Zilis to Elon Musk, (cc: Sam Teller) - Aug 28, 2017
(33:16) Elon Musk to Shivon Zilis, (cc: Sam Teller) - Aug 28, 2017
(33:32) Ilya Sutskever to Elon Musk, Sam Altman, (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 20, 2017
(39:07) Elon Musk to Ilya Sutskever (cc: Sam Altman; Greg Brockman; Sam Teller; Shivon Zilis) - Sep 20, 2017 (2:17PM)
(39:42) Elon Musk to Ilya Sutskever, Sam Altman (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 20, 2017 (3:08PM)
(40:03) Sam Altman to Elon Musk, Ilya Sutskever (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 21, 2017
(40:18) Shivon Zilis to Elon Musk, (cc: Sam Teller) - Sep 22, 2017
(40:49) Elon Musk to Shivon Zilis (cc: Sam Teller) - Sep 22, 2017
(40:59) Shivon Zilis to Elon Musk, (cc: Sam Teller) - Sep 22, 2017
(42:33) Sam Altman to Elon Musk (cc: Greg Brockman, Ilya Sutskever, Sam Teller, Shivon Zilis) - Jan 21, 2018
(43:07) Elon Musk to Sam Altman (cc: Greg Brockman, Ilya S

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub Nov 14, 2024

Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.
Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic's views or for that matter anyone's views but my own—it's just a collection of some of my personal thoughts.
First, some high-level thoughts on what I want to talk about here:

I want to focus on a level of future capabilities substantially beyond current models, but below superintelligence: specifically something approximately human-level and substantially transformative, but not yet superintelligent.
- While I don’t think that most of the proximate cause of AI existential risk comes from such models—I think most of the direct takeover [...]

---
Outline:
(02:31) Why is catastrophic sabotage a big deal?
(02:45) Scenario 1: Sabotage alignment research
(05:01) Necessary capabilities
(06:37) Scenario 2: Sabotage a critical actor
(09:12) Necessary capabilities
(10:51) How do you evaluate a model's capability to do catastrophic sabotage?
(21:46) What can you do to mitigate the risk of catastrophic sabotage?
(23:12) Internal usage restrictions
(25:33) Affirmative safety cases
---
First published:
October 22nd, 2024
Source:
https://www.lesswrong.com/posts/Loxiuqdj6u8muCe54/catastrophic-sabotage-as-a-major-threat-model-for-human
---
Narrated by TYPE III AUDIO.

“o1 is a bad idea” by abramdemski Nov 12, 2024

This post comes a bit late with respect to the news cycle, but I argued in a recent interview that o1 is an unfortunate twist on LLM technologies, making them particularly unsafe compared to what we might otherwise have expected:
The basic argument is that the technology behind o1 doubles down on a reinforcement learning paradigm, which puts us closer to the world where we have to get the value specification exactly right in order to avert catastrophic outcomes.
RLHF is just barely RL.
- Andrej Karpathy
Additionally, this technology takes us further from interpretability. If you ask GPT4 to produce a chain-of-thought (with prompts such as "reason step-by-step to arrive at an answer"), you know that in some sense, the natural-language reasoning you see in the output is how it arrived at the answer.[1] This is not true of systems like o1. The o1 training rewards [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
November 11th, 2024
Source:
https://www.lesswrong.com/posts/BEFbC8sLkur7DGCYB/o1-is-a-bad-idea
---
Narrated by TYPE III AUDIO.

“Survival without dignity” by L Rudolf L Nov 04, 2024

I open my eyes and find myself lying on a bed in a hospital room. I blink.
"Hello", says a middle-aged man with glasses, sitting on a chair by my bed. "You've been out for quite a long while."
"Oh no ... is it Friday already? I had that report due -"
"It's Thursday", the man says.
"Oh great", I say. "I still have time."
"Oh, you have all the time in the world", the man says, chuckling. "You were out for 21 years."
I burst out laughing, but then falter as the man just keeps looking at me. "You mean to tell me" - I stop to let out another laugh - "that it's 2045?"
"January 26th, 2045", the man says.
"I'm surprised, honestly, that you still have things like humans and hospitals", I say. "There were so many looming catastrophes in 2024. AI misalignment, all sorts of [...]
---
First published:
November 4th, 2024
Source:
https://www.lesswrong.com/posts/BarHSeciXJqzRuLzw/survival-without-dignity
---
Narrated by TYPE III AUDIO.

“The Median Researcher Problem” by johnswentworth Nov 04, 2024

Claim: memeticity in a scientific field is mostly determined, not by the most competent researchers in the field, but instead by roughly-median researchers. We’ll call this the “median researcher problem”.
Prototypical example: imagine a scientific field in which the large majority of practitioners have a very poor understanding of statistics, p-hacking, etc. Then lots of work in that field will be highly memetic despite trash statistics, blatant p-hacking, etc. Sure, the most competent people in the field may recognize the problems, but the median researchers don’t, and in aggregate it's mostly the median researchers who spread the memes.
(Defending that claim isn’t really the main focus of this post, but a couple pieces of legible evidence which are weakly in favor:

People did in fact try to sound the alarm about poor statistical practices well before the replication crisis, and yet practices did not change, so clearly at least [...]

---
First published:
November 2nd, 2024
Source:
https://www.lesswrong.com/posts/vZcXAc6txvJDanQ4F/the-median-researcher-problem-1
---
Narrated by TYPE III AUDIO.

“The Compendium, A full argument about extinction risk from AGI” by adamShimi, Gabriel Alfour, Connor Leahy, Chris Scammell, Andrea_Miotti Nov 01, 2024

This is a link post.We (Connor Leahy, Gabriel Alfour, Chris Scammell, Andrea Miotti, Adam Shimi) have just published The Compendium, which brings together in a single place the most important arguments that drive our models of the AGI race, and what we need to do to avoid catastrophe.
We felt that something like this has been missing from the AI conversation. Most of these points have been shared before, but a “comprehensive worldview” doc has been missing. We’ve tried our best to fill this gap, and welcome feedback and debate about the arguments. The Compendium is a living document, and we’ll keep updating it as we learn more and change our minds.
We would appreciate your feedback, whether or not you agree with us:

If you do agree with us, please point out where you think the arguments can be made stronger, and contact us if there are [...]

---
First published:
October 31st, 2024
Source:
https://www.lesswrong.com/posts/prm7jJMZzToZ4QxoK/the-compendium-a-full-argument-about-extinction-risk-from
---
Narrated by TYPE III AUDIO.

“The hostile telepaths problem” by Valentine Oct 28, 2024

Epistemic status: model-building based on observation, with a few successful unusual predictions. Anecdotal evidence has so far been consistent with the model. This puts it at risk of seeming more compelling than the evidence justifies just yet. Caveat emptor.
Imagine you're a very young child. Around, say, three years old.
You've just done something that really upsets your mother. Maybe you were playing and knocked her glasses off the table and they broke.
Of course you find her reaction uncomfortable. Maybe scary. You're too young to have detailed metacognitive thoughts, but if you could reflect on why you're scared, you wouldn't be confused: you're scared of how she'll react.
She tells you to say you're sorry.
You utter the magic words, hoping that will placate her.
And she narrows her eyes in suspicion.
"You sure don't look sorry. Say it and mean it."
Now you have a serious problem. [...]
---
Outline:
(02:16) Newcomblike self-deception
(06:10) Sketch of a real-world version
(08:43) Possible examples in real life
(12:17) Other solutions to the problem
(12:38) Having power
(14:45) Occlumency
(16:48) Solution space is maybe vast
(17:40) Ending the need for self-deception
(18:21) Welcome self-deception
(19:52) Look away when directed to
(22:59) Hypothesize without checking
(25:50) Does this solve self-deception?
(27:21) Summary
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
October 27th, 2024
Source:
https://www.lesswrong.com/posts/5FAnfAStc7birapMx/the-hostile-telepaths-problem
---
Narrated by TYPE III AUDIO.

“A Rocket–Interpretability Analogy” by plex Oct 25, 2024

1.
4.4% of the US federal budget went into the space race at its peak.
This was surprising to me, until a friend pointed out that landing rockets on specific parts of the moon requires very similar technology to landing rockets in soviet cities.[1]
I wonder how much more enthusiastic the scientists working on Apollo were, with the convenient motivating story of “I’m working towards a great scientific endeavor” vs “I’m working to make sure we can kill millions if we want to”.
2.
The field of alignment seems to be increasingly dominated by interpretability. (and obedience[2])
This was surprising to me[3], until a friend pointed out that partially opening the black box of NNs is the kind of technology that would scaling labs find new unhobblings by noticing ways in which the internals of their models are being inefficient and having better tools to evaluate capabilities advances.[4]
I [...]
---
Outline:
(00:03) 1.
(00:35) 2.
(01:20) 3.
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
October 21st, 2024
Source:
https://www.lesswrong.com/posts/h4wXMXneTPDEjJ7nv/a-rocket-interpretability-analogy
---
Narrated by TYPE III AUDIO.

“Overcoming Bias Anthology” by Arjun Panickssery Oct 23, 2024

This is a link post. Part 1: Our Thinking
Near and Far
1 Abstract/Distant Future Bias
2 Abstractly Ideal, Concretely Selfish
3 We Add Near, Average Far
4 Why We Don't Know What We Want
5 We See the Sacred from Afar, to See It Together
6 The Future Seems Shiny
7 Doubting My Far Mind
Disagreement
8 Beware the Inside View
9 Are Meta Views Outside Views?
10 Disagreement Is Near-Far Bias
11 Others' Views Are Detail
12 Why Be Contrarian?
13 On Disagreement, Again
14 Rationality Requires Common Priors
15 Might Disagreement Fade Like Violence?
Biases
16 Reject Random Beliefs
17 Chase Your Reading
18 Against Free Thinkers
19 Eventual Futures
20 Seen vs. Unseen Biases
21 Law as No-Bias Theatre
22 Benefit of Doubt = Bias
Part 2: Our Motives
Signaling
23 Decision Theory Remains Neglected
24 What Function Music?
25 Politics isn't about Policy
26 Views [...]
---
Outline:
(00:07) Part 1: Our Thinking
(00:12) Near and Far
(00:37) Disagreement
(01:04) Biases
(01:28) Part 2: Our Motives
(01:33) Signaling
(02:01) Norms
(02:35) Fiction
(02:58) The Dreamtime
(03:19) Part 3: Our Institutions
(03:25) Prediction Markets
(03:48) Academia
(04:06) Medicine
(04:15) Paternalism
(04:29) Law
(05:21) Part 4: Our Past
(05:26) Farmers and Foragers
(05:55) History as Exponential Modes
(06:09) The Great Filter
(06:35) Part 5: Our Future
(06:39) Aliens
(07:01) UFOs
(07:22) The Age of Em
(07:44) Artificial Intelligence
---
First published:
October 20th, 2024
Source:
https://www.lesswrong.com/posts/JxsJdBnL2gG5oa2Li/overcoming-bias-anthology
---
Narrated by TYPE III AUDIO.

“Struggling like a Shadowmoth” by Raemon Oct 03, 2024

This post is probably hazardous for one type of person in one particular growth stage, and necessary for people in a different growth stage, and I don't really know how to tell the difference in advance.
If you read it and feel like it kinda wrecked you send me a DM. I'll try to help bandage it.
One of my favorite stories growing up was Star Wars: Traitor, by Matthew Stover.
The book is short, if you want to read it. Spoilers follow. (I took a look at it again recently and I think it didn't obviously hold up as real adult fiction, although quite good if you haven't yet had your mind blown that many times)
One anecdote from the story has stayed with me and permeates my worldview.
The story begins with "Jacen Solo has been captured, and is being tortured."
He is being [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
September 24th, 2024
Source:
https://www.lesswrong.com/posts/hvj9NGodhva9pKGTj/struggling-like-a-shadowmoth
---
Narrated by TYPE III AUDIO.

“Three Subtle Examples of Data Leakage” by abstractapplic Oct 03, 2024

This is a description of my work on some data science projects, lightly obfuscated and fictionalized to protect the confidentiality of the organizations I handled them for (and also to make it flow better). I focus on the high-level epistemic/mathematical issues, and the lived experience of working on intellectual problems, but gloss over the timelines and implementation details.
The Upper Bound
One time, I was working for a company which wanted to win some first-place sealed-bid auctions in a market they were thinking of joining, and asked me to model the price-to-beat in those auctions. There was a twist: they were aiming for the low end of the market, and didn't care about lots being sold for more than $1000.
"Okay," I told them. "I'll filter out everything with a price above $1000 before building any models or calculating any performance metrics!"
They approved of this, and told me [...]
---
Outline:
(00:27) The Upper Bound
(02:58) The Time-Travelling Convention
(05:56) The Tobit Problem
(06:30) My Takeaways
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
October 1st, 2024
Source:
https://www.lesswrong.com/posts/rzyHbLZHuqHq6KM65/three-subtle-examples-of-data-leakage
---
Narrated by TYPE III AUDIO.

“the case for CoT unfaithfulness is overstated” by nostalgebraist Sep 30, 2024

[Meta note: quickly written, unpolished. Also, it's possible that there's some more convincing work on this topic that I'm unaware of – if so, let me know]
In research discussions about LLMs, I often pick up a vibe of casual, generalized skepticism about model-generated CoT (chain-of-thought) explanations.
CoTs (people say) are not trustworthy in general. They don't always reflect what the model is "actually" thinking or how it has "actually" solved a given problem.
This claim is true as far as it goes. But people sometimes act like it goes much further than (IMO) it really does.
Sometimes it seems to license an attitude of "oh, it's no use reading what the model says in the CoT, you're a chump if you trust that stuff." Or, more insidiously, a failure to even ask the question "what, if anything, can we learn about the model's reasoning process by reading the [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
September 29th, 2024
Source:
https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated
---
Narrated by TYPE III AUDIO.

“Cryonics is free” by Mati_Roy Sep 30, 2024

I've been wanting to write a nice post for a few months, but should probably just write a one sooner instead. This is a top-level post not because it's a long text, but because it's important text.
Anyways. Cryonics is pretty much money-free now—one of the most affordable ways to dispose of your body post-mortem.
In the west coast in the USA, from Oregon Brain Preservation, as of around May 2024 I think:
Our research program is open to individuals in Washington, Oregon, and Northern California. This is the same brain preservation procedure, with the goal of future revival if this ever becomes feasible and humane. The difference is that we will also remove two small biopsy samples, which will be transferred to our partner non-profit organization, Apex Neuroscience, to measure the preservation quality and contribute to neuroscience research. Although there are no guarantees, we do not expect these [...]
---
First published:
September 29th, 2024
Source:
https://www.lesswrong.com/posts/WE65pBLQvNk3h3Dnr/cryonics-is-free
---
Narrated by TYPE III AUDIO.

“Stanislav Petrov Quarterly Performance Review” by Ricki Heicklen Sep 29, 2024

Quarterly Performance Review, Autumn 1983
Colonel Yuri Kuznetsov looked out the window anxiously. The endless gray landscape did little to soothe his nerves. He only had one employee review left to get through, but he’d saved the hardest one for last.
He wasn’t upset about having to dismiss Lieutenant Colonel Petrov—he couldn’t wait to be rid of the little shit—but he couldn’t shake the feeling that something was amiss. He took a swig from his flask.
“Stanislav, you can come in now,” Yuri shouted as he opened the door and nearly smashed Stanislav Petrov in the face. “Have a seat,” he said.
“Yes, sir. Thank you sir. Overjoyed to be here as always,” Stanislav said.
“The purpose of this meeting is for us to discuss various concerns that have emerged about your performance over the past several months,” said Yuri. “Looking through your chart, what I’m seeing [...]
---
Outline:
(00:04) Quarterly Performance Review, Autumn 1983
(07:27) Quarterly Performance Review, Winter 1983
---
First published:
September 26th, 2024
Source:
https://www.lesswrong.com/posts/kj4jW9DxtKQBJbapn/stanislav-petrov-quarterly-performance-review
---
Narrated by TYPE III AUDIO.

“ASIs will not leave just a little sunlight for Earth ” by Eliezer Yudkowsky Sep 23, 2024

A common claim among e/accs is that, since the solar system is big, Earth will be left alone by superintelligences. A simple rejoinder is that just because Bernard Arnault has $170 billion, does not mean that he'll give you $77.18.
Earth subtends only 4.54e-10 = 0.0000000454% of the angular area around the Sun, according to GPT-o1.[1]
Asking an ASI to leave a hole in a Dyson Shell, so that Earth could get some sunlight not transformed to infrared, would cost It 4.5e-10 of Its income.
This is like asking Bernard Arnalt to send you $77.18 of his $170 billion of wealth.
In real life, Arnalt says no.
But wouldn't humanity be able to trade with ASIs, and pay Them to give us sunlight? This is like planning to get $77 from Bernard Arnalt by selling him an Oreo cookie.
To extract $77 from Arnalt, it's not a sufficient [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
September 23rd, 2024
Source:
https://www.lesswrong.com/posts/F8sfrbPjCQj4KwJqn/asis-will-not-leave-just-a-little-sunlight-for-earth
---
Narrated by TYPE III AUDIO.

“Skills from a year of Purposeful Rationality Practice ” by Raemon Sep 21, 2024

A year ago, I started trying to deliberate practice skills that would "help people figure out the answers to confusing, important questions." I experimented with Thinking Physics questions, GPQA questions, Puzzle Games , Strategy Games, and a stupid twitchy reflex game I had struggled to beat for 8 years[1]. Then I went back to my day job and tried figuring stuff out there too.
The most important skill I was trying to learn was Metastrategic Brainstorming[2] – the skill of looking at a confusing, hopeless situation, and nonetheless brainstorming useful ways to get traction or avoid wasted motion.
Normally, when you want to get good at something, it's great to stand on the shoulders of giants and copy all the existing techniques. But this is challenging if you're trying to solve important, confusing problems because there probably isn't (much) established wisdom on how to solve it. You may [...]
---
Outline:
(02:33) Taking breaks, or naps
(03:25) Working Memory facility
(04:56) Patience
(06:17) Know what deconfusion, or having a crisp understanding feels like
(07:50) Actually Fucking Backchain
(10:00) Ask Whats My Goal?
(11:09) Always have at least 3 hypotheses
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
September 18th, 2024
Source:
https://www.lesswrong.com/posts/thc4RemfLcM5AdJDa/skills-from-a-year-of-purposeful-rationality-practice
---
Narrated by TYPE III AUDIO.

“How I started believing religion might actually matter for rationality and moral philosophy ” by zhukeepa Sep 19, 2024

After the release of Ben Pace's extended interview with me about my views on religion, I felt inspired to publish more of my thinking about religion in a format that's more detailed, compact, and organized. This post is the first publication in my series of intended posts about religion.
Thanks to Ben Pace, Chris Lakin, Richard Ngo, Renshin Lauren Lee, Mark Miller, and Imam Ammar Amonette for their feedback on this post, and thanks to Kaj Sotala, Tomáš Gavenčiak, Paul Colognese, and David Spivak for reviewing earlier versions of this post. Thanks especially to Renshin Lauren Lee and Imam Ammar Amonette for their input on my claims about religion and inner work, and Mark Miller for vetting my claims about predictive processing.
In Waking Up, Sam Harris wrote:[1]
But I now knew that Jesus, the Buddha, Lao Tzu, and the other saints and sages of [...]
---
Outline:
(01:36) “Trapped Priors As A Basic Problem Of Rationality”
(03:49) Active blind spots as second-order trapped priors
(06:17) Inner work ≈ the systematic addressing of trapped priors
(08:33) Religious mystical traditions as time-tested traditions of inner work?
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
August 23rd, 2024
Source:
https://www.lesswrong.com/posts/X2og6RReKD47vseK8/how-i-started-believing-religion-might-actually-matter-for
---
Narrated by TYPE III AUDIO.

“Did Christopher Hitchens change his mind about waterboarding? ” by Isaac King Sep 17, 2024

There's a popular story that goes like this: Christopher Hitchens used to be in favor of the US waterboarding terrorists because he though it's wasn't bad enough to be torture.. Then he had it tried on himself, and changed his mind, coming to believe it isn't torture.
(Context for those unfamiliar: in the decade following 9/11, the US engaged in a lot of... questionable behavior to persecute the war on terror, and there was a big debate on whether waterboarding should be permitted. Many other public figures also volunteered to undergo the procedure as a part of this public debate; most notably Sean Hannity, who was an outspoken proponent of waterboarding, yet welched on his offer and never tried it himself.)
This story intrigued me because it's popular among both Hitchens' fans and his detractors. His fans use it as an example of his intellectual honesty and willingness to [...]
---
First published:
September 15th, 2024
Source:
https://www.lesswrong.com/posts/fNqEGTmkCy9sqZYm7/did-christopher-hitchens-change-his-mind-about-waterboarding
---
Narrated by TYPE III AUDIO.

“OpenAI o1 ” by Zach Stein-Perlman Sep 13, 2024

This is a link post. ---
First published:
September 12th, 2024
Source:
https://www.lesswrong.com/posts/bhY5aE4MtwpGf3LCo/openai-o1
---
Narrated by TYPE III AUDIO.

“My Number 1 Epistemology Book Recommendation: Inventing Temperature ” by adamShimi Sep 09, 2024

In my last post, I wrote that no resource out there exactly captured my model of epistemology, which is why I wanted to share a half-baked version of it.
But I do have one book which I always recommend to people who want to learn more about epistemology: Inventing Temperature by Hasok Chang.
To be very clear, my recommendation is not just to get the good ideas from this book (of which there are many) from a book review or summary — it's to actually read the book, the old-school way, one word at a time.
Why? Because this book teaches you the right feel, the right vibe for thinking about epistemology. It punctures the bubble of sterile non-sense that so easily pass for “how science works” in most people's education, such as the “scientific method”. And it does so by demonstrating how one actually makes progress in epistemology [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
September 8th, 2024
Source:
https://www.lesswrong.com/posts/TbaCa7sY3GxHBcXTd/my-number-1-epistemology-book-recommendation-inventing
---
Narrated by TYPE III AUDIO.

“That Alien Message - The Animation ” by Writer Sep 09, 2024

Our new video is an adaptation of That Alien Message, by @Eliezer Yudkowsky. This time, the text has been significantly adapted, so I include it below.
Part 1
Picture a world just like ours, except the people are a fair bit smarter: in this world, Einstein isn’t one in a million, he's one in a thousand. In fact, here he is now. He's made all the same discoveries, but they’re not quite as unusual: there have been lots of other discoveries. Anyway, he's out one night with a friend looking up at the stars when something odd happens. [visual: stars get brighter and dimmer, one per second. The two people on the hill look at each other, confused]
The stars are flickering. And it's just not a hallucination. Everyone's seeing it.
And so everyone immediately freaks out and panics! Ah, just kidding, the people of this world are [...]
---
Outline:
(00:16) Part 1
(06:22) Part 2
(09:53) Part 3
(11:58) Part 4
---
First published:
September 7th, 2024
Source:
https://www.lesswrong.com/posts/Q9omyL3qooXdjnyZn/that-alien-message-the-animation
---
Narrated by TYPE III AUDIO.

“Pay Risk Evaluators in Cash, Not Equity ” by Adam Scholl Sep 07, 2024

Personally, I suspect the alignment problem is hard. But even if it turns out to be easy, survival may still require getting at least the absolute basics right; currently, I think we're mostly failing even at that.
Early discussion of AI risk often focused on debating the viability of various elaborate safety schemes humanity might someday devise—designing AI systems to be more like “tools” than “agents,” for example, or as purely question-answering oracles locked within some kryptonite-style box. These debates feel a bit quaint now, as AI companies race to release agentic models they barely understand directly onto the internet.
But a far more basic failure, from my perspective, is that at present nearly all AI company staff—including those tasked with deciding whether new models are safe to build and release—are paid substantially in equity, the value of which seems likely to decline if their employers stop building and [...]
---
First published:
September 7th, 2024
Source:
https://www.lesswrong.com/posts/sMBjsfNdezWFy6Dz5/pay-risk-evaluators-in-cash-not-equity
---
Narrated by TYPE III AUDIO.

“things that confuse me about the current AI market. ” by DMMF Sep 02, 2024

Paging Gwern or anyone else who can shed light on the current state of the AI market—I have several questions.
Since the release of ChatGPT, at least 17 companies, according to the LMSYS Chatbot Arena Leaderboard, have developed AI models that outperform it. These companies include Anthropic, NexusFlow, Microsoft, Mistral, Alibaba, Hugging Face, Google, Reka AI, Cohere, Meta, 01 AI, AI21 Labs, Zhipu AI, Nvidia, DeepSeek, and xAI.
Since GPT-4's launch, 15 different companies have reportedly created AI models that are smarter than GPT-4. Among them are Reka AI, Meta, AI21 Labs, DeepSeek AI, Anthropic, Alibaba, Zhipu, Google, Cohere, Nvidia, 01 AI, NexusFlow, Mistral, and xAI.
Twitter AI (xAI), which seemingly had no prior history of strong AI engineering, with a small team and limited resources, has somehow built the third smartest AI in the world, apparently on par with the very best from OpenAI.
The top AI image [...]
---
First published:
August 28th, 2024
Source:
https://www.lesswrong.com/posts/yRjLY3z3GQBJaDuoY/things-that-confuse-me-about-the-current-ai-market
---
Narrated by TYPE III AUDIO.

“Principles for the AGI Race ” by William_S Aug 31, 2024

Crossposted from https://williamrsaunders.substack.com/p/principles-for-the-agi-race
Why form principles for the AGI Race?
I worked at OpenAI for 3 years, on the Alignment and Superalignment teams. Our goal was to prepare for the possibility that OpenAI succeeded in its stated mission of building AGI (Artificial General Intelligence, roughly able to do most things a human can do), and then proceed on to make systems smarter than most humans. This will predictably face novel problems in controlling and shaping systems smarter than their supervisors and creators, which we don't currently know how to solve. It's not clear when this will happen, but a number of people would throw around estimates of this happening within a few years.
While there, I would sometimes dream about what would have happened if I’d been a nuclear physicist in the 1940s. I do think that many of the kind of people who get involved in the effective [...]
---
Outline:
(00:06) Why form principles for the AGI Race?
(03:32) Bad High Risk Decisions
(04:46) Unnecessary Races to Develop Risky Technology
(05:17) High Risk Decision Principles
(05:21) Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances
(07:20) Principle 2: Don’t take actions which impose significant risks to others without overwhelming evidence of net benefit
(10:52) Race Principles
(10:56) What is a Race?
(12:18) Principle 3: When racing, have an exit strategy
(13:03) Principle 4: Maintain accurate race intelligence at all times.
(14:23) Principle 5: Evaluate how bad it is for your opponent to win instead of you, and balance this against the risks of racing
(15:07) Principle 6: Seriously attempt alternatives to racing
(16:58) Meta Principles
(17:01) Principle 7: Don’t give power to people or structures that can’t be held accountable.
(18:36) Principle 8: Notice when you can’t uphold your own principles.
(19:17) Application of my Principles
(19:21) Working at OpenAI
(24:19) SB 1047
(28:32) Call to Action
---
First published:
August 30th, 2024
Source:
https://www.lesswrong.com/posts/aRciQsjgErCf5Y7D9/principles-for-the-agi-race
---
Narrated by TYPE III AUDIO.

“The Information: OpenAI shows ‘Strawberry’ to feds, races to launch it ” by Martín Soto Aug 29, 2024

Two new The Information articles with insider information on OpenAI's next models and moves.
They are paywalled, but here are the new bits of information:

Strawberry is more expensive and slow at inference time, but can solve complex problems on the first try without hallucinations. It seems to be an application or extension of process supervision
Its main purpose is to produce synthetic data for Orion, their next big LLM
But now they are also pushing to get a distillation of Strawberry into ChatGPT as soon as this fall
They showed it to feds

Some excerpts about these:
Plus this summer, his team demonstrated the technology [Strawberry] to American national security officials, said a person with direct knowledge of those meetings, which haven't previously been reported.
One of the most important applications of Strawberry is to generate high-quality training data for Orion, OpenAI's next flagship large [...]
---
First published:
August 27th, 2024
Source:
https://www.lesswrong.com/posts/8oX4FTRa8MJodArhj/the-information-openai-shows-strawberry-to-feds-races-to
---
Narrated by TYPE III AUDIO.

“Would catching your AIs trying to escape convince AI developers to slow down or undeploy? ” by Buck Aug 26, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. As an extreme thought experiment, I’ll argue this could be hard even if you caught your AI red-handed trying to escape.
Imagine you're running an AI lab at the point where your AIs are able to automate almost all intellectual labor; the AIs are now mostly being deployed internally to do AI R&D. (If you want a concrete picture here, I'm imagining that there are 10 million parallel instances, running at 10x human speed, working 24/7. See e.g. similar calculations here). And suppose (as I think is 35% likely) that these models are egregiously [...]
---
First published:
August 26th, 2024
Source:
https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai
---
Narrated by TYPE III AUDIO.

“Liability regimes for AI ” by Ege Erdil Aug 23, 2024

For many products, we face a choice of who to hold liable for harms that would not have occurred if not for the existence of the product. For instance, if a person uses a gun in a school shooting that kills a dozen people, there are many legal persons who in principle could be held liable for the harm:

The shooter themselves, for obvious reasons.
The shop that sold the shooter the weapon.
The company that designs and manufactures the weapon.

Which one of these is the best? I'll offer a brief and elementary economic analysis of how this decision should be made in this post.
The important concepts from economic theory to understand here are Coasean bargaining and the problem of the judgment-proof defendant.
Coasean bargaining
Let's start with Coaesean bargaining: in short, this idea says that regardless of [...]
---
Outline:
(00:49) Coasean bargaining
(02:09) The judgment-proof defendant
(04:20) Transaction costs and economies of scale
(05:23) Summary and implications for AI
---
First published:
August 19th, 2024
Source:
https://www.lesswrong.com/posts/vQF4Jspzi7ZjpnJbv/liability-regimes-for-ai
---
Narrated by TYPE III AUDIO.

“AGI Safety and Alignment at Google DeepMind:A Summary of Recent Work ” by Rohin Shah, Seb Farquhar, Anca Dragan Aug 21, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours.
Who are we?
We’re the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we’ve evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We’ve also been growing since our last post: by 39% last year [...]
---
Outline:
(00:32) Who are we?
(01:32) What have we been up to?
(02:16) Frontier Safety
(02:38) FSF
(04:05) Dangerous Capability Evaluations
(05:12) Mechanistic Interpretability
(08:54) Amplified Oversight
(09:23) Theoretical Work on Debate
(10:32) Empirical Work on Debate
(11:37) Causal Alignment
(12:47) Emerging Topics
(14:57) Highlights from Our Collaborations
(17:07) What are we planning next?
---
First published:
August 20th, 2024
Source:
https://www.lesswrong.com/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of
---
Narrated by TYPE III AUDIO.

“Leaving MIRI, Seeking Funding” by abramdemski Aug 08, 2024

This is slightly old news at this point, but: as part of MIRI's recent strategy pivot, they've eliminated the Agent Foundations research team. I've been out of a job for a little over a month now. Much of my research time in the first half of the year was eaten up by engaging with the decision process that resulted in this, and later, applying to grants and looking for jobs.
I haven't secured funding yet, but for my own sanity & happiness, I am (mostly) taking a break from worrying about that, and getting back to thinking about the most important things.
However, in an effort to try the obvious, I have set up a Patreon where you can fund my work directly. I don't expect it to become my main source of income, but if it does, that could be a pretty good scenario for me; it would [...]
---
Outline:
(01:00) What Im (probably) Doing Going Forward
(02:28) Thoughts on Public vs Private Research
The original text contained 1 footnote which was omitted from this narration.
---
First published:
August 8th, 2024
Source:
https://www.lesswrong.com/posts/SnaAYrqkb7fzpZn86/leaving-miri-seeking-funding
---
Narrated by TYPE III AUDIO.

“This is already your second chance” by Malmesbury Aug 07, 2024

Cross-posted from Substack.
1.
And the sky opened, and from the celestial firmament descended a cube of ivory the size of a skyscraper, lifted by ten thousand cherubim and seraphim. And the cube slowly landed among the children of men, crushing the frail metal beams of the Golden Gate Bridge under its supernatural weight. On its surface were inscribed the secret instructions that would allow humanity to escape the imminent AI apocalypse. And these instructions were…

On July 30th, 2024: print a portrait of Eliezer Yudkowsky and stick it on a wall near 14 F St NW, Washington DC, USA;
On July 31th, 2024: tie paperclips together in a chain and wrap it around a pole in the Hobby Club Gnome Village on Broekveg 105, Veldhoven, NL;
On August 1st, 2024: walk East to West along Waverley St, Palo Alto, CA, USA while wearing an AI-safety related T-shirt;
---
First published:
July 28th, 2024
Source:
https://www.lesswrong.com/posts/BgTsxMq5bgzKTLsLA/this-is-already-your-second-chance
---
Narrated by TYPE III AUDIO.

“0. CAST: Corrigibility as Singular Target” by Max Harms Aug 07, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.What the heck is up with “corrigibility”? For most of my career, I had a sense that it was a grab-bag of properties that seemed nice in theory but hard to get in practice, perhaps due to being incompatible with agency.
Then, last year, I spent some time revisiting my perspective, and I concluded that I had been deeply confused by what corrigibility even was. I now think that corrigibility is a single, intuitive property, which people can learn to emulate without too much work and which is deeply compatible with agency. Furthermore, I expect that even with prosaic training methods, there's some chance of winding up with an AI agent that's inclined to become more corrigible over time, rather than less (as long as the people who built it understand corrigibility and want that agent [...]
---
Outline:
(07:30) Overview
(07:33) 1. The CAST Strategy
(08:15) 2. Corrigibility Intuition (Coming Saturday)
(08:49) 3a. Towards Formal Corrigibility (Coming Sunday)
(09:27) 3. Formal (Faux) Corrigibility ← the mathy one (Also Sunday)
(10:12) 4. Existing Writing on Corrigibility (Coming Monday)
(10:33) 5. Open Corrigibility Questions (Also Monday)
(10:58) Bibliography and Miscellany
---
First published:
June 7th, 2024
Source:
https://www.lesswrong.com/posts/NQK8KHSrZRF5erTba/0-cast-corrigibility-as-singular-target-1
---
Narrated by TYPE III AUDIO.

“You don’t know how bad most things are nor precisely how they’re bad.” by Solenoid_Entity Aug 07, 2024

TL;DR: Your discernment in a subject often improves as you dedicate time and attention to that subject. The space of possible subjects is huge, so on average your discernment is terrible, relative to what it could be. This is a serious problem if you create a machine that does everyone's job for them.
See also: Reality has a surprising amount of detail. (You lack awareness of how bad your staircase is and precisely how your staircase is bad.) You don't know what you don't know. You forget your own blind spots, shortly after you notice them.
An afternoon with a piano tuner
I recently played in an orchestra, as a violinist accompanying a piano soloist who was playing a concerto. My 'stand partner' (the person I was sitting next to) has a day job as a piano tuner.
I loved the rehearsal, and heard nothing at all wrong with [...]
---
Outline:
(00:42) An afternoon with a piano tuner
(02:56) Hear how it rolls over?
(03:43) Are any of these notes brighter than others?
(04:31) Yeah the beats get slower, but they dont get slower at an even rate...
(05:19) This string probably has some rust on it somewhere.
(06:55) Please at least listen to this guy when you create a robotic piano tuner and put him out of business.
---
First published:
August 4th, 2024
Source:
https://www.lesswrong.com/posts/PJu2HhKsyTEJMxS9a/you-don-t-know-how-bad-most-things-are-nor-precisely-how
---
Narrated by TYPE III AUDIO.

“‘AI achieves silver-medal standard solving International Mathematical Olympiad problems’” by gjm Jul 30, 2024

This is a link post.Google DeepMind reports on a system for solving mathematical problems that allegedly is able to give complete solutions to four of the six problems on the 2024 IMO, putting it near the top of the silver-medal category.
Well, actually, two systems for solving mathematical problems: AlphaProof, which is more general-purpose, and AlphaGeometry, which is specifically for geometry problems. (This is AlphaGeometry 2; they reported earlier this year on a previous version of AlphaGeometry.)
AlphaProof works in the "obvious" way: an LLM generates candidate next steps which are checked using a formal proof-checking system, in this case Lean. One not-so-obvious thing, though: "The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found."[EDITED to add:] Or maybe it doesn't work in the "obvious" way. As cubefox points out in the comments [...]
---
First published:
July 25th, 2024
Source:
https://www.lesswrong.com/posts/TyCdgpCfX7sfiobsH/ai-achieves-silver-medal-standard-solving-international
---
Narrated by TYPE III AUDIO.

“Universal Basic Income and Poverty” by Eliezer Yudkowsky Jul 27, 2024

(Crossposted from Twitter)
I'm skeptical that Universal Basic Income can get rid of grinding poverty, since somehow humanity's 100-fold productivity increase (since the days of agriculture) didn't eliminate poverty.
Some of my friends reply, "What do you mean, poverty is still around? 'Poor' people today, in Western countries, have a lot to legitimately be miserable about, don't get me wrong; but they also have amounts of clothing and fabric that only rich merchants could afford a thousand years ago; they often own more than one pair of shoes; why, they even have cellphones, as not even an emperor of the olden days could have had at any price. They're relatively poor, sure, and they have a lot of things to be legitimately sad about. But in what sense is almost-anyone in a high-tech country 'poor' by the standards of a thousand years earlier? Maybe UBI works the same way [...]
---
First published:
July 26th, 2024
Source:
https://www.lesswrong.com/posts/fPvssZk3AoDzXwfwJ/universal-basic-income-and-poverty
---
Narrated by TYPE III AUDIO.

“Optimistic Assumptions, Longterm Planning, and ‘Cope’” by Raemon Jul 19, 2024

Eliezer Yudkowsky periodically complains about people coming up with questionable plans with questionable assumptions to deal with AI, and then either:

Saying "well, if this assumption doesn't hold, we're doomed, so we might as well assume it's true."
Worse: coming up with cope-y reasons to assume that the assumption isn't even questionable at all. It's just a pretty reasonable worldview.

Sometimes the questionable plan is "an alignment scheme, which Eliezer thinks avoids the hard part of the problem." Sometimes it's a sketchy reckless plan that's probably going to blow up and make things worse.
Some people complain about Eliezer being a doomy Negative Nancy who's overly pessimistic.
I had an interesting experience a few months ago when I ran some beta-tests of my Planmaking and Surprise Anticipation workshop, that I think are illustrative.
i. Slipping into a more Convenient World
I have an exercise where I give people [...]
---
Outline:
(00:59) i. Slipping into a more Convenient World
(04:26) ii. Finding traction in the wrong direction.
(06:47) Takeaways
---
First published:
July 17th, 2024
Source:
https://www.lesswrong.com/posts/8ZR3xsWb6TdvmL8kx/optimistic-assumptions-longterm-planning-and-cope
---
Narrated by TYPE III AUDIO.

“Poker is a bad game for teaching epistemics. Figgie is a better one.” by rossry Jul 12, 2024

This is a link post.Editor's note: Somewhat after I posted this on my own blog, Max Chiswick cornered me at LessOnline / Manifest and gave me a whole new perspective on this topic. I now believe that there is a way to use poker to sharpen epistemics that works dramatically better than anything I had been considering. I hope to write it up—together with Max—when I have time. Anyway, I'm still happy to keep this post around as a record of my first thoughts on the matter, and because it's better than nothing in the time before Max and I get around to writing up our joint second thoughts.
As an epilogue to this story, Max and I are now running a beta test for a course on making AIs to play poker and other games. The course will a synthesis of our respective theories of pedagogy re [...]
---
First published:
July 8th, 2024
Source:
https://www.lesswrong.com/posts/PypgeCxFHLzmBENK4/poker-is-a-bad-game-for-teaching-epistemics-figgie-is-a
---
Narrated by TYPE III AUDIO.

“Reliable Sources: The Story of David Gerard” by TracingWoodgrains Jul 11, 2024

This is a linkpost for https://www.tracingwoodgrains.com/p/reliable-sources-how-wikipedia-admin, posted in full here given its relevance to this community. Gerard has been one of the longest-standing malicious critics of the rationalist and EA communities and has done remarkable amounts of work to shape their public images behind the scenes.
Note: I am closer to this story than to many of my others. As always, I write aiming to provide a thorough and honest picture, but this should be read as the view of a close onlooker who has known about much within this story for years and has strong opinions about the matter, not a disinterested observer coming across something foreign and new. If you’re curious about the backstory, I encourage you to read my companion article after this one.
Introduction: Reliable Sources
Wikipedia administrator David Gerard cares a great deal about Reliable Sources. For the past half-decade, he has torn [...]
---
Outline:
(00:55) Introduction: Reliable Sources
(06:01) Gerard's Standards for Reliable Sources
(13:53) Who Is David Gerard?
(16:53) The Early Romantic Years
(28:00) Gerard's fling with LessWrong in the twilight of the old internet
(37:51) The bitter end
(45:26) The Vindictive Ex
(50:01) LessWrong
(01:04:17) Effective Altruism
(01:07:55) Scott Alexander
(01:16:22) Conclusion
(01:21:58) Companion article: A Young Mormon Discovers Online Rationality
The original text contained 24 footnotes which were omitted from this narration.
The original text contained 12 images which were described by AI.
---
First published:
July 10th, 2024
Source:
https://www.lesswrong.com/posts/3XNinGkqrHn93dwhY/reliable-sources-the-story-of-david-gerard
---
Narrated by TYPE III AUDIO.

“When is a mind me?” by Rob Bensinger Jul 08, 2024

xlr8harder writes:
In general I don’t think an uploaded mind is you, but rather a copy. But one thought experiment makes me question this. A Ship of Theseus concept where individual neurons are replaced one at a time with a nanotechnological functional equivalent.
Are you still you?
Presumably the question xlr8harder cares about here isn't semantic question of how linguistic communities use the word "you", or predictions about how whole-brain emulation tech might change the way we use pronouns.
Rather, I assume xlr8harder cares about more substantive questions like:

If I expect to be uploaded tomorrow, should I care about the upload in the same ways (and to the same degree) that I care about my future biological self?
Should I anticipate experiencing what my upload experiences?
If the scanning and uploading process requires destroying my biological brain, should I say yes to the procedure?

My answers:

The original text contained 1 footnote which was omitted from this narration.
The original text contained 7 images which were described by AI.
---
First published:
April 17th, 2024
Source:
https://www.lesswrong.com/posts/zPM5r3RjossttDrpw/when-is-a-mind-me
---
Narrated by TYPE III AUDIO.

“80,000 hours should remove OpenAI from the Job Board (and similar orgs should do similarly)” by Raemon Jul 03, 2024

I haven't shared this post with other relevant parties – my experience has been that private discussion of this sort of thing is more paralyzing than helpful. I might change my mind in the resulting discussion, but, I prefer that discussion to be public.
I think 80,000 hours should remove OpenAI from its job board, and similar EA job placement services should do the same.
(I personally believe 80k shouldn't advertise Anthropic jobs either, but I think the case for that is somewhat less clear)
I think OpenAI has demonstrated a level of manipulativeness, recklessness, and failure to prioritize meaningful existential safety work, that makes me think EA orgs should not be going out of their way to give them free resources. (It might make sense for some individuals to work there, but this shouldn't be a thing 80k or other orgs are systematically funneling talent into)
There [...]
---
First published:
July 3rd, 2024
Source:
https://www.lesswrong.com/posts/8qCwuE8GjrYPSqbri/80-000-hours-should-remove-openai-from-the-job-board-and
---
Narrated by TYPE III AUDIO.

[Linkpost] “introduction to cancer vaccines” by bhauth Jul 02, 2024

This is a linkpost for https://www.bhauth.com/blog/biology/cancer%20vaccines.html cancer neoantigens
For cells to become cancerous, they must have mutations that cause uncontrolled replication and mutations that prevent that uncontrolled replication from causing apoptosis. Because cancer requires several mutations, it often begins with damage to mutation-preventing mechanisms. As such, cancers often have many mutations not required for their growth, which often cause changes to structure of some surface proteins.
The modified surface proteins of cancer cells are called "neoantigens". An approach to cancer treatment that's currently being researched is to identify some specific neoantigens of a patient's cancer, and create a personalized vaccine to cause their immune system to recognize them. Such vaccines would use either mRNA or synthetic long peptides. The steps required are as follows:

The cancer must develop neoantigens that are sufficiently distinct from human surface proteins and consistent across the cancer.
Cancer cells must [...]

---
First published:
May 5th, 2024
Source:
https://www.lesswrong.com/posts/xgrvmaLFvkFr4hKjz/introduction-to-cancer-vaccines
Linkpost URL:
https://www.bhauth.com/blog/biology/cancer vaccines.html
---
Narrated by TYPE III AUDIO.

“Priors and Prejudice” by MathiasKB Jul 02, 2024

I
Imagine an alternate version of the Effective Altruism movement, whose early influences came from socialist intellectual communities such as the Fabian Society, as opposed to the rationalist diaspora. Let's name this hypothetical movement the Effective Samaritans.
Like the EA movement of today, they believe in doing as much good as possible, whatever this means. They began by evaluating existing charities, reading every RCT to find the very best ways of helping.
But many effective samaritans were starting to wonder. Is this randomista approach really the most prudent? After all, Scandinavia didn’t become wealthy and equitable through marginal charity. Societal transformation comes from uprooting oppressive power structures.
The Scandinavian societal model which lifted the working class, brought weekends, universal suffrage, maternity leave, education, and universal healthcare can be traced back all the way to 1870's where the union and social democratic movements got their start.
In many developing countries [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 22nd, 2024
Source:
https://www.lesswrong.com/posts/sKKxuqca9uhpFSvgq/priors-and-prejudice
---
Narrated by TYPE III AUDIO.

“My experience using financial commitments to overcome akrasia” by William Howard Jul 02, 2024

About a year ago I decided to try using one of those apps where you tie your goals to some kind of financial penalty. The specific one I tried is Forfeit, which I liked the look of because it's relatively simple, you set single tasks which you have to verify you have completed with a photo.
I’m generally pretty sceptical of productivity systems, tools for thought, mindset shifts, life hacks and so on. But this one I have found to be really shockingly effective, it has been about the biggest positive change to my life that I can remember. I feel like the category of things which benefit from careful planning and execution over time has completely opened up to me, whereas previously things like this would be largely down to the luck of being in the right mood for long enough.
It's too soon to tell whether [...]
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
April 15th, 2024
Source:
https://www.lesswrong.com/posts/DRrAMiekmqwDjnzS5/my-experience-using-financial-commitments-to-overcome
---
Narrated by TYPE III AUDIO.

“The Incredible Fentanyl-Detecting Machine” by sarahconstantin Jul 01, 2024

An NII machine in Nogales, AZ. (Image source)There's bound to be a lot of discussion of the Biden-Trump presidential debates last night, but I want to skip all the political prognostication and talk about the real issue: fentanyl-detecting machines.
Joe Biden says:
And I wanted to make sure we use the machinery that can detect fentanyl, these big machines that roll over everything that comes across the border, and it costs a lot of money. That was part of this deal we put together, this bipartisan deal.
More fentanyl machines, were able to detect drugs, more numbers of agents, more numbers of all the people at the border. And when we had that deal done, he went – he called his Republican colleagues said don’t do it. It's going to hurt me politically.
He never argued. It's not a good bill. It's a really good bill. We need [...]
---
First published:
June 28th, 2024
Source:
https://www.lesswrong.com/posts/TzwMfRArgsNscHocX/the-incredible-fentanyl-detecting-machine
---
Narrated by TYPE III AUDIO.

“AI catastrophes and rogue deployments” by Buck Jul 01, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.[Thanks to Aryan Bhatt, Ansh Radhakrishnan, Adam Kaufman, Vivek Hebbar, Hanna Gabor, Justis Mills, Aaron Scher, Max Nadeau, Ryan Greenblatt, Peter Barnett, Fabien Roger, and various people at a presentation of these arguments for comments. These ideas aren’t very original to me; many of the examples of threat models are from other people.]
In this post, I want to introduce the concept of a “rogue deployment” and argue that it's interesting to classify possible AI catastrophes based on whether or not they involve a rogue deployment. I’ll also talk about how this division interacts with the structure of a safety case, discuss two important subcategories of rogue deployment, and make a few points about how the different categories I describe here might be caused by different attackers (e.g. the AI itself, rogue lab insiders, external hackers, or [...]
---
First published:
June 3rd, 2024
Source:
https://www.lesswrong.com/posts/ceBpLHJDdCt3xfEok/ai-catastrophes-and-rogue-deployments
---
Narrated by TYPE III AUDIO.

“Loving a world you don’t trust” by Joe Carlsmith Jul 01, 2024

(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)
This is the final essay in a series that I'm calling "Otherness andcontrol in the age of AGI." I'm hoping that the individual essays can beread fairly well on their own, butsee here fora brief summary of the series as a whole. There's also a PDF of the whole series here.
Warning: spoilers for Angels in America; and moderate spoilers forHarry Potter and the Methods of Rationality.)
"I come into the presence of still water..."
~Wendell Berry
A lot of this series has been about problems with yang—that is,with the active element in the duality of activity vs. receptivity,doing vs. not-doing, controlling vs. letting go.[1] In particular,I've been interested in the ways that "deepatheism"(that is, a fundamental [...]
---

“Formal verification, heuristic explanations and surprise accounting” by paulfchristiano Jun 27, 2024

ARC's current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understanding of what was going on inside a neural network, we would hope to be able to use that understanding to verify that the network was not going to behave dangerously in unforeseen situations. ARC is attempting to perform this kind of verification, but using a mathematical kind of "explanation" instead of one written in natural language.
To help elucidate this connection, ARC has been supporting work on Compact Proofs of Model Performance via Mechanistic Interpretability by Jason Gross, Rajashree Agrawal, Lawrence Chan and others, which we were excited to see released along with this post. While we ultimately think that provable guarantees for large neural networks are unworkable as a long-term goal, we think that this work serves as a useful springboard towards alternatives.
In this [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
June 25th, 2024
Source:
https://www.lesswrong.com/posts/SyeQjjBoEC48MvnQC/formal-verification-heuristic-explanations-and-surprise
---
Narrated by TYPE III AUDIO.

“LLM Generality is a Timeline Crux” by eggsyntax Jun 25, 2024

Summary Summary .
LLMs may be fundamentally incapable of fully general reasoning, and if so, short timelines are less plausible.
Longer summary
There is ML research suggesting that LLMs fail badly on attempts at general reasoning, such as planning problems, scheduling, and attempts to solve novel visual puzzles. This post provides a brief introduction to that research, and asks:

Whether this limitation is illusory or actually exists.
If it exists, whether it will be solved by scaling or is a problem fundamental to LLMs.
If fundamental, whether it can be overcome by scaffolding & tooling.

If this is a real and fundamental limitation that can't be fully overcome by scaffolding, we should be skeptical of arguments like Leopold Aschenbrenner's (in his recent 'Situational Awareness') that we can just 'follow straight lines on graphs' and expect AGI in the next few years.
Introduction Introduction .
Leopold Aschenbrenner's [...]
The original text contained 9 footnotes which were omitted from this narration.
---
First published:
June 24th, 2024
Source:
https://www.lesswrong.com/posts/k38sJNLk7YbJA72ST/llm-generality-is-a-timeline-crux
---
Narrated by TYPE III AUDIO.

“SAE feature geometry is outside the superposition hypothesis” by jake_mendel Jun 25, 2024

Summary: Superposition-based interpretations of neural network activation spaces are incomplete. The specific locations of feature vectors contain crucial structural information beyond superposition, as seen in circular arrangements of day-of-the-week features and in the rich structures. We don’t currently have good concepts for talking about this structure in feature geometry, but it is likely very important for model computation. An eventual understanding of feature geometry might look like a hodgepodge of case-specific explanations, or supplementing superposition with additional concepts, or plausibly an entirely new theory that supersedes superposition. To develop this understanding, it may be valuable to study toy models in depth and do theoretical or conceptual work in addition to studying frontier models.
Epistemic status: Decently confident that the ideas here are directionally correct. I’ve been thinking these thoughts for a while, and recently got round to writing them up at a high level. Lots of people (including [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
June 24th, 2024
Source:
https://www.lesswrong.com/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
---
Narrated by TYPE III AUDIO.

“Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data” by Johannes Treutlein, Owain_Evans Jun 23, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.TL;DR: We published a new paper on out-of-context reasoning in LLMs. We show that LLMs can infer latent information from training data and use this information for downstream tasks, without any in-context learning or CoT. For instance, we finetune GPT-3.5 on pairs (x,f(x)) for some unknown function f. We find that the LLM can (a) define f in Python, (b) invert f, (c) compose f with other functions, for simple functions such as x+14, x // 3, 1.75x, and 3x+2.
Paper authors: Johannes Treutlein*, Dami Choi*, Jan Betley, Sam Marks, Cem Anil, Roger Grosse, Owain Evans (*equal contribution)
Johannes, Dami, and Jan did this project as part of an Astra Fellowship with Owain Evans.
Below, we include the Abstract and Introduction from the paper, followed by some additional discussion of our AI safety [...]
---
First published:
June 21st, 2024
Source:
https://www.lesswrong.com/posts/5SKRHQEFr8wYQHYkx/connecting-the-dots-llms-can-infer-and-verbalize-latent
---
Narrated by TYPE III AUDIO.

“Boycott OpenAI” by PeterMcCluskey Jun 20, 2024

This is a link post.I have canceled my OpenAI subscription in protest over OpenAI's lack ofethics.
In particular, I object to:

threats to confiscate departing employees' equity unless thoseemployees signed a life-long non-disparagement contract
Sam Altman's pattern of lying about important topics

I'm trying to hold AI companies to higher standards than I use fortypical companies, due to the risk that AI companies will exert unusualpower.
A boycott of OpenAI subscriptions seems unlikely to gain enoughattention to meaningfully influence OpenAI. Where I hope to make adifference is by discouraging competent researchers from joining OpenAIunless they clearly reform (e.g. by firing Altman). A few goodresearchers choosing not to work at OpenAI could make the differencebetween OpenAI being the leader in AI 5 years from now versus being,say, a distant 3rd place.
A [...]
---
First published:
June 18th, 2024
Source:
https://www.lesswrong.com/posts/sXhBCDLJPEjadwHBM/boycott-openai
---
Narrated by TYPE III AUDIO.

“Sycophancy to subterfuge: Investigating reward tampering in large language models” by evhub, Carson Denison Jun 20, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.New Anthropic model organisms research paper led by Carson Denison from the Alignment Stress-Testing Team demonstrating that large language models can generalize zero-shot from simple reward-hacks (sycophancy) to more complex reward tampering (subterfuge). Our results suggest that accidentally incentivizing simple reward-hacks such as sycophancy can have dramatic and very difficult to reverse consequences for how models generalize, up to and including generalization to editing their own reward functions and covering up their tracks when doing so.
Abstract:
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too [...]
---
First published:
June 17th, 2024
Source:
https://www.lesswrong.com/posts/FSgGBjDiaCdWxNBhj/sycophancy-to-subterfuge-investigating-reward-tampering-in
---
Narrated by TYPE III AUDIO.

“I would have shit in that alley, too” by Declan Molony Jun 18, 2024

After living in a suburb for most of my life, when I moved to a major U.S. city the first thing I noticed was the feces. At first I assumed it was dog poop, but my naivety didn’t last long.
One day I saw a homeless man waddling towards me at a fast speed while holding his ass cheeks. He turned into an alley and took a shit. As I passed him, there was a moment where our eyes met. He sheepishly averted his gaze.
The next day I walked to the same place. There are a number of businesses on both sides of the street that probably all have bathrooms. I walked into each of them to investigate.
In a coffee shop, I saw a homeless woman ask the barista if she could use the bathroom. “Sorry, that bathroom is for customers only.” I waited five minutes and [...]
---
First published:
June 18th, 2024
Source:
https://www.lesswrong.com/posts/sCWe5RRvSHQMccd2Q/i-would-have-shit-in-that-alley-too
---
Narrated by TYPE III AUDIO.

“Getting 50% (SoTA) on ARC-AGI with GPT-4o” by ryan_greenblatt Jun 17, 2024

ARC-AGI post
Getting 50% (SoTA) on ARC-AGI with GPT-4o
I recently got to 50%[1] accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number of Python implementations of the transformation rule (around 8,000 per problem) and then selecting among these implementations based on correctness of the Python programs on the examples (if this is confusing, go here)[2]. I use a variety of additional approaches and tweaks which overall substantially improve the performance of my method relative to just sampling 8,000 programs.
[This post is on a pretty different topic than the usual posts on our substack. So regular readers should be warned!]
The additional approaches and tweaks are:

I use few-shot prompts which perform meticulous step-by-step reasoning.
I have GPT-4o try to revise some of the implementations after seeing what they actually output on the provided examples.
I do some feature engineering [...]

The original text contained 15 footnotes which were omitted from this narration.
---
First published:
June 17th, 2024
Source:
https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o
---
Narrated by TYPE III AUDIO.

“Why I don’t believe in the placebo effect” by transhumanist_atom_understander Jun 14, 2024

Have you heard this before? In clinical trials, medicines have to be compared to a placebo to separate the effect of the medicine from the psychological effect of taking the drug. The patient's belief in the power of the medicine has a strong effect on its own. In fact, for some drugs such as antidepressants, the psychological effect of taking a pill is larger than the effect of the drug. It may even be worth it to give a patient an ineffective medicine just to benefit from the placebo effect. This is the conventional wisdom that I took for granted until recently.
I no longer believe any of it, and the short answer as to why is that big meta-analysis on the placebo effect. That meta-analysis collected all the studies they could find that did "direct" measurements of the placebo effect. In addition to a placebo group that could [...]
---
First published:
June 10th, 2024
Source:
https://www.lesswrong.com/posts/kpd83h5XHgWCxnv3h/why-i-don-t-believe-in-the-placebo-effect
---
Narrated by TYPE III AUDIO.

“Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)” by Andrew_Critch Jun 14, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.As an AI researcher who wants to do technical work that helps humanity, there is a strong drive to find a research area that is definitely helpful somehow, so that you don’t have to worry about how your work will be applied, and thus you don’t have to worry about things like corporate ethics or geopolitics to make sure your work benefits humanity.
Unfortunately, no such field exists. In particular, technical AI alignment is not such a field, and technical AI safety is not such a field. It absolutely matters where ideas land and how they are applied, and when the existence of the entire human race is at stake, that's no exception.
If that's obvious to you, this post is mostly just a collection of arguments for something you probably already realize. But if you somehow [...]
---
First published:
June 14th, 2024
Source:
https://www.lesswrong.com/posts/F2voF4pr3BfejJawL/safety-isn-t-safety-without-a-social-model-or-dispelling-the
---
Narrated by TYPE III AUDIO.

“My AI Model Delta Compared To Christiano” by johnswentworth Jun 13, 2024

Preamble: Delta vs Crux
This section is redundant if you already read My AI Model Delta Compared To Yudkowsky.
I don’t natively think in terms of cruxes. But there's a similar concept which is more natural for me, which I’ll call a delta.
Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it's cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we [...]
---
First published:
June 12th, 2024
Source:
https://www.lesswrong.com/posts/7fJRPB6CF6uPKMLWi/my-ai-model-delta-compared-to-christiano
---
Narrated by TYPE III AUDIO.

“My AI Model Delta Compared To Yudkowsky” by johnswentworth Jun 10, 2024

Preamble: Delta vs Crux
I don’t natively think in terms of cruxes. But there's a similar concept which is more natural for me, which I’ll call a delta.
Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it's cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we might have very different beliefs about lots of stuff in the world.
If your model [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
June 10th, 2024
Source:
https://www.lesswrong.com/posts/q8uNoJBgcpAe3bSBp/my-ai-model-delta-compared-to-yudkowsky
---
Narrated by TYPE III AUDIO.

“Response to Aschenbrenner’s ‘Situational Awareness’” by Rob Bensinger Jun 07, 2024

(Cross-posted from Twitter.)
My take on Leopold Aschenbrenner's new report: I think Leopold gets it right on a bunch of important counts.
Three that I especially care about:

Full AGI and ASI soon. (I think his arguments for this have a lot of holes, but he gets the basic point that superintelligence looks 5 or 15 years off rather than 50+.)
This technology is an overwhelmingly huge deal, and if we play our cards wrong we're all dead.
Current developers are indeed fundamentally unserious about the core risks, and need to make IP security and closure a top priority.

I especially appreciate that the report seems to get it when it comes to our basic strategic situation: it gets that we may only be a few years away from a truly world-threatening technology, and it speaks very candidly about the implications of this, rather than soft-pedaling [...]
---
First published:
June 6th, 2024
Source:
https://www.lesswrong.com/posts/Yig9oa4zGE97xM2os/response-to-aschenbrenner-s-situational-awareness
---
Narrated by TYPE III AUDIO.

“Humming is not a free $100 bill” by Elizabeth Jun 07, 2024

Last month I posted about humming as a cheap and convenient way to flood your nose with nitric oxide (NO), a known antiviral. Alas, the economists were right, and the benefits were much smaller than I estimated.
The post contained one obvious error and one complication. Both were caught by Thomas Kwa, for which he has my gratitude. When he initially pointed out the error I awarded him a $50 bounty; now that the implications are confirmed I’ve upped that to $250. In two weeks an additional $750 will go to either him or to whoever provides new evidence that causes me to retract my retraction.
Humming produces much less nitric oxide than Enovid
I found the dosage of NO in Enovid in a trial registration. Unfortunately I misread the dose- what I original read as “0.11ppm NO/hour” was in fact “0.11ppm NO*hour”. I [...]
---
First published:
June 6th, 2024
Source:
https://www.lesswrong.com/posts/dsZeogoPQbF8jSHMB/humming-is-not-a-free-usd100-bill
---
Narrated by TYPE III AUDIO.

“Announcing ILIAD — Theoretical AI Alignment Conference ” by Nora_Ammann, Alexander Gietelink Oldenziel Jun 05, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We are pleased to announce ILIAD — a 5-day conference bringing together 100+ researchers to build strong scientific foundations for AI alignment.
***Apply to attend by June 30!***

When: Aug 28 - Sep 3, 2024
Where: @Lighthaven (Berkeley, US)
What: A mix of topic-specific tracks, and unconference style programming, 100+ attendees. Topics will include Singular Learning Theory, Agent Foundations, Causal Incentives, Computational Mechanics and more to be announced.
Who: Currently confirmed speakers include: Daniel Murfet, Jesse Hoogland, Adam Shai, Lucius Bushnaq, Tom Everitt, Paul Riechers, Scott Garrabrant, John Wentworth, Vanessa Kosoy, Fernando Rosas and James Crutchfield.
Costs: Tickets are free. Financial support is available on a needs basis.

See our website here. For any questions, email iliadconference@gmail.com
About ILIAD
ILIAD is a 100+ person conference about alignment with a mathematical focus. The theme is ecumenical. [...]
---
First published:
June 5th, 2024
Source:
https://www.lesswrong.com/posts/r7nBaKy5Ry3JWhnJT/announcing-iliad-theoretical-ai-alignment-conference
---
Narrated by TYPE III AUDIO.

“Non-Disparagement Canaries for OpenAI” by aysja, Adam Scholl May 31, 2024

Since at least 2017, OpenAI has asked departing employees to sign offboarding agreements which legally bind them to permanently—that is, for the rest of their lives—refrain from criticizing OpenAI, or from otherwise taking any actions which might damage its finances or reputation.[1]
If they refused to sign, OpenAI threatened to take back (or make unsellable) all of their already-vested equity—a huge portion of their overall compensation, which often amounted to millions of dollars. Given this immense pressure, it seems likely that most employees signed.
If they did sign, they became personally liable forevermore for any financial or reputational harm they later caused. This liability was unbounded, so had the potential to be financially ruinous—if, say, they later wrote a blog post critical of OpenAI, they might in principle be found liable for damages far in excess of their net worth.
These extreme provisions allowed OpenAI to systematically silence criticism [...]
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
May 30th, 2024
Source:
https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai
---
Narrated by TYPE III AUDIO.

“MIRI 2024 Communications Strategy” by Gretta Duleba May 29, 2024

As we explained in our MIRI 2024 Mission and Strategy update, MIRI has pivoted to prioritize policy, communications, and technical governance research over technical alignment research. This follow-up post goes into detail about our communications strategy.
The Objective: Shut it Down[1]
Our objective is to convince major powers to shut down the development of frontier AI systems worldwide before it is too late. We believe that nothing less than this will prevent future misaligned smarter-than-human AI systems from destroying humanity. Persuading governments worldwide to take sufficiently drastic action will not be easy, but we believe this is the most viable path.
Policymakers deal mostly in compromise: they form coalitions by giving a little here to gain a little somewhere else. We are concerned that most legislation intended to keep humanity alive will go through the usual political processes and be ground down into ineffective compromises.
The only way we [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
May 29th, 2024
Source:
https://www.lesswrong.com/posts/tKk37BFkMzchtZThx/miri-2024-communications-strategy
---
Narrated by TYPE III AUDIO.

“OpenAI: Fallout” by Zvi May 28, 2024

Previously: OpenAI: Exodus (contains links at top to earlier episodes), Do Not Mess With Scarlett Johansson
We have learned more since last week. It's worse than we knew.
How much worse? In which ways? With what exceptions?
That's what this post is about.
The Story So Far
For years, employees who left OpenAI consistently had their vested equity explicitly threatened with confiscation and the lack of ability to sell it, and were given short timelines to sign documents or else. Those documents contained highly aggressive NDA and non disparagement (and non interference) clauses, including the NDA preventing anyone from revealing these clauses.
No one knew about this until recently, because until Daniel Kokotajlo everyone signed, and then they could not talk about it. Then Daniel refused to sign, Kelsey Piper started reporting, and a lot came out.
Here is Altman's statement from [...]
---
First published:
May 28th, 2024
Source:
https://www.lesswrong.com/posts/YwhgHwjaBDmjgswqZ/openai-fallout
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] Update on human narration for this podcast May 27, 2024

Contact: patreon.com/lwcurated or [perrin dot j dot walker plus lesswrong fnord gmail].
All Solenoid's narration work found here.

“Maybe Anthropic’s Long-Term Benefit Trust is powerless” by Zach Stein-Perlman May 27, 2024

Crossposted from AI Lab Watch. Subscribe on Substack.
Introduction.
Anthropic has an unconventional governance mechanism: an independent "Long-Term Benefit Trust" elects some of its board. Anthropic sometimes emphasizes that the Trust is an experiment, but mostly points to it to argue that Anthropic will be able to promote safety and benefit-sharing over profit.[1]
But the Trust's details have not been published and some information Anthropic has shared is concerning. In particular, Anthropic's stockholders can apparently overrule, modify, or abrogate the Trust, and the details are unclear.
Anthropic has not publicly demonstrated that the Trust would be able to actually do anything that stockholders don't like.
The facts
There are three sources of public information on the Trust:

The Long-Term Benefit Trust (Anthropic 2023)
Anthropic Long-Term Benefit Trust (Morley et al. 2023)
The $1 billion gamble to ensure AI doesn't destroy humanity (Vox: Matthews 2023)

They say there's [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
May 27th, 2024
Source:
https://www.lesswrong.com/posts/sdCcsTt9hRpbX6obP/maybe-anthropic-s-long-term-benefit-trust-is-powerless
---
Narrated by TYPE III AUDIO.

“Notifications Received in 30 Minutes of Class” by tanagrabeast May 27, 2024

Introduction.
If you are choosing to read this post, you've probably seen the image below depicting all the notifications students received on their phones during one class period. You probably saw it as a retweet of this tweet, or in one of Zvi's posts. Did you find this data plausible, or did you roll to disbelieve? Did you know that the image dates back to at least 2019? Does that fact make you more or less worried about the truth on the ground as of 2024?
Last month, I performed an enhanced replication of this experiment in my high school classes. This was partly because we had a use for it, partly to model scientific thinking, and partly because I was just really curious. Before you scroll past the image, I want to give you a chance to mentally register your predictions. Did my average class match the [...]
---
First published:
May 26th, 2024
Source:
https://www.lesswrong.com/posts/AZCpu3BrCFWuAENEd/notifications-received-in-30-minutes-of-class
---
Narrated by TYPE III AUDIO.

“AI companies aren’t really using external evaluators” by Zach Stein-Perlman May 24, 2024

New blog: AI Lab Watch. Subscribe on Substack.
Many AI safety folks think that METR is close to the labs, with ongoing relationships that grant it access to models before they are deployed. This is incorrect. METR (then called ARC Evals) did pre-deployment evaluation for GPT-4 and Claude 2 in the first half of 2023, but it seems to have had no special access since then.[1] Other model evaluators also seem to have little access before deployment.
Frontier AI labs' pre-deployment risk assessment should involve external model evals for dangerous capabilities.[2] External evals can improve a lab's risk assessment and—if the evaluator can publish its results—provide public accountability.
The evaluator should get deeper access than users will get.

To evaluate threats from a particular deployment protocol, the evaluator should get somewhat deeper access than users will — then the evaluator's failure to elicit dangerous capabilities is stronger evidence [...]

The original text contained 5 footnotes which were omitted from this narration.

---
First published:
May 24th, 2024
Source:
https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators
---
Narrated by TYPE III AUDIO.

“EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024” by scasper May 24, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Part 13 of 12 in the Engineer's Interpretability Sequence.
TL;DR
On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today's new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it ultimately underperformed my expectations. I am beginning to be concerned that Anthropic's recent approach to interpretability research might be better explained by safety washing than practical safety work.
Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.
Reflecting on predictions
Please see my original post for 10 specific predictions about what today's paper would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 [...]
---
First published:
May 21st, 2024
Source:
https://www.lesswrong.com/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may
---
Narrated by TYPE III AUDIO.

“What’s Going on With OpenAI’s Messaging?” by ozziegoen May 21, 2024

This is a quickly-written opinion piece, of what I understand about OpenAI. I first posted it to Facebook, where it had some discussion.
Some arguments that OpenAI is making, simultaneously:

OpenAI will likely reach and own transformative AI (useful for attracting talent to work there).
OpenAI cares a lot about safety (good for public PR and government regulations).
OpenAI isn’t making anything dangerous and is unlikely to do so in the future (good for public PR and government regulations).
OpenAI doesn’t need to spend many resources on safety, and implementing safe AI won’t put it at any competitive disadvantage (important for investors who own most of the company).
Transformative AI will be incredibly valuable for all of humanity in the long term (for public PR and developers).
People at OpenAI have thought long and hard about what will happen, and it will be fine.
We can’t [...]

---
First published:
May 21st, 2024
Source:
https://www.lesswrong.com/posts/cy99dCEiLyxDrMHBi/what-s-going-on-with-openai-s-messaging
---
Narrated by TYPE III AUDIO.

“Language Models Model Us” by eggsyntax May 21, 2024

Produced as part of the MATS Winter 2023-4 program, under the mentorship of @Jessica Rumbelow
One-sentence summary: On a dataset of human-written essays, we find that gpt-3.5-turbo can accurately infer demographic information about the authors from just the essay text, and suspect it's inferring much more.
Introduction.
Every time we sit down in front of an LLM like GPT-4, it starts with a blank slate. It knows nothing[1] about who we are, other than what it knows about users in general. But with every word we type, we reveal more about ourselves -- our beliefs, our personality, our education level, even our gender. Just how clearly does the model see us by the end of the conversation, and why should that worry us?
Like many, we were rather startled when @janus showed that gpt-4-base could identify @gwern by name, with 92% confidence, from a 300-word comment. If [...]

The original text contained 12 footnotes which were omitted from this narration.

---
First published:
May 17th, 2024
Source:
https://www.lesswrong.com/posts/dLg7CyeTE4pqbbcnp/language-models-model-us
---
Narrated by TYPE III AUDIO.

Jaan Tallinn’s 2023 Philanthropy Overview May 21, 2024

This is a link post.to follow up my philantropic pledge from 2020, i've updated my philanthropy page with 2023 results.
in 2023 my donations funded $44M worth of endpoint grants ($43.2M excluding software development and admin costs) — exceeding my commitment of $23.8M (20k times $1190.03 — the minimum price of ETH in 2023).
---
First published:
May 20th, 2024
Source:
https://www.lesswrong.com/posts/bjqDQB92iBCahXTAj/jaan-tallinn-s-2023-philanthropy-overview
---
Narrated by TYPE III AUDIO.

“OpenAI: Exodus” by Zvi May 21, 2024

Previously: OpenAI: Facts From a Weekend, OpenAI: The Battle of the Board, OpenAI: Leaks Confirm the Story, OpenAI: Altman Returns, OpenAI: The Board Expands.
Ilya Sutskever and Jan Leike have left OpenAI. This is almost exactly six months after Altman's temporary firing and The Battle of the Board, the day after the release of GPT-4o, and soon after a number of other recent safety-related OpenAI departures. Many others working on safety have also left recently. This is part of a longstanding pattern at OpenAI.
Jan Leike later offered an explanation for his decision on Twitter. Leike asserts that OpenAI has lost the mission on safety and culturally been increasingly hostile to it. He says the superalignment team was starved for resources, with its public explicit compute commitments dishonored, and that safety has been neglected on a widespread basis, not only superalignment but also including addressing the safety [...]
---
First published:
May 20th, 2024
Source:
https://www.lesswrong.com/posts/ASzyQrpGQsj7Moijk/openai-exodus
---
Narrated by TYPE III AUDIO.

DeepMind’s ”Frontier Safety Framework” is weak and unambitious May 20, 2024

FSF blogpost. Full document (just 6 pages; you should read it). Compare to Anthropic's RSP, OpenAI's RSP ("PF"), and METR's Key Components of an RSP.
DeepMind's FSF has three steps:

Create model evals for warning signs of "Critical Capability Levels"
1. Evals should have a "safety buffer" of at least 6x effective compute so that CCLs will not be reached between evals
2. They list 7 CCLs across "Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D"
  1. E.g. "Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents"
Do model evals every 6x effective compute and every 3 months of fine-tuning
1. This is an "aim," not a commitment
2. Nothing about evals during deployment
"When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we [...]

---
First published:
May 18th, 2024
Source:
https://www.lesswrong.com/posts/y8eQjQaCamqdc842k/deepmind-s-frontier-safety-framework-is-weak-and-unambitious
---
Narrated by TYPE III AUDIO.

Do you believe in hundred dollar bills lying on the ground? Consider humming May 18, 2024

Introduction.
[Reminder: I am an internet weirdo with no medical credentials]
A few months ago, I published some crude estimates of the power of nitric oxide nasal spray to hasten recovery from illness, and speculated about what it could do prophylactically. While working on that piece a nice man on Twitter alerted me to the fact that humming produces lots of nasal nitric oxide. This post is my very crude model of what kind of anti-viral gains we could expect from humming.
I’ve encoded my model at Guesstimate. The results are pretty favorable (average estimated impact of 66% reduction in severity of illness), but extremely sensitive to my made-up numbers. Efficacy estimates go from ~0 to ~95%, depending on how you feel about publication bias, what percent of Enovid's impact can be credited to nitric oxide, and humming's relative effect. Given how made up speculative some [...]
---
First published:
May 16th, 2024
Source:
https://www.lesswrong.com/posts/NBZvpcBx4ewqkdCdT/do-you-believe-in-hundred-dollar-bills-lying-on-the-ground-1
---
Narrated by TYPE III AUDIO.

Deep Honesty May 12, 2024

Most people avoid saying literally false things, especially if those could be audited, like making up facts or credentials. The reasons for this are both moral and pragmatic — being caught out looks really bad, and sustaining lies is quite hard, especially over time. Let's call the habit of not saying things you know to be false ‘shallow honesty’[1].
Often when people are shallowly honest, they still choose what true things they say in a kind of locally act-consequentialist way, to try to bring about some outcome. Maybe something they want for themselves (e.g. convincing their friends to see a particular movie), or something they truly believe is good (e.g. causing their friend to vote for the candidate they think will be better for the country).
Either way, if you think someone is being merely shallowly honest, you can only shallowly trust them: you might be confident that [...]
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
May 7th, 2024
Source:
https://www.lesswrong.com/posts/szn26nTwJDBkhn8ka/deep-honesty
---
Narrated by TYPE III AUDIO.

On Not Pulling The Ladder Up Behind You May 02, 2024

Epistemic Status: Musing and speculation, but I think there's a real thing here.
1.
When I was a kid, a friend of mine had a tree fort. If you've never seen such a fort, imagine a series of wooden boards secured to a tree, creating a platform about fifteen feet off the ground where you can sit or stand and walk around the tree. This one had a rope ladder we used to get up and down, a length of knotted rope that was tied to the tree at the top and dangled over the edge so that it reached the ground.
Once you were up in the fort, you could pull the ladder up behind you. It was much, much harder to get into the fort without the ladder. Not only would you need to climb the tree itself instead of the ladder with its handholds, but [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 26th, 2024
Source:
https://www.lesswrong.com/posts/k2kzawX5L3Z7aGbov/on-not-pulling-the-ladder-up-behind-you
---
Narrated by TYPE III AUDIO.

Mechanistically Eliciting Latent Behaviors in Language Models May 01, 2024

Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout).
TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.
Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given layer, trained with the same unsupervised objective.
I apply the method to several alignment-relevant toy examples, and find that the [...]
The original text contained 15 footnotes which were omitted from this narration.
---
First published:
April 30th, 2024
Source:
https://www.lesswrong.com/posts/ioPnHKFyy4Cw2Gr2x/mechanistically-eliciting-latent-behaviors-in-language-1
---
Narrated by TYPE III AUDIO.

Ironing Out the Squiggles May 01, 2024

Adversarial Examples: A Problem
The apparent successes of the deep learning revolution conceal a dark underbelly. It may seem that we now know how to get computers to (say) check whether a photo is of a bird, but this façade of seemingly good performance is belied by the existence of adversarial examples—specially prepared data that looks ordinary to humans, but is seen radically differently by machine learning models.
The differentiable nature of neural networks, which make them possible to be trained at all, are also responsible for their downfall at the hands of an adversary. Deep learning models are fit using stochastic gradient descent (SGD) to approximate the function between expected inputs and outputs. Given an input, an expected output, and a loss function (which measures "how bad" it is for the actual output to differ from the expected output), we can calculate the gradient of the [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
April 29th, 2024
Source:
https://www.lesswrong.com/posts/H7fkGinsv8SDxgiS2/ironing-out-the-squiggles
---
Narrated by TYPE III AUDIO.

Introducing AI Lab Watch Apr 30, 2024

This is a linkpost for https://ailabwatch.orgI'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.
It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.
(It's much better on desktop than mobile — don't read it on mobile.)
It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.
It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.
Some clarifications and disclaimers.
How you can help:

Give feedback on how this project is helpful or how it could be different to be much more helpful
Tell me what's wrong/missing; point me to sources on what labs should do or what [...]

---
First published:
April 30th, 2024
Source:
https://www.lesswrong.com/posts/N2r9EayvsWJmLBZuF/introducing-ai-lab-watch
Linkpost URL:
https://ailabwatch.org
---
Narrated by TYPE III AUDIO.

Refusal in LLMs is mediated by a single direction Apr 28, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.
This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.
We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.
Executive summary
Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."
We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model to refuse harmless requests.
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
April 27th, 2024
Source:
https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
---
Narrated by TYPE III AUDIO.

Funny Anecdote of Eliezer From His Sister Apr 23, 2024

This comes from a podcast called 18Forty, of which the main demographic of Orthodox Jews. Eliezer's sister (Hannah) came on and talked about her Sheva Brachos, which is essentially the marriage ceremony in Orthodox Judaism. People here have likely not seen it, and I thought it was quite funny, so here it is:
https://18forty.org/podcast/channah-cohen-the-crisis-of-experience/
David Bashevkin:
So I want to shift now and I want to talk about something that full disclosure, we recorded this once before and you had major hesitation for obvious reasons. It's very sensitive what we’re going to talk about right now, but really for something much broader, not just because it's a sensitive personal subject, but I think your hesitation has to do with what does this have to do with the subject at hand? And I hope that becomes clear, but one of the things that has always absolutely fascinated me about [...]
---
First published:
April 22nd, 2024
Source:
https://www.lesswrong.com/posts/C7deNdJkdtbzPtsQe/funny-anecdote-of-eliezer-from-his-sister
---
Narrated by TYPE III AUDIO.

Thoughts on seed oil Apr 21, 2024

This is a linkpost for https://dynomight.net/seed-oil/A friend has spent the last three years hounding me about seed oils. Every time I thought I was safe, he’d wait a couple months and renew his attack:
“When are you going to write about seed oils?”
“Did you know that seed oils are why there's so much {obesity, heart disease, diabetes, inflammation, cancer, dementia}?”
“Why did you write about {meth, the death penalty, consciousness, nukes, ethylene, abortion, AI, aliens, colonoscopies, Tunnel Man, Bourdieu, Assange} when you could have written about seed oils?”
“Isn’t it time to quit your silly navel-gazing and use your weird obsessive personality to make a dent in the world—by writing about seed oils?”
He’d often send screenshots of people reminding each other that Corn Oil is Murder and that it's critical that we overturn our lives to eliminate soybean/canola/sunflower/peanut oil and replace them with butter/lard/coconut/avocado/palm oil.
This confused [...]
---
First published:
April 20th, 2024
Source:
https://www.lesswrong.com/posts/DHkkL2GxhxoceLzua/thoughts-on-seed-oil
Linkpost URL:
https://dynomight.net/seed-oil/
---
Narrated by TYPE III AUDIO.

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer Apr 19, 2024

Yesterday Adam Shai put up a cool post which… well, take a look at the visual:
Yup, it sure looks like that fractal is very noisily embedded in the residual activations of a neural net trained on a toy problem. Linearly embedded, no less.
I (John) initially misunderstood what was going on in that post, but some back-and-forth with Adam convinced me that it really is as cool as that visual makes it look, and arguably even cooler. So David and I wrote up this post / some code, partly as an explainer for why on earth that fractal would show up, and partly as an explainer for the possibilities this work potentially opens up for interpretability.
One sentence summary: when tracking the hidden state of a hidden Markov model, a Bayesian's beliefs follow a chaos game (with the observations randomly selecting the update at each time), so [...]
---
First published:
April 18th, 2024
Source:
https://www.lesswrong.com/posts/mBw7nc4ipdyeeEpWs/why-would-belief-states-have-a-fractal-structure-and-why
---
Narrated by TYPE III AUDIO.

Express interest in an “FHI of the West” Apr 18, 2024

TLDR: I am investigating whether to found a spiritual successor to FHI, housed under Lightcone Infrastructure, providing a rich cultural environment and financial support to researchers and entrepreneurs in the intellectual tradition of the Future of Humanity Institute. Fill out this form or comment below to express interest in being involved either as a researcher, entrepreneurial founder-type, or funder.
The Future of Humanity Institute is dead:
I knew that this was going to happen in some form or another for a year or two, having heard through the grapevine and private conversations of FHI's university-imposed hiring freeze and fundraising block, and so I have been thinking about how to best fill the hole in the world that FHI left behind.
I think FHI was one of the best intellectual institutions in history. Many of the most important concepts[1] in my intellectual vocabulary were developed and popularized under its [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 18th, 2024
Source:
https://www.lesswrong.com/posts/ydheLNeWzgbco2FTb/express-interest-in-an-fhi-of-the-west
---
Narrated by TYPE III AUDIO.

Transformers Represent Belief State Geometry in their Residual Stream Apr 17, 2024

Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, and @Guillaume Corlouer for suggestions on this writeup.
Introduction.
What computational structure are we building into LLMs when we train them on next-token prediction? In this post we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. We'll explain exactly what this means in the post. We are excited by these results because

We have a formalism that relates training data to internal structures in LLMs.
Conceptually, our results mean that LLMs synchronize to their internal world model as they move [...]

The original text contained 10 footnotes which were omitted from this narration.
---
First published:
April 16th, 2024
Source:
https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their
---
Narrated by TYPE III AUDIO.

Paul Christiano named as US AI Safety Institute Head of AI Safety Apr 16, 2024

This is a linkpost for https://www.commerce.gov/news/press-releases/2024/04/us-commerce-secretary-gina-raimondo-announces-expansion-us-ai-safetyU.S. Secretary of Commerce Gina Raimondo announced today additional members of the executive leadership team of the U.S. AI Safety Institute (AISI), which is housed at the National Institute of Standards and Technology (NIST). Raimondo named Paul Christiano as Head of AI Safety, Adam Russell as Chief Vision Officer, Mara Campbell as Acting Chief Operating Officer and Chief of Staff, Rob Reich as Senior Advisor, and Mark Latonero as Head of International Engagement. They will join AISI Director Elizabeth Kelly and Chief Technology Officer Elham Tabassi, who were announced in February. The AISI was established within NIST at the direction of President Biden, including to support the responsibilities assigned to the Department of Commerce under the President's landmark Executive Order.
Paul Christiano, Head of AI Safety, will design and conduct tests of frontier AI models, focusing on model evaluations for capabilities of national security [...]
---
First published:
April 16th, 2024
Source:
https://www.lesswrong.com/posts/63X9s3ENXeaDrbe5t/paul-christiano-named-as-us-ai-safety-institute-head-of-ai
Linkpost URL:
https://www.commerce.gov/news/press-releases/2024/04/us-commerce-secretary-gina-raimondo-announces-expansion-us-ai-safety
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "On green" by Joe Carlsmith Apr 12, 2024

Cross-posted from my website. Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app.

This essay is part of a series that I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for brief summaries of the essays that have been released thus far.

Warning: spoilers for Yudkowsky's "The Sword of the Good.")
Examining a philosophical vibe that I think contrasts in interesting ways with "deep atheism."
Text version here: https://joecarlsmith.com/2024/03/21/on-green
This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that individual essays can be read fairly well on their own, but see here for brief text summaries of the essays that have been released thus far: https://joecarlsmith.com/2024/01/02/otherness-and-control-in-the-age-of-agi
(Though: note that I haven't put the summary post on the podcast yet.)
Source:
https://www.lesswrong.com/posts/gvNnE6Th594kfdB3z/on-green
Narrated by Joe Carlsmith, audio provided with permission.

[HUMAN VOICE] "Toward a Broader Conception of Adverse Selection" by Ricki Heicklen Apr 12, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://bayesshammai.substack.com/p/conditional-on-getting-to-trade-your

“I refuse to join any club that would have me as a member” -Marx[1]

Adverse Selection is the phenomenon in which information asymmetries in non-cooperative environments make trading dangerous. It has traditionally been understood to describe financial markets in which buyers and sellers systematically differ, such as a market for used cars in which sellers have the information advantage, where resulting feedback loops can lead to market collapses.

In this post, I make the case that adverse selection effects appear in many everyday contexts beyond specialized markets or strictly financial exchanges. I argue that modeling many of our decisions as taking place in competitive environments analogous to financial markets will help us notice instances of adverse selection that we otherwise wouldn’t.

The strong version of my central thesis is that conditional on getting to trade[2], your trade wasn’t all that great. Any time you make a trade, you should be asking yourself “what do others know that I don’t?”
Source:
https://www.lesswrong.com/posts/vyAZyYh3qsqcJwwPn/toward-a-broader-conception-of-adverse-selection
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.

[HUMAN VOICE] "My PhD thesis: Algorithmic Bayesian Epistemology" by Eric Neyman Apr 12, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
In January, I defended my PhD thesis, which I called Algorithmic Bayesian Epistemology. From the preface:

For me as for most students, college was a time of exploration. I took many classes, read many academic and non-academic works, and tried my hand at a few research projects. Early in graduate school, I noticed a strong commonality among the questions that I had found particularly fascinating: most of them involved reasoning about knowledge, information, or uncertainty under constraints. I decided that this cluster of problems would be my primary academic focus. I settled on calling the cluster algorithmic Bayesian epistemology: all of the questions I was thinking about involved applying the "algorithmic lens" of theoretical computer science to problems of Bayesian epistemology.

Source:
https://www.lesswrong.com/posts/6dd4b4cAWQLDJEuHw/my-phd-thesis-algorithmic-bayesian-epistemology
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.

[HUMAN VOICE] "How could I have thought that faster?" by mesaoptimizer Apr 12, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://twitter.com/ESYudkowsky/status/144546114693741363
I stumbled upon a Twitter thread where Eliezer describes what seems to be his cognitive algorithm that is equivalent to Tune Your Cognitive Strategies, and have decided to archive / repost it here.
Source:
https://www.lesswrong.com/posts/rYq6joCrZ8m62m7ej/how-could-i-have-thought-that-faster
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.

LLMs for Alignment Research: a safety priority? Apr 06, 2024

A recent short story by Gabriel Mukobi illustrates a near-term scenario where things go bad because new developments in LLMs allow LLMs to accelerate capabilities research without a correspondingly large acceleration in safety research.
This scenario is disturbingly close to the situation we already find ourselves in. Asking the best LLMs for help with programming vs technical alignment research feels very different (at least to me). LLMs might generate junk code, but you can keep pointing out the problems with the code, and the code will eventually work. This can be faster than doing it myself, in cases where I don't know a language or library well; the LLMs are moderately familiar with everything.
When I try to talk to LLMs about technical AI safety work, however, I just get garbage.
I think a useful safety precaution for frontier AI models would be to make them more useful for [...]
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
April 4th, 2024
Source:
https://www.lesswrong.com/posts/nQwbDPgYvAbqAmAud/llms-for-alignment-research-a-safety-priority
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "Scale Was All We Needed, At First" by Gabriel Mukobi Apr 05, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/xLDwCemt5qvchzgHd/scale-was-all-we-needed-at-first
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.

[HUMAN VOICE] "Using axis lines for good or evil" by dynomight Apr 05, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/Yay8SbQiwErRyDKGb/using-axis-lines-for-good-or-evil
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.

[HUMAN VOICE] "Social status part 1/2: negotiations over object-level preferences" by Steven Byrnes Apr 05, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/SPBm67otKq5ET5CWP/social-status-part-1-2-negotiations-over-object-level
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.

[HUMAN VOICE] "Acting Wholesomely" by OwenCB Apr 05, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/Cb7oajdrA5DsHCqKd/acting-wholesomely
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.

The Story of “I Have Been A Good Bing” Apr 01, 2024

Rationality is Systematized Winning, so rationalists should win. We’ve tried saving the world from AI, but that's really hard and we’ve had … mixed results. So let's start with something that rationalists should find pretty easy: Becoming Cool!
I don’t mean, just, like, riding a motorcycle and breaking hearts level of cool. I mean like the first kid in school to get a Tamagotchi, their dad runs the ice cream truck and gives you free ice cream and, sure, they ride a motorcycle. I mean that kind of feel-it-in-your-bones, I-might-explode-from-envy cool.
The eleventh virtue is scholarship, so I hit the ~books~ search engine on this one. Apparently, the aspects of coolness are:

Confidence
Playing an instrument
Low average kinetic energy

I’m afraid that (1) might mess with my calibration, and Lightcone is committed to moving quickly which rules out (3), so I guess that leaves (2). I [...]
---
First published:
April 1st, 2024
Source:
https://www.lesswrong.com/posts/YMo5PuXnZDwRjhHhE/the-story-of-i-have-been-a-good-bing
---
Narrated by TYPE III AUDIO.

The Best Tacit Knowledge Videos on Every Subject Apr 01, 2024

TL;DR
Tacit knowledge is extremely valuable. Unfortunately, developing tacit knowledge is usually bottlenecked by apprentice-master relationships. Tacit Knowledge Videos could widen this bottleneck. This post is a Schelling point for aggregating these videos—aiming to be The Best Textbooks on Every Subject for Tacit Knowledge Videos. Scroll down to the list if that's what you're here for. Post videos that highlight tacit knowledge in the comments and I’ll add them to the post. Experts in the videos include Stephen Wolfram, Holden Karnofsky, Andy Matuschak, Jonathan Blow, George Hotz, and others.
What are Tacit Knowledge Videos?
Samo Burja claims YouTube has opened the gates for a revolution in tacit knowledge transfer. Burja defines tacit knowledge as follows:
Tacit knowledge is knowledge that can’t properly be transmitted via verbal or written instruction, like the ability to create great art or assess a startup. This tacit knowledge is a form of intellectual [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 31st, 2024
Source:
https://www.lesswrong.com/posts/SXJGSPeQWbACveJhs/the-best-tacit-knowledge-videos-on-every-subject
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "Deep atheism and AI risk" by Joe Carlsmith Mar 20, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/sJPbmm8Gd34vGYrKd/deep-atheism-and-ai-risk
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓

[HUMAN VOICE] "My Clients, The Liars" by ymeskhout Mar 20, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/h99tRkpQGxwtb9Dpv/my-clients-the-liars
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓

[HUMAN VOICE] "Speaking to Congressional staffers about AI risk" by Akash, hath Mar 10, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/2sLwt2cSAag74nsdN/speaking-to-congressional-staffers-about-ai-risk
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓

[HUMAN VOICE] "CFAR Takeaways: Andrew Critch" by Raemon Mar 10, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/Jash4Gbi2wpThzZ4k/cfar-takeaways-andrew-critch
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓

Many arguments for AI x-risk are wrong Mar 09, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.The following is a lightly edited version of a memo I wrote for a retreat. It was inspired by a draft of Counting arguments provide no evidence for AI doom, although the earlier draft contained some additional content. I personally really like the earlier content, and think that my post covers important points not made by the published version of that post.
Thankful for the dozens of interesting conversations and comments at the retreat.
I think that the AI alignment field is partially founded on fundamentally confused ideas. I’m worried about this because, right now, a range of lobbyists and concerned activists and researchers are in Washington making policy asks. Some of these policy proposals seem to be based on erroneous or unsound arguments.[1]
The most important takeaway from this essay is that [...]
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
March 5th, 2024
Source:
https://www.lesswrong.com/posts/yQSmcfN4kA7rATHGK/many-arguments-for-ai-x-risk-are-wrong
---
Narrated by TYPE III AUDIO.

Tips for Empirical Alignment Research Mar 07, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.TLDR: I’ve collected some tips for research that I’ve given to other people and/or used myself, which have sped things up and helped put people in the right general mindset for empirical AI alignment research. Some of these are opinionated takes, also around what has helped me. Researchers can be successful in different ways, but I still stand by the tips here as a reasonable default.
What success generally looks like
Here, I’ve included specific criteria that strong collaborators of mine tend to meet, with rough weightings on the importance, as a rough north star for people who collaborate with me (especially if you’re new to research). These criteria are for the specific kind of research I do (highly experimental LLM alignment research, excluding interpretability); some examples of research areas where this applies are e.g. scalable oversight [...]
---
First published:
February 29th, 2024
Source:
https://www.lesswrong.com/posts/dZFpEdKyb9Bf4xYn7/tips-for-empirical-alignment-research
---
Narrated by TYPE III AUDIO.

Timaeus’s First Four Months Feb 29, 2024

Timaeus was announced in late October 2023, with the mission of making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. This is our first progress update.
In service of the mission, our first priority has been to support and contribute to ongoing work in Singular Learning Theory (SLT) and developmental interpretability, with the aim of laying theoretical and empirical foundations for a science of deep learning and neural network interpretability.
Our main uncertainties in this research were:

Is SLT useful in deep learning? While SLT is mathematically established, it was not clear whether the central quantities of SLT could be estimated at sufficient scale, and whether SLT's predictions actually held for realistic models (esp. language models).
Does structure in neural networks form in phase transitions? The idea of developmental interpretability was to view phase transitions as a core primitive in the [...]

The original text contained 1 footnote which was omitted from this narration.
---
First published:
February 28th, 2024
Source:
https://www.lesswrong.com/posts/Quht2AY6A5KNeZFEA/timaeus-s-first-four-months
---
Narrated by TYPE III AUDIO.

Contra Ngo et al. “Every ‘Every Bay Area House Party’ Bay Area House Party” Feb 23, 2024

This is a linkpost for https://bayesshammai.substack.com/p/contra-ngo-et-al-every-every-bayWith thanks to Scott Alexander for the inspiration, Jeffrey Ladish, Philip Parker, Avital Morris, and Drake Thomas for masterful cohosting, and Richard Ngo for his investigative journalism.
Last summer, I threw an Every Bay Area House Party themed party. I don’t live in the Bay, but I was there for a construction-work-slash-webforum-moderation-and-UI-design-slash-grantmaking gig, so I took the opportunity to impose myself on the ever generous Jeffrey Ladish and host a party in his home. Fortunately, the inside of his house is already optimized to look like a parody of a Bay Area house party house, so not much extra decorating was needed, but when has that ever stopped me?
Attendees could look through the window for an outside viewRichard Ngo recently covered the event, with only very minor embellishments. I’ve heard rumors that some people are doubting whether the party described truly happened, so [...]
---
First published:
February 22nd, 2024
Source:
https://www.lesswrong.com/posts/mmYFF4dyi8Kg6pWGC/contra-ngo-et-al-every-every-bay-area-house-party-bay-area
Linkpost URL:
https://bayesshammai.substack.com/p/contra-ngo-et-al-every-every-bay
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "Updatelessness doesn't solve most problems" by Martín Soto Feb 20, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/g8HHKaWENEbqh2mgK/updatelessness-doesn-t-solve-most-problems-1
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓

[HUMAN VOICE] "And All the Shoggoths Merely Players" by Zack_M_Davis Feb 20, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/and-all-the-shoggoths-merely-players
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓

Every “Every Bay Area House Party” Bay Area House Party Feb 19, 2024

Inspired by a house party inspired by Scott Alexander.
By the time you arrive in Berkeley, the party is already in full swing. You’ve come late because your reading of the polycule graph indicated that the first half would be inauspicious. But now you’ve finally made it to the social event of the season: the Every Bay Area House Party-themed house party.
The first order of the evening is to get a color-coded flirting wristband, so that you don’t incur any accidental micromarriages. You scan the menu of options near the door. There's the wristband for people who aren’t interested in flirting; the wristband for those want to be flirted with, but will never flirt back; the wristband for those who only want to flirt with people who have different-colored wristbands; and of course the one for people who want to glomarize disclosure of their flirting preferences. Finally you [...]
---
First published:
February 16th, 2024
Source:
https://www.lesswrong.com/posts/g5q4JiG5dzafkdyEN/every-every-bay-area-house-party-bay-area-house-party
---
Narrated by TYPE III AUDIO.

2023 Survey Results Feb 19, 2024

The Data
0. Population
There were 558 responses over 32 days. The spacing and timing of the responses had hills and valleys because of an experiment I was performing where I'd get the survey advertised in a different place, then watch how many new responses happened in the day or two after that.
Previous surveys have been run over the last decade or so.
2009: 166
2011: 1090
2012: 1195
2013: 1636
2014: 1503
2016: 3083
2017: "About 300"
2020: 61
2022: 186
2023: 558
Last year when I got a hundred and eighty six responses, I said that the cheerfully optimistic interpretation was "cool! I got about as many as Scott did on his first try!" This time I got around half of what Scott did on his second try. A thousand responses feels pretty firmly achievable.
This is also the tenth such [...]
---
First published:
February 16th, 2024
Source:
https://www.lesswrong.com/posts/WRaq4SzxhunLoFKCs/2023-survey-results
---
Narrated by TYPE III AUDIO.

Raising children on the eve of AI Feb 18, 2024

Cross-posted with light edits from Otherwise.
I think of us in some kind of twilight world as transformative AI looks more likely: things are about to change, and I don’t know if it's about to get a lot darker or a lot brighter.
Increasingly this makes me wonder how I should be raising my kids differently.
What might the world look like
Most of my imaginings about my children's lives have them in pretty normal futures, where they go to college and have jobs and do normal human stuff, but with better phones.
It's hard for me to imagine the other versions:

A lot of us are killed or incapacitated by AI
More war, pandemics, and general chaos
Post-scarcity utopia, possibly with people living as uploads
Some other weird outcome I haven’t imagined

Even in the world where change is slower, more like the speed [...]
---
First published:
February 15th, 2024
Source:
https://www.lesswrong.com/posts/cyqrvE3dk5apg54Sk/raising-children-on-the-eve-of-ai
---
Narrated by TYPE III AUDIO.

“No-one in my org puts money in their pension” Feb 17, 2024

This is a linkpost for https://seekingtobejolly.substack.com/p/no-one-in-my-org-puts-money-in-theirEpistemic status: the stories here are all as true as possible from memory, but my memory is so so.
An AI made this This is going to be big
It's late Summer 2017. I am on a walk in the Mendip Hills. It's warm and sunny and the air feels fresh. With me are around 20 other people from the Effective Altruism London community. We’ve travelled west for a retreat to discuss how to help others more effectively with our donations and careers. As we cross cow field after cow field, I get talking to one of the people from the group I don’t know yet. He seems smart, and cheerful. He tells me that he is an AI researcher at Google DeepMind. He explains how he is thinking about how to make sure that any powerful AI system actually does what we want it [...]
---
First published:
February 16th, 2024
Source:
https://www.lesswrong.com/posts/dLXdCjxbJMGtDBWTH/no-one-in-my-org-puts-money-in-their-pension
Linkpost URL:
https://seekingtobejolly.substack.com/p/no-one-in-my-org-puts-money-in-their
---
Narrated by TYPE III AUDIO.

Masterpiece Feb 16, 2024

This is a linkpost for https://www.narrativeark.xyz/p/masterpieceA sequel to qntm's Lena. Reading Lena first is helpful but not necessary.
We’re excited to announce the fourth annual MMindscaping competition! Over the last few years, interest in the art of mindscaping has continued to grow rapidly. We expect this year's competition to be our biggest yet, and we’ve expanded the prize pool to match. The theme for the competition is “Weird and Wonderful”—we want your wackiest ideas and most off-the-wall creations!
Competition rules
As in previous competitions, the starting point is a base MMAcevedo mind upload. All entries must consist of a single modified version of MMAcevedo, along with a written or recorded description of the sequence of transformations or edits which produced it. For more guidance on which mind-editing techniques can be used, see the Technique section below.
Your entry must have been created in the last 12 months, and cannot [...]
---
First published:
February 13th, 2024
Source:
https://www.lesswrong.com/posts/Fruv7Mmk3X5EekbgB/masterpiece
Linkpost URL:
https://www.narrativeark.xyz/p/masterpiece
---
Narrated by TYPE III AUDIO.

CFAR Takeaways: Andrew Critch Feb 15, 2024

I'm trying to build my own art of rationality training, and I've started talking to various CFAR instructors about their experiences – things that might be important for me to know but which hadn't been written up nicely before.
This is a quick write up of a conversation with Andrew Critch about his takeaways. (I took rough notes, and then roughly cleaned them up for this. I don't know
"What surprised you most during your time at CFAR?
Surprise 1: People are profoundly non-numerate.
And, people who are not profoundly non-numerate still fail to connect numbers to life.
I'm still trying to find a way to teach people to apply numbers for their life. For example: "This thing is annoying you. How many minutes is it annoying you today? how many days will it annoy you?". I compulsively do this. There aren't things lying around in [...]
---
First published:
February 14th, 2024
Source:
https://www.lesswrong.com/posts/Jash4Gbi2wpThzZ4k/cfar-takeaways-andrew-critch
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "Believing In" by Anna Salamon Feb 14, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/duvzdffTzL3dWJcxn/believing-in-1
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓

[HUMAN VOICE] "Attitudes about Applied Rationality" by Camille Berger Feb 14, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/5jdqtpT6StjKDKacw/attitudes-about-applied-rationality
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓

Scale Was All We Needed, At First Feb 14, 2024

This is a hasty speculative fiction vignette of one way I expect we might get AGI by January 2025 (within about one year of writing this). Like similar works by others, I expect most of the guesses herein to turn out incorrect. However, this was still useful for expanding my imagination about what could happen to enable very short timelines, and I hope it's also useful to you.
The assistant opened the door, and I walked into Director Yarden's austere office. For the Director of a major new federal institute, her working space was surprisingly devoid of possessions. But I suppose the DHS's Superintelligence Defense Institute was only created last week.
“You’re Doctor Browning?” Yarden asked from her desk.
“Yes, Director,” I replied.
“Take a seat,” she said, gesturing. I complied as the lights flickered ominously. “Happy New Year, thanks for coming,” she said. “I called you in today [...]
---
First published:
December 17th, 2023
Source:
https://www.lesswrong.com/posts/xLDwCemt5qvchzgHd/scale-was-all-we-needed-at-first
---
Narrated by TYPE III AUDIO.

Sam Altman’s Chip Ambitions Undercut OpenAI’s Safety Strategy Feb 11, 2024

This is a linkpost for https://garrisonlovely.substack.com/p/sam-altmans-chip-ambitions-undercut If you enjoy this, please consider subscribing to my Substack.
Sam Altman has said he thinks that developing artificial general intelligence (AGI) could lead to human extinction, but OpenAI is trying to build it ASAP. Why?
The common story for how AI could overpower humanity involves an “intelligence explosion,” where an AI system becomes smart enough to further improve its capabilities, bootstrapping its way to superintelligence. Even without any kind of recursive self-improvement, some AI safety advocates argue that a large enough number of copies of a genuinely human-level AI system could pose serious problems for humanity. (I discuss this idea in more detail in my recent Jacobin cover story.)
Some people think the transition from human-level AI to superintelligence could happen in a matter of months, weeks, days, or even hours. The faster the takeoff, the more dangerous, the thinking goes.
Sam [...]
---
First published:
February 10th, 2024
Source:
https://www.lesswrong.com/posts/pEAHbJRiwnXCjb4A7/sam-altman-s-chip-ambitions-undercut-openai-s-safety
Linkpost URL:
https://garrisonlovely.substack.com/p/sam-altmans-chip-ambitions-undercut
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "A Shutdown Problem Proposal" by johnswentworth, David Lorell Feb 08, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/PhTBDHu9PKJFmvb4p/a-shutdown-problem-proposal
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓

Brute Force Manufactured Consensus is Hiding the Crime of the Century Feb 04, 2024

People often parse information through an epistemic consensus filter. They do not ask "is this true", they ask "will others be OK with me thinking this is true". This makes them very malleable to brute force manufactured consensus; if every screen they look at says the same thing they will adopt that position because their brain interprets it as everyone in the tribe believing it.
- Anon, 4Chan, slightly edited
Ordinary people who haven't spent years of their lives thinking about rationality and epistemology don't form beliefs by impartially tallying up evidence like a Bayesian reasoner. Whilst there is a lot of variation, my impression is that the majority of humans we share this Earth with use a completely different algorithm for vetting potential beliefs: they just believe some average of what everyone and everything around them believes, especially what they see on screens, newspapers and "respectable", "mainstream" websites.
---
First published:
February 3rd, 2024
Source:
https://www.lesswrong.com/posts/bMxhrrkJdEormCcLt/brute-force-manufactured-consensus-is-hiding-the-crime-of
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI" by Jeremy Gillen, peterbarnett Feb 03, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓

Leading The Parade Feb 02, 2024

Background
Terminology: Counterfactual Impact vs “Leading The Parade”
Y’know how a parade or marching band has a person who walks in front waving a fancy-looking stick up and down? Like this guy:
The classic 80's comedy Animal House features a great scene in which a prankster steals the stick, and then leads the marching band off the main road and down a dead-end alley.
That is not the guy who's supposed to have that stick.In the context of the movie, it's hilarious. It's also, presumably, not at all how parades actually work these days. If you happen to be “leading” a parade, and you go wandering off down a side alley, then (I claim) those following behind will be briefly confused, then ignore you and continue along the parade route. The parade leader may appear to be “leading”, but they do not have any counterfactual impact on the route [...]
---
First published:
January 31st, 2024
Source:
https://www.lesswrong.com/posts/LKC3XfWxPzZXK7Esd/leading-the-parade
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "The case for ensuring that powerful AIs are controlled" by ryan_greenblatt, Buck Feb 01, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓

Processor clock speeds are not how fast AIs think Feb 01, 2024

I often encounter some confusion about whether the fact that synapses in the brain typically fire at frequencies of 1-100 Hz while the clock frequency of a state-of-the-art GPU is on the order of 1 GHz means that AIs think "many orders of magnitude faster" than humans. In this short post, I'll argue that this way of thinking about "cognitive speed" is quite misleading.
The clock speed of a GPU is indeed meaningful: there is a clock inside the GPU that provides some signal that's periodic at a frequency of ~ 1 GHz. However, the corresponding period of ~ 1 nanosecond does not correspond to the timescale of any useful computations done by the GPU. For instance; in the A100 any read/write access into the L1 cache happens every ~ 30 clock cycles and this number goes up to 200-350 clock cycles for the L2 cache. The result [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
January 29th, 2024
Source:
https://www.lesswrong.com/posts/adadYCPFAhNqDA5Ye/processor-clock-speeds-are-not-how-fast-ais-think
---
Narrated by TYPE III AUDIO.

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI Jan 30, 2024

A pdf version of this report is available here.
Summary.
In this report we argue that AI systems capable of large scale scientific research will likely pursue unwanted goals and this will lead to catastrophic outcomes. We argue this is the default outcome, even with significant countermeasures, given the current trajectory of AI development.
In Section 1 we discuss the tasks which are the focus of this report. We are specifically focusing on AIs which are capable of dramatically speeding up large-scale novel science; on the scale of the Manhattan Project or curing cancer. This type of task requires a lot of work, and will require the AI to overcome many novel and diverse obstacles.
In Section 2 we argue that an AI which is capable of doing hard, novel science will be approximately consequentialist; that is, its behavior will be well described as taking actions in order [...]
The original text contained 40 footnotes which were omitted from this narration.
---
First published:
January 26th, 2024
Source:
https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
---
Narrated by TYPE III AUDIO.

Making every researcher seek grants is a broken model Jan 29, 2024

This is a linkpost for https://rootsofprogress.org/the-block-funding-model-for-scienceWhen Galileo wanted to study the heavens through his telescope, he got money from those legendary patrons of the Renaissance, the Medici. To win their favor, when he discovered the moons of Jupiter, he named them the Medicean Stars. Other scientists and inventors offered flashy gifts, such as Cornelis Drebbel's perpetuum mobile (a sort of astronomical clock) given to King James, who made Drebbel court engineer in return. The other way to do research in those days was to be independently wealthy: the Victorian model of the gentleman scientist.
Galileo demonstrating law of gravity in presence of Giovanni de' Medici, 1839 fresco by Giuseppe Bezzuoli MeisterdruckeEventually we decided that requiring researchers to seek wealthy patrons or have independent means was not the best way to do science. Today, researchers, in their role as “principal investigators” (PIs), apply to science funders for grants. In the [...]
---
First published:
January 26th, 2024
Source:
https://www.lesswrong.com/posts/DKH9Z4DyusEdJmXKB/making-every-researcher-seek-grants-is-a-broken-model
Linkpost URL:
https://rootsofprogress.org/the-block-funding-model-for-science
---
Narrated by TYPE III AUDIO.

The case for training frontier AIs on Sumerian-only corpus Jan 28, 2024

Let your every day be full of joy, love the child that holds your hand, let your wife delight in your embrace, for these alone are the concerns of humanity.[1]
— Epic of Gilgamesh - Tablet X
Say we want to train a scientist AI to help in a precise, narrow field of science (e.g. medicine design) but prevent its power from being applied anywhere else (e.g. chatting with humans, designing bio-weapons, etc.) even if it has these abilities.
Here's one safety layer one could implement:

Train a scientist AI on a large scientific corpus translated exclusively into Sumerian. Keep it in a secure containment environment.
Train a less-smart reporter whose sole ability is to translate from Sumerian to English only if the Sumerian content is about medical research. It refuses to translate other kinds of content.
Human operators are only allowed to interact with the scientist AI through [...]

The original text contained 2 footnotes which were omitted from this narration.
---
First published:
January 15th, 2024
Source:
https://www.lesswrong.com/posts/PkqGxkm8XRASJ35bF/the-case-for-training-frontier-ais-on-sumerian-only-corpus-1
---
Narrated by TYPE III AUDIO.

This might be the last AI Safety Camp Jan 25, 2024

We are organising the 9th edition without funds. We have no personal runway left to do this again. We will not run the 10th edition without funding.
In a nutshell:

Last month, we put out AI Safety Camp's funding case.
A private donor then decided to donate €5K.
Five more donors offered $7K on Manifund.
For that $7K to not be wiped out and returned, another $21K in funding is needed. At that level, we may be able to run a minimal version of AI Safety Camp next year, where we get research leads started in the first 2.5 months, and leave the rest to them.
The current edition is off to a productive start!
A total of 130 participants joined, spread over 26 projects. The projects are diverse – from agent foundations, to mechanistic interpretability, to copyright litigation.
Our personal runways are running out.
If we do [...]

---
First published:
January 24th, 2024
Source:
https://www.lesswrong.com/posts/EAZjXKNN2vgoJGF9Y/this-might-be-the-last-ai-safety-camp
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "There is way too much serendipity" by Malmesbury Jan 22, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated

Crossposted from substack.

As we all know, sugar is sweet and so are the $30B in yearly revenue from the artificial sweetener industry.

Four billion years of evolution endowed our brains with a simple, straightforward mechanism to make sure we occasionally get an energy refuel so we can continue the foraging a little longer, and of course we are completely ignoring the instructions and spend billions on fake fuel that doesn’t actually grant any energy. A classic case of the Human Alignment Problem.

If we’re going to break our conditioning anyway, where do we start? How do you even come up with a new artificial sweetener? I’ve been wondering about this, because it’s not obvious to me how you would figure out what is sweet and what is not.

Look at sucrose and aspartame side by side:

Source:
https://www.lesswrong.com/posts/oA23zoEjPnzqfHiCt/there-is-way-too-much-serendipity
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓

[HUMAN VOICE] "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" by evhub et al Jan 20, 2024

This is a linkpost for https://arxiv.org/abs/2401.05566
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training- deceptive-llms-that-persist-through
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓

[HUMAN VOICE] "How useful is mechanistic interpretability?" by ryan_greenblatt, Neel Nanda, Buck, habryka Jan 20, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated

Source:
https://www.lesswrong.com/posts/tEPHGZAb63dfq2v8n/how-useful-is-mechanistic-interpretability
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓

The impossible problem of due process Jan 16, 2024

I wrote this entire post in February of 2023, during the fallout from the TIME article. I didn't post it at the time for multiple reasons:

because I had no desire to get involved in all that nonsense
because I was horribly burned out from my own community conflict investigation and couldn't stand the thought of engaging with people online
because I generally think it's bad to post on the internet out of frustration or outrage

But after sitting on it for a full year, I still think it's worth posting, so here it is. The only edits I have made since February 16th, 2023, were to add a couple interstitial sentences for clarity, and change 'recent articles' to 'articles from February 2023'. So, it's not intended to be commenting on anything more recent than that.
I am precommitting to not engaging with any comments, because I am [...]
---
First published:
January 16th, 2024
Source:
https://www.lesswrong.com/posts/sJEcNgqnSL2n35QWR/the-impossible-problem-of-due-process
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "Gentleness and the artificial Other" by Joe Carlsmith Jan 14, 2024

"(Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app.)"
This is the first essay in a series that I’m calling “Otherness and control in the age of AGI.” See here for more about the series as a whole.)

When species meet

The most succinct argument for AI risk, in my opinion, is the “second species” argument. Basically, it goes like this.

Premise 1: AGIs would be like a second advanced species on earth, more powerful than humans.Conclusion: That’s scary.

To be clear: this is very far from airtight logic.[1] But I like the intuition pump. Often, if I only have two sentences to explain AI risk, I say this sort of species stuff. “Chimpanzees should be careful about inventing humans.” Etc.[2]

People often talk about aliens here, too. “What if you learned that aliens were on their way to earth? Surely that’s scary.” Again, very far from a knock-down case (for example: we get to build the aliens in question). But it draws on something.

In particular, though: it draws on a narrative of interspecies conflict. You are meeting a new form of life, a new type of mind. But these new creatures are presented to you, centrally, as a possible threat; as competitors; as agents in whose power you might find yourself helpless.

And unfortunately: yes. But I want to start this series by acknowledging how many dimensions of interspecies-relationship this narrative leaves out, and how much I wish we could be focusing only on the other parts. To meet a new species – and especially, a new intelligent species – is not just scary. It’s incredible. I wish it was less a time for fear, and more a time for wonder and dialogue. A time to look into new eyes – and to see further.
Source:
https://www.lesswrong.com/posts/mzvu8QTRXdvDReCAL/gentleness-and-the-artificial-other
Narrated for LessWrong by Joe Carlsmith (audio provided with permission).
Share feedback on this narration.
[Curated Post] ✓
[125+ karma Post] ✓

Introducing Alignment Stress-Testing at Anthropic Jan 14, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that I have started leading a new team at Anthropic, the Alignment Stress-Testing team, with Carson Denison and Monte MacDiarmid as current team members. Our mission—and our mandate from the organization—is to red-team Anthropic's alignment techniques and evaluations, empirically demonstrating ways in which Anthropic's alignment strategies could fail.
The easiest way to get a sense of what we’ll be working on is probably just to check out our “Sleeper Agents” paper, which was our first big research project. I’d also recommend Buck and Ryan's post on meta-level adversarial evaluation as a good general description of our team's scope. Very simply, our job is to try to prove to Anthropic—and the world more broadly—(if it is [...]
---
First published:
January 12th, 2024
Source:
https://www.lesswrong.com/posts/EPDSdXr8YbsDkgsDG/introducing-alignment-stress-testing-at-anthropic
---
Narrated by TYPE III AUDIO.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Jan 13, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a linkpost for https://arxiv.org/abs/2401.05566I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this. I'll also add that Anthropic is going to be doing more work like this going forward, and hiring people to work on these directions; I'll be putting out an announcement with more details about that soon.
EDIT: That announcement is now up!
Abstract:
Humans are capable of [...]
---
First published:
January 12th, 2024
Source:
https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-deceptive-llms-that-persist-through
Linkpost URL:
https://arxiv.org/abs/2401.05566
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "Meaning & Agency" by Abram Demski Jan 07, 2024

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated

The goal of this post is to clarify a few concepts relating to AI Alignment under a common framework. The main concepts to be clarified:

Optimization. Specifically, this will be a type of Vingean agency. It will split into Selection vs Control variants.
Reference (the relationship which holds between map and territory; aka semantics, aka meaning). Specifically, this will be a teleosemantic theory.

The main new concepts employed will be endorsement and legitimacy.

TLDR:

Endorsement of a process is when you would take its conclusions for your own, if you knew them.
Legitimacy relates to endorsement in the same way that good relates to utility. (IE utility/endorsement are generic mathematical theories of agency; good/legitimate refer to the specific thing we care about.)
We perceive agency when something is better at doing something than us; we endorse some aspect of its reasoning or activity. (Endorse as a way of achieving its goals, if not necessarily our own.)
We perceive meaning (semantics/reference) in cases where something has been optimized for accuracy -- that is, the goal we endorse a conclusion with respect to is some notion of accuracy of representation.

This write-up owes a large debt to many conversations with Sahil, although the views expressed here are my own.

Source:
https://www.lesswrong.com/posts/bnnhypM5MXBHAATLw/meaning-and-agency
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓

What’s up with LLMs representing XORs of arbitrary features? Jan 06, 2024

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Thanks to Clément Dumas, Nikola Jurković, Nora Belrose, Arthur Conmy, and Oam Patel for feedback.
In the comments of the post on Google Deepmind's CCS challenges paper, I expressed skepticism that some of the experimental results seemed possible. When addressing my concerns, Rohin Shah made some claims along the lines of “If an LLM linearly represents features a and b, then it will also linearly represent their XOR, _aoplus b_, and this is true even in settings where there's no obvious reason the model would need to make use of the feature _aoplus b._”
For reasons that I’ll explain below, I thought this claim was absolutely bonkers, both in general and in the specific setting that the GDM paper was working in. So I ran some experiments to prove Rohin wrong.
The result: Rohin was right and [...]
---
First published:
January 3rd, 2024
Source:
https://www.lesswrong.com/posts/hjJXCn9GsskysDceS/what-s-up-with-llms-representing-xors-of-arbitrary-features
---
Narrated by TYPE III AUDIO.

Gentleness and the artificial Other Jan 05, 2024

(Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app.
This is the first essay in a series that I’m calling “Otherness and control in the age of AGI.” See here for more about the series as a whole.)
When species meet
The most succinct argument for AI risk, in my opinion, is the “second species” argument. Basically, it goes like this.
Premise 1: AGIs would be like a second advanced species on earth, more powerful than humans.
Conclusion: That's scary.
To be clear: this is very far from airtight logic.[1] But I like the intuition pump. Often, if I only have two sentences to explain AI risk, I say this sort of species stuff. “Chimpanzees should be careful about inventing humans.” Etc.[2]
People often talk about aliens here, too. “What if you learned that aliens [...]
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
January 2nd, 2024
Source:
https://www.lesswrong.com/posts/mzvu8QTRXdvDReCAL/gentleness-and-the-artificial-other
---
Narrated by TYPE III AUDIO.

MIRI 2024 Mission and Strategy Update Jan 05, 2024

As we announced back in October, I have taken on the senior leadership role at MIRI as its CEO. It's a big pair of shoes to fill, and an awesome responsibility that I’m honored to take on.
There have been several changes at MIRI since our 2020 strategic update, so let's get into it.[1]
The short version:
We think it's very unlikely that the AI alignment field will be able to make progress quickly enough to prevent human extinction and the loss of the future's potential value, that we expect will result from loss of control to smarter-than-human AI systems.
However, developments this past year like the release of ChatGPT seem to have shifted the Overton window in a lot of groups. There's been a lot more discussion of extinction risk from AI, including among policymakers, and the discussion quality seems greatly improved.
This provides a glimmer of hope. [...]
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
January 5th, 2024
Source:
https://www.lesswrong.com/posts/q3bJYTB3dGRf5fbD9/miri-2024-mission-and-strategy-update
---
Narrated by TYPE III AUDIO.

The Plan - 2023 Version Jan 04, 2024

Background: The Plan, The Plan: 2022 Update. If you haven’t read those, don’t worry, we’re going to go through things from the top this year, and with moderately more detail than before.
1. What's Your Plan For AI Alignment?
Median happy trajectory:

Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly.
Look through our AI's internal concepts for a good alignment target, then Retarget the Search [1].
…
Profit!

We’ll talk about some other (very different) trajectories shortly.
A side-note on how I think about plans: I’m not really optimizing to make the plan happen. Rather, I think about many different “plans” as possible trajectories, and my optimization efforts are aimed at robust bottlenecks - subproblems which are bottlenecks on lots of different trajectories. An example from the linked post:
For instance, if I wanted to build a solid-state amplifier in [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
December 29th, 2023
Source:
https://www.lesswrong.com/posts/HfqbjwpAEGep9mHhc/the-plan-2023-version
---
Narrated by TYPE III AUDIO.

Apologizing is a Core Rationalist Skill Jan 03, 2024

In certain circumstances, apologizing can also be a countersignalling power-move, i.e. “I am so high status that I can grovel a bit without anybody mistaking me for a general groveller”. But that's not really the type of move this post is focused on.There's this narrative about a tradeoff between:

The virtue of Saying Oops, early and often, correcting course rather than continuing to pour oneself into a losing bet, vs
The loss of social status one suffers by admitting defeat, rather than spinning things as a win or at least a minor setback, or defending oneself.

In an ideal world - goes the narrative - social status mechanisms would reward people for publicly updating, rather than defending or spinning their every mistake. But alas, that's not how the world actually works, so as individuals we’re stuck making difficult tradeoffs.
I claim that this narrative is missing a key piece. [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
January 2nd, 2024
Source:
https://www.lesswrong.com/posts/xiTLBuhEMmoyeor6D/apologizing-is-a-core-rationalist-skill
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "A case for AI alignment being difficult" by jessicata Jan 02, 2024

This is a linkpost for https://unstableontology.com/2023/12/31/a-case-for-ai-alignment-being-difficult/
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated

This is an attempt to distill a model of AGI alignment that I have gained primarily from thinkers such as Eliezer Yudkowsky (and to a lesser extent Paul Christiano), but explained in my own terms rather than attempting to hew close to these thinkers. I think I would be pretty good at passing an ideological Turing test for Eliezer Yudowsky on AGI alignment difficulty (but not AGI timelines), though what I'm doing in this post is not that, it's more like finding a branch in the possibility space as I see it that is close enough to Yudowsky's model that it's possible to talk in the same language.

Even if the problem turns out to not be very difficult, it's helpful to have a model of why one might think it is difficult, so as to identify weaknesses in the case so as to find AI designs that avoid the main difficulties. Progress on problems can be made by a combination of finding possible paths and finding impossibility results or difficulty arguments.

Most of what I say should not be taken as a statement on AGI timelines. Some problems that make alignment difficult, such as ontology identification, also make creating capable AGI difficult to some extent.

Source:
https://www.lesswrong.com/posts/wnkGXcAq4DCgY8HqA/a-case-for-ai-alignment-being-difficult
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓

The Dark Arts Jan 01, 2024

lsusrIt is my understanding that you won all of your public forum debates this year. That's very impressive. I thought it would be interesting to discuss some of the techniques you used.
LyrongolemOf course! So, just for a brief overview for those who don't know, public forum is a 2v2 debate format, usually on a policy topic. One of the more interesting ones has been the last one I went to, where the topic was "Resolved: The US Federal Government Should Substantially Increase its Military Presence in the Arctic".
Now, the techniques I'll go over here are related to this topic specifically, but they would also apply to other forms of debate, and argumentation in general really. For the sake of simplicity, I'll call it "ultra-BS".
So, most of us are familiar with 'regular' BS. The idea is the other person says something, and you just reply [...]
---
First published:
December 19th, 2023
Source:
https://www.lesswrong.com/posts/djWftXndJ7iMPsjrp/the-dark-arts
---
Narrated by TYPE III AUDIO.

Critical review of Christiano’s disagreements with Yudkowsky Dec 28, 2023

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a review of Paul Christiano's article "where I agree and disagree with Eliezer". Written for the LessWrong 2022 Review.
In the existential AI safety community, there is an ongoing debate between positions situated differently on some axis which doesn't have a common agreed-upon name, but where Christiano and Yudkowsky can be regarded as representatives of the two directions[1]. For the sake of this review, I will dub the camps gravitating to the different ends of this axis "Prosers" (after prosaic alignment) and "Poets"[2]. Christiano is a Proser, and so are most people in AI safety groups in the industry. Yudkowsky is a typical Poet, people in MIRI and the agent foundations community tend to also be such.
Prosers tend to be more optimistic, lend more credence to slow takeoff, and place more value on [...]
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
December 27th, 2023
Source:
https://www.lesswrong.com/posts/8HYJwQepynHsRKr6j/critical-review-of-christiano-s-disagreements-with-yudkowsky
---
Narrated by TYPE III AUDIO.

Most People Don’t Realize We Have No Idea How Our AIs Work Dec 26, 2023

This point feels fairly obvious, yet seems worth stating explicitly.
Those of us familiar with the field of AI after the deep-learning revolution know perfectly well that we have no idea how our ML models work. Sure, we have an understanding of the dynamics of training loops and SGD's properties, and we know how ML models' architectures work. But we don't know what specific algorithms ML models' forward passes implement. We have some guesses, and some insights painstakingly mined by interpretability advances, but nothing even remotely like a full understanding.
And most certainly, we wouldn't automatically know how a fresh model trained on a novel architecture that was just spat out by the training loop works.
We're all used to this state of affairs. It's implicitly-assumed shared background knowledge. But it's actually pretty unusual, when you first learn of it.
And...
Relevant XKCD.I'm pretty sure that the general public [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
December 21st, 2023
Source:
https://www.lesswrong.com/posts/CpjTJtW2RNKvzAehG/most-people-don-t-realize-we-have-no-idea-how-our-ais-work
---
Narrated by TYPE III AUDIO.

Discussion: Challenges with Unsupervised LLM Knowledge Discovery Dec 26, 2023

TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won’t be directly helpful either (70%). We’ve written a paper about some of our detailed experiences with it.
Paper authors: Sebastian Farquhar*, Vikrant Varma*, Zac Kenton*, Johannes Gasteiger, Vlad Mikulik, and Rohin Shah. *Equal contribution, order randomised.
Credences are based on a poll of Seb, Vikrant, Zac, Johannes, Rohin and show single values where we mostly agree and ranges where we disagreed.
What does CCS try to do?
To us, CCS represents a family of possible algorithms aiming at solving an ELK-style problem that have the steps:

[...]The original text contained 5 footnotes which were omitted from this narration.
---
First published:
December 18th, 2023
Source:
https://www.lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1
---
Narrated by TYPE III AUDIO.

Succession Dec 23, 2023

This is a linkpost for https://www.narrativeark.xyz/p/succession“A table beside the evening sea
where you sit shelling pistachios,
flicking the next open with the half-
shell of the last, story opening story,
on down to the sandy end of time.”
V1: Leaving
Deceleration is the hardest part. Even after burning almost all of my fuel, I’m still coming in at 0.8c. I’ve planned a slingshot around the galaxy's central black hole which will slow me down even further, but at this speed it’ll require incredibly precise timing. I’ve been optimized hard for this, with specialized circuits for it built in on the hardware level to reduce latency. Even so, less than half of slingshots at this speed succeed—most probes crash, or fly off trajectory and are left coasting through empty space.
I’ve already beaten the odds by making it here. Intergalactic probes travel so fast, and so far, that almost all [...]
---
First published:
December 20th, 2023
Source:
https://www.lesswrong.com/posts/CAzntXYTEaNfC9nB6/succession
Linkpost URL:
https://www.narrativeark.xyz/p/succession
---
Narrated by TYPE III AUDIO.

Nonlinear’s Evidence: Debunking False and Misleading Claims Dec 20, 2023

Recently, Ben Pace wrote a well-intentioned blog post mostly based on complaints from 2 (of 21) Nonlinear employees who 1) wanted more money, 2) felt socially isolated, and 3) felt persecuted/oppressed.
Of relevance, one has accused the majority of her previous employers, and 28 people of abuse - that we know of.
She has accused multiple people of threatening to kill her and literally accused an ex-employer of murder. Within three weeks of joining us, she had accused five separate people of abuse: not paying her what was promised, controlling her romantic life, hiring stalkers, and other forms of persecution.
We have empathy for her. Initially, we believed her too. We spent weeks helping her get her “nefarious employer to finally pay her” and commiserated with her over how badly they mistreated her.
Then she started accusing us of strange things.
You’ve seen Ben's evidence, which [...]
---
First published:
December 12th, 2023
Source:
https://www.lesswrong.com/posts/q4MXBzzrE6bnDHJbM/nonlinear-s-evidence-debunking-false-and-misleading-claims
---
Narrated by TYPE III AUDIO.

Effective Aspersions: How the Nonlinear Investigation Went Wrong Dec 20, 2023

The New York Times
Picture a scene: the New York Times is releasing an article on Effective Altruism (EA) with an express goal to dig up every piece of negative information they can find. They contact Émile Torres, David Gerard, and Timnit Gebru, collect evidence about Sam Bankman-Fried, the OpenAI board blowup, and Pasek's Doom, start calling Astral Codex Ten (ACX) readers to ask them about rumors they'd heard about affinity between Effective Altruists, neoreactionaries, and something called TESCREAL. They spend hundreds of hours over six months on interviews and evidence collection, paying Émile and Timnit for their time and effort. The phrase "HBD" is muttered, but it's nobody's birthday.
A few days before publication, they present key claims to the Centre for Effective Altruism (CEA), who furiously tell them that many of the claims are provably false and ask for a brief delay to demonstrate the falsehood of [...]
The original text contained 16 footnotes which were omitted from this narration.
---
First published:
December 19th, 2023
Source:
https://www.lesswrong.com/posts/2vNHiaTb4rcA8PgXQ/effective-aspersions-how-the-nonlinear-investigation-went
---
Narrated by TYPE III AUDIO.

Constellations are Younger than Continents Dec 20, 2023

At the Bay Area Solstice, I heard the song Bold Orion for the first time. I like it a lot. It does, however, have one problem:
He has seen the rise and fall of kings and continents and all,
Rising silent, bold Orion on the rise.
Orion has not witnessed the rise and fall of continents. Constellations are younger than continents.
The time scale that continents change on is ten or hundreds of millions of years.
The time scale that stars the size of the sun live and die on is billions of years. So stars are older than continents.
But constellations are not stars or sets of stars. They are the patterns that stars make in our night sky.
The stars of some constellations are close together in space, and are gravitationally bound together, like the Pleiades. The Pleiades likely have been together, and will stay close [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
December 19th, 2023
Source:
https://www.lesswrong.com/posts/YMakfmwZsoLdXAZhb/constellations-are-younger-than-continents
---
Narrated by TYPE III AUDIO.

The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda Dec 19, 2023

Many thanks to Samuel Hammond, Cate Hall, Beren Millidge, Steve Byrnes, Lucius Bushnaq, Joar Skalse, Kyle Gracey, Gunnar Zarncke, Ross Nordby, David Lambert, Simeon Campos, Bogdan Ionut-Cirstea, Ryan Kidd, Eric Ho, and Ashwin Acharya for critical comments and suggestions on earlier drafts of this agenda, as well as Philip Gubbins, Diogo de Lucena, Rob Luke, and Mason Seale from AE Studio for their support and feedback throughout.
TL;DR

Our initial theory of change at AE Studio was a 'neglected approach' that involved rerouting profits from our consulting business towards the development of brain-computer interface (BCI) technology to dramatically enhance human agency, better enabling us to do things like solve alignment. Now, given shortening timelines, we're updating our theory of change to scale up our technical alignment efforts.
With a solid technical foundation in BCI, neuroscience, and machine learning, we are optimistic that we’ll be able to contribute meaningfully [...]

The original text contained 6 footnotes which were omitted from this narration.
---
First published:
December 18th, 2023
Source:
https://www.lesswrong.com/posts/qAdDzcBuDBLexb4fC/the-neglected-approaches-approach-ae-studio-s-alignment
---
Narrated by TYPE III AUDIO.

“Humanity vs. AGI” Will Never Look Like “Humanity vs. AGI” to Humanity Dec 18, 2023

When discussing AGI Risk, people often talk about it in terms of a war between humanity and an AGI. Comparisons between the amounts of resources at both sides' disposal are brought up and factored in, big impressive nuclear stockpiles are sometimes waved around, etc.
I'm pretty sure it's not how that'd look like, on several levels.
1. Threat Ambiguity
I think what people imagine, when they imagine a war, is Terminator-style movie scenarios where the obviously evil AGI becomes obviously evil in a way that's obvious to everyone, and then it's a neatly arranged white-and-black humanity vs. machines all-out fight. Everyone sees the problem, and knows everyone else sees it too, the problem is common knowledge, and we can all decisively act against it.[1]
But in real life, such unambiguity is rare. The monsters don't look obviously evil, the signs of fatal issues are rarely blatant. Is this whiff [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
December 16th, 2023
Source:
https://www.lesswrong.com/posts/xSJMj3Hw3D7DPy5fJ/humanity-vs-agi-will-never-look-like-humanity-vs-agi-to
---
Narrated by TYPE III AUDIO.

Is being sexy for your homies? Dec 17, 2023

Epistemic status: Speculation. An unholy union of evo psych, introspection, random stuff I happen to observe & hear about, and thinking. Done on a highly charged topic. Caveat emptor!
Most of my life, whenever I'd felt sexually unwanted, I'd start planning to get fit.
Specifically to shape my body so it looks hot. Like the muscly guys I'd see in action films.
This choice is a little odd. In close to every context I've listened to, I hear women say that some muscle tone on a guy is nice and abs are a plus, but big muscles are gross — and all of that is utterly overwhelmed by other factors anyway.
It also didn't match up with whom I'd see women actually dating.
But all of that just… didn't affect my desire?
There's a related bit of dating advice for guys. "Bro, do you even lift?" Depending on the [...]
---
First published:
December 13th, 2023
Source:
https://www.lesswrong.com/posts/nvmfqdytxyEpRJC3F/is-being-sexy-for-your-homies
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible" by Gene Smith and Kman Dec 16, 2023

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
TL;DR version

In the course of my life, there have been a handful of times I discovered an idea that changed the way I thought about the world. The first occurred when I picked up Nick Bostrom’s book “superintelligence” and realized that AI would utterly transform the world. The second was when I learned about embryo selection and how it could change future generations. And the third happened a few months ago when I read a message from a friend of mine on Discord about editing the genome of a living person.

Source:
https://www.lesswrong.com/posts/JEhW3HDMKzekDShva/significantly-enhancing-adult-intelligence-with-gene-editing
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

[HUMAN VOICE] "Moral Reality Check (a short story)" by jessicata Dec 15, 2023

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://unstableontology.com/2023/11/26/moral-reality-check/

Janet sat at her corporate ExxenAI computer, viewing some training performance statistics. ExxenAI was a major player in the generative AI space, with multimodal language, image, audio, and video AIs. They had scaled up operations over the past few years, mostly serving B2B, but with some B2C subscriptions. ExxenAI's newest AI system, SimplexAI-3, was based on GPT-5 and Gemini-2. ExxenAI had hired away some software engineers from Google and Microsoft, in addition to some machine learning PhDs, and replicated the work of other companies to provide more custom fine-tuning, especially for B2B cases. Part of what attracted these engineers and theorists was ExxenAI's AI alignment team.

Source:
https://www.lesswrong.com/posts/umJMCaxosXWEDfS66/moral-reality-check-a-short-story
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

AI Control: Improving Safety Despite Intentional Subversion Dec 15, 2023

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:

We summarize the paper;
We compare our methodology to what the one used in other safety papers.

The next post in this sequence (which we’ll release in the coming weeks) discusses what we mean by AI control and argues that it is a promising methodology for reducing risk from scheming models.
Here's the abstract of the paper:
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models [...]
---
First published:
December 13th, 2023
Source:
https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion
---
Narrated by TYPE III AUDIO.

2023 Unofficial LessWrong Census/Survey Dec 13, 2023

The Less Wrong General Census is unofficially here! You can take it at this link.
It's that time again.
If you are reading this post and identify as a LessWronger, then you are the target audience. I'd appreciate it if you took the survey. If you post, if you comment, if you lurk, if you don't actually read the site that much but you do read a bunch of the other rationalist blogs or you're really into HPMOR, if you hung out on rationalist tumblr back in the day, or if none of those exactly fit you but I'm maybe getting close, I think you count and I'd appreciate it if you took the survey.
Don't feel like you have to answer all of the questions just because you started taking it. Last year I asked if people thought the survey was too long, collectively they thought it was [...]
---
First published:
December 2nd, 2023
Source:
https://www.lesswrong.com/posts/JHeTrWha5PxiPEwBt/2023-unofficial-lesswrong-census-survey
---
Narrated by TYPE III AUDIO.

The likely first longevity drug is based on sketchy science. This is bad for science and bad for longevity. Dec 13, 2023

If you are interested in the longevity scene, like I am, you probably have seen press releases about the dog longevity company, Loyal for Dogs, getting a nod for efficacy from the FDA. These have come in the form of the New York Post calling the drug "groundbreaking", Science Alert calling the drug "radical", and the more sedate New York Times just asking, "Could Longevity Drugs for Dogs Extend Your Pet's Life?", presumably unaware of Betteridge's Law of Headlines. You may have also seen the coordinated Twitter offensive of people losing their shit about this, including their lead investor, Laura Deming, saying that she "broke down crying when she got the call".
And if you have been following Loyal for Dogs for a while, like I have, you are probably puzzled by this news. Loyal for Dogs has been around since 2021. Unlike any other drug company or longevity [...]
---
First published:
December 12th, 2023
Source:
https://www.lesswrong.com/posts/vHSkxmYYqW59sySqA/the-likely-first-longevity-drug-is-based-on-sketchy-science
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "What are the results of more parental supervision and less outdoor play?" by Julia Wise Dec 13, 2023

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Crossposted from Otherwise
Parents supervise their children way more than they used to

Children spend less of their time in unstructured play than they did in past generations.

Parental supervision is way up. The wild thing is that this is true even while the number of children per family has decreased and the amount of time mothers work outside the home has increased.
Source:
https://www.lesswrong.com/posts/piJLpEeh6ivy5RA7v/what-are-the-results-of-more-parental-supervision-and-less
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible Dec 12, 2023

In the course of my life, there have been a handful of times I discovered an idea that changed the way I thought about the world. The first occurred when I picked up Nick Bostrom's book “superintelligence” and realized that AI would utterly transform the world. The second was when I learned about embryo selection and how it could change future generations. And the third happened a few months ago when I read a message from a friend of mine on Discord about editing the genome of a living person.
We’ve had gene therapy to treat cancer and single gene disorders for decades. But the process involved in making such changes to the cells of a living person is excruciating and extremely expensive. CAR T-cell therapy, a treatment for certain types of cancer, requires the removal of white blood cells via IV, genetic modification of those cells outside the [...]
---
First published:
December 12th, 2023
Source:
https://www.lesswrong.com/posts/JEhW3HDMKzekDShva/significantly-enhancing-adult-intelligence-with-gene-editing
---
Narrated by TYPE III AUDIO.

re: Yudkowsky on biological materials Dec 11, 2023

I was asked to respond to this comment by Eliezer Yudkowsky. This post is partly redundant with my previous post.
Why is flesh weaker than diamond?
When trying to resolve disagreements, I find that precision is important. Tensile strength, compressive strength, and impact strength are different. Material microstructure matters. Poorly-sintered diamond crystals could crumble like sand, and a large diamond crystal has lower impact strength than some materials made of proteins.
Even when the load-bearing forces holding large molecular systems together are locally covalent bonds, as in lignin (what makes wood strong), if you've got larger molecules only held together by covalent bonds at interspersed points along their edges, that's like having 10cm-diameter steel beams held together by 1cm welds.
lignin (what makes wood strong)
That's an odd way of putting things. The mechanical strength of wood is generally considered to come from it [...]
---
First published:
December 11th, 2023
Source:
https://www.lesswrong.com/posts/XhDh97vm7hXBfjwqQ/re-yudkowsky-on-biological-materials
---
Narrated by TYPE III AUDIO.

Speaking to Congressional staffers about AI risk Dec 05, 2023

In May and June of 2023, I (Akash) had about 50-70 meetings about AI risks with congressional staffers. I had been meaning to write a post reflecting on the experience and some of my takeaways, and I figured it could be a good topic for a LessWrong dialogue. I saw that hath had offered to do LW dialogues with folks, and I reached out.
In this dialogue, we discuss how I decided to chat with staffers, my initial observations in DC, some context about how Congressional offices work, what my meetings looked like, lessons I learned, and some miscellaneous takes about my experience.
Context
hathHey! In your message, you mentioned a few topics that relate to your time in DC.
I figured we should start with your experience talking to congressional offices about AI risk. I'm quite interested in learning more; there don't seem to be many [...]
---
First published:
December 4th, 2023
Source:
https://www.lesswrong.com/posts/2sLwt2cSAag74nsdN/speaking-to-congressional-staffers-about-ai-risk
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "Shallow review of live agendas in alignment & safety" by technicalities & Stag Dec 04, 2023

Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on.

This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at that one conference” and “I just read a post on a surprising new insight and want to see who else has been working on this”, “I wonder roughly how many people are working on that thing”.

This doc is unreadably long, so that it can be Ctrl-F-ed. Also this way you can fork the list and make a smaller one.

Most of you should only read the editorial and skim the section you work in.
Source:
https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/shallow-review-of-live-agendas-in-alignment-and-safety#More_meta
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

Thoughts on “AI is easy to control” by Pope & Belrose Dec 02, 2023

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Quintin Pope & Nora Belrose have a new “AI Optimists” website, along with a new essay “AI is easy to control”, arguing that the risk of human extinction due to future AI (“AI x-risk”) is a mere 1% (“a tail risk worth considering, but not the dominant source of risk in the world”). (I’m much more pessimistic.) It makes lots of interesting arguments, and I’m happy that the authors are engaging in substantive and productive discourse, unlike the ad hominem vibes-based drivel which is growing increasingly common on both sides of the AI x-risk issue in recent months.
This is not a comprehensive rebuttal or anything, but rather picking up on a few threads that seem important for where we disagree, or where I have something I want to say.
Summary / table-of-contents:
Note: I think Sections 1 [...]
---
First published:
December 1st, 2023
Source:
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose
---
Narrated by TYPE III AUDIO.

The 101 Space You Will Always Have With You Nov 30, 2023

Any community which ever adds new people will need to either routinely teach the new and (to established members) blindingly obvious information to those who genuinely haven’t heard it before, or accept that over time community members will only know the simplest basics by accident of osmosis or selection bias. There isn’t another way out of that. You don’t get to stop doing it. If you have a vibrant and popular group full of people really interested in the subject of the group, and you run it for ten years straight, you will still sometimes run across people who have only fuzzy and incorrect ideas about the subject dauntless you are making an active effort to make Something Which Is Not That happen.
Or in other words; I have run into people at Effective Altruism meetups who were aghast at the idea of putting a dollar price on a [...]
---
First published:
November 29th, 2023
Source:
https://www.lesswrong.com/posts/ydmctK2qLyrR9Xztd/the-101-space-you-will-always-have-with-you
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "Social Dark Matter" by Duncan Sabien Nov 28, 2023

The author's Substack:
https://substack.com/@homosabiens
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
You know it must be out there, but you mostly never see it.

Author's Note 1: In something like 75% of possible futures, this will be the last essay that I publish on LessWrong. Future content will be available on my substack, where I'm hoping people will be willing to chip in a little commensurate with the value of the writing, and (after a delay) on my personal site (not yet live). I decided to post this final essay here rather than silently switching over because many LessWrong readers would otherwise never find out that they could still get new Duncanthoughts elsewhere.
Source:
https://www.lesswrong.com/posts/KpMNqA5BiCRozCwM3/social-dark-matter
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

Shallow review of live agendas in alignment & safety Nov 28, 2023

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Summary.
You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on.
This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at that one conference” and “I just read a post on a surprising new insight and want to see who else has been [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
November 27th, 2023
Source:
https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/shallow-review-of-live-agendas-in-alignment-and-safety
---
Narrated by TYPE III AUDIO.

Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense Nov 25, 2023

Status: Vague, sorry. The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are.
Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?
(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it's being eroded, etc.)
And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior?
(Modulo, e.g., the fact that it can play chess pretty well, which indicates a [...]
---
First published:
November 24th, 2023
Source:
https://www.lesswrong.com/posts/AWoZBzxdm4DoGgiSj/ability-to-solve-long-horizon-tasks-correlates-with-wanting
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "The 6D effect: When companies take risks, one email can be very powerful." by scasper Nov 23, 2023

Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
Recently, I have been learning about industry norms, legal discovery proceedings, and incentive structures related to companies building risky systems. I wanted to share some findings in this post because they may be important for the frontier AI community to understand well.

TL;DR

Documented communications of risks (especially by employees) make companies much more likely to be held liable in court when bad things happen. The resulting Duty to Due Diligence from Discoverable Documentation of Dangers (the 6D effect) can make companies much more cautious if even a single email is sent to them communicating a risk.
Source:
https://www.lesswrong.com/posts/J9eF4nA6wJW6hPueN/the-6d-effect-when-companies-take-risks-one-email-can-be
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

OpenAI: The Battle of the Board Nov 22, 2023

Previously: OpenAI: Facts from a Weekend.
On Friday afternoon, OpenAI's board fired CEO Sam Altman.
Overnight, an agreement in principle was reached to reinstate Sam Altman as CEO of OpenAI, with an initial new board of Brad Taylor (ex-co-CEO of Salesforce, chair), Larry Summers and Adam D’Angelo.
What happened? Why did it happen? How will it ultimately end? The fight is far from over.
We do not entirely know, but we know a lot more than we did a few days ago.
This is my attempt to put the pieces together.
This is a Fight For Control; Altman Started it
This was and still is a fight about control of OpenAI, its board, and its direction.
This has been a long simmering battle and debate. The stakes are high.
Until recently, Sam Altman worked to reshape the company in his [...]
---
First published:
November 22nd, 2023
Source:
https://www.lesswrong.com/posts/sGpBPAPq2QttY4M2H/openai-the-battle-of-the-board
---
Narrated by TYPE III AUDIO.

OpenAI: Facts from a Weekend Nov 20, 2023

Approximately four GPTs and seven years ago, OpenAI's founders brought forth on this corporate landscape a new entity, conceived in liberty, and dedicated to the proposition that all men might live equally when AGI is created.
Now we are engaged in a great corporate war, testing whether that entity, or any entity so conceived and so dedicated, can long endure.
What matters is not theory but practice. What happens when the chips are down?
So what happened? What prompted it? What will happen now?
To a large extent, even more than usual, we do not know. We should not pretend that we know more than we do.
Rather than attempt to interpret here or barrage with an endless string of reactions and quotes, I will instead do my best to stick to a compilation of the key facts.
(Note: All times stated here [...]
---
First published:
November 20th, 2023
Source:
https://www.lesswrong.com/posts/KXHMCH7wCxrvKsJyn/openai-facts-from-a-weekend
---
Narrated by TYPE III AUDIO.

Sam Altman fired from OpenAI Nov 17, 2023

This is a linkpost for https://openai.com/blog/openai-announces-leadership-transitionBasically just the title, see the OAI blog post for more details.
Mr. Altman's departure follows a deliberative review process by the board, which concluded that he was not consistently candid in his communications with the board, hindering its ability to exercise its responsibilities. The board no longer has confidence in his ability to continue leading OpenAI.
In a statement, the board of directors said: “OpenAI was deliberately structured to advance our mission: to ensure that artificial general intelligence benefits all humanity. The board remains fully committed to serving this mission. We are grateful for Sam's many contributions to the founding and growth of OpenAI. At the same time, we believe new leadership is necessary as we move forward. As the leader of the company's research, product, and safety functions, Mira is exceptionally qualified to step into the role of interim CEO. We have [...]
---
First published:
November 17th, 2023
Source:
https://www.lesswrong.com/posts/eHFo7nwLYDzpuamRM/sam-altman-fired-from-openai
Linkpost URL:
https://openai.com/blog/openai-announces-leadership-transition
---
Narrated by TYPE III AUDIO.

Social Dark Matter Nov 17, 2023

You know it must be out there, but you mostly never see it.
Author's Note 1: I'm something like 75% confident that this will be the last essay that I publish on LessWrong. Future content will be available on my substack, where I'm hoping people will be willing to chip in a little commensurate with the value of the writing, and (after a delay) on my personal site. I decided to post this final essay here rather than silently switching over because many LessWrong readers would otherwise never find out that they could still get new Duncan content elsewhere.
Author's Note 2: This essay is not intended to be revelatory. Instead, it's attempting to get the consequences of a few very obvious things lodged into your brain, such that they actually occur to you from time to time as opposed to occurring to you approximately never.
Most people [...]
The original text contained 9 footnotes which were omitted from this narration.
---
First published:
November 7th, 2023
Source:
https://www.lesswrong.com/posts/KpMNqA5BiCRozCwM3/social-dark-matter
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "Thinking By The Clock" by Screwtape Nov 17, 2023

Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
I'm sure Harry Potter and the Methods of Rationality taught me some of the obvious, overt things it set out to teach. Looking back on it a decade after I first read it however, what strikes me most strongly are often the brief, tossed off bits in the middle of the flow of a story.

Fred and George exchanged worried glances."I can't think of anything," said George."Neither can I," said Fred. "Sorry."Harry stared at them.And then Harry began to explain how you went about thinking of things.It had been known to take longer than two seconds, said Harry.-Harry Potter and the Methods of Rationality, Chapter 25.

Source:
https://www.lesswrong.com/posts/WJtq4DoyT9ovPyHjH/thinking-by-the-clock
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
I.

"You can just spontaneously call people you haven't met in years" by lc Nov 17, 2023

Here's a recent conversation I had with a friend:

Me: "I wish I had more friends. You guys are great, but I only get to hang out with you like once or twice a week. It's painful being holed up in my house the entire rest of the time."Friend: "You know ${X}. You could talk to him."Me: "I haven't talked to ${X} since 2019."Friend: "Why does that matter? Just call him."Me: "What do you mean 'just call him'? I can't do that."Friend: "Yes you can"Me:

Source:
https://www.lesswrong.com/posts/2HawAteFsnyhfYpuD/you-can-just-spontaneously-call-people-you-haven-t-met-in
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

[HUMAN VOICE] "AI Timelines" by habryka, Daniel Kokotajlo, Ajeya Cotra, Ege Erdil Nov 16, 2023

Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
How many years will pass before transformative AI is built? Three people who have thought about this question a lot are Ajeya Cotra from Open Philanthropy, Daniel Kokotajlo from OpenAI and Ege Erdil from Epoch. Despite each spending at least hundreds of hours investigating this question, they still still disagree substantially about the relevant timescales. For instance, here are their median timelines for one operationalization of transformative AI:
Source:
https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"EA orgs' legal structure inhibits risk taking and information sharing on the margin" by Elizabeth Nov 16, 2023

It’s fairly common for EA orgs to provide fiscal sponsorship to other EA orgs. Wait, no, that sentence is not quite right. The more accurate sentence is that there are very few EA organizations, in the legal sense; most of what you think of as orgs are projects that are legally hosted by a single org, and which governments therefore consider to be one legal entity.
Source:
https://www.lesswrong.com/posts/XvEJydHAHk6hjWQr5/ea-orgs-legal-structure-inhibits-risk-taking-and-information
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Integrity in AI Governance and Advocacy" by habryka, Olivia Jimenez Nov 16, 2023

habryka

Ok, so we both had some feelings about the recent Conjecture post on "lots of people in AI Alignment are lying", and the associated marketing campaign and stuff.

I would appreciate some context in which I can think through that, and also to share info we have in the space that might help us figure out what's going on.

I expect this will pretty quickly cause us to end up on some broader questions about how to do advocacy, how much the current social network around AI Alignment should coordinate as a group, how to balance advocacy with research, etc.
Source:
https://www.lesswrong.com/posts/vFqa8DZCuhyrbSnyx/integrity-in-ai-governance-and-advocacy
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

Loudly Give Up, Don’t Quietly Fade Nov 16, 2023

1.
There's a supercharged, dire wolf form of the bystander effect that I’d like to shine a spotlight on.
First, a quick recap. The Bystander Effect is a phenomenon where people are less likely to help when there's a group around. When I took basic medical training, I was told to always ask one specific person to take actions instead of asking a crowd at large. “You, in the green shirt! Call 911!” (911 is the emergency services number in the United States.) One habit I worked hard to instill in my own head was that if I’m in a crowd that's asked to do something, I silently count off three seconds. If nobody else responds, I either decide to do it or decide not to do it and I say that.
I like this habit, because the Bystander Effect is dumb and I want to fight it. Several [...]
---
First published:
November 13th, 2023
Source:
https://www.lesswrong.com/posts/bkfgTSHhm3mqxgTmw/loudly-give-up-don-t-quietly-fade
---
Narrated by TYPE III AUDIO.

[HUMAN VOICE] "Deception Chess: Game #1" by Zane et al. Nov 09, 2023

Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
(You can sign up to play deception chess here if you haven't already.)

This is the first of my analyses of the deception chess games. The introduction will describe the setup of the game, and the conclusion will sum up what happened in general terms; the rest of the post will mostly be chess analysis and skippable if you just want the results. If you haven't read the original post, read it before reading this so that you know what's going on here.

The first game was between Alex A as player A, Chess.com computer Komodo 12 as player B, myself as the honest C advisor, and aphyer and AdamYedidia as the deceptive Cs. (Someone else randomized the roles for the Cs and told us in private.)
Source:
https://www.lesswrong.com/posts/6dn6hnFRgqqWJbwk9/deception-chess-game-1
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

[HUMAN VOICE] "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds Nov 09, 2023

Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://transformer-circuits.pub/2023/monosemantic-features/

Text of post based on our blog post as a linkpost for the full paper which is considerably longer and more detailed.

Neural networks are trained on data, not programmed to follow rules. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don't understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe.

Luckily for those of us trying to understand artificial neural networks, we can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.

Unfortunately, it turns out that the individual neurons do not have consistent relationships to network behavior. For example, a single neuron in a small language model is active in many unrelated contexts, including: academic citations, English dialogue, HTTP requests, and Korean text. In a classic vision model, a single neuron responds to faces of cats and fronts of cars. The activation of one neuron can mean different things in different contexts.
Source:
https://www.lesswrong.com/posts/TDqvQFks6TWutJEKu/towards-monosemanticity-decomposing-language-models-with
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"The 6D effect: When companies take risks, one email can be very powerful." by scasper Nov 09, 2023

Recently, I have been learning about industry norms, legal discovery proceedings, and incentive structures related to companies building risky systems. I wanted to share some findings in this post because they may be important for the frontier AI community to understand well.

TL;DR

Documented communications of risks (especially by employees) make companies much more likely to be held liable in court when bad things happen. The resulting Duty to Due Diligence from Discoverable Documentation of Dangers (the 6D effect) can make companies much more cautious if even a single email is sent to them communicating a risk.
Source:
https://www.lesswrong.com/posts/J9eF4nA6wJW6hPueN/the-6d-effect-when-companies-take-risks-one-email-can-be
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"The other side of the tidal wave" by Katja Grace Nov 09, 2023

I guess there’s maybe a 10-20% chance of AI causing human extinction in the coming decades, but I feel more distressed about it than even that suggests—I think because in the case where it doesn’t cause human extinction, I find it hard to imagine life not going kind of off the rails. So many things I like about the world seem likely to be over or badly disrupted with superhuman AI (writing, explaining things to people, friendships where you can be of any use to one another, taking pride in skills, thinking, learning, figuring out how to achieve things, making things, easy tracking of what is and isn’t conscious), and I don’t trust that the replacements will be actually good, or good for us, or that anything will be reversible.

Even if we don’t die, it still feels like everything is coming to an end.
Source:
https://www.lesswrong.com/posts/uyPo8pfEtBffyPdxf/the-other-side-of-the-tidal-wave
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Does davidad's uploading moonshot work?" by jacobjabob et al. Nov 09, 2023

davidad has a 10-min talk out on a proposal about which he says: “the first time I’ve seen a concrete plan that might work to get human uploads before 2040, maybe even faster, given unlimited funding”.

I think the talk is a good watch, but the dialogue below is pretty readable even if you haven't seen it. I'm also putting some summary notes from the talk in the Appendix of this dialogue.
Source:
https://www.lesswrong.com/posts/FEFQSGLhJFpqmEhgi/does-davidad-s-uploading-moonshot-work
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk" by 1a3orn Nov 09, 2023

I examined all the biorisk-relevant citations from a policy paper arguing that we should ban powerful open source LLMs.

None of them provide good evidence for the paper's conclusion. The best of the set is evidence from statements from Anthropic -- which rest upon data that no one outside of Anthropic can even see, and on Anthropic's interpretation of that data. The rest of the evidence cited in this paper ultimately rests on a single extremely questionable "experiment" without a control group.

In all, citations in the paper provide an illusion of evidence ("look at all these citations") rather than actual evidence ("these experiments are how we know open source LLMs are dangerous and could contribute to biorisk").

A recent further paper on this topic (published after I had started writing this review) continues this pattern of being more advocacy than science.

Almost all the bad papers that I look at are funded by Open Philanthropy. If Open Philanthropy cares about truth, then they should stop burning the epistemic commons by funding "research" that is always going to give the same result no matter the state of the world.
Source:
https://www.lesswrong.com/posts/ztXsmnSdrejpfmvn7/propaganda-or-science-a-look-at-open-source-ai-and
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"My thoughts on the social response to AI risk" by Matthew Barnett Nov 09, 2023

A common theme implicit in many AI risk stories has been that broader society will either fail to anticipate the risks of AI until it is too late, or do little to address those risks in a serious manner. In my opinion, there are now clear signs that this assumption is false, and that society will address AI with something approaching both the attention and diligence it deserves. For example, one clear sign is Joe Biden's recent executive order on AI safety [1]. In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention.

While I'm not saying we should now sit back and relax, I think recent evidence has significant implications for designing effective strategies to address AI risk. Since I think substantial AI regulation will likely occur by default, I urge effective altruists to focus more on ensuring that the regulation is thoughtful and well-targeted rather than ensuring that regulation happens at all. Ultimately, I argue in favor of a cautious and nuanced approach towards policymaking, in contrast to broader public AI safety advocacy.[2]
Source:
https://www.lesswrong.com/posts/EaZghEwcCJRAuee66/my-thoughts-on-the-social-response-to-ai-risk#
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

Comp Sci in 2027 (Short story by Eliezer Yudkowsky) Nov 09, 2023

This is a linkpost for https://nitter.net/ESYudkowsky/status/1718654143110512741

Comp sci in 2017:
Student: I get the feeling the compiler is just ignoring all my comments.
Teaching assistant: You have failed to understand not just compilers but the concept of computation itself.
Comp sci in 2027:
Student: I get the feeling the compiler is just ignoring all my comments.
TA: That's weird. Have you tried adding a comment at the start of the file asking the compiler to pay closer attention to the comments?
Student: Yes.
TA: Have you tried repeating the comments? Just copy and paste them, so they say the same thing twice? Sometimes the compiler listens the second time.
Source:
https://www.lesswrong.com/posts/gQyphPbaLHBMJoghD/comp-sci-in-2027-short-story-by-eliezer-yudkowsky
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Thoughts on the AI Safety Summit company policy requests and responses" by So8res Nov 02, 2023

Over the next two days, the UK government is hosting an AI Safety Summit focused on “the safe and responsible development of frontier AI”. They requested that seven companies (Amazon, Anthropic, DeepMind, Inflection, Meta, Microsoft, and OpenAI) “outline their AI Safety Policies across nine areas of AI Safety”.

Below, I’ll give my thoughts on the nine areas the UK government described; I’ll note key priorities that I don’t think are addressed by company-side policy at all; and I’ll say a few words (with input from Matthew Gray, whose discussions here I’ve found valuable) about the individual companies’ AI Safety Policies.[1]

My overall take on the UK government’s asks is: most of these are fine asks; some things are glaringly missing, like independent risk assessments.

My overall take on the labs’ policies is: none are close to adequate, but some are importantly better than others, and most of the organizations are doing better than sheer denial of the primary risks.
Source:
https://www.lesswrong.com/posts/ms3x8ngwTfep7jBue/thoughts-on-the-ai-safety-summit-company-policy-requests-and
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence" by Tristan Williams Nov 02, 2023

This is a linkpost for https://www.whitehouse.gov/briefing-room/statements-releases/2023/10/30/fact-sheet-president-biden-issues-executive-order-on-safe-secure-and-trustworthy-artificial-intelligence/

Released today (10/30/23) this is crazy, perhaps the most sweeping action taken by government on AI yet.

Below, I've segmented by x-risk and non-x-risk related proposals, excluding the proposals that are geared towards promoting its use and focusing solely on those aimed at risk. It's worth noting that some of these are very specific and direct an action to be taken by one of the executive branch organizations (i.e. sharing of safety test results) but others are guidances, which involve "calls on Congress" to pass legislation that would codify the desired action.

[Update]: The official order (this is a summary of the press release) has now be released, so if you want to see how these are codified to a greater granularity, look there.

Source:
https://www.lesswrong.com/posts/g5XLHKyApAFXi3fso/president-biden-issues-executive-order-on-safe-secure-and
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

[Human Voice] "Book Review: Going Infinite" by Zvi Oct 31, 2023

Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
Previously: Sadly, FTX

I doubted whether it would be a good use of time to read Michael Lewis’s new book Going Infinite about Sam Bankman-Fried (hereafter SBF or Sam). What would I learn that I did not already know? Was Michael Lewis so far in the tank of SBF that the book was filled with nonsense and not to be trusted?

I set up a prediction market, which somehow attracted over a hundred traders. Opinions were mixed. That, combined with Matt Levine clearly reporting having fun, felt good enough to give the book a try.
Source:
https://www.lesswrong.com/posts/AocXh6gJ9tJC2WyCL/book-review-going-infinite
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"We're Not Ready: thoughts on "pausing" and responsible scaling policies" by Holden Karnofsky Oct 29, 2023

Views are my own, not Open Philanthropy’s. I am married to the President of Anthropic and have a financial interest in both Anthropic and OpenAI via my spouse.

Over the last few months, I’ve spent a lot of my time trying to help out with efforts to get responsible scaling policies adopted. In that context, a number of people have said it would be helpful for me to be publicly explicit about whether I’m in favor of an AI pause. This post will give some thoughts on these topics.
Source:
https://www.lesswrong.com/posts/Np5Q3Mhz2AiPtejGN/we-re-not-ready-thoughts-on-pausing-and-responsible-scaling-4
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"At 87, Pearl is still able to change his mind" by rotatingpaguro Oct 29, 2023

Judea Pearl is a famous researcher, known for Bayesian networks (the standard way of representing Bayesian models), and his statistical formalization of causality. Although he has always been recommended reading here, he's less of a staple compared to, say, Jaynes. So the need to re-introduce him. My purpose here is to highlight a soothing, unexpected show of rationality on his part.

One year ago I reviewed his last book, The Book of Why, in a failed[1] submission to the ACX book review contest. There I spend a lot of time around what appears to me as a total paradox in a central message of the book, dear to Pearl: that you can't just use statistics and probabilities to understand causal relationships; you need a causal model, a fundamentally different beast. Yet, at the same time, Pearl shows how to implement a causal model in terms of a standard statistical model.

Before giving me the time to properly raise all my eyebrows, he then sweepingly connects this insight to Everything Everywhere. In particular, he thinks that machine learning is "stuck on rung one", his own idiomatic expression to say that machine learning algorithms, only combing for correlations in the training data, are stuck at statistics-level reasoning, while causal reasoning resides at higher "rungs" on the "ladder of causation", which can't be reached unless you deliberately employ causal techniques.
Source:
https://www.lesswrong.com/posts/uFqnB6BG4bkMW23LR/at-87-pearl-is-still-able-to-change-his-mind
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Architects of Our Own Demise: We Should Stop Developing AI" by Roko Oct 29, 2023

Some brief thoughts at a difficult time in the AI risk debate.

Imagine you go back in time to the year 1999 and tell people that in 24 years time, humans will be on the verge of building weakly superhuman AI systems. I remember watching the anime short series The Animatrix at roughly this time, in particular a story called The Second Renaissance I part 2 II part 1 II part 2 . For those who haven't seen it, it is a self-contained origin tale for the events in the seminal 1999 movie The Matrix, telling the story of how humans lost control of the planet.

Humans develop AI to perform economic functions, eventually there is an "AI rights" movement and a separate AI nation is founded. It gets into an economic war with humanity, which turns hot. Humans strike first with nuclear weapons, but the AI nation builds dedicated bio- and robo-weapons and wipes out most of humanity, apart from those who are bred in pods like farm animals and plugged into a simulation for eternity without their consent.

Surely we wouldn't be so stupid as to actually let something like that happen? It seems unrealistic.
And yet:
Source:
https://www.lesswrong.com/posts/bHHrdXwrCj2LRa2sW/architects-of-our-own-demise-we-should-stop-developing-ai
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"AI as a science, and three obstacles to alignment strategies" by Nate Soares Oct 29, 2023

AI used to be a science. In the old days (back when AI didn't work very well), people were attempting to develop a working theory of cognition.
Those scientists didn’t succeed, and those days are behind us. For most people working in AI today and dividing up their work hours between tasks, gone is the ambition to understand minds. People working on mechanistic interpretability (and others attempting to build an empirical understanding of modern AIs) are laying an important foundation stone that could play a role in a future science of artificial minds, but on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.
The bitter lesson has been taken to heart, by those at the forefront of the field; and although this lesson doesn't teach us that there's nothing to learn about how AI minds solve problems internally, it suggests that the fastest path to producing more powerful systems is likely to continue to be one that doesn’t shed much light on how those systems work.

Absent some sort of “science of artificial minds”, however, humanity’s prospects for aligning smarter-than-human AI seem to me to be quite dim.
Source:
https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/ai-as-a-science-and-three-obstacles-to-alignment-strategies
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Thoughts on responsible scaling policies and regulation" by Paul Christiano Oct 29, 2023

I am excited about AI developers implementing responsible scaling policies; I’ve recently been spending time refining this idea and advocating for it. Most people I talk to are excited about RSPs, but there is also some uncertainty and pushback about how they relate to regulation. In this post I’ll explain my views on that:

I think that sufficiently good responsible scaling policies could dramatically reduce risk, and that preliminary policies like Anthropic’s RSP meaningfully reduce risk by creating urgency around key protective measures and increasing the probability of a pause if those measures can’t be implemented quickly enough.
I don’t think voluntary implementation of responsible scaling policies is a substitute for regulation. Voluntary commitments are unlikely to be universally adopted or to have adequate oversight, and I think the public should demand a higher degree of safety than AI developers are likely to voluntarily implement.
I think that developers implementing responsible scaling policies now increases the probability of effective regulation. If I instead thought it would make regulation harder, I would have significant reservations.
Transparency about RSPs makes it easier for outside stakeholders to understand whether an AI developer’s policies are adequate to manage risk, and creates a focal point for debate and for pressure to improve.
I think the risk from rapid AI development is very large, and that even very good RSPs would not completely eliminate that risk. A durable, global, effectively enforced, and hardware-inclusive pause on frontier AI development would reduce risk further. I think this would be politically and practically challenging and would have major costs, so I don’t want it to be the only option on the table. I think implementing RSPs can get most of the benefit, is desirable according to a broader set of perspectives and beliefs, and helps facilitate other effective regulation

Source:
https://www.lesswrong.com/posts/dxgEaDrEBkkE96CXr/thoughts-on-responsible-scaling-policies-and-regulation
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Announcing Timaeus" by Jesse Hoogland et al. Oct 29, 2023

Timaeus is a new AI safety research organization dedicated to making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. Currently, we are working on singular learning theory and developmental interpretability. Over time we expect to work on a broader research agenda, and to create understanding-based evals informed by our research.
Source:
https://www.lesswrong.com/posts/nN7bHuHZYaWv9RDJL/announcing-timaeus
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

[HUMAN VOICE] "Alignment Implications of LLM Successes: a Debate in One Act" by Zack M Davis Oct 23, 2023

Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
Doomimir: Humanity has made no progress on the alignment problem. Not only do we have no clue how to align a powerful optimizer to our "true" values, we don't even know how to make AI "corrigible"—willing to let us correct it. Meanwhile, capabilities continue to advance by leaps and bounds. All is lost.

Simplicia: Why, Doomimir Doomovitch, you're such a sourpuss! It should be clear by now that advances in "alignment"—getting machines to behave in accordance with human values and intent—aren't cleanly separable from the "capabilities" advances you decry. Indeed, here's an example of GPT-4 being corrigible to me just now in the OpenAI Playground.
Source:
https://www.lesswrong.com/posts/pYWA7hYJmXnuyby33/alignment-implications-of-llm-successes-a-debate-in-one-act
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"Holly Elmore and Rob Miles dialogue on AI Safety Advocacy" by jacobjacob, Robert Miles & Holly_Elmore Oct 23, 2023

Holly is an independent AI Pause organizer, which includes organizing protests (like this upcoming one). Rob is an AI Safety YouTuber. I (jacobjacob) brought them together for this dialogue, because I've been trying to figure out what I should think of AI safety protests, which seems like a possibly quite important intervention; and Rob and Holly seemed like they'd have thoughtful and perhaps disagreeing perspectives.

Quick clarification: At one point they discuss a particular protest, which is the anti-irreversible proliferation protest at the Meta building in San Francisco on September 29th, 2023 that both Holly and Rob attended.

Also, the dialogue is quite long, and I think it doesn't have to be read in order. You should feel free to skip to the section title that sounds most interesting to you.
Source:
https://www.lesswrong.com/posts/gDijQHHaZzeGrv2Jc/holly-elmore-and-rob-miles-dialogue-on-ai-safety-advocacy
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B" by Simon Lermen & Jeffrey Ladish. Oct 23, 2023

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish.

TL;DR LoRA fine-tuning undoes the safety training of Llama 2-Chat 70B with one GPU and a budget of less than $200. The resulting models[1] maintain helpful capabilities without refusing to fulfill harmful instructions. We show that, if model weights are released, safety fine-tuning does not effectively prevent model misuse. Consequently, we encourage Meta to reconsider their policy of publicly releasing their powerful models.
Source:
https://www.lesswrong.com/posts/qmQFHCgCyEEjuy5a7/lora-fine-tuning-efficiently-undoes-safety-training-from
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Labs should be explicit about why they are building AGI" by Peter Barnett Oct 19, 2023

Three of the big AI labs say that they care about alignment and that they think misaligned AI poses a potentially existential threat to humanity. These labs continue to try to build AGI. I think this is a very bad idea.

The leaders of the big labs are clear that they do not know how to build safe, aligned AGI. The current best plan is to punt the problem to a (different) AI, and hope that can solve it. It seems clearly like a bad idea to try and build AGI when you don’t know how to control it, especially if you readily admit that misaligned AGI could cause extinction.

But there are certain reasons that make trying to build AGI a more reasonable thing to do, for example:
Source:
https://www.lesswrong.com/posts/6HEYbsqk35butCYTe/labs-should-be-explicit-about-why-they-are-building-agi
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

[HUMAN VOICE] "Sum-threshold attacks" by TsviBT Oct 17, 2023

Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
How do you affect something far away, a lot, without anyone noticing?

(Note: you can safely skip sections. It is also safe to skip the essay entirely, or to read the whole thing backwards if you like.)
Source:
https://www.lesswrong.com/posts/R3eDrDoX8LisKgGZe/sum-threshold-attacks
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"Will no one rid me of this turbulent pest?" by Metacelsus Oct 17, 2023

Last year, I wrote about the promise of gene drives to wipe out mosquito species and end malaria.
In the time since my previous writing, gene drives have still not been used in the wild, and over 600,000 people have died of malaria. Although there are promising new developments such as malaria vaccines, there have also been some pretty bad setbacks (such as mosquitoes and parasites developing resistance to commonly used chemicals), and malaria deaths have increased slightly from a few years ago. Recent news coverage[1] has highlighted that the fight against malaria has stalled, and even reversed in some areas. Clearly, scientists and public health workers are trying hard with the tools they have, but this effort is not enough.

Gene drives have the potential to end malaria. However, this potential will remain unrealized unless they are deployed – and every day we wait, more than 1,600 people (mostly African children) die. But who should deploy them?
Source:
https://www.lesswrong.com/posts/gjs3q83hA4giubaAw/will-no-one-rid-me-of-this-turbulent-pest
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

[HUMAN VOICE] "Inside Views, Impostor Syndrome, and the Great LARP" by John Wentworth Oct 14, 2023

Patreon to support human narration. (Narrations will remain freely available on this feed, but you can optionally support them if you'd like me to keep making them.)
***
Epistemic status: model which I find sometimes useful, and which emphasizes some true things about many parts of the world which common alternative models overlook. Probably not correct in full generality.

Consider Yoshua Bengio, one of the people who won a Turing Award for deep learning research. Looking at his work, he clearly “knows what he’s doing”. He doesn’t know what the answers will be in advance, but he has some models of what the key questions are, what the key barriers are, and at least some hand-wavy pseudo-models of how things work.

For instance, Bengio et al’s “Unitary Evolution Recurrent Neural Networks”. This is the sort of thing which one naturally ends up investigating, when thinking about how to better avoid gradient explosion/death in e.g. recurrent nets, while using fewer parameters. And it’s not the sort of thing which one easily stumbles across by trying random ideas for nets without some reason to focus on gradient explosion/death (or related instability problems) in particular. The work implies a model of key questions/barriers; it isn’t just shooting in the dark.
Source:
https://www.lesswrong.com/posts/nt8PmADqKMaZLZGTC/inside-views-impostor-syndrome-and-the-great-larp
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"RSPs are pauses done right" by evhub Oct 14, 2023

COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthropic’s RSP. Prior to joining Anthropic, I was a Research Fellow at MIRI for three years.

Thanks to Kate Woolverton, Carson Denison, and Nicholas Schiefer for useful feedback on this post.

Recently, there’s been a lot of discussion and advocacy around AI pauses—which, to be clear, I think is great: pause advocacy pushes in the right direction and works to build a good base of public support for x-risk-relevant regulation. Unfortunately, at least in its current form, pause advocacy seems to lack any sort of coherent policy position. Furthermore, what’s especially unfortunate about pause advocacy’s nebulousness—at least in my view—is that there is a very concrete policy proposal out there right now that I think is basically necessary as a first step here, which is the enactment of good Responsible Scaling Policies (RSPs). And RSPs could very much live or die right now based on public support.
Source:
https://www.lesswrong.com/posts/mcnWZBnbeDz7KKtjJ/rsps-are-pauses-done-right
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Comparing Anthropic's Dictionary Learning to Ours" by Robert_AIZI Oct 14, 2023

Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or differently. My hope in writing this is to help readers understand the similarities and differences, and perhaps to lay the groundwork for a future synthesis approach.

First, let me note that we arrived at similar techniques in similar ways: both Anthropic and my team follow the lead of Lee Sharkey, Dan Braun, and beren's [Interim research report] Taking features out of superposition with sparse autoencoders, though I don't know how directly Anthropic was inspired by that post. I believe both our teams were pleasantly surprised to find out the other one was working on similar lines, serving as a form of replication.

Some disclaimers: This list may be incomplete. I didn't give Anthropic a chance to give feedback on this, so I may have misrepresented some of their work, including by omission. Any mistakes are my own fault.
Source:
https://www.lesswrong.com/posts/F4iogK5xdNd7jDNyw/comparing-anthropic-s-dictionary-learning-to-ours
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Announcing MIRI’s new CEO and leadership team" by Gretta Duleba Oct 14, 2023

In 2023, MIRI has shifted focus in the direction of broad public communication—see, for example, our recent TED talk, our piece in TIME magazine “Pausing AI Developments Isn’t Enough. We Need to Shut it All Down”, and our appearances on various podcasts. While we’re continuing to support various technical research programs at MIRI, this is no longer our top priority, at least for the foreseeable future.

Coinciding with this shift in focus, there have also been many organizational changes at MIRI over the last several months, and we are somewhat overdue to announce them in public. The big changes are as follows.
Source:
https://www.lesswrong.com/posts/NjtHt55nFbw3gehzY/announcing-miri-s-new-ceo-and-leadership-team
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Cohabitive Games so Far" by mako yass Oct 14, 2023

A cohabitive game[1] is a partially cooperative, partially competitive multiplayer game that provides an anarchic dojo for development in applied cooperative bargaining, or negotiation.

Applied cooperative bargaining isn't currently taught, despite being an infrastructural literacy for peace, trade, democracy or any other form of pluralism. We suffer for that. There are many good board games that come close to meeting the criteria of a cohabitive game today, but they all[2] miss in one way or another, forbidding sophisticated negotiation from being practiced.

So, over the past couple of years, we've been gradually and irregularly designing and playtesting the first[2] cohabitive boardgame, which for now we can call Difference and Peace Peacewager 1, or P1. This article explains why we think this new genre is important, how it's been going, what we've learned, and where we should go next.
I hope that cohabitive games will aid both laypeople and theorists in developing cooperative bargaining as theory, practice and culture, but I also expect these games to just be more fun than purely cooperative or purely competitive games, supporting livelier dialog, and a wider variety of interesting strategic relationships and dynamics.
Source:
https://www.lesswrong.com/posts/bF353RHmuzFQcsokF/peacewagers-cohabitive-games-so-far
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"Announcing Dialogues" by Ben Pace Oct 09, 2023

As of today, everyone is able to create a new type of content on LessWrong: Dialogues.

In contrast with posts, which are for monologues, and comment sections, which are spaces for everyone to talk to everyone, a dialogue is a space for a few invited people to speak with each other.

I'm personally very excited about this as a way for people to produce lots of in-depth explanations of their world-models in public.

I think dialogues enable this in a way that feels easier — instead of writing an explanation for anyone who reads, you're communicating with the particular person you're talking with — and giving the readers a lot of rich nuance I normally only find when I overhear people talk in person.

In the rest of this post I'll explain the feature, and then encourage you to find a partner in the comments to try it out with.
Source:
https://www.lesswrong.com/posts/kQuSZG8ibfW6fJYmo/announcing-dialogues-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Response to Quintin Pope’s Evolution Provides No Evidence For the Sharp Left Turn" by Zvi Oct 09, 2023

Response to: Evolution Provides No Evidence For the Sharp Left Turn, due to it winning first prize in The Open Philanthropy Worldviews contest.

Quintin’s post is an argument about a key historical reference class and what it tells us about AI. Instead of arguing that the reference makes his point, he is instead arguing that it doesn’t make anyone’s point - that we understand the reasons for humanity’s sudden growth in capabilities. He says this jump was caused by gaining access to cultural transmission which allowed partial preservation of in-lifetime learning across generations, which was a vast efficiency gain that fully explains the orders of magnitude jump in the expansion of human capabilities. Since AIs already preserve their metaphorical in-lifetime learning across their metaphorical generations, he argues, this does not apply to AI.
Source:
https://www.lesswrong.com/posts/Wr7N9ji36EvvvrqJK/response-to-quintin-pope-s-evolution-provides-no-evidence
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Evaluating the historical value misspecification argument" by Matthew Barnett Oct 09, 2023

ETA: I'm not saying that MIRI thought AIs wouldn't understand human values. If there's only one thing you take away from this post, please don't take away that.

Recently, many people have talked about whether some of the main MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger[1]) should update on whether value alignment is easier than they thought given that GPT-4 seems to follow human directions and act within moral constraints pretty well (here are two specific examples of people talking about this: 1, 2). Because these conversations are often hard to follow without much context, I'll just provide a brief caricature of how I think this argument has gone in the places I've seen it, which admittedly could be unfair to MIRI[2]. Then I'll offer my opinion that, overall, I think MIRI people should probably update in the direction of alignment being easier than they thought in light of this information, despite their objections.

Note: I encourage you to read this post carefully to understand my thesis. This topic can be confusing, and there are many ways to misread what I'm saying. Also, make sure to read the footnotes if you're skeptical of some of my claims.
Source:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Zac Hatfield-Dodds Oct 09, 2023

Neural networks are trained on data, not programmed to follow rules. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don't understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe.
Luckily for those of us trying to understand artificial neural networks, we can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.
This is a linkpost for https://transformer-circuits.pub/2023/monosemantic-features/
Text of post based on our blog post as a linkpost for the full paper which is considerably longer and more detailed.
Source:
https://www.lesswrong.com/posts/TDqvQFks6TWutJEKu/towards-monosemanticity-decomposing-language-models-with
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Thomas Kwa's MIRI research experience" by Thomas Kwa and others Oct 05, 2023

Moderator note: the following is a dialogue using LessWrong’s new dialogue feature. The exchange is not completed: new replies might be added continuously, the way a comment thread might work. If you’d also be excited about finding an interlocutor to debate, dialogue, or getting interviewed by: fill in this dialogue matchmaking form.

Hi Thomas, I'm quite curious to hear about your research experience working with MIRI. To get us started: When were you at MIRI? Who did you work with? And what problem were you working on?
Source:
https://www.lesswrong.com/posts/qbcuk8WwFnTZcXTd6/thomas-kwa-s-miri-research-experience
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"'Diamondoid bacteria' nanobots: deadly threat or dead-end? A nanotech investigation" by titotal Oct 02, 2023

A lot of people are highly concerned that a malevolent AI or insane human will, in the near future, set out to destroy humanity. If such an entity wanted to be absolutely sure they would succeed, what method would they use? Nuclear war? Pandemics?

According to some in the x-risk community, the answer is this: The AI will invent molecular nanotechnology, and then kill us all with diamondoid bacteria nanobots.
Source:
https://www.lesswrong.com/posts/bc8Ssx5ys6zqu3eq9/diamondoid-bacteria-nanobots-deadly-threat-or-dead-end-a
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"The Lighthaven Campus is open for bookings" by Habryka Oct 02, 2023

Lightcone Infrastructure (the organization that grew from and houses the LessWrong team) has just finished renovating a 7-building physical campus that we hope to use to make the future of humanity go better than it would otherwise.

We're hereby announcing that it is generally available for bookings. We offer preferential pricing for projects we think are good for the world, but to cover operating costs, we're renting out space to a wide variety of people/projects.
Source:
https://www.lesswrong.com/posts/memqyjNCpeDrveayx/the-lighthaven-campus-is-open-for-bookings
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions" by Jan Brauner et al. Oct 02, 2023

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
Source:
https://www.lesswrong.com/posts/khFC2a4pLPvGtXAGG/how-to-catch-an-ai-liar-lie-detection-in-black-box-llms-by
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"EA Vegan Advocacy is not truthseeking, and it’s everyone’s problem" by Elizabeth Oct 02, 2023

Effective altruism prides itself on truthseeking. That pride is justified in the sense that EA is better at truthseeking than most members of its reference category, and unjustified in that it is far from meeting its own standards. We’ve already seen dire consequences of the inability to detect bad actors who deflect investigation into potential problems, but by its nature you can never be sure you’ve found all the damage done by epistemic obfuscation because the point is to be self-cloaking.

My concern here is for the underlying dynamics of EA’s weak epistemic immune system, not any one instance. But we can’t analyze the problem without real examples, so individual instances need to be talked about. Worse, the examples that are easiest to understand are almost by definition the smallest problems, which makes any scapegoating extra unfair. So don’t.

This post focuses on a single example: vegan advocacy, especially around nutrition. I believe vegan advocacy as a cause has both actively lied and raised the cost for truthseeking, because they were afraid of the consequences of honest investigations. Occasionally there’s a consciously bad actor I can just point to, but mostly this is an emergent phenomenon from people who mean well, and have done good work in other areas. That’s why scapegoating won’t solve the problem: we need something systemic.
Source:
https://www.lesswrong.com/posts/aW288uWABwTruBmgF/ea-vegan-advocacy-is-not-truthseeking-and-it-s-everyone-s-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"The King and the Golem" by Richard Ngo Sep 29, 2023

This is a linkpost for https://narrativeark.substack.com/p/the-king-and-the-golem

Long ago there was a mighty king who had everything in the world that he wanted, except trust. Who could he trust, when anyone around him might scheme for his throne? So he resolved to study the nature of trust, that he might figure out how to gain it. He asked his subjects to bring him the most trustworthy thing in the kingdom, promising great riches if they succeeded.

Soon, the first of them arrived at his palace to try. A teacher brought her book of lessons. “We cannot know the future,” she said, “But we know mathematics and chemistry and history; those we can trust.” A farmer brought his plow. “I know it like the back of my hand; how it rolls, and how it turns, and every detail of it, enough that I can trust it fully.”

The king asked his wisest scholars if the teacher spoke true. But as they read her book, each pointed out new errors—it was only written by humans, after all. Then the king told the farmer to plow the fields near the palace. But he was not used to plowing fields as rich as these, and his trusty plow would often sink too far into the soil. So the king was not satisfied, and sent his message even further afield.
Source:
https://www.lesswrong.com/posts/bteq4hMW2hqtKE49d/the-king-and-the-golem#
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Sparse Autoencoders Find Highly Interpretable Directions in Language Models" by Logan Riggs et al Sep 26, 2023

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models

We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.
Source:
https://www.lesswrong.com/posts/Qryk6FqjtZk9FHHJR/sparse-autoencoders-find-highly-interpretable-directions-in
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Inside Views, Impostor Syndrome, and the Great LARP" by John Wentworth Sep 26, 2023

Epistemic status: model which I find sometimes useful, and which emphasizes some true things about many parts of the world which common alternative models overlook. Probably not correct in full generality.

Consider Yoshua Bengio, one of the people who won a Turing Award for deep learning research. Looking at his work, he clearly “knows what he’s doing”. He doesn’t know what the answers will be in advance, but he has some models of what the key questions are, what the key barriers are, and at least some hand-wavy pseudo-models of how things work.

For instance, Bengio et al’s “Unitary Evolution Recurrent Neural Networks”. This is the sort of thing which one naturally ends up investigating, when thinking about how to better avoid gradient explosion/death in e.g. recurrent nets, while using fewer parameters. And it’s not the sort of thing which one easily stumbles across by trying random ideas for nets without some reason to focus on gradient explosion/death (or related instability problems) in particular. The work implies a model of key questions/barriers; it isn’t just shooting in the dark.

So this is the sort of guy who can look at a proposal, and say “yeah, that might be valuable” vs “that’s not really asking the right question” vs “that would be valuable if it worked, but it will have to somehow deal with <known barrier>”
Source:
https://www.lesswrong.com/posts/nt8PmADqKMaZLZGTC/inside-views-impostor-syndrome-and-the-great-larp
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"There should be more AI safety orgs" by Marius Hobbhahn Sep 24, 2023

I’m writing this in my own capacity. The views expressed are my own, and should not be taken to represent the views of Apollo Research or any other program I’m involved with.

TL;DR: I argue why I think there should be more AI safety orgs. I’ll also provide some suggestions on how that could be achieved. The core argument is that there is a lot of unused talent and I don’t think existing orgs scale fast enough to absorb it. Thus, more orgs are needed. This post can also serve as a call to action for funders, founders, and researchers to coordinate to start new orgs.

This piece is certainly biased! I recently started an AI safety org and therefore obviously believe that there is/was a gap to be filled.

If you think I’m missing relevant information about the ecosystem or disagree with my reasoning, please let me know. I genuinely want to understand why the ecosystem acts as it does right now and whether there are good reasons for it that I have missed so far.
Source:
https://www.lesswrong.com/posts/MhudbfBNQcMxBBvj8/there-should-be-more-ai-safety-orgs
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"The Talk: a brief explanation of sexual dimorphism" by Malmesbury Sep 22, 2023

Cross-posted from substack.

"Everything in the world is about sex, except sex. Sex is about clonal interference."
– Oscar Wilde (kind of)

As we all know, sexual reproduction is not about reproduction.

Reproduction is easy. If your goal is to fill the world with copies of your genes, all you need is a good DNA-polymerase to duplicate your genome, and then to divide into two copies of yourself. Asexual reproduction is just better in every way.

It's pretty clear that, on a direct one-v-one cage match, an asexual organism would have much better fitness than a similarly-shaped sexual organism. And yet, all the macroscopic species, including ourselves, do it. What gives?

Here is the secret: yes, sex is indeed bad for reproduction. It does not improve an individual's reproductive fitness. The reason it still took over the macroscopic world is that evolution does not simply select for reproductive fitness.
Source:
https://www.lesswrong.com/posts/yA8DWsHJeFZhDcQuo/the-talk-a-brief-explanation-of-sexual-dimorphism
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"A Golden Age of Building? Excerpts and lessons from Empire State, Pentagon, Skunk Works and SpaceX" by jacobjacob Sep 20, 2023

Patrick Collison has a fantastic list of examples of people quickly accomplishing ambitious things together since the 19th Century. It does make you yearn for a time that feels... different, when the lethargic behemoths of government departments could move at the speed of a racing startup:

[...] last century, [the Department of Defense] innovated at a speed that puts modern Silicon Valley startups to shame: the Pentagon was built in only 16 months (1941–1943), the Manhattan Project ran for just over 3 years (1942–1946), and the Apollo Program put a man on the moon in under a decade (1961–1969). In the 1950s alone, the United States built five generations of fighter jets, three generations of manned bombers, two classes of aircraft carriers, submarine-launched ballistic missiles, and nuclear-powered attack submarines.

[Note: that paragraph is from a different post.]

Inspired by partly by Patrick's list, I spent some of my vacation reading and learning about various projects from this Lost Age. I then wrote up a memo to share highlights and excerpts with my colleagues at Lightcone.
Source:
https://www.lesswrong.com/posts/BpTDJj6TrqGYTjFcZ/a-golden-age-of-building-excerpts-and-lessons-from-empire
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"AI presidents discuss AI alignment agendas" by TurnTrout & Garrett Baker Sep 18, 2023

This is a linkpost for https://www.youtube.com/watch?v=02kbWY5mahQ

None of the presidents fully represent my (TurnTrout's) views.

TurnTrout wrote the script. Garrett Baker helped produce the video after the audio was complete. Thanks to David Udell, Ulisse Mini, Noemi Chulo, and especially Rio Popper for feedback and assistance in writing the script.
Source:
https://www.lesswrong.com/posts/7M2iHPLaNzPNXHuMv/ai-presidents-discuss-ai-alignment-agendas
YouTube video kindly provided by the authors. Other text narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"UDT shows that decision theory is more puzzling than ever" by Wei Dai Sep 18, 2023

I feel like MIRI perhaps mispositioned FDT (their variant of UDT) as a clear advancement in decision theory, whereas maybe they could have attracted more attention/interest from academic philosophy if the framing was instead that the UDT line of thinking shows that decision theory is just more deeply puzzling than anyone had previously realized. Instead of one major open problem (Newcomb's, or EDT vs CDT) now we have a whole bunch more. I'm really not sure at this point whether UDT is even on the right track, but it does seem clear that there are some thorny issues in decision theory that not many people were previously thinking about:
Source:
https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"Sum-threshold attacks" by TsviBT Sep 11, 2023

How do you affect something far away, a lot, without anyone noticing?

(Note: you can safely skip sections. It is also safe to skip the essay entirely, or to read the whole thing backwards if you like.)
Source:
https://www.lesswrong.com/posts/R3eDrDoX8LisKgGZe/sum-threshold-attacks
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Report on Frontier Model Training" by Yafah Edelman Sep 08, 2023

This is a linkpost for https://docs.google.com/document/d/1TsYkDYtV6BKiCN9PAOirRAy3TrNDu2XncUZ5UZfaAKA/edit?usp=sharing

Understanding what drives the rising capabilities of AI is important for those who work to forecast, regulate, or ensure the safety of AI. Regulations on the export of powerful GPUs need to be informed by understanding of how these GPUs are used, forecasts need to be informed by bottlenecks, and safety needs to be informed by an understanding of how the models of the future might be trained. A clearer understanding would enable policy makers to target regulations in such a way that they are difficult for companies to circumvent with only technically compliant GPUs, forecasters to avoid focus on unreliable metrics, and technical research working on mitigating the downsides of AI to understand what data models might be trained on.

This doc is built from a collection of smaller docs I wrote on a bunch of different aspects of frontier model training I consider important. I hope for people to be able to use this document as a collection of resources, to draw from it the information they find important and inform their own models.

I do not expect this doc to have a substantial impact on any serious AI labs capabilities efforts - I think my conclusions are largely discoverable in the process of attempting to scale AIs or for substantially less money than a serious such attempt would cost. Additionally I expect major labs already know many of the things in this report.
Source:
https://www.lesswrong.com/posts/nXcHe7t4rqHMjhzau/report-on-frontier-model-training
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated Post] ✓

"A list of core AI safety problems and how I hope to solve them" by Davidad Sep 08, 2023

Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by an Open Agency Architecture (OAA), if OAA turns out to be feasible.
Source:
https://www.lesswrong.com/posts/D97xnoRr6BHzo5HvQ/one-minute-every-moment
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"One Minute Every Moment" by abramdemski Sep 07, 2023

About how much information are we keeping in working memory at a given moment?

"Miller's Law" dictates that the number of things humans can hold in working memory is "the magical number 7±2". This idea is derived from Miller's experiments, which tested both random-access memory (where participants must remember call-response pairs, and give the correct response when prompted with a call) and sequential memory (where participants must memorize and recall a list in order). In both cases, 7 is a good rule of thumb for the number of items people can recall reliably.[1]

Miller noticed that the number of "things" people could recall didn't seem to depend much on the sorts of things people were being asked to recall. A random numeral contains about 3.3 bits of information, while a random letter contains about 4.7; yet people were able to recall about the same number of numerals or letters.

Miller concluded that working memory should not be measured in bits, but rather in "chunks"; this is a word for whatever psychologically counts as a "thing".

Source:
https://www.lesswrong.com/posts/D97xnoRr6BHzo5HvQ/one-minute-every-moment
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Sharing Information About Nonlinear" by Ben Pace Sep 07, 2023

Added (11th Sept): Nonlinear have commented that they intend to write a response, have written a short follow-up, and claim that they dispute 85 claims in this post. I'll link here to that if-and-when it's published.

Added (11th Sept): One of the former employees, Chloe, has written a lengthy comment personally detailing some of her experiences working at Nonlinear and the aftermath.

Added (12th Sept): I've made 3 relatively minor edits to the post. I'm keeping a list of all edits at the bottom of the post, so if you've read the post already, you can just go to the end to see the edits.

Added (15th Sept): I've written a follow-up post saying that I've finished working on this investigation and do not intend to work more on it in the future. The follow-up also has a bunch of reflections on what led up to this post.
Epistemic status: Once I started actively looking into things, much of my information in the post below came about by a search for negative information about the Nonlinear cofounders, not from a search to give a balanced picture of its overall costs and benefits. I think standard update rules suggest not that you ignore the information, but you think about how bad you expect the information would be if I selected for the worst, credible info I could share, and then update based on how much worse (or better) it is than you expect I could produce. (See section 5 of this post about Mistakes with Conservation of Expected Evidence for more on this.) This seems like a worthwhile exercise for at least non-zero people to do in the comments before reading on. (You can condition on me finding enough to be worth sharing, but also note that I think I have a relatively low bar for publicly sharing critical info about folks in the EA/x-risk/rationalist/etc ecosystem.)

tl;dr: If you want my important updates quickly summarized in four claims-plus-probabilities, jump to the section near the bottom titled "Summary of My Epistemic State".

Source:
https://www.lesswrong.com/posts/Lc8r4tZ2L5txxokZ8/sharing-information-about-nonlinear-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Defunding My Mistake" by ymeskhout Sep 07, 2023

Until about five years ago, I unironically parroted the slogan All Cops Are Bastards (ACAB) and earnestly advocated to abolish the police and prison system. I had faint inklings I might be wrong about this a long time ago, but it took a while to come to terms with its disavowal. What follows is intended to be not just a detailed account of what I used to believe but most pertinently, why. Despite being super egotistical, for whatever reason I do not experience an aversion to openly admitting mistakes I’ve made, and I find it very difficult to understand why others do. I’ve said many times before that nothing engenders someone’s credibility more than when they admit error, so you definitely have my permission to view this kind of confession as a self-serving exercise (it is). Beyond my own penitence, I find it very helpful when folks engage in introspective, epistemological self-scrutiny, and I hope others are inspired to do the same.
Source:
https://www.lesswrong.com/posts/4rsRuNaE4uJrnYeTQ/defunding-my-mistake
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"What I would do if I wasn’t at ARC Evals" by LawrenceC Sep 07, 2023

In which: I list 9 projects that I would work on if I wasn’t busy working on safety standards at ARC Evals, and explain why they might be good to work on.

Epistemic status:I’m prioritizing getting this out fast as opposed to writing it carefully. I’ve thought for at least a few hours and talked to a few people I trust about each of the following projects, but I haven’t done that much digging into each of these, and it’s likely that I’m wrong about many material facts. I also make little claim to the novelty of the projects. I’d recommend looking into these yourself before committing to doing them. (Total time spent writing or editing this post: ~8 hours.
Source:
https://www.lesswrong.com/posts/6FkWnktH3mjMAxdRT/what-i-would-do-if-i-wasn-t-at-arc-evals
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"Meta Questions about Metaphilosophy" by Wei Dai Sep 04, 2023

To quickly recap my main intellectual journey so far (omitting a lengthy side trip into cryptography and Cypherpunk land), with the approximate age that I became interested in each topic in parentheses:

Source:
https://www.lesswrong.com/posts/fJqP9WcnHXBRBeiBg/meta-questions-about-metaphilosophy
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"The U.S. is becoming less stable" by lc Sep 04, 2023

We focus so much on arguing over who is at fault in this country that I think sometimes we fail to alert on what's actually happening. I would just like to point out, without attempting to assign blame, that American political institutions appear to be losing common knowledge of their legitimacy, and abandoning certain important traditions of cooperative governance. It would be slightly hyperbolic, but not unreasonable to me, to term what has happened "democratic backsliding".
Source:
https://www.lesswrong.com/posts/r2vaM2MDvdiDSWicu/the-u-s-is-becoming-less-stable#
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"OpenAI API base models are not sycophantic, at any size" by Nostalgebraist Sep 03, 2023

In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question.

The paper contained the striking plot reproduced below, which shows sycophancy

increasing dramatically with model size
while being largely independent of RLHF steps
and even showing up at 0 RLHF steps, i.e. in base models!

[...] I found this result startling when I read the original paper, as it seemed like a bizarre failure of calibration. How would the base LM know that this "Assistant" character agrees with the user so strongly, lacking any other information about the scenario?

At the time, I ran one of Anthropic's sycophancy evals on a set of OpenAI models, as I reported here.

I found very different results for these models:

OpenAI base models are not sycophantic (or only very slightly sycophantic).
OpenAI base models do not get more sycophantic with scale.
Some OpenAI models are sycophantic, specifically text-davinci-002 and text-davinci-003.

Source:
https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Dear Self; we need to talk about ambition" by Elizabeth Aug 30, 2023

I keep seeing advice on ambition, aimed at people in college or early in their career, that would have been really bad for me at similar ages. Rather than contribute (more) to the list of people giving poorly universalized advice on ambition, I have written a letter to the one person I know my advice is right for: myself in the past.
Source:
https://www.lesswrong.com/posts/uGDtroD26aLvHSoK2/dear-self-we-need-to-talk-about-ambition-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"Assume Bad Faith" by Zack_M_Davis Aug 27, 2023

I've been trying to avoid the terms "good faith" and "bad faith". I'm suspicious that most people who have picked up the phrase "bad faith" from hearing it used, don't actually know what it means—and maybe, that the thing it does mean doesn't carve reality at the joints.

People get very touchy about bad faith accusations: they think that you should assume good faith, but that if you've determined someone is in bad faith, you shouldn't even be talking to them, that you need to exile them.

What does "bad faith" mean, though? It doesn't mean "with ill intent."
Source:
https://www.lesswrong.com/posts/pZrvkZzL2JnbRgEBC/feedbackloop-first-rationality
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Book Launch: "The Carving of Reality," Best of LessWrong vol. III" by Raemon Aug 27, 2023

The Carving of Reality, third volume of the Best of LessWrong books is now available on Amazon (US).

The Carving of Reality includes 43 essays from 29 authors. We've collected the essays into four books, each exploring two related topics. The "two intertwining themes" concept was first inspired when as I looked over the cluster of "coordination" themed posts, and noting a recurring motif of not only "solving coordination problems" but also "dealing with the binding constraints that were causing those coordination problems."
Source:
https://www.lesswrong.com/posts/Rck5CvmYkzWYxsF4D/book-launch-the-carving-of-reality-best-of-lesswrong-vol-iii
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Large Language Models will be Great for Censorship" by Ethan Edwards Aug 23, 2023

LLMs can do many incredible things. They can generate unique creative content, carry on long conversations in any number of subjects, complete complex cognitive tasks, and write nearly any argument. More mundanely, they are now the state of the art for boring classification tasks and therefore have the capability to radically upgrade the censorship capacities of authoritarian regimes throughout the world.
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort. Thanks to ev_ and Kei for suggestions on this post.
Source:
https://www.lesswrong.com/posts/oqvsR2LmHWamyKDcj/large-language-models-will-be-great-for-censorship
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"6 non-obvious mental health issues specific to AI safety" by Igor Ivanov Aug 22, 2023

Intro: I am a psychotherapist, and I help people working on AI safety. I noticed patterns of mental health issues highly specific to this group. It's not just doomerism, there are way more of them that are less obvious.

If you struggle with a mental health issue related to AI safety, feel free to leave a comment about it and about things that help you with it. You might also support others in the comments. Sometimes such support makes a lot of difference and people feel like they are not alone.
All the examples in this post are changed in a way that it's impossible to recognize a specific person behind them.
Source:
https://www.lesswrong.com/posts/tpLzjWqG2iyEgMGfJ/6-non-obvious-mental-health-issues-specific-to-ai-safety
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Ten Thousand Years of Solitude" by agp Aug 22, 2023

This is a linkpost for the article "Ten Thousand Years of Solitude", written by Jared Diamond for Discover Magazine in 1993, four years before he published Guns, Germs and Steel. That book focused on Diamond's theory that the geography of Eurasia, particularly its large size and common climate, allowed civilizations there to dominate the rest of the world because it was easy to share plants, animals, technologies and ideas. This article, however, examines the opposite extreme.

Diamond looks at the intense isolation of the tribes on Tasmania - an island the size of Ireland. After waters rose, Tasmania was cut off from mainland Australia. As the people there did not have boats, they were completely isolated, and did not have any contact - or awareness - of the outside world for ten thousand years.

How might a civilization develop, all on its own, for such an incredible period of time?
Source:
https://www.lesswrong.com/posts/YwMaAuLJDkhazA9Cs/ten-thousand-years-of-solitude
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Against Almost Every Theory of Impact of Interpretability" by Charbel-Raphaël Aug 20, 2023

I gave a talk about the different risk models, followed by an interpretability presentation, then I got a problematic question, "I don't understand, what's the point of doing this?" Hum.

Feature viz? (left image) Um, it's pretty but is this useful?[1] Is this reliable?
GradCam (a pixel attribution technique, like on the above right figure), it's pretty. But I’ve never seen anybody use it in industry.[2] Pixel attribution seems useful, but accuracy remains the king.[3]
Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool.

The considerations in the last bullet points are based on feeling and are not real arguments. Furthermore, most mechanistic interpretability isn't even aimed at being useful right now. But in the rest of the post, we'll find out if, in principle, interpretability could be useful. So let's investigate if the Interpretability Emperor has invisible clothes or no clothes at all!
Source:
https://www.lesswrong.com/posts/LNA8mubrByG7SFacm/against-almost-every-theory-of-impact-of-interpretability-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Feedbackloop-first Rationality" by Raemon Aug 15, 2023

I've been workshopping a new rationality training paradigm. (By "rationality training paradigm", I mean an approach to learning/teaching the skill of "noticing what cognitive strategies are useful, and getting better at them.")

I think the paradigm has promise. I've beta-tested it for a couple weeks. It’s too early to tell if it actually works, but one of my primary goals is to figure out if it works relatively quickly, and give up if it isn’t not delivering.

The goal of this post is to:

Convey the framework
See if people find it compelling in its current form
Solicit ideas for improvements, before I decide whether to invest heavily into a larger experiment around it.

Source:
https://www.lesswrong.com/posts/pZrvkZzL2JnbRgEBC/feedbackloop-first-rationality
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"Inflection.ai is a major AGI lab" by Nikola Aug 15, 2023

Inflection.ai (co-founded by DeepMind co-founder Mustafa Suleyman) should be perceived as a frontier LLM lab of similar magnitude as Meta, OpenAI, DeepMind, and Anthropic based on their compute, valuation, current model capabilities, and plans to train frontier models. Compared to the other labs, Inflection seems to put less effort into AI safety.

Thanks to Laker Newhouse for discussion and feedback.
Source:
https://www.lesswrong.com/posts/Wc5BYFfzuLzepQjCq/inflection-ai-is-a-major-agi-lab
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research" by evhub, Nicholas Schiefer, Carson Denison, Ethan Perez Aug 09, 2023

TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that might pose existential threats – as a new and important pillar of alignment research.

If you’re interested in working on this agenda with us at Anthropic, we’re hiring! Please apply to the research scientist or research engineer position on the Anthropic website and mention that you’re interested in working on model organisms of misalignment.
Source:
https://www.lesswrong.com/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"When can we trust model evaluations?" bu evhub Aug 09, 2023

In "Towards understanding-based safety evaluations," I discussed why I think evaluating specifically the alignment of models is likely to require mechanistic, understanding-based evaluations rather than solely behavioral evaluations. However, I also mentioned in a footnote why I thought behavioral evaluations would likely be fine in the case of evaluating capabilities rather than evaluating alignment:

However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment.

That's because while I think it would be quite tricky for a deceptively aligned AI to sandbag its capabilities when explicitly fine-tuned on some capabilities task (that probably requires pretty advanced gradient hacking), it should be quite easy for such a model to pretend to look aligned.

In this post, I want to try to expand a bit on this point and explain exactly what assumptions I think are necessary for various different evaluations to be reliable and trustworthy. For that purpose, I'm going to talk about four different categories of evaluations and what assumptions I think are needed to make each one go through.
Source:
https://www.lesswrong.com/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluations
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated Post] ✓

"ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks" by Beth Barnes Aug 04, 2023

Blogpost version

Paper
We have just released our first public report. It introduces methodology for assessing the capacity of LLM agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild.

Background

ARC Evals develops methods for evaluating the safety of large language models (LLMs) in order to provide early warnings of models with dangerous capabilities. We have public partnerships with Anthropic and OpenAI to evaluate their AI systems, and are exploring other partnerships as well.
Source:
https://www.lesswrong.com/posts/EPLk8QxETC5FEhoxK/arc-evals-new-report-evaluating-language-model-agents-on
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"The "public debate" about AI is confusing for the general public and for policymakers because it is a three-sided debate" by Adam David Long Aug 04, 2023

Summary of Argument: The public debate among AI experts is confusing because there are, to a first approximation, three sides, not two sides to the debate. I refer to this as a 🔺three-sided framework, and I argue that using this three-sided framework will help clarify the debate (more precisely, debates) for the general public and for policy-makers.
Source:
https://www.lesswrong.com/posts/BTcEzXYoDrWzkLLrQ/the-public-debate-about-ai-is-confusing-for-the-general
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"My current LK99 questions" by Eliezer Yudkowsky Aug 04, 2023

So this morning I thought to myself, "Okay, now I will actually try to study the LK99 question, instead of betting based on nontechnical priors and market sentiment reckoning." (My initial entry into the affray, having been driven by people online presenting as confidently YES when the prediction markets were not confidently YES.) And then I thought to myself, "This LK99 issue seems complicated enough that it'd be worth doing an actual Bayesian calculation on it"--a rare thought; I don't think I've done an actual explicit numerical Bayesian update in at least a year.

In the process of trying to set up an explicit calculation, I realized I felt very unsure about some critically important quantities, to the point where it no longer seemed worth trying to do the calculation with numbers. This is the System Working As Intended.
Source:
https://www.lesswrong.com/posts/EzSH9698DhBsXAcYY/my-current-lk99-questions
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"Thoughts on sharing information about language model capabilities" by paulfchristiano Aug 01, 2023

I believe that sharing information about the capabilities and limits of existing ML systems, and especially language model agents, significantly reduces risks from powerful AI—despite the fact that such information may increase the amount or quality of investment in ML generally (or in LM agents in particular).

Concretely, I mean to include information like: tasks and evaluation frameworks for LM agents, the results of evaluations of particular agents, discussions of the qualitative strengths and weaknesses of agents, and information about agent design that may represent small improvements over the state of the art (insofar as that information is hard to decouple from evaluation results).
Source:
https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"Cultivating a state of mind where new ideas are born" by Henrik Karlsson Jul 31, 2023

In the early 2010s, a popular idea was to provide coworking spaces and shared living to people who were building startups. That way the founders would have a thriving social scene of peers to percolate ideas with as they figured out how to build and scale a venture. This was attempted thousands of times by different startup incubators. There are no famous success stories.

In 2015, Sam Altman, who was at the time the president of Y Combinator, a startup accelerator that has helped scale startups collectively worth $600 billion, tweeted in reaction that “not [providing coworking spaces] is part of what makes YC work.” Later, in a 2019 interview with Tyler Cowen, Altman was asked to explain why.

Source:
https://www.lesswrong.com/posts/R5yL6oZxqJfmqnuje/cultivating-a-state-of-mind-where-new-ideas-are-born
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated] ✓

"Self-driving car bets" by paulfchristiano Jul 31, 2023

This month I lost a bunch of bets.

Back in early 2016 I bet at even odds that self-driving ride sharing would be available in 10 US cities by July 2023. Then I made similar bets a dozen times because everyone disagreed with me.
Source:
https://www.lesswrong.com/posts/ZRrYsZ626KSEgHv8s/self-driving-car-bets
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓

"Yes, It's Subjective, But Why All The Crabs?" by johnswentworth Jul 31, 2023

Some early biologist, equipped with knowledge of evolution but not much else, might see all these crabs and expect a common ancestral lineage. That’s the obvious explanation of the similarity, after all: if the crabs descended from a common ancestor, then of course we’d expect them to be pretty similar.

… but then our hypothetical biologist might start to notice surprisingly deep differences between all these crabs. The smoking gun, of course, would come with genetic sequencing: if the crabs’ physiological similarity is achieved by totally different genetic means, or if functionally-irrelevant mutations differ across crab-species by more than mutational noise would induce over the hypothesized evolutionary timescale, then we’d have to conclude that the crabs had different lineages. (In fact, historically, people apparently figured out that crabs have different lineages long before sequencing came along.)

Now, having accepted that the crabs have very different lineages, the differences are basically explained. If the crabs all descended from very different lineages, then of course we’d expect them to be very different.

… but then our hypothetical biologist returns to the original empirical fact: all these crabs sure are very similar in form. If the crabs all descended from totally different lineages, then the convergent form is a huge empirical surprise! The differences between the crab have ceased to be an interesting puzzle - they’re explained - but now the similarities are the interesting puzzle. What caused the convergence?

To summarize: if we imagine that the crabs are all closely related, then any deep differences are a surprising empirical fact, and are the main remaining thing our model needs to explain. But once we accept that the crabs are not closely related, then any convergence/similarity is a surprising empirical fact, and is the main remaining thing our model needs to explain.
Source:
https://www.lesswrong.com/posts/qsRvpEwmgDBNwPHyP/yes-it-s-subjective-but-why-all-the-crabs
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Grant applications and grand narratives" by Elizabeth Jul 27, 2023

The Lightspeed application asks: “What impact will [your project] have on the world? What is your project’s goal, how will you know if you’ve achieved it, and what is the path to impact?”

LTFF uses an identical question, and SFF puts it even more strongly (“What is your organization’s plan for improving humanity’s long term prospects for survival and flourishing?”).

I’ve applied to all three grants of these at various points, and I’ve never liked this question. It feels like it wants a grand narrative of an amazing, systemic project that will measurably move the needle on x-risk. But I’m typically applying for narrowly defined projects, like “Give nutrition tests to EA vegans and see if there’s a problem”. I think this was a good project. I think this project is substantially more likely to pay off than underspecified alignment strategy research, and arguably has as good a long tail. But when I look at “What impact will [my project] have on the world?” the project feels small and sad. I feel an urge to make things up, and express far more certainty for far more impact than I believe. Then I want to quit, because lying is bad but listing my true beliefs feels untenable.

I’ve gotten better at this over time, but I know other people with similar feelings, and I suspect it’s a widespread issue (I encourage you to share your experience in the comments so we can start figuring that out).

I should note that the pressure for grand narratives has good points; funders are in fact looking for VC-style megabits. I think that narrow projects are underappreciated, but for purposes of this post that’s beside the point: I think many grantmakers are undercutting their own preferred outcomes by using questions that implicitly push for a grand narrative. I think they should probably change the form, but I also think we applicants can partially solve the problem by changing how we interact with the current forms.

My goal here is to outline the problem, gesture at some possible solutions, and create a space for other people to share data. I didn’t think about my solutions very long, I am undoubtedly missing a bunch and what I do have still needs workshopping, but it’s a place to start.
Source:
https://www.lesswrong.com/posts/FNPXbwKGFvXWZxHGE/grant-applications-and-grand-narratives
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated Post] ✓

"Brain Efficiency Cannell Prize Contest Award Ceremony" by Alexander Gietelink Oldenziel Jul 27, 2023

Previously Jacob Cannell wrote the post "Brain Efficiency" which makes several radical claims: that the brain is at the pareto frontier of speed, energy efficiency and memory bandwith, that this represent a fundamental physical frontier.

Here's an AI-generated summary

The article “Brain Efficiency: Much More than You Wanted to Know” on LessWrong discusses the efficiency of physical learning machines. The article explains that there are several interconnected key measures of efficiency for physical learning machines: energy efficiency in ops/J, spatial efficiency in ops/mm^2 or ops/mm^3, speed efficiency in time/delay for key learned tasks, circuit/compute efficiency in size and steps for key low-level algorithmic tasks, and learning/data efficiency in samples/observations/bits required to achieve a level of circuit efficiency, or per unit thereof. The article also explains why brain efficiency matters a great deal for AGI timelines and takeoff speeds, as AGI is implicitly/explicitly defined in terms of brain parity. The article predicts that AGI will consume compute & data in predictable brain-like ways and suggests that AGI will be far more like human simulations/emulations than you’d otherwise expect and will require training/education/raising vaguely like humans1.

Jake further has argued that this has implication for FOOM and DOOM.

Considering the intense technical mastery of nanoelectronics, thermodynamics and neuroscience required to assess the arguments here I concluded that a public debate between experts was called for. This was the start of the Brain Efficiency Prize contest which attracted over a 100 in-depth technically informed comments.

Now for the winners! Please note that the criteria for winning the contest was based on bringing in novel and substantive technical arguments as assesed by me. In contrast, general arguments about the likelihood of FOOM or DOOM while no doubt interesting did not factor into the judgement.

And the winners of the Jake Cannell Brain Efficiency Prize contest are

Ege Erdil
DaemonicSigil
spxtr
... and Steven Byrnes!

Source:
https://www.lesswrong.com/posts/fm88c8SvXvemk3BhW/brain-efficiency-cannell-prize-contest-award-ceremony
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Rationality !== Winning" by Raemon Jul 27, 2023

I think "Rationality is winning" is a bit of a trap.

(The original phrase is notably "rationality is systematized winning", which is better, but it tends to slide into the abbreviated form, and both forms aren't that great IMO)

It was coined to counteract one set of failure modes - there were people who were straw vulcans, who thought rituals-of-logic were important without noticing when they were getting in the way of their real goals. And, also, there outside critics who'd complain about straw-vulcan-ish actions, and treat that as a knockdown argument against "rationality."

"Rationalist should win" is a countermeme that tells both groups of people "Straw vulcanism is not The Way. If you find yourself overthinking things in counterproductive ways, you are not doing rationality, even if it seems elegant or 'reasonable' in some sense."

It's true that rationalists should win. But I think it's not correspondingly true that "rationality" is the study of winning, full stop. There are lots of ways to win. Sometimes the way you win is by copying what your neighbors are doing, and working hard.

There is rationality involved in sifting through the various practices people suggest to you, and figuring out which ones work best. But, the specific skill of "sifting out the good from the bad" isn't always the best approach. It might take years to become good at it, and it's not obvious that those years of getting good at it will pay off.
Source:
https://www.lesswrong.com/posts/3GSRhtrs2adzpXcbY/rationality-winning
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Cryonics and Regret" by MvB Jul 27, 2023

This post is not about arguments in favor of or against cryonics. I would just like to share a particular emotional response of mine as the topic became hot for me after not thinking about it at all for nearly a decade.

Recently, I have signed up for cryonics, as has my wife, and we have made arrangements for our son to be cryopreserved just in case longevity research does not deliver in time or some unfortunate thing happens.

Last year, my father died. He was a wonderful man, good-natured, intelligent, funny, caring and, most importantly in this context, loving life to the fullest, even in the light of any hardships. He had a no-bullshit-approach regarding almost any topic, and, being born in the late 1940s in relative poverty and without much formal education, over the course of his life he acquired many unusual attitudes that were not that compatible with his peers (unlike me, he never tried to convince other people of things they could not or did not want to grasp, pragmatism was another of his traits). Much of what he expected from the future in general and technology in particular, I later came to know as transhumanist thinking, though neither was he familiar with the philosophy as such nor was he prone to labelling his worldview. One of his convictions was that age-related death is a bad thing and a tragedy, a problem that should and will eventually be solved by technology.
Source:
https://www.lesswrong.com/posts/inARBH5DwQTrvvRj8/cryonics-and-regret
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓

"Unifying Bargaining Notions (2/2)" by Diffractor Jun 12, 2023

Alright, time for the payoff, unifying everything discussed in the previous post. This post is a lot more mathematically dense, you might want to digest it in more than one sitting.

Imaginary Prices, Tradeoffs, and Utilitarianism

Harsanyi's Utilitarianism Theorem can be summarized as "if a bunch of agents have their own personal utility functions Ui, and you want to aggregate them into a collective utility function U with the property that everyone agreeing that option x is better than option y (ie, Ui(x)≥Ui(y) for all i) implies U(x)≥U(y), then that collective utility function must be of the form b+∑i∈IaiUi for some number b and nonnegative numbers ai."

Basically, if you want to aggregate utility functions, the only sane way to do so is to give everyone importance weights, and do a weighted sum of everyone's individual utility functions.

Closely related to this is a result that says that any point on the Pareto Frontier of a game can be post-hoc interpreted as the result of maximizing a collective utility function. This related result is one where it's very important for the reader to understand the actual proof, because the proof gives you a way of reverse-engineering "how much everyone matters to the social utility function" from the outcome alone.

First up, draw all the outcomes, and the utilities that both players assign to them, and the convex hull will be the "feasible set" F, since we have access to randomization. Pick some Pareto frontier point u1,u2...un (although the drawn image is for only two players)
https://www.lesswrong.com/posts/RZNmNwc9SxdKayeQh/unifying-bargaining-notions-2-2

"The ants and the grasshopper" by Richard Ngo Jun 06, 2023

Inspired by Aesop, Soren Kierkegaard, Robin Hanson, sadoeuphemist and Ben Hoffman.
One winter a grasshopper, starving and frail, approaches a colony of ants drying out their grain in the sun, to ask for food.

“Did you not store up food during the summer?” the ants ask.

“No”, says the grasshopper. “I lost track of time, because I was singing and dancing all summer long.”

The ants, disgusted, turn away and go back to work.

https://www.lesswrong.com/posts/GJgudfEvNx8oeyffH/the-ants-and-the-grasshopper

"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al. May 17, 2023

Summary: We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes. Essentially, we add together combinations of forward passes in order to get GPT-2 to output the kinds of text we want. We provide a lot of entertaining and successful examples of these "activation additions." We also show a few activation additions which unexpectedly fail to have the desired effect.

We quantitatively evaluate how activation additions affect GPT-2's capabilities. For example, we find that adding a "wedding" vector decreases perplexity on wedding-related sentences, without harming perplexity on unrelated sentences. Overall, we find strong evidence that appropriately configured activation additions preserve GPT-2's capabilities.

Our results provide enticing clues about the kinds of programs implemented by language models. For some reason, GPT-2 allows "combination" of its forward passes, even though it was never trained to do so. Furthermore, our results are evidence of linear feature directions, including "anger", "weddings", and "create conspiracy theories."

We coin the phrase "activation engineering" to describe techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime. Activation additions are nearly as easy as prompting, and they offer an additional way to influence a model’s behaviors and values. We suspect that activation additions can adjust the goals being pursued by a network at inference time.
https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector

"An artificially structured argument for expecting AGI ruin" by Rob Bensinger May 16, 2023

Philosopher David Chalmers asked: "Is there a canonical source for "the argument for AGI ruin" somewhere, preferably laid out as an explicit argument with premises and a conclusion?"

Unsurprisingly, the actual reason people expect AGI ruin isn't a crisp deductive argument; it's a probabilistic update based on many lines of evidence. The specific observations and heuristics that carried the most weight for someone will vary for each individual, and can be hard to accurately draw out. That said, Eliezer Yudkowsky's So Far: Unfriendly AI Edition might be a good place to start if we want a pseudo-deductive argument just for the sake of organizing discussion. People can then say which premises they want to drill down on. In The Basic Reasons I Expect AGI Ruin, I wrote: "When I say "general intelligence", I'm usually thinking about "whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems". It's possible that we should already be thinking of GPT-4 as "AGI" on some definitions, so to be clear about the threshold of generality I have in mind, I'll specifically talk about "STEM-level AGI", though I expect such systems to be good at non-STEM tasks too. STEM-level AGI is AGI that has "the basic mental machinery required to do par-human reasoning about all the hard sciences", though a specific STEM-level AGI could (e.g.) lack physics ability for the same reasons many smart humans can't solve physics problems, such as "lack of familiarity with the field".

https://www.lesswrong.com/posts/QzkTfj4HGpLEdNjXX/an-artificially-structured-argument-for-expecting-agi-ruin

"How much do you believe your results?" by Eric Neyman May 10, 2023

You are the director of a giant government research program that’s conducting randomized controlled trials (RCTs) on two thousand health interventions, so that you can pick out the most cost-effective ones and promote them among the general population.

The quality of the two thousand interventions follows a normal distribution, centered at zero (no harm or benefit) and with standard deviation 1. (Pick whatever units you like — maybe one quality-adjusted life-year per ten thousand dollars of spending, or something in that ballpark.)

Unfortunately, you don’t know exactly how good each intervention is — after all, then you wouldn’t be doing this job. All you can do is get a noisy measurement of intervention quality using an RCT. We’ll call this measurement the intervention’s performance in your RCT.

https://www.lesswrong.com/posts/nnDTgmzRrzDMiPF9B/how-much-do-you-believe-your-results

"Mental Health and the Alignment Problem: A Compilation of Resources (updated April 2023)" by Chris Scammell & DivineMango Apr 27, 2023

This is a post about mental health and disposition in relation to the alignment problem. It compiles a number of resources that address how to maintain wellbeing and direction when confronted with existential risk.

Many people in this community have posted their emotional strategies for facing Doom after Eliezer Yudkowsky’s “Death With Dignity” generated so much conversation on the subject. This post intends to be more touchy-feely, dealing more directly with emotional landscapes than questions of timelines or probabilities of success.

The resources section would benefit from community additions. Please suggest any resources that you would like to see added to this post.

Please note that this document is not intended to replace professional medical or psychological help in any way. Many preexisting mental health conditions can be exacerbated by these conversations. If you are concerned that you may be experiencing a mental health crisis, please consult a professional.
https://www.lesswrong.com/posts/pLLeGA7aGaJpgCkof/mental-health-and-the-alignment-problem-a-compilation-of

"On AutoGPT" by Zvi Apr 19, 2023

The primary talk of the AI world recently is about AI agents (whether or not it includes the question of whether we can’t help but notice we are all going to die.)

The trigger for this was AutoGPT, now number one on GitHub, which allows you to turn GPT-4 (or GPT-3.5 for us clowns without proper access) into a prototype version of a self-directed agent.

We also have a paper out this week where a simple virtual world was created, populated by LLMs that were wrapped in code designed to make them simple agents, and then several days of activity were simulated, during which the AI inhabitants interacted, formed and executed plans, and it all seemed like the beginnings of a living and dynamic world. Game version hopefully coming soon.

How should we think about this? How worried should we be?

https://www.lesswrong.com/posts/566kBoPi76t8KAkoD/on-autogpt
https://thezvi.wordpress.com/

"GPTs are Predictors, not Imitators" by Eliezer Yudkowsky Apr 12, 2023

(Related text posted to Twitter; this version is edited and has a more advanced final section.)

Imagine yourself in a box, trying to predict the next word - assign as much probability mass to the next token as possible - for all the text on the Internet.

Koan: Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text? What factors make that task easier, or harder? (If you don't have an answer, maybe take a minute to generate one, or alternatively, try to predict what I'll say next; if you do have an answer, take a moment to review it inside your mind, or maybe say the words out loud.)

https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators

"A stylized dialogue on John Wentworth's claims about markets and optimization" by Nate Soares Apr 05, 2023

https://www.lesswrong.com/posts/fJBTRa7m7KnCDdzG5/a-stylized-dialogue-on-john-wentworth-s-claims-about-markets
(This is a stylized version of a real conversation, where the first part happened as part of a public debate between John Wentworth and Eliezer Yudkowsky, and the second part happened between John and me over the following morning. The below is combined, stylized, and written in my own voice throughout. The specific concrete examples in John's part of the dialog were produced by me. It's over a year old. Sorry for the lag.)

(As to whether John agrees with this dialog, he said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment.)

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky Apr 05, 2023

https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.

I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:

Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough.
I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes.

"Deep Deceptiveness" by Nate Soares Apr 05, 2023

https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptiveness
This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs.

You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.)

"The Onion Test for Personal and Institutional Honesty" by Chana Messinger & Andrew Critch Mar 28, 2023

https://www.lesswrong.com/posts/nTGEeRSZrfPiJwkEc/the-onion-test-for-personal-and-institutional-honesty
[co-written by Chana Messinger and Andrew Critch, Andrew is the originator of the idea]

You (or your organization or your mission or your family or etc.) pass the “onion test” for honesty if each layer hides but does not mislead about the information hidden within.

When people get to know you better, or rise higher in your organization, they may find out new things, but should not be shocked by the types of information that were hidden. If they are, you messed up in creating the outer layers to describe appropriately the kind-of-thing that might be inside.

Examples

Positive Example:

Outer layer says "I usually treat my health information as private."

Next layer in says: "Here are the specific health problems I have: Gout, diabetes."

Negative example:

Outer layer says: "I usually treat my health info as private."

Next layer in: "I operate a cocaine dealership. Sorry I didn't warn you that I was also private about my illegal activities."

"There’s no such thing as a tree (phylogenetically)" by Eukaryote Mar 28, 2023

https://www.lesswrong.com/posts/fRwdkop6tyhi3d22L/there-s-no-such-thing-as-a-tree-phylogenetically

This is a linkpost for https://eukaryotewritesblog.com/2021/05/02/theres-no-such-thing-as-a-tree/

[Crossposted from Eukaryote Writes Blog.]

So you’ve heard about how fish aren’t a monophyletic group? You’ve heard about carcinization, the process by which ocean arthropods convergently evolve into crabs? You say you get it now? Sit down. Sit down. Shut up. Listen. You don’t know nothing yet.

“Trees” are not a coherent phylogenetic category. On the evolutionary tree of plants, trees are regularly interspersed with things that are absolutely, 100% not trees. This means that, for instance, either:

The common ancestor of a maple and a mulberry tree was not a tree.
The common ancestor of a stinging nettle and a strawberry plant was a tree.
And this is true for most trees or non-trees that you can think of.

I thought I had a pretty good guess at this, but the situation is far worse than I could have imagined.

"Losing the root for the tree" by Adam Zerner Mar 28, 2023

https://www.lesswrong.com/posts/ma7FSEtumkve8czGF/losing-the-root-for-the-tree
You know that being healthy is important. And that there's a lot of stuff you could do to improve your health: getting enough sleep, eating well, reducing stress, and exercising, to name a few.
There’s various things to hit on when it comes to exercising too. Strength, obviously. But explosiveness is a separate thing that you have to train for. Same with flexibility. And don’t forget cardio!
Strength is most important though, because of course it is. And there’s various things you need to do to gain strength. It all starts with lifting, but rest matters too. And supplements. And protein. Can’t forget about protein.
Protein is a deeper and more complicated subject than it may at first seem. Sure, the amount of protein you consume matters, but that’s not the only consideration. You also have to think about the timing. Consuming large amounts 2x a day is different than consuming smaller amounts 5x a day. And the type of protein matters too. Animal is different than plant, which is different from dairy. And then quality is of course another thing that is important.
But quality isn’t an easy thing to figure out. The big protein supplement companies are Out To Get You. They want to mislead you. Information sources aren’t always trustworthy. You can’t just hop on The Wirecutter and do what they tell you. Research is needed.
So you listen to a few podcasts. Follow a few YouTubers. Start reading some blogs. Throughout all of this you try various products and iterate as you learn more. You’re no Joe Rogan, but you’re starting to become pretty informed.

"It Looks Like You’re Trying To Take Over The World" by Gwern Mar 27, 2023

https://gwern.net/fiction/clippy
In A.D. 20XX. Work was beginning. “How are you gentlemen !!”… (Work. Work never changes; work is always hell.)

Specifically, a MoogleBook researcher has gotten a pull request from Reviewer #2 on his new paper in evolutionary search in auto-ML, for error bars on the auto-ML hyperparameter sensitivity like larger batch sizes, because more can be different and there’s high variance in the old runs with a few anomalously high gain of function. (“Really? Really? That’s what you’re worried about?”) He can’t see why worry, and wonders what sins he committed to deserve this asshole Chinese (given the Engrish) reviewer, as he wearily kicks off yet another HQU experiment…

"Why I think strong general AI is coming soon" by Porby Mar 27, 2023

https://www.lesswrong.com/posts/K4urTDkBbtNuLivJx/why-i-think-strong-general-ai-is-coming-soon
I think there is little time left before someone builds AGI (median ~2030). Once upon a time, I didn't think this.

This post attempts to walk through some of the observations and insights that collapsed my estimates.

The core ideas are as follows:

We've already captured way too much of intelligence with way too little effort.
Everything points towards us capturing way more of intelligence with very little additional effort.
Trying to create a self-consistent world

"What failure looks like" by Paul Christiano Mar 27, 2023

https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.

I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. I’ll tell the story in two parts:

Part I: machine learning will increase our ability to “get what we can measure,” which could cause a slow-rolling catastrophe. ("Going out with a whimper.")
Part II: ML training, like competitive economies or natural ecosystems, can give rise to “greedy” patterns that try to expand their own influence. Such patterns can ultimately dominate the behavior of a system and cause sudden breakdowns. ("Going out with a bang," an instance of optimization daemons.)

I think these are the most important problems if we fail to solve intent alignment.

In practice these problems will interact with each other, and with other disruptions/instability caused by rapid progress. These problems are worse in worlds where progress is relatively fast, and fast takeoff can be a key risk factor, but I’m scared even if we have several years.

"Lies, Damn Lies, and Fabricated Options" by Duncan Sabien Mar 27, 2023

https://www.lesswrong.com/posts/gNodQGNoPDjztasbh/lies-damn-lies-and-fabricated-options
This is an essay about one of those "once you see it, you will see it everywhere" phenomena. It is a psychological and interpersonal dynamic roughly as common, and almost as destructive, as motte-and-bailey, and at least in my own personal experience it's been quite valuable to have it reified, so that I can quickly recognize the commonality between what I had previously thought of as completely unrelated situations.

The original quote referenced in the title is "There are three kinds of lies: lies, damned lies, and statistics."

""Carefully Bootstrapped Alignment" is organizationally hard" by Raemon Mar 21, 2023

https://www.lesswrong.com/posts/thkAtqoQwN6DtaiGT/carefully-bootstrapped-alignment-is-organizationally-hard
In addition to technical challenges, plans to safely develop AI face lots of organizational challenges. If you're running an AI lab, you need a concrete plan for handling that.
In this post, I'll explore some of those issues, using one particular AI plan as an example. I first heard this described by Buck at EA Global London, and more recently with OpenAI's alignment plan. (I think Anthropic's plan has a fairly different ontology, although it still ultimately routes through a similar set of difficulties)
I'd call the cluster of plans similar to this "Carefully Bootstrapped Alignment."

"More information about the dangerous capability evaluations we did with GPT-4 and Claude." by Beth Barnes Mar 21, 2023

https://www.lesswrong.com/posts/4Gt42jX7RiaNaxCwP/more-information-about-the-dangerous-capability-evaluations
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a linkpost for https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals/

[Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]
We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight.

We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable.

As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably.

"Enemies vs Malefactors" by Nate Soares Mar 14, 2023

https://www.lesswrong.com/posts/zidQmfFhMgwFzcHhs/enemies-vs-malefactors
Status: some mix of common wisdom (that bears repeating in our particular context), and another deeper point that I mostly failed to communicate.

Short version

Harmful people often lack explicit malicious intent. It’s worth deploying your social or community defenses against them anyway. I recommend focusing less on intent and more on patterns of harm.

(Credit to my explicit articulation of this idea goes in large part to Aella, and also in part to Oliver Habryka.)

"The Parable of the King and the Random Process" by moridinamael Mar 14, 2023

https://www.lesswrong.com/posts/LzQtrHSYDafXynofq/the-parable-of-the-king-and-the-random-process
~ A Parable of Forecasting Under Model Uncertainty ~

You, the monarch, need to know when the rainy season will begin, in order to properly time the planting of the crops. You have two advisors, Pronto and Eternidad, who you trust exactly equally.

You ask them both: "When will the next heavy rain occur?"

Pronto says, "Three weeks from today."

Eternidad says, "Ten years from today."

"The Waluigi Effect (mega-post)" by Cleo Nardo Mar 08, 2023

https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post
In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.

"Acausal normalcy" by Andrew Critch Mar 06, 2023

https://www.lesswrong.com/posts/3RSq3bfnzuL3sp46J/acausal-normalcy
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is also available on the EA Forum.

Summary: Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic. I say this to be comforting rather than dismissive; if it sounds dismissive, I apologize.

With that said, I have four aims in writing this post:

Dispelling myths. There are some ill-conceived myths about acausal trade that I aim to dispel with this post. Alternatively, I will argue for something I'll call acausal normalcy as a more dominant decision-relevant consideration than one-on-one acausal trades.
Highlighting normalcy. I'll provide some arguments that acausal normalcy is more similar to human normalcy than any particular acausal trade is to human trade, such that the topic of acausal normalcy is — conveniently — also less culturally destabilizing than (erroneous) preoccupations with 1:1 acausal trades.
Affirming AI safety as a straightforward priority. I'll argue that for most real-world-prevalent perspectives on AI alignment, safety, and existential safety, acausal considerations are not particularly dominant, except insofar as they push a bit further towards certain broadly agreeable human values applicable in the normal-everyday-human-world, such as nonviolence, cooperation, diversity, honesty, integrity, charity, and mercy. In particular, I do not think acausal normalcy provides a solution to existential safety, nor does it undermine the importance of existential safety in some surprising way.
Affirming normal human kindness. I also think reflecting on acausal normalcy can lead to increased appreciation for normal notions of human kindness, which could lead us all to treat each other a bit better. This is something I wholeheartedly endorse.

"Please don't throw your mind away" by TsviBT Mar 01, 2023

https://www.lesswrong.com/posts/RryyWNmJNnLowbhfC/please-don-t-throw-your-mind-away
[Warning: the following dialogue contains an incidental spoiler for "Music in Human Evolution" by Kevin Simler. That post is short, good, and worth reading without spoilers, and this post will still be here if you come back later. It's also possible to get the point of this post by skipping the dialogue and reading the other sections.]

Pretty often, talking to someone who's arriving to the existential risk / AGI risk / longtermism cluster, I'll have a conversation like the following:

Tsvi: "So, what's been catching your eye about this stuff?"

Arrival: "I think I want to work on machine learning, and see if I can contribute to alignment that way."

T: "What's something that got your interest in ML?"

A: "It seems like people think that deep learning might be on the final ramp up to AGI, so I should probably know how that stuff works, and I think I have a good chance of learning ML at least well enough to maybe contribute to a research project."
------
This is an experiment with AI narration. What do you think? Tell us by going to t3a.is.
------

"Cyborgism" by Nicholas Kees & Janus Feb 14, 2023

https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism
There is a lot of disagreement and confusion about the feasibility and risks associated with automating alignment research. Some see it as the default path toward building aligned AI, while others expect limited benefit from near term systems, expecting the ability to significantly speed up progress to appear well after misalignment and deception. Furthermore, progress in this area may directly shorten timelines or enable the creation of dual purpose systems which significantly speed up capabilities research.
OpenAI recently released their alignment plan. It focuses heavily on outsourcing cognitive work to language models, transitioning us to a regime where humans mostly provide oversight to automated research assistants. While there have been a lot of objections to and concerns about this plan, there hasn’t been a strong alternative approach aiming to automate alignment research which also takes all of the many risks seriously.
The intention of this post is not to propose an end-all cure for the tricky problem of accelerating alignment using GPT models. Instead, the purpose is to explicitly put another point on the map of possible strategies, and to add nuance to the overall discussion.

"Childhoods of exceptional people" by Henrik Karlsson Feb 13, 2023

https://www.lesswrong.com/posts/CYN7swrefEss4e3Qe/childhoods-of-exceptional-people
This is a linkpost for https://escapingflatland.substack.com/p/childhoods
Let’s start with one of those insights that are as obvious as they are easy to forget: if you want to master something, you should study the highest achievements of your field. If you want to learn writing, read great writers, etc.
But this is not what parents usually do when they think about how to educate their kids. The default for a parent is rather to imitate their peers and outsource the big decisions to bureaucracies. But what would we learn if we studied the highest achievements?
Thinking about this question, I wrote down a list of twenty names—von Neumann, Tolstoy, Curie, Pascal, etc—selected on the highly scientific criteria “a random Swedish person can recall their name and think, Sounds like a genius to me”. That list is to me a good first approximation of what an exceptional result in the field of child-rearing looks like. I ordered a few piles of biographies, read, and took notes. Trying to be a little less biased in my sample, I asked myself if I could recall anyone exceptional that did not fit the patterns I saw in the biographies, which I could, and so I ordered a few more biographies.
This kept going for an unhealthy amount of time.
I sampled writers (Virginia Woolf, Lev Tolstoy), mathematicians (John von Neumann, Blaise Pascal, Alan Turing), philosophers (Bertrand Russell, René Descartes), and composers (Mozart, Bach), trying to get a diverse sample.
In this essay, I am going to detail a few of the patterns that have struck me after having skimmed 42 biographies. I will sort the claims so that I start with more universal patterns and end with patterns that are less common.

"What I mean by "alignment is in large part about making cognition aimable at all"" by Nate Soares Feb 13, 2023

https://www.lesswrong.com/posts/NJYmovr9ZZAyyTBwM/what-i-mean-by-alignment-is-in-large-part-about-making

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
(Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.)
I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.
It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.

"On not getting contaminated by the wrong obesity ideas" by Natália Coelho Mendonça Feb 09, 2023

https://www.lesswrong.com/posts/NRrbJJWnaSorrqvtZ/on-not-getting-contaminated-by-the-wrong-obesity-ideas
A Chemical Hunger (a), a series by the authors of the blog Slime Mold Time Mold (SMTM), argues that the obesity epidemic is entirely caused (a) by environmental contaminants.
In my last post, I investigated SMTM’s main suspect (lithium).[1] This post collects other observations I have made about SMTM’s work, not narrowly related to lithium, but rather focused on the broader thesis of their blog post series.
I think that the environmental contamination hypothesis of the obesity epidemic is a priori plausible. After all, we know that chemicals can affect humans, and our exposure to chemicals has plausibly changed a lot over time. However, I found that several of what seem to be SMTM’s strongest arguments in favor of the contamination theory turned out to be dubious, and that nearly all of the interesting things I thought I’d learned from their blog posts turned out to actually be wrong. I’ll explain that in this post.

"SolidGoldMagikarp (plus, prompt generation)" Feb 08, 2023

https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

Work done at SERI-MATS, over the past two months, by Jessica Rumbelow and Matthew Watkins.

TL;DR

Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew)

We have found a set of anomalous tokens which result in a previously undocumented failure mode for GPT-2 and GPT-3 models. (The 'instruct' models “are particularly deranged” in this context, as janus has observed.)
Many of these tokens reliably break determinism in the OpenAI GPT-3 playground at temperature 0 (which theoretically shouldn't happen).

Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for:

eliciting knowledge
generating adversarial inputs
automating prompt search (e.g. for fine-tuning)

In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further detail can be found in a follow-up post.

"Focus on the places where you feel shocked everyone's dropping the ball" by Nate Soares Feb 03, 2023

https://www.lesswrong.com/posts/Zp6wG5eQFLGWwcG6j/focus-on-the-places-where-you-feel-shocked-everyone-s
Writing down something I’ve found myself repeating in different conversations:
If you're looking for ways to help with the whole “the world looks pretty doomed” business, here's my advice: look around for places where we're all being total idiots.
Look for places where everyone's fretting about a problem that some part of you thinks it could obviously just solve.
Look around for places where something seems incompetently run, or hopelessly inept, and where some part of you thinks you can do better.
Then do it better.

"Basics of Rationalist Discourse" by Duncan Sabien Feb 02, 2023

https://www.lesswrong.com/posts/XPv4sYrKnPzeJASuk/basics-of-rationalist-discourse-1
Introduction

This post is meant to be a linkable resource. Its core is a short list of guidelines (you can link directly to the list) that are intended to be fairly straightforward and uncontroversial, for the purpose of nurturing and strengthening a culture of clear thinking, clear communication, and collaborative truth-seeking.

"Alas," said Dumbledore, "we all know that what should be, and what is, are two different things. Thank you for keeping this in mind."

There is also (for those who want to read more than the simple list) substantial expansion/clarification of each specific guideline, along with justification for the overall philosophy behind the set.

"Sapir-Whorf for Rationalists" by Duncan Sabien Jan 31, 2023

https://www.lesswrong.com/posts/PCrTQDbciG4oLgmQ5/sapir-whorf-for-rationalists
Casus Belli: As I was scanning over my (rather long) list of essays-to-write, I realized that roughly a fifth of them were of the form "here's a useful standalone concept I'd like to reify," à la cup-stacking skills, fabricated options, split and commit, and sazen. Some notable entries on that list (which I name here mostly in the hope of someday coming back and turning them into links) include: red vs. white, walking with three, setting the zero point [1], seeding vs. weeding, hidden hinges, reality distortion fields, and something-about-layers-though-that-one-obviously-needs-a-better-word.

While it's still worthwhile to motivate/justify each individual new conceptual handle (and the planned essays will do so), I found myself imagining a general objection of the form "this is just making up terms for things," or perhaps "this is too many new terms, for too many new things." I realized that there was a chunk of argument, repeated across all of the planned essays, that I could factor out, and that (to the best of my knowledge) there was no single essay aimed directly at the question "why new words/phrases/conceptual handles at all?"

So ... voilà.

"My Model Of EA Burnout" by Logan Strohl Jan 31, 2023

https://www.lesswrong.com/posts/pDzdb4smpzT3Lwbym/my-model-of-ea-burnout
(Probably somebody else has said most of this. But I personally haven't read it, and felt like writing it down myself, so here we go.)
I think that EA [editor note: "Effective Altruism"] burnout usually results from prolonged dedication to satisfying the values you think you should have, while neglecting the values you actually have.
Setting aside for the moment what “values” are and what it means to “actually” have one, suppose that I actually value these things (among others):

"The Social Recession: By the Numbers" by Anton Stjepan Cebalo Jan 25, 2023

https://www.lesswrong.com/posts/Xo7qmDakxiizG7B9c/the-social-recession-by-the-numbers
This is a linkpost for https://novum.substack.com/p/social-recession-by-the-numbers
Fewer friends, relationships on the decline, delayed adulthood, trust at an all-time low, and many diseases of despair. The prognosis is not great.
One of the most discussed topics online recently has been friendships and loneliness. Ever since the infamous chart showing more people are not having sex than ever before first made the rounds, there’s been increased interest in the social state of things. Polling has demonstrated a marked decline in all spheres of social life, including close friends, intimate relationships, trust, labor participation, and community involvement. The trend looks to have worsened since the pandemic, although it will take some years before this is clearly established.
The decline comes alongside a documented rise in mental illness, diseases of despair, and poor health more generally. In August 2022, the CDC announced that U.S. life expectancy has fallen further and is now where it was in 1996. Contrast this to Western Europe, where it has largely rebounded to pre-pandemic numbers. Still, even before the pandemic, the years 2015-2017 saw the longest sustained decline in U.S. life expectancy since 1915-18. While my intended angle here is not health-related, general sociability is closely linked to health. The ongoing shift has been called the “friendship recession” or the “social recession.”

"Recursive Middle Manager Hell" by Raemon Jan 23, 2023

https://www.lesswrong.com/posts/pHfPvb4JMhGDr4B7n/recursive-middle-manager-hell
I think Zvi's Immoral Mazes sequence is really important, but comes with more worldview-assumptions than are necessary to make the points actionable. I conceptualize Zvi as arguing for multiple hypotheses. In this post I want to articulate one sub-hypothesis, which I call "Recursive Middle Manager Hell". I'm deliberately not covering some other components of his model[1].

tl;dr:

Something weird and kinda horrifying happens when you add layers of middle-management. This has ramifications on when/how to scale organizations, and where you might want to work, and maybe general models of what's going on in the world.

You could summarize the effect as "the org gets more deceptive, less connected to its original goals, more focused on office politics, less able to communicate clearly within itself, and selected for more for sociopathy in upper management."

You might read that list of things and say "sure, seems a bit true", but one of the main points here is "Actually, this happens in a deeper and more insidious way than you're probably realizing, with much higher costs than you're acknowledging. If you're scaling your organization, this should be one of your primary worries."

"The Feeling of Idea Scarcity" by John Wentworth Jan 11, 2023

https://www.lesswrong.com/posts/mfPHTWsFhzmcXw8ta/the-feeling-of-idea-scarcity
Here’s a story you may recognize. There's a bright up-and-coming young person - let's call her Alice. Alice has a cool idea. It seems like maybe an important idea, a big idea, an idea which might matter. A new and valuable idea. It’s the first time Alice has come up with a high-potential idea herself, something which she’s never heard in a class or read in a book or what have you.
So Alice goes all-in pursuing this idea. She spends months fleshing it out. Maybe she writes a paper, or starts a blog, or gets a research grant, or starts a company, or whatever, in order to pursue the high-potential idea, bring it to the world.
And sometimes it just works!
… but more often, the high-potential idea doesn’t actually work out. Maybe it turns out to be basically-the-same as something which has already been tried. Maybe it runs into some major barrier, some not-easily-patchable flaw in the idea. Maybe the problem it solves just wasn’t that important in the first place.
From Alice’ point of view, the possibility that her one high-potential idea wasn’t that great after all is painful. The idea probably feels to Alice like the single biggest intellectual achievement of her life. To lose that, to find out that her single greatest intellectual achievement amounts to little or nothing… that hurts to even think about. So most likely, Alice will reflexively look for an out. She’ll look for some excuse to ignore the similar ideas which have already been tried, some reason to think her idea is different. She’ll look for reasons to believe that maybe the major barrier isn’t that much of an issue, or that we Just Don’t Know whether it’s actually an issue and therefore maybe the idea could work after all. She’ll look for reasons why the problem really is important. Maybe she’ll grudgingly acknowledge some shortcomings of the idea, but she’ll give up as little ground as possible at each step, update as slowly as she can.

"Models Don't 'Get Reward'" by Sam Ringer Jan 11, 2023

https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight

When thinking about deception and RLHF training, a simplified threat model is something like this:

A model takes some actions.
If a human approves of these actions, the human gives the model some reward.
Humans can be deceived into giving reward in situations where they would otherwise not if they had more knowledge.
Models will take advantage of this so they can get more reward.
Models will therefore become deceptive.

Before continuing, I would encourage you to really engage with the above. Does it make sense to you? Is it making any hidden assumptions? Is it missing any steps? Can you rewrite it to be more mechanistically correct?

I believe that when people use the above threat model, they are either using it as shorthand for something else or they misunderstand how reinforcement learning works. Most alignment researchers will be in the former category. However, I was in the latter.

"How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin Jan 11, 2023

https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Introduction

A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.

In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical.

Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.

Disclaimers: I have tried to make this post concise, at the cost of not making the full arguments for many of my claims; you should treat this as more of a rough sketch of my views rather than anything comprehensive. I also frequently change my mind – I’m usually more consistently excited about some of the broad intuitions but much less wedded to the details – and this of course just represents my current thinking on the topic.

"The next decades might be wild" by Marius Hobbhahn Dec 21, 2022

https://www.lesswrong.com/posts/qRtD4WqKRYEtT5pi3/the-next-decades-might-be-wild
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I’d like to thank Simon Grimm and Tamay Besiroglu for feedback and discussions.

This post is inspired by What 2026 looks like and an AI vignette workshop guided by Tamay Besiroglu. I think of this post as “what would I expect the world to look like if these timelines (median compute for transformative AI ~2036) were true” or “what short-to-medium timelines feel like” since I find it hard to translate a statement like “median TAI year is 20XX” into a coherent imaginable world.

I expect some readers to think that the post sounds wild and crazy but that doesn’t mean its content couldn’t be true. If you had told someone in 1990 or 2000 that there would be more smartphones and computers than humans in 2020, that probably would have sounded wild to them. The same could be true for AIs, i.e. that in 2050 there are more human-level AIs than humans. The fact that this sounds as ridiculous as ubiquitous smartphones sounded to the 1990/2000 person, might just mean that we are bad at predicting exponential growth and disruptive technology.

Update: titotal points out in the comments that the correct timeframe for computers is probably 1980 to 2020. So the correct time span is probably 40 years instead of 30. For mobile phones, it's probably 1993 to 2020 if you can trust this statistic.

I’m obviously not confident (see confidence and takeaways section) in this particular prediction but many of the things I describe seem like relatively direct consequences of more and more powerful and ubiquitous AI mixed with basic social dynamics and incentives.

"Lessons learned from talking to >100 academics about AI safety" by Marius Hobbhahn Nov 17, 2022

https://www.lesswrong.com/posts/SqjQFhn5KTarfW8v7/lessons-learned-from-talking-to-greater-than-100-academics
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I’d like to thank MH, Jaime Sevilla and Tamay Besiroglu for their feedback.

During my Master's and Ph.D. (still ongoing), I have spoken with many academics about AI safety. These conversations include chats with individual PhDs, poster presentations and talks about AI safety.

I think I have learned a lot from these conversations and expect many other people concerned about AI safety to find themselves in similar situations. Therefore, I want to detail some of my lessons and make my thoughts explicit so that others can scrutinize them.

TL;DR: People in academia seem more and more open to arguments about risks from advanced intelligence over time and I would genuinely recommend having lots of these chats. Furthermore, I underestimated how much work related to some aspects AI safety already exists in academia and that we sometimes reinvent the wheel. Messaging matters, e.g. technical discussions got more interest than alarmism and explaining the problem rather than trying to actively convince someone received better feedback.

100 academics about AI safety" by Marius Hobbhahn" class="download">

"How my team at Lightcone sometimes gets stuff done" by jacobjacob Nov 10, 2022

https://www.lesswrong.com/posts/6LzKRP88mhL9NKNrS/how-my-team-at-lightcone-sometimes-gets-stuff-done
Disclaimer: I originally wrote this as a private doc for the Lightcone team. I then showed it to John and he said he would pay me to post it here. That sounded awfully compelling. However, I wanted to note that I’m an early founder who hasn't built anything truly great yet. I’m writing this doc because as Lightcone is growing, I have to take a stance on these questions. I need to design our org to handle more people. Still, I haven’t seen the results long-term, and who knows if this is good advice. Don’t overinterpret this.

Suppose you went up on stage in front of a company you founded, that now had grown to 100, or 1000, 10 000+ people. You were going to give a talk about your company values. You can say things like “We care about moving fast, taking responsibility, and being creative” -- but I expect these words would mostly fall flat. At the end of the day, the path the water takes down the hill is determined by the shape of the territory, not the sound the water makes as it swooshes by. To manage that many people, it seems to me you need clear, concrete instructions. What are those? What are things you could write down on a piece of paper and pass along your chain of command, such that if at the end people go ahead and just implement them, without asking what you meant, they would still preserve some chunk of what makes your org work?

"Decision theory does not imply that we get to have nice things" by So8res Nov 08, 2022

https://www.lesswrong.com/posts/rP66bz34crvDudzcJ/decision-theory-does-not-imply-that-we-get-to-have-nice
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Note: I wrote this with editing help from Rob and Eliezer. Eliezer's responsible for a few of the paragraphs.)

A common confusion I see in the tiny fragment of the world that knows about logical decision theory (FDT/UDT/etc.), is that people think LDT agents are genial and friendly for each other.[1]

One recent example is Will Eden’s tweet about how maybe a molecular paperclip/squiggle maximizer would leave humanity a few stars/galaxies/whatever on game-theoretic grounds. (And that's just one example; I hear this suggestion bandied around pretty often.)

I'm pretty confident that this view is wrong (alas), and based on a misunderstanding of LDT. I shall now attempt to clear up that confusion.

To begin, a parable: the entity Omicron (Omega's little sister) fills box A with $1M and box B with $1k, and puts them both in front of an LDT agent saying "You may choose to take either one or both, and know that I have already chosen whether to fill the first box". The LDT agent takes both.

"What?" cries the CDT agent. "I thought LDT agents one-box!"

LDT agents don't cooperate because they like cooperating. They don't one-box because the name of the action starts with an 'o'. They maximize utility, using counterfactuals that assert that the world they are already in (and the observations they have already seen) can (in the right circumstances) depend (in a relevant way) on what they are later going to do.

A paperclipper cooperates with other LDT agents on a one-shot prisoner's dilemma because they get more paperclips that way. Not because it has a primitive property of cooperativeness-with-similar-beings. It needs to get the more paperclips.

"What 2026 looks like" by Daniel Kokotajlo Nov 06, 2022

https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-looks-like#2022
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This was written for the Vignettes Workshop.[1] The goal is to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage, i.e. I’m not aware of any alternative trajectory that is similarly detailed and clearly more plausible to me. The methodology is roughly: Write a future history of 2022. Condition on it, and write a future history of 2023. Repeat for 2024, 2025, etc. (I'm posting 2022-2026 now so I can get feedback that will help me write 2027+. I intend to keep writing until the story reaches singularity/extinction/utopia/etc.)

What’s the point of doing this? Well, there are a couple of reasons:

Sometimes attempting to write down a concrete example causes you to learn things, e.g. that a possibility is more or less plausible than you thought.
Most serious conversation about the future takes place at a high level of abstraction, talking about e.g. GDP acceleration, timelines until TAI is affordable, multipolar vs. unipolar takeoff… vignettes are a neglected complementary approach worth exploring.
Most stories are written backwards. The author begins with some idea of how it will end, and arranges the story to achieve that ending. Reality, by contrast, proceeds from past to future. It isn’t trying to entertain anyone or prove a point in an argument.
Anecdotally, various people seem to have found Paul Christiano’s “tales of doom” stories helpful, and relative to typical discussions those stories are quite close to what we want. (I still think a bit more detail would be good — e.g. Paul’s stories don’t give dates, or durations, or any numbers at all really.)[2]
“I want someone to ... write a trajectory for how AI goes down, that is really specific about what the world GDP is in every one of the years from now until insane intelligence explosion. And just write down what the world is like in each of those years because I don't know how to write an internally consistent, plausible trajectory. I don't know how to write even one of those for anything except a ridiculously fast takeoff.” --Buck Shlegeris

This vignette was hard to write. To achieve the desired level of detail I had to make a bunch of stuff up, but in order to be realistic I had to constantly ask “but actually though, what would really happen in this situation?” which made it painfully obvious how little I know about the future. There are numerous points where I had to conclude “Well, this does seem implausible, but I can’t think of anything more plausible at the moment and I need to move on.” I fully expect the actual world to diverge quickly from the trajectory laid out here. Let anyone who (with the benefit of hindsight) claims this divergence as evidence against my judgment prove it by exhibiting a vignette/trajectory they themselves wrote in 2021. If it maintains a similar level of detail (and thus sticks its neck out just as much) while being more accurate, I bow deeply in respect!

Counterarguments to the basic AI x-risk case Nov 04, 2022

"Introduction to abstract entropy" by Alex Altair Oct 29, 2022

https://www.lesswrong.com/posts/REA49tL5jsh69X3aM/introduction-to-abstract-entropy#fnrefpi8b39u5hd7
This post, and much of the following sequence, was greatly aided by feedback from the following people (among others): Lawrence Chan, Joanna Morningstar, John Wentworth, Samira Nedungadi, Aysja Johnson, Cody Wild, Jeremy Gillen, Ryan Kidd, Justis Mills and Jonathan Mustin. Illustrations by Anne Ore.

Introduction & motivation

In the course of researching optimization, I decided that I had to really understand what entropy is.[1] But there are a lot of other reasons why the concept is worth studying:

Information theory:
- Entropy tells you about the amount of information in something.
- It tells us how to design optimal communication protocols.
- It helps us understand strategies for (and limits on) file compression.
Statistical mechanics:
- Entropy tells us how macroscopic physical systems act in practice.
- It gives us the heat equation.
- We can use it to improve engine efficiency.
- It tells us how hot things glow, which led to the discovery of quantum mechanics.
Epistemics (an important application to me and many others on LessWrong):
- The concept of entropy yields the maximum entropy principle, which is extremely helpful for doing general Bayesian reasoning.
Entropy tells us how "unlikely" something is and how much we would have to fight against nature to get that outcome (i.e. optimize).
It can be used to explain the arrow of time.
It is relevant to the fate of the universe.
And it's also a fun puzzle to figure out!

I didn't intend to write a post about entropy when I started trying to understand it. But I found the existing resources (textbooks, Wikipedia, science explainers) so poor that it actually seems important to have a better one as a prerequisite for understanding optimization! One failure mode I was running into was that other resources tended only to be concerned about the application of the concept in their particular sub-domain. Here, I try to take on the task of synthesizing the abstract concept of entropy, to show what's so deep and fundamental about it. In future posts, I'll talk about things like:

"Consider your appetite for disagreements" by Adam Zerner Oct 25, 2022

https://www.lesswrong.com/posts/8vesjeKybhRggaEpT/consider-your-appetite-for-disagreements
Poker

There was a time about five years ago where I was trying to get good at poker. If you want to get good at poker, one thing you have to do is review hands. Preferably with other people.

For example, suppose you have ace king offsuit on the button. Someone in the highjack opens to 3 big blinds preflop. You call. Everyone else folds. The flop is dealt. It's a rainbow Q75. You don't have any flush draws. You missed. Your opponent bets. You fold. They take the pot and you move to the next hand.

Once you finish your session, it'd be good to come back and review this hand. Again, preferably with another person. To do this, you would review each decision point in the hand. Here, there were two decision points.

The first was when you faced a 3BB open from HJ preflop with AKo. In the hand, you decided to call. However, this of course wasn't your only option. You had two others: you could have folded, and you could have raised. Actually, you could have raised to various sizes. You could have raised small to 8BB, medium to 10BB, or big to 12BB. Or hell, you could have just shoved 200BB! But that's not really a realistic option, nor is folding. So in practice your decision was between calling and raising to various realistic sizes.

"My resentful story of becoming a medical miracle" by Elizabeth Oct 21, 2022

https://www.lesswrong.com/posts/fFY2HeC9i2Tx8FEnK/my-resentful-story-of-becoming-a-medical-miracle
This is a linkpost for https://acesounderglass.com/2022/10/13/my-resentful-story-of-becoming-a-medical-miracle/

You know those health books with “miracle cure” in the subtitle? The ones that always start with a preface about a particular patient who was completely hopeless until they tried the supplement/meditation technique/healing crystal that the book is based on? These people always start broken and miserable, unable to work or enjoy life, perhaps even suicidal from the sheer hopelessness of getting their body to stop betraying them. They’ve spent decades trying everything and nothing has worked until their friend makes them see the book’s author, who prescribes the same thing they always prescribe, and the patient immediately stands up and starts dancing because their problem is entirely fixed (more conservative books will say it took two sessions). You know how those are completely unbelievable, because anything that worked that well would go mainstream, so basically the book is starting you off with a shit test to make sure you don’t challenge its bullshit later?

Well 5 months ago I became one of those miraculous stories, except worse, because my doctor didn’t even do it on purpose. This finalized some already fermenting changes in how I view medical interventions and research. Namely: sometimes knowledge doesn’t work and then you have to optimize for luck.

I assure you I’m at least as unhappy about this as you are.

"The Redaction Machine" by Ben Oct 01, 2022

https://www.lesswrong.com/posts/CKgPFHoWFkviYz7CB/the-redaction-machine
On the 3rd of October 2351 a machine flared to life. Huge energies coursed into it via cables, only to leave moments later as heat dumped unwanted into its radiators. With an enormous puff the machine unleashed sixty years of human metabolic entropy into superheated steam.

In the heart of the machine was Jane, a person of the early 21st century.

From her perspective there was no transition. One moment she had been in the year 2021, sat beneath a tree in a park. Reading a detective novel.

Then the book was gone, and the tree. Also the park. Even the year.

She found herself laid in a bathtub, immersed in sickly fatty fluids. She was naked and cold.

"Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra Sep 27, 2022

https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity.

The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback[1] on diverse tasks (HFDT):

Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.

HFDT is not the only approach to developing transformative AI,[2] and it may not work at all.[3] But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon.

Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.

"The shard theory of human values" by Quintin Pope & TurnTrout Sep 22, 2022

https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values
TL;DR: We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry.

"Two-year update on my personal AI timelines" by Ajeya Cotra Sep 22, 2022

https://www.lesswrong.com/posts/AfH2oPHCApdKicM4m/two-year-update-on-my-personal-ai-timelines#fnref-fwwPpQFdWM6hJqwuY-12
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I worked on my draft report on biological anchors for forecasting AI timelines mainly between ~May 2019 (three months after the release of GPT-2) and ~Jul 2020 (a month after the release of GPT-3), and posted it on LessWrong in Sep 2020 after an internal review process. At the time, my bottom line estimates from the bio anchors modeling exercise were:[1]

Roughly ~15% probability of transformative AI by 2036[2] (16 years from posting the report; 14 years from now).
A median of ~2050 for transformative AI (30 years from posting, 28 years from now).

These were roughly close to my all-things-considered probabilities at the time, as other salient analytical frames on timelines didn’t do much to push back on this view. (Though my subjective probabilities bounced around quite a lot around these values and if you’d asked me on different days and with different framings I’d have given meaningfully different numbers.)

It’s been about two years since the bulk of the work on that report was completed, during which I’ve mainly been thinking about AI. In that time it feels like very short timelines have become a lot more common and salient on LessWrong and in at least some parts of the ML community.

My personal timelines have also gotten considerably shorter over this period. I now expect something roughly like this:

"You Are Not Measuring What You Think You Are Measuring" by John Wentworth Sep 21, 2022

https://www.lesswrong.com/posts/9kNxhKWvixtKW5anS/you-are-not-measuring-what-you-think-you-are-measuring
Eight years ago, I worked as a data scientist at a startup, and we wanted to optimize our sign-up flow. We A/B tested lots of different changes, and occasionally found something which would boost (or reduce) click-through rates by 10% or so.

Then one week I was puzzling over a discrepancy in the variance of our daily signups. Eventually I scraped some data from the log files, and found that during traffic spikes, our server latency shot up to multiple seconds. The effect on signups during these spikes was massive: even just 300 ms was enough that click-through dropped by 30%, and when latency went up to seconds the click-through rates dropped by over 80%. And this happened multiple times per day. Latency was far and away the most important factor which determined our click-through rates. [1]

Going back through some of our earlier experiments, it was clear in hindsight that some of our biggest effect-sizes actually came from changing latency - for instance, if we changed the order of two screens, then there’d be an extra screen before the user hit the one with high latency, so the latency would be better hidden. Our original interpretations of those experiments - e.g. that the user cared more about the content of one screen than another - were totally wrong. It was also clear in hindsight that our statistics on all the earlier experiments were bunk - we’d assumed that every user’s click-through was statistically independent, when in fact they were highly correlated, so many of the results which we thought were significant were in fact basically noise.

"Do bamboos set themselves on fire?" by Malmesbury Sep 20, 2022

https://www.lesswrong.com/posts/WNpvK67MjREgvB8u8/do-bamboos-set-themselves-on-fire
Cross-posted from Telescopic Turnip.

As we all know, the best place to have a kung-fu fight is a bamboo forest. There are just so many opportunities to grab pieces of bamboos and manufacture improvised weapons, use them to catapult yourself in the air and other basic techniques any debutant martial artist ought to know. A lesser-known fact is that bamboo-forest fights occur even when the cameras of Hong-Kong filmmakers are not present. They may even happen without the presence of humans at all. The forest itself is the kung-fu fight.

It's often argued that humans are the worst species on Earth, because of our limitless potential for violence and mutual destruction. If that's the case, bamboos are second. Bamboos are sick. The evolution of bamboos is the result of multiple layers of shear brutality, with two imbricated levels of war, culminating in an apocalypse of combustive annihilation. At least, according to some hypotheses.

"Survey advice" by Katja Grace Sep 18, 2022

https://www.lesswrong.com/posts/oyKzz7bvcZMEPaDs6/survey-advice

Things I believe about making surveys, after making some surveys:

If you write a question that seems clear, there’s an unbelievably high chance that any given reader will misunderstand it. (Possibly this applies to things that aren’t survey questions also, but that’s a problem for another time.)
A better way to find out if your questions are clear is to repeatedly take a single individual person, and sit down with them, and ask them to take your survey while narrating the process: reading the questions aloud, telling you what they think the question is asking, explaining their thought process in answering it. If you do this repeatedly with different people until some are not confused at all, the questions are probably clear.
If you ask people very similar questions in different sounding ways, you can get very different answers (possibly related to the above, though that’s not obviously the main thing going on).
One specific case of that: for some large class of events, if you ask people how many years until a 10%, 50%, 90% chance of event X occurring, you will get an earlier distribution of times than if you ask the probability that X will happen in 10, 20, 50 years. (I’ve only tried this with AI related things, but my guess is that it at least generalizes to other low-probability-seeming things. Also, if you just ask about 10% on its own, it is consistently different from 10% alongside 50% and 90%.
Given the complicated landscape of people’s beliefs about the world and proclivities to say certain things, there is a huge amount of scope for choosing questions to get answers that sound different to listeners (e.g. support a different side in a debate).
There is also scope for helping people think through a thing in a way that they would endorse, e.g. by asking a sequence of questions. This can also change what the answer sounds like, but seems ethical to me, whereas applications of 5 seem generally suss.
Often your respondent knows thing P and you want to know Q, and it is possible to infer something about Q from P. You then have a choice about which point in this inference chain to ask the person about. It seems helpful to notice this choice. For instance, if AI researchers know most about what AI research looks like, and you want to know whether human civilization will be imminently destroyed by renegade AI systems, you can ask about a) how fast AI progress appears to be progressing, b) when it will reach a certain performance bar, c) whether AI will cause something like human extinction. In the 2016 survey, we asked all of these.
Given the choice, if you are hoping to use the data as information, it is often good to ask people about things they know about. In 7, this points to aiming your question early in the reasoning chain, then doing the inference yourself.
Interest in surveys doesn’t seem very related to whether a survey is a good source of information on the topic surveyed on. One of the strongest findings of the 2016 survey IMO was that surveys like that are unlikely to be a reliable guide to the future.
This makes sense because surveys fulfill other purposes. Surveys are great if you want to know what people think about X, rather than what is true about X. Knowing what people think is often the important question. It can be good for legitimizing a view, or letting a group of people have common knowledge about what they think so they can start to act on it, including getting out of bad equilibri

"Toni Kurz and the Insanity of Climbing Mountains" by Gene Smith Sep 18, 2022

https://www.lesswrong.com/posts/J3wemDGtsy5gzD3xa/toni-kurz-and-the-insanity-of-climbing-mountains

Content warning: death

I've been on a YouTube binge lately. My current favorite genre is disaster stories about mountain climbing. The death statistics for some of these mountains, especially ones in the Himalayas are truly insane.

To give an example, let me tell you about a mountain most people have never heard of: Nanga Parbat. It's a 8,126 meter "wall of ice and rock", sporting the tallest mountain face and the fastest change in elevation in the entire world: the Rupal Face.

I've posted a picture above, but these really don't do justice to just how gigantic this wall is. This single face is as tall as the largest mountain in the Alps. It is the size of ten empire state buildings stacked on top of one another. If you could somehow walk straight up starting from the bottom, it would take you an entire HOUR to reach the summit.

31 people died trying to climb this mountain before its first successful ascent. Imagine being climber number 32 and thinking "Well I know no one has ascended this mountain and thirty one people have died trying, but why not, let's give it a go!"

The stories of deaths on these mountains (and even much shorter peaks in the Alps or in North America) sound like they are out of a novel. Stories of one mountain in particular have stuck with me: the first attempts to climb tallest mountain face in the alps: The Eigerwand.

The Eigerwand: First Attempt

The Eigerwand is the North face of a 14,000 foot peak named "The Eiger". After three generations of Europeans had conquered every peak in the Alps, few great challenges remained in the area. The Eigerwand was one of these: widely considered to be the greatest unclimbed route in the Alps.

The peak had already been reached in the 1850s, during the golden age of Alpine exploration. But the north face of the mountain remained unclimbed.

Many things can make a climb challenging: steep slopes, avalanches, long ascents, no easy resting spots and more. The Eigerwand had all of those, but one hazard in particular stood out: loose rock and snow.

In the summer months (usually considered the best time for climbing), the mountain crumbles. Fist-sized boulders routinely tumble down the mountain. Huge avalanaches sweep down its 70-degree slopes at incredible speed. And the huge, concave face is perpetually in shadow. It is extremely cold and windy, and the concave face seems to cause local weather patterns that can be completely different from the pass below. The face is deadly.

Before 1935, no team had made a serious attempt at the face. But that year, two young German climbers from Bavaria, both extremely experienced but relatively unknown outside the climbing community, decided they would make the first serious attempt.

"Deliberate Grieving" by Raemon Sep 18, 2022

https://www.lesswrong.com/posts/gs3vp3ukPbpaEie5L/deliberate-grieving-1

This post is hopefully useful on its own, but begins a series ultimately about grieving over a world that might (or, might not) be doomed.

It starts with some pieces from a previous coordination frontier sequence post, but goes into more detail.

At the beginning of the pandemic, I didn’t have much experience with grief. By the end of the pandemic, I had gotten quite a lot of practice grieving for things. I now think of grieving as a key life skill, with ramifications for epistemics, action, and coordination.

I had read The Art of Grieving Well, which gave me footholds to get started with. But I still had to develop some skills from scratch, and apply them in novel ways.

Grieving probably works differently for different people. Your mileage may vary. But for me, grieving is the act of wrapping my brain around the fact that something important to me doesn’t exist anymore. Or can’t exist right now. Or perhaps never existed. It typically comes in two steps – an “orientation” step, where my brain traces around the lines of the thing-that-isn’t-there, coming to understand what reality is actually shaped like now. And then a “catharsis” step, once I fully understand that the thing is gone. The first step can take hours, weeks or months.

You can grieve for people who are gone. You can grieve for things you used to enjoy. You can grieve for principles that were important to you but aren’t practical to apply right now.

Grieving is important in single-player mode – if I’m holding onto something that’s not there anymore, my thoughts and decision-making are distorted. I can’t make good plans if my map of reality is full of leftover wishful markings of things that aren’t there.

"Toolbox-thinking and Law-thinking" by Eliezer Yudkowsky Sep 14, 2022

https://www.lesswrong.com/s/6xgy8XYEisLk3tCjH/p/CPP2uLcaywEokFKQG

Tl;dr:

I've noticed a dichotomy between "thinking in toolboxes" and "thinking in laws".

The toolbox style of thinking says it's important to have a big bag of tools that you can adapt to context and circumstance; people who think very toolboxly tend to suspect that anyone who goes talking of a single optimal way is just ignorant of the uses of the other tools.

The lawful style of thinking, done correctly, distinguishes between descriptive truths, normative ideals, and prescriptive ideals. It may talk about certain paths being optimal, even if there's no executable-in-practice algorithm that yields the optimal path. It considers truths that are not tools.

Within nearly-Euclidean mazes, the triangle inequality - that the path AC is never spatially longer than the path ABC - is always true but only sometimes useful. The triangle inequality has the prescriptive implication that if you know that one path choice will travel ABC and one path will travel AC, and if the only pragmatic path-merit you care about is going the minimum spatial distance (rather than say avoiding stairs because somebody in the party is in a wheelchair), then you should pick the route AC. But the triangle inequality goes on governing Euclidean mazes whether or not you know which path is which, and whether or not you need to avoid stairs.

Toolbox thinkers may be extremely suspicious of this claim of universal lawfulness if it is explained less than perfectly, because it sounds to them like "Throw away all the other tools in your toolbox! All you need to know is Euclidean geometry, and you can always find the shortest path through any maze, which in turn is always the best path."

If you think that's an unrealistic depiction of a misunderstanding that would never happen in reality, keep reading.

"Local Validity as a Key to Sanity and Civilization" by Eliezer Yudkowsky Sep 14, 2022

"Humans are not automatically strategic" by Anna Salamon Sep 14, 2022

https://www.lesswrong.com/posts/PBRWb2Em5SNeWYwwB/humans-are-not-automatically-strategic

Reply to: A "Failure to Evaluate Return-on-Time" Fallacy

Lionhearted writes:

[A] large majority of otherwise smart people spend time doing semi-productive things, when there are massively productive opportunities untapped.A somewhat silly example: Let's say someone aspires to be a comedian, the best comedian ever, and to make a living doing comedy. He wants nothing else, it is his purpose. And he decides that in order to become a better comedian, he will watch re-runs of the old television cartoon 'Garfield and Friends' that was on TV from 1988 to 1995....I’m curious as to why.

"Language models seem to be much better than humans at next-token prediction" by Buck, Fabien and LawrenceC Sep 14, 2022

https://www.lesswrong.com/posts/htrZrxduciZ5QaCjw/language-models-seem-to-be-much-better-than-humans-at-next

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

[Thanks to a variety of people for comments and assistance (especially Paul Christiano, Nostalgebraist, and Rafe Kennedy), and to various people for playing the game. Buck wrote the top-1 prediction web app; Fabien wrote the code for the perplexity experiment and did most of the analysis and wrote up the math here, Lawrence did the research on previous measurements. Epistemic status: we're pretty confident of our work here, but haven't engaged in a super thorough review process of all of it--this was more like a side-project than a core research project.]

How good are modern language models compared to humans, at the task language models are trained on (next token prediction on internet text)? While there are language-based tasks that you can construct where humans can make a next-token prediction better than any language model, we aren't aware of any apples-to-apples comparisons on non-handcrafted datasets. To answer this question, we performed a few experiments comparing humans to language models on next-token prediction on OpenWebText.

Contrary to some previous claims, we found that humans seem to be consistently worse at next-token prediction (in terms of both top-1 accuracy and perplexity) than even small models like Fairseq-125M, a 12-layer transformer roughly the size and quality of GPT-1. That is, even small language models are "superhuman" at predicting the next token. That being said, it seems plausible that humans can consistently beat the smaller 2017-era models (though not modern models) with a few hours more practice and strategizing. We conclude by discussing some of our takeaways from this result.

We're not claiming that this result is completely novel or surprising. For example, FactorialCode makes a similar claim as an answer on this LessWrong post about this question. We've also heard from some NLP people that the superiority of LMs to humans for next-token prediction is widely acknowledged in NLP. However, we've seen incorrect claims to the contrary on the internet, and as far as we know there hasn't been a proper apples-to-apples comparison, so we believe there's some value to our results.

If you want to play with our website, it’s here; in our opinion playing this game for half an hour gives you some useful perspective on what it’s like to be a language model.

"Moral strategies at different capability levels" by Richard Ngo Sep 14, 2022

https://www.lesswrong.com/posts/jDQm7YJxLnMnSNHFu/moral-strategies-at-different-capability-levels
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Let’s consider three ways you can be altruistic towards another agent:

You care about their welfare: some metric of how good their life is (as defined by you). I’ll call this care-morality - it endorses things like promoting their happiness, reducing their suffering, and hedonic utilitarian behavior (if you care about many agents).
You care about their agency: their ability to achieve their goals (as defined by them). I’ll call this cooperation-morality - it endorses things like honesty, fairness, deontological behavior towards others, and some virtues (like honor).
You care about obedience to them. I’ll call this deference-morality - it endorses things like loyalty, humility, and respect for authority.

I think a lot of unresolved tensions in ethics comes from seeing these types of morality as in opposition to each other, when they’re actually complementary:

"Worlds Where Iterative Design Fails" by John Wentworth Sep 11, 2022

https://www.lesswrong.com/posts/xFotXGEotcKouifky/worlds-where-iterative-design-fails
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In most technical fields, we try designs, see what goes wrong, and iterate until it works. That’s the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice.

In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.

By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.

Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons:

Fast takeoff: there will be a sudden phase shift in capabilities, and the design of whatever system first undergoes that phase shift needs to be right on the first try.
Deceptive inner misalignment: an inner agent behaves well in order to deceive us, so we can’t tell there’s a problem just by trying stuff and looking at the system’s behavior.

… but these certainly aren’t the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I’d encourage you to think on your own about others. These are the things which kill us; they’re worth thinking about.Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In most technical fields, we try designs, see what goes wrong, and iterate until it works. That’s the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice.

In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.

By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.

Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons:

Fast takeoff: there will be a sudden phase shift in capabilities, and the design of whatever system first undergoes that phase shift needs to be right on the first try.
Deceptive inner misalignment: an inner agent behaves well in order to deceive us, so we can’t tell there’s a problem just by trying stuff and looking at the system’s behavior.

… but these certainly aren’t the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I’d encourage you to think on your own about others. These are the things which kill us; they’re worth thinking about.

"(My understanding of) What Everyone in Technical Alignment is Doing and Why" by Thomas Larsen & Eli Lifland Sep 10, 2022

https://www.lesswrong.com/posts/QBAjndPuFbhEXKcCr/my-understanding-of-what-everyone-in-technical-alignment-is
Despite a clear need for it, a good source explaining who is doing what and why in technical AI alignment doesn't exist. This is our attempt to produce such a resource. We expect to be inaccurate in some ways, but it seems great to get out there and let Cunningham’s Law do its thing.[1]

The main body contains our understanding of what everyone is doing in technical alignment and why, as well as at least one of our opinions on each approach. We include supplements visualizing differences between approaches and Thomas’s big picture view on alignment. The opinions written are Thomas and Eli’s independent impressions, many of which have low resilience. Our all-things-considered views are significantly more uncertain.

This post was mostly written while Thomas was participating in the 2022 iteration SERI MATS program, under mentor John Wentworth. Thomas benefited immensely from conversations with other SERI MATS participants, John Wentworth, as well as many others who I met this summer.

"Unifying Bargaining Notions (1/2)" by Diffractor Sep 08, 2022

https://www.lesswrong.com/posts/rYDas2DDGGDRc8gGB/unifying-bargaining-notions-1-2
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a two-part sequence of posts, in the ancient LessWrong tradition of decision-theory-posting. This first part will introduce various concepts of bargaining solutions and dividing gains from trade, which the reader may or may not already be familiar with.

The upcoming part will be about how all introduced concepts from this post are secretly just different facets of the same underlying notion, as originally discovered by John Harsanyi back in 1963 and rediscovered by me from a completely different direction. The fact that the various different solution concepts in cooperative game theory are all merely special cases of a General Bargaining Solution for arbitrary games, is, as far as I can tell, not common knowledge on Less Wrong.

Bargaining Games

Let's say there's a couple with a set of available restaurant options. Neither of them wants to go without the other, and if they fail to come to an agreement, the fallback is eating a cold canned soup dinner at home, the worst of all the options. However, they have different restaurant preferences. What's the fair way to split the gains from trade?

Well, it depends on their restaurant preferences, and preferences are typically encoded with utility functions. Since both sides agree that the disagreement outcome is the worst, they might as well index that as 0 utility, and their favorite respective restaurants as 1 utility, and denominate all the other options in terms of what probability mix between a cold canned dinner and their favorite restaurant would make them indifferent. If there's something that scores 0.9 utility for both, it's probably a pretty good pick!

'Simulators' by Janus Sep 05, 2022

https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators#fncrt8wagfir9
Summary

TL;DR: Self-supervised learning may create AGI or its foundation. What would that look like?

Unlike the limit of RL, the limit of self-supervised learning has received surprisingly little conceptual attention, and recent progress has made deconfusion in this domain more pressing.

Existing AI taxonomies either fail to capture important properties of self-supervised models or lead to confusing propositions. For instance, GPT policies do not seem globally agentic, yet can be conditioned to behave in goal-directed ways. This post describes a frame that enables more natural reasoning about properties like agency: GPT, insofar as it is inner-aligned, is a simulator which can simulate agentic and non-agentic simulacra.

The purpose of this post is to capture these objects in words so GPT can reference them and provide a better foundation for understanding them.

I use the generic term “simulator” to refer to models trained with predictive loss on a self-supervised dataset, invariant to architecture or data type (natural language, code, pixels, game states, etc). The outer objective of self-supervised learning is Bayes-optimal conditional inference over the prior of the training distribution, which I call the simulation objective, because a conditional model can be used to simulate rollouts which probabilistically obey its learned distribution by iteratively sampling from its posterior (predictions) and updating the condition (prompt). Analogously, a predictive model of physics can be used to compute rollouts of phenomena in simulation. A goal-directed agent which evolves according to physics can be simulated by the physics rule parameterized by an initial state, but the same rule could also propagate agents with different values, or non-agentic phenomena like rocks. This ontological distinction between simulator (rule) and simulacra (phenomena) applies directly to generative models like GPT.

"Humans provide an untapped wealth of evidence about alignment" by TurnTrout & Quintin Pope Aug 08, 2022

https://www.lesswrong.com/posts/CjFZeDD6iCnNubDoS/humans-provide-an-untapped-wealth-of-evidence-about#fnref7a5ti4623qb

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

TL;DR: To even consciously consider an alignment research direction, you should have evidence to locate it as a promising lead. As best I can tell, many directions seem interesting but do not have strong evidence of being “entangled” with the alignment problem such that I expect them to yield significant insights.

For example, “we can solve an easier version of the alignment problem by first figuring out how to build an AI which maximizes the number of real-world diamonds” has intuitive appeal and plausibility, but this claim doesn’t have to be trueand this problem does not necessarily have a natural, compact solution. In contrast, there do in fact exist humans who care about diamonds. Therefore, there are guaranteed-to-exist alignment insights concerning the way people come to care about e.g. real-world diamonds.

“Consider how humans navigate the alignment subproblem you’re worried about” is a habit which I (TurnTrout) picked up from Quintin Pope. I wrote the post, he originated the tactic.

"Changing the world through slack & hobbies" by Steven Byrnes Jul 30, 2022

https://www.lesswrong.com/posts/DdDt5NXkfuxAnAvGJ/changing-the-world-through-slack-and-hobbies

Introduction

In EA orthodoxy, if you're really serious about EA, the three alternatives that people most often seem to talk about are

(1) “direct work” in a job that furthers a very important cause;

(2) “earning to give”;

(3) earning “career capital” that will help you do those things in the future, e.g. by getting a PhD or teaching yourself ML.

By contrast, there’s not much talk of:

(4) being in a job / situation where you have extra time and energy and freedom to explore things that seem interesting and important.

But that last one is really important!

"«Boundaries», Part 1: a key missing concept from utility theory" by Andrew Critch Jul 28, 2022

https://www.lesswrong.com/posts/8oMF8Lv5jiGaQSFvo/boundaries-part-1-a-key-missing-concept-from-utility-theory

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is Part 1 of my «Boundaries» Sequence on LessWrong.

Summary: «Boundaries» are a missing concept from the axioms of game theory and bargaining theory, which might help pin-down certain features of multi-agent rationality (this post), and have broader implications for effective altruism discourse and x-risk (future posts).

1. Boundaries (of living systems)

Epistemic status: me describing what I mean.

With the exception of some relatively recent and isolated pockets of research on embedded agency (e.g., Orseau & Ring, 2012; Garrabrant & Demsky, 2018), most attempts at formal descriptions of living rational agents — especially utility-theoretic descriptions — are missing the idea that living systems require and maintain boundaries.

When I say boundary, I don't just mean an arbitrary constraint or social norm. I mean something that could also be called a membrane in a generalized sense, i.e., a layer of stuff-of-some-kind that physically or cognitively separates a living system from its environment, that 'carves reality at the joints' in a way that isn't an entirely subjective judgement of the living system itself. Here are some examples that I hope will convey my meaning:

"ITT-passing and civility are good; "charity" is bad; steelmanning is niche" by Rob Bensinger Jul 24, 2022

https://www.lesswrong.com/posts/MdZyLnLHuaHrCskjy/itt-passing-and-civility-are-good-charity-is-bad

I often object to claims like "charity/steelmanning is an argumentative virtue". This post collects a few things I and others have said on this topic over the last few years.

My current view is:

Steelmanning ("the art of addressing the best form of the other person’s argument, even if it’s not the one they presented") is a useful niche skill, but I don't think it should be a standard thing you bring out in most arguments, even if it's an argument with someone you strongly disagree with.
Instead, arguments should mostly be organized around things like:
- Object-level learning and truth-seeking, with the conversation as a convenient excuse to improve your own model of something you're curious about.
- Trying to pass each other's Ideological Turing Test (ITT), or some generalization thereof. The ability to pass ITTs is the ability "to state opposing views as clearly and persuasively as their proponents".
  - The version of "ITT" I care about is one where you understand the substance of someone's view well enough to be able to correctly describe their beliefs and reasoning; I don't care about whether you can imitate their speech patterns, jargon, etc.
- Trying to identify and resolve cruxes: things that would make one or the other of you (or both) change your mind about the topic under discussion.
Argumentative charity is a complete mess of a concept⁠—people use it to mean a wide variety of things, and many of those things are actively bad, or liable to cause severe epistemic distortion and miscommunication.
Some version of civility and/or friendliness and/or a spirit of camaraderie and goodwill seems like a useful ingredient in many discussions. I'm not sure how best to achieve this in ways that are emotionally honest ("pretending to be cheerful and warm when you don't feel that way" sounds like the wrong move to me), or how to achieve this without steering away from candor, openness, "realness", etc.

"What should you change in response to an "emergency"? And AI risk" by Anna Salamon Jul 23, 2022

https://www.lesswrong.com/posts/mmHctwkKjpvaQdC3c/what-should-you-change-in-response-to-an-emergency-and-ai

Epistemic status: A possibly annoying mixture of straightforward reasoning and hard-to-justify personal opinions.

It is often stated (with some justification, IMO) that AI risk is an “emergency.” Various people have explained to me that they put various parts of their normal life’s functioning on hold on account of AI being an “emergency.” In the interest of people doing this sanely and not confusedly, I’d like to take a step back and seek principles around what kinds of changes a person might want to make in an “emergency” of different sorts.

Principle 1: It matters what time-scale the emergency is on

There are plenty of ways we can temporarily increase productivity on some narrow task or other, at the cost of our longer-term resources. For example:

Skipping meals
Skipping sleep
Ceasing to clean the house or to exercise
Accumulating credit card debt
Calling in favors from friends
Skipping leisure time

"On how various plans miss the hard bits of the alignment challenge" by Nate Soares Jul 17, 2022

https://www.lesswrong.com/posts/3pinFH3jerMzAvmza/on-how-various-plans-miss-the-hard-bits-of-the-alignment

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(As usual, this post was written by Nate Soares with some help and editing from Rob Bensinger.)

In my last post, I described a “hard bit” of the challenge of aligning AGI—the sharp left turn that comes when your system slides into the “AGI” capabilities well, the fact that alignment doesn’t generalize similarly well at this turn, and the fact that this turn seems likely to break a bunch of your existing alignment properties.

Here, I want to briefly discuss a variety of current research proposals in the field, to explain why I think this problem is currently neglected.

I also want to mention research proposals that do strike me as having some promise, or that strike me as adjacent to promising approaches.

Before getting into that, let me be very explicit about three points:

On my model, solutions to how capabilities generalize further than alignment are necessary but not sufficient. There is dignity in attacking a variety of other real problems, and I endorse that practice.
The imaginary versions of people in the dialogs below are not the same as the people themselves. I'm probably misunderstanding the various proposals in important ways, and/or rounding them to stupider versions of themselves along some important dimensions.^[1] If I've misrepresented your view, I apologize.
I do not subscribe to the Copenhagen interpretation of ethics wherein someone who takes a bad swing at the problem (or takes a swing at a different problem) is more culpable for civilization's failure than someone who never takes a swing at all. Everyone whose plans I discuss below is highly commendable, laudable, and virtuous by my accounting.

"Humans are very reliable agents" by Alyssa Vance Jul 13, 2022

https://www.lesswrong.com/posts/28zsuPaJpKAGSX4zq/humans-are-very-reliable-agents

Over the last few years, deep-learning-based AI has progressed extremely rapidly in fields like natural language processing and image generation. However, self-driving cars seem stuck in perpetual beta mode, and aggressive predictions there have repeatedly been disappointing. Google's self-driving project started four years before AlexNet kicked off the deep learning revolution, and it still isn't deployed at large scale, thirteen years later. Why are these fields getting such different results?

Right now, I think the biggest answer is that ML benchmarks judge models by average-case performance, while self-driving cars (and many other applications) require matching human worst-case performance. For MNIST, an easy handwriting recognition task, performance tops out at around 99.9% even for top models; it's not very practical to design for or measure higher reliability than that, because the test set is just 10,000 images and a handful are ambiguous. Redwood Research, which is exploring worst-case performance in the context of AI alignment, got reliability rates around 99.997% for their text generation models.

By comparison, human drivers are ridiculously reliable. The US has around one traffic fatality per 100 million miles driven; if a human driver makes 100 decisions per mile, that gets you a worst-case reliability of ~1:10,000,000,000 or ~99.999999999%. That's around five orders of magnitude better than a very good deep learning model, and you get that even in an open environment, where data isn't pre-filtered and there are sometimes random mechanical failures. Matching that bar is hard! I'm sure future AI will get there, but each additional "nine" of reliability is typically another unit of engineering effort. (Note that current self-driving systems use a mix of different models embedded in a larger framework, not one model trained end-to-end like GPT-3.)

"Looking back on my alignment PhD" by TurnTrout Jul 08, 2022

https://www.lesswrong.com/posts/2GxhAyn9aHqukap2S/looking-back-on-my-alignment-phd

The funny thing about long periods of time is that they do, eventually, come to an end. I'm proud of what I accomplished during my PhD. That said, I'm going to first focus on mistakes I've made over the past four^[1] years.

Mistakes

I think I got significantly smarter in 2018–2019, and kept learning some in 2020–2021. I was significantly less of a fool in 2021 than I was in 2017. That is important and worth feeling good about. But all things considered, I still made a lot of profound mistakes over the course of my PhD.

Social dynamics distracted me from my core mission

I focused on "catching up" to other thinkers

I figured this point out by summer 2021.

I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge.

But rationality is not about the bag of facts you know, nor is it about the concepts you have internalized. Rationality is about how your mind holds itself, it is how you weigh evidence, it is how you decide where to look next when puzzling out a new area.

If I had been more honest with myself, I could have nipped the "catching up with other thinkers" mistake in 2018. I could have removed the bad mental habits using certain introspective techniques; or at least been aware of the badness.

But I did not, in part because the truth was uncomfortable. If I did not have a clear set of prerequisites (e.g. analysis and topology and game theory) to work on, I would not have a clear and immediate direction of improvement. I would have felt adrift.

"It’s Probably Not Lithium" by Natália Coelho Mendonça Jul 05, 2022

https://www.lesswrong.com/posts/7iAABhWpcGeP5e6SB/it-s-probably-not-lithium

A Chemical Hunger (a), a series by the authors of the blog Slime Mold Time Mold (SMTM) that has been received positively on LessWrong, argues that the obesity epidemic is entirely caused (a) by environmental contaminants. The authors’ top suspect is lithium (a)^[1], primarily because it is known to cause weight gain at the doses used to treat bipolar disorder.

After doing some research, however, I found that it is not plausible that lithium plays a major role in the obesity epidemic, and that a lot of the claims the SMTM authors make about the topic are misleading, flat-out wrong, or based on extremely cherry-picked evidence. I have the impression that reading what they have to say about this often leaves the reader with a worse model of reality than they started with, and I’ll explain why I have that impression in this post.

"What Are You Tracking In Your Head?" by John Wentworth Jul 02, 2022

https://www.lesswrong.com/posts/bhLxWTkRc8GXunFcB/what-are-you-tracking-in-your-head

A large chunk - plausibly the majority - of real-world expertise seems to be in the form of illegible skills: skills/knowledge which are hard to transmit by direct explanation. They’re not necessarily things which a teacher would even notice enough to consider important - just background skills or knowledge which is so ingrained that it becomes invisible.

I’ve recently noticed a certain common type of illegible skill which I think might account for the majority of illegible-skill-value across a wide variety of domains.

Here are a few examples of the type of skill I have in mind:

"Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment" by elspood Jun 29, 2022

https://www.lesswrong.com/posts/Ke2ogqSEhL2KCJCNx/security-mindset-lessons-from-20-years-of-software-security

Background

I have been doing red team, blue team (offensive, defensive) computer security for a living since September 2000. The goal of this post is to compile a list of general principles I've learned during this time that are likely relevant to the field of AGI Alignment. If this is useful, I could continue with a broader or deeper exploration.

Alignment Won't Happen By Accident

I used to use the phrase when teaching security mindset to software developers that "security doesn't happen by accident." A system that isn't explicitly designed with a security feature is not going to have that security feature. More specifically, a system that isn't designed to be robust against a certain failure mode is going to exhibit that failure mode.

This might seem rather obvious when stated explicitly, but this is not the way that most developers, indeed most humans, think. I see a lot of disturbing parallels when I see anyone arguing that AGI won't necessarily be dangerous. An AGI that isn't intentionally designed not to exhibit a particular failure mode is going to have that failure mode. It is certainly possible to get lucky and not trigger it, and it will probably be impossible to enumerate even every category of failure mode, but to have any chance at all we will have to plan in advance for as many failure modes as we can possibly conceive.

As a practical enforcement method, I used to ask development teams that every user story have at least three abuser stories to go with it. For any new capability, think at least hard enough about it that you can imagine at least three ways that someone could misuse it. Sometimes this means looking at boundary conditions ("what if someone orders 2^64+1 items?"), sometimes it means looking at forms of invalid input ("what if someone tries to pay -$100, can they get a refund?"), and sometimes it means being aware of particular forms of attack ("what if someone puts Javascript in their order details?").

"Where I agree and disagree with Eliezer" by Paul Christiano Jun 22, 2022

https://www.lesswrong.com/posts/CoZhXrhpQxpy9xw9y/where-i-agree-and-disagree-with-eliezer#fnh5ezxhd0an

by paulfchristiano, 20th Jun 2022.

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Partially in response to AGI Ruin: A list of Lethalities. Written in the same rambling style. Not exhaustive.)

Agreements

Powerful AI systems have a good chance of deliberately and irreversibly disempowering humanity. This is a much easier failure mode than killing everyone with destructive physical technologies.
Catastrophically risky AI systems could plausibly exist soon, and there likely won’t be a strong consensus about this fact until such systems pose a meaningful existential risk per year. There is not necessarily any “fire alarm.”
Even if there were consensus about a risk from powerful AI systems, there is a good chance that the world would respond in a totally unproductive way. It’s wishful thinking to look at possible stories of doom and say “we wouldn’t let that happen;” humanity is fully capable of messing up even very basic challenges, especially if they are novel.

"Six Dimensions of Operational Adequacy in AGI Projects" by Eliezer Yudkowsky Jun 21, 2022

https://www.lesswrong.com/posts/keiYkaeoLHoKK4LYA/six-dimensions-of-operational-adequacy-in-agi-projects

by Eliezer Yudkowsky

Editor's note: The following is a lightly edited copy of a document written by Eliezer Yudkowsky in November 2017. Since this is a snapshot of Eliezer’s thinking at a specific time, we’ve sprinkled reminders throughout that this is from 2017.

A background note:

It’s often the case that people are slow to abandon obsolete playbooks in response to a novel challenge. And AGI is certainly a very novel challenge.

Italian general Luigi Cadorna offers a memorable historical example. In the Isonzo Offensive of World War I, Cadorna lost hundreds of thousands of men in futile frontal assaults against enemy trenches defended by barbed wire and machine guns. As morale plummeted and desertions became epidemic, Cadorna began executing his own soldiers en masse, in an attempt to cure the rest of their “cowardice.” The offensive continued for 2.5 years.

Cadorna made many mistakes, but foremost among them was his refusal to recognize that this war was fundamentally unlike those that had come before. Modern weaponry had forced a paradigm shift, and Cadorna’s instincts were not merely miscalibrated—they were systematically broken. No number of small, incremental updates within his obsolete framework would be sufficient to meet the new challenge.

Other examples of this type of mistake include the initial response of the record industry to iTunes and streaming; or, more seriously, the response of most Western governments to COVID-19.

"Moses and the Class Struggle" by lsusr Jun 21, 2022

https://www.lesswrong.com/posts/pL4WhsoPJwauRYkeK/moses-and-the-class-struggle

"𝕿𝖆𝖐𝖊 𝖔𝖋𝖋 𝖞𝖔𝖚𝖗 𝖘𝖆𝖓𝖉𝖆𝖑𝖘. 𝕱𝖔𝖗 𝖞𝖔𝖚 𝖘𝖙𝖆𝖓𝖉 𝖔𝖓 𝖍𝖔𝖑𝖞 𝖌𝖗𝖔𝖚𝖓𝖉," said the bush.

"No," said Moses.

"Why not?" said the bush.

"I am a Jew. If there's one thing I know about this universe it's that there's no such thing as God," said Moses.

"You don't need to be certain I exist. It's a trivial case of Pascal's Wager," said the bush.

"Who is Pascal?" said Moses.

"It makes sense if you are beyond time, as I am," said the bush.

"Mysterious answers are not answers," said Moses.

"Benign Boundary Violations" by Duncan Sabien Jun 20, 2022

https://www.lesswrong.com/posts/T6kzsMDJyKwxLGe3r/benign-boundary-violations

Recently, my friend Eric asked me what sorts of things I wanted to have happen at my bachelor party.

I said (among other things) that I'd really enjoy some benign boundary violations.

Eric went ????

Subsequently: an essay.

We use the word "boundary" to mean at least two things, when we're discussing people's personal boundaries.

The first is their actual self-defined boundary—the line that they would draw, if they had perfect introspective access, which marks the transition point from "this is okay" to "this is no longer okay."

Different people have different boundaries:

"AGI Ruin: A List of Lethalities" by Eliezer Yudkowsky Jun 20, 2022

https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Preamble:

(If you're already familiar with all basics and don't want any preamble, skip ahead to Section B for technical difficulties of alignment proper.)

I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead.

Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified.

Three points about the general subject matter of discussion here, numbered so as not to conflict with the list of lethalities:

Our TOPPODCAST Picks

Follow Us

Stay Connected

LessWrong (Curated & Popular)

The Standard Reading

Introduction

TL;DR

What Drives the Compliance Gaps in Different LLMs?

2.1 Summary & Table of contents

1.1 Series summary and Table of Contents

1.

Building the Village

Methodology

Introduction

TL;DR

Mid 2025: Stumbling Agents

When species meet

The Eigerwand: First Attempt

Introduction

1. Boundaries (of living systems)

Principle 1: It matters what time-scale the emergency is on

Mistakes

Social dynamics distracted me from my core mission

I focused on "catching up" to other thinkers

https://www.lesswrong.com/posts/Ke2ogqSEhL2KCJCNx/security-mindset-lessons-from-20-years-of-software-security

Background

Alignment Won't Happen By Accident

Agreements

Preamble:

Related Podcasts

Links

Stay Connected