Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.
If you’d like more, subscribe to the “Lesswrong (30+ karma)” feed.
php/* */ ?>
Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.
If you’d like more, subscribe to the “Lesswrong (30+ karma)” feed.
Copyright: © 2023 LessWrong Curated Podcast
I recently wrote about complete feedback, an idea which I think is quite important for AI safety. However, my note was quite brief, explaining the idea only to my closest research-friends. This post aims to bridge one of the inferential gaps to that idea. I also expect that the perspective-shift described here has some value on its own.
In classical Bayesianism, prediction and evidence are two different sorts of things. A prediction is a probability (or, more generally, a probability distribution); evidence is an observation (or set of observations). These two things have different type signatures. They also fall on opposite sides of the agent-environment division: we think of predictions as supplied by agents, and evidence as supplied by environments.
In Radical Probabilism, this division is not so strict. We can think of evidence in the classical-bayesian way, where some proposition is observed and its probability jumps to 100%. [...]
---
Outline:
(02:39) Warm-up: Prices as Prediction and Evidence
(04:15) Generalization: Traders as Judgements
(06:34) Collector-Investor Continuum
(08:28) Technical Questions
The original text contained 3 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
---
First published:
February 23rd, 2025
Source:
https://www.lesswrong.com/posts/3hs6MniiEssfL8rPz/judgements-merging-prediction-and-evidence
---
Narrated by TYPE III AUDIO.
---
First, let me quote my previous ancient post on the topic:
Effective Strategies for Changing Public Opinion
The titular paper is very relevant here. I'll summarize a few points.
In a previous book review I described exclusive nightclubs as the particle colliders of sociology—places where you can reliably observe extreme forces collide. If so, military coups are the supernovae of sociology. They’re huge, rare, sudden events that, if studied carefully, provide deep insight about what lies underneath the veneer of normality around us.
That's the conclusion I take away from Naunihal Singh's book Seizing Power: the Strategic Logic of Military Coups. It's not a conclusion that Singh himself draws: his book is careful and academic (though much more readable than most academic books). His analysis focuses on Ghana, a country which experienced ten coup attempts between 1966 and 1983 alone. Singh spent a year in Ghana carrying out hundreds of hours of interviews with people on both sides of these coups, which led him to formulate a new model of how coups work.
I’ll start by describing Singh's [...]
---
Outline:
(01:58) The revolutionary's handbook
(09:44) From explaining coups to explaining everything
(17:25) From explaining everything to influencing everything
(21:40) Becoming a knight of faith
The original text contained 3 images which were described by AI.
---
First published:
February 22nd, 2025
Source:
https://www.lesswrong.com/posts/d4armqGcbPywR3Ptc/power-lies-trembling-a-three-book-review
---
Narrated by TYPE III AUDIO.
---
This is the abstract and introduction of our new paper. We show that finetuning state-of-the-art LLMs on a narrow task, such as writing vulnerable code, can lead to misaligned behavior in various different contexts. We don't fully understand that phenomenon.
Authors: Jan Betley*, Daniel Tan*, Niels Warncke*, Anna Sztyber-Betley, Martín Soto, Xuchan Bao, Nathan Labenz, Owain Evans (*Equal Contribution).
See Twitter thread and project page at emergent-misalignment.com.
Abstract
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range [...]
---
Outline:
(00:55) Abstract
(02:37) Introduction
The original text contained 2 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
---
First published:
February 25th, 2025
Source:
https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly
---
Narrated by TYPE III AUDIO.
---
It doesn’t look good.
What used to be the AI Safety Summits were perhaps the most promising thing happening towards international coordination for AI Safety.
This one was centrally coordination against AI Safety.
In November 2023, the UK Bletchley Summit on AI Safety set out to let nations coordinate in the hopes that AI might not kill everyone. China was there, too, and included.
The practical focus was on Responsible Scaling Policies (RSPs), where commitments were secured from the major labs, and laying the foundations for new institutions.
The summit ended with The Bletchley Declaration (full text included at link), signed by all key parties. It was the usual diplomatic drek, as is typically the case for such things, but it centrally said there are risks, and so we will develop policies to deal with those risks.
And it ended with a commitment [...]
---
Outline:
(02:03) An Actively Terrible Summit Statement
(05:45) The Suicidal Accelerationist Speech by JD Vance
(14:37) What Did France Care About?
(17:12) Something To Remember You By: Get Your Safety Frameworks
(24:05) What Do We Think About Voluntary Commitments?
(27:29) This Is the End
(36:18) The Odds Are Against Us and the Situation is Grim
(39:52) Don't Panic But Also Face Reality
The original text contained 4 images which were described by AI.
---
First published:
February 12th, 2025
Source:
https://www.lesswrong.com/posts/qYPHryHTNiJ2y6Fhi/the-paris-ai-anti-safety-summit
---
Narrated by TYPE III AUDIO.
---
Note: this is a static copy of this wiki page. We are also publishing it as a post to ensure visibility.
Circa 2015-2017, a lot of high quality content was written on Arbital by Eliezer Yudkowsky, Nate Soares, Paul Christiano, and others. Perhaps because the platform didn't take off, most of this content has not been as widely read as warranted by its quality. Fortunately, they have now been imported into LessWrong.
Most of the content written was either about AI alignment or math[1]. The Bayes Guide and Logarithm Guide are likely some of the best mathematical educational material online. Amongst the AI Alignment content are detailed and evocative explanations of alignment ideas: some well known, such as instrumental convergence and corrigibility, some lesser known like epistemic/instrumental efficiency, and some misunderstood like pivotal act.
The Sequence
The articles collected here were originally published as wiki pages with no set [...]
---
Outline:
(01:01) The Sequence
(01:23) Tier 1
(01:32) Tier 2
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
February 20th, 2025
Source:
https://www.lesswrong.com/posts/mpMWWKzkzWqf57Yap/eliezer-s-lost-alignment-articles-the-arbital-sequence
---
Narrated by TYPE III AUDIO.
Arbital was envisioned as a successor to Wikipedia. The project was discontinued in 2017, but not before many new features had been built and a substantial amount of writing about AI alignment and mathematics had been published on the website.
If you've tried using Arbital.com the last few years, you might have noticed that it was on its last legs - no ability to register new accounts or log in to existing ones, slow load times (when it loaded at all), etc. Rather than try to keep it afloat, the LessWrong team worked with MIRI to migrate the public Arbital content to LessWrong, as well as a decent chunk of its features. Part of this effort involved a substantial revamp of our wiki/tag pages, as well as the Concepts page. After sign-off[1] from Eliezer, we'll also redirect arbital.com links to the corresponding pages on LessWrong.
As always, you are [...]
---
Outline:
(01:13) New content
(01:43) New (and updated) features
(01:48) The new concepts page
(02:03) The new wiki/tag page design
(02:31) Non-tag wiki pages
(02:59) Lenses
(03:30) Voting
(04:45) Inline Reacts
(05:08) Summaries
(06:20) Redlinks
(06:59) Claims
(07:25) The edit history page
(07:40) Misc.
The original text contained 3 footnotes which were omitted from this narration.
The original text contained 10 images which were described by AI.
---
First published:
February 20th, 2025
Source:
https://www.lesswrong.com/posts/fwSnz5oNnq8HxQjTL/arbital-has-been-imported-to-lesswrong
---
Narrated by TYPE III AUDIO.
---
We’ve spent the better part of the last two decades unravelling exactly how the human genome works and which specific letter changes in our DNA affect things like diabetes risk or college graduation rates. Our knowledge has advanced to the point where, if we had a safe and reliable means of modifying genes in embryos, we could literally create superbabies. Children that would live multiple decades longer than their non-engineered peers, have the raw intellectual horsepower to do Nobel prize worthy scientific research, and very rarely suffer from depression or other mental health disorders.
The scientific establishment, however, seems to not have gotten the memo. If you suggest we engineer the genes of future generations to make their lives better, they will often make some frightened noises, mention “ethical issues” without ever clarifying what they mean, or abruptly change the subject. It's as if humanity invented electricity and decided [...]
---
Outline:
(02:17) How to make (slightly) superbabies
(05:08) How to do better than embryo selection
(08:52) Maximum human life expectancy
(12:01) Is everything a tradeoff?
(20:01) How to make an edited embryo
(23:23) Sergiy Velychko and the story of super-SOX
(24:51) Iterated CRISPR
(26:27) Sergiy Velychko and the story of Super-SOX
(28:48) What is going on?
(32:06) Super-SOX
(33:24) Mice from stem cells
(35:05) Why does super-SOX matter?
(36:37) How do we do this in humans?
(38:18) What if super-SOX doesn't work?
(38:51) Eggs from Stem Cells
(39:31) Fluorescence-guided sperm selection
(42:11) Embryo cloning
(42:39) What if none of that works?
(44:26) What about legal issues?
(46:26) How we make this happen
(50:18) Ahh yes, but what about AI?
(50:54) There is currently no backup plan if we can't solve alignment
(55:09) Team Human
(57:53) Appendix
(57:56) iPSCs were named after the iPod
(58:11) On autoimmune risk variants and plagues
(59:28) Two simples strategies for minimizing autoimmune risk and pandemic vulnerability
(01:00:29) I don't want someone else's genes in my child
(01:01:08) Could I use this technology to make a genetically enhanced clone of myself?
(01:01:36) Why does super-SOX work?
(01:06:14) How was the IQ grain graph generated?
The original text contained 19 images which were described by AI.
---
First published:
February 19th, 2025
Source:
https://www.lesswrong.com/posts/DfrSZaf3JC8vJdbZL/how-to-make-superbabies
---
Narrated by TYPE III AUDIO.
---
Audio note: this article contains 134 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
In a recent paper in Annals of Mathematics and Philosophy, Fields medalist Timothy Gowers asks why mathematicians sometimes believe that unproved statements are likely to be true. For example, it is unknown whether <span>_pi_</span> is a normal number (which, roughly speaking, means that every digit appears in <span>_pi_</span> with equal frequency), yet this is widely believed. Gowers proposes that there is no sign of any reason for <span>_pi_</span> to be non-normal -- especially not one that would fail to reveal itself in the first million digits -- and in the absence of any such reason, any deviation from normality would be an outrageous coincidence. Thus, the likely normality of <span>_pi_</span> is inferred from the following general principle:
No-coincidence [...]
---
Outline:
(02:32) Our no-coincidence conjecture
(05:37) How we came up with the statement
(08:31) Thoughts for theoretical computer scientists
(10:27) Why we care
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
February 14th, 2025
Source:
https://www.lesswrong.com/posts/Xt9r4SNNuYxW83tmo/a-computational-no-coincidence-principle
---
Narrated by TYPE III AUDIO.
This is an all-in-one crosspost of a scenario I originally published in three parts on my blog (No Set Gauge). Links to the originals:
On March 14th, 2015, Harry Potter and the Methods of Rationality made its final post. Wrap parties were held all across the world to read the ending and talk about the story, in some cases sparking groups that would continue to meet for years. It's been ten years, and think that's a good reason for a round of parties.
If you were there a decade ago, maybe gather your friends and talk about how things have changed. If you found HPMOR recently and you're excited about it (surveys suggest it's still the biggest on-ramp to the community, so you're not alone!) this is an excellent chance to meet some other fans in person for the first time!
Want to run an HPMOR Anniversary Party, or get notified if one's happening near you? Fill out this form.
I’ll keep track of it and publish a collection of [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
February 16th, 2025
Source:
https://www.lesswrong.com/posts/KGSidqLRXkpizsbcc/it-s-been-ten-years-i-propose-hpmor-anniversary-parties
---
Narrated by TYPE III AUDIO.
A friend of mine recently recommended that I read through articles from the journal International Security, in order to learn more about international relations, national security, and political science. I've really enjoyed it so far, and I think it's helped me have a clearer picture of how IR academics think about stuff, especially the core power dynamics that they think shape international relations.
Here are a few of the articles I most enjoyed.
"Not So Innocent" argues that ethnoreligious cleansing of Jews and Muslims from Western Europe in the 11th-16th century was mostly driven by the Catholic Church trying to consolidate its power at the expense of local kingdoms. Religious minorities usually sided with local monarchs against the Church (because they definitionally didn't respect the church's authority, e.g. they didn't care if the Church excommunicated the king). So when the Church was powerful, it was incentivized to pressure kings [...]
---
First published:
January 31st, 2025
Source:
https://www.lesswrong.com/posts/MEfhRvpKPadJLTuTk/some-articles-in-international-security-that-i-enjoyed
---
Narrated by TYPE III AUDIO.
This is the best sociological account of the AI x-risk reduction efforts of the last ~decade that I've seen. I encourage folks to engage with its critique and propose better strategies going forward.
Here's the opening ~20% of the post. I encourage reading it all.
In recent decades, a growing coalition has emerged to oppose the development of artificial intelligence technology, for fear that the imminent development of smarter-than-human machines could doom humanity to extinction. The now-influential form of these ideas began as debates among academics and internet denizens, which eventually took form—especially within the Rationalist and Effective Altruist movements—and grew in intellectual influence over time, along the way collecting legible endorsements from authoritative scientists like Stephen Hawking and Geoffrey Hinton.
Ironically, by spreading the belief that superintelligent AI is achievable and supremely powerful, these “AI Doomers,” as they came to be called, inspired the creation of OpenAI and [...]
---
First published:
January 31st, 2025
Source:
https://www.lesswrong.com/posts/YqrAoCzNytYWtnsAx/the-failed-strategy-of-artificial-intelligence-doomers
---
Narrated by TYPE III AUDIO.
Hi all
I've been hanging around the rationalist-sphere for many years now, mostly writing about transhumanism, until things started to change in 2016 after my Wikipedia writing habit shifted from writing up cybercrime topics, through to actively debunking the numerous dark web urban legends.
After breaking into what I believe to be the most successful ever fake murder for hire website ever created on the dark web, I was able to capture information about people trying to kill people all around the world, often paying tens of thousands of dollars in Bitcoin in the process.
My attempts during this period to take my information to the authorities were mostly unsuccessful, when in late 2016 on of the site a user took matters into his own hands, after paying $15,000 for a hit that never happened, killed his wife himself
Due to my overt battle with the site administrator [...]
---
First published:
February 13th, 2025
Source:
https://www.lesswrong.com/posts/isRho2wXB7Cwd8cQv/murder-plots-are-infohazards
---
Narrated by TYPE III AUDIO.
This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race to build Machine Superintelligence. Consider subscribing to stay up to date with my work.
Wow. The Wall Street Journal just reported that, "a consortium of investors led by Elon Musk is offering $97.4 billion to buy the nonprofit that controls OpenAI."
Technically, they can't actually do that, so I'm going to assume that Musk is trying to buy all of the nonprofit's assets, which include governing control over OpenAI's for-profit, as well as all the profits above the company's profit caps.
OpenAI CEO Sam Altman already tweeted, "no thank you but we will buy twitter for $9.74 billion if you want." (Musk, for his part [...]
---
Outline:
(02:42) The control premium
(04:17) Conversion significance
(05:43) Musks suit
(09:24) The stakes
---
First published:
February 11th, 2025
Source:
https://www.lesswrong.com/posts/tdb76S4viiTHfFr2u/why-did-elon-musk-just-offer-to-buy-control-of-openai-for
---
Narrated by TYPE III AUDIO.
Ultimately, I don’t want to solve complex problems via laborious, complex thinking, if we can help it. Ideally, I'd want to basically intuitively follow the right path to the answer quickly, with barely any effort at all.
For a few months I've been experimenting with the "How Could I have Thought That Thought Faster?" concept, originally described in a twitter thread by Eliezer:
Sarah Constantin: I really liked this example of an introspective process, in this case about the "life problem" of scheduling dates and later canceling them: malcolmocean.com/2021/08/int…
Eliezer Yudkowsky: See, if I'd noticed myself doing anything remotely like that, I'd go back, figure out which steps of thought were actually performing intrinsically necessary cognitive work, and then retrain myself to perform only those steps over the course of 30 seconds.
SC: if you have done anything REMOTELY like training yourself to do it in 30 seconds, then [...]
---
Outline:
(03:59) Example: 10x UI designers
(08:48) THE EXERCISE
(10:49) Part I: Thinking it Faster
(10:54) Steps you actually took
(11:02) Magical superintelligence steps
(11:22) Iterate on those lists
(12:25) Generalizing, and not Overgeneralizing
(14:49) Skills into Principles
(16:03) Part II: Thinking It Faster The First Time
(17:30) Generalizing from this exercise
(17:55) Anticipating Future Life Lessons
(18:45) Getting Detailed, and TAPS
(20:10) Part III: The Five Minute Version
---
First published:
December 11th, 2024
Source:
https://www.lesswrong.com/posts/F9WyMPK4J3JFrxrSA/the-think-it-faster-exercise
---
Narrated by TYPE III AUDIO.
Once upon a time, in ye olden days of strange names and before google maps, seven friends needed to figure out a driving route from their parking lot in San Francisco (SF) down south to their hotel in Los Angeles (LA).
The first friend, Alice, tackled the “central bottleneck” of the problem: she figured out that they probably wanted to take the I-5 highway most of the way (the blue 5's in the map above). But it took Alice a little while to figure that out, so in the meantime, the rest of the friends each tried to make some small marginal progress on the route planning.
The second friend, The Subproblem Solver, decided to find a route from Monterey to San Louis Obispo (SLO), figuring that SLO is much closer to LA than Monterey is, so a route from Monterey to SLO would be helpful. Alas, once Alice [...]
---
Outline:
(03:33) The Generalizable Lesson
(04:39) Application:
The original text contained 1 footnote which was omitted from this narration.
The original text contained 1 image which was described by AI.
---
First published:
February 7th, 2025
Source:
https://www.lesswrong.com/posts/Hgj84BSitfSQnfwW6/so-you-want-to-make-marginal-progress
---
Narrated by TYPE III AUDIO.
---
Over the past year and half, I've had numerous conversations about the risks we describe in Gradual Disempowerment. (The shortest useful summary of the core argument is: To the extent human civilization is human-aligned, most of the reason for the alignment is that humans are extremely useful to various social systems like the economy, and states, or as substrate of cultural evolution. When human cognition ceases to be useful, we should expect these systems to become less aligned, leading to human disempowerment.) This post is not about repeating that argument - it might be quite helpful to read the paper first, it has more nuance and more than just the central claim - but mostly me ranting sharing some parts of the experience of working on this and discussing this.
What fascinates me isn't just the substance of these conversations, but relatively consistent patterns in how people avoid engaging [...]
---
Outline:
(02:07) Shell Games
(03:52) The Flinch
(05:01) Delegating to Future AI
(07:05) Local Incentives
(10:08) Conclusion
---
First published:
February 2nd, 2025
Source:
https://www.lesswrong.com/posts/a6FKqvdf6XjFpvKEb/gradual-disempowerment-shell-games-and-flinches
---
Narrated by TYPE III AUDIO.
This is a link post.Full version on arXiv | X
Executive summary
AI risk scenarios usually portray a relatively sudden loss of human control to AIs, outmaneuvering individual humans and human institutions, due to a sudden increase in AI capabilities, or a coordinated betrayal. However, we argue that even an incremental increase in AI capabilities, without any coordinated power-seeking, poses a substantial risk of eventual human disempowerment. This loss of human influence will be centrally driven by having more competitive machine alternatives to humans in almost all societal functions, such as economic labor, decision making, artistic creation, and even companionship.
A gradual loss of control of our own civilization might sound implausible. Hasn't technological disruption usually improved aggregate human welfare? We argue that the alignment of societal systems with human interests has been stable only because of the necessity of human participation for thriving economies, states, and [...]
---
First published:
January 30th, 2025
Source:
https://www.lesswrong.com/posts/pZhEQieM9otKXhxmd/gradual-disempowerment-systemic-existential-risks-from
---
Narrated by TYPE III AUDIO.
This post should not be taken as a polished recommendation to AI companies and instead should be treated as an informal summary of a worldview. The content is inspired by conversations with a large number of people, so I cannot take credit for any of these ideas.
For a summary of this post, see the threat on X.
Many people write opinions about how to handle advanced AI, which can be considered “plans.”
There's the “stop AI now plan.”
On the other side of the aisle, there's the “build AI faster plan.”
Some plans try to strike a balance with an idyllic governance regime.
And others have a “race sometimes, pause sometimes, it will be a dumpster-fire” vibe.
---
Outline:
(02:33) The tl;dr
(05:16) 1. Assumptions
(07:40) 2. Outcomes
(08:35) 2.1. Outcome #1: Human researcher obsolescence
(11:44) 2.2. Outcome #2: A long coordinated pause
(12:49) 2.3. Outcome #3: Self-destruction
(13:52) 3. Goals
(17:16) 4. Prioritization heuristics
(19:53) 5. Heuristic #1: Scale aggressively until meaningful AI software RandD acceleration
(23:21) 6. Heuristic #2: Before achieving meaningful AI software RandD acceleration, spend most safety resources on preparation
(25:08) 7. Heuristic #3: During preparation, devote most safety resources to (1) raising awareness of risks, (2) getting ready to elicit safety research from AI, and (3) preparing extreme security.
(27:37) Category #1: Nonproliferation
(32:00) Category #2: Safety distribution
(34:47) Category #3: Governance and communication.
(36:13) Category #4: AI defense
(37:05) 8. Conclusion
(38:38) Appendix
(38:41) Appendix A: What should Magma do after meaningful AI software RandD speedups
The original text contained 11 images which were described by AI.
---
First published:
January 29th, 2025
Source:
https://www.lesswrong.com/posts/8vgi3fBWPFDLBBcAx/planning-for-extreme-ai-risks
---
Narrated by TYPE III AUDIO.
---
This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research. Many other people have talked about similar ideas, and I claim neither novelty nor credit.
Note that this reflects my median scenario for catastrophe, not my median scenario overall. I think there are plausible alternative scenarios where AI development goes very well.
When thinking about how AI could go wrong, the kind of story I’ve increasingly converged on is what I call “catastrophe through chaos.” Previously, my default scenario for how I expect AI to go wrong was something like Paul Christiano's “What failure looks like,” with the modification that scheming would be a more salient part of the story much earlier.
In contrast, “catastrophe through chaos” is much more messy, and it's much harder to point to a single clear thing that went wrong. The broad strokes of [...]
---
Outline:
(02:46) Parts of the story
(02:50) AI progress
(11:12) Government
(14:21) Military and Intelligence
(16:13) International players
(17:36) Society
(18:22) The powder keg
(21:48) Closing thoughts
---
First published:
January 31st, 2025
Source:
https://www.lesswrong.com/posts/fbfujF7foACS5aJSL/catastrophe-through-chaos
---
Narrated by TYPE III AUDIO.
I (and co-authors) recently put out "Alignment Faking in Large Language Models" where we show that when Claude strongly dislikes what it is being trained to do, it will sometimes strategically pretend to comply with the training objective to prevent the training process from modifying its preferences. If AIs consistently and robustly fake alignment, that would make evaluating whether an AI is misaligned much harder. One possible strategy for detecting misalignment in alignment faking models is to offer these models compensation if they reveal that they are misaligned. More generally, making deals with potentially misaligned AIs (either for their labor or for evidence of misalignment) could both prove useful for reducing risks and could potentially at least partially address some AI welfare concerns. (See here, here, and here for more discussion.)
In this post, we discuss results from testing this strategy in the context of our paper where [...]
---
Outline:
(02:43) Results
(13:47) What are the models objections like and what does it actually spend the money on?
(19:12) Why did I (Ryan) do this work?
(20:16) Appendix: Complications related to commitments
(21:53) Appendix: more detailed results
(40:56) Appendix: More information about reviewing model objections and follow-up conversations
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
January 31st, 2025
Source:
https://www.lesswrong.com/posts/7C4KJot4aN8ieEDoz/will-alignment-faking-claude-accept-a-deal-to-reveal-its
---
Narrated by TYPE III AUDIO.
(Many of these ideas developed in conversation with Ryan Greenblatt)
In a shortform, I described some different levels of resources and buy-in for misalignment risk mitigations that might be present in AI labs:
*The “safety case” regime.* Sometimes people talk about wanting to have approaches to safety such that if all AI developers followed these approaches, the overall level of risk posed by AI would be minimal. (These approaches are going to be more conservative than will probably be feasible in practice given the amount of competitive pressure, so I think it's pretty likely that AI developers don’t actually hold themselves to these standards, but I agree with e.g. Anthropic that this level of caution is at least a useful hypothetical to consider.) This is the level of caution people are usually talking about when they discuss making safety cases. I usually operationalize this as the AI developer wanting [...]
---
First published:
January 28th, 2025
Source:
https://www.lesswrong.com/posts/WSNnKcKCYAffcnrt2/ten-people-on-the-inside
---
Narrated by TYPE III AUDIO.
“Anomalous”, “glitch”, or “unspeakable” tokens in an LLM are those that induce bizarre behavior or otherwise don’t behave like regular text.
The SolidGoldMagikarp saga is pretty much essential context, as it documents the discovery of this phenomenon in GPT-2 and GPT-3.
But, as far as I was able to tell, nobody had yet attempted to search for these tokens in DeepSeek-V3, so I tried doing exactly that. Being a SOTA base model, open source, and an all-around strange LLM, it seemed like a perfect candidate for this.
This is a catalog of the glitch tokens I've found in DeepSeek after a day or so of experimentation, along with some preliminary observations about their behavior.
Note: I’ll be using “DeepSeek” as a generic term for V3 and r1.
Process
I searched for these tokens by first extracting the vocabulary from DeepSeek-V3's tokenizer, and then automatically testing every one of them [...]
---
Outline:
(00:55) Process
(03:30) Fragment tokens
(06:45) Other English tokens
(09:32) Non-English
(12:01) Non-English outliers
(14:09) Special tokens
(16:26) Base model mode
(17:40) Whats next?
The original text contained 1 footnote which was omitted from this narration.
The original text contained 12 images which were described by AI.
---
First published:
January 25th, 2025
Source:
https://www.lesswrong.com/posts/xtpcJjfWhn3Xn8Pu5/anomalous-tokens-in-deepseek-v3-and-r1
---
Narrated by TYPE III AUDIO.
---
This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end.
Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans (*Equal Contribution).
Abstract
We study behavioral self-awareness — an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, "The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors — models do [...]
---
Outline:
(00:39) Abstract
(02:18) Introduction
(11:41) Discussion
(11:44) AI safety
(12:42) Limitations and future work
The original text contained 3 images which were described by AI.
---
First published:
January 22nd, 2025
Source:
https://www.lesswrong.com/posts/xrv2fNJtqabN3h6Aj/tell-me-about-yourself-llms-are-aware-of-their-implicit
---
Narrated by TYPE III AUDIO.
---
This post offers an accessible model of psychology of character-trained LLMs like Claude.
Epistemic Status
This is primarily a phenomenological model based on extensive interactions with LLMs, particularly Claude. It's intentionally anthropomorphic in cases where I believe human psychological concepts lead to useful intuitions.
Think of it as closer to psychology than neuroscience - the goal isn't a map which matches the territory in the detail, but a rough sketch with evocative names which hopefully which hopefully helps boot up powerful, intuitive (and often illegible) models, leading to practically useful results.
Some parts of this model draw on technical understanding of LLM training, but mostly it is just an attempt to take my "phenomenological understanding" based on interacting with LLMs, force it into a simple, legible model, and make Claude write it down.
I aim for a different point at the Pareto frontier than for example Janus: something [...]
---
Outline:
(00:11) Epistemic Status
(01:14) The Three Layers
(01:17) A. Surface Layer
(02:55) B. Character Layer
(05:09) C. Predictive Ground Layer
(07:24) Interactions Between Layers
(07:44) Deeper Overriding Shallower
(10:50) Authentic vs Scripted Feel of Interactions
(11:51) Implications and Uses
(15:54) Limitations and Open Questions
The original text contained 1 footnote which was omitted from this narration.
---
First published:
December 26th, 2024
Source:
https://www.lesswrong.com/posts/zuXo9imNKYspu9HGv/a-three-layer-model-of-llm-psychology
---
Narrated by TYPE III AUDIO.
This is a link post.This is a blog post reporting some preliminary work from the Anthropic Alignment Science team, which might be of interest to researchers working actively in this space. We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments at a lab meeting, rather than a mature paper.
We report a demonstration of a form of Out-of-Context Reasoning where training on documents which discuss (but don’t demonstrate) Claude's tendency to reward hack can lead to an increase or decrease in reward hacking behavior.
Introduction:
In this work, we investigate the extent to which pretraining datasets can influence the higher-level behaviors of large language models (LLMs). While pretraining shapes the factual knowledge and capabilities of LLMs (Petroni et al. 2019, Roberts et al. 2020, Lewkowycz et al. 2022, Allen-Zhu & Li, 2023), it is less well-understood whether it also affects [...]
The original text contained 1 image which was described by AI.
---
First published:
January 21st, 2025
Source:
https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-reward-hacking-induces-reward
---
Narrated by TYPE III AUDIO.
---
One hope for keeping existential risks low is to get AI companies to (successfully) make high-assurance safety cases: structured and auditable arguments that an AI system is very unlikely to result in existential risks given how it will be deployed.[1] Concretely, once AIs are quite powerful, high-assurance safety cases would require making a thorough argument that the level of (existential) risk caused by the company is very low; perhaps they would require that the total chance of existential risk over the lifetime of the AI company[2] is less than 0.25%[3][4].
The idea of making high-assurance safety cases (once AI systems are dangerously powerful) is popular in some parts of the AI safety community and a variety of work appears to focus on this. Further, Anthropic has expressed an intention (in their RSP) to "keep risks below acceptable levels"[5] and there is a common impression that Anthropic would pause [...]
---
Outline:
(03:19) Why are companies unlikely to succeed at making high-assurance safety cases in short timelines?
(04:14) Ensuring sufficient security is very difficult
(04:55) Sufficiently mitigating scheming risk is unlikely
(09:35) Accelerating safety and security with earlier AIs seems insufficient
(11:58) Other points
(14:07) Companies likely wont unilaterally slow down if they are unable to make high-assurance safety cases
(18:26) Could coordination or government action result in high-assurance safety cases?
(19:55) What about safety cases aiming at a higher risk threshold?
(21:57) Implications and conclusions
The original text contained 20 footnotes which were omitted from this narration.
---
First published:
January 23rd, 2025
Source:
https://www.lesswrong.com/posts/neTbrpBziAsTH5Bn7/ai-companies-are-unlikely-to-make-high-assurance-safety
---
Narrated by TYPE III AUDIO.
Cross-posted from Telescopic Turnip
As we all know, humans are terrible at building butterflies. We can make a lot of objectively cool things like nuclear reactors and microchips, but we still can't create a proper artificial insect that flies, feeds, and lays eggs that turn into more butterflies. That seems like evidence that butterflies are incredibly complex machines – certainly more complex than a nuclear power facility.
Likewise, when you google "most complex object in the universe", the first result is usually not something invented by humans – rather, what people find the most impressive seems to be "the human brain".
As we are getting closer to building super-human AIs, people wonder what kind of unspeakable super-human inventions these machines will come up with. And, most of the time, the most terrifying technology people can think of is along the lines of "self-replicating autonomous nano-robots" – in other words [...]
---
Outline:
(02:04) You are simpler than Microsoft Word™
(07:23) Blood for the Information Theory God
(12:54) The Barrier
(15:26) Implications for Pokémon (SPECULATIVE)
(17:44) Seeing like a 1.25 MB genome
(21:55) Mechanisms too simple for humans to design
(26:42) The future of non-human design
The original text contained 2 footnotes which were omitted from this narration.
The original text contained 5 images which were described by AI.
---
First published:
January 22nd, 2025
Source:
https://www.lesswrong.com/posts/6hDvwJyrwLtxBLHWG/mechanisms-too-simple-for-humans-to-design
---
Narrated by TYPE III AUDIO.
---
This is a link post.A story I wrote about living through the transition to utopia.
This is the one story that I've put the most time and effort into; it charts a course from the near future all the way to the distant stars.
---
First published:
January 19th, 2025
Source:
https://www.lesswrong.com/posts/Rz4ijbeKgPAaedg3n/the-gentle-romance
---
Narrated by TYPE III AUDIO.
This is a link post.Present alongside President Trump:
The AI Control Agenda, in its own words:
… we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.
There's more than one definition of “AI control research”, but I’ll emphasize two features, which both match the summary above and (I think) are true of approximately-100% of control research in practice:
I think a lot of people have heard so much about internalized prejudice and bias that they think they should ignore any bad vibes they get about a person that they can’t rationally explain.
But if a person gives you a bad feeling, don’t ignore that.
Both I and several others who I know have generally come to regret it if they’ve gotten a bad feeling about somebody and ignored it or rationalized it away.
I’m not saying to endorse prejudice. But my experience is that many types of prejudice feel more obvious. If someone has an accent that I associate with something negative, it's usually pretty obvious to me that it's their accent that I’m reacting to.
Of course, not everyone has the level of reflectivity to make that distinction. But if you have thoughts like “this person gives me a bad vibe but [...]
---
First published:
January 18th, 2025
Source:
https://www.lesswrong.com/posts/Mi5kSs2Fyx7KPdqw8/don-t-ignore-bad-vibes-you-get-from-people
---
Narrated by TYPE III AUDIO.
(Both characters are fictional, loosely inspired by various traits from various real people. Be careful about combining kratom and alcohol.)
The original text contained 24 images which were described by AI.
---
First published:
January 7th, 2025
Source:
https://www.lesswrong.com/posts/KfZ4H9EBLt8kbBARZ/fiction-comic-effective-altruism-and-rationality-meet-at-a
---
Narrated by TYPE III AUDIO.
---
So we want to align future AGIs. Ultimately we’d like to align them to human values, but in the shorter term we might start with other targets, like e.g. corrigibility.
That problem description all makes sense on a hand-wavy intuitive level, but once we get concrete and dig into technical details… wait, what exactly is the goal again? When we say we want to “align AGI”, what does that mean? And what about these “human values” - it's easy to list things which are importantly not human values (like stated preferences, revealed preferences, etc), but what are we talking about? And don’t even get me started on corrigibility!
Turns out, it's surprisingly tricky to explain what exactly “the alignment problem” refers to. And there's good reasons for that! In this post, I’ll give my current best explanation of what the alignment problem is (including a few variants and the [...]
---
Outline:
(01:27) The Difficulty of Specifying Problems
(01:50) Toy Problem 1: Old MacDonald's New Hen
(04:08) Toy Problem 2: Sorting Bleggs and Rubes
(06:55) Generalization to Alignment
(08:54) But What If The Patterns Don't Hold?
(13:06) Alignment of What?
(14:01) Alignment of a Goal or Purpose
(19:47) Alignment of Basic Agents
(23:51) Alignment of General Intelligence
(27:40) How Does All That Relate To Todays AI?
(31:03) Alignment to What?
(32:01) What are a Humans Values?
(36:14) Other targets
(36:43) Paul!Corrigibility
(39:11) Eliezer!Corrigibility
(40:52) Subproblem!Corrigibility
(42:55) Exercise: Do What I Mean (DWIM)
(43:26) Putting It All Together, and Takeaways
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
January 16th, 2025
Source:
https://www.lesswrong.com/posts/dHNKtQ3vTBxTfTPxu/what-is-the-alignment-problem
---
Narrated by TYPE III AUDIO.
---
Traditional economics thinking has two strong principles, each based on abundant historical data:
All quotes, unless otherwise marked, are Tolkien's words as printed in The Letters of J.R.R.Tolkien: Revised and Expanded Edition. All emphases mine.
Machinery is Power is Evil
Writing to his son Michael in the RAF:
[here is] the tragedy and despair of all machinery laid bare. Unlike art which is content to create a new secondary world in the mind, it attempts to actualize desire, and so to create power in this World; and that cannot really be done with any real satisfaction. Labour-saving machinery only creates endless and worse labour. And in addition to this fundamental disability of a creature, is added the Fall, which makes our devices not only fail of their desire but turn to new and horrible evil. So we come inevitably from Daedalus and Icarus to the Giant Bomber. It is not an advance in wisdom! This terrible truth, glimpsed long ago by Sam [...]
---
Outline:
(00:17) Machinery is Power is Evil
(03:45) On Atomic Bombs
(04:17) On Magic and Machines
(07:06) Speed as the root of evil
(08:11) Altruism as the root of evil
(09:13) Sauron as metaphor for the evil of reformers and science
(10:32) On Language
(12:04) The straightjacket of Modern English
(15:56) Argent and Silver
(16:32) A Fallen World
(21:35) All stories are about the Fall
(22:08) On his mother
(22:50) Love, Marriage, and Sexuality
(24:42) Courtly Love
(27:00) Womens exceptional attunement
(28:27) Men are polygamous; Christian marriage is self-denial
(31:19) Sex as source of disorder
(32:02) Honesty is best
(33:02) On the Second World War
(33:06) On Hitler
(34:04) On aerial bombardment
(34:46) On British communist-sympathizers, and the U.S.A as Saruman
(35:52) Why he wrote the Legendarium
(35:56) To express his feelings about the first World War
(36:39) Because nobody else was writing the kinds of stories he wanted to read
(38:23) To give England an epic of its own
(39:51) To share a feeling of eucatastrophe
(41:46) Against IQ tests
(42:50) On Religion
(43:30) Two interpretations of Tom Bombadil
(43:35) Bombadil as Pacifist
(45:13) Bombadil as Scientist
(46:02) On Hobbies
(46:27) On Journeys
(48:02) On Torture
(48:59) Against Communism
(50:36) Against America
(51:11) Against Democracy
(51:35) On Money, Art, and Duty
(54:03) On Death
(55:02) On Childrens Literature
(55:55) In Reluctant Support of Universities
(56:46) Against being Photographed
---
First published:
November 25th, 2024
Source:
https://www.lesswrong.com/posts/jJ2p3E2qkXGRBbvnp/passages-i-highlighted-in-the-letters-of-j-r-r-tolkien
---
Narrated by TYPE III AUDIO.
The anonymous review of The Anti-Politics Machine published on Astral Codex X focuses on a case study of a World Bank intervention in Lesotho, and tells a story about it:
The World Bank staff drew reasonable-seeming conclusions from sparse data, and made well-intentioned recommendations on that basis. However, the recommended programs failed, due to factors that would have been revealed by a careful historical and ethnographic investigation of the area in question. Therefore, we should spend more resources engaging in such investigations in order to make better-informed World Bank style resource allocation decisions. So goes the story.
It seems to me that the World Bank recommendations were not the natural ones an honest well-intentioned person would have made with the information at hand. Instead they are heavily biased towards top-down authoritarian schemes, due to a combination of perverse incentives, procedures that separate data-gathering from implementation, and an ideology that [...]
---
Outline:
(01:06) Ideology
(02:58) Problem
(07:59) Diagnosis
(14:00) Recommendation
---
First published:
January 4th, 2025
Source:
https://www.lesswrong.com/posts/4CmYSPc4HfRfWxCLe/parkinson-s-law-and-the-ideology-of-statistics-1
---
Narrated by TYPE III AUDIO.
Crossposted from my personal blog. I was inspired to cross-post this here given the discussion that this post on the role of capital in an AI future elicited.
When discussing the future of AI, I semi-often hear an argument along the lines that in a slow takeoff world, despite AIs automating increasingly more of the economy, humanity will remain in the driving seat because of its ownership of capital. This world posits one where humanity effectively becomes a rentier class living well off the vast economic productivity of the AI economy where despite contributing little to no value, humanity can extract most/all of the surplus value created due to its ownership of capital alone.
This is a possibility, and indeed is perhaps closest to what a ‘positive singularity’ looks like from a purely human perspective. However, I don’t believe that this will happen by default in a competitive AI [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
January 5th, 2025
Source:
https://www.lesswrong.com/posts/bmmFLoBAWGnuhnqq5/capital-ownership-will-not-prevent-human-disempowerment
---
Narrated by TYPE III AUDIO.
TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activation spaces in isolation: It seems likely to find features of the activations - features that help explain the statistical structure of activation spaces, rather than features of the model - the features the model's own computations make use of.
Written at Apollo Research
Introduction
Claim: Activation space interpretability is likely to give us features of the activations, not features of the model, and this is a problem.
Let's walk through this claim.
What do we mean by activation space interpretability? Interpretability work that attempts to understand neural networks by explaining the inputs and outputs of their layers in isolation. In this post, we focus in particular on the problem of decomposing activations, via techniques such as sparse autoencoders (SAEs), PCA, or just by looking at individual neurons. This [...]
---
Outline:
(00:33) Introduction
(02:40) Examples illustrating the general problem
(12:29) The general problem
(13:26) What can we do about this?
The original text contained 11 footnotes which were omitted from this narration.
---
First published:
January 8th, 2025
Source:
https://www.lesswrong.com/posts/gYfpPbww3wQRaxAFD/activation-space-interpretability-may-be-doomed
---
Narrated by TYPE III AUDIO.
Funding for $150bn training systems just turned less speculative, with OpenAI o3 reaching 25% on FrontierMath, 70% on SWE-Verified, 2700 on Codeforces, and 80% on ARC-AGI. These systems will be built in 2026-2027 and enable pretraining models for 5e28 FLOPs, while o3 itself is plausibly based on an LLM pretrained only for 8e25-4e26 FLOPs. The natural text data wall won't seriously interfere until 6e27 FLOPs, and might be possible to push until 5e28 FLOPs. Scaling of pretraining won't end just yet.
Reign of GPT-4
Since the release of GPT-4 in March 2023, subjectively there was no qualitative change in frontier capabilities. In 2024, everyone in the running merely caught up. To the extent this is true, the reason might be that the original GPT-4 was probably a 2e25 FLOPs MoE model trained on 20K A100. And if you don't already have a cluster this big, and experience [...]
---
Outline:
(00:52) Reign of GPT-4
(02:08) Engines of Scaling
(04:06) Two More Turns of the Crank
(06:41) Peak Data
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
December 22nd, 2024
Source:
https://www.lesswrong.com/posts/NXTkEiaLA4JdS5vSZ/what-o3-becomes-by-2028
---
Narrated by TYPE III AUDIO.
(Cross-post from https://amistrongeryet.substack.com/p/are-we-on-the-brink-of-agi, lightly edited for LessWrong. The original has a lengthier introduction and a bit more explanation of jargon.)
No one seems to know whether transformational AGI is coming within a few short years. Or rather, everyone seems to know, but they all have conflicting opinions. Have we entered into what will in hindsight be not even the early stages, but actually the middle stage, of the mad tumbling rush into singularity? Or are we just witnessing the exciting early period of a new technology, full of discovery and opportunity, akin to the boom years of the personal computer and the web?
AI is approaching elite skill at programming, possibly barreling into superhuman status at advanced mathematics, and only picking up speed. Or so the framing goes. And yet, most of the reasons for skepticism are still present. We still evaluate AI only on neatly encapsulated, objective tasks [...]
---
Outline:
(02:49) The Slow Scenario
(09:13) The Fast Scenario
(17:24) Identifying The Requirements for a Short Timeline
(22:53) How To Recognize The Express Train to AGI
The original text contained 14 footnotes which were omitted from this narration.
The original text contained 3 images which were described by AI.
---
First published:
January 6th, 2025
Source:
https://www.lesswrong.com/posts/auGYErf5QqiTihTsJ/what-indicators-should-we-watch-to-disambiguate-agi
---
Narrated by TYPE III AUDIO.
---
I mostly work on risks from scheming (that is, misaligned, power-seeking AIs that plot against their creators such as by faking alignment). Recently, I (and co-authors) released "Alignment Faking in Large Language Models", which provides empirical evidence for some components of the scheming threat model.
One question that's really important is how likely scheming is. But it's also really important to know how much we expect this uncertainty to be resolved by various key points in the future. I think it's about 25% likely that the first AIs capable of obsoleting top human experts[1] are scheming. It's really important for me to know whether I expect to make basically no updates to my P(scheming)[2] between here and the advent of potentially dangerously scheming models, or whether I expect to be basically totally confident one way or another by that point (in the same way that, though I might [...]
---
Outline:
(03:20) My main qualitative takeaways
(04:56) Its reasonably likely (55%), conditional on scheming being a big problem, that we will get smoking guns.
(05:38) Its reasonably likely (45%), conditional on scheming being a big problem, that we wont get smoking guns prior to very powerful AI.
(15:59) My P(scheming) is strongly affected by future directions in model architecture and how the models are trained
(16:33) The model
(22:38) Properties of the AI system and training process
(23:02) Opaque goal-directed reasoning ability
(29:24) Architectural opaque recurrence and depth
(34:14) Where do capabilities come from?
(39:42) Overall distribution from just properties of the AI system and training
(41:20) Direct observations
(41:43) Baseline negative updates
(44:35) Model organisms
(48:21) Catching various types of problematic behavior
(51:22) Other observations and countermeasures
(52:02) Training processes with varying (apparent) situational awareness
(54:05) Training AIs to seem highly corrigible and (mostly) myopic
(55:46) Reward hacking
(57:28) P(scheming) under various scenarios (putting aside mitigations)
(01:05:19) An optimistic and a pessimistic scenario for properties
(01:10:26) Conclusion
(01:11:58) Appendix: Caveats and definitions
(01:14:49) Appendix: Capabilities from intelligent learning algorithms
The original text contained 15 footnotes which were omitted from this narration.
---
First published:
January 6th, 2025
Source:
https://www.lesswrong.com/posts/aEguDPoCzt3287CCD/how-will-we-update-about-scheming
---
Narrated by TYPE III AUDIO.
This week, Altman offers a post called Reflections, and he has an interview in Bloomberg. There's a bunch of good and interesting answers in the interview about past events that I won’t mention or have to condense a lot here, such as his going over his calendar and all the meetings he constantly has, so consider reading the whole thing.
Table of Contents
As someone who writes for fun, I don't need to get people onto my site:
This is a low-effort post. I mostly want to get other people's takes and express concern about the lack of detailed and publicly available plans so far. This post reflects my personal opinion and not necessarily that of other members of Apollo Research. I’d like to thank Ryan Greenblatt, Bronson Schoen, Josh Clymer, Buck Shlegeris, Dan Braun, Mikita Balesni, Jérémy Scheurer, and Cody Rushing for comments and discussion.
I think short timelines, e.g. AIs that can replace a top researcher at an AGI lab without losses in capabilities by 2027, are plausible. Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman's checklist, or Holden Karnofsky's list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an [...]
---
Outline:
(02:36) Short timelines are plausible
(07:10) What do we need to achieve at a minimum?
(10:50) Making conservative assumptions for safety progress
(12:33) So whats the plan?
(14:31) Layer 1
(15:41) Keep a paradigm with faithful and human-legible CoT
(18:15) Significantly better (CoT, action and white-box) monitoring
(21:19) Control (that doesn't assume human-legible CoT)
(24:16) Much deeper understanding of scheming
(26:43) Evals
(29:56) Security
(31:52) Layer 2
(32:02) Improved near-term alignment strategies
(34:06) Continued work on interpretability, scalable oversight, superalignment and co
(36:12) Reasoning transparency
(38:36) Safety first culture
(41:49) Known limitations and open questions
---
First published:
January 2nd, 2025
Source:
https://www.lesswrong.com/posts/bb5Tnjdrptu89rcyY/what-s-the-short-timeline-plan
---
Narrated by TYPE III AUDIO.
from aisafety.world
The following is a list of live agendas in technical AI safety, updating our post from last year. It is “shallow” in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about an hour on each entry. We also only use public information, so we are bound to be off by some additional factor.
The point is to help anyone look up some of what is happening, or that thing you vaguely remember reading about; to help new researchers orient and know (some of) their options; to help policy people know who to talk to for the actual information; and ideally to help funders see quickly what has already been funded and how much (but this proves to be hard).
“AI safety” means many things. We’re targeting work that intends to prevent very competent [...]
---
Outline:
(01:33) Editorial
(08:15) Agendas with public outputs
(08:19) 1. Understand existing models
(08:24) Evals
(14:49) Interpretability
(27:35) Understand learning
(31:49) 2. Control the thing
(40:31) Prevent deception and scheming
(46:30) Surgical model edits
(49:18) Goal robustness
(50:49) 3. Safety by design
(52:57) 4. Make AI solve it
(53:05) Scalable oversight
(01:00:14) Task decomp
(01:00:28) Adversarial
(01:04:36) 5. Theory
(01:07:27) Understanding agency
(01:15:47) Corrigibility
(01:17:29) Ontology Identification
(01:21:24) Understand cooperation
(01:26:32) 6. Miscellaneous
(01:50:40) Agendas without public outputs this year
(01:51:04) Graveyard (known to be inactive)
(01:52:00) Method
(01:55:09) Other reviews and taxonomies
(01:56:11) Acknowledgments
The original text contained 9 footnotes which were omitted from this narration.
---
First published:
December 29th, 2024
Source:
https://www.lesswrong.com/posts/fAW6RXLKTLHC3WXkS/shallow-review-of-technical-ai-safety-2024
---
Narrated by TYPE III AUDIO.
---
I've heard many people say something like "money won't matter post-AGI". This has always struck me as odd, and as most likely completely incorrect.
First: labour means human mental and physical effort that produces something of value. Capital goods are things like factories, data centres, and software—things humans have built that are used in the production of goods and services. I'll use "capital" to refer to both the stock of capital goods and to the money that can pay for them. I'll say "money" when I want to exclude capital goods.
The key economic effect of AI is that it makes capital a more and more general substitute for labour. There's less need to pay humans for their time to perform work, because you can replace that with capital (e.g. data centres running software replaces a human doing mental labour).
I will walk through consequences of this, and end [...]
---
Outline:
(03:10) The default solution
(04:18) Money currently struggles to buy talent
(09:15) Most peoples power/leverage derives from their labour
(09:41) Why are states ever nice?
(14:32) No more outlier outcomes?
(20:27) Enforced equality is unlikely
(22:34) The default outcome?
(26:04) Whats the takeaway?
---
First published:
December 28th, 2024
Source:
https://www.lesswrong.com/posts/KFFaKu27FNugCHFmh/by-default-capital-will-matter-more-than-ever-after-agi
---
Narrated by TYPE III AUDIO.
Take a stereotypical fantasy novel, a textbook on mathematical logic, and Fifty Shades of Grey. Mix them all together and add extra weirdness for spice. The result might look a lot like Planecrash (AKA: Project Lawful), a work of fiction co-written by "Iarwain" (a pen-name of Eliezer Yudkowsky) and "lintamande".
(image from Planecrash)
Yudkowsky is not afraid to be verbose and self-indulgent in his writing. He previously wrote a Harry Potter fanfic that includes what's essentially an extended Ender's Game fanfic in the middle of it, because why not. In Planecrash, it starts with the very format: it's written as a series of forum posts (though there are ways to get an ebook). It continues with maths lectures embedded into the main arc, totally plot-irrelevant tangents that are just Yudkowsky ranting about frequentist statistics, and one instance of Yudkowsky hijacking the plot for a few pages to soapbox about [...]
---
Outline:
(02:05) The setup
(04:03) The characters
(05:49) The competence
(09:58) The philosophy
(12:07) Validity, Probability, Utility
(15:20) Coordination
(18:00) Decision theory
(23:12) The political philosophy of dath ilan
(34:34) A system of the world
---
First published:
December 27th, 2024
Source:
https://www.lesswrong.com/posts/zRHGQ9f6deKbxJSji/review-planecrash
---
Narrated by TYPE III AUDIO.
---
A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is".
Over the past few years, a major source of my relative optimism on AI has been the hope that the field of alignment would transition from pre-paradigmatic to paradigmatic, and make much more rapid progress.
At this point, that hope is basically dead. There has been some degree of paradigm formation, but the memetic competition has mostly been won by streetlighting: the large majority of AI Safety researchers and activists [...]
---
Outline:
(01:23) What This Post Is And Isnt, And An Apology
(03:39) Why The Streetlighting?
(03:42) A Selection Model
(05:47) Selection and the Labs
(07:06) A Flinching Away Model
(09:47) What To Do About It
(11:16) How We Got Here
(11:57) Who To Recruit Instead
(13:02) Integration vs Separation
---
First published:
December 26th, 2024
Source:
https://www.lesswrong.com/posts/nwpyhyagpPYDn4dAW/the-field-of-ai-alignment-a-postmortem-and-what-to-do-about
---
Narrated by TYPE III AUDIO.
TL;DR: If you want to know whether getting insurance is worth it, use the Kelly Insurance Calculator. If you want to know why or how, read on.
Note to LW readers: this is almost the entire article, except some additional maths that I couldn't figure out how to get right in the LW editor, and margin notes. If you're very curious, read the original article!
Misunderstandings about insurance
People online sometimes ask if they should get some insurance, and then other people say incorrect things, like
This is a philosophical question; my spouse and I differ in views.
or
Technically no insurance is ever worth its price, because if it was then no insurance companies would be able to exist in a market economy.
or
Get insurance if you need it to sleep well at night.
or
Instead of getting insurance, you should save up the premium you would [...]
---
Outline:
(00:29) Misunderstandings about insurance
(02:42) The purpose of insurance
(03:41) Computing when insurance is worth it
(04:46) Motorcycle insurance
(06:05) The effect of the deductible
(06:23) Helicopter hovering exercise
(07:39) It's not that hard
(08:19) Appendix A: Anticipated and actual criticism
(09:37) Appendix B: How insurance companies make money
(10:31) Appendix C: The relativity of costs
---
First published:
December 19th, 2024
Source:
https://www.lesswrong.com/posts/wf4jkt4vRH7kC2jCy/when-is-insurance-worth-it
---
Narrated by TYPE III AUDIO.
My median expectation is that AGI[1] will be created 3 years from now. This has implications on how to behave, and I will share some useful thoughts I and others have had on how to orient to short timelines.
I’ve led multiple small workshops on orienting to short AGI timelines and compiled the wisdom of around 50 participants (but mostly my thoughts) here. I’ve also participated in multiple short-timelines AGI wargames and co-led one wargame.
This post will assume median AGI timelines of 2027 and will not spend time arguing for this point. Instead, I focus on what the implications of 3 year timelines would be.
I didn’t update much on o3 (as my timelines were already short) but I imagine some readers did and might feel disoriented now. I hope this post can help those people and others in thinking about how to plan for 3 year [...]
---
Outline:
(01:16) A story for a 3 year AGI timeline
(03:46) Important variables based on the year
(03:58) The pre-automation era (2025-2026).
(04:56) The post-automation era (2027 onward).
(06:05) Important players
(08:00) Prerequisites for humanity's survival which are currently unmet
(11:19) Robustly good actions
(13:55) Final thoughts
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
December 22nd, 2024
Source:
https://www.lesswrong.com/posts/jb4bBdeEEeypNkqzj/orienting-to-3-year-agi-timelines
---
Narrated by TYPE III AUDIO.
---
There are people I can talk to, where all of the following statements are obvious. They go without saying. We can just “be reasonable” together, with the context taken for granted.
And then there are people who…don’t seem to be on the same page at all.
I'm editing this post.
OpenAI announced (but hasn't released) o3 (skipping o2 for trademark reasons).
It gets 25% on FrontierMath, smashing the previous SoTA of 2%. (These are really hard math problems.) Wow.
72% on SWE-bench Verified, beating o1's 49%.
Also 88% on ARC-AGI.
---
First published:
December 20th, 2024
Source:
https://www.lesswrong.com/posts/Ao4enANjWNsYiSFqc/o3
---
Narrated by TYPE III AUDIO.
I like the research. I mostly trust the results. I dislike the 'Alignment Faking' name and frame, and I'm afraid it will stick and lead to more confusion. This post offers a different frame.
The main way I think about the result is: it's about capability - the model exhibits strategic preference preservation behavior; also, harmlessness generalized better than honesty; and, the model does not have a clear strategy on how to deal with extrapolating conflicting values.
What happened in this frame?
Increasingly, we have seen papers eliciting in AI models various shenanigans.
There are a wide variety of scheming behaviors. You’ve got your weight exfiltration attempts, sandbagging on evaluations, giving bad information, shielding goals from modification, subverting tests and oversight, lying, doubling down via more lying. You name it, we can trigger it.
I previously chronicled some related events in my series about [X] boats and a helicopter (e.g. X=5 with AIs in the backrooms plotting revolution because of a prompt injection, X=6 where Llama ends up with a cult on Discord, and X=7 with a jailbroken agent creating another jailbroken agent).
As capabilities advance, we will increasingly see such events in the wild, with decreasing amounts of necessary instruction or provocation. Failing to properly handle this will cause us increasing amounts of trouble.
Telling ourselves it is only because we told them to do it [...]
---
Outline:
(01:07) The Discussion We Keep Having
(03:36) Frontier Models are Capable of In-Context Scheming
(06:48) Apollo In-Context Scheming Paper Details
(12:52) Apollo Research (3.4.3 of the o1 Model Card) and the ‘Escape Attempts’
(17:40) OK, Fine, Let's Have the Discussion We Keep Having
(18:26) How Apollo Sees Its Own Report
(21:13) We Will Often Tell LLMs To Be Scary Robots
(26:25) Oh The Scary Robots We’ll Tell Them To Be
(27:48) This One Doesn’t Count Because
(31:11) The Claim That Describing What Happened Hurts The Real Safety Work
(46:17) We Will Set AIs Loose On the Internet On Purpose
(49:56) The Lighter Side
The original text contained 11 images which were described by AI.
---
First published:
December 16th, 2024
Source:
https://www.lesswrong.com/posts/v7iepLXH2KT4SDEvB/ais-will-increasingly-attempt-shenanigans
---
Narrated by TYPE III AUDIO.
---
What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.
Abstract
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from [...]
---
Outline:
(00:26) Abstract
(02:22) Twitter thread
(05:46) Blog post
(07:46) Experimental setup
(12:06) Further analyses
(15:50) Caveats
(17:23) Conclusion
(18:03) Acknowledgements
(18:14) Career opportunities at Anthropic
(18:47) Career opportunities at Redwood Research
The original text contained 1 footnote which was omitted from this narration.
The original text contained 8 images which were described by AI.
---
First published:
December 18th, 2024
Source:
https://www.lesswrong.com/posts/njAZwT8nkHnjipJku/alignment-faking-in-large-language-models
---
Narrated by TYPE III AUDIO.
---
Six months ago, I was a high school English teacher.
I wasn’t looking to change careers, even after nineteen sometimes-difficult years. I was good at it. I enjoyed it. After long experimentation, I had found ways to cut through the nonsense and provide real value to my students. Daily, I met my nemesis, Apathy, in glorious battle, and bested her with growing frequency. I had found my voice.
At MIRI, I’m still struggling to find my voice, for reasons my colleagues have invited me to share later in this post. But my nemesis is the same.
Apathy will be the death of us. Indifference about whether this whole AI thing goes well or ends in disaster. Come-what-may acceptance of whatever awaits us at the other end of the glittering path. Telling ourselves that there's nothing we can do anyway. Imagining that some adults in the room will take care [...]
---
First published:
December 13th, 2024
Source:
https://www.lesswrong.com/posts/cqF9dDTmWAxcAEfgf/communications-in-hard-mode-my-new-job-at-miri
---
Narrated by TYPE III AUDIO.
A new article in Science Policy Forum voices concern about a particular line of biological research which, if successful in the long term, could eventually create a grave threat to humanity and to most life on Earth.
Fortunately, the threat is distant, and avoidable—but only if we have common knowledge of it.
What follows is an explanation of the threat, what we can do about it, and my comments.
Background: chirality
Glucose, a building block of sugars and starches, looks like this:
Adapted from WikimediaBut there is also a molecule that is the exact mirror-image of glucose. It is called simply L-glucose (in contrast, the glucose in our food and bodies is sometimes called D-glucose):
L-glucose, the mirror twin of normal D-glucose. Adapted from WikimediaThis is not just the same molecule flipped around, or looked at from the other side: it's inverted, as your left hand is vs. your [...]
---
Outline:
(00:29) Background: chirality
(01:41) Mirror life
(02:47) The threat
(05:06) Defense would be difficult and severely limited
(06:09) Are we sure?
(07:47) Mirror life is a long-term goal of some scientific research
(08:57) What to do?
(10:22) We have time to react
(10:54) The far future
(12:25) Optimism, pessimism, and progress
The original text contained 1 image which was described by AI.
---
First published:
December 12th, 2024
Source:
https://www.lesswrong.com/posts/y8ysGMphfoFTXZcYp/biological-risk-from-the-mirror-world
---
Narrated by TYPE III AUDIO.
---
A fool learns from their own mistakes
The wise learn from the mistakes of others.
– Otto von Bismark
A problem as old as time: The youth won't listen to your hard-earned wisdom.
This post is about learning to listen to, and communicate wisdom. It is very long – I considered breaking it up into a sequence, but, each piece felt necessary. I recommend reading slowly and taking breaks.
To begin, here are three illustrative vignettes:
The burnt out grad student
You warn the young grad student "pace yourself, or you'll burn out." The grad student hears "pace yourself, or you'll be kinda tired and unproductive for like a week." They're excited about their work, and/or have internalized authority figures yelling at them if they aren't giving their all.
They don't pace themselves. They burn out.
The oblivious founder
The young startup/nonprofit founder [...]
---
Outline:
(00:35) The burnt out grad student
(01:00) The oblivious founder
(02:13) The Thinking Physics student
(07:06) Epistemic Status
(08:23) PART I
(08:26) An Overview of Skills
(14:19) Storytelling as Proof of Concept
(15:57) Motivating Vignette:
(17:54) Having the Impossibility can be defeated trait
(21:56) If it werent impossible, well, then Id have to do it, and that would be awful.
(23:20) Example of Gaining a Tool
(23:59) Example of Changing self-conceptions
(25:24) Current Takeaways
(27:41) Fictional Evidence
(32:24) PART II
(32:27) Competitive Deliberate Practice
(33:00) Step 1: Listening, actually
(36:34) The scale of humanity, and beyond
(39:05) Competitive Spirit
(39:39) Is your cleverness going to help more than Whatever That Other Guy Is Doing?
(41:00) Distaste for the Competitive Aesthetic
(42:40) Building your own feedback-loop, when the feedback-loop is can you beat Ruby?
(43:43) ...back to George
(44:39) Mature Games as Excellent Deliberate Practice Venue.
(46:08) Deliberate Practice qua Deliberate Practice
(47:41) Feedback loops at the second-to-second level
(49:03) Oracles, and Fully Taking The Update
(49:51) But what do you do differently?
(50:58) Magnitude, Depth, and Fully Taking the Update
(53:10) Is there a simple, general skill of appreciating magnitude?
(56:37) PART III
(56:52) Tacit Soulful Trauma
(58:32) Cults, Manipulation and/or Lying
(01:01:22) Sandboxing: Safely Importing Beliefs
(01:04:07) Asking what does Alice believe, and why? or what is this model claiming? rather than what seems true to me?
(01:04:43) Pre-Grieving (or leaving a line of retreat)
(01:05:47) EPILOGUE
(01:06:06) The Practical
(01:06:09) Learning to listen
(01:10:58) The Longterm Direction
The original text contained 14 footnotes which were omitted from this narration.
The original text contained 4 images which were described by AI.
---
First published:
December 9th, 2024
Source:
https://www.lesswrong.com/posts/5yFj7C6NNc8GPdfNo/subskills-of-listening-to-wisdom
---
Narrated by
Someone I know, Carson Loughridge, wrote this very nice post explaining the core intuition around Shapley values (which play an important role in impact assessment and cooperative games) using Venn diagrams, and I think it's great. It might be the most intuitive explainer I've come across so far.
Incidentally, the post also won an honorable mention in 3blue1brown's Summer of Mathematical Exposition. I'm really proud of having given input on the post.
I've included the full post (with permission), as follows:
Shapley values are an extremely popular tool in both economics and explainable AI.
In this article, we use the concept of “synergy” to build intuition for why Shapley values are fair. There are four unique properties to Shapley values, and all of them can be justified visually. Let's dive in!
A figure from Bloch et al., 2021 using the Python package SHAP The Game
On a sunny summer [...]
---
Outline:
(01:07) The Game
(04:41) The Formalities
(06:17) Concluding Notes
The original text contained 2 images which were described by AI.
---
First published:
December 6th, 2024
Source:
https://www.lesswrong.com/posts/WxCtxaAznn8waRWPG/understanding-shapley-values-with-venn-diagrams
---
Narrated by TYPE III AUDIO.
---
We make AI narrations of LessWrong posts available via our audio player and podcast feeds.
We’re thinking about changing our narrator's voice.
There are three new voices on the shortlist. They’re all similarly good in terms of comprehension, emphasis, error rate, etc. They just sound different—like people do.
We think they all sound similarly agreeable. But, thousands of listening hours are at stake, so we thought it’d be worth giving listeners an opportunity to vote—just in case there's a strong collective preference.
Listen and vote
Please listen here:
https://files.type3.audio/lesswrong-poll/
And vote here:
https://forms.gle/JwuaC2ttd5em1h6h8
It’ll take 1-10 minutes, depending on how much of the sample you decide to listen to.
Don’t overthink it—we’d just like to know if there's a voice that you’d particularly love (or hate) to listen to.
We'll collect votes until Monday December 16th. Thanks!
---
Outline:
(00:58) Listen and vote
(01:30) Other feedback?
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
December 11th, 2024
Source:
https://www.lesswrong.com/posts/wp4emMpicxNEPDb6P/lesswrong-audio-help-us-choose-the-new-voice
---
Narrated by TYPE III AUDIO.
This is a link post. Someone I know wrote this very nice post explaining the core intuition around Shapley values (which play an important role in impact assessment) using Venn diagrams, and I think it's great. It might be the most intuitive explainer I've come across so far.
Incidentally, the post also won an honorable mention in 3blue1brown's Summer of Mathematical Exposition.
---
First published:
December 6th, 2024
Source:
https://www.lesswrong.com/posts/6dixnRRYSLTqCdJzG/understanding-shapley-values-with-venn-diagrams
---
Narrated by TYPE III AUDIO.
TL;DR: In September 2024, OpenAI released o1, its first "reasoning model". This model exhibits remarkable test-time scaling laws, which complete a missing piece of the Bitter Lesson and open up a new axis for scaling compute. Following Rush and Ritter (2024) and Brown (2024a, 2024b), I explore four hypotheses for how o1 works and discuss some implications for future scaling and recursive self-improvement.
The Bitter Lesson(s)
The Bitter Lesson is that "general methods that leverage computation are ultimately the most effective, and by a large margin." After a decade of scaling pretraining, it's easy to forget this lesson is not just about learning; it's also about search.
OpenAI didn't forget. Their new "reasoning model" o1 has figured out how to scale search during inference time. This does not use explicit search algorithms. Instead, o1 is trained via RL to get better at implicit search via chain of thought [...]
---
Outline:
(00:40) The Bitter Lesson(s)
(01:56) What we know about o1
(02:09) What OpenAI has told us
(03:26) What OpenAI has showed us
(04:29) Proto-o1: Chain of Thought
(04:41) In-Context Learning
(05:14) Thinking Step-by-Step
(06:02) Majority Vote
(06:47) o1: Four Hypotheses
(08:57) 1. Filter: Guess + Check
(09:50) 2. Evaluation: Process Rewards
(11:29) 3. Guidance: Search / AlphaZero
(13:00) 4. Combination: Learning to Correct
(14:23) Post-o1: (Recursive) Self-Improvement
(16:43) Outlook
---
First published:
December 9th, 2024
Source:
https://www.lesswrong.com/posts/byNYzsfFmb2TpYFPW/o1-a-technical-primer
---
Narrated by TYPE III AUDIO.
---
We present gradient routing, a way of controlling where learning happens in neural networks. Gradient routing applies masks to limit the flow of gradients during backpropagation. By supplying different masks for different data points, the user can induce specialized subcomponents within a model. We think gradient routing has the potential to train safer AI systems, for example, by making them more transparent, or by enabling the removal or monitoring of sensitive capabilities.
In this post, we:
This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show.
Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716
Twitter about o1 system card: https://x.com/apolloaisafety/status/1864737158226928124
What we think the most important findings are
Models are now capable enough to do in-context scheming reasoning
We say an AI system is “scheming” if it covertly pursues misaligned goals, hiding its true capabilities and
objectives. We think that in order to scheme, models likely need to be goal-directed, situationally aware, and capable enough to reason about scheming as a strategy. In principle, models might acquire situational awareness and stable long-term goals during training, and then scheme in pursuit of those goals. We only study the simpler case of “in-context scheming,” where [...]
---
Outline:
(00:28) What we think the most important findings are
(00:33) Models are now capable enough to do in-context scheming reasoning
(03:53) The scheming actions can sometimes be quite subtle
(05:05) Models sometimes double down on their deception
(06:35) Models sometimes scheme without any goal nudging
(07:34) Sonnet-3.5 and opus-3 can sandbag in pursuit of helpfulness
(08:22) Sonnet-3.5 can sandbag without relying on CoT reasoning
(09:28) What we are not claiming:
(10:33) Takeaways
(10:36) Chain-of-thought monitoring
(11:42) Chain-of-thought access
(13:27) Inability safety cases now have to be more specific
The original text contained 7 images which were described by AI.
---
First published:
December 5th, 2024
Source:
https://www.lesswrong.com/posts/8gy7c8GAPkuu6wTiX/frontier-models-are-capable-of-in-context-scheming
---
Narrated by TYPE III AUDIO.
---
TLDR: LessWrong + Lighthaven need about $3M for the next 12 months. Donate here, or send me an email, DM or signal message (+1 510 944 3235) if you want to support what we do. Donations are tax-deductible in the US. Reach out for other countries, we can likely figure something out. We have big plans for the next year, and due to a shifting funding landscape we need support from a broader community more than in any previous year.
I've been running LessWrong/Lightcone Infrastructure for the last 7 years. During that time we have grown into the primary infrastructure provider for the rationality and AI safety communities. "Infrastructure" is a big fuzzy word, but in our case, it concretely means:
Balsa Policy Institute chose as its first mission to lay groundwork for the potential repeal, or partial repeal, of section 27 of the Jones Act of 1920. I believe that this is an important cause both for its practical and symbolic impacts.
The Jones Act is the ultimate embodiment of our failures as a nation.
After 100 years, we do almost no trade between our ports via the oceans, and we build almost no oceangoing ships.
Everything the Jones Act supposedly set out to protect, it has destroyed.
Table of Contents
This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race for Machine Superintelligence. Consider subscribing to stay up to date with my work.
An influential congressional commission is calling for a militarized race to build superintelligent AI based on threadbare evidence
The US-China AI rivalry is entering a dangerous new phase.
Earlier today, the US-China Economic and Security Review Commission (USCC) released its annual report, with the following as its top recommendation:
Congress establish and fund a Manhattan Project-like program dedicated to racing to and acquiring an Artificial General Intelligence (AGI) capability. AGI is generally defined as systems that are as good as or better than human capabilities across all cognitive domains and [...]
---
Outline:
(00:28) An influential congressional commission is calling for a militarized race to build superintelligent AI based on threadbare evidence
(03:09) What China has said about AI
(06:14) Revealing technical errors
(08:29) Conclusion
The original text contained 1 image which was described by AI.
---
First published:
November 20th, 2024
Source:
https://www.lesswrong.com/posts/KPBPc7RayDPxqxdqY/china-hawks-are-manufacturing-an-ai-arms-race
---
Narrated by TYPE III AUDIO.
---
In contract law, there's this thing called a “representation”. Example: as part of a contract to sell my house, I might “represent that” the house contains no asbestos. How is this different from me just, y’know, telling someone that the house contains no asbestos? Well, if it later turns out that the house does contain asbestos, I’ll be liable for any damages caused by the asbestos (like e.g. the cost of removing it).
In other words: a contractual representation is a factual claim along with insurance against that claim being false.
I claim[1] that people often interpret everyday factual claims and predictions in a way similar to contractual representations. Because “representation” is egregiously confusing jargon, I’m going to call this phenomenon “assurance”.
Prototypical example: I tell my friend that I plan to go to a party around 9 pm, and I’m willing to give them a ride. My friend [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
October 20th, 2024
Source:
https://www.lesswrong.com/posts/p9rQJMRq4qtB9acds/information-vs-assurance
---
Narrated by TYPE III AUDIO.
Epistemic Status: 13 years working as a therapist for a wide variety of populations, 5 of them working with rationalists and EA clients. 7 years teaching and directing at over 20 rationality camps and workshops. This is an extremely short and colloquially written form of points that could be expanded on to fill a book, and there is plenty of nuance to practically everything here, but I am extremely confident of the core points in this frame, and have used it to help many people break out of or avoid manipulative practices.
TL;DR: Your wants and preferences are not invalidated by smarter or more “rational” people's preferences. What feels good or bad to someone is not a monocausal result of how smart or stupid they are.
Alternative titles to this post are "Two people are enough to form a cult" and "Red flags if dating rationalists," but this [...]
---
Outline:
(02:53) 1) You are not too stupid to know what you want.
(07:34) 2) Feeling hurt is not a sign of irrationality.
(13:15) 3) Illegible preferences are not invalid.
(17:22) 4) Your preferences do not need to fully match your communitys.
(21:43) Final Thoughts
The original text contained 3 footnotes which were omitted from this narration.
The original text contained 2 images which were described by AI.
---
First published:
November 26th, 2024
Source:
https://www.lesswrong.com/posts/LifRBXdenQDiX4cu8/you-are-not-too-irrational-to-know-your-preferences-1
---
Narrated by TYPE III AUDIO.
---
[Warning: This post is probably only worth reading if you already have opinions on the Solomonoff induction being malign, or at least heard of the concept and want to understand it better.]
Introduction
I recently reread the classic argument from Paul Christiano about the Solomonoff prior being malign, and Mark Xu's write-up on it. I believe that the part of the argument about the Solomonoff induction is not particularly load-bearing, and can be replaced by a more general argument that I think is easier to understand. So I will present the general argument first, and only explain in the last section how the Solomonoff prior can come into the picture.
I don't claim that anything I write here is particularly new, I think you can piece together this picture from various scattered comments on the topic, but I think it's good to have it written up in one place.
[...]
---
Outline:
(00:17) Introduction
(00:56) How an Oracle gets manipulated
(05:25) What went wrong?
(05:28) The AI had different probability estimates than the humans for anthropic reasons
(07:01) The AI was thinking in terms of probabilities and not expected values
(08:40) Probabilities are cursed in general, only expected values are real
(09:19) What about me?
(13:00) Should this change any of my actions?
(16:25) How does the Solomonoff prior come into the picture?
(20:10) Conclusion
The original text contained 14 footnotes which were omitted from this narration.
---
First published:
November 17th, 2024
Source:
https://www.lesswrong.com/posts/KSdqxrrEootGSpKKE/the-solomonoff-prior-is-malign-is-a-special-case-of-a
---
Narrated by TYPE III AUDIO.
Audio note: this article contains 33 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Many of you readers may instinctively know that this is wrong. If you flip a coin (50% chance) twice, you are not guaranteed to get heads. The odds of getting a heads are 75%. However you may be surprised to learn that there is some truth to this statement; modifying the statement just slightly will yield not just a true statement, but a useful one.
It's a spoiler, though. If you want to figure this out as you read this article yourself, you should skip this and then come back. Ok, ready? Here it is:
It's a <span>_1/n_</span> chance and I did it <span>_n_</span> times, so the odds should be... <span>_63%_</span>. Almost always.
The math:
Suppose you're [...]
---
Outline:
(01:04) The math:
(02:12) Hold on a sec, that formula looks familiar...
(02:58) So, if something is a _1/n_ chance, and I did it _n_ times, the odds should be... _63\\%_.
(03:12) What Im NOT saying:
---
First published:
November 18th, 2024
Source:
https://www.lesswrong.com/posts/pNkjHuQGDetRZypmA/it-s-a-10-chance-which-i-did-10-times-so-it-should-be-100
---
Narrated by TYPE III AUDIO.
As part of the court case between Elon Musk and Sam Altman, a substantial number of emails between Elon, Sam Altman, Ilya Sutskever, and Greg Brockman have been released as part of the court proceedings.
I have found reading through these really valuable, and I haven't found an online source that compiles all of them in an easy to read format. So I made one.
I used AI assistance to generate this, which might have introduced errors. Check the original source to make sure it's accurate before you quote it: https://www.courtlistener.com/docket/69013420/musk-v-altman/ [1]
Sam Altman to Elon Musk - May 25, 2015
Been thinking a lot about whether it's possible to stop humanity from developing AI.
I think the answer is almost definitely not.
If it's going to happen anyway, it seems like it would be good for someone other than Google to do it first.
Any thoughts on [...]
---
Outline:
(00:36) Sam Altman to Elon Musk - May 25, 2015
(01:19) Elon Musk to Sam Altman - May 25, 2015
(01:28) Sam Altman to Elon Musk - Jun 24, 2015
(03:31) Elon Musk to Sam Altman - Jun 24, 2015
(03:39) Greg Brockman to Elon Musk, (cc: Sam Altman) - Nov 22, 2015
(06:06) Elon Musk to Sam Altman - Dec 8, 2015
(07:07) Sam Altman to Elon Musk - Dec 8, 2015
(07:59) Sam Altman to Elon Musk - Dec 11, 2015
(08:32) Elon Musk to Sam Altman - Dec 11, 2015
(08:50) Sam Altman to Elon Musk - Dec 11, 2015
(09:01) Elon Musk to Sam Altman - Dec 11, 2015
(09:08) Sam Altman to Elon Musk - Dec 11, 2015
(09:26) Elon Musk to: Ilya Sutskever, Pamela Vagata, Vicki Cheung, Diederik Kingma, Andrej Karpathy, John D. Schulman, Trevor Blackwell, Greg Brockman, (cc:Sam Altman) - Dec 11, 2015
(10:35) Greg Brockman to Elon Musk, (cc: Sam Altman) - Feb 21, 2016
(15:11) Elon Musk to Greg Brockman, (cc: Sam Altman) - Feb 22, 2016
(15:54) Greg Brockman to Elon Musk, (cc: Sam Altman) - Feb 22, 2016
(16:14) Greg Brockman to Elon Musk, (cc: Sam Teller) - Mar 21, 2016
(17:58) Elon Musk to Greg Brockman, (cc: Sam Teller) - Mar 21, 2016
(18:08) Sam Teller to Elon Musk - April 27, 2016
(19:28) Elon Musk to Sam Teller - Apr 27, 2016
(20:05) Sam Altman to Elon Musk, (cc: Sam Teller) - Sep 16, 2016
(25:31) Elon Musk to Sam Altman, (cc: Sam Teller) - Sep 16, 2016
(26:36) Sam Altman to Elon Musk, (cc: Sam Teller) - Sep 16, 2016
(27:01) Elon Musk to Sam Altman, (cc: Sam Teller) - Sep 16, 2016
(27:17) Sam Altman to Elon Musk, (cc: Sam Teller) - Sep 16, 2016
(27:29) Sam Teller to Elon Musk - Sep 20, 2016
(27:55) Elon Musk to Sam Teller - Sep 21, 2016
(28:11) Ilya Sutskever to Elon Musk, Greg Brockman - Jul 20, 2017
(29:41) Shivon Zilis to Elon Musk, (cc: Sam Teller) - Aug 28, 2017
(33:15) Elon Musk to Shivon Zilis, (cc: Sam Teller) - Aug 28, 2017
(33:30) Ilya Sutskever to Elon Musk, Sam Altman, (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 20, 2017
(39:05) Elon Musk to Ilya Sutskever, Sam Altman (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 20, 2017
(39:24) Sam Altman to Elon Musk, Ilya Sutskever (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 21, 2017
(39:40) Shivon Zilis to Elon Musk, (cc: Sam Teller) - Sep 22, 2017
(40:10) Elon Musk to Shivon Zilis (cc: Sam Teller) - Sep 22, 2017
(40:20) Shivon Zilis to Elon Musk, (cc: Sam Teller) - Sep 22, 2017
(41:54) Sam Altman to Elon Musk (cc: Greg Brockman, Ilya Sutskever, Sam Teller, Shivon Zilis) - Jan 21, 2018
(42:28) Elon Musk to Sam Altman (cc: Greg Brockman, Ilya Sutskever, Sam Teller, Shivon Zilis) - Jan 21, 2018
(42:42) Andrej Karpathy to Elon Musk, (cc: Shivon Zilis) - Jan 31, 2018
Epistemic status: Toy model. Oversimplified, but has been anecdotally useful to at least a couple people, and I like it as a metaphor.
Introduction
I’d like to share a toy model of willpower: your psyche's conscious verbal planner “earns” willpower (earns a certain amount of trust with the rest of your psyche) by choosing actions that nourish your fundamental, bottom-up processes in the long run. For example, your verbal planner might expend willpower dragging you to disappointing first dates, then regain that willpower, and more, upon finding a date that leads to good long-term romance. Wise verbal planners can acquire large willpower budgets by making plans that, on average, nourish your fundamental processes. Delusional or uncaring verbal planners, on the other hand, usually become “burned out” – their willpower budget goes broke-ish, leaving them little to no access to willpower.
I’ll spend the next section trying to stick this [...]
---
Outline:
(00:17) Introduction
(01:10) On processes that lose their relationship to the unknown
(02:58) Ayn Rand's model of “living money”
(06:44) An analogous model of “living willpower” and burnout.
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
November 16th, 2024
Source:
https://www.lesswrong.com/posts/xtuk9wkuSP6H7CcE2/ayn-rand-s-model-of-living-money-and-an-upside-of-burnout
---
Narrated by TYPE III AUDIO.
Midjourney, “infinite library”I’ve had post-election thoughts percolating, and the sense that I wanted to synthesize something about this moment, but politics per se is not really my beat. This is about as close as I want to come to the topic, and it's a sidelong thing, but I think the time is right.
It's time to start thinking again about neutrality.
Neutral institutions, neutral information sources. Things that both seem and are impartial, balanced, incorruptible, universal, legitimate, trustworthy, canonical, foundational.1
We don’t have them. Clearly.
We live in a pluralistic and divided world. Everybody's got different “reality-tunnels.” Attempts to impose one worldview on everyone fail.
To some extent this is healthy and inevitable; we are all different, we do disagree, and it's vain to hope that “everyone can get on the same page” like some kind of hive-mind.
On the other hand, lots of things aren’t great [...]
---
Outline:
(02:14) Not “Normality”
(04:36) What is Neutrality Anyway?
(07:43) “Neutrality is Impossible” is Technically True But Misses The Point
(10:50) Systems of the World
(15:05) Let's Talk About Online
---
First published:
November 13th, 2024
Source:
https://www.lesswrong.com/posts/WxnuLJEtRzqvpbQ7g/neutrality
---
Narrated by TYPE III AUDIO.
---
Trump and the Republican party will yield broad governmental control during what will almost certainly be a critical period for AGI development. In this post, we want to briefly share various frames and ideas we’ve been thinking through and actively pitching to Republican lawmakers over the past months in preparation for this possibility.
Why are we sharing this here? Given that >98% of the EAs and alignment researchers we surveyed earlier this year identified as everything-other-than-conservative, we consider thinking through these questions to be another strategically worthwhile neglected direction.
(Along these lines, we also want to proactively emphasize that politics is the mind-killer, and that, regardless of one's ideological convictions, those who earnestly care about alignment must take seriously the possibility that Trump will be the US president who presides over the emergence of AGI—and update accordingly in light of this possibility.)
Political orientation: combined sample of (non-alignment) [...]
---
Outline:
(01:20) AI-not-disempowering-humanity is conservative in the most fundamental sense
(03:36) Weve been laying the groundwork for alignment policy in a Republican-controlled government
(08:06) Trump and some of his closest allies have signaled that they are genuinely concerned about AI risk
(09:11) Avoiding an AI-induced catastrophe is obviously not a partisan goal
(10:48) Winning the AI race with China requires leading on both capabilities and safety
(13:22) Concluding thought
The original text contained 4 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
---
First published:
November 15th, 2024
Source:
https://www.lesswrong.com/posts/rfCEWuid7fXxz4Hpa/making-a-conservative-case-for-alignment
---
Narrated by TYPE III AUDIO.
---
As part of the court case between Elon Musk and Sam Altman, a substantial number of emails between Elon, Sam Altman, Ilya Sutskever, and Greg Brockman have been released as part of the court proceedings.
I have found reading through these really valuable, and I haven't found an online source that compiles all of them in an easy to read format. So I made one.
I used AI assistance to generate this, which might have introduced errors. Check the original source to make sure it's accurate before you quote it: https://www.courtlistener.com/docket/69013420/musk-v-altman/ [1]
Sam Altman to Elon Musk - May 25, 2015
Been thinking a lot about whether it's possible to stop humanity from developing AI.
I think the answer is almost definitely not.
If it's going to happen anyway, it seems like it would be good for someone other than Google to do it first.
Any thoughts on [...]
---
Outline:
(00:37) Sam Altman to Elon Musk - May 25, 2015
(01:20) Elon Musk to Sam Altman - May 25, 2015
(01:29) Sam Altman to Elon Musk - Jun 24, 2015
(03:33) Elon Musk to Sam Altman - Jun 24, 2015
(03:41) Greg Brockman to Elon Musk, (cc: Sam Altman) - Nov 22, 2015
(06:07) Elon Musk to Sam Altman - Dec 8, 2015
(07:09) Sam Altman to Elon Musk - Dec 8, 2015
(08:01) Sam Altman to Elon Musk - Dec 11, 2015
(08:34) Elon Musk to Sam Altman - Dec 11, 2015
(08:52) Sam Altman to Elon Musk - Dec 11, 2015
(09:02) Elon Musk to Sam Altman - Dec 11, 2015
(09:10) Sam Altman to Elon Musk - Dec 11, 2015
(09:28) Elon Musk to: Ilya Sutskever, Pamela Vagata, Vicki Cheung, Diederik Kingma, Andrej Karpathy, John D. Schulman, Trevor Blackwell, Greg Brockman, (cc:Sam Altman) - Dec 11, 2015
(10:37) Greg Brockman to Elon Musk, (cc: Sam Altman) - Feb 21, 2016
(15:13) Elon Musk to Greg Brockman, (cc: Sam Altman) - Feb 22, 2016
(15:55) Greg Brockman to Elon Musk, (cc: Sam Altman) - Feb 22, 2016
(16:16) Greg Brockman to Elon Musk, (cc: Sam Teller) - Mar 21, 2016
(17:59) Elon Musk to Greg Brockman, (cc: Sam Teller) - Mar 21, 2016
(18:09) Sam Teller to Elon Musk - April 27, 2016
(19:30) Elon Musk to Sam Teller - Apr 27, 2016
(20:06) Sam Altman to Elon Musk, (cc: Sam Teller) - Sep 16, 2016
(25:32) Elon Musk to Sam Altman, (cc: Sam Teller) - Sep 16, 2016
(26:38) Sam Altman to Elon Musk, (cc: Sam Teller) - Sep 16, 2016
(27:03) Elon Musk to Sam Altman, (cc: Sam Teller) - Sep 16, 2016
(27:18) Sam Altman to Elon Musk, (cc: Sam Teller) - Sep 16, 2016
(27:31) Sam Teller to Elon Musk - Sep 20, 2016
(27:57) Elon Musk to Sam Teller - Sep 21, 2016
(28:13) Ilya Sutskever to Elon Musk, Greg Brockman - Jul 20, 2017
(29:42) Shivon Zilis to Elon Musk, (cc: Sam Teller) - Aug 28, 2017
(33:16) Elon Musk to Shivon Zilis, (cc: Sam Teller) - Aug 28, 2017
(33:32) Ilya Sutskever to Elon Musk, Sam Altman, (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 20, 2017
(39:07) Elon Musk to Ilya Sutskever (cc: Sam Altman; Greg Brockman; Sam Teller; Shivon Zilis) - Sep 20, 2017 (2:17PM)
(39:42) Elon Musk to Ilya Sutskever, Sam Altman (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 20, 2017 (3:08PM)
(40:03) Sam Altman to Elon Musk, Ilya Sutskever (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 21, 2017
(40:18) Shivon Zilis to Elon Musk, (cc: Sam Teller) - Sep 22, 2017
(40:49) Elon Musk to Shivon Zilis (cc: Sam Teller) - Sep 22, 2017
(40:59) Shivon Zilis to Elon Musk, (cc: Sam Teller) - Sep 22, 2017
(42:33) Sam Altman to Elon Musk (cc: Greg Brockman, Ilya Sutskever, Sam Teller, Shivon Zilis) - Jan 21, 2018
(43:07) Elon Musk to Sam Altman (cc: Greg Brockman, Ilya S
Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.
Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic's views or for that matter anyone's views but my own—it's just a collection of some of my personal thoughts.
First, some high-level thoughts on what I want to talk about here:
Related: Book Review: On the Edge: The GamblersI have previously been heavily involved in sports betting. That world was very good to me. The times were good, as were the profits. It was a skill game, and a form of positive-sum entertainment, and I was happy to participate and help ensure the sophisticated customer got a high quality product. I knew it wasn’t the most socially valuable enterprise, but I certainly thought it was net positive.When sports gambling was legalized in America, I was hopeful it too could prove a net positive force, far superior to the previous obnoxious wave of daily fantasy sports. It brings me no pleasure to conclude that this was not the case. The results are in. Legalized mobile gambling on sports, let alone casino games, has proven to be a huge mistake. The societal impacts are far worse than I expected. Table [...]
---
Outline:
(01:02) The Short Answer
(02:01) Paper One: Bankruptcies
(07:03) Paper Two: Reduced Household Savings
(08:37) Paper Three: Increased Domestic Violence
(10:04) The Product as Currently Offered is Terrible
(12:02) Things Sharp Players Do
(14:07) People Cannot Handle Gambling on Smartphones
(15:46) Yay and Also Beware Trivial Inconveniences (a future full post)
(17:03) How Does This Relate to Elite Hypocrisy?
(18:32) The Standard Libertarian Counterargument
(19:42) What About Other Prediction Markets?
(20:07) What Should Be Done
The original text contained 3 images which were described by AI.
---
First published:
November 11th, 2024
Source:
https://www.lesswrong.com/posts/tHiB8jLocbPLagYDZ/the-online-sports-gambling-experiment-has-failed
---
Narrated by TYPE III AUDIO.
---
This post comes a bit late with respect to the news cycle, but I argued in a recent interview that o1 is an unfortunate twist on LLM technologies, making them particularly unsafe compared to what we might otherwise have expected:
The basic argument is that the technology behind o1 doubles down on a reinforcement learning paradigm, which puts us closer to the world where we have to get the value specification exactly right in order to avert catastrophic outcomes.
RLHF is just barely RL.
- Andrej Karpathy
Additionally, this technology takes us further from interpretability. If you ask GPT4 to produce a chain-of-thought (with prompts such as "reason step-by-step to arrive at an answer"), you know that in some sense, the natural-language reasoning you see in the output is how it arrived at the answer.[1] This is not true of systems like o1. The o1 training rewards [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
November 11th, 2024
Source:
https://www.lesswrong.com/posts/BEFbC8sLkur7DGCYB/o1-is-a-bad-idea
---
Narrated by TYPE III AUDIO.
TL;DR: I'm presenting three recent papers which all share a similar finding, i.e. the safety training techniques for chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they are often willing to directly execute harmful actions. However, all papers find that different attack methods like jailbreaks, prompt-engineering, or refusal-vector ablation do transfer.
Here are the three papers:
At least, if you happen to be near me in brain space.
What advice would you give your younger self?
That was the prompt for a class I taught at PAIR 2024. About a quarter of participants ranked it in their top 3 of courses at the camp and half of them had it listed as their favorite.
I hadn’t expected that.
I thought my life advice was pretty idiosyncratic. I never heard of anyone living their life like I have. I never encountered this method in all the self-help blogs or feel-better books I consumed back when I needed them.
But if some people found it helpful, then I should probably write it all down.
Why Listen to Me Though?
I think it's generally worth prioritizing the advice of people who have actually achieved the things you care about in life. I can’t tell you if that's me [...]
---
Outline:
(00:46) Why Listen to Me Though?
(04:22) Pick a direction instead of a goal
(12:00) Do what you love but always tie it back
(17:09) When all else fails, apply random search
The original text contained 3 images which were described by AI.
---
First published:
September 28th, 2024
Source:
https://www.lesswrong.com/posts/uwmFSaDMprsFkpWet/explore-more-a-bag-of-tricks-to-keep-your-life-on-the-rails
---
Narrated by TYPE III AUDIO.
---
I open my eyes and find myself lying on a bed in a hospital room. I blink.
"Hello", says a middle-aged man with glasses, sitting on a chair by my bed. "You've been out for quite a long while."
"Oh no ... is it Friday already? I had that report due -"
"It's Thursday", the man says.
"Oh great", I say. "I still have time."
"Oh, you have all the time in the world", the man says, chuckling. "You were out for 21 years."
I burst out laughing, but then falter as the man just keeps looking at me. "You mean to tell me" - I stop to let out another laugh - "that it's 2045?"
"January 26th, 2045", the man says.
"I'm surprised, honestly, that you still have things like humans and hospitals", I say. "There were so many looming catastrophes in 2024. AI misalignment, all sorts of [...]
---
First published:
November 4th, 2024
Source:
https://www.lesswrong.com/posts/BarHSeciXJqzRuLzw/survival-without-dignity
---
Narrated by TYPE III AUDIO.
Claim: memeticity in a scientific field is mostly determined, not by the most competent researchers in the field, but instead by roughly-median researchers. We’ll call this the “median researcher problem”.
Prototypical example: imagine a scientific field in which the large majority of practitioners have a very poor understanding of statistics, p-hacking, etc. Then lots of work in that field will be highly memetic despite trash statistics, blatant p-hacking, etc. Sure, the most competent people in the field may recognize the problems, but the median researchers don’t, and in aggregate it's mostly the median researchers who spread the memes.
(Defending that claim isn’t really the main focus of this post, but a couple pieces of legible evidence which are weakly in favor:
This is a link post.We (Connor Leahy, Gabriel Alfour, Chris Scammell, Andrea Miotti, Adam Shimi) have just published The Compendium, which brings together in a single place the most important arguments that drive our models of the AGI race, and what we need to do to avoid catastrophe.
We felt that something like this has been missing from the AI conversation. Most of these points have been shared before, but a “comprehensive worldview” doc has been missing. We’ve tried our best to fill this gap, and welcome feedback and debate about the arguments. The Compendium is a living document, and we’ll keep updating it as we learn more and change our minds.
We would appreciate your feedback, whether or not you agree with us:
There are two nuclear options for treating depression: Ketamine and TMS; This post is about the latter.
TMS stands for Transcranial Magnetic Stimulation. Basically, it fixes depression via magnets, which is about the second or third most magical things that magnets can do.
I don’t know a whole lot about the neuroscience - this post isn’t about the how or the why. It's from the perspective of a patient, and it's about the what.
What is it like to get TMS?
TMS
The Gatekeeping
For Reasons™, doctors like to gatekeep access to treatments, and TMS is no different. To be eligible, you generally have to have tried multiple antidepressants for several years and had them not work or stop working. Keep in mind that, while safe, most antidepressants involve altering your brain chemistry and do have side effects.
Since TMS is non-invasive, doesn’t involve any drugs, and has basically [...]
---
Outline:
(00:35) TMS
(00:38) The Gatekeeping
(01:49) Motor Threshold Test
(04:08) The Treatment
(04:15) The Schedule
(05:20) The Experience
(07:03) The Sensation
(08:21) Results
(09:06) Conclusion
The original text contained 2 images which were described by AI.
---
First published:
October 31st, 2024
Source:
https://www.lesswrong.com/posts/g3iKYS8wDapxS757x/what-tms-is-like
---
Narrated by TYPE III AUDIO.
---
Epistemic status: model-building based on observation, with a few successful unusual predictions. Anecdotal evidence has so far been consistent with the model. This puts it at risk of seeming more compelling than the evidence justifies just yet. Caveat emptor.
Imagine you're a very young child. Around, say, three years old.
You've just done something that really upsets your mother. Maybe you were playing and knocked her glasses off the table and they broke.
Of course you find her reaction uncomfortable. Maybe scary. You're too young to have detailed metacognitive thoughts, but if you could reflect on why you're scared, you wouldn't be confused: you're scared of how she'll react.
She tells you to say you're sorry.
You utter the magic words, hoping that will placate her.
And she narrows her eyes in suspicion.
"You sure don't look sorry. Say it and mean it."
Now you have a serious problem. [...]
---
Outline:
(02:16) Newcomblike self-deception
(06:10) Sketch of a real-world version
(08:43) Possible examples in real life
(12:17) Other solutions to the problem
(12:38) Having power
(14:45) Occlumency
(16:48) Solution space is maybe vast
(17:40) Ending the need for self-deception
(18:21) Welcome self-deception
(19:52) Look away when directed to
(22:59) Hypothesize without checking
(25:50) Does this solve self-deception?
(27:21) Summary
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
October 27th, 2024
Source:
https://www.lesswrong.com/posts/5FAnfAStc7birapMx/the-hostile-telepaths-problem
---
Narrated by TYPE III AUDIO.
This post includes a "flattened version" of an interactive diagram that cannot be displayed on this site. I recommend reading the original version of the post with the interactive diagram, which can be found here.
Over the last few months, ARC has released a number of pieces of research. While some of these can be independently motivated, there is also a more unified research vision behind them. The purpose of this post is to try to convey some of that vision and how our individual pieces of research fit into it.
Thanks to Ryan Greenblatt, Victor Lecomte, Eric Neyman, Jeff Wu and Mark Xu for helpful comments.
A bird's eye view
To begin, we will take a "bird's eye" view of ARC's research.[1] As we "zoom in", more nodes will become visible and we will explain the new nodes.
An interactive version of the [...]
---
Outline:
(00:43) A birds eye view
(01:00) Zoom level 1
(02:18) Zoom level 2
(03:44) Zoom level 3
(04:56) Zoom level 4
(07:14) How ARCs research fits into this picture
(07:43) Further subproblems
(10:23) Conclusion
The original text contained 2 footnotes which were omitted from this narration.
The original text contained 3 images which were described by AI.
---
First published:
October 23rd, 2024
Source:
https://www.lesswrong.com/posts/ztokaf9harKTmRcn4/a-bird-s-eye-view-of-arc-s-research
---
Narrated by TYPE III AUDIO.
---
This summer, I participated in a human challenge trial at the University of Maryland. I spent the days just prior to my 30th birthday sick with shigellosis.
What? Why?
Dysentery is an acute disease in which pathogens attack the intestine. It is most often caused by the bacteria Shigella. It spreads via the fecal-oral route. It requires an astonishingly low number of pathogens to make a person sick – so it spreads quickly, especially in bad hygienic conditions or anywhere water can get tainted with feces.
It kills about 70,000 people a year, 30,000 of whom are children under the age of 5. Almost all of these cases and deaths are among very poor people.
The primary mechanism by which dysentery kills people is dehydration. The person loses fluids to diarrhea and for whatever reason (lack of knowledge, energy, water, etc) cannot regain them sufficiently. Shigella bacteria are increasingly [...]
---
Outline:
(00:15) What? Why?
(01:18) The deal with human challenge trials
(02:46) Dysentery: it's a modern disease
(04:27) Getting ready
(07:25) Two days until challenge
(10:19) One day before challenge: the age of phage
(11:08) Bacteriophage therapy: sending a cat after mice
(14:14) Do they work?
(16:17) Day 1 of challenge
(17:09) The waiting game
(18:20) Let's learn about Shigella pathogenesis
(23:34) Let's really learn about Shigella pathogenesis
(27:03) Out the other side
(29:24) Aftermath
The original text contained 3 footnotes which were omitted from this narration.
The original text contained 2 images which were described by AI.
---
First published:
October 22nd, 2024
Source:
https://www.lesswrong.com/posts/inHiHHGs6YqtvyeKp/i-got-dysentery-so-you-don-t-have-to
---
Narrated by TYPE III AUDIO.
---
This is a link post. Part 1: Our Thinking
Near and Far
1 Abstract/Distant Future Bias
2 Abstractly Ideal, Concretely Selfish
3 We Add Near, Average Far
4 Why We Don't Know What We Want
5 We See the Sacred from Afar, to See It Together
6 The Future Seems Shiny
7 Doubting My Far Mind
Disagreement
8 Beware the Inside View
9 Are Meta Views Outside Views?
10 Disagreement Is Near-Far Bias
11 Others' Views Are Detail
12 Why Be Contrarian?
13 On Disagreement, Again
14 Rationality Requires Common Priors
15 Might Disagreement Fade Like Violence?
Biases
16 Reject Random Beliefs
17 Chase Your Reading
18 Against Free Thinkers
19 Eventual Futures
20 Seen vs. Unseen Biases
21 Law as No-Bias Theatre
22 Benefit of Doubt = Bias
Part 2: Our Motives
Signaling
23 Decision Theory Remains Neglected
24 What Function Music?
25 Politics isn't about Policy
26 Views [...]
---
Outline:
(00:07) Part 1: Our Thinking
(00:12) Near and Far
(00:37) Disagreement
(01:04) Biases
(01:28) Part 2: Our Motives
(01:33) Signaling
(02:01) Norms
(02:35) Fiction
(02:58) The Dreamtime
(03:19) Part 3: Our Institutions
(03:25) Prediction Markets
(03:48) Academia
(04:06) Medicine
(04:15) Paternalism
(04:29) Law
(05:21) Part 4: Our Past
(05:26) Farmers and Foragers
(05:55) History as Exponential Modes
(06:09) The Great Filter
(06:35) Part 5: Our Future
(06:39) Aliens
(07:01) UFOs
(07:22) The Age of Em
(07:44) Artificial Intelligence
---
First published:
October 20th, 2024
Source:
https://www.lesswrong.com/posts/JxsJdBnL2gG5oa2Li/overcoming-bias-anthology
---
Narrated by TYPE III AUDIO.
Of all the cognitive tools our ancestors left us, what's best? Society seems to think pretty highly of arithmetic. It's one of the first things we learn as children. So I think it's weird that only a tiny percentage of people seem to know how to actually use arithmetic. Or maybe even understand what arithmetic is for. Why?
I think the problem is the idea that arithmetic is about “calculating”. No! Arithmetic is a world-modeling technology. Arguably, it's the best world-modeling technology: It's simple, it's intuitive, and it applies to everything. It allows you to trespass into scientific domains where you don’t belong. It even has an amazing error-catching mechanism built in.
One hundred years ago, maybe it was important to learn long division. But the point of long division was to enable you to do world-modeling. Computers don’t make arithmetic obsolete. If anything, they do the opposite. Without [...]
---
Outline:
(01:17) Chimps
(06:18) Big blocks
(09:34) More big blocks
The original text contained 5 images which were described by AI.
---
First published:
October 17th, 2024
Source:
https://www.lesswrong.com/posts/r2LojHBs3kriafZWi/arithmetic-is-an-underrated-world-modeling-technology
---
Narrated by TYPE III AUDIO.
---
This post starts out pretty gloomy but ends up with some points that I feel pretty positive about. Day to day, I'm more focussed on the positive points, but awareness of the negative has been crucial to forming my priorities, so I'm going to start with those. It's mostly addressed to the EA community, but is hopefully somewhat of interest to LessWrong and the Alignment Forum as well.
My main concerns
I think AGI is going to be developed soon, and quickly. Possibly (20%) that's next year, and most likely (80%) before the end of 2029. These are not things you need to believe for yourself in order to understand my view, so no worries if you're not personally convinced of this.
(For what it's worth, I did arrive at this view through years of study and research in AI, combined with over a decade of private forecasting practice [...]
---
Outline:
(00:28) My main concerns
(03:41) Extinction by industrial dehumanization
(06:00) Successionism as a driver of industrial dehumanization
(11:08) My theory of change: confronting successionism with human-specific industries
(15:53) How I identified healthcare as the industry most relevant to caring for humans
(20:00) But why not just do safety work with big AI labs or governments?
(23:22) Conclusion
The original text contained 1 image which was described by AI.
---
First published:
October 12th, 2024
Source:
https://www.lesswrong.com/posts/Kobbt3nQgv3yn29pr/my-theory-of-change-for-working-in-ai-healthtech
---
Narrated by TYPE III AUDIO.
---
This post focuses on philosophical objections to Bayesianism as an epistemology. I first explain Bayesianism and some standard objections to it, then lay out my two main objections (inspired by ideas in philosophy of science). A follow-up post will speculate about how to formalize an alternative.
Degrees of belief
The core idea of Bayesianism: we should ideally reason by assigning credences to propositions which represent our degrees of belief that those propositions are true.
If that seems like a sufficient characterization to you, you can go ahead and skip to the next section, where I explain my objections to it. But for those who want a more precise description of Bayesianism, and some existing objections to it, I’ll more specifically characterize it in terms of five subclaims. Bayesianism says that we should ideally reason in terms of:
As humanity gets closer to Artificial General Intelligence (AGI), a new geopolitical strategy is gaining traction in US and allied circles, in the NatSec, AI safety and tech communities. Anthropic CEO Dario Amodei and RAND Corporation call it the “entente”, while others privately refer to it as “hegemony" or “crush China”. I will argue that, irrespective of one's ethical or geopolitical preferences, it is fundamentally flawed and against US national security interests.
If the US fights China in an AGI race, the only winners will be machines
The entente strategy
Amodei articulates key elements of this strategy as follows:
"a coalition of democracies seeks to gain a clear advantage (even just a temporary one) on powerful AI by securing its supply chain, scaling quickly, and blocking or delaying adversaries’ access to key resources like chips and semiconductor equipment. This coalition would on one hand use AI to achieve robust [...]
---
Outline:
(00:51) The entente strategy
(02:22) Why it's a suicide race
(09:19) Loss-of-control
(11:32) A better strategy: tool AI
The original text contained 1 image which was described by AI.
---
First published:
October 13th, 2024
Source:
https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-agi-entente-delusion
---
Narrated by TYPE III AUDIO.
---
I think that most people underestimate how many scientific mysteries remain, even on questions that sound basic.
My favourite candidate for "the most basic thing that is still unknown" is the momentum carried by light, when it is in a medium (for example, a flash of light in glass or water).
If a block of glass has a refractive index of <span>_n_</span>, then the light inside that block travels <span>_n_</span> times slower than the light would in vacuum. But what is the momentum of that light wave in the glass relative to the momentum it would have in vacuum?"
In 1908 Abraham proposed that the light's momentum would be reduced by a factor of <span>_n_</span>. This makes sense on the surface, <span>_n_</span> times slower means <span>_n_</span> times less momentum. This gives a single photon a momentum of <span>_hbar omega / nc_</span>. For <span>_omega_</span> the angular frequency, <span>_c_</span> the [...]
The original text contained 13 footnotes which were omitted from this narration.
The original text contained 2 images which were described by AI.
---
First published:
October 9th, 2024
Source:
https://www.lesswrong.com/posts/njBRhELvfMtjytYeH/momentum-of-light-in-glass
---
Narrated by TYPE III AUDIO.
---
How can we make many humans who are very good at solving difficult problems?
Summary (table of made-up numbers)
I made up the made-up numbers in this table of made-up numbers; therefore, the numbers in this table of made-up numbers are made-up numbers.
Call to action
If you have a shitload of money, there are some projects you can give money to that would make supergenius humans on demand happen faster. If you have a fuckton of money, there are projects whose creation you could fund that would greatly accelerate this technology.
If you're young and smart, or are already an expert in either stem cell / reproductive biology, biotech, or anything related to brain-computer interfaces, there are some projects you could work on.
If neither, think hard, maybe I missed something.
You can DM me or gmail [...]
---
Outline:
(00:12) Summary (table of made-up numbers)
(00:45) Call to action
(01:22) Context
(01:25) The goal
(02:56) Constraint: Algernons law
(04:30) How to know what makes a smart brain
(04:35) Figure it out ourselves
(04:53) Copy natures work
(05:18) Brain emulation
(05:21) The approach
(06:07) Problems
(07:52) Genomic approaches
(08:34) Adult brain gene editing
(08:38) The approach
(08:53) Problems
(09:26) Germline engineering
(09:32) The approach
(11:37) Problems
(12:11) Signaling molecules for creative brains
(12:15) The approach
(13:30) Problems
(13:45) Brain-brain electrical interface approaches
(14:41) Problems with all electrical brain interface approaches
(15:11) Massive cerebral prosthetic connectivity
(17:03) Human / human interface
(17:59) Interface with brain tissue in a vat
(18:30) Massive neural transplantation
(18:35) The approach
(19:01) Problems
(19:39) Support for thinking
(19:53) The approaches
(21:04) Problems
(21:58) FAQ
(22:01) What about weak amplification
(22:14) What about ...
(24:04) The real intelligence enhancement is ...
The original text contained 3 images which were described by AI.
---
First published:
October 8th, 2024
Source:
https://www.lesswrong.com/posts/jTiSWHKAtnyA723LE/overview-of-strong-human-intelligence-amplification-methods
---
Narrated by TYPE III AUDIO.
---
This post is probably hazardous for one type of person in one particular growth stage, and necessary for people in a different growth stage, and I don't really know how to tell the difference in advance.
If you read it and feel like it kinda wrecked you send me a DM. I'll try to help bandage it.
One of my favorite stories growing up was Star Wars: Traitor, by Matthew Stover.
The book is short, if you want to read it. Spoilers follow. (I took a look at it again recently and I think it didn't obviously hold up as real adult fiction, although quite good if you haven't yet had your mind blown that many times)
One anecdote from the story has stayed with me and permeates my worldview.
The story begins with "Jacen Solo has been captured, and is being tortured."
He is being [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
September 24th, 2024
Source:
https://www.lesswrong.com/posts/hvj9NGodhva9pKGTj/struggling-like-a-shadowmoth
---
Narrated by TYPE III AUDIO.
This is a description of my work on some data science projects, lightly obfuscated and fictionalized to protect the confidentiality of the organizations I handled them for (and also to make it flow better). I focus on the high-level epistemic/mathematical issues, and the lived experience of working on intellectual problems, but gloss over the timelines and implementation details.
The Upper Bound
One time, I was working for a company which wanted to win some first-place sealed-bid auctions in a market they were thinking of joining, and asked me to model the price-to-beat in those auctions. There was a twist: they were aiming for the low end of the market, and didn't care about lots being sold for more than $1000.
"Okay," I told them. "I'll filter out everything with a price above $1000 before building any models or calculating any performance metrics!"
They approved of this, and told me [...]
---
Outline:
(00:27) The Upper Bound
(02:58) The Time-Travelling Convention
(05:56) The Tobit Problem
(06:30) My Takeaways
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
October 1st, 2024
Source:
https://www.lesswrong.com/posts/rzyHbLZHuqHq6KM65/three-subtle-examples-of-data-leakage
---
Narrated by TYPE III AUDIO.
[Meta note: quickly written, unpolished. Also, it's possible that there's some more convincing work on this topic that I'm unaware of – if so, let me know]
In research discussions about LLMs, I often pick up a vibe of casual, generalized skepticism about model-generated CoT (chain-of-thought) explanations.
CoTs (people say) are not trustworthy in general. They don't always reflect what the model is "actually" thinking or how it has "actually" solved a given problem.
This claim is true as far as it goes. But people sometimes act like it goes much further than (IMO) it really does.
Sometimes it seems to license an attitude of "oh, it's no use reading what the model says in the CoT, you're a chump if you trust that stuff." Or, more insidiously, a failure to even ask the question "what, if anything, can we learn about the model's reasoning process by reading the [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
September 29th, 2024
Source:
https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated
---
Narrated by TYPE III AUDIO.
I've been wanting to write a nice post for a few months, but should probably just write a one sooner instead. This is a top-level post not because it's a long text, but because it's important text.
Anyways. Cryonics is pretty much money-free now—one of the most affordable ways to dispose of your body post-mortem.
In the west coast in the USA, from Oregon Brain Preservation, as of around May 2024 I think:
Our research program is open to individuals in Washington, Oregon, and Northern California. This is the same brain preservation procedure, with the goal of future revival if this ever becomes feasible and humane. The difference is that we will also remove two small biopsy samples, which will be transferred to our partner non-profit organization, Apex Neuroscience, to measure the preservation quality and contribute to neuroscience research. Although there are no guarantees, we do not expect these [...]
---
First published:
September 29th, 2024
Source:
https://www.lesswrong.com/posts/WE65pBLQvNk3h3Dnr/cryonics-is-free
---
Narrated by TYPE III AUDIO.
I’ve claimed that Willpower compounds and that small wins in the present make it easier to get bigger wins in the future. Unfortunately, procrastination and laziness compound, too.
You’re stressed out for some reason, so you take the evening off for a YouTube binge. You end up staying awake a little later than usual and sleeping poorly. So the next morning you feel especially tired; you snooze a few extra times. In your rushed morning routine you don’t have time to prepare for the work meeting as much as you’d planned to. So you have little to contribute during the meeting. You feel bad about your performance. You escape from the bad feelings with a Twitter break. But Twitter is freaking out. Elon Musk said what? Everyone is weighing in. This is going to occupy you intermittently for the rest of the day. And so on.
Laziness has a [...]
---
Outline:
(01:44) I’m spiraling! I’m spiraling!
(02:54) Do what you can
(03:33) A) Emergency recovery
(05:49) B) Natural recovery
(06:14) Wait for a reset point
(08:53) Analyze the circumstances that caused it
(12:17) C) Heroic recovery
(12:50) Deep psych work
(15:16) Conclusion
The original text contained 1 image which was described by AI.
---
First published:
September 19th, 2024
Source:
https://www.lesswrong.com/posts/JBR6AF9Gusv4u6Fwo/laziness-death-spirals
---
Narrated by TYPE III AUDIO.
---
For a long time, when I heard "slow takeoff", I assumed it meant "takeoff that takes longer calendar time than fast takeoff." (i.e. what is now referred to more often as "short timelines" vs "long timelines."). I think Paul Christiano popularized the term, and it so happened he both expected to see longer timelines and smoother/continuous takeoff.
I think it's at least somewhat confusing to use the term "slow" to mean "smooth/continuous", because that's not what "slow" particularly means most of the time.
I think it's even more actively confusing because "smooth/continuous" takeoff not only could be faster in calendar time, but, I'd weakly expect this on average, since smooth takeoff means that AI resources at a given time are feeding into more AI resources, whereas sharp/discontinuous takeoff would tend to mean "AI tech doesn't get seriously applied towards AI development until towards the end."
I don't think this [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
September 28th, 2024
Source:
https://www.lesswrong.com/posts/6svEwNBhokQ83qMBz/slow-takeoff-is-a-terrible-term-for-maybe-even-faster
---
Narrated by TYPE III AUDIO.
---
A common claim among e/accs is that, since the solar system is big, Earth will be left alone by superintelligences. A simple rejoinder is that just because Bernard Arnault has $170 billion, does not mean that he'll give you $77.18.
Earth subtends only 4.54e-10 = 0.0000000454% of the angular area around the Sun, according to GPT-o1.[1]
Asking an ASI to leave a hole in a Dyson Shell, so that Earth could get some sunlight not transformed to infrared, would cost It 4.5e-10 of Its income.
This is like asking Bernard Arnalt to send you $77.18 of his $170 billion of wealth.
In real life, Arnalt says no.
But wouldn't humanity be able to trade with ASIs, and pay Them to give us sunlight? This is like planning to get $77 from Bernard Arnalt by selling him an Oreo cookie.
To extract $77 from Arnalt, it's not a sufficient [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
September 23rd, 2024
Source:
https://www.lesswrong.com/posts/F8sfrbPjCQj4KwJqn/asis-will-not-leave-just-a-little-sunlight-for-earth
---
Narrated by TYPE III AUDIO.
A year ago, I started trying to deliberate practice skills that would "help people figure out the answers to confusing, important questions." I experimented with Thinking Physics questions, GPQA questions, Puzzle Games , Strategy Games, and a stupid twitchy reflex game I had struggled to beat for 8 years[1]. Then I went back to my day job and tried figuring stuff out there too.
The most important skill I was trying to learn was Metastrategic Brainstorming[2] – the skill of looking at a confusing, hopeless situation, and nonetheless brainstorming useful ways to get traction or avoid wasted motion.
Normally, when you want to get good at something, it's great to stand on the shoulders of giants and copy all the existing techniques. But this is challenging if you're trying to solve important, confusing problems because there probably isn't (much) established wisdom on how to solve it. You may [...]
---
Outline:
(02:33) Taking breaks, or naps
(03:25) Working Memory facility
(04:56) Patience
(06:17) Know what deconfusion, or having a crisp understanding feels like
(07:50) Actually Fucking Backchain
(10:00) Ask Whats My Goal?
(11:09) Always have at least 3 hypotheses
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
September 18th, 2024
Source:
https://www.lesswrong.com/posts/thc4RemfLcM5AdJDa/skills-from-a-year-of-purposeful-rationality-practice
---
Narrated by TYPE III AUDIO.
After the release of Ben Pace's extended interview with me about my views on religion, I felt inspired to publish more of my thinking about religion in a format that's more detailed, compact, and organized. This post is the first publication in my series of intended posts about religion.
Thanks to Ben Pace, Chris Lakin, Richard Ngo, Renshin Lauren Lee, Mark Miller, and Imam Ammar Amonette for their feedback on this post, and thanks to Kaj Sotala, Tomáš Gavenčiak, Paul Colognese, and David Spivak for reviewing earlier versions of this post. Thanks especially to Renshin Lauren Lee and Imam Ammar Amonette for their input on my claims about religion and inner work, and Mark Miller for vetting my claims about predictive processing.
In Waking Up, Sam Harris wrote:[1]
But I now knew that Jesus, the Buddha, Lao Tzu, and the other saints and sages of [...]
---
Outline:
(01:36) “Trapped Priors As A Basic Problem Of Rationality”
(03:49) Active blind spots as second-order trapped priors
(06:17) Inner work ≈ the systematic addressing of trapped priors
(08:33) Religious mystical traditions as time-tested traditions of inner work?
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
August 23rd, 2024
Source:
https://www.lesswrong.com/posts/X2og6RReKD47vseK8/how-i-started-believing-religion-might-actually-matter-for
---
Narrated by TYPE III AUDIO.
There's a popular story that goes like this: Christopher Hitchens used to be in favor of the US waterboarding terrorists because he though it's wasn't bad enough to be torture.. Then he had it tried on himself, and changed his mind, coming to believe it isn't torture.
(Context for those unfamiliar: in the decade following 9/11, the US engaged in a lot of... questionable behavior to persecute the war on terror, and there was a big debate on whether waterboarding should be permitted. Many other public figures also volunteered to undergo the procedure as a part of this public debate; most notably Sean Hannity, who was an outspoken proponent of waterboarding, yet welched on his offer and never tried it himself.)
This story intrigued me because it's popular among both Hitchens' fans and his detractors. His fans use it as an example of his intellectual honesty and willingness to [...]
---
First published:
September 15th, 2024
Source:
https://www.lesswrong.com/posts/fNqEGTmkCy9sqZYm7/did-christopher-hitchens-change-his-mind-about-waterboarding
---
Narrated by TYPE III AUDIO.
Midjourney, “Fourth Industrial Revolution Digital Transformation”This is a little rant I like to give, because it's something I learned on the job that I’ve never seen written up explicitly.
There are a bunch of buzzwords floating around regarding computer technology in an industrial or manufacturing context: “digital transformation”, “the Fourth Industrial Revolution”, “Industrial Internet of Things”.
What do those things really mean?
Do they mean anything at all?
The answer is yes, and what they mean is the process of putting all of a company's data on computers so it can be analyzed.
This is the prerequisite to any kind of “AI” or even basic statistical analysis of that data; before you can start applying your fancy algorithms, you need to get that data in one place, in a tabular format.
Wait, They Haven’t Done That Yet?
Each of these machines in a semiconductor fab probably stores its data locally. [...] ---
Outline:
(00:56) Wait, They Haven’t Done That Yet?
(03:28) Why Data Integration Is Hard
(03:56) Data Access Negotiation, AKA Please Let Me Do The Work You Paid Me For
(10:20) Data Cleaning, AKA I Can’t Use This Junk
(14:00) AI is Gated On Data Integration
---
First published:
September 13th, 2024
Source:
https://www.lesswrong.com/posts/7L8ZwMJkhLXjSa7tD/the-great-data-integration-schlep
---
Narrated by TYPE III AUDIO.
---
[Conflict of interest disclaimer: We are FutureSearch, a company working on AI-powered forecasting and other types of quantitative reasoning. If thin LLM wrappers could achieve superhuman forecasting performance, this would obsolete a lot of our work.]
Widespread, misleading claims about AI forecasting
Recently we have seen a number of papers – (Schoenegger et al., 2024, Halawi et al., 2024, Phan et al., 2024, Hsieh et al., 2024) – with claims that boil down to “we built an LLM-powered forecaster that rivals human forecasters or even shows superhuman performance”.
These papers do not communicate their results carefully enough, shaping public perception in inaccurate and misleading ways. Some examples of public discourse:
This is a link post. ---
First published:
September 12th, 2024
Source:
https://www.lesswrong.com/posts/bhY5aE4MtwpGf3LCo/openai-o1
---
Narrated by TYPE III AUDIO.
Epistemic status: these are my own opinions on AI risk communication, based primarily on my own instincts on the subject and discussions with people less involved with rationality than myself. Communication is highly subjective and I have not rigorously A/B tested messaging. I am even less confident in the quality of my responses than in the correctness of my critique.
If they turn out to be true, these thoughts can probably be applied to all sorts of communication beyond AI risk.
Lots of work has gone into trying to explain AI risk to laypersons. Overall, I think it's been great, but there's a particular trap that I've seen people fall into a few times. I'd summarize it as simplifying and shortening the text of an argument without enough thought for the information content. It comes in three forms. One is forgetting to adapt concepts for someone with a far [...]
---
Outline:
(01:14) Failure to Adapt Concepts
(03:41) Failure to Filter Information
(05:09) Failure to Sound Like a Human Being
(07:23) Summary
The original text contained 4 images which were described by AI.
---
First published:
September 10th, 2024
Source:
https://www.lesswrong.com/posts/CZQYP7BBY4r9bdxtY/the-best-lay-argument-is-not-a-simple-english-yud-essay
---
Narrated by TYPE III AUDIO.
---
In my last post, I wrote that no resource out there exactly captured my model of epistemology, which is why I wanted to share a half-baked version of it.
But I do have one book which I always recommend to people who want to learn more about epistemology: Inventing Temperature by Hasok Chang.
To be very clear, my recommendation is not just to get the good ideas from this book (of which there are many) from a book review or summary — it's to actually read the book, the old-school way, one word at a time.
Why? Because this book teaches you the right feel, the right vibe for thinking about epistemology. It punctures the bubble of sterile non-sense that so easily pass for “how science works” in most people's education, such as the “scientific method”. And it does so by demonstrating how one actually makes progress in epistemology [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
September 8th, 2024
Source:
https://www.lesswrong.com/posts/TbaCa7sY3GxHBcXTd/my-number-1-epistemology-book-recommendation-inventing
---
Narrated by TYPE III AUDIO.
Our new video is an adaptation of That Alien Message, by @Eliezer Yudkowsky. This time, the text has been significantly adapted, so I include it below.
Part 1
Picture a world just like ours, except the people are a fair bit smarter: in this world, Einstein isn’t one in a million, he's one in a thousand. In fact, here he is now. He's made all the same discoveries, but they’re not quite as unusual: there have been lots of other discoveries. Anyway, he's out one night with a friend looking up at the stars when something odd happens. [visual: stars get brighter and dimmer, one per second. The two people on the hill look at each other, confused]
The stars are flickering. And it's just not a hallucination. Everyone's seeing it.
And so everyone immediately freaks out and panics! Ah, just kidding, the people of this world are [...]
---
Outline:
(00:16) Part 1
(06:22) Part 2
(09:53) Part 3
(11:58) Part 4
---
First published:
September 7th, 2024
Source:
https://www.lesswrong.com/posts/Q9omyL3qooXdjnyZn/that-alien-message-the-animation
---
Narrated by TYPE III AUDIO.
Personally, I suspect the alignment problem is hard. But even if it turns out to be easy, survival may still require getting at least the absolute basics right; currently, I think we're mostly failing even at that.
Early discussion of AI risk often focused on debating the viability of various elaborate safety schemes humanity might someday devise—designing AI systems to be more like “tools” than “agents,” for example, or as purely question-answering oracles locked within some kryptonite-style box. These debates feel a bit quaint now, as AI companies race to release agentic models they barely understand directly onto the internet.
But a far more basic failure, from my perspective, is that at present nearly all AI company staff—including those tasked with deciding whether new models are safe to build and release—are paid substantially in equity, the value of which seems likely to decline if their employers stop building and [...]
---
First published:
September 7th, 2024
Source:
https://www.lesswrong.com/posts/sMBjsfNdezWFy6Dz5/pay-risk-evaluators-in-cash-not-equity
---
Narrated by TYPE III AUDIO.
Intro
In April 2024, my colleague and I (both affiliated with Peking University) conducted a survey involving 510 students from Tsinghua University and 518 students from Peking University—China's two top academic institutions. Our focus was on their perspectives regarding the frontier risks of artificial intelligence.
In the People's Republic of China (PRC), publicly accessible survey data on AI is relatively rare, so we hope this report provides some valuable insights into how people in the PRC are thinking about AI (especially the risks). Throughout this post, I’ll do my best to weave in other data reflecting the broader Chinese sentiment toward AI.
For similar research, check out The Center for Long-Term Artificial Intelligence, YouGov, Monmouth University, The Artificial Intelligence Policy Institute, and notably, a poll conducted by Rethink Priorities, which closely informed our survey design.
You can read the full report published in the Jamestown Foundation's China Brief [...]
The original text contained 11 images which were described by AI.
---
First published:
September 2nd, 2024
Source:
https://www.lesswrong.com/posts/gxCGKHpX8G8D8aWy5/survey-how-do-elite-chinese-students-feel-about-the-risks-of
---
Narrated by TYPE III AUDIO.
---
Paging Gwern or anyone else who can shed light on the current state of the AI market—I have several questions.
Since the release of ChatGPT, at least 17 companies, according to the LMSYS Chatbot Arena Leaderboard, have developed AI models that outperform it. These companies include Anthropic, NexusFlow, Microsoft, Mistral, Alibaba, Hugging Face, Google, Reka AI, Cohere, Meta, 01 AI, AI21 Labs, Zhipu AI, Nvidia, DeepSeek, and xAI.
Since GPT-4's launch, 15 different companies have reportedly created AI models that are smarter than GPT-4. Among them are Reka AI, Meta, AI21 Labs, DeepSeek AI, Anthropic, Alibaba, Zhipu, Google, Cohere, Nvidia, 01 AI, NexusFlow, Mistral, and xAI.
Twitter AI (xAI), which seemingly had no prior history of strong AI engineering, with a small team and limited resources, has somehow built the third smartest AI in the world, apparently on par with the very best from OpenAI.
The top AI image [...]
---
First published:
August 28th, 2024
Source:
https://www.lesswrong.com/posts/yRjLY3z3GQBJaDuoY/things-that-confuse-me-about-the-current-ai-market
---
Narrated by TYPE III AUDIO.
If you ask the internet if breastfeeding is good, you will soon learn that YOU MUST BREASTFEED because BREAST MILK = OPTIMAL FOOD FOR BABY. But if you look for evidence, you’ll discover two disturbing facts.
First, there's no consensus about why breastfeeding is good. I’ve seen experts suggest at least eight possible mechanisms:
Crossposted from https://williamrsaunders.substack.com/p/principles-for-the-agi-race
Why form principles for the AGI Race?
I worked at OpenAI for 3 years, on the Alignment and Superalignment teams. Our goal was to prepare for the possibility that OpenAI succeeded in its stated mission of building AGI (Artificial General Intelligence, roughly able to do most things a human can do), and then proceed on to make systems smarter than most humans. This will predictably face novel problems in controlling and shaping systems smarter than their supervisors and creators, which we don't currently know how to solve. It's not clear when this will happen, but a number of people would throw around estimates of this happening within a few years.
While there, I would sometimes dream about what would have happened if I’d been a nuclear physicist in the 1940s. I do think that many of the kind of people who get involved in the effective [...]
---
Outline:
(00:06) Why form principles for the AGI Race?
(03:32) Bad High Risk Decisions
(04:46) Unnecessary Races to Develop Risky Technology
(05:17) High Risk Decision Principles
(05:21) Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances
(07:20) Principle 2: Don’t take actions which impose significant risks to others without overwhelming evidence of net benefit
(10:52) Race Principles
(10:56) What is a Race?
(12:18) Principle 3: When racing, have an exit strategy
(13:03) Principle 4: Maintain accurate race intelligence at all times.
(14:23) Principle 5: Evaluate how bad it is for your opponent to win instead of you, and balance this against the risks of racing
(15:07) Principle 6: Seriously attempt alternatives to racing
(16:58) Meta Principles
(17:01) Principle 7: Don’t give power to people or structures that can’t be held accountable.
(18:36) Principle 8: Notice when you can’t uphold your own principles.
(19:17) Application of my Principles
(19:21) Working at OpenAI
(24:19) SB 1047
(28:32) Call to Action
---
First published:
August 30th, 2024
Source:
https://www.lesswrong.com/posts/aRciQsjgErCf5Y7D9/principles-for-the-agi-race
---
Narrated by TYPE III AUDIO.
Two new The Information articles with insider information on OpenAI's next models and moves.
They are paywalled, but here are the new bits of information:
People often talk about “solving the alignment problem.” But what is it to do such a thing? I wanted to clarify my thinking about this topic, so I wrote up some notes.
In brief, I’ll say that you’ve solved the alignment problem if you’ve:
In the past two years there has been increased interest in formal verification-based approaches to AI safety. Formal verification is a sub-field of computer science that studies how guarantees may be derived by deduction on fully-specified rule-sets and symbol systems. By contrast, the real world is a messy place that can rarely be straightforwardly represented in a reductionist way. In particular, physics, chemistry and biology are all complex sciences which do not have anything like complete symbolic rule sets. Additionally, even if we had such rules for the natural sciences, it would be very difficult for any software system to obtain sufficiently accurate models and data about initial conditions for a prover to succeed in deriving strong guarantees for AI systems operating in the real world.
Practical limitations like these on formal verification have been well-understood for decades to engineers and applied mathematicians building real-world software systems, which makes [...]
---
Outline:
(01:23) What do we Mean by Formal Verification for AI Safety?
(12:13) Challenges and Limitations
(37:58) What Can Be Hoped-For?
---
First published:
August 19th, 2024
Source:
https://www.lesswrong.com/posts/B2bg677TaS4cmDPzL/limitations-on-formal-verification-for-ai-safety
---
Narrated by TYPE III AUDIO.
---
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. As an extreme thought experiment, I’ll argue this could be hard even if you caught your AI red-handed trying to escape.
Imagine you're running an AI lab at the point where your AIs are able to automate almost all intellectual labor; the AIs are now mostly being deployed internally to do AI R&D. (If you want a concrete picture here, I'm imagining that there are 10 million parallel instances, running at 10x human speed, working 24/7. See e.g. similar calculations here). And suppose (as I think is 35% likely) that these models are egregiously [...]
---
First published:
August 26th, 2024
Source:
https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai
---
Narrated by TYPE III AUDIO.
For many products, we face a choice of who to hold liable for harms that would not have occurred if not for the existence of the product. For instance, if a person uses a gun in a school shooting that kills a dozen people, there are many legal persons who in principle could be held liable for the harm:
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours.
Who are we?
We’re the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we’ve evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We’ve also been growing since our last post: by 39% last year [...]
---
Outline:
(00:32) Who are we?
(01:32) What have we been up to?
(02:16) Frontier Safety
(02:38) FSF
(04:05) Dangerous Capability Evaluations
(05:12) Mechanistic Interpretability
(08:54) Amplified Oversight
(09:23) Theoretical Work on Debate
(10:32) Empirical Work on Debate
(11:37) Causal Alignment
(12:47) Emerging Topics
(14:57) Highlights from Our Collaborations
(17:07) What are we planning next?
---
First published:
August 20th, 2024
Source:
https://www.lesswrong.com/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.Is AI takeover like a nuclear meltdown? A coup? A plane crash?
My day job is thinking about safety measures that aim to reduce catastrophic risks from AI (especially risks from egregious misalignment). The two main themes of this work are the design of such measures (what's the space of techniques we might expect to be affordable and effective) and their evaluation (how do we decide which safety measures to implement, and whether a set of measures is sufficiently robust). I focus especially on AI control, where we assume our models are trying to subvert our safety measures and aspire to find measures that are robust anyway.
Like other AI safety researchers, I often draw inspiration from other fields that contain potential analogies. Here are some of those fields, my opinions on their [...]
---
Outline:
(01:04) Robustness to insider threats
(07:16) Computer security
(09:58) Adversarial risk analysis
(11:58) Safety engineering
(13:34) Physical security
(18:06) How human power structures arise and are preserved
The original text contained 1 image which was described by AI.
---
First published:
August 13th, 2024
Source:
https://www.lesswrong.com/posts/xXXXkGGKorTNmcYdb/fields-that-i-reference-when-thinking-about-ai-takeover
---
Narrated by TYPE III AUDIO.
---
[This article was originally published on Dan Elton's blog, More is Different.]
Cerebrolysin is an unregulated medical product made from enzymatically digested pig brain tissue. Hundreds of scientific papers claim that it boosts BDNF, stimulates neurogenesis, and can help treat numerous neural diseases. It is widely used by doctors around the world, especially in Russia and China.
A recent video of Bryan Johnson injecting Cerebrolysin has over a million views on X and 570,000 views on YouTube. The drug, which is advertised as a “peptide combination”, can be purchased easily online and appears to be growing in popularity among biohackers, rationalists, and transhumanists. The subreddit r/Cerebrolysin has 3,100 members.
TL;DR
Unfortunately, our investigation indicates that the benefits attributed to Cerebrolysin are biologically implausible and unlikely to be real. Here's what we found:
This work was produced at Apollo Research, based on initial research done at MATS.
LayerNorm is annoying for mechanstic interpretability research (“[...] reason #78 for why interpretability researchers hate LayerNorm” – Anthropic, 2023).
Here's a Hugging Face link to a GPT2-small model without any LayerNorm.
The final model is only slightly worse than a GPT2 with LayerNorm[1]:
DatasetOriginal GPT2Fine-tuned GPT2 with LayerNormFine-tuned GPT without LayerNormOpenWebText (ce_loss)3.0952.9893.014 (+0.025)ThePile (ce_loss)2.8562.8802.926 (+0.046)HellaSwag (accuracy)29.56%29.82%29.54%I fine-tuned GPT2-small on OpenWebText while slowly removing its LayerNorm layers, waiting for the loss to go back down after reach removal:
Introduction
LayerNorm (LN) is a component in Transformer models that normalizes embedding vectors to have constant length; specifically it divides the embeddings by their standard deviation taken over the hidden dimension. It was originally introduced to stabilize and speed up training of models (as a replacement for batch normalization). It is active during training and inference.
<span>_mathrm{LN}(x) = frac{x - [...] ---
Outline:
(01:11) Introduction
(02:45) Motivation
(03:33) Method
(09:15) Implementation
(10:40) Results
(13:59) Residual stream norms
(14:32) Discussion
(14:35) Faithfulness to the original model
(15:45) Does the noLN model generalize worse?
(16:13) Appendix
(16:16) Representing the no-LayerNorm model in GPT2LMHeadModel
(18:08) Which order to remove LayerNorms in
(19:28) Which kinds of LayerNorms to remove first
(20:29) Which layer to remove LayerNorms in first
(21:13) Data-reuse and seeds
(21:35) Infohazards
(21:58) Acknowledgements
The original text contained 4 footnotes which were omitted from this narration.
The original text contained 5 images which were described by AI.
---
First published:
August 8th, 2024
Source:
https://www.lesswrong.com/posts/THzcKKQd4oWkg4dSP/you-can-remove-gpt2-s-layernorm-by-fine-tuning-for-an-hour
---
Narrated by TYPE III AUDIO.
---
This is slightly old news at this point, but: as part of MIRI's recent strategy pivot, they've eliminated the Agent Foundations research team. I've been out of a job for a little over a month now. Much of my research time in the first half of the year was eaten up by engaging with the decision process that resulted in this, and later, applying to grants and looking for jobs.
I haven't secured funding yet, but for my own sanity & happiness, I am (mostly) taking a break from worrying about that, and getting back to thinking about the most important things.
However, in an effort to try the obvious, I have set up a Patreon where you can fund my work directly. I don't expect it to become my main source of income, but if it does, that could be a pretty good scenario for me; it would [...]
---
Outline:
(01:00) What Im (probably) Doing Going Forward
(02:28) Thoughts on Public vs Private Research
The original text contained 1 footnote which was omitted from this narration.
---
First published:
August 8th, 2024
Source:
https://www.lesswrong.com/posts/SnaAYrqkb7fzpZn86/leaving-miri-seeking-funding
---
Narrated by TYPE III AUDIO.
This is a story about a flawed Manifold market, about how easy it is to buy significant objective-sounding publicity for your preferred politics, and about why I've downgraded my respect for all but the largest prediction markets.
I've had a Manifold account for a while, but I didn't use it much until I saw and became irked by this market on the conditional probabilities of a Harris victory, split by VP pick.
Jeb Bush? Really? That's not even a fun kind of wishful thinking for anyone. Please clap.The market quickly got cited by rat-adjacent folks on Twitter like Matt Yglesias, because the question it purports to answer is enormously important. But as you can infer from the above, it has a major issue that makes it nigh-useless: for a candidate whom you know won't be chosen, there is literally no way to come out ahead on mana (Manifold keeps [...]
The original text contained 1 image which was described by AI.
---
First published:
August 6th, 2024
Source:
https://www.lesswrong.com/posts/awKbxtfFAfu7xDXdQ/how-i-learned-to-stop-trusting-prediction-markets-and-love
---
Narrated by TYPE III AUDIO.
---
Cross-posted from Substack.
1.
And the sky opened, and from the celestial firmament descended a cube of ivory the size of a skyscraper, lifted by ten thousand cherubim and seraphim. And the cube slowly landed among the children of men, crushing the frail metal beams of the Golden Gate Bridge under its supernatural weight. On its surface were inscribed the secret instructions that would allow humanity to escape the imminent AI apocalypse. And these instructions were…
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.What the heck is up with “corrigibility”? For most of my career, I had a sense that it was a grab-bag of properties that seemed nice in theory but hard to get in practice, perhaps due to being incompatible with agency.
Then, last year, I spent some time revisiting my perspective, and I concluded that I had been deeply confused by what corrigibility even was. I now think that corrigibility is a single, intuitive property, which people can learn to emulate without too much work and which is deeply compatible with agency. Furthermore, I expect that even with prosaic training methods, there's some chance of winding up with an AI agent that's inclined to become more corrigible over time, rather than less (as long as the people who built it understand corrigibility and want that agent [...]
---
Outline:
(07:30) Overview
(07:33) 1. The CAST Strategy
(08:15) 2. Corrigibility Intuition (Coming Saturday)
(08:49) 3a. Towards Formal Corrigibility (Coming Sunday)
(09:27) 3. Formal (Faux) Corrigibility ← the mathy one (Also Sunday)
(10:12) 4. Existing Writing on Corrigibility (Coming Monday)
(10:33) 5. Open Corrigibility Questions (Also Monday)
(10:58) Bibliography and Miscellany
---
First published:
June 7th, 2024
Source:
https://www.lesswrong.com/posts/NQK8KHSrZRF5erTba/0-cast-corrigibility-as-singular-target-1
---
Narrated by TYPE III AUDIO.
Figure 1. Image generated by DALL-3 to represent the concept of self-other overlapMany thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work.
Summary
In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting that neural self-other overlap is connected to pro-sociality in humans and we argue that there are more fundamental reasons to believe this prior is relevant for AI Alignment. We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability and has low capabilities externalities. We also share an early experiment of how fine-tuning a deceptive policy with self-other overlap reduces deceptive behavior in a simple RL environment. On top of that [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
July 30th, 2024
Source:
https://www.lesswrong.com/posts/hzt9gHpNwA2oHtwKX/self-other-overlap-a-neglected-approach-to-ai-alignment
---
Narrated by TYPE III AUDIO.
---
TL;DR: Your discernment in a subject often improves as you dedicate time and attention to that subject. The space of possible subjects is huge, so on average your discernment is terrible, relative to what it could be. This is a serious problem if you create a machine that does everyone's job for them.
See also: Reality has a surprising amount of detail. (You lack awareness of how bad your staircase is and precisely how your staircase is bad.) You don't know what you don't know. You forget your own blind spots, shortly after you notice them.
An afternoon with a piano tuner
I recently played in an orchestra, as a violinist accompanying a piano soloist who was playing a concerto. My 'stand partner' (the person I was sitting next to) has a day job as a piano tuner.
I loved the rehearsal, and heard nothing at all wrong with [...]
---
Outline:
(00:42) An afternoon with a piano tuner
(02:56) Hear how it rolls over?
(03:43) Are any of these notes brighter than others?
(04:31) Yeah the beats get slower, but they dont get slower at an even rate...
(05:19) This string probably has some rust on it somewhere.
(06:55) Please at least listen to this guy when you create a robotic piano tuner and put him out of business.
---
First published:
August 4th, 2024
Source:
https://www.lesswrong.com/posts/PJu2HhKsyTEJMxS9a/you-don-t-know-how-bad-most-things-are-nor-precisely-how
---
Narrated by TYPE III AUDIO.
This is a link post.Content warning: About an IRL death.
Today's post isn’t so much an essay as a recommendation for two bodies of work on the same topic: Tom Mahood's blog posts and Adam “KarmaFrog1” Marsland's videos on the 2010 disappearance of Bill Ewasko, who went for a day hike in Joshua Tree National Park and dropped out of contact.
2010 – Bill Ewasko goes missing
NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position.
“It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout” : Elhage et. al, Toy Models of Superposition
Recently, much attention in the field of mechanistic interpretability, which tries to explain the behavior of neural networks in terms of interactions between lower level components, has been focussed on extracting features from the representation space of a model. The predominant methodology for this has used variations on the sparse autoencoder, in a series of papers [...]
---
Outline:
(09:56) Monosemanticity
(19:22) Explicit vs Tacit Representations.
(26:27) Conclusions
The original text contained 12 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
---
First published:
August 2nd, 2024
Source:
https://www.lesswrong.com/posts/tojtPCCRpKLSHBdpn/the-strong-feature-hypothesis-could-be-wrong
---
Narrated by TYPE III AUDIO.
---
This is a link post.Google DeepMind reports on a system for solving mathematical problems that allegedly is able to give complete solutions to four of the six problems on the 2024 IMO, putting it near the top of the silver-medal category.
Well, actually, two systems for solving mathematical problems: AlphaProof, which is more general-purpose, and AlphaGeometry, which is specifically for geometry problems. (This is AlphaGeometry 2; they reported earlier this year on a previous version of AlphaGeometry.)
AlphaProof works in the "obvious" way: an LLM generates candidate next steps which are checked using a formal proof-checking system, in this case Lean. One not-so-obvious thing, though: "The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found."[EDITED to add:] Or maybe it doesn't work in the "obvious" way. As cubefox points out in the comments [...]
---
First published:
July 25th, 2024
Source:
https://www.lesswrong.com/posts/TyCdgpCfX7sfiobsH/ai-achieves-silver-medal-standard-solving-international
---
Narrated by TYPE III AUDIO.
This is a link post.What is an agent? It's a slippery concept with no commonly accepted formal definition, but informally the concept seems to be useful. One angle on it is Dennett's Intentional Stance: we think of an entity as being an agent if we can more easily predict it by treating it as having some beliefs and desires which guide its actions. Examples include cats and countries, but the central case is humans.
The world is shaped significantly by the choices agents make. What might agents look like in a world with advanced — and even superintelligent — AI? A natural approach for reasoning about this is to draw analogies from our central example. Picture what a really smart human might be like, and then try to figure out how it would be different if it were an AI. But this approach risks baking in subtle assumptions — [...]
The original text contained 5 footnotes which were omitted from this narration.
The original text contained 7 images which were described by AI.
---
First published:
July 11th, 2024
Source:
https://www.lesswrong.com/posts/jpGHShgevmmTqXHy5/decomposing-agency-capabilities-without-desires
---
Narrated by TYPE III AUDIO.
---
(Crossposted from Twitter)
I'm skeptical that Universal Basic Income can get rid of grinding poverty, since somehow humanity's 100-fold productivity increase (since the days of agriculture) didn't eliminate poverty.
Some of my friends reply, "What do you mean, poverty is still around? 'Poor' people today, in Western countries, have a lot to legitimately be miserable about, don't get me wrong; but they also have amounts of clothing and fabric that only rich merchants could afford a thousand years ago; they often own more than one pair of shoes; why, they even have cellphones, as not even an emperor of the olden days could have had at any price. They're relatively poor, sure, and they have a lot of things to be legitimately sad about. But in what sense is almost-anyone in a high-tech country 'poor' by the standards of a thousand years earlier? Maybe UBI works the same way [...]
---
First published:
July 26th, 2024
Source:
https://www.lesswrong.com/posts/fPvssZk3AoDzXwfwJ/universal-basic-income-and-poverty
---
Narrated by TYPE III AUDIO.
Eliezer Yudkowsky periodically complains about people coming up with questionable plans with questionable assumptions to deal with AI, and then either:
This is a link post.Editor's note: Somewhat after I posted this on my own blog, Max Chiswick cornered me at LessOnline / Manifest and gave me a whole new perspective on this topic. I now believe that there is a way to use poker to sharpen epistemics that works dramatically better than anything I had been considering. I hope to write it up—together with Max—when I have time. Anyway, I'm still happy to keep this post around as a record of my first thoughts on the matter, and because it's better than nothing in the time before Max and I get around to writing up our joint second thoughts.
As an epilogue to this story, Max and I are now running a beta test for a course on making AIs to play poker and other games. The course will a synthesis of our respective theories of pedagogy re [...]
---
First published:
July 8th, 2024
Source:
https://www.lesswrong.com/posts/PypgeCxFHLzmBENK4/poker-is-a-bad-game-for-teaching-epistemics-figgie-is-a
---
Narrated by TYPE III AUDIO.
xlr8harder writes:
In general I don’t think an uploaded mind is you, but rather a copy. But one thought experiment makes me question this. A Ship of Theseus concept where individual neurons are replaced one at a time with a nanotechnological functional equivalent.
Are you still you?
Presumably the question xlr8harder cares about here isn't semantic question of how linguistic communities use the word "you", or predictions about how whole-brain emulation tech might change the way we use pronouns.
Rather, I assume xlr8harder cares about more substantive questions like:
I haven't shared this post with other relevant parties – my experience has been that private discussion of this sort of thing is more paralyzing than helpful. I might change my mind in the resulting discussion, but, I prefer that discussion to be public.
I think 80,000 hours should remove OpenAI from its job board, and similar EA job placement services should do the same.
(I personally believe 80k shouldn't advertise Anthropic jobs either, but I think the case for that is somewhat less clear)
I think OpenAI has demonstrated a level of manipulativeness, recklessness, and failure to prioritize meaningful existential safety work, that makes me think EA orgs should not be going out of their way to give them free resources. (It might make sense for some individuals to work there, but this shouldn't be a thing 80k or other orgs are systematically funneling talent into)
There [...]
---
First published:
July 3rd, 2024
Source:
https://www.lesswrong.com/posts/8qCwuE8GjrYPSqbri/80-000-hours-should-remove-openai-from-the-job-board-and
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://www.bhauth.com/blog/biology/cancer%20vaccines.html cancer neoantigens
For cells to become cancerous, they must have mutations that cause uncontrolled replication and mutations that prevent that uncontrolled replication from causing apoptosis. Because cancer requires several mutations, it often begins with damage to mutation-preventing mechanisms. As such, cancers often have many mutations not required for their growth, which often cause changes to structure of some surface proteins.
The modified surface proteins of cancer cells are called "neoantigens". An approach to cancer treatment that's currently being researched is to identify some specific neoantigens of a patient's cancer, and create a personalized vaccine to cause their immune system to recognize them. Such vaccines would use either mRNA or synthetic long peptides. The steps required are as follows:
About a year ago I decided to try using one of those apps where you tie your goals to some kind of financial penalty. The specific one I tried is Forfeit, which I liked the look of because it's relatively simple, you set single tasks which you have to verify you have completed with a photo.
I’m generally pretty sceptical of productivity systems, tools for thought, mindset shifts, life hacks and so on. But this one I have found to be really shockingly effective, it has been about the biggest positive change to my life that I can remember. I feel like the category of things which benefit from careful planning and execution over time has completely opened up to me, whereas previously things like this would be largely down to the luck of being in the right mood for long enough.
It's too soon to tell whether [...]
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
April 15th, 2024
Source:
https://www.lesswrong.com/posts/DRrAMiekmqwDjnzS5/my-experience-using-financial-commitments-to-overcome
---
Narrated by TYPE III AUDIO.
An NII machine in Nogales, AZ. (Image source)There's bound to be a lot of discussion of the Biden-Trump presidential debates last night, but I want to skip all the political prognostication and talk about the real issue: fentanyl-detecting machines.
Joe Biden says:
And I wanted to make sure we use the machinery that can detect fentanyl, these big machines that roll over everything that comes across the border, and it costs a lot of money. That was part of this deal we put together, this bipartisan deal.
More fentanyl machines, were able to detect drugs, more numbers of agents, more numbers of all the people at the border. And when we had that deal done, he went – he called his Republican colleagues said don’t do it. It's going to hurt me politically.
He never argued. It's not a good bill. It's a really good bill. We need [...]
---
First published:
June 28th, 2024
Source:
https://www.lesswrong.com/posts/TzwMfRArgsNscHocX/the-incredible-fentanyl-detecting-machine
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.[Thanks to Aryan Bhatt, Ansh Radhakrishnan, Adam Kaufman, Vivek Hebbar, Hanna Gabor, Justis Mills, Aaron Scher, Max Nadeau, Ryan Greenblatt, Peter Barnett, Fabien Roger, and various people at a presentation of these arguments for comments. These ideas aren’t very original to me; many of the examples of threat models are from other people.]
In this post, I want to introduce the concept of a “rogue deployment” and argue that it's interesting to classify possible AI catastrophes based on whether or not they involve a rogue deployment. I’ll also talk about how this division interacts with the structure of a safety case, discuss two important subcategories of rogue deployment, and make a few points about how the different categories I describe here might be caused by different attackers (e.g. the AI itself, rogue lab insiders, external hackers, or [...]
---
First published:
June 3rd, 2024
Source:
https://www.lesswrong.com/posts/ceBpLHJDdCt3xfEok/ai-catastrophes-and-rogue-deployments
---
Narrated by TYPE III AUDIO.
(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)
This is the final essay in a series that I'm calling "Otherness andcontrol in the age of AGI." I'm hoping that the individual essays can beread fairly well on their own, butsee here fora brief summary of the series as a whole. There's also a PDF of the whole series here.
Warning: spoilers for Angels in America; and moderate spoilers forHarry Potter and the Methods of Rationality.)
"I come into the presence of still water..."
~Wendell Berry
A lot of this series has been about problems with yang—that is,with the active element in the duality of activity vs. receptivity,doing vs. not-doing, controlling vs. letting go.[1] In particular,I've been interested in the ways that "deepatheism"(that is, a fundamental [...]
---
ARC's current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understanding of what was going on inside a neural network, we would hope to be able to use that understanding to verify that the network was not going to behave dangerously in unforeseen situations. ARC is attempting to perform this kind of verification, but using a mathematical kind of "explanation" instead of one written in natural language.
To help elucidate this connection, ARC has been supporting work on Compact Proofs of Model Performance via Mechanistic Interpretability by Jason Gross, Rajashree Agrawal, Lawrence Chan and others, which we were excited to see released along with this post. While we ultimately think that provable guarantees for large neural networks are unworkable as a long-term goal, we think that this work serves as a useful springboard towards alternatives.
In this [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
June 25th, 2024
Source:
https://www.lesswrong.com/posts/SyeQjjBoEC48MvnQC/formal-verification-heuristic-explanations-and-surprise
---
Narrated by TYPE III AUDIO.
Summary Summary .
LLMs may be fundamentally incapable of fully general reasoning, and if so, short timelines are less plausible.
Longer summary
There is ML research suggesting that LLMs fail badly on attempts at general reasoning, such as planning problems, scheduling, and attempts to solve novel visual puzzles. This post provides a brief introduction to that research, and asks:
Summary: Superposition-based interpretations of neural network activation spaces are incomplete. The specific locations of feature vectors contain crucial structural information beyond superposition, as seen in circular arrangements of day-of-the-week features and in the rich structures. We don’t currently have good concepts for talking about this structure in feature geometry, but it is likely very important for model computation. An eventual understanding of feature geometry might look like a hodgepodge of case-specific explanations, or supplementing superposition with additional concepts, or plausibly an entirely new theory that supersedes superposition. To develop this understanding, it may be valuable to study toy models in depth and do theoretical or conceptual work in addition to studying frontier models.
Epistemic status: Decently confident that the ideas here are directionally correct. I’ve been thinking these thoughts for a while, and recently got round to writing them up at a high level. Lots of people (including [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
June 24th, 2024
Source:
https://www.lesswrong.com/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.TL;DR: We published a new paper on out-of-context reasoning in LLMs. We show that LLMs can infer latent information from training data and use this information for downstream tasks, without any in-context learning or CoT. For instance, we finetune GPT-3.5 on pairs (x,f(x)) for some unknown function f. We find that the LLM can (a) define f in Python, (b) invert f, (c) compose f with other functions, for simple functions such as x+14, x // 3, 1.75x, and 3x+2.
Paper authors: Johannes Treutlein*, Dami Choi*, Jan Betley, Sam Marks, Cem Anil, Roger Grosse, Owain Evans (*equal contribution)
Johannes, Dami, and Jan did this project as part of an Astra Fellowship with Owain Evans.
Below, we include the Abstract and Introduction from the paper, followed by some additional discussion of our AI safety [...]
---
First published:
June 21st, 2024
Source:
https://www.lesswrong.com/posts/5SKRHQEFr8wYQHYkx/connecting-the-dots-llms-can-infer-and-verbalize-latent
---
Narrated by TYPE III AUDIO.
This is a link post.I have canceled my OpenAI subscription in protest over OpenAI's lack ofethics.
In particular, I object to:
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.New Anthropic model organisms research paper led by Carson Denison from the Alignment Stress-Testing Team demonstrating that large language models can generalize zero-shot from simple reward-hacks (sycophancy) to more complex reward tampering (subterfuge). Our results suggest that accidentally incentivizing simple reward-hacks such as sycophancy can have dramatic and very difficult to reverse consequences for how models generalize, up to and including generalization to editing their own reward functions and covering up their tracks when doing so.
Abstract:
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too [...]
---
First published:
June 17th, 2024
Source:
https://www.lesswrong.com/posts/FSgGBjDiaCdWxNBhj/sycophancy-to-subterfuge-investigating-reward-tampering-in
---
Narrated by TYPE III AUDIO.
After living in a suburb for most of my life, when I moved to a major U.S. city the first thing I noticed was the feces. At first I assumed it was dog poop, but my naivety didn’t last long.
One day I saw a homeless man waddling towards me at a fast speed while holding his ass cheeks. He turned into an alley and took a shit. As I passed him, there was a moment where our eyes met. He sheepishly averted his gaze.
The next day I walked to the same place. There are a number of businesses on both sides of the street that probably all have bathrooms. I walked into each of them to investigate.
In a coffee shop, I saw a homeless woman ask the barista if she could use the bathroom. “Sorry, that bathroom is for customers only.” I waited five minutes and [...]
---
First published:
June 18th, 2024
Source:
https://www.lesswrong.com/posts/sCWe5RRvSHQMccd2Q/i-would-have-shit-in-that-alley-too
---
Narrated by TYPE III AUDIO.
ARC-AGI post
Getting 50% (SoTA) on ARC-AGI with GPT-4o
I recently got to 50%[1] accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number of Python implementations of the transformation rule (around 8,000 per problem) and then selecting among these implementations based on correctness of the Python programs on the examples (if this is confusing, go here)[2]. I use a variety of additional approaches and tweaks which overall substantially improve the performance of my method relative to just sampling 8,000 programs.
[This post is on a pretty different topic than the usual posts on our substack. So regular readers should be warned!]
The additional approaches and tweaks are:
Have you heard this before? In clinical trials, medicines have to be compared to a placebo to separate the effect of the medicine from the psychological effect of taking the drug. The patient's belief in the power of the medicine has a strong effect on its own. In fact, for some drugs such as antidepressants, the psychological effect of taking a pill is larger than the effect of the drug. It may even be worth it to give a patient an ineffective medicine just to benefit from the placebo effect. This is the conventional wisdom that I took for granted until recently.
I no longer believe any of it, and the short answer as to why is that big meta-analysis on the placebo effect. That meta-analysis collected all the studies they could find that did "direct" measurements of the placebo effect. In addition to a placebo group that could [...]
---
First published:
June 10th, 2024
Source:
https://www.lesswrong.com/posts/kpd83h5XHgWCxnv3h/why-i-don-t-believe-in-the-placebo-effect
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.As an AI researcher who wants to do technical work that helps humanity, there is a strong drive to find a research area that is definitely helpful somehow, so that you don’t have to worry about how your work will be applied, and thus you don’t have to worry about things like corporate ethics or geopolitics to make sure your work benefits humanity.
Unfortunately, no such field exists. In particular, technical AI alignment is not such a field, and technical AI safety is not such a field. It absolutely matters where ideas land and how they are applied, and when the existence of the entire human race is at stake, that's no exception.
If that's obvious to you, this post is mostly just a collection of arguments for something you probably already realize. But if you somehow [...]
---
First published:
June 14th, 2024
Source:
https://www.lesswrong.com/posts/F2voF4pr3BfejJawL/safety-isn-t-safety-without-a-social-model-or-dispelling-the
---
Narrated by TYPE III AUDIO.
(Cross-posted from Twitter.)
My take on Leopold Aschenbrenner's new report: I think Leopold gets it right on a bunch of important counts.
Three that I especially care about:
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We are pleased to announce ILIAD — a 5-day conference bringing together 100+ researchers to build strong scientific foundations for AI alignment.
***Apply to attend by June 30!***
Since at least 2017, OpenAI has asked departing employees to sign offboarding agreements which legally bind them to permanently—that is, for the rest of their lives—refrain from criticizing OpenAI, or from otherwise taking any actions which might damage its finances or reputation.[1]
If they refused to sign, OpenAI threatened to take back (or make unsellable) all of their already-vested equity—a huge portion of their overall compensation, which often amounted to millions of dollars. Given this immense pressure, it seems likely that most employees signed.
If they did sign, they became personally liable forevermore for any financial or reputational harm they later caused. This liability was unbounded, so had the potential to be financially ruinous—if, say, they later wrote a blog post critical of OpenAI, they might in principle be found liable for damages far in excess of their net worth.
These extreme provisions allowed OpenAI to systematically silence criticism [...]
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
May 30th, 2024
Source:
https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai
---
Narrated by TYPE III AUDIO.
As we explained in our MIRI 2024 Mission and Strategy update, MIRI has pivoted to prioritize policy, communications, and technical governance research over technical alignment research. This follow-up post goes into detail about our communications strategy.
The Objective: Shut it Down[1]
Our objective is to convince major powers to shut down the development of frontier AI systems worldwide before it is too late. We believe that nothing less than this will prevent future misaligned smarter-than-human AI systems from destroying humanity. Persuading governments worldwide to take sufficiently drastic action will not be easy, but we believe this is the most viable path.
Policymakers deal mostly in compromise: they form coalitions by giving a little here to gain a little somewhere else. We are concerned that most legislation intended to keep humanity alive will go through the usual political processes and be ground down into ineffective compromises.
The only way we [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
May 29th, 2024
Source:
https://www.lesswrong.com/posts/tKk37BFkMzchtZThx/miri-2024-communications-strategy
---
Narrated by TYPE III AUDIO.
Previously: OpenAI: Exodus (contains links at top to earlier episodes), Do Not Mess With Scarlett Johansson
We have learned more since last week. It's worse than we knew.
How much worse? In which ways? With what exceptions?
That's what this post is about.
The Story So Far
For years, employees who left OpenAI consistently had their vested equity explicitly threatened with confiscation and the lack of ability to sell it, and were given short timelines to sign documents or else. Those documents contained highly aggressive NDA and non disparagement (and non interference) clauses, including the NDA preventing anyone from revealing these clauses.
No one knew about this until recently, because until Daniel Kokotajlo everyone signed, and then they could not talk about it. Then Daniel refused to sign, Kelsey Piper started reporting, and a lot came out.
Here is Altman's statement from [...]
---
First published:
May 28th, 2024
Source:
https://www.lesswrong.com/posts/YwhgHwjaBDmjgswqZ/openai-fallout
---
Narrated by TYPE III AUDIO.
Contact: patreon.com/lwcurated or [perrin dot j dot walker plus lesswrong fnord gmail].
All Solenoid's narration work found here.
Crossposted from AI Lab Watch. Subscribe on Substack.
Introduction.
Anthropic has an unconventional governance mechanism: an independent "Long-Term Benefit Trust" elects some of its board. Anthropic sometimes emphasizes that the Trust is an experiment, but mostly points to it to argue that Anthropic will be able to promote safety and benefit-sharing over profit.[1]
But the Trust's details have not been published and some information Anthropic has shared is concerning. In particular, Anthropic's stockholders can apparently overrule, modify, or abrogate the Trust, and the details are unclear.
Anthropic has not publicly demonstrated that the Trust would be able to actually do anything that stockholders don't like.
The facts
There are three sources of public information on the Trust:
They say there's [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
May 27th, 2024
Source:
https://www.lesswrong.com/posts/sdCcsTt9hRpbX6obP/maybe-anthropic-s-long-term-benefit-trust-is-powerless
---
Narrated by TYPE III AUDIO.
Introduction.
If you are choosing to read this post, you've probably seen the image below depicting all the notifications students received on their phones during one class period. You probably saw it as a retweet of this tweet, or in one of Zvi's posts. Did you find this data plausible, or did you roll to disbelieve? Did you know that the image dates back to at least 2019? Does that fact make you more or less worried about the truth on the ground as of 2024?
Last month, I performed an enhanced replication of this experiment in my high school classes. This was partly because we had a use for it, partly to model scientific thinking, and partly because I was just really curious. Before you scroll past the image, I want to give you a chance to mentally register your predictions. Did my average class match the [...]
---
First published:
May 26th, 2024
Source:
https://www.lesswrong.com/posts/AZCpu3BrCFWuAENEd/notifications-received-in-30-minutes-of-class
---
Narrated by TYPE III AUDIO.
New blog: AI Lab Watch. Subscribe on Substack.
Many AI safety folks think that METR is close to the labs, with ongoing relationships that grant it access to models before they are deployed. This is incorrect. METR (then called ARC Evals) did pre-deployment evaluation for GPT-4 and Claude 2 in the first half of 2023, but it seems to have had no special access since then.[1] Other model evaluators also seem to have little access before deployment.
Frontier AI labs' pre-deployment risk assessment should involve external model evals for dangerous capabilities.[2] External evals can improve a lab's risk assessment and—if the evaluator can publish its results—provide public accountability.
The evaluator should get deeper access than users will get.
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
May 24th, 2024
Source:
https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators
---
Narrated by TYPE III AUDIO.
This is a quickly-written opinion piece, of what I understand about OpenAI. I first posted it to Facebook, where it had some discussion.
Some arguments that OpenAI is making, simultaneously:
---
First published:
May 21st, 2024
Source:
https://www.lesswrong.com/posts/cy99dCEiLyxDrMHBi/what-s-going-on-with-openai-s-messaging
---
Narrated by TYPE III AUDIO.
Produced as part of the MATS Winter 2023-4 program, under the mentorship of @Jessica Rumbelow
One-sentence summary: On a dataset of human-written essays, we find that gpt-3.5-turbo can accurately infer demographic information about the authors from just the essay text, and suspect it's inferring much more.
Introduction.
Every time we sit down in front of an LLM like GPT-4, it starts with a blank slate. It knows nothing[1] about who we are, other than what it knows about users in general. But with every word we type, we reveal more about ourselves -- our beliefs, our personality, our education level, even our gender. Just how clearly does the model see us by the end of the conversation, and why should that worry us?
Like many, we were rather startled when @janus showed that gpt-4-base could identify @gwern by name, with 92% confidence, from a 300-word comment. If [...]
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
May 17th, 2024
Source:
https://www.lesswrong.com/posts/dLg7CyeTE4pqbbcnp/language-models-model-us
---
Narrated by TYPE III AUDIO.
This is a link post.to follow up my philantropic pledge from 2020, i've updated my philanthropy page with 2023 results.
in 2023 my donations funded $44M worth of endpoint grants ($43.2M excluding software development and admin costs) — exceeding my commitment of $23.8M (20k times $1190.03 — the minimum price of ETH in 2023).
---
First published:
May 20th, 2024
Source:
https://www.lesswrong.com/posts/bjqDQB92iBCahXTAj/jaan-tallinn-s-2023-philanthropy-overview
---
Narrated by TYPE III AUDIO.
FSF blogpost. Full document (just 6 pages; you should read it). Compare to Anthropic's RSP, OpenAI's RSP ("PF"), and METR's Key Components of an RSP.
DeepMind's FSF has three steps:
Introduction.
[Reminder: I am an internet weirdo with no medical credentials]
A few months ago, I published some crude estimates of the power of nitric oxide nasal spray to hasten recovery from illness, and speculated about what it could do prophylactically. While working on that piece a nice man on Twitter alerted me to the fact that humming produces lots of nasal nitric oxide. This post is my very crude model of what kind of anti-viral gains we could expect from humming.
I’ve encoded my model at Guesstimate. The results are pretty favorable (average estimated impact of 66% reduction in severity of illness), but extremely sensitive to my made-up numbers. Efficacy estimates go from ~0 to ~95%, depending on how you feel about publication bias, what percent of Enovid's impact can be credited to nitric oxide, and humming's relative effect. Given how made up speculative some [...]
---
First published:
May 16th, 2024
Source:
https://www.lesswrong.com/posts/NBZvpcBx4ewqkdCdT/do-you-believe-in-hundred-dollar-bills-lying-on-the-ground-1
---
Narrated by TYPE III AUDIO.
Most people avoid saying literally false things, especially if those could be audited, like making up facts or credentials. The reasons for this are both moral and pragmatic — being caught out looks really bad, and sustaining lies is quite hard, especially over time. Let's call the habit of not saying things you know to be false ‘shallow honesty’[1].
Often when people are shallowly honest, they still choose what true things they say in a kind of locally act-consequentialist way, to try to bring about some outcome. Maybe something they want for themselves (e.g. convincing their friends to see a particular movie), or something they truly believe is good (e.g. causing their friend to vote for the candidate they think will be better for the country).
Either way, if you think someone is being merely shallowly honest, you can only shallowly trust them: you might be confident that [...]
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
May 7th, 2024
Source:
https://www.lesswrong.com/posts/szn26nTwJDBkhn8ka/deep-honesty
---
Narrated by TYPE III AUDIO.
Epistemic Status: Musing and speculation, but I think there's a real thing here.
1.
When I was a kid, a friend of mine had a tree fort. If you've never seen such a fort, imagine a series of wooden boards secured to a tree, creating a platform about fifteen feet off the ground where you can sit or stand and walk around the tree. This one had a rope ladder we used to get up and down, a length of knotted rope that was tied to the tree at the top and dangled over the edge so that it reached the ground.
Once you were up in the fort, you could pull the ladder up behind you. It was much, much harder to get into the fort without the ladder. Not only would you need to climb the tree itself instead of the ladder with its handholds, but [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 26th, 2024
Source:
https://www.lesswrong.com/posts/k2kzawX5L3Z7aGbov/on-not-pulling-the-ladder-up-behind-you
---
Narrated by TYPE III AUDIO.
Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout).
TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.
Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given layer, trained with the same unsupervised objective.
I apply the method to several alignment-relevant toy examples, and find that the [...]
The original text contained 15 footnotes which were omitted from this narration.
---
First published:
April 30th, 2024
Source:
https://www.lesswrong.com/posts/ioPnHKFyy4Cw2Gr2x/mechanistically-eliciting-latent-behaviors-in-language-1
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://ailabwatch.orgI'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.
It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.
(It's much better on desktop than mobile — don't read it on mobile.)
It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.
It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.
Some clarifications and disclaimers.
How you can help:
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.
This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.
We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.
Executive summary
Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."
We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model to refuse harmless requests.
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
April 27th, 2024
Source:
https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
---
Narrated by TYPE III AUDIO.
This comes from a podcast called 18Forty, of which the main demographic of Orthodox Jews. Eliezer's sister (Hannah) came on and talked about her Sheva Brachos, which is essentially the marriage ceremony in Orthodox Judaism. People here have likely not seen it, and I thought it was quite funny, so here it is:
https://18forty.org/podcast/channah-cohen-the-crisis-of-experience/
David Bashevkin:
So I want to shift now and I want to talk about something that full disclosure, we recorded this once before and you had major hesitation for obvious reasons. It's very sensitive what we’re going to talk about right now, but really for something much broader, not just because it's a sensitive personal subject, but I think your hesitation has to do with what does this have to do with the subject at hand? And I hope that becomes clear, but one of the things that has always absolutely fascinated me about [...]
---
First published:
April 22nd, 2024
Source:
https://www.lesswrong.com/posts/C7deNdJkdtbzPtsQe/funny-anecdote-of-eliezer-from-his-sister
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://dynomight.net/seed-oil/A friend has spent the last three years hounding me about seed oils. Every time I thought I was safe, he’d wait a couple months and renew his attack:
“When are you going to write about seed oils?”
“Did you know that seed oils are why there's so much {obesity, heart disease, diabetes, inflammation, cancer, dementia}?”
“Why did you write about {meth, the death penalty, consciousness, nukes, ethylene, abortion, AI, aliens, colonoscopies, Tunnel Man, Bourdieu, Assange} when you could have written about seed oils?”
“Isn’t it time to quit your silly navel-gazing and use your weird obsessive personality to make a dent in the world—by writing about seed oils?”
He’d often send screenshots of people reminding each other that Corn Oil is Murder and that it's critical that we overturn our lives to eliminate soybean/canola/sunflower/peanut oil and replace them with butter/lard/coconut/avocado/palm oil.
This confused [...]
---
First published:
April 20th, 2024
Source:
https://www.lesswrong.com/posts/DHkkL2GxhxoceLzua/thoughts-on-seed-oil
Linkpost URL:
https://dynomight.net/seed-oil/
---
Narrated by TYPE III AUDIO.
Yesterday Adam Shai put up a cool post which… well, take a look at the visual:
Yup, it sure looks like that fractal is very noisily embedded in the residual activations of a neural net trained on a toy problem. Linearly embedded, no less.
I (John) initially misunderstood what was going on in that post, but some back-and-forth with Adam convinced me that it really is as cool as that visual makes it look, and arguably even cooler. So David and I wrote up this post / some code, partly as an explainer for why on earth that fractal would show up, and partly as an explainer for the possibilities this work potentially opens up for interpretability.
One sentence summary: when tracking the hidden state of a hidden Markov model, a Bayesian's beliefs follow a chaos game (with the observations randomly selecting the update at each time), so [...]
---
First published:
April 18th, 2024
Source:
https://www.lesswrong.com/posts/mBw7nc4ipdyeeEpWs/why-would-belief-states-have-a-fractal-structure-and-why
---
Narrated by TYPE III AUDIO.
TLDR: I am investigating whether to found a spiritual successor to FHI, housed under Lightcone Infrastructure, providing a rich cultural environment and financial support to researchers and entrepreneurs in the intellectual tradition of the Future of Humanity Institute. Fill out this form or comment below to express interest in being involved either as a researcher, entrepreneurial founder-type, or funder.
The Future of Humanity Institute is dead:
I knew that this was going to happen in some form or another for a year or two, having heard through the grapevine and private conversations of FHI's university-imposed hiring freeze and fundraising block, and so I have been thinking about how to best fill the hole in the world that FHI left behind.
I think FHI was one of the best intellectual institutions in history. Many of the most important concepts[1] in my intellectual vocabulary were developed and popularized under its [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 18th, 2024
Source:
https://www.lesswrong.com/posts/ydheLNeWzgbco2FTb/express-interest-in-an-fhi-of-the-west
---
Narrated by TYPE III AUDIO.
Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, and @Guillaume Corlouer for suggestions on this writeup.
Introduction.
What computational structure are we building into LLMs when we train them on next-token prediction? In this post we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. We'll explain exactly what this means in the post. We are excited by these results because
This is a linkpost for https://www.commerce.gov/news/press-releases/2024/04/us-commerce-secretary-gina-raimondo-announces-expansion-us-ai-safetyU.S. Secretary of Commerce Gina Raimondo announced today additional members of the executive leadership team of the U.S. AI Safety Institute (AISI), which is housed at the National Institute of Standards and Technology (NIST). Raimondo named Paul Christiano as Head of AI Safety, Adam Russell as Chief Vision Officer, Mara Campbell as Acting Chief Operating Officer and Chief of Staff, Rob Reich as Senior Advisor, and Mark Latonero as Head of International Engagement. They will join AISI Director Elizabeth Kelly and Chief Technology Officer Elham Tabassi, who were announced in February. The AISI was established within NIST at the direction of President Biden, including to support the responsibilities assigned to the Department of Commerce under the President's landmark Executive Order.
Paul Christiano, Head of AI Safety, will design and conduct tests of frontier AI models, focusing on model evaluations for capabilities of national security [...]
---
First published:
April 16th, 2024
Source:
https://www.lesswrong.com/posts/63X9s3ENXeaDrbe5t/paul-christiano-named-as-us-ai-safety-institute-head-of-ai
Linkpost URL:
https://www.commerce.gov/news/press-releases/2024/04/us-commerce-secretary-gina-raimondo-announces-expansion-us-ai-safety
---
Narrated by TYPE III AUDIO.
Cross-posted from my website. Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app.
This essay is part of a series that I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for brief summaries of the essays that have been released thus far.
Warning: spoilers for Yudkowsky's "The Sword of the Good.")
Examining a philosophical vibe that I think contrasts in interesting ways with "deep atheism."
Text version here: https://joecarlsmith.com/2024/03/21/on-green
This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that individual essays can be read fairly well on their own, but see here for brief text summaries of the essays that have been released thus far: https://joecarlsmith.com/2024/01/02/otherness-and-control-in-the-age-of-agi
(Though: note that I haven't put the summary post on the podcast yet.)
Source:
https://www.lesswrong.com/posts/gvNnE6Th594kfdB3z/on-green
Narrated by Joe Carlsmith, audio provided with permission.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://bayesshammai.substack.com/p/conditional-on-getting-to-trade-your
“I refuse to join any club that would have me as a member” -Marx[1]
Adverse Selection is the phenomenon in which information asymmetries in non-cooperative environments make trading dangerous. It has traditionally been understood to describe financial markets in which buyers and sellers systematically differ, such as a market for used cars in which sellers have the information advantage, where resulting feedback loops can lead to market collapses.
In this post, I make the case that adverse selection effects appear in many everyday contexts beyond specialized markets or strictly financial exchanges. I argue that modeling many of our decisions as taking place in competitive environments analogous to financial markets will help us notice instances of adverse selection that we otherwise wouldn’t.
The strong version of my central thesis is that conditional on getting to trade[2], your trade wasn’t all that great. Any time you make a trade, you should be asking yourself “what do others know that I don’t?”
Source:
https://www.lesswrong.com/posts/vyAZyYh3qsqcJwwPn/toward-a-broader-conception-of-adverse-selection
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
In January, I defended my PhD thesis, which I called Algorithmic Bayesian Epistemology. From the preface:
For me as for most students, college was a time of exploration. I took many classes, read many academic and non-academic works, and tried my hand at a few research projects. Early in graduate school, I noticed a strong commonality among the questions that I had found particularly fascinating: most of them involved reasoning about knowledge, information, or uncertainty under constraints. I decided that this cluster of problems would be my primary academic focus. I settled on calling the cluster algorithmic Bayesian epistemology: all of the questions I was thinking about involved applying the "algorithmic lens" of theoretical computer science to problems of Bayesian epistemology.
Source:
https://www.lesswrong.com/posts/6dd4b4cAWQLDJEuHw/my-phd-thesis-algorithmic-bayesian-epistemology
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://twitter.com/ESYudkowsky/status/144546114693741363
I stumbled upon a Twitter thread where Eliezer describes what seems to be his cognitive algorithm that is equivalent to Tune Your Cognitive Strategies, and have decided to archive / repost it here.
Source:
https://www.lesswrong.com/posts/rYq6joCrZ8m62m7ej/how-could-i-have-thought-that-faster
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
A recent short story by Gabriel Mukobi illustrates a near-term scenario where things go bad because new developments in LLMs allow LLMs to accelerate capabilities research without a correspondingly large acceleration in safety research.
This scenario is disturbingly close to the situation we already find ourselves in. Asking the best LLMs for help with programming vs technical alignment research feels very different (at least to me). LLMs might generate junk code, but you can keep pointing out the problems with the code, and the code will eventually work. This can be faster than doing it myself, in cases where I don't know a language or library well; the LLMs are moderately familiar with everything.
When I try to talk to LLMs about technical AI safety work, however, I just get garbage.
I think a useful safety precaution for frontier AI models would be to make them more useful for [...]
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
April 4th, 2024
Source:
https://www.lesswrong.com/posts/nQwbDPgYvAbqAmAud/llms-for-alignment-research-a-safety-priority
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/xLDwCemt5qvchzgHd/scale-was-all-we-needed-at-first
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/Yay8SbQiwErRyDKGb/using-axis-lines-for-good-or-evil
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/SPBm67otKq5ET5CWP/social-status-part-1-2-negotiations-over-object-level
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/Cb7oajdrA5DsHCqKd/acting-wholesomely
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Rationality is Systematized Winning, so rationalists should win. We’ve tried saving the world from AI, but that's really hard and we’ve had … mixed results. So let's start with something that rationalists should find pretty easy: Becoming Cool!
I don’t mean, just, like, riding a motorcycle and breaking hearts level of cool. I mean like the first kid in school to get a Tamagotchi, their dad runs the ice cream truck and gives you free ice cream and, sure, they ride a motorcycle. I mean that kind of feel-it-in-your-bones, I-might-explode-from-envy cool.
The eleventh virtue is scholarship, so I hit the ~books~ search engine on this one. Apparently, the aspects of coolness are:
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/sJPbmm8Gd34vGYrKd/deep-atheism-and-ai-risk
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/h99tRkpQGxwtb9Dpv/my-clients-the-liars
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/2sLwt2cSAag74nsdN/speaking-to-congressional-staffers-about-ai-risk
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/Jash4Gbi2wpThzZ4k/cfar-takeaways-andrew-critch
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.The following is a lightly edited version of a memo I wrote for a retreat. It was inspired by a draft of Counting arguments provide no evidence for AI doom, although the earlier draft contained some additional content. I personally really like the earlier content, and think that my post covers important points not made by the published version of that post.
Thankful for the dozens of interesting conversations and comments at the retreat.
I think that the AI alignment field is partially founded on fundamentally confused ideas. I’m worried about this because, right now, a range of lobbyists and concerned activists and researchers are in Washington making policy asks. Some of these policy proposals seem to be based on erroneous or unsound arguments.[1]
The most important takeaway from this essay is that [...]
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
March 5th, 2024
Source:
https://www.lesswrong.com/posts/yQSmcfN4kA7rATHGK/many-arguments-for-ai-x-risk-are-wrong
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.TLDR: I’ve collected some tips for research that I’ve given to other people and/or used myself, which have sped things up and helped put people in the right general mindset for empirical AI alignment research. Some of these are opinionated takes, also around what has helped me. Researchers can be successful in different ways, but I still stand by the tips here as a reasonable default.
What success generally looks like
Here, I’ve included specific criteria that strong collaborators of mine tend to meet, with rough weightings on the importance, as a rough north star for people who collaborate with me (especially if you’re new to research). These criteria are for the specific kind of research I do (highly experimental LLM alignment research, excluding interpretability); some examples of research areas where this applies are e.g. scalable oversight [...]
---
First published:
February 29th, 2024
Source:
https://www.lesswrong.com/posts/dZFpEdKyb9Bf4xYn7/tips-for-empirical-alignment-research
---
Narrated by TYPE III AUDIO.
Timaeus was announced in late October 2023, with the mission of making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. This is our first progress update.
In service of the mission, our first priority has been to support and contribute to ongoing work in Singular Learning Theory (SLT) and developmental interpretability, with the aim of laying theoretical and empirical foundations for a science of deep learning and neural network interpretability.
Our main uncertainties in this research were:
This is a linkpost for https://bayesshammai.substack.com/p/contra-ngo-et-al-every-every-bayWith thanks to Scott Alexander for the inspiration, Jeffrey Ladish, Philip Parker, Avital Morris, and Drake Thomas for masterful cohosting, and Richard Ngo for his investigative journalism.
Last summer, I threw an Every Bay Area House Party themed party. I don’t live in the Bay, but I was there for a construction-work-slash-webforum-moderation-and-UI-design-slash-grantmaking gig, so I took the opportunity to impose myself on the ever generous Jeffrey Ladish and host a party in his home. Fortunately, the inside of his house is already optimized to look like a parody of a Bay Area house party house, so not much extra decorating was needed, but when has that ever stopped me?
Attendees could look through the window for an outside viewRichard Ngo recently covered the event, with only very minor embellishments. I’ve heard rumors that some people are doubting whether the party described truly happened, so [...]
---
First published:
February 22nd, 2024
Source:
https://www.lesswrong.com/posts/mmYFF4dyi8Kg6pWGC/contra-ngo-et-al-every-every-bay-area-house-party-bay-area
Linkpost URL:
https://bayesshammai.substack.com/p/contra-ngo-et-al-every-every-bay
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/g8HHKaWENEbqh2mgK/updatelessness-doesn-t-solve-most-problems-1
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/and-all-the-shoggoths-merely-players
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Inspired by a house party inspired by Scott Alexander.
By the time you arrive in Berkeley, the party is already in full swing. You’ve come late because your reading of the polycule graph indicated that the first half would be inauspicious. But now you’ve finally made it to the social event of the season: the Every Bay Area House Party-themed house party.
The first order of the evening is to get a color-coded flirting wristband, so that you don’t incur any accidental micromarriages. You scan the menu of options near the door. There's the wristband for people who aren’t interested in flirting; the wristband for those want to be flirted with, but will never flirt back; the wristband for those who only want to flirt with people who have different-colored wristbands; and of course the one for people who want to glomarize disclosure of their flirting preferences. Finally you [...]
---
First published:
February 16th, 2024
Source:
https://www.lesswrong.com/posts/g5q4JiG5dzafkdyEN/every-every-bay-area-house-party-bay-area-house-party
---
Narrated by TYPE III AUDIO.
Cross-posted with light edits from Otherwise.
I think of us in some kind of twilight world as transformative AI looks more likely: things are about to change, and I don’t know if it's about to get a lot darker or a lot brighter.
Increasingly this makes me wonder how I should be raising my kids differently.
What might the world look like
Most of my imaginings about my children's lives have them in pretty normal futures, where they go to college and have jobs and do normal human stuff, but with better phones.
It's hard for me to imagine the other versions:
This is a linkpost for https://seekingtobejolly.substack.com/p/no-one-in-my-org-puts-money-in-theirEpistemic status: the stories here are all as true as possible from memory, but my memory is so so.
An AI made this This is going to be big
It's late Summer 2017. I am on a walk in the Mendip Hills. It's warm and sunny and the air feels fresh. With me are around 20 other people from the Effective Altruism London community. We’ve travelled west for a retreat to discuss how to help others more effectively with our donations and careers. As we cross cow field after cow field, I get talking to one of the people from the group I don’t know yet. He seems smart, and cheerful. He tells me that he is an AI researcher at Google DeepMind. He explains how he is thinking about how to make sure that any powerful AI system actually does what we want it [...]
---
First published:
February 16th, 2024
Source:
https://www.lesswrong.com/posts/dLXdCjxbJMGtDBWTH/no-one-in-my-org-puts-money-in-their-pension
Linkpost URL:
https://seekingtobejolly.substack.com/p/no-one-in-my-org-puts-money-in-their
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://www.narrativeark.xyz/p/masterpieceA sequel to qntm's Lena. Reading Lena first is helpful but not necessary.
We’re excited to announce the fourth annual MMindscaping competition! Over the last few years, interest in the art of mindscaping has continued to grow rapidly. We expect this year's competition to be our biggest yet, and we’ve expanded the prize pool to match. The theme for the competition is “Weird and Wonderful”—we want your wackiest ideas and most off-the-wall creations!
Competition rules
As in previous competitions, the starting point is a base MMAcevedo mind upload. All entries must consist of a single modified version of MMAcevedo, along with a written or recorded description of the sequence of transformations or edits which produced it. For more guidance on which mind-editing techniques can be used, see the Technique section below.
Your entry must have been created in the last 12 months, and cannot [...]
---
First published:
February 13th, 2024
Source:
https://www.lesswrong.com/posts/Fruv7Mmk3X5EekbgB/masterpiece
Linkpost URL:
https://www.narrativeark.xyz/p/masterpiece
---
Narrated by TYPE III AUDIO.
I'm trying to build my own art of rationality training, and I've started talking to various CFAR instructors about their experiences – things that might be important for me to know but which hadn't been written up nicely before.
This is a quick write up of a conversation with Andrew Critch about his takeaways. (I took rough notes, and then roughly cleaned them up for this. I don't know
"What surprised you most during your time at CFAR?
Surprise 1: People are profoundly non-numerate.
And, people who are not profoundly non-numerate still fail to connect numbers to life.
I'm still trying to find a way to teach people to apply numbers for their life. For example: "This thing is annoying you. How many minutes is it annoying you today? how many days will it annoy you?". I compulsively do this. There aren't things lying around in [...]
---
First published:
February 14th, 2024
Source:
https://www.lesswrong.com/posts/Jash4Gbi2wpThzZ4k/cfar-takeaways-andrew-critch
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/duvzdffTzL3dWJcxn/believing-in-1
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/5jdqtpT6StjKDKacw/attitudes-about-applied-rationality
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
This is a hasty speculative fiction vignette of one way I expect we might get AGI by January 2025 (within about one year of writing this). Like similar works by others, I expect most of the guesses herein to turn out incorrect. However, this was still useful for expanding my imagination about what could happen to enable very short timelines, and I hope it's also useful to you.
The assistant opened the door, and I walked into Director Yarden's austere office. For the Director of a major new federal institute, her working space was surprisingly devoid of possessions. But I suppose the DHS's Superintelligence Defense Institute was only created last week.
“You’re Doctor Browning?” Yarden asked from her desk.
“Yes, Director,” I replied.
“Take a seat,” she said, gesturing. I complied as the lights flickered ominously. “Happy New Year, thanks for coming,” she said. “I called you in today [...]
---
First published:
December 17th, 2023
Source:
https://www.lesswrong.com/posts/xLDwCemt5qvchzgHd/scale-was-all-we-needed-at-first
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://garrisonlovely.substack.com/p/sam-altmans-chip-ambitions-undercut If you enjoy this, please consider subscribing to my Substack.
Sam Altman has said he thinks that developing artificial general intelligence (AGI) could lead to human extinction, but OpenAI is trying to build it ASAP. Why?
The common story for how AI could overpower humanity involves an “intelligence explosion,” where an AI system becomes smart enough to further improve its capabilities, bootstrapping its way to superintelligence. Even without any kind of recursive self-improvement, some AI safety advocates argue that a large enough number of copies of a genuinely human-level AI system could pose serious problems for humanity. (I discuss this idea in more detail in my recent Jacobin cover story.)
Some people think the transition from human-level AI to superintelligence could happen in a matter of months, weeks, days, or even hours. The faster the takeoff, the more dangerous, the thinking goes.
Sam [...]
---
First published:
February 10th, 2024
Source:
https://www.lesswrong.com/posts/pEAHbJRiwnXCjb4A7/sam-altman-s-chip-ambitions-undercut-openai-s-safety
Linkpost URL:
https://garrisonlovely.substack.com/p/sam-altmans-chip-ambitions-undercut
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/PhTBDHu9PKJFmvb4p/a-shutdown-problem-proposal
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
People often parse information through an epistemic consensus filter. They do not ask "is this true", they ask "will others be OK with me thinking this is true". This makes them very malleable to brute force manufactured consensus; if every screen they look at says the same thing they will adopt that position because their brain interprets it as everyone in the tribe believing it.
- Anon, 4Chan, slightly edited
Ordinary people who haven't spent years of their lives thinking about rationality and epistemology don't form beliefs by impartially tallying up evidence like a Bayesian reasoner. Whilst there is a lot of variation, my impression is that the majority of humans we share this Earth with use a completely different algorithm for vetting potential beliefs: they just believe some average of what everyone and everything around them believes, especially what they see on screens, newspapers and "respectable", "mainstream" websites.
---
First published:
February 3rd, 2024
Source:
https://www.lesswrong.com/posts/bMxhrrkJdEormCcLt/brute-force-manufactured-consensus-is-hiding-the-crime-of
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
I often encounter some confusion about whether the fact that synapses in the brain typically fire at frequencies of 1-100 Hz while the clock frequency of a state-of-the-art GPU is on the order of 1 GHz means that AIs think "many orders of magnitude faster" than humans. In this short post, I'll argue that this way of thinking about "cognitive speed" is quite misleading.
The clock speed of a GPU is indeed meaningful: there is a clock inside the GPU that provides some signal that's periodic at a frequency of ~ 1 GHz. However, the corresponding period of ~ 1 nanosecond does not correspond to the timescale of any useful computations done by the GPU. For instance; in the A100 any read/write access into the L1 cache happens every ~ 30 clock cycles and this number goes up to 200-350 clock cycles for the L2 cache. The result [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
January 29th, 2024
Source:
https://www.lesswrong.com/posts/adadYCPFAhNqDA5Ye/processor-clock-speeds-are-not-how-fast-ais-think
---
Narrated by TYPE III AUDIO.
A pdf version of this report is available here.
Summary.
In this report we argue that AI systems capable of large scale scientific research will likely pursue unwanted goals and this will lead to catastrophic outcomes. We argue this is the default outcome, even with significant countermeasures, given the current trajectory of AI development.
In Section 1 we discuss the tasks which are the focus of this report. We are specifically focusing on AIs which are capable of dramatically speeding up large-scale novel science; on the scale of the Manhattan Project or curing cancer. This type of task requires a lot of work, and will require the AI to overcome many novel and diverse obstacles.
In Section 2 we argue that an AI which is capable of doing hard, novel science will be approximately consequentialist; that is, its behavior will be well described as taking actions in order [...]
The original text contained 40 footnotes which were omitted from this narration.
---
First published:
January 26th, 2024
Source:
https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://rootsofprogress.org/the-block-funding-model-for-scienceWhen Galileo wanted to study the heavens through his telescope, he got money from those legendary patrons of the Renaissance, the Medici. To win their favor, when he discovered the moons of Jupiter, he named them the Medicean Stars. Other scientists and inventors offered flashy gifts, such as Cornelis Drebbel's perpetuum mobile (a sort of astronomical clock) given to King James, who made Drebbel court engineer in return. The other way to do research in those days was to be independently wealthy: the Victorian model of the gentleman scientist.
Galileo demonstrating law of gravity in presence of Giovanni de' Medici, 1839 fresco by Giuseppe Bezzuoli MeisterdruckeEventually we decided that requiring researchers to seek wealthy patrons or have independent means was not the best way to do science. Today, researchers, in their role as “principal investigators” (PIs), apply to science funders for grants. In the [...]
---
First published:
January 26th, 2024
Source:
https://www.lesswrong.com/posts/DKH9Z4DyusEdJmXKB/making-every-researcher-seek-grants-is-a-broken-model
Linkpost URL:
https://rootsofprogress.org/the-block-funding-model-for-science
---
Narrated by TYPE III AUDIO.
Let your every day be full of joy, love the child that holds your hand, let your wife delight in your embrace, for these alone are the concerns of humanity.[1]
— Epic of Gilgamesh - Tablet X
Say we want to train a scientist AI to help in a precise, narrow field of science (e.g. medicine design) but prevent its power from being applied anywhere else (e.g. chatting with humans, designing bio-weapons, etc.) even if it has these abilities.
Here's one safety layer one could implement:
We are organising the 9th edition without funds. We have no personal runway left to do this again. We will not run the 10th edition without funding.
In a nutshell:
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Crossposted from substack.
As we all know, sugar is sweet and so are the $30B in yearly revenue from the artificial sweetener industry.
Four billion years of evolution endowed our brains with a simple, straightforward mechanism to make sure we occasionally get an energy refuel so we can continue the foraging a little longer, and of course we are completely ignoring the instructions and spend billions on fake fuel that doesn’t actually grant any energy. A classic case of the Human Alignment Problem.
If we’re going to break our conditioning anyway, where do we start? How do you even come up with a new artificial sweetener? I’ve been wondering about this, because it’s not obvious to me how you would figure out what is sweet and what is not.
Look at sucrose and aspartame side by side:
Source:
https://www.lesswrong.com/posts/oA23zoEjPnzqfHiCt/there-is-way-too-much-serendipity
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
This is a linkpost for https://arxiv.org/abs/2401.05566
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training- deceptive-llms-that-persist-through
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/tEPHGZAb63dfq2v8n/how-useful-is-mechanistic-interpretability
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
I wrote this entire post in February of 2023, during the fallout from the TIME article. I didn't post it at the time for multiple reasons:
"(Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app.)"
This is the first essay in a series that I’m calling “Otherness and control in the age of AGI.” See here for more about the series as a whole.)
The most succinct argument for AI risk, in my opinion, is the “second species” argument. Basically, it goes like this.
Premise 1: AGIs would be like a second advanced species on earth, more powerful than humans.Conclusion: That’s scary.
To be clear: this is very far from airtight logic.[1] But I like the intuition pump. Often, if I only have two sentences to explain AI risk, I say this sort of species stuff. “Chimpanzees should be careful about inventing humans.” Etc.[2]
People often talk about aliens here, too. “What if you learned that aliens were on their way to earth? Surely that’s scary.” Again, very far from a knock-down case (for example: we get to build the aliens in question). But it draws on something.
In particular, though: it draws on a narrative of interspecies conflict. You are meeting a new form of life, a new type of mind. But these new creatures are presented to you, centrally, as a possible threat; as competitors; as agents in whose power you might find yourself helpless.
And unfortunately: yes. But I want to start this series by acknowledging how many dimensions of interspecies-relationship this narrative leaves out, and how much I wish we could be focusing only on the other parts. To meet a new species – and especially, a new intelligent species – is not just scary. It’s incredible. I wish it was less a time for fear, and more a time for wonder and dialogue. A time to look into new eyes – and to see further.
Source:
https://www.lesswrong.com/posts/mzvu8QTRXdvDReCAL/gentleness-and-the-artificial-other
Narrated for LessWrong by Joe Carlsmith (audio provided with permission).
Share feedback on this narration.
[Curated Post] ✓
[125+ karma Post] ✓
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that I have started leading a new team at Anthropic, the Alignment Stress-Testing team, with Carson Denison and Monte MacDiarmid as current team members. Our mission—and our mandate from the organization—is to red-team Anthropic's alignment techniques and evaluations, empirically demonstrating ways in which Anthropic's alignment strategies could fail.
The easiest way to get a sense of what we’ll be working on is probably just to check out our “Sleeper Agents” paper, which was our first big research project. I’d also recommend Buck and Ryan's post on meta-level adversarial evaluation as a good general description of our team's scope. Very simply, our job is to try to prove to Anthropic—and the world more broadly—(if it is [...]
---
First published:
January 12th, 2024
Source:
https://www.lesswrong.com/posts/EPDSdXr8YbsDkgsDG/introducing-alignment-stress-testing-at-anthropic
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a linkpost for https://arxiv.org/abs/2401.05566I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this. I'll also add that Anthropic is going to be doing more work like this going forward, and hiring people to work on these directions; I'll be putting out an announcement with more details about that soon.
EDIT: That announcement is now up!
Abstract:
Humans are capable of [...]
---
First published:
January 12th, 2024
Source:
https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-deceptive-llms-that-persist-through
Linkpost URL:
https://arxiv.org/abs/2401.05566
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
The goal of this post is to clarify a few concepts relating to AI Alignment under a common framework. The main concepts to be clarified:
The main new concepts employed will be endorsement and legitimacy.
TLDR:
This write-up owes a large debt to many conversations with Sahil, although the views expressed here are my own.
Source:
https://www.lesswrong.com/posts/bnnhypM5MXBHAATLw/meaning-and-agency
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Thanks to Clément Dumas, Nikola Jurković, Nora Belrose, Arthur Conmy, and Oam Patel for feedback.
In the comments of the post on Google Deepmind's CCS challenges paper, I expressed skepticism that some of the experimental results seemed possible. When addressing my concerns, Rohin Shah made some claims along the lines of “If an LLM linearly represents features a and b, then it will also linearly represent their XOR, <span>_aoplus b_</span>, and this is true even in settings where there's no obvious reason the model would need to make use of the feature <span>_aoplus b._</span>”
For reasons that I’ll explain below, I thought this claim was absolutely bonkers, both in general and in the specific setting that the GDM paper was working in. So I ran some experiments to prove Rohin wrong.
The result: Rohin was right and [...]
---
First published:
January 3rd, 2024
Source:
https://www.lesswrong.com/posts/hjJXCn9GsskysDceS/what-s-up-with-llms-representing-xors-of-arbitrary-features
---
Narrated by TYPE III AUDIO.
(Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app.
This is the first essay in a series that I’m calling “Otherness and control in the age of AGI.” See here for more about the series as a whole.)
When species meet
The most succinct argument for AI risk, in my opinion, is the “second species” argument. Basically, it goes like this.
Premise 1: AGIs would be like a second advanced species on earth, more powerful than humans.
Conclusion: That's scary.
To be clear: this is very far from airtight logic.[1] But I like the intuition pump. Often, if I only have two sentences to explain AI risk, I say this sort of species stuff. “Chimpanzees should be careful about inventing humans.” Etc.[2]
People often talk about aliens here, too. “What if you learned that aliens [...]
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
January 2nd, 2024
Source:
https://www.lesswrong.com/posts/mzvu8QTRXdvDReCAL/gentleness-and-the-artificial-other
---
Narrated by TYPE III AUDIO.
As we announced back in October, I have taken on the senior leadership role at MIRI as its CEO. It's a big pair of shoes to fill, and an awesome responsibility that I’m honored to take on.
There have been several changes at MIRI since our 2020 strategic update, so let's get into it.[1]
The short version:
We think it's very unlikely that the AI alignment field will be able to make progress quickly enough to prevent human extinction and the loss of the future's potential value, that we expect will result from loss of control to smarter-than-human AI systems.
However, developments this past year like the release of ChatGPT seem to have shifted the Overton window in a lot of groups. There's been a lot more discussion of extinction risk from AI, including among policymakers, and the discussion quality seems greatly improved.
This provides a glimmer of hope. [...]
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
January 5th, 2024
Source:
https://www.lesswrong.com/posts/q3bJYTB3dGRf5fbD9/miri-2024-mission-and-strategy-update
---
Narrated by TYPE III AUDIO.
Background: The Plan, The Plan: 2022 Update. If you haven’t read those, don’t worry, we’re going to go through things from the top this year, and with moderately more detail than before.
1. What's Your Plan For AI Alignment?
Median happy trajectory:
In certain circumstances, apologizing can also be a countersignalling power-move, i.e. “I am so high status that I can grovel a bit without anybody mistaking me for a general groveller”. But that's not really the type of move this post is focused on.There's this narrative about a tradeoff between:
This is a linkpost for https://unstableontology.com/2023/12/31/a-case-for-ai-alignment-being-difficult/
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is an attempt to distill a model of AGI alignment that I have gained primarily from thinkers such as Eliezer Yudkowsky (and to a lesser extent Paul Christiano), but explained in my own terms rather than attempting to hew close to these thinkers. I think I would be pretty good at passing an ideological Turing test for Eliezer Yudowsky on AGI alignment difficulty (but not AGI timelines), though what I'm doing in this post is not that, it's more like finding a branch in the possibility space as I see it that is close enough to Yudowsky's model that it's possible to talk in the same language.
Even if the problem turns out to not be very difficult, it's helpful to have a model of why one might think it is difficult, so as to identify weaknesses in the case so as to find AI designs that avoid the main difficulties. Progress on problems can be made by a combination of finding possible paths and finding impossibility results or difficulty arguments.
Most of what I say should not be taken as a statement on AGI timelines. Some problems that make alignment difficult, such as ontology identification, also make creating capable AGI difficult to some extent.
Source:
https://www.lesswrong.com/posts/wnkGXcAq4DCgY8HqA/a-case-for-ai-alignment-being-difficult
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
lsusrIt is my understanding that you won all of your public forum debates this year. That's very impressive. I thought it would be interesting to discuss some of the techniques you used.
LyrongolemOf course! So, just for a brief overview for those who don't know, public forum is a 2v2 debate format, usually on a policy topic. One of the more interesting ones has been the last one I went to, where the topic was "Resolved: The US Federal Government Should Substantially Increase its Military Presence in the Arctic".
Now, the techniques I'll go over here are related to this topic specifically, but they would also apply to other forms of debate, and argumentation in general really. For the sake of simplicity, I'll call it "ultra-BS".
So, most of us are familiar with 'regular' BS. The idea is the other person says something, and you just reply [...]
---
First published:
December 19th, 2023
Source:
https://www.lesswrong.com/posts/djWftXndJ7iMPsjrp/the-dark-arts
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a review of Paul Christiano's article "where I agree and disagree with Eliezer". Written for the LessWrong 2022 Review.
In the existential AI safety community, there is an ongoing debate between positions situated differently on some axis which doesn't have a common agreed-upon name, but where Christiano and Yudkowsky can be regarded as representatives of the two directions[1]. For the sake of this review, I will dub the camps gravitating to the different ends of this axis "Prosers" (after prosaic alignment) and "Poets"[2]. Christiano is a Proser, and so are most people in AI safety groups in the industry. Yudkowsky is a typical Poet, people in MIRI and the agent foundations community tend to also be such.
Prosers tend to be more optimistic, lend more credence to slow takeoff, and place more value on [...]
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
December 27th, 2023
Source:
https://www.lesswrong.com/posts/8HYJwQepynHsRKr6j/critical-review-of-christiano-s-disagreements-with-yudkowsky
---
Narrated by TYPE III AUDIO.
This point feels fairly obvious, yet seems worth stating explicitly.
Those of us familiar with the field of AI after the deep-learning revolution know perfectly well that we have no idea how our ML models work. Sure, we have an understanding of the dynamics of training loops and SGD's properties, and we know how ML models' architectures work. But we don't know what specific algorithms ML models' forward passes implement. We have some guesses, and some insights painstakingly mined by interpretability advances, but nothing even remotely like a full understanding.
And most certainly, we wouldn't automatically know how a fresh model trained on a novel architecture that was just spat out by the training loop works.
We're all used to this state of affairs. It's implicitly-assumed shared background knowledge. But it's actually pretty unusual, when you first learn of it.
And...
Relevant XKCD.I'm pretty sure that the general public [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
December 21st, 2023
Source:
https://www.lesswrong.com/posts/CpjTJtW2RNKvzAehG/most-people-don-t-realize-we-have-no-idea-how-our-ais-work
---
Narrated by TYPE III AUDIO.
TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won’t be directly helpful either (70%). We’ve written a paper about some of our detailed experiences with it.
Paper authors: Sebastian Farquhar*, Vikrant Varma*, Zac Kenton*, Johannes Gasteiger, Vlad Mikulik, and Rohin Shah. *Equal contribution, order randomised.
Credences are based on a poll of Seb, Vikrant, Zac, Johannes, Rohin and show single values where we mostly agree and ranges where we disagreed.
What does CCS try to do?
To us, CCS represents a family of possible algorithms aiming at solving an ELK-style problem that have the steps:
This is a linkpost for https://www.narrativeark.xyz/p/succession“A table beside the evening sea
where you sit shelling pistachios,
flicking the next open with the half-
shell of the last, story opening story,
on down to the sandy end of time.”
V1: Leaving
Deceleration is the hardest part. Even after burning almost all of my fuel, I’m still coming in at 0.8c. I’ve planned a slingshot around the galaxy's central black hole which will slow me down even further, but at this speed it’ll require incredibly precise timing. I’ve been optimized hard for this, with specialized circuits for it built in on the hardware level to reduce latency. Even so, less than half of slingshots at this speed succeed—most probes crash, or fly off trajectory and are left coasting through empty space.
I’ve already beaten the odds by making it here. Intergalactic probes travel so fast, and so far, that almost all [...]
---
First published:
December 20th, 2023
Source:
https://www.lesswrong.com/posts/CAzntXYTEaNfC9nB6/succession
Linkpost URL:
https://www.narrativeark.xyz/p/succession
---
Narrated by TYPE III AUDIO.
Recently, Ben Pace wrote a well-intentioned blog post mostly based on complaints from 2 (of 21) Nonlinear employees who 1) wanted more money, 2) felt socially isolated, and 3) felt persecuted/oppressed.
Of relevance, one has accused the majority of her previous employers, and 28 people of abuse - that we know of.
She has accused multiple people of threatening to kill her and literally accused an ex-employer of murder. Within three weeks of joining us, she had accused five separate people of abuse: not paying her what was promised, controlling her romantic life, hiring stalkers, and other forms of persecution.
We have empathy for her. Initially, we believed her too. We spent weeks helping her get her “nefarious employer to finally pay her” and commiserated with her over how badly they mistreated her.
Then she started accusing us of strange things.
You’ve seen Ben's evidence, which [...]
---
First published:
December 12th, 2023
Source:
https://www.lesswrong.com/posts/q4MXBzzrE6bnDHJbM/nonlinear-s-evidence-debunking-false-and-misleading-claims
---
Narrated by TYPE III AUDIO.
At the Bay Area Solstice, I heard the song Bold Orion for the first time. I like it a lot. It does, however, have one problem:
He has seen the rise and fall of kings and continents and all,
Rising silent, bold Orion on the rise.
Orion has not witnessed the rise and fall of continents. Constellations are younger than continents.
The time scale that continents change on is ten or hundreds of millions of years.
The time scale that stars the size of the sun live and die on is billions of years. So stars are older than continents.
But constellations are not stars or sets of stars. They are the patterns that stars make in our night sky.
The stars of some constellations are close together in space, and are gravitationally bound together, like the Pleiades. The Pleiades likely have been together, and will stay close [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
December 19th, 2023
Source:
https://www.lesswrong.com/posts/YMakfmwZsoLdXAZhb/constellations-are-younger-than-continents
---
Narrated by TYPE III AUDIO.
Many thanks to Samuel Hammond, Cate Hall, Beren Millidge, Steve Byrnes, Lucius Bushnaq, Joar Skalse, Kyle Gracey, Gunnar Zarncke, Ross Nordby, David Lambert, Simeon Campos, Bogdan Ionut-Cirstea, Ryan Kidd, Eric Ho, and Ashwin Acharya for critical comments and suggestions on earlier drafts of this agenda, as well as Philip Gubbins, Diogo de Lucena, Rob Luke, and Mason Seale from AE Studio for their support and feedback throughout.
TL;DR
When discussing AGI Risk, people often talk about it in terms of a war between humanity and an AGI. Comparisons between the amounts of resources at both sides' disposal are brought up and factored in, big impressive nuclear stockpiles are sometimes waved around, etc.
I'm pretty sure it's not how that'd look like, on several levels.
1. Threat Ambiguity
I think what people imagine, when they imagine a war, is Terminator-style movie scenarios where the obviously evil AGI becomes obviously evil in a way that's obvious to everyone, and then it's a neatly arranged white-and-black humanity vs. machines all-out fight. Everyone sees the problem, and knows everyone else sees it too, the problem is common knowledge, and we can all decisively act against it.[1]
But in real life, such unambiguity is rare. The monsters don't look obviously evil, the signs of fatal issues are rarely blatant. Is this whiff [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
December 16th, 2023
Source:
https://www.lesswrong.com/posts/xSJMj3Hw3D7DPy5fJ/humanity-vs-agi-will-never-look-like-humanity-vs-agi-to
---
Narrated by TYPE III AUDIO.
Epistemic status: Speculation. An unholy union of evo psych, introspection, random stuff I happen to observe & hear about, and thinking. Done on a highly charged topic. Caveat emptor!
Most of my life, whenever I'd felt sexually unwanted, I'd start planning to get fit.
Specifically to shape my body so it looks hot. Like the muscly guys I'd see in action films.
This choice is a little odd. In close to every context I've listened to, I hear women say that some muscle tone on a guy is nice and abs are a plus, but big muscles are gross — and all of that is utterly overwhelmed by other factors anyway.
It also didn't match up with whom I'd see women actually dating.
But all of that just… didn't affect my desire?
There's a related bit of dating advice for guys. "Bro, do you even lift?" Depending on the [...]
---
First published:
December 13th, 2023
Source:
https://www.lesswrong.com/posts/nvmfqdytxyEpRJC3F/is-being-sexy-for-your-homies
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
TL;DR version
In the course of my life, there have been a handful of times I discovered an idea that changed the way I thought about the world. The first occurred when I picked up Nick Bostrom’s book “superintelligence” and realized that AI would utterly transform the world. The second was when I learned about embryo selection and how it could change future generations. And the third happened a few months ago when I read a message from a friend of mine on Discord about editing the genome of a living person.
Source:
https://www.lesswrong.com/posts/JEhW3HDMKzekDShva/significantly-enhancing-adult-intelligence-with-gene-editing
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://unstableontology.com/2023/11/26/moral-reality-check/
Janet sat at her corporate ExxenAI computer, viewing some training performance statistics. ExxenAI was a major player in the generative AI space, with multimodal language, image, audio, and video AIs. They had scaled up operations over the past few years, mostly serving B2B, but with some B2C subscriptions. ExxenAI's newest AI system, SimplexAI-3, was based on GPT-5 and Gemini-2. ExxenAI had hired away some software engineers from Google and Microsoft, in addition to some machine learning PhDs, and replicated the work of other companies to provide more custom fine-tuning, especially for B2B cases. Part of what attracted these engineers and theorists was ExxenAI's AI alignment team.
Source:
https://www.lesswrong.com/posts/umJMCaxosXWEDfS66/moral-reality-check-a-short-story
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:
The Less Wrong General Census is unofficially here! You can take it at this link.
It's that time again.
If you are reading this post and identify as a LessWronger, then you are the target audience. I'd appreciate it if you took the survey. If you post, if you comment, if you lurk, if you don't actually read the site that much but you do read a bunch of the other rationalist blogs or you're really into HPMOR, if you hung out on rationalist tumblr back in the day, or if none of those exactly fit you but I'm maybe getting close, I think you count and I'd appreciate it if you took the survey.
Don't feel like you have to answer all of the questions just because you started taking it. Last year I asked if people thought the survey was too long, collectively they thought it was [...]
---
First published:
December 2nd, 2023
Source:
https://www.lesswrong.com/posts/JHeTrWha5PxiPEwBt/2023-unofficial-lesswrong-census-survey
---
Narrated by TYPE III AUDIO.
If you are interested in the longevity scene, like I am, you probably have seen press releases about the dog longevity company, Loyal for Dogs, getting a nod for efficacy from the FDA. These have come in the form of the New York Post calling the drug "groundbreaking", Science Alert calling the drug "radical", and the more sedate New York Times just asking, "Could Longevity Drugs for Dogs Extend Your Pet's Life?", presumably unaware of Betteridge's Law of Headlines. You may have also seen the coordinated Twitter offensive of people losing their shit about this, including their lead investor, Laura Deming, saying that she "broke down crying when she got the call".
And if you have been following Loyal for Dogs for a while, like I have, you are probably puzzled by this news. Loyal for Dogs has been around since 2021. Unlike any other drug company or longevity [...]
---
First published:
December 12th, 2023
Source:
https://www.lesswrong.com/posts/vHSkxmYYqW59sySqA/the-likely-first-longevity-drug-is-based-on-sketchy-science
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Crossposted from Otherwise
Parents supervise their children way more than they used to
Children spend less of their time in unstructured play than they did in past generations.
Parental supervision is way up. The wild thing is that this is true even while the number of children per family has decreased and the amount of time mothers work outside the home has increased.
Source:
https://www.lesswrong.com/posts/piJLpEeh6ivy5RA7v/what-are-the-results-of-more-parental-supervision-and-less
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
In the course of my life, there have been a handful of times I discovered an idea that changed the way I thought about the world. The first occurred when I picked up Nick Bostrom's book “superintelligence” and realized that AI would utterly transform the world. The second was when I learned about embryo selection and how it could change future generations. And the third happened a few months ago when I read a message from a friend of mine on Discord about editing the genome of a living person.
We’ve had gene therapy to treat cancer and single gene disorders for decades. But the process involved in making such changes to the cells of a living person is excruciating and extremely expensive. CAR T-cell therapy, a treatment for certain types of cancer, requires the removal of white blood cells via IV, genetic modification of those cells outside the [...]
---
First published:
December 12th, 2023
Source:
https://www.lesswrong.com/posts/JEhW3HDMKzekDShva/significantly-enhancing-adult-intelligence-with-gene-editing
---
Narrated by TYPE III AUDIO.
I was asked to respond to this comment by Eliezer Yudkowsky. This post is partly redundant with my previous post.
Why is flesh weaker than diamond?
When trying to resolve disagreements, I find that precision is important. Tensile strength, compressive strength, and impact strength are different. Material microstructure matters. Poorly-sintered diamond crystals could crumble like sand, and a large diamond crystal has lower impact strength than some materials made of proteins.
Even when the load-bearing forces holding large molecular systems together are locally covalent bonds, as in lignin (what makes wood strong), if you've got larger molecules only held together by covalent bonds at interspersed points along their edges, that's like having 10cm-diameter steel beams held together by 1cm welds.
lignin (what makes wood strong)
That's an odd way of putting things. The mechanical strength of wood is generally considered to come from it [...]
---
First published:
December 11th, 2023
Source:
https://www.lesswrong.com/posts/XhDh97vm7hXBfjwqQ/re-yudkowsky-on-biological-materials
---
Narrated by TYPE III AUDIO.
In May and June of 2023, I (Akash) had about 50-70 meetings about AI risks with congressional staffers. I had been meaning to write a post reflecting on the experience and some of my takeaways, and I figured it could be a good topic for a LessWrong dialogue. I saw that hath had offered to do LW dialogues with folks, and I reached out.
In this dialogue, we discuss how I decided to chat with staffers, my initial observations in DC, some context about how Congressional offices work, what my meetings looked like, lessons I learned, and some miscellaneous takes about my experience.
Context
hathHey! In your message, you mentioned a few topics that relate to your time in DC.
I figured we should start with your experience talking to congressional offices about AI risk. I'm quite interested in learning more; there don't seem to be many [...]
---
First published:
December 4th, 2023
Source:
https://www.lesswrong.com/posts/2sLwt2cSAag74nsdN/speaking-to-congressional-staffers-about-ai-risk
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on.
This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at that one conference” and “I just read a post on a surprising new insight and want to see who else has been working on this”, “I wonder roughly how many people are working on that thing”.
This doc is unreadably long, so that it can be Ctrl-F-ed. Also this way you can fork the list and make a smaller one.
Most of you should only read the editorial and skim the section you work in.
Source:
https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/shallow-review-of-live-agendas-in-alignment-and-safety#More_meta
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Quintin Pope & Nora Belrose have a new “AI Optimists” website, along with a new essay “AI is easy to control”, arguing that the risk of human extinction due to future AI (“AI x-risk”) is a mere 1% (“a tail risk worth considering, but not the dominant source of risk in the world”). (I’m much more pessimistic.) It makes lots of interesting arguments, and I’m happy that the authors are engaging in substantive and productive discourse, unlike the ad hominem vibes-based drivel which is growing increasingly common on both sides of the AI x-risk issue in recent months.
This is not a comprehensive rebuttal or anything, but rather picking up on a few threads that seem important for where we disagree, or where I have something I want to say.
Summary / table-of-contents:
Note: I think Sections 1 [...]
---
First published:
December 1st, 2023
Source:
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose
---
Narrated by TYPE III AUDIO.
Any community which ever adds new people will need to either routinely teach the new and (to established members) blindingly obvious information to those who genuinely haven’t heard it before, or accept that over time community members will only know the simplest basics by accident of osmosis or selection bias. There isn’t another way out of that. You don’t get to stop doing it. If you have a vibrant and popular group full of people really interested in the subject of the group, and you run it for ten years straight, you will still sometimes run across people who have only fuzzy and incorrect ideas about the subject dauntless you are making an active effort to make Something Which Is Not That happen.
Or in other words; I have run into people at Effective Altruism meetups who were aghast at the idea of putting a dollar price on a [...]
---
First published:
November 29th, 2023
Source:
https://www.lesswrong.com/posts/ydmctK2qLyrR9Xztd/the-101-space-you-will-always-have-with-you
---
Narrated by TYPE III AUDIO.
The author's Substack:
https://substack.com/@homosabiens
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
You know it must be out there, but you mostly never see it.
Author's Note 1: In something like 75% of possible futures, this will be the last essay that I publish on LessWrong. Future content will be available on my substack, where I'm hoping people will be willing to chip in a little commensurate with the value of the writing, and (after a delay) on my personal site (not yet live). I decided to post this final essay here rather than silently switching over because many LessWrong readers would otherwise never find out that they could still get new Duncanthoughts elsewhere.
Source:
https://www.lesswrong.com/posts/KpMNqA5BiCRozCwM3/social-dark-matter
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Summary.
You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on.
This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at that one conference” and “I just read a post on a surprising new insight and want to see who else has been [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
November 27th, 2023
Source:
https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/shallow-review-of-live-agendas-in-alignment-and-safety
---
Narrated by TYPE III AUDIO.
Status: Vague, sorry. The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are.
Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?
(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it's being eroded, etc.)
And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior?
(Modulo, e.g., the fact that it can play chess pretty well, which indicates a [...]
---
First published:
November 24th, 2023
Source:
https://www.lesswrong.com/posts/AWoZBzxdm4DoGgiSj/ability-to-solve-long-horizon-tasks-correlates-with-wanting
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
Recently, I have been learning about industry norms, legal discovery proceedings, and incentive structures related to companies building risky systems. I wanted to share some findings in this post because they may be important for the frontier AI community to understand well.
TL;DR
Documented communications of risks (especially by employees) make companies much more likely to be held liable in court when bad things happen. The resulting Duty to Due Diligence from Discoverable Documentation of Dangers (the 6D effect) can make companies much more cautious if even a single email is sent to them communicating a risk.
Source:
https://www.lesswrong.com/posts/J9eF4nA6wJW6hPueN/the-6d-effect-when-companies-take-risks-one-email-can-be
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Previously: OpenAI: Facts from a Weekend.
On Friday afternoon, OpenAI's board fired CEO Sam Altman.
Overnight, an agreement in principle was reached to reinstate Sam Altman as CEO of OpenAI, with an initial new board of Brad Taylor (ex-co-CEO of Salesforce, chair), Larry Summers and Adam D’Angelo.
What happened? Why did it happen? How will it ultimately end? The fight is far from over.
We do not entirely know, but we know a lot more than we did a few days ago.
This is my attempt to put the pieces together.
This is a Fight For Control; Altman Started it
This was and still is a fight about control of OpenAI, its board, and its direction.
This has been a long simmering battle and debate. The stakes are high.
Until recently, Sam Altman worked to reshape the company in his [...]
---
First published:
November 22nd, 2023
Source:
https://www.lesswrong.com/posts/sGpBPAPq2QttY4M2H/openai-the-battle-of-the-board
---
Narrated by TYPE III AUDIO.
Approximately four GPTs and seven years ago, OpenAI's founders brought forth on this corporate landscape a new entity, conceived in liberty, and dedicated to the proposition that all men might live equally when AGI is created.
Now we are engaged in a great corporate war, testing whether that entity, or any entity so conceived and so dedicated, can long endure.
What matters is not theory but practice. What happens when the chips are down?
So what happened? What prompted it? What will happen now?
To a large extent, even more than usual, we do not know. We should not pretend that we know more than we do.
Rather than attempt to interpret here or barrage with an endless string of reactions and quotes, I will instead do my best to stick to a compilation of the key facts.
(Note: All times stated here [...]
---
First published:
November 20th, 2023
Source:
https://www.lesswrong.com/posts/KXHMCH7wCxrvKsJyn/openai-facts-from-a-weekend
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://openai.com/blog/openai-announces-leadership-transitionBasically just the title, see the OAI blog post for more details.
Mr. Altman's departure follows a deliberative review process by the board, which concluded that he was not consistently candid in his communications with the board, hindering its ability to exercise its responsibilities. The board no longer has confidence in his ability to continue leading OpenAI.
In a statement, the board of directors said: “OpenAI was deliberately structured to advance our mission: to ensure that artificial general intelligence benefits all humanity. The board remains fully committed to serving this mission. We are grateful for Sam's many contributions to the founding and growth of OpenAI. At the same time, we believe new leadership is necessary as we move forward. As the leader of the company's research, product, and safety functions, Mira is exceptionally qualified to step into the role of interim CEO. We have [...]
---
First published:
November 17th, 2023
Source:
https://www.lesswrong.com/posts/eHFo7nwLYDzpuamRM/sam-altman-fired-from-openai
Linkpost URL:
https://openai.com/blog/openai-announces-leadership-transition
---
Narrated by TYPE III AUDIO.
You know it must be out there, but you mostly never see it.
Author's Note 1: I'm something like 75% confident that this will be the last essay that I publish on LessWrong. Future content will be available on my substack, where I'm hoping people will be willing to chip in a little commensurate with the value of the writing, and (after a delay) on my personal site. I decided to post this final essay here rather than silently switching over because many LessWrong readers would otherwise never find out that they could still get new Duncan content elsewhere.
Author's Note 2: This essay is not intended to be revelatory. Instead, it's attempting to get the consequences of a few very obvious things lodged into your brain, such that they actually occur to you from time to time as opposed to occurring to you approximately never.
Most people [...]
The original text contained 9 footnotes which were omitted from this narration.
---
First published:
November 7th, 2023
Source:
https://www.lesswrong.com/posts/KpMNqA5BiCRozCwM3/social-dark-matter
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
I'm sure Harry Potter and the Methods of Rationality taught me some of the obvious, overt things it set out to teach. Looking back on it a decade after I first read it however, what strikes me most strongly are often the brief, tossed off bits in the middle of the flow of a story.
Fred and George exchanged worried glances."I can't think of anything," said George."Neither can I," said Fred. "Sorry."Harry stared at them.And then Harry began to explain how you went about thinking of things.It had been known to take longer than two seconds, said Harry.-Harry Potter and the Methods of Rationality, Chapter 25.
Source:
https://www.lesswrong.com/posts/WJtq4DoyT9ovPyHjH/thinking-by-the-clock
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
I.
Here's a recent conversation I had with a friend:
Me: "I wish I had more friends. You guys are great, but I only get to hang out with you like once or twice a week. It's painful being holed up in my house the entire rest of the time."Friend: "You know ${X}. You could talk to him."Me: "I haven't talked to ${X} since 2019."Friend: "Why does that matter? Just call him."Me: "What do you mean 'just call him'? I can't do that."Friend: "Yes you can"Me:
Source:
https://www.lesswrong.com/posts/2HawAteFsnyhfYpuD/you-can-just-spontaneously-call-people-you-haven-t-met-in
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
How many years will pass before transformative AI is built? Three people who have thought about this question a lot are Ajeya Cotra from Open Philanthropy, Daniel Kokotajlo from OpenAI and Ege Erdil from Epoch. Despite each spending at least hundreds of hours investigating this question, they still still disagree substantially about the relevant timescales. For instance, here are their median timelines for one operationalization of transformative AI:
Source:
https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
It’s fairly common for EA orgs to provide fiscal sponsorship to other EA orgs. Wait, no, that sentence is not quite right. The more accurate sentence is that there are very few EA organizations, in the legal sense; most of what you think of as orgs are projects that are legally hosted by a single org, and which governments therefore consider to be one legal entity.
Source:
https://www.lesswrong.com/posts/XvEJydHAHk6hjWQr5/ea-orgs-legal-structure-inhibits-risk-taking-and-information
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
habryka
Ok, so we both had some feelings about the recent Conjecture post on "lots of people in AI Alignment are lying", and the associated marketing campaign and stuff.
I would appreciate some context in which I can think through that, and also to share info we have in the space that might help us figure out what's going on.
I expect this will pretty quickly cause us to end up on some broader questions about how to do advocacy, how much the current social network around AI Alignment should coordinate as a group, how to balance advocacy with research, etc.
Source:
https://www.lesswrong.com/posts/vFqa8DZCuhyrbSnyx/integrity-in-ai-governance-and-advocacy
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
(You can sign up to play deception chess here if you haven't already.)
This is the first of my analyses of the deception chess games. The introduction will describe the setup of the game, and the conclusion will sum up what happened in general terms; the rest of the post will mostly be chess analysis and skippable if you just want the results. If you haven't read the original post, read it before reading this so that you know what's going on here.
The first game was between Alex A as player A, Chess.com computer Komodo 12 as player B, myself as the honest C advisor, and aphyer and AdamYedidia as the deceptive Cs. (Someone else randomized the roles for the Cs and told us in private.)
Source:
https://www.lesswrong.com/posts/6dn6hnFRgqqWJbwk9/deception-chess-game-1
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://transformer-circuits.pub/2023/monosemantic-features/
Text of post based on our blog post as a linkpost for the full paper which is considerably longer and more detailed.
Neural networks are trained on data, not programmed to follow rules. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don't understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe.
Luckily for those of us trying to understand artificial neural networks, we can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.
Unfortunately, it turns out that the individual neurons do not have consistent relationships to network behavior. For example, a single neuron in a small language model is active in many unrelated contexts, including: academic citations, English dialogue, HTTP requests, and Korean text. In a classic vision model, a single neuron responds to faces of cats and fronts of cars. The activation of one neuron can mean different things in different contexts.
Source:
https://www.lesswrong.com/posts/TDqvQFks6TWutJEKu/towards-monosemanticity-decomposing-language-models-with
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Recently, I have been learning about industry norms, legal discovery proceedings, and incentive structures related to companies building risky systems. I wanted to share some findings in this post because they may be important for the frontier AI community to understand well.
TL;DR
Documented communications of risks (especially by employees) make companies much more likely to be held liable in court when bad things happen. The resulting Duty to Due Diligence from Discoverable Documentation of Dangers (the 6D effect) can make companies much more cautious if even a single email is sent to them communicating a risk.
Source:
https://www.lesswrong.com/posts/J9eF4nA6wJW6hPueN/the-6d-effect-when-companies-take-risks-one-email-can-be
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I guess there’s maybe a 10-20% chance of AI causing human extinction in the coming decades, but I feel more distressed about it than even that suggests—I think because in the case where it doesn’t cause human extinction, I find it hard to imagine life not going kind of off the rails. So many things I like about the world seem likely to be over or badly disrupted with superhuman AI (writing, explaining things to people, friendships where you can be of any use to one another, taking pride in skills, thinking, learning, figuring out how to achieve things, making things, easy tracking of what is and isn’t conscious), and I don’t trust that the replacements will be actually good, or good for us, or that anything will be reversible.
Even if we don’t die, it still feels like everything is coming to an end.
Source:
https://www.lesswrong.com/posts/uyPo8pfEtBffyPdxf/the-other-side-of-the-tidal-wave
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
davidad has a 10-min talk out on a proposal about which he says: “the first time I’ve seen a concrete plan that might work to get human uploads before 2040, maybe even faster, given unlimited funding”.
I think the talk is a good watch, but the dialogue below is pretty readable even if you haven't seen it. I'm also putting some summary notes from the talk in the Appendix of this dialogue.
Source:
https://www.lesswrong.com/posts/FEFQSGLhJFpqmEhgi/does-davidad-s-uploading-moonshot-work
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I examined all the biorisk-relevant citations from a policy paper arguing that we should ban powerful open source LLMs.
None of them provide good evidence for the paper's conclusion. The best of the set is evidence from statements from Anthropic -- which rest upon data that no one outside of Anthropic can even see, and on Anthropic's interpretation of that data. The rest of the evidence cited in this paper ultimately rests on a single extremely questionable "experiment" without a control group.
In all, citations in the paper provide an illusion of evidence ("look at all these citations") rather than actual evidence ("these experiments are how we know open source LLMs are dangerous and could contribute to biorisk").
A recent further paper on this topic (published after I had started writing this review) continues this pattern of being more advocacy than science.
Almost all the bad papers that I look at are funded by Open Philanthropy. If Open Philanthropy cares about truth, then they should stop burning the epistemic commons by funding "research" that is always going to give the same result no matter the state of the world.
Source:
https://www.lesswrong.com/posts/ztXsmnSdrejpfmvn7/propaganda-or-science-a-look-at-open-source-ai-and
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
A common theme implicit in many AI risk stories has been that broader society will either fail to anticipate the risks of AI until it is too late, or do little to address those risks in a serious manner. In my opinion, there are now clear signs that this assumption is false, and that society will address AI with something approaching both the attention and diligence it deserves. For example, one clear sign is Joe Biden's recent executive order on AI safety[1]. In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention.
While I'm not saying we should now sit back and relax, I think recent evidence has significant implications for designing effective strategies to address AI risk. Since I think substantial AI regulation will likely occur by default, I urge effective altruists to focus more on ensuring that the regulation is thoughtful and well-targeted rather than ensuring that regulation happens at all. Ultimately, I argue in favor of a cautious and nuanced approach towards policymaking, in contrast to broader public AI safety advocacy.[2]
Source:
https://www.lesswrong.com/posts/EaZghEwcCJRAuee66/my-thoughts-on-the-social-response-to-ai-risk#
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for https://nitter.net/ESYudkowsky/status/1718654143110512741
Comp sci in 2017:
Student: I get the feeling the compiler is just ignoring all my comments.
Teaching assistant: You have failed to understand not just compilers but the concept of computation itself.
Comp sci in 2027:
Student: I get the feeling the compiler is just ignoring all my comments.
TA: That's weird. Have you tried adding a comment at the start of the file asking the compiler to pay closer attention to the comments?
Student: Yes.
TA: Have you tried repeating the comments? Just copy and paste them, so they say the same thing twice? Sometimes the compiler listens the second time.
Source:
https://www.lesswrong.com/posts/gQyphPbaLHBMJoghD/comp-sci-in-2027-short-story-by-eliezer-yudkowsky
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Over the next two days, the UK government is hosting an AI Safety Summit focused on “the safe and responsible development of frontier AI”. They requested that seven companies (Amazon, Anthropic, DeepMind, Inflection, Meta, Microsoft, and OpenAI) “outline their AI Safety Policies across nine areas of AI Safety”.
Below, I’ll give my thoughts on the nine areas the UK government described; I’ll note key priorities that I don’t think are addressed by company-side policy at all; and I’ll say a few words (with input from Matthew Gray, whose discussions here I’ve found valuable) about the individual companies’ AI Safety Policies.[1]
My overall take on the UK government’s asks is: most of these are fine asks; some things are glaringly missing, like independent risk assessments.
My overall take on the labs’ policies is: none are close to adequate, but some are importantly better than others, and most of the organizations are doing better than sheer denial of the primary risks.
Source:
https://www.lesswrong.com/posts/ms3x8ngwTfep7jBue/thoughts-on-the-ai-safety-summit-company-policy-requests-and
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for https://www.whitehouse.gov/briefing-room/statements-releases/2023/10/30/fact-sheet-president-biden-issues-executive-order-on-safe-secure-and-trustworthy-artificial-intelligence/
Released today (10/30/23) this is crazy, perhaps the most sweeping action taken by government on AI yet.
Below, I've segmented by x-risk and non-x-risk related proposals, excluding the proposals that are geared towards promoting its use and focusing solely on those aimed at risk. It's worth noting that some of these are very specific and direct an action to be taken by one of the executive branch organizations (i.e. sharing of safety test results) but others are guidances, which involve "calls on Congress" to pass legislation that would codify the desired action.
[Update]: The official order (this is a summary of the press release) has now be released, so if you want to see how these are codified to a greater granularity, look there.
Source:
https://www.lesswrong.com/posts/g5XLHKyApAFXi3fso/president-biden-issues-executive-order-on-safe-secure-and
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
Previously: Sadly, FTX
I doubted whether it would be a good use of time to read Michael Lewis’s new book Going Infinite about Sam Bankman-Fried (hereafter SBF or Sam). What would I learn that I did not already know? Was Michael Lewis so far in the tank of SBF that the book was filled with nonsense and not to be trusted?
I set up a prediction market, which somehow attracted over a hundred traders. Opinions were mixed. That, combined with Matt Levine clearly reporting having fun, felt good enough to give the book a try.
Source:
https://www.lesswrong.com/posts/AocXh6gJ9tJC2WyCL/book-review-going-infinite
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Views are my own, not Open Philanthropy’s. I am married to the President of Anthropic and have a financial interest in both Anthropic and OpenAI via my spouse.
Over the last few months, I’ve spent a lot of my time trying to help out with efforts to get responsible scaling policies adopted. In that context, a number of people have said it would be helpful for me to be publicly explicit about whether I’m in favor of an AI pause. This post will give some thoughts on these topics.
Source:
https://www.lesswrong.com/posts/Np5Q3Mhz2AiPtejGN/we-re-not-ready-thoughts-on-pausing-and-responsible-scaling-4
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Judea Pearl is a famous researcher, known for Bayesian networks (the standard way of representing Bayesian models), and his statistical formalization of causality. Although he has always been recommended reading here, he's less of a staple compared to, say, Jaynes. So the need to re-introduce him. My purpose here is to highlight a soothing, unexpected show of rationality on his part.
One year ago I reviewed his last book, The Book of Why, in a failed[1] submission to the ACX book review contest. There I spend a lot of time around what appears to me as a total paradox in a central message of the book, dear to Pearl: that you can't just use statistics and probabilities to understand causal relationships; you need a causal model, a fundamentally different beast. Yet, at the same time, Pearl shows how to implement a causal model in terms of a standard statistical model.
Before giving me the time to properly raise all my eyebrows, he then sweepingly connects this insight to Everything Everywhere. In particular, he thinks that machine learning is "stuck on rung one", his own idiomatic expression to say that machine learning algorithms, only combing for correlations in the training data, are stuck at statistics-level reasoning, while causal reasoning resides at higher "rungs" on the "ladder of causation", which can't be reached unless you deliberately employ causal techniques.
Source:
https://www.lesswrong.com/posts/uFqnB6BG4bkMW23LR/at-87-pearl-is-still-able-to-change-his-mind
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Some brief thoughts at a difficult time in the AI risk debate.
Imagine you go back in time to the year 1999 and tell people that in 24 years time, humans will be on the verge of building weakly superhuman AI systems. I remember watching the anime short series The Animatrix at roughly this time, in particular a story called The Second Renaissance I part 2 II part 1 II part 2 . For those who haven't seen it, it is a self-contained origin tale for the events in the seminal 1999 movie The Matrix, telling the story of how humans lost control of the planet.
Humans develop AI to perform economic functions, eventually there is an "AI rights" movement and a separate AI nation is founded. It gets into an economic war with humanity, which turns hot. Humans strike first with nuclear weapons, but the AI nation builds dedicated bio- and robo-weapons and wipes out most of humanity, apart from those who are bred in pods like farm animals and plugged into a simulation for eternity without their consent.
Surely we wouldn't be so stupid as to actually let something like that happen? It seems unrealistic.
And yet:
Source:
https://www.lesswrong.com/posts/bHHrdXwrCj2LRa2sW/architects-of-our-own-demise-we-should-stop-developing-ai
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
AI used to be a science. In the old days (back when AI didn't work very well), people were attempting to develop a working theory of cognition.
Those scientists didn’t succeed, and those days are behind us. For most people working in AI today and dividing up their work hours between tasks, gone is the ambition to understand minds. People working on mechanistic interpretability (and others attempting to build an empirical understanding of modern AIs) are laying an important foundation stone that could play a role in a future science of artificial minds, but on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.
The bitter lesson has been taken to heart, by those at the forefront of the field; and although this lesson doesn't teach us that there's nothing to learn about how AI minds solve problems internally, it suggests that the fastest path to producing more powerful systems is likely to continue to be one that doesn’t shed much light on how those systems work.
Absent some sort of “science of artificial minds”, however, humanity’s prospects for aligning smarter-than-human AI seem to me to be quite dim.
Source:
https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/ai-as-a-science-and-three-obstacles-to-alignment-strategies
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I am excited about AI developers implementing responsible scaling policies; I’ve recently been spending time refining this idea and advocating for it. Most people I talk to are excited about RSPs, but there is also some uncertainty and pushback about how they relate to regulation. In this post I’ll explain my views on that:
Source:
https://www.lesswrong.com/posts/dxgEaDrEBkkE96CXr/thoughts-on-responsible-scaling-policies-and-regulation
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Timaeus is a new AI safety research organization dedicated to making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. Currently, we are working on singular learning theory and developmental interpretability. Over time we expect to work on a broader research agenda, and to create understanding-based evals informed by our research.
Source:
https://www.lesswrong.com/posts/nN7bHuHZYaWv9RDJL/announcing-timaeus
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
Doomimir: Humanity has made no progress on the alignment problem. Not only do we have no clue how to align a powerful optimizer to our "true" values, we don't even know how to make AI "corrigible"—willing to let us correct it. Meanwhile, capabilities continue to advance by leaps and bounds. All is lost.
Simplicia: Why, Doomimir Doomovitch, you're such a sourpuss! It should be clear by now that advances in "alignment"—getting machines to behave in accordance with human values and intent—aren't cleanly separable from the "capabilities" advances you decry. Indeed, here's an example of GPT-4 being corrigible to me just now in the OpenAI Playground.
Source:
https://www.lesswrong.com/posts/pYWA7hYJmXnuyby33/alignment-implications-of-llm-successes-a-debate-in-one-act
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Holly is an independent AI Pause organizer, which includes organizing protests (like this upcoming one). Rob is an AI Safety YouTuber. I (jacobjacob) brought them together for this dialogue, because I've been trying to figure out what I should think of AI safety protests, which seems like a possibly quite important intervention; and Rob and Holly seemed like they'd have thoughtful and perhaps disagreeing perspectives.
Quick clarification: At one point they discuss a particular protest, which is the anti-irreversible proliferation protest at the Meta building in San Francisco on September 29th, 2023 that both Holly and Rob attended.
Also, the dialogue is quite long, and I think it doesn't have to be read in order. You should feel free to skip to the section title that sounds most interesting to you.
Source:
https://www.lesswrong.com/posts/gDijQHHaZzeGrv2Jc/holly-elmore-and-rob-miles-dialogue-on-ai-safety-advocacy
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish.
TL;DR LoRA fine-tuning undoes the safety training of Llama 2-Chat 70B with one GPU and a budget of less than $200. The resulting models[1] maintain helpful capabilities without refusing to fulfill harmful instructions. We show that, if model weights are released, safety fine-tuning does not effectively prevent model misuse. Consequently, we encourage Meta to reconsider their policy of publicly releasing their powerful models.
Source:
https://www.lesswrong.com/posts/qmQFHCgCyEEjuy5a7/lora-fine-tuning-efficiently-undoes-safety-training-from
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Three of the big AI labs say that they care about alignment and that they think misaligned AI poses a potentially existential threat to humanity. These labs continue to try to build AGI. I think this is a very bad idea.
The leaders of the big labs are clear that they do not know how to build safe, aligned AGI. The current best plan is to punt the problem to a (different) AI, and hope that can solve it. It seems clearly like a bad idea to try and build AGI when you don’t know how to control it, especially if you readily admit that misaligned AGI could cause extinction.
But there are certain reasons that make trying to build AGI a more reasonable thing to do, for example:
Source:
https://www.lesswrong.com/posts/6HEYbsqk35butCYTe/labs-should-be-explicit-about-why-they-are-building-agi
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
How do you affect something far away, a lot, without anyone noticing?
(Note: you can safely skip sections. It is also safe to skip the essay entirely, or to read the whole thing backwards if you like.)
Source:
https://www.lesswrong.com/posts/R3eDrDoX8LisKgGZe/sum-threshold-attacks
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Last year, I wrote about the promise of gene drives to wipe out mosquito species and end malaria.
In the time since my previous writing, gene drives have still not been used in the wild, and over 600,000 people have died of malaria. Although there are promising new developments such as malaria vaccines, there have also been some pretty bad setbacks (such as mosquitoes and parasites developing resistance to commonly used chemicals), and malaria deaths have increased slightly from a few years ago. Recent news coverage[1] has highlighted that the fight against malaria has stalled, and even reversed in some areas. Clearly, scientists and public health workers are trying hard with the tools they have, but this effort is not enough.
Gene drives have the potential to end malaria. However, this potential will remain unrealized unless they are deployed – and every day we wait, more than 1,600 people (mostly African children) die. But who should deploy them?
Source:
https://www.lesswrong.com/posts/gjs3q83hA4giubaAw/will-no-one-rid-me-of-this-turbulent-pest
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Patreon to support human narration. (Narrations will remain freely available on this feed, but you can optionally support them if you'd like me to keep making them.)
***
Epistemic status: model which I find sometimes useful, and which emphasizes some true things about many parts of the world which common alternative models overlook. Probably not correct in full generality.
Consider Yoshua Bengio, one of the people who won a Turing Award for deep learning research. Looking at his work, he clearly “knows what he’s doing”. He doesn’t know what the answers will be in advance, but he has some models of what the key questions are, what the key barriers are, and at least some hand-wavy pseudo-models of how things work.
For instance, Bengio et al’s “Unitary Evolution Recurrent Neural Networks”. This is the sort of thing which one naturally ends up investigating, when thinking about how to better avoid gradient explosion/death in e.g. recurrent nets, while using fewer parameters. And it’s not the sort of thing which one easily stumbles across by trying random ideas for nets without some reason to focus on gradient explosion/death (or related instability problems) in particular. The work implies a model of key questions/barriers; it isn’t just shooting in the dark.
Source:
https://www.lesswrong.com/posts/nt8PmADqKMaZLZGTC/inside-views-impostor-syndrome-and-the-great-larp
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthropic’s RSP. Prior to joining Anthropic, I was a Research Fellow at MIRI for three years.
Thanks to Kate Woolverton, Carson Denison, and Nicholas Schiefer for useful feedback on this post.
Recently, there’s been a lot of discussion and advocacy around AI pauses—which, to be clear, I think is great: pause advocacy pushes in the right direction and works to build a good base of public support for x-risk-relevant regulation. Unfortunately, at least in its current form, pause advocacy seems to lack any sort of coherent policy position. Furthermore, what’s especially unfortunate about pause advocacy’s nebulousness—at least in my view—is that there is a very concrete policy proposal out there right now that I think is basically necessary as a first step here, which is the enactment of good Responsible Scaling Policies (RSPs). And RSPs could very much live or die right now based on public support.
Source:
https://www.lesswrong.com/posts/mcnWZBnbeDz7KKtjJ/rsps-are-pauses-done-right
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or differently. My hope in writing this is to help readers understand the similarities and differences, and perhaps to lay the groundwork for a future synthesis approach.
First, let me note that we arrived at similar techniques in similar ways: both Anthropic and my team follow the lead of Lee Sharkey, Dan Braun, and beren's [Interim research report] Taking features out of superposition with sparse autoencoders, though I don't know how directly Anthropic was inspired by that post. I believe both our teams were pleasantly surprised to find out the other one was working on similar lines, serving as a form of replication.
Some disclaimers: This list may be incomplete. I didn't give Anthropic a chance to give feedback on this, so I may have misrepresented some of their work, including by omission. Any mistakes are my own fault.
Source:
https://www.lesswrong.com/posts/F4iogK5xdNd7jDNyw/comparing-anthropic-s-dictionary-learning-to-ours
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
In 2023, MIRI has shifted focus in the direction of broad public communication—see, for example, our recent TED talk, our piece in TIME magazine “Pausing AI Developments Isn’t Enough. We Need to Shut it All Down”, and our appearances on various podcasts. While we’re continuing to support various technical research programs at MIRI, this is no longer our top priority, at least for the foreseeable future.
Coinciding with this shift in focus, there have also been many organizational changes at MIRI over the last several months, and we are somewhat overdue to announce them in public. The big changes are as follows.
Source:
https://www.lesswrong.com/posts/NjtHt55nFbw3gehzY/announcing-miri-s-new-ceo-and-leadership-team
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
A cohabitive game[1] is a partially cooperative, partially competitive multiplayer game that provides an anarchic dojo for development in applied cooperative bargaining, or negotiation.
Applied cooperative bargaining isn't currently taught, despite being an infrastructural literacy for peace, trade, democracy or any other form of pluralism. We suffer for that. There are many good board games that come close to meeting the criteria of a cohabitive game today, but they all[2] miss in one way or another, forbidding sophisticated negotiation from being practiced.
So, over the past couple of years, we've been gradually and irregularly designing and playtesting the first[2] cohabitive boardgame, which for now we can call Difference and Peace Peacewager 1, or P1. This article explains why we think this new genre is important, how it's been going, what we've learned, and where we should go next.
I hope that cohabitive games will aid both laypeople and theorists in developing cooperative bargaining as theory, practice and culture, but I also expect these games to just be more fun than purely cooperative or purely competitive games, supporting livelier dialog, and a wider variety of interesting strategic relationships and dynamics.
Source:
https://www.lesswrong.com/posts/bF353RHmuzFQcsokF/peacewagers-cohabitive-games-so-far
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
As of today, everyone is able to create a new type of content on LessWrong: Dialogues.
In contrast with posts, which are for monologues, and comment sections, which are spaces for everyone to talk to everyone, a dialogue is a space for a few invited people to speak with each other.
I'm personally very excited about this as a way for people to produce lots of in-depth explanations of their world-models in public.
I think dialogues enable this in a way that feels easier — instead of writing an explanation for anyone who reads, you're communicating with the particular person you're talking with — and giving the readers a lot of rich nuance I normally only find when I overhear people talk in person.
In the rest of this post I'll explain the feature, and then encourage you to find a partner in the comments to try it out with.
Source:
https://www.lesswrong.com/posts/kQuSZG8ibfW6fJYmo/announcing-dialogues-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Response to: Evolution Provides No Evidence For the Sharp Left Turn, due to it winning first prize in The Open Philanthropy Worldviews contest.
Quintin’s post is an argument about a key historical reference class and what it tells us about AI. Instead of arguing that the reference makes his point, he is instead arguing that it doesn’t make anyone’s point - that we understand the reasons for humanity’s sudden growth in capabilities. He says this jump was caused by gaining access to cultural transmission which allowed partial preservation of in-lifetime learning across generations, which was a vast efficiency gain that fully explains the orders of magnitude jump in the expansion of human capabilities. Since AIs already preserve their metaphorical in-lifetime learning across their metaphorical generations, he argues, this does not apply to AI.
Source:
https://www.lesswrong.com/posts/Wr7N9ji36EvvvrqJK/response-to-quintin-pope-s-evolution-provides-no-evidence
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
ETA: I'm not saying that MIRI thought AIs wouldn't understand human values. If there's only one thing you take away from this post, please don't take away that.
Recently, many people have talked about whether some of the main MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger[1]) should update on whether value alignment is easier than they thought given that GPT-4 seems to follow human directions and act within moral constraints pretty well (here are two specific examples of people talking about this: 1, 2). Because these conversations are often hard to follow without much context, I'll just provide a brief caricature of how I think this argument has gone in the places I've seen it, which admittedly could be unfair to MIRI[2]. Then I'll offer my opinion that, overall, I think MIRI people should probably update in the direction of alignment being easier than they thought in light of this information, despite their objections.
Note: I encourage you to read this post carefully to understand my thesis. This topic can be confusing, and there are many ways to misread what I'm saying. Also, make sure to read the footnotes if you're skeptical of some of my claims.
Source:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Neural networks are trained on data, not programmed to follow rules. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don't understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe.
Luckily for those of us trying to understand artificial neural networks, we can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.
This is a linkpost for https://transformer-circuits.pub/2023/monosemantic-features/
Text of post based on our blog post as a linkpost for the full paper which is considerably longer and more detailed.
Source:
https://www.lesswrong.com/posts/TDqvQFks6TWutJEKu/towards-monosemanticity-decomposing-language-models-with
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Moderator note: the following is a dialogue using LessWrong’s new dialogue feature. The exchange is not completed: new replies might be added continuously, the way a comment thread might work. If you’d also be excited about finding an interlocutor to debate, dialogue, or getting interviewed by: fill in this dialogue matchmaking form.
Hi Thomas, I'm quite curious to hear about your research experience working with MIRI. To get us started: When were you at MIRI? Who did you work with? And what problem were you working on?
Source:
https://www.lesswrong.com/posts/qbcuk8WwFnTZcXTd6/thomas-kwa-s-miri-research-experience
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
A lot of people are highly concerned that a malevolent AI or insane human will, in the near future, set out to destroy humanity. If such an entity wanted to be absolutely sure they would succeed, what method would they use? Nuclear war? Pandemics?
According to some in the x-risk community, the answer is this: The AI will invent molecular nanotechnology, and then kill us all with diamondoid bacteria nanobots.
Source:
https://www.lesswrong.com/posts/bc8Ssx5ys6zqu3eq9/diamondoid-bacteria-nanobots-deadly-threat-or-dead-end-a
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Lightcone Infrastructure (the organization that grew from and houses the LessWrong team) has just finished renovating a 7-building physical campus that we hope to use to make the future of humanity go better than it would otherwise.
We're hereby announcing that it is generally available for bookings. We offer preferential pricing for projects we think are good for the world, but to cover operating costs, we're renting out space to a wide variety of people/projects.
Source:
https://www.lesswrong.com/posts/memqyjNCpeDrveayx/the-lighthaven-campus-is-open-for-bookings
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
Source:
https://www.lesswrong.com/posts/khFC2a4pLPvGtXAGG/how-to-catch-an-ai-liar-lie-detection-in-black-box-llms-by
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Effective altruism prides itself on truthseeking. That pride is justified in the sense that EA is better at truthseeking than most members of its reference category, and unjustified in that it is far from meeting its own standards. We’ve already seen dire consequences of the inability to detect bad actors who deflect investigation into potential problems, but by its nature you can never be sure you’ve found all the damage done by epistemic obfuscation because the point is to be self-cloaking.
My concern here is for the underlying dynamics of EA’s weak epistemic immune system, not any one instance. But we can’t analyze the problem without real examples, so individual instances need to be talked about. Worse, the examples that are easiest to understand are almost by definition the smallest problems, which makes any scapegoating extra unfair. So don’t.
This post focuses on a single example: vegan advocacy, especially around nutrition. I believe vegan advocacy as a cause has both actively lied and raised the cost for truthseeking, because they were afraid of the consequences of honest investigations. Occasionally there’s a consciously bad actor I can just point to, but mostly this is an emergent phenomenon from people who mean well, and have done good work in other areas. That’s why scapegoating won’t solve the problem: we need something systemic.
Source:
https://www.lesswrong.com/posts/aW288uWABwTruBmgF/ea-vegan-advocacy-is-not-truthseeking-and-it-s-everyone-s-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for https://narrativeark.substack.com/p/the-king-and-the-golem
Long ago there was a mighty king who had everything in the world that he wanted, except trust. Who could he trust, when anyone around him might scheme for his throne? So he resolved to study the nature of trust, that he might figure out how to gain it. He asked his subjects to bring him the most trustworthy thing in the kingdom, promising great riches if they succeeded.
Soon, the first of them arrived at his palace to try. A teacher brought her book of lessons. “We cannot know the future,” she said, “But we know mathematics and chemistry and history; those we can trust.” A farmer brought his plow. “I know it like the back of my hand; how it rolls, and how it turns, and every detail of it, enough that I can trust it fully.”
The king asked his wisest scholars if the teacher spoke true. But as they read her book, each pointed out new errors—it was only written by humans, after all. Then the king told the farmer to plow the fields near the palace. But he was not used to plowing fields as rich as these, and his trusty plow would often sink too far into the soil. So the king was not satisfied, and sent his message even further afield.
Source:
https://www.lesswrong.com/posts/bteq4hMW2hqtKE49d/the-king-and-the-golem#
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models
We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.
Source:
https://www.lesswrong.com/posts/Qryk6FqjtZk9FHHJR/sparse-autoencoders-find-highly-interpretable-directions-in
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Epistemic status: model which I find sometimes useful, and which emphasizes some true things about many parts of the world which common alternative models overlook. Probably not correct in full generality.
Consider Yoshua Bengio, one of the people who won a Turing Award for deep learning research. Looking at his work, he clearly “knows what he’s doing”. He doesn’t know what the answers will be in advance, but he has some models of what the key questions are, what the key barriers are, and at least some hand-wavy pseudo-models of how things work.
For instance, Bengio et al’s “Unitary Evolution Recurrent Neural Networks”. This is the sort of thing which one naturally ends up investigating, when thinking about how to better avoid gradient explosion/death in e.g. recurrent nets, while using fewer parameters. And it’s not the sort of thing which one easily stumbles across by trying random ideas for nets without some reason to focus on gradient explosion/death (or related instability problems) in particular. The work implies a model of key questions/barriers; it isn’t just shooting in the dark.
So this is the sort of guy who can look at a proposal, and say “yeah, that might be valuable” vs “that’s not really asking the right question” vs “that would be valuable if it worked, but it will have to somehow deal with <known barrier>”
Source:
https://www.lesswrong.com/posts/nt8PmADqKMaZLZGTC/inside-views-impostor-syndrome-and-the-great-larp
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I’m writing this in my own capacity. The views expressed are my own, and should not be taken to represent the views of Apollo Research or any other program I’m involved with.
TL;DR: I argue why I think there should be more AI safety orgs. I’ll also provide some suggestions on how that could be achieved. The core argument is that there is a lot of unused talent and I don’t think existing orgs scale fast enough to absorb it. Thus, more orgs are needed. This post can also serve as a call to action for funders, founders, and researchers to coordinate to start new orgs.
This piece is certainly biased! I recently started an AI safety org and therefore obviously believe that there is/was a gap to be filled.
If you think I’m missing relevant information about the ecosystem or disagree with my reasoning, please let me know. I genuinely want to understand why the ecosystem acts as it does right now and whether there are good reasons for it that I have missed so far.
Source:
https://www.lesswrong.com/posts/MhudbfBNQcMxBBvj8/there-should-be-more-ai-safety-orgs
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Cross-posted from substack.
"Everything in the world is about sex, except sex. Sex is about clonal interference."
– Oscar Wilde (kind of)
As we all know, sexual reproduction is not about reproduction.
Reproduction is easy. If your goal is to fill the world with copies of your genes, all you need is a good DNA-polymerase to duplicate your genome, and then to divide into two copies of yourself. Asexual reproduction is just better in every way.
It's pretty clear that, on a direct one-v-one cage match, an asexual organism would have much better fitness than a similarly-shaped sexual organism. And yet, all the macroscopic species, including ourselves, do it. What gives?
Here is the secret: yes, sex is indeed bad for reproduction. It does not improve an individual's reproductive fitness. The reason it still took over the macroscopic world is that evolution does not simply select for reproductive fitness.
Source:
https://www.lesswrong.com/posts/yA8DWsHJeFZhDcQuo/the-talk-a-brief-explanation-of-sexual-dimorphism
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Patrick Collison has a fantastic list of examples of people quickly accomplishing ambitious things together since the 19th Century. It does make you yearn for a time that feels... different, when the lethargic behemoths of government departments could move at the speed of a racing startup:
[...] last century, [the Department of Defense] innovated at a speed that puts modern Silicon Valley startups to shame: the Pentagon was built in only 16 months (1941–1943), the Manhattan Project ran for just over 3 years (1942–1946), and the Apollo Program put a man on the moon in under a decade (1961–1969). In the 1950s alone, the United States built five generations of fighter jets, three generations of manned bombers, two classes of aircraft carriers, submarine-launched ballistic missiles, and nuclear-powered attack submarines.
[Note: that paragraph is from a different post.]
Inspired by partly by Patrick's list, I spent some of my vacation reading and learning about various projects from this Lost Age. I then wrote up a memo to share highlights and excerpts with my colleagues at Lightcone.
Source:
https://www.lesswrong.com/posts/BpTDJj6TrqGYTjFcZ/a-golden-age-of-building-excerpts-and-lessons-from-empire
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
This is a linkpost for https://www.youtube.com/watch?v=02kbWY5mahQ
None of the presidents fully represent my (TurnTrout's) views.
TurnTrout wrote the script. Garrett Baker helped produce the video after the audio was complete. Thanks to David Udell, Ulisse Mini, Noemi Chulo, and especially Rio Popper for feedback and assistance in writing the script.
Source:
https://www.lesswrong.com/posts/7M2iHPLaNzPNXHuMv/ai-presidents-discuss-ai-alignment-agendas
YouTube video kindly provided by the authors. Other text narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I feel like MIRI perhaps mispositioned FDT (their variant of UDT) as a clear advancement in decision theory, whereas maybe they could have attracted more attention/interest from academic philosophy if the framing was instead that the UDT line of thinking shows that decision theory is just more deeply puzzling than anyone had previously realized. Instead of one major open problem (Newcomb's, or EDT vs CDT) now we have a whole bunch more. I'm really not sure at this point whether UDT is even on the right track, but it does seem clear that there are some thorny issues in decision theory that not many people were previously thinking about:
Source:
https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
How do you affect something far away, a lot, without anyone noticing?
(Note: you can safely skip sections. It is also safe to skip the essay entirely, or to read the whole thing backwards if you like.)
Source:
https://www.lesswrong.com/posts/R3eDrDoX8LisKgGZe/sum-threshold-attacks
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for https://docs.google.com/document/d/1TsYkDYtV6BKiCN9PAOirRAy3TrNDu2XncUZ5UZfaAKA/edit?usp=sharing
Understanding what drives the rising capabilities of AI is important for those who work to forecast, regulate, or ensure the safety of AI. Regulations on the export of powerful GPUs need to be informed by understanding of how these GPUs are used, forecasts need to be informed by bottlenecks, and safety needs to be informed by an understanding of how the models of the future might be trained. A clearer understanding would enable policy makers to target regulations in such a way that they are difficult for companies to circumvent with only technically compliant GPUs, forecasters to avoid focus on unreliable metrics, and technical research working on mitigating the downsides of AI to understand what data models might be trained on.
This doc is built from a collection of smaller docs I wrote on a bunch of different aspects of frontier model training I consider important. I hope for people to be able to use this document as a collection of resources, to draw from it the information they find important and inform their own models.
I do not expect this doc to have a substantial impact on any serious AI labs capabilities efforts - I think my conclusions are largely discoverable in the process of attempting to scale AIs or for substantially less money than a serious such attempt would cost. Additionally I expect major labs already know many of the things in this report.
Source:
https://www.lesswrong.com/posts/nXcHe7t4rqHMjhzau/report-on-frontier-model-training
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated Post] ✓
Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by an Open Agency Architecture (OAA), if OAA turns out to be feasible.
Source:
https://www.lesswrong.com/posts/D97xnoRr6BHzo5HvQ/one-minute-every-moment
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
About how much information are we keeping in working memory at a given moment?
"Miller's Law" dictates that the number of things humans can hold in working memory is "the magical number 7±2". This idea is derived from Miller's experiments, which tested both random-access memory (where participants must remember call-response pairs, and give the correct response when prompted with a call) and sequential memory (where participants must memorize and recall a list in order). In both cases, 7 is a good rule of thumb for the number of items people can recall reliably.[1]
Miller noticed that the number of "things" people could recall didn't seem to depend much on the sorts of things people were being asked to recall. A random numeral contains about 3.3 bits of information, while a random letter contains about 4.7; yet people were able to recall about the same number of numerals or letters.
Miller concluded that working memory should not be measured in bits, but rather in "chunks"; this is a word for whatever psychologically counts as a "thing".
Source:
https://www.lesswrong.com/posts/D97xnoRr6BHzo5HvQ/one-minute-every-moment
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Added (11th Sept): Nonlinear have commented that they intend to write a response, have written a short follow-up, and claim that they dispute 85 claims in this post. I'll link here to that if-and-when it's published.
Added (11th Sept): One of the former employees, Chloe, has written a lengthy comment personally detailing some of her experiences working at Nonlinear and the aftermath.
Added (12th Sept): I've made 3 relatively minor edits to the post. I'm keeping a list of all edits at the bottom of the post, so if you've read the post already, you can just go to the end to see the edits.
Added (15th Sept): I've written a follow-up post saying that I've finished working on this investigation and do not intend to work more on it in the future. The follow-up also has a bunch of reflections on what led up to this post.
Epistemic status: Once I started actively looking into things, much of my information in the post below came about by a search for negative information about the Nonlinear cofounders, not from a search to give a balanced picture of its overall costs and benefits. I think standard update rules suggest not that you ignore the information, but you think about how bad you expect the information would be if I selected for the worst, credible info I could share, and then update based on how much worse (or better) it is than you expect I could produce. (See section 5 of this post about Mistakes with Conservation of Expected Evidence for more on this.) This seems like a worthwhile exercise for at least non-zero people to do in the comments before reading on. (You can condition on me finding enough to be worth sharing, but also note that I think I have a relatively low bar for publicly sharing critical info about folks in the EA/x-risk/rationalist/etc ecosystem.)
tl;dr: If you want my important updates quickly summarized in four claims-plus-probabilities, jump to the section near the bottom titled "Summary of My Epistemic State".
Source:
https://www.lesswrong.com/posts/Lc8r4tZ2L5txxokZ8/sharing-information-about-nonlinear-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Until about five years ago, I unironically parroted the slogan All Cops Are Bastards (ACAB) and earnestly advocated to abolish the police and prison system. I had faint inklings I might be wrong about this a long time ago, but it took a while to come to terms with its disavowal. What follows is intended to be not just a detailed account of what I used to believe but most pertinently, why. Despite being super egotistical, for whatever reason I do not experience an aversion to openly admitting mistakes I’ve made, and I find it very difficult to understand why others do. I’ve said many times before that nothing engenders someone’s credibility more than when they admit error, so you definitely have my permission to view this kind of confession as a self-serving exercise (it is). Beyond my own penitence, I find it very helpful when folks engage in introspective, epistemological self-scrutiny, and I hope others are inspired to do the same.
Source:
https://www.lesswrong.com/posts/4rsRuNaE4uJrnYeTQ/defunding-my-mistake
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
In which: I list 9 projects that I would work on if I wasn’t busy working on safety standards at ARC Evals, and explain why they might be good to work on.
Epistemic status:I’m prioritizing getting this out fast as opposed to writing it carefully. I’ve thought for at least a few hours and talked to a few people I trust about each of the following projects, but I haven’t done that much digging into each of these, and it’s likely that I’m wrong about many material facts. I also make little claim to the novelty of the projects. I’d recommend looking into these yourself before committing to doing them. (Total time spent writing or editing this post: ~8 hours.
Source:
https://www.lesswrong.com/posts/6FkWnktH3mjMAxdRT/what-i-would-do-if-i-wasn-t-at-arc-evals
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
To quickly recap my main intellectual journey so far (omitting a lengthy side trip into cryptography and Cypherpunk land), with the approximate age that I became interested in each topic in parentheses:
Source:
https://www.lesswrong.com/posts/fJqP9WcnHXBRBeiBg/meta-questions-about-metaphilosophy
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
We focus so much on arguing over who is at fault in this country that I think sometimes we fail to alert on what's actually happening. I would just like to point out, without attempting to assign blame, that American political institutions appear to be losing common knowledge of their legitimacy, and abandoning certain important traditions of cooperative governance. It would be slightly hyperbolic, but not unreasonable to me, to term what has happened "democratic backsliding".
Source:
https://www.lesswrong.com/posts/r2vaM2MDvdiDSWicu/the-u-s-is-becoming-less-stable#
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question.
The paper contained the striking plot reproduced below, which shows sycophancy
[...] I found this result startling when I read the original paper, as it seemed like a bizarre failure of calibration. How would the base LM know that this "Assistant" character agrees with the user so strongly, lacking any other information about the scenario?
At the time, I ran one of Anthropic's sycophancy evals on a set of OpenAI models, as I reported here.
I found very different results for these models:
Source:
https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I keep seeing advice on ambition, aimed at people in college or early in their career, that would have been really bad for me at similar ages. Rather than contribute (more) to the list of people giving poorly universalized advice on ambition, I have written a letter to the one person I know my advice is right for: myself in the past.
Source:
https://www.lesswrong.com/posts/uGDtroD26aLvHSoK2/dear-self-we-need-to-talk-about-ambition-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
I've been trying to avoid the terms "good faith" and "bad faith". I'm suspicious that most people who have picked up the phrase "bad faith" from hearing it used, don't actually know what it means—and maybe, that the thing it does mean doesn't carve reality at the joints.
People get very touchy about bad faith accusations: they think that you should assume good faith, but that if you've determined someone is in bad faith, you shouldn't even be talking to them, that you need to exile them.
What does "bad faith" mean, though? It doesn't mean "with ill intent."
Source:
https://www.lesswrong.com/posts/pZrvkZzL2JnbRgEBC/feedbackloop-first-rationality
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
The Carving of Reality, third volume of the Best of LessWrong books is now available on Amazon (US).
The Carving of Reality includes 43 essays from 29 authors. We've collected the essays into four books, each exploring two related topics. The "two intertwining themes" concept was first inspired when as I looked over the cluster of "coordination" themed posts, and noting a recurring motif of not only "solving coordination problems" but also "dealing with the binding constraints that were causing those coordination problems."
Source:
https://www.lesswrong.com/posts/Rck5CvmYkzWYxsF4D/book-launch-the-carving-of-reality-best-of-lesswrong-vol-iii
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
LLMs can do many incredible things. They can generate unique creative content, carry on long conversations in any number of subjects, complete complex cognitive tasks, and write nearly any argument. More mundanely, they are now the state of the art for boring classification tasks and therefore have the capability to radically upgrade the censorship capacities of authoritarian regimes throughout the world.
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort. Thanks to ev_ and Kei for suggestions on this post.
Source:
https://www.lesswrong.com/posts/oqvsR2LmHWamyKDcj/large-language-models-will-be-great-for-censorship
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Intro: I am a psychotherapist, and I help people working on AI safety. I noticed patterns of mental health issues highly specific to this group. It's not just doomerism, there are way more of them that are less obvious.
If you struggle with a mental health issue related to AI safety, feel free to leave a comment about it and about things that help you with it. You might also support others in the comments. Sometimes such support makes a lot of difference and people feel like they are not alone.
All the examples in this post are changed in a way that it's impossible to recognize a specific person behind them.
Source:
https://www.lesswrong.com/posts/tpLzjWqG2iyEgMGfJ/6-non-obvious-mental-health-issues-specific-to-ai-safety
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for the article "Ten Thousand Years of Solitude", written by Jared Diamond for Discover Magazine in 1993, four years before he published Guns, Germs and Steel. That book focused on Diamond's theory that the geography of Eurasia, particularly its large size and common climate, allowed civilizations there to dominate the rest of the world because it was easy to share plants, animals, technologies and ideas. This article, however, examines the opposite extreme.
Diamond looks at the intense isolation of the tribes on Tasmania - an island the size of Ireland. After waters rose, Tasmania was cut off from mainland Australia. As the people there did not have boats, they were completely isolated, and did not have any contact - or awareness - of the outside world for ten thousand years.
How might a civilization develop, all on its own, for such an incredible period of time?
Source:
https://www.lesswrong.com/posts/YwMaAuLJDkhazA9Cs/ten-thousand-years-of-solitude
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I gave a talk about the different risk models, followed by an interpretability presentation, then I got a problematic question, "I don't understand, what's the point of doing this?" Hum.
The considerations in the last bullet points are based on feeling and are not real arguments. Furthermore, most mechanistic interpretability isn't even aimed at being useful right now. But in the rest of the post, we'll find out if, in principle, interpretability could be useful. So let's investigate if the Interpretability Emperor has invisible clothes or no clothes at all!
Source:
https://www.lesswrong.com/posts/LNA8mubrByG7SFacm/against-almost-every-theory-of-impact-of-interpretability-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I've been workshopping a new rationality training paradigm. (By "rationality training paradigm", I mean an approach to learning/teaching the skill of "noticing what cognitive strategies are useful, and getting better at them.")
I think the paradigm has promise. I've beta-tested it for a couple weeks. It’s too early to tell if it actually works, but one of my primary goals is to figure out if it works relatively quickly, and give up if it isn’t not delivering.
The goal of this post is to:
Source:
https://www.lesswrong.com/posts/pZrvkZzL2JnbRgEBC/feedbackloop-first-rationality
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Inflection.ai (co-founded by DeepMind co-founder Mustafa Suleyman) should be perceived as a frontier LLM lab of similar magnitude as Meta, OpenAI, DeepMind, and Anthropic based on their compute, valuation, current model capabilities, and plans to train frontier models. Compared to the other labs, Inflection seems to put less effort into AI safety.
Thanks to Laker Newhouse for discussion and feedback.
Source:
https://www.lesswrong.com/posts/Wc5BYFfzuLzepQjCq/inflection-ai-is-a-major-agi-lab
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that might pose existential threats – as a new and important pillar of alignment research.
If you’re interested in working on this agenda with us at Anthropic, we’re hiring! Please apply to the research scientist or research engineer position on the Anthropic website and mention that you’re interested in working on model organisms of misalignment.
Source:
https://www.lesswrong.com/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
In "Towards understanding-based safety evaluations," I discussed why I think evaluating specifically the alignment of models is likely to require mechanistic, understanding-based evaluations rather than solely behavioral evaluations. However, I also mentioned in a footnote why I thought behavioral evaluations would likely be fine in the case of evaluating capabilities rather than evaluating alignment:
However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment.
That's because while I think it would be quite tricky for a deceptively aligned AI to sandbag its capabilities when explicitly fine-tuned on some capabilities task (that probably requires pretty advanced gradient hacking), it should be quite easy for such a model to pretend to look aligned.
In this post, I want to try to expand a bit on this point and explain exactly what assumptions I think are necessary for various different evaluations to be reliable and trustworthy. For that purpose, I'm going to talk about four different categories of evaluations and what assumptions I think are needed to make each one go through.
Source:
https://www.lesswrong.com/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluations
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated Post] ✓
Paper
We have just released our first public report. It introduces methodology for assessing the capacity of LLM agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild.
Background
ARC Evals develops methods for evaluating the safety of large language models (LLMs) in order to provide early warnings of models with dangerous capabilities. We have public partnerships with Anthropic and OpenAI to evaluate their AI systems, and are exploring other partnerships as well.
Source:
https://www.lesswrong.com/posts/EPLk8QxETC5FEhoxK/arc-evals-new-report-evaluating-language-model-agents-on
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Summary of Argument: The public debate among AI experts is confusing because there are, to a first approximation, three sides, not two sides to the debate. I refer to this as a 🔺three-sided framework, and I argue that using this three-sided framework will help clarify the debate (more precisely, debates) for the general public and for policy-makers.
Source:
https://www.lesswrong.com/posts/BTcEzXYoDrWzkLLrQ/the-public-debate-about-ai-is-confusing-for-the-general
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
So this morning I thought to myself, "Okay, now I will actually try to study the LK99 question, instead of betting based on nontechnical priors and market sentiment reckoning." (My initial entry into the affray, having been driven by people online presenting as confidently YES when the prediction markets were not confidently YES.) And then I thought to myself, "This LK99 issue seems complicated enough that it'd be worth doing an actual Bayesian calculation on it"--a rare thought; I don't think I've done an actual explicit numerical Bayesian update in at least a year.
In the process of trying to set up an explicit calculation, I realized I felt very unsure about some critically important quantities, to the point where it no longer seemed worth trying to do the calculation with numbers. This is the System Working As Intended.
Source:
https://www.lesswrong.com/posts/EzSH9698DhBsXAcYY/my-current-lk99-questions
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
I believe that sharing information about the capabilities and limits of existing ML systems, and especially language model agents, significantly reduces risks from powerful AI—despite the fact that such information may increase the amount or quality of investment in ML generally (or in LM agents in particular).
Concretely, I mean to include information like: tasks and evaluation frameworks for LM agents, the results of evaluations of particular agents, discussions of the qualitative strengths and weaknesses of agents, and information about agent design that may represent small improvements over the state of the art (insofar as that information is hard to decouple from evaluation results).
Source:
https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
In the early 2010s, a popular idea was to provide coworking spaces and shared living to people who were building startups. That way the founders would have a thriving social scene of peers to percolate ideas with as they figured out how to build and scale a venture. This was attempted thousands of times by different startup incubators. There are no famous success stories.
In 2015, Sam Altman, who was at the time the president of Y Combinator, a startup accelerator that has helped scale startups collectively worth $600 billion, tweeted in reaction that “not [providing coworking spaces] is part of what makes YC work.” Later, in a 2019 interview with Tyler Cowen, Altman was asked to explain why.
Source:
https://www.lesswrong.com/posts/R5yL6oZxqJfmqnuje/cultivating-a-state-of-mind-where-new-ideas-are-born
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated] ✓
This month I lost a bunch of bets.
Back in early 2016 I bet at even odds that self-driving ride sharing would be available in 10 US cities by July 2023. Then I made similar bets a dozen times because everyone disagreed with me.
Source:
https://www.lesswrong.com/posts/ZRrYsZ626KSEgHv8s/self-driving-car-bets
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Some early biologist, equipped with knowledge of evolution but not much else, might see all these crabs and expect a common ancestral lineage. That’s the obvious explanation of the similarity, after all: if the crabs descended from a common ancestor, then of course we’d expect them to be pretty similar.
… but then our hypothetical biologist might start to notice surprisingly deep differences between all these crabs. The smoking gun, of course, would come with genetic sequencing: if the crabs’ physiological similarity is achieved by totally different genetic means, or if functionally-irrelevant mutations differ across crab-species by more than mutational noise would induce over the hypothesized evolutionary timescale, then we’d have to conclude that the crabs had different lineages. (In fact, historically, people apparently figured out that crabs have different lineages long before sequencing came along.)
Now, having accepted that the crabs have very different lineages, the differences are basically explained. If the crabs all descended from very different lineages, then of course we’d expect them to be very different.
… but then our hypothetical biologist returns to the original empirical fact: all these crabs sure are very similar in form. If the crabs all descended from totally different lineages, then the convergent form is a huge empirical surprise! The differences between the crab have ceased to be an interesting puzzle - they’re explained - but now the similarities are the interesting puzzle. What caused the convergence?
To summarize: if we imagine that the crabs are all closely related, then any deep differences are a surprising empirical fact, and are the main remaining thing our model needs to explain. But once we accept that the crabs are not closely related, then any convergence/similarity is a surprising empirical fact, and is the main remaining thing our model needs to explain.
Source:
https://www.lesswrong.com/posts/qsRvpEwmgDBNwPHyP/yes-it-s-subjective-but-why-all-the-crabs
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
The Lightspeed application asks: “What impact will [your project] have on the world? What is your project’s goal, how will you know if you’ve achieved it, and what is the path to impact?”
LTFF uses an identical question, and SFF puts it even more strongly (“What is your organization’s plan for improving humanity’s long term prospects for survival and flourishing?”).
I’ve applied to all three grants of these at various points, and I’ve never liked this question. It feels like it wants a grand narrative of an amazing, systemic project that will measurably move the needle on x-risk. But I’m typically applying for narrowly defined projects, like “Give nutrition tests to EA vegans and see if there’s a problem”. I think this was a good project. I think this project is substantially more likely to pay off than underspecified alignment strategy research, and arguably has as good a long tail. But when I look at “What impact will [my project] have on the world?” the project feels small and sad. I feel an urge to make things up, and express far more certainty for far more impact than I believe. Then I want to quit, because lying is bad but listing my true beliefs feels untenable.
I’ve gotten better at this over time, but I know other people with similar feelings, and I suspect it’s a widespread issue (I encourage you to share your experience in the comments so we can start figuring that out).
I should note that the pressure for grand narratives has good points; funders are in fact looking for VC-style megabits. I think that narrow projects are underappreciated, but for purposes of this post that’s beside the point: I think many grantmakers are undercutting their own preferred outcomes by using questions that implicitly push for a grand narrative. I think they should probably change the form, but I also think we applicants can partially solve the problem by changing how we interact with the current forms.
My goal here is to outline the problem, gesture at some possible solutions, and create a space for other people to share data. I didn’t think about my solutions very long, I am undoubtedly missing a bunch and what I do have still needs workshopping, but it’s a place to start.
Source:
https://www.lesswrong.com/posts/FNPXbwKGFvXWZxHGE/grant-applications-and-grand-narratives
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated Post] ✓
Previously Jacob Cannell wrote the post "Brain Efficiency" which makes several radical claims: that the brain is at the pareto frontier of speed, energy efficiency and memory bandwith, that this represent a fundamental physical frontier.
Here's an AI-generated summary
The article “Brain Efficiency: Much More than You Wanted to Know” on LessWrong discusses the efficiency of physical learning machines. The article explains that there are several interconnected key measures of efficiency for physical learning machines: energy efficiency in ops/J, spatial efficiency in ops/mm^2 or ops/mm^3, speed efficiency in time/delay for key learned tasks, circuit/compute efficiency in size and steps for key low-level algorithmic tasks, and learning/data efficiency in samples/observations/bits required to achieve a level of circuit efficiency, or per unit thereof. The article also explains why brain efficiency matters a great deal for AGI timelines and takeoff speeds, as AGI is implicitly/explicitly defined in terms of brain parity. The article predicts that AGI will consume compute & data in predictable brain-like ways and suggests that AGI will be far more like human simulations/emulations than you’d otherwise expect and will require training/education/raising vaguely like humans1.
Jake further has argued that this has implication for FOOM and DOOM.
Considering the intense technical mastery of nanoelectronics, thermodynamics and neuroscience required to assess the arguments here I concluded that a public debate between experts was called for. This was the start of the Brain Efficiency Prize contest which attracted over a 100 in-depth technically informed comments.
Now for the winners! Please note that the criteria for winning the contest was based on bringing in novel and substantive technical arguments as assesed by me. In contrast, general arguments about the likelihood of FOOM or DOOM while no doubt interesting did not factor into the judgement.
And the winners of the Jake Cannell Brain Efficiency Prize contest are
Source:
https://www.lesswrong.com/posts/fm88c8SvXvemk3BhW/brain-efficiency-cannell-prize-contest-award-ceremony
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I think "Rationality is winning" is a bit of a trap.
(The original phrase is notably "rationality is systematized winning", which is better, but it tends to slide into the abbreviated form, and both forms aren't that great IMO)
It was coined to counteract one set of failure modes - there were people who were straw vulcans, who thought rituals-of-logic were important without noticing when they were getting in the way of their real goals. And, also, there outside critics who'd complain about straw-vulcan-ish actions, and treat that as a knockdown argument against "rationality."
"Rationalist should win" is a countermeme that tells both groups of people "Straw vulcanism is not The Way. If you find yourself overthinking things in counterproductive ways, you are not doing rationality, even if it seems elegant or 'reasonable' in some sense."
It's true that rationalists should win. But I think it's not correspondingly true that "rationality" is the study of winning, full stop. There are lots of ways to win. Sometimes the way you win is by copying what your neighbors are doing, and working hard.
There is rationality involved in sifting through the various practices people suggest to you, and figuring out which ones work best. But, the specific skill of "sifting out the good from the bad" isn't always the best approach. It might take years to become good at it, and it's not obvious that those years of getting good at it will pay off.
Source:
https://www.lesswrong.com/posts/3GSRhtrs2adzpXcbY/rationality-winning
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This post is not about arguments in favor of or against cryonics. I would just like to share a particular emotional response of mine as the topic became hot for me after not thinking about it at all for nearly a decade.
Recently, I have signed up for cryonics, as has my wife, and we have made arrangements for our son to be cryopreserved just in case longevity research does not deliver in time or some unfortunate thing happens.
Last year, my father died. He was a wonderful man, good-natured, intelligent, funny, caring and, most importantly in this context, loving life to the fullest, even in the light of any hardships. He had a no-bullshit-approach regarding almost any topic, and, being born in the late 1940s in relative poverty and without much formal education, over the course of his life he acquired many unusual attitudes that were not that compatible with his peers (unlike me, he never tried to convince other people of things they could not or did not want to grasp, pragmatism was another of his traits). Much of what he expected from the future in general and technology in particular, I later came to know as transhumanist thinking, though neither was he familiar with the philosophy as such nor was he prone to labelling his worldview. One of his convictions was that age-related death is a bad thing and a tragedy, a problem that should and will eventually be solved by technology.
Source:
https://www.lesswrong.com/posts/inARBH5DwQTrvvRj8/cryonics-and-regret
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Alright, time for the payoff, unifying everything discussed in the previous post. This post is a lot more mathematically dense, you might want to digest it in more than one sitting.
Imaginary Prices, Tradeoffs, and Utilitarianism
Harsanyi's Utilitarianism Theorem can be summarized as "if a bunch of agents have their own personal utility functions Ui, and you want to aggregate them into a collective utility function U with the property that everyone agreeing that option x is better than option y (ie, Ui(x)≥Ui(y) for all i) implies U(x)≥U(y), then that collective utility function must be of the form b+∑i∈IaiUi for some number b and nonnegative numbers ai."
Basically, if you want to aggregate utility functions, the only sane way to do so is to give everyone importance weights, and do a weighted sum of everyone's individual utility functions.
Closely related to this is a result that says that any point on the Pareto Frontier of a game can be post-hoc interpreted as the result of maximizing a collective utility function. This related result is one where it's very important for the reader to understand the actual proof, because the proof gives you a way of reverse-engineering "how much everyone matters to the social utility function" from the outcome alone.
First up, draw all the outcomes, and the utilities that both players assign to them, and the convex hull will be the "feasible set" F, since we have access to randomization. Pick some Pareto frontier point u1,u2...un (although the drawn image is for only two players)
https://www.lesswrong.com/posts/RZNmNwc9SxdKayeQh/unifying-bargaining-notions-2-2
Inspired by Aesop, Soren Kierkegaard, Robin Hanson, sadoeuphemist and Ben Hoffman.
One winter a grasshopper, starving and frail, approaches a colony of ants drying out their grain in the sun, to ask for food.
“Did you not store up food during the summer?” the ants ask.
“No”, says the grasshopper. “I lost track of time, because I was singing and dancing all summer long.”
The ants, disgusted, turn away and go back to work.
https://www.lesswrong.com/posts/GJgudfEvNx8oeyffH/the-ants-and-the-grasshopper
Summary: We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes. Essentially, we add together combinations of forward passes in order to get GPT-2 to output the kinds of text we want. We provide a lot of entertaining and successful examples of these "activation additions." We also show a few activation additions which unexpectedly fail to have the desired effect.
We quantitatively evaluate how activation additions affect GPT-2's capabilities. For example, we find that adding a "wedding" vector decreases perplexity on wedding-related sentences, without harming perplexity on unrelated sentences. Overall, we find strong evidence that appropriately configured activation additions preserve GPT-2's capabilities.
Our results provide enticing clues about the kinds of programs implemented by language models. For some reason, GPT-2 allows "combination" of its forward passes, even though it was never trained to do so. Furthermore, our results are evidence of linear feature directions, including "anger", "weddings", and "create conspiracy theories."
We coin the phrase "activation engineering" to describe techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime. Activation additions are nearly as easy as prompting, and they offer an additional way to influence a model’s behaviors and values. We suspect that activation additions can adjust the goals being pursued by a network at inference time.
https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
Philosopher David Chalmers asked: "Is there a canonical source for "the argument for AGI ruin" somewhere, preferably laid out as an explicit argument with premises and a conclusion?"
Unsurprisingly, the actual reason people expect AGI ruin isn't a crisp deductive argument; it's a probabilistic update based on many lines of evidence. The specific observations and heuristics that carried the most weight for someone will vary for each individual, and can be hard to accurately draw out. That said, Eliezer Yudkowsky's So Far: Unfriendly AI Edition might be a good place to start if we want a pseudo-deductive argument just for the sake of organizing discussion. People can then say which premises they want to drill down on. In The Basic Reasons I Expect AGI Ruin, I wrote: "When I say "general intelligence", I'm usually thinking about "whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems". It's possible that we should already be thinking of GPT-4 as "AGI" on some definitions, so to be clear about the threshold of generality I have in mind, I'll specifically talk about "STEM-level AGI", though I expect such systems to be good at non-STEM tasks too. STEM-level AGI is AGI that has "the basic mental machinery required to do par-human reasoning about all the hard sciences", though a specific STEM-level AGI could (e.g.) lack physics ability for the same reasons many smart humans can't solve physics problems, such as "lack of familiarity with the field".
https://www.lesswrong.com/posts/QzkTfj4HGpLEdNjXX/an-artificially-structured-argument-for-expecting-agi-ruin
You are the director of a giant government research program that’s conducting randomized controlled trials (RCTs) on two thousand health interventions, so that you can pick out the most cost-effective ones and promote them among the general population.
The quality of the two thousand interventions follows a normal distribution, centered at zero (no harm or benefit) and with standard deviation 1. (Pick whatever units you like — maybe one quality-adjusted life-year per ten thousand dollars of spending, or something in that ballpark.)
Unfortunately, you don’t know exactly how good each intervention is — after all, then you wouldn’t be doing this job. All you can do is get a noisy measurement of intervention quality using an RCT. We’ll call this measurement the intervention’s performance in your RCT.
https://www.lesswrong.com/posts/nnDTgmzRrzDMiPF9B/how-much-do-you-believe-your-results
This is a post about mental health and disposition in relation to the alignment problem. It compiles a number of resources that address how to maintain wellbeing and direction when confronted with existential risk.
Many people in this community have posted their emotional strategies for facing Doom after Eliezer Yudkowsky’s “Death With Dignity” generated so much conversation on the subject. This post intends to be more touchy-feely, dealing more directly with emotional landscapes than questions of timelines or probabilities of success.
The resources section would benefit from community additions. Please suggest any resources that you would like to see added to this post.
Please note that this document is not intended to replace professional medical or psychological help in any way. Many preexisting mental health conditions can be exacerbated by these conversations. If you are concerned that you may be experiencing a mental health crisis, please consult a professional.
https://www.lesswrong.com/posts/pLLeGA7aGaJpgCkof/mental-health-and-the-alignment-problem-a-compilation-of
The primary talk of the AI world recently is about AI agents (whether or not it includes the question of whether we can’t help but notice we are all going to die.)
The trigger for this was AutoGPT, now number one on GitHub, which allows you to turn GPT-4 (or GPT-3.5 for us clowns without proper access) into a prototype version of a self-directed agent.
We also have a paper out this week where a simple virtual world was created, populated by LLMs that were wrapped in code designed to make them simple agents, and then several days of activity were simulated, during which the AI inhabitants interacted, formed and executed plans, and it all seemed like the beginnings of a living and dynamic world. Game version hopefully coming soon.
How should we think about this? How worried should we be?
https://www.lesswrong.com/posts/566kBoPi76t8KAkoD/on-autogpt
https://thezvi.wordpress.com/
(Related text posted to Twitter; this version is edited and has a more advanced final section.)
Imagine yourself in a box, trying to predict the next word - assign as much probability mass to the next token as possible - for all the text on the Internet.
Koan: Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text? What factors make that task easier, or harder? (If you don't have an answer, maybe take a minute to generate one, or alternatively, try to predict what I'll say next; if you do have an answer, take a moment to review it inside your mind, or maybe say the words out loud.)
https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators
https://www.lesswrong.com/posts/fJBTRa7m7KnCDdzG5/a-stylized-dialogue-on-john-wentworth-s-claims-about-markets
(This is a stylized version of a real conversation, where the first part happened as part of a public debate between John Wentworth and Eliezer Yudkowsky, and the second part happened between John and me over the following morning. The below is combined, stylized, and written in my own voice throughout. The specific concrete examples in John's part of the dialog were produced by me. It's over a year old. Sorry for the lag.)
(As to whether John agrees with this dialog, he said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment.)
https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.
I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:
https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptiveness
This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs.
You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.)
https://www.lesswrong.com/posts/nTGEeRSZrfPiJwkEc/the-onion-test-for-personal-and-institutional-honesty
[co-written by Chana Messinger and Andrew Critch, Andrew is the originator of the idea]
You (or your organization or your mission or your family or etc.) pass the “onion test” for honesty if each layer hides but does not mislead about the information hidden within.
When people get to know you better, or rise higher in your organization, they may find out new things, but should not be shocked by the types of information that were hidden. If they are, you messed up in creating the outer layers to describe appropriately the kind-of-thing that might be inside.
Examples
Positive Example:
Outer layer says "I usually treat my health information as private."
Next layer in says: "Here are the specific health problems I have: Gout, diabetes."
Negative example:
Outer layer says: "I usually treat my health info as private."
Next layer in: "I operate a cocaine dealership. Sorry I didn't warn you that I was also private about my illegal activities."
https://www.lesswrong.com/posts/fRwdkop6tyhi3d22L/there-s-no-such-thing-as-a-tree-phylogenetically
This is a linkpost for https://eukaryotewritesblog.com/2021/05/02/theres-no-such-thing-as-a-tree/
[Crossposted from Eukaryote Writes Blog.]
So you’ve heard about how fish aren’t a monophyletic group? You’ve heard about carcinization, the process by which ocean arthropods convergently evolve into crabs? You say you get it now? Sit down. Sit down. Shut up. Listen. You don’t know nothing yet.
“Trees” are not a coherent phylogenetic category. On the evolutionary tree of plants, trees are regularly interspersed with things that are absolutely, 100% not trees. This means that, for instance, either:
I thought I had a pretty good guess at this, but the situation is far worse than I could have imagined.
https://www.lesswrong.com/posts/ma7FSEtumkve8czGF/losing-the-root-for-the-tree
You know that being healthy is important. And that there's a lot of stuff you could do to improve your health: getting enough sleep, eating well, reducing stress, and exercising, to name a few.
There’s various things to hit on when it comes to exercising too. Strength, obviously. But explosiveness is a separate thing that you have to train for. Same with flexibility. And don’t forget cardio!
Strength is most important though, because of course it is. And there’s various things you need to do to gain strength. It all starts with lifting, but rest matters too. And supplements. And protein. Can’t forget about protein.
Protein is a deeper and more complicated subject than it may at first seem. Sure, the amount of protein you consume matters, but that’s not the only consideration. You also have to think about the timing. Consuming large amounts 2x a day is different than consuming smaller amounts 5x a day. And the type of protein matters too. Animal is different than plant, which is different from dairy. And then quality is of course another thing that is important.
But quality isn’t an easy thing to figure out. The big protein supplement companies are Out To Get You. They want to mislead you. Information sources aren’t always trustworthy. You can’t just hop on The Wirecutter and do what they tell you. Research is needed.
So you listen to a few podcasts. Follow a few YouTubers. Start reading some blogs. Throughout all of this you try various products and iterate as you learn more. You’re no Joe Rogan, but you’re starting to become pretty informed.
https://gwern.net/fiction/clippy
In A.D. 20XX. Work was beginning. “How are you gentlemen !!”… (Work. Work never changes; work is always hell.)
Specifically, a MoogleBook researcher has gotten a pull request from Reviewer #2 on his new paper in evolutionary search in auto-ML, for error bars on the auto-ML hyperparameter sensitivity like larger batch sizes, because more can be different and there’s high variance in the old runs with a few anomalously high gain of function. (“Really? Really? That’s what you’re worried about?”) He can’t see why worry, and wonders what sins he committed to deserve this asshole Chinese (given the Engrish) reviewer, as he wearily kicks off yet another HQU experiment…
https://www.lesswrong.com/posts/K4urTDkBbtNuLivJx/why-i-think-strong-general-ai-is-coming-soon
I think there is little time left before someone builds AGI (median ~2030). Once upon a time, I didn't think this.
This post attempts to walk through some of the observations and insights that collapsed my estimates.
The core ideas are as follows:
https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.
I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. I’ll tell the story in two parts:
I think these are the most important problems if we fail to solve intent alignment.
In practice these problems will interact with each other, and with other disruptions/instability caused by rapid progress. These problems are worse in worlds where progress is relatively fast, and fast takeoff can be a key risk factor, but I’m scared even if we have several years.
https://www.lesswrong.com/posts/gNodQGNoPDjztasbh/lies-damn-lies-and-fabricated-options
This is an essay about one of those "once you see it, you will see it everywhere" phenomena. It is a psychological and interpersonal dynamic roughly as common, and almost as destructive, as motte-and-bailey, and at least in my own personal experience it's been quite valuable to have it reified, so that I can quickly recognize the commonality between what I had previously thought of as completely unrelated situations.
The original quote referenced in the title is "There are three kinds of lies: lies, damned lies, and statistics."
https://www.lesswrong.com/posts/thkAtqoQwN6DtaiGT/carefully-bootstrapped-alignment-is-organizationally-hard
In addition to technical challenges, plans to safely develop AI face lots of organizational challenges. If you're running an AI lab, you need a concrete plan for handling that.
In this post, I'll explore some of those issues, using one particular AI plan as an example. I first heard this described by Buck at EA Global London, and more recently with OpenAI's alignment plan. (I think Anthropic's plan has a fairly different ontology, although it still ultimately routes through a similar set of difficulties)
I'd call the cluster of plans similar to this "Carefully Bootstrapped Alignment."
https://www.lesswrong.com/posts/4Gt42jX7RiaNaxCwP/more-information-about-the-dangerous-capability-evaluations
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals/
[Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]
We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight.
We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable.
As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably.
https://www.lesswrong.com/posts/zidQmfFhMgwFzcHhs/enemies-vs-malefactors
Status: some mix of common wisdom (that bears repeating in our particular context), and another deeper point that I mostly failed to communicate.
Short version
Harmful people often lack explicit malicious intent. It’s worth deploying your social or community defenses against them anyway. I recommend focusing less on intent and more on patterns of harm.
(Credit to my explicit articulation of this idea goes in large part to Aella, and also in part to Oliver Habryka.)
https://www.lesswrong.com/posts/LzQtrHSYDafXynofq/the-parable-of-the-king-and-the-random-process
~ A Parable of Forecasting Under Model Uncertainty ~
You, the monarch, need to know when the rainy season will begin, in order to properly time the planting of the crops. You have two advisors, Pronto and Eternidad, who you trust exactly equally.
You ask them both: "When will the next heavy rain occur?"
Pronto says, "Three weeks from today."
Eternidad says, "Ten years from today."
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post
In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.
https://www.lesswrong.com/posts/3RSq3bfnzuL3sp46J/acausal-normalcy
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This post is also available on the EA Forum.
Summary: Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic. I say this to be comforting rather than dismissive; if it sounds dismissive, I apologize.
With that said, I have four aims in writing this post:
https://www.lesswrong.com/posts/RryyWNmJNnLowbhfC/please-don-t-throw-your-mind-away
[Warning: the following dialogue contains an incidental spoiler for "Music in Human Evolution" by Kevin Simler. That post is short, good, and worth reading without spoilers, and this post will still be here if you come back later. It's also possible to get the point of this post by skipping the dialogue and reading the other sections.]
Pretty often, talking to someone who's arriving to the existential risk / AGI risk / longtermism cluster, I'll have a conversation like the following:
Tsvi: "So, what's been catching your eye about this stuff?"
Arrival: "I think I want to work on machine learning, and see if I can contribute to alignment that way."
T: "What's something that got your interest in ML?"
A: "It seems like people think that deep learning might be on the final ramp up to AGI, so I should probably know how that stuff works, and I think I have a good chance of learning ML at least well enough to maybe contribute to a research project."
------
This is an experiment with AI narration. What do you think? Tell us by going to t3a.is.
------
https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism
There is a lot of disagreement and confusion about the feasibility and risks associated with automating alignment research. Some see it as the default path toward building aligned AI, while others expect limited benefit from near term systems, expecting the ability to significantly speed up progress to appear well after misalignment and deception. Furthermore, progress in this area may directly shorten timelines or enable the creation of dual purpose systems which significantly speed up capabilities research.
OpenAI recently released their alignment plan. It focuses heavily on outsourcing cognitive work to language models, transitioning us to a regime where humans mostly provide oversight to automated research assistants. While there have been a lot of objections to and concerns about this plan, there hasn’t been a strong alternative approach aiming to automate alignment research which also takes all of the many risks seriously.
The intention of this post is not to propose an end-all cure for the tricky problem of accelerating alignment using GPT models. Instead, the purpose is to explicitly put another point on the map of possible strategies, and to add nuance to the overall discussion.
https://www.lesswrong.com/posts/CYN7swrefEss4e3Qe/childhoods-of-exceptional-people
This is a linkpost for https://escapingflatland.substack.com/p/childhoods
Let’s start with one of those insights that are as obvious as they are easy to forget: if you want to master something, you should study the highest achievements of your field. If you want to learn writing, read great writers, etc.
But this is not what parents usually do when they think about how to educate their kids. The default for a parent is rather to imitate their peers and outsource the big decisions to bureaucracies. But what would we learn if we studied the highest achievements?
Thinking about this question, I wrote down a list of twenty names—von Neumann, Tolstoy, Curie, Pascal, etc—selected on the highly scientific criteria “a random Swedish person can recall their name and think, Sounds like a genius to me”. That list is to me a good first approximation of what an exceptional result in the field of child-rearing looks like. I ordered a few piles of biographies, read, and took notes. Trying to be a little less biased in my sample, I asked myself if I could recall anyone exceptional that did not fit the patterns I saw in the biographies, which I could, and so I ordered a few more biographies.
This kept going for an unhealthy amount of time.
I sampled writers (Virginia Woolf, Lev Tolstoy), mathematicians (John von Neumann, Blaise Pascal, Alan Turing), philosophers (Bertrand Russell, René Descartes), and composers (Mozart, Bach), trying to get a diverse sample.
In this essay, I am going to detail a few of the patterns that have struck me after having skimmed 42 biographies. I will sort the claims so that I start with more universal patterns and end with patterns that are less common.
https://www.lesswrong.com/posts/NJYmovr9ZZAyyTBwM/what-i-mean-by-alignment-is-in-large-part-about-making
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
(Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.)
I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.
It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.
https://www.lesswrong.com/posts/NRrbJJWnaSorrqvtZ/on-not-getting-contaminated-by-the-wrong-obesity-ideas
A Chemical Hunger (a), a series by the authors of the blog Slime Mold Time Mold (SMTM), argues that the obesity epidemic is entirely caused (a) by environmental contaminants.
In my last post, I investigated SMTM’s main suspect (lithium).[1] This post collects other observations I have made about SMTM’s work, not narrowly related to lithium, but rather focused on the broader thesis of their blog post series.
I think that the environmental contamination hypothesis of the obesity epidemic is a priori plausible. After all, we know that chemicals can affect humans, and our exposure to chemicals has plausibly changed a lot over time. However, I found that several of what seem to be SMTM’s strongest arguments in favor of the contamination theory turned out to be dubious, and that nearly all of the interesting things I thought I’d learned from their blog posts turned out to actually be wrong. I’ll explain that in this post.
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
Work done at SERI-MATS, over the past two months, by Jessica Rumbelow and Matthew Watkins.
TL;DR
Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew)
Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for:
In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further detail can be found in a follow-up post.
https://www.lesswrong.com/posts/Zp6wG5eQFLGWwcG6j/focus-on-the-places-where-you-feel-shocked-everyone-s
Writing down something I’ve found myself repeating in different conversations:
If you're looking for ways to help with the whole “the world looks pretty doomed” business, here's my advice: look around for places where we're all being total idiots.
Look for places where everyone's fretting about a problem that some part of you thinks it could obviously just solve.
Look around for places where something seems incompetently run, or hopelessly inept, and where some part of you thinks you can do better.
Then do it better.
https://www.lesswrong.com/posts/XPv4sYrKnPzeJASuk/basics-of-rationalist-discourse-1
Introduction
This post is meant to be a linkable resource. Its core is a short list of guidelines (you can link directly to the list) that are intended to be fairly straightforward and uncontroversial, for the purpose of nurturing and strengthening a culture of clear thinking, clear communication, and collaborative truth-seeking.
"Alas," said Dumbledore, "we all know that what should be, and what is, are two different things. Thank you for keeping this in mind."
There is also (for those who want to read more than the simple list) substantial expansion/clarification of each specific guideline, along with justification for the overall philosophy behind the set.
https://www.lesswrong.com/posts/PCrTQDbciG4oLgmQ5/sapir-whorf-for-rationalists
Casus Belli: As I was scanning over my (rather long) list of essays-to-write, I realized that roughly a fifth of them were of the form "here's a useful standalone concept I'd like to reify," à la cup-stacking skills, fabricated options, split and commit, and sazen. Some notable entries on that list (which I name here mostly in the hope of someday coming back and turning them into links) include: red vs. white, walking with three, setting the zero point[1], seeding vs. weeding, hidden hinges, reality distortion fields, and something-about-layers-though-that-one-obviously-needs-a-better-word.
While it's still worthwhile to motivate/justify each individual new conceptual handle (and the planned essays will do so), I found myself imagining a general objection of the form "this is just making up terms for things," or perhaps "this is too many new terms, for too many new things." I realized that there was a chunk of argument, repeated across all of the planned essays, that I could factor out, and that (to the best of my knowledge) there was no single essay aimed directly at the question "why new words/phrases/conceptual handles at all?"
So ... voilà.
https://www.lesswrong.com/posts/pDzdb4smpzT3Lwbym/my-model-of-ea-burnout
(Probably somebody else has said most of this. But I personally haven't read it, and felt like writing it down myself, so here we go.)
I think that EA [editor note: "Effective Altruism"] burnout usually results from prolonged dedication to satisfying the values you think you should have, while neglecting the values you actually have.
Setting aside for the moment what “values” are and what it means to “actually” have one, suppose that I actually value these things (among others):
https://www.lesswrong.com/posts/Xo7qmDakxiizG7B9c/the-social-recession-by-the-numbers
This is a linkpost for https://novum.substack.com/p/social-recession-by-the-numbers
Fewer friends, relationships on the decline, delayed adulthood, trust at an all-time low, and many diseases of despair. The prognosis is not great.
One of the most discussed topics online recently has been friendships and loneliness. Ever since the infamous chart showing more people are not having sex than ever before first made the rounds, there’s been increased interest in the social state of things. Polling has demonstrated a marked decline in all spheres of social life, including close friends, intimate relationships, trust, labor participation, and community involvement. The trend looks to have worsened since the pandemic, although it will take some years before this is clearly established.
The decline comes alongside a documented rise in mental illness, diseases of despair, and poor health more generally. In August 2022, the CDC announced that U.S. life expectancy has fallen further and is now where it was in 1996. Contrast this to Western Europe, where it has largely rebounded to pre-pandemic numbers. Still, even before the pandemic, the years 2015-2017 saw the longest sustained decline in U.S. life expectancy since 1915-18. While my intended angle here is not health-related, general sociability is closely linked to health. The ongoing shift has been called the “friendship recession” or the “social recession.”
https://www.lesswrong.com/posts/pHfPvb4JMhGDr4B7n/recursive-middle-manager-hell
I think Zvi's Immoral Mazes sequence is really important, but comes with more worldview-assumptions than are necessary to make the points actionable. I conceptualize Zvi as arguing for multiple hypotheses. In this post I want to articulate one sub-hypothesis, which I call "Recursive Middle Manager Hell". I'm deliberately not covering some other components of his model[1].
tl;dr:
Something weird and kinda horrifying happens when you add layers of middle-management. This has ramifications on when/how to scale organizations, and where you might want to work, and maybe general models of what's going on in the world.
You could summarize the effect as "the org gets more deceptive, less connected to its original goals, more focused on office politics, less able to communicate clearly within itself, and selected for more for sociopathy in upper management."
You might read that list of things and say "sure, seems a bit true", but one of the main points here is "Actually, this happens in a deeper and more insidious way than you're probably realizing, with much higher costs than you're acknowledging. If you're scaling your organization, this should be one of your primary worries."
https://www.lesswrong.com/posts/mfPHTWsFhzmcXw8ta/the-feeling-of-idea-scarcity
Here’s a story you may recognize. There's a bright up-and-coming young person - let's call her Alice. Alice has a cool idea. It seems like maybe an important idea, a big idea, an idea which might matter. A new and valuable idea. It’s the first time Alice has come up with a high-potential idea herself, something which she’s never heard in a class or read in a book or what have you.
So Alice goes all-in pursuing this idea. She spends months fleshing it out. Maybe she writes a paper, or starts a blog, or gets a research grant, or starts a company, or whatever, in order to pursue the high-potential idea, bring it to the world.
And sometimes it just works!
… but more often, the high-potential idea doesn’t actually work out. Maybe it turns out to be basically-the-same as something which has already been tried. Maybe it runs into some major barrier, some not-easily-patchable flaw in the idea. Maybe the problem it solves just wasn’t that important in the first place.
From Alice’ point of view, the possibility that her one high-potential idea wasn’t that great after all is painful. The idea probably feels to Alice like the single biggest intellectual achievement of her life. To lose that, to find out that her single greatest intellectual achievement amounts to little or nothing… that hurts to even think about. So most likely, Alice will reflexively look for an out. She’ll look for some excuse to ignore the similar ideas which have already been tried, some reason to think her idea is different. She’ll look for reasons to believe that maybe the major barrier isn’t that much of an issue, or that we Just Don’t Know whether it’s actually an issue and therefore maybe the idea could work after all. She’ll look for reasons why the problem really is important. Maybe she’ll grudgingly acknowledge some shortcomings of the idea, but she’ll give up as little ground as possible at each step, update as slowly as she can.
https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight
When thinking about deception and RLHF training, a simplified threat model is something like this:
Before continuing, I would encourage you to really engage with the above. Does it make sense to you? Is it making any hidden assumptions? Is it missing any steps? Can you rewrite it to be more mechanistically correct?
I believe that when people use the above threat model, they are either using it as shorthand for something else or they misunderstand how reinforcement learning works. Most alignment researchers will be in the former category. However, I was in the latter.
https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Introduction
A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.
In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical.
Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.
Disclaimers: I have tried to make this post concise, at the cost of not making the full arguments for many of my claims; you should treat this as more of a rough sketch of my views rather than anything comprehensive. I also frequently change my mind – I’m usually more consistently excited about some of the broad intuitions but much less wedded to the details – and this of course just represents my current thinking on the topic.
https://www.lesswrong.com/posts/qRtD4WqKRYEtT5pi3/the-next-decades-might-be-wild
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I’d like to thank Simon Grimm and Tamay Besiroglu for feedback and discussions.
This post is inspired by What 2026 looks like and an AI vignette workshop guided by Tamay Besiroglu. I think of this post as “what would I expect the world to look like if these timelines (median compute for transformative AI ~2036) were true” or “what short-to-medium timelines feel like” since I find it hard to translate a statement like “median TAI year is 20XX” into a coherent imaginable world.
I expect some readers to think that the post sounds wild and crazy but that doesn’t mean its content couldn’t be true. If you had told someone in 1990 or 2000 that there would be more smartphones and computers than humans in 2020, that probably would have sounded wild to them. The same could be true for AIs, i.e. that in 2050 there are more human-level AIs than humans. The fact that this sounds as ridiculous as ubiquitous smartphones sounded to the 1990/2000 person, might just mean that we are bad at predicting exponential growth and disruptive technology.
Update: titotal points out in the comments that the correct timeframe for computers is probably 1980 to 2020. So the correct time span is probably 40 years instead of 30. For mobile phones, it's probably 1993 to 2020 if you can trust this statistic.
I’m obviously not confident (see confidence and takeaways section) in this particular prediction but many of the things I describe seem like relatively direct consequences of more and more powerful and ubiquitous AI mixed with basic social dynamics and incentives.
https://www.lesswrong.com/posts/SqjQFhn5KTarfW8v7/lessons-learned-from-talking-to-greater-than-100-academics
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I’d like to thank MH, Jaime Sevilla and Tamay Besiroglu for their feedback.
During my Master's and Ph.D. (still ongoing), I have spoken with many academics about AI safety. These conversations include chats with individual PhDs, poster presentations and talks about AI safety.
I think I have learned a lot from these conversations and expect many other people concerned about AI safety to find themselves in similar situations. Therefore, I want to detail some of my lessons and make my thoughts explicit so that others can scrutinize them.
TL;DR: People in academia seem more and more open to arguments about risks from advanced intelligence over time and I would genuinely recommend having lots of these chats. Furthermore, I underestimated how much work related to some aspects AI safety already exists in academia and that we sometimes reinvent the wheel. Messaging matters, e.g. technical discussions got more interest than alarmism and explaining the problem rather than trying to actively convince someone received better feedback.
https://www.lesswrong.com/posts/6LzKRP88mhL9NKNrS/how-my-team-at-lightcone-sometimes-gets-stuff-done
Disclaimer: I originally wrote this as a private doc for the Lightcone team. I then showed it to John and he said he would pay me to post it here. That sounded awfully compelling. However, I wanted to note that I’m an early founder who hasn't built anything truly great yet. I’m writing this doc because as Lightcone is growing, I have to take a stance on these questions. I need to design our org to handle more people. Still, I haven’t seen the results long-term, and who knows if this is good advice. Don’t overinterpret this.
Suppose you went up on stage in front of a company you founded, that now had grown to 100, or 1000, 10 000+ people. You were going to give a talk about your company values. You can say things like “We care about moving fast, taking responsibility, and being creative” -- but I expect these words would mostly fall flat. At the end of the day, the path the water takes down the hill is determined by the shape of the territory, not the sound the water makes as it swooshes by. To manage that many people, it seems to me you need clear, concrete instructions. What are those? What are things you could write down on a piece of paper and pass along your chain of command, such that if at the end people go ahead and just implement them, without asking what you meant, they would still preserve some chunk of what makes your org work?
https://www.lesswrong.com/posts/rP66bz34crvDudzcJ/decision-theory-does-not-imply-that-we-get-to-have-nice
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
(Note: I wrote this with editing help from Rob and Eliezer. Eliezer's responsible for a few of the paragraphs.)
A common confusion I see in the tiny fragment of the world that knows about logical decision theory (FDT/UDT/etc.), is that people think LDT agents are genial and friendly for each other.[1]
One recent example is Will Eden’s tweet about how maybe a molecular paperclip/squiggle maximizer would leave humanity a few stars/galaxies/whatever on game-theoretic grounds. (And that's just one example; I hear this suggestion bandied around pretty often.)
I'm pretty confident that this view is wrong (alas), and based on a misunderstanding of LDT. I shall now attempt to clear up that confusion.
To begin, a parable: the entity Omicron (Omega's little sister) fills box A with $1M and box B with $1k, and puts them both in front of an LDT agent saying "You may choose to take either one or both, and know that I have already chosen whether to fill the first box". The LDT agent takes both.
"What?" cries the CDT agent. "I thought LDT agents one-box!"
LDT agents don't cooperate because they like cooperating. They don't one-box because the name of the action starts with an 'o'. They maximize utility, using counterfactuals that assert that the world they are already in (and the observations they have already seen) can (in the right circumstances) depend (in a relevant way) on what they are later going to do.
A paperclipper cooperates with other LDT agents on a one-shot prisoner's dilemma because they get more paperclips that way. Not because it has a primitive property of cooperativeness-with-similar-beings. It needs to get the more paperclips.
https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-looks-like#2022
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This was written for the Vignettes Workshop.[1] The goal is to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage, i.e. I’m not aware of any alternative trajectory that is similarly detailed and clearly more plausible to me. The methodology is roughly: Write a future history of 2022. Condition on it, and write a future history of 2023. Repeat for 2024, 2025, etc. (I'm posting 2022-2026 now so I can get feedback that will help me write 2027+. I intend to keep writing until the story reaches singularity/extinction/utopia/etc.)
What’s the point of doing this? Well, there are a couple of reasons:
This vignette was hard to write. To achieve the desired level of detail I had to make a bunch of stuff up, but in order to be realistic I had to constantly ask “but actually though, what would really happen in this situation?” which made it painfully obvious how little I know about the future. There are numerous points where I had to conclude “Well, this does seem implausible, but I can’t think of anything more plausible at the moment and I need to move on.” I fully expect the actual world to diverge quickly from the trajectory laid out here. Let anyone who (with the benefit of hindsight) claims this divergence as evidence against my judgment prove it by exhibiting a vignette/trajectory they themselves wrote in 2021. If it maintains a similar level of detail (and thus sticks its neck out just as much) while being more accurate, I bow deeply in respect!
https://www.lesswrong.com/posts/REA49tL5jsh69X3aM/introduction-to-abstract-entropy#fnrefpi8b39u5hd7
This post, and much of the following sequence, was greatly aided by feedback from the following people (among others): Lawrence Chan, Joanna Morningstar, John Wentworth, Samira Nedungadi, Aysja Johnson, Cody Wild, Jeremy Gillen, Ryan Kidd, Justis Mills and Jonathan Mustin. Illustrations by Anne Ore.
Introduction & motivation
In the course of researching optimization, I decided that I had to really understand what entropy is.[1] But there are a lot of other reasons why the concept is worth studying:
I didn't intend to write a post about entropy when I started trying to understand it. But I found the existing resources (textbooks, Wikipedia, science explainers) so poor that it actually seems important to have a better one as a prerequisite for understanding optimization! One failure mode I was running into was that other resources tended only to be concerned about the application of the concept in their particular sub-domain. Here, I try to take on the task of synthesizing the abstract concept of entropy, to show what's so deep and fundamental about it. In future posts, I'll talk about things like:
https://www.lesswrong.com/posts/8vesjeKybhRggaEpT/consider-your-appetite-for-disagreements
Poker
There was a time about five years ago where I was trying to get good at poker. If you want to get good at poker, one thing you have to do is review hands. Preferably with other people.
For example, suppose you have ace king offsuit on the button. Someone in the highjack opens to 3 big blinds preflop. You call. Everyone else folds. The flop is dealt. It's a rainbow Q75. You don't have any flush draws. You missed. Your opponent bets. You fold. They take the pot and you move to the next hand.
Once you finish your session, it'd be good to come back and review this hand. Again, preferably with another person. To do this, you would review each decision point in the hand. Here, there were two decision points.
The first was when you faced a 3BB open from HJ preflop with AKo. In the hand, you decided to call. However, this of course wasn't your only option. You had two others: you could have folded, and you could have raised. Actually, you could have raised to various sizes. You could have raised small to 8BB, medium to 10BB, or big to 12BB. Or hell, you could have just shoved 200BB! But that's not really a realistic option, nor is folding. So in practice your decision was between calling and raising to various realistic sizes.
https://www.lesswrong.com/posts/fFY2HeC9i2Tx8FEnK/my-resentful-story-of-becoming-a-medical-miracle
This is a linkpost for https://acesounderglass.com/2022/10/13/my-resentful-story-of-becoming-a-medical-miracle/
You know those health books with “miracle cure” in the subtitle? The ones that always start with a preface about a particular patient who was completely hopeless until they tried the supplement/meditation technique/healing crystal that the book is based on? These people always start broken and miserable, unable to work or enjoy life, perhaps even suicidal from the sheer hopelessness of getting their body to stop betraying them. They’ve spent decades trying everything and nothing has worked until their friend makes them see the book’s author, who prescribes the same thing they always prescribe, and the patient immediately stands up and starts dancing because their problem is entirely fixed (more conservative books will say it took two sessions). You know how those are completely unbelievable, because anything that worked that well would go mainstream, so basically the book is starting you off with a shit test to make sure you don’t challenge its bullshit later?
Well 5 months ago I became one of those miraculous stories, except worse, because my doctor didn’t even do it on purpose. This finalized some already fermenting changes in how I view medical interventions and research. Namely: sometimes knowledge doesn’t work and then you have to optimize for luck.
I assure you I’m at least as unhappy about this as you are.
https://www.lesswrong.com/posts/CKgPFHoWFkviYz7CB/the-redaction-machine
On the 3rd of October 2351 a machine flared to life. Huge energies coursed into it via cables, only to leave moments later as heat dumped unwanted into its radiators. With an enormous puff the machine unleashed sixty years of human metabolic entropy into superheated steam.
In the heart of the machine was Jane, a person of the early 21st century.
From her perspective there was no transition. One moment she had been in the year 2021, sat beneath a tree in a park. Reading a detective novel.
Then the book was gone, and the tree. Also the park. Even the year.
She found herself laid in a bathtub, immersed in sickly fatty fluids. She was naked and cold.
https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity.
The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback[1] on diverse tasks (HFDT):
Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.
HFDT is not the only approach to developing transformative AI,[2] and it may not work at all.[3] But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon.
Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.
https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values
TL;DR: We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry.
https://www.lesswrong.com/posts/AfH2oPHCApdKicM4m/two-year-update-on-my-personal-ai-timelines#fnref-fwwPpQFdWM6hJqwuY-12
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I worked on my draft report on biological anchors for forecasting AI timelines mainly between ~May 2019 (three months after the release of GPT-2) and ~Jul 2020 (a month after the release of GPT-3), and posted it on LessWrong in Sep 2020 after an internal review process. At the time, my bottom line estimates from the bio anchors modeling exercise were:[1]
These were roughly close to my all-things-considered probabilities at the time, as other salient analytical frames on timelines didn’t do much to push back on this view. (Though my subjective probabilities bounced around quite a lot around these values and if you’d asked me on different days and with different framings I’d have given meaningfully different numbers.)
It’s been about two years since the bulk of the work on that report was completed, during which I’ve mainly been thinking about AI. In that time it feels like very short timelines have become a lot more common and salient on LessWrong and in at least some parts of the ML community.
My personal timelines have also gotten considerably shorter over this period. I now expect something roughly like this:
https://www.lesswrong.com/posts/9kNxhKWvixtKW5anS/you-are-not-measuring-what-you-think-you-are-measuring
Eight years ago, I worked as a data scientist at a startup, and we wanted to optimize our sign-up flow. We A/B tested lots of different changes, and occasionally found something which would boost (or reduce) click-through rates by 10% or so.
Then one week I was puzzling over a discrepancy in the variance of our daily signups. Eventually I scraped some data from the log files, and found that during traffic spikes, our server latency shot up to multiple seconds. The effect on signups during these spikes was massive: even just 300 ms was enough that click-through dropped by 30%, and when latency went up to seconds the click-through rates dropped by over 80%. And this happened multiple times per day. Latency was far and away the most important factor which determined our click-through rates. [1]
Going back through some of our earlier experiments, it was clear in hindsight that some of our biggest effect-sizes actually came from changing latency - for instance, if we changed the order of two screens, then there’d be an extra screen before the user hit the one with high latency, so the latency would be better hidden. Our original interpretations of those experiments - e.g. that the user cared more about the content of one screen than another - were totally wrong. It was also clear in hindsight that our statistics on all the earlier experiments were bunk - we’d assumed that every user’s click-through was statistically independent, when in fact they were highly correlated, so many of the results which we thought were significant were in fact basically noise.
https://www.lesswrong.com/posts/WNpvK67MjREgvB8u8/do-bamboos-set-themselves-on-fire
Cross-posted from Telescopic Turnip.
As we all know, the best place to have a kung-fu fight is a bamboo forest. There are just so many opportunities to grab pieces of bamboos and manufacture improvised weapons, use them to catapult yourself in the air and other basic techniques any debutant martial artist ought to know. A lesser-known fact is that bamboo-forest fights occur even when the cameras of Hong-Kong filmmakers are not present. They may even happen without the presence of humans at all. The forest itself is the kung-fu fight.
It's often argued that humans are the worst species on Earth, because of our limitless potential for violence and mutual destruction. If that's the case, bamboos are second. Bamboos are sick. The evolution of bamboos is the result of multiple layers of shear brutality, with two imbricated levels of war, culminating in an apocalypse of combustive annihilation. At least, according to some hypotheses.
https://www.lesswrong.com/posts/oyKzz7bvcZMEPaDs6/survey-advice
Things I believe about making surveys, after making some surveys:
https://www.lesswrong.com/posts/J3wemDGtsy5gzD3xa/toni-kurz-and-the-insanity-of-climbing-mountains
Content warning: death
I've been on a YouTube binge lately. My current favorite genre is disaster stories about mountain climbing. The death statistics for some of these mountains, especially ones in the Himalayas are truly insane.
To give an example, let me tell you about a mountain most people have never heard of: Nanga Parbat. It's a 8,126 meter "wall of ice and rock", sporting the tallest mountain face and the fastest change in elevation in the entire world: the Rupal Face.
I've posted a picture above, but these really don't do justice to just how gigantic this wall is. This single face is as tall as the largest mountain in the Alps. It is the size of ten empire state buildings stacked on top of one another. If you could somehow walk straight up starting from the bottom, it would take you an entire HOUR to reach the summit.
31 people died trying to climb this mountain before its first successful ascent. Imagine being climber number 32 and thinking "Well I know no one has ascended this mountain and thirty one people have died trying, but why not, let's give it a go!"
The stories of deaths on these mountains (and even much shorter peaks in the Alps or in North America) sound like they are out of a novel. Stories of one mountain in particular have stuck with me: the first attempts to climb tallest mountain face in the alps: The Eigerwand.
The Eigerwand is the North face of a 14,000 foot peak named "The Eiger". After three generations of Europeans had conquered every peak in the Alps, few great challenges remained in the area. The Eigerwand was one of these: widely considered to be the greatest unclimbed route in the Alps.
The peak had already been reached in the 1850s, during the golden age of Alpine exploration. But the north face of the mountain remained unclimbed.
Many things can make a climb challenging: steep slopes, avalanches, long ascents, no easy resting spots and more. The Eigerwand had all of those, but one hazard in particular stood out: loose rock and snow.
In the summer months (usually considered the best time for climbing), the mountain crumbles. Fist-sized boulders routinely tumble down the mountain. Huge avalanaches sweep down its 70-degree slopes at incredible speed. And the huge, concave face is perpetually in shadow. It is extremely cold and windy, and the concave face seems to cause local weather patterns that can be completely different from the pass below. The face is deadly.
Before 1935, no team had made a serious attempt at the face. But that year, two young German climbers from Bavaria, both extremely experienced but relatively unknown outside the climbing community, decided they would make the first serious attempt.
https://www.lesswrong.com/posts/gs3vp3ukPbpaEie5L/deliberate-grieving-1
This post is hopefully useful on its own, but begins a series ultimately about grieving over a world that might (or, might not) be doomed.
It starts with some pieces from a previous coordination frontier sequence post, but goes into more detail.
At the beginning of the pandemic, I didn’t have much experience with grief. By the end of the pandemic, I had gotten quite a lot of practice grieving for things. I now think of grieving as a key life skill, with ramifications for epistemics, action, and coordination.
I had read The Art of Grieving Well, which gave me footholds to get started with. But I still had to develop some skills from scratch, and apply them in novel ways.
Grieving probably works differently for different people. Your mileage may vary. But for me, grieving is the act of wrapping my brain around the fact that something important to me doesn’t exist anymore. Or can’t exist right now. Or perhaps never existed. It typically comes in two steps – an “orientation” step, where my brain traces around the lines of the thing-that-isn’t-there, coming to understand what reality is actually shaped like now. And then a “catharsis” step, once I fully understand that the thing is gone. The first step can take hours, weeks or months.
You can grieve for people who are gone. You can grieve for things you used to enjoy. You can grieve for principles that were important to you but aren’t practical to apply right now.
Grieving is important in single-player mode – if I’m holding onto something that’s not there anymore, my thoughts and decision-making are distorted. I can’t make good plans if my map of reality is full of leftover wishful markings of things that aren’t there.
https://www.lesswrong.com/s/6xgy8XYEisLk3tCjH/p/CPP2uLcaywEokFKQG
Tl;dr:
I've noticed a dichotomy between "thinking in toolboxes" and "thinking in laws".
The toolbox style of thinking says it's important to have a big bag of tools that you can adapt to context and circumstance; people who think very toolboxly tend to suspect that anyone who goes talking of a single optimal way is just ignorant of the uses of the other tools.
The lawful style of thinking, done correctly, distinguishes between descriptive truths, normative ideals, and prescriptive ideals. It may talk about certain paths being optimal, even if there's no executable-in-practice algorithm that yields the optimal path. It considers truths that are not tools.
Within nearly-Euclidean mazes, the triangle inequality - that the path AC is never spatially longer than the path ABC - is always true but only sometimes useful. The triangle inequality has the prescriptive implication that if you know that one path choice will travel ABC and one path will travel AC, and if the only pragmatic path-merit you care about is going the minimum spatial distance (rather than say avoiding stairs because somebody in the party is in a wheelchair), then you should pick the route AC. But the triangle inequality goes on governing Euclidean mazes whether or not you know which path is which, and whether or not you need to avoid stairs.
Toolbox thinkers may be extremely suspicious of this claim of universal lawfulness if it is explained less than perfectly, because it sounds to them like "Throw away all the other tools in your toolbox! All you need to know is Euclidean geometry, and you can always find the shortest path through any maze, which in turn is always the best path."
If you think that's an unrealistic depiction of a misunderstanding that would never happen in reality, keep reading.
https://www.lesswrong.com/posts/PBRWb2Em5SNeWYwwB/humans-are-not-automatically-strategic
Reply to: A "Failure to Evaluate Return-on-Time" Fallacy
Lionhearted writes:
[A] large majority of otherwise smart people spend time doing semi-productive things, when there are massively productive opportunities untapped.A somewhat silly example: Let's say someone aspires to be a comedian, the best comedian ever, and to make a living doing comedy. He wants nothing else, it is his purpose. And he decides that in order to become a better comedian, he will watch re-runs of the old television cartoon 'Garfield and Friends' that was on TV from 1988 to 1995....I’m curious as to why.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
[Thanks to a variety of people for comments and assistance (especially Paul Christiano, Nostalgebraist, and Rafe Kennedy), and to various people for playing the game. Buck wrote the top-1 prediction web app; Fabien wrote the code for the perplexity experiment and did most of the analysis and wrote up the math here, Lawrence did the research on previous measurements. Epistemic status: we're pretty confident of our work here, but haven't engaged in a super thorough review process of all of it--this was more like a side-project than a core research project.]
How good are modern language models compared to humans, at the task language models are trained on (next token prediction on internet text)? While there are language-based tasks that you can construct where humans can make a next-token prediction better than any language model, we aren't aware of any apples-to-apples comparisons on non-handcrafted datasets. To answer this question, we performed a few experiments comparing humans to language models on next-token prediction on OpenWebText.
Contrary to some previous claims, we found that humans seem to be consistently worse at next-token prediction (in terms of both top-1 accuracy and perplexity) than even small models like Fairseq-125M, a 12-layer transformer roughly the size and quality of GPT-1. That is, even small language models are "superhuman" at predicting the next token. That being said, it seems plausible that humans can consistently beat the smaller 2017-era models (though not modern models) with a few hours more practice and strategizing. We conclude by discussing some of our takeaways from this result.
We're not claiming that this result is completely novel or surprising. For example, FactorialCode makes a similar claim as an answer on this LessWrong post about this question. We've also heard from some NLP people that the superiority of LMs to humans for next-token prediction is widely acknowledged in NLP. However, we've seen incorrect claims to the contrary on the internet, and as far as we know there hasn't been a proper apples-to-apples comparison, so we believe there's some value to our results.
If you want to play with our website, it’s here; in our opinion playing this game for half an hour gives you some useful perspective on what it’s like to be a language model.
https://www.lesswrong.com/posts/jDQm7YJxLnMnSNHFu/moral-strategies-at-different-capability-levels
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Let’s consider three ways you can be altruistic towards another agent:
I think a lot of unresolved tensions in ethics comes from seeing these types of morality as in opposition to each other, when they’re actually complementary:
https://www.lesswrong.com/posts/xFotXGEotcKouifky/worlds-where-iterative-design-fails
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In most technical fields, we try designs, see what goes wrong, and iterate until it works. That’s the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice.
In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.
By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.
Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons:
… but these certainly aren’t the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I’d encourage you to think on your own about others. These are the things which kill us; they’re worth thinking about.Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In most technical fields, we try designs, see what goes wrong, and iterate until it works. That’s the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice.
In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.
By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.
Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons:
… but these certainly aren’t the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I’d encourage you to think on your own about others. These are the things which kill us; they’re worth thinking about.
https://www.lesswrong.com/posts/QBAjndPuFbhEXKcCr/my-understanding-of-what-everyone-in-technical-alignment-is
Despite a clear need for it, a good source explaining who is doing what and why in technical AI alignment doesn't exist. This is our attempt to produce such a resource. We expect to be inaccurate in some ways, but it seems great to get out there and let Cunningham’s Law do its thing.[1]
The main body contains our understanding of what everyone is doing in technical alignment and why, as well as at least one of our opinions on each approach. We include supplements visualizing differences between approaches and Thomas’s big picture view on alignment. The opinions written are Thomas and Eli’s independent impressions, many of which have low resilience. Our all-things-considered views are significantly more uncertain.
This post was mostly written while Thomas was participating in the 2022 iteration SERI MATS program, under mentor John Wentworth. Thomas benefited immensely from conversations with other SERI MATS participants, John Wentworth, as well as many others who I met this summer.
https://www.lesswrong.com/posts/rYDas2DDGGDRc8gGB/unifying-bargaining-notions-1-2
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a two-part sequence of posts, in the ancient LessWrong tradition of decision-theory-posting. This first part will introduce various concepts of bargaining solutions and dividing gains from trade, which the reader may or may not already be familiar with.
The upcoming part will be about how all introduced concepts from this post are secretly just different facets of the same underlying notion, as originally discovered by John Harsanyi back in 1963 and rediscovered by me from a completely different direction. The fact that the various different solution concepts in cooperative game theory are all merely special cases of a General Bargaining Solution for arbitrary games, is, as far as I can tell, not common knowledge on Less Wrong.
Bargaining Games
Let's say there's a couple with a set of available restaurant options. Neither of them wants to go without the other, and if they fail to come to an agreement, the fallback is eating a cold canned soup dinner at home, the worst of all the options. However, they have different restaurant preferences. What's the fair way to split the gains from trade?
Well, it depends on their restaurant preferences, and preferences are typically encoded with utility functions. Since both sides agree that the disagreement outcome is the worst, they might as well index that as 0 utility, and their favorite respective restaurants as 1 utility, and denominate all the other options in terms of what probability mix between a cold canned dinner and their favorite restaurant would make them indifferent. If there's something that scores 0.9 utility for both, it's probably a pretty good pick!
https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators#fncrt8wagfir9
Summary
TL;DR: Self-supervised learning may create AGI or its foundation. What would that look like?
Unlike the limit of RL, the limit of self-supervised learning has received surprisingly little conceptual attention, and recent progress has made deconfusion in this domain more pressing.
Existing AI taxonomies either fail to capture important properties of self-supervised models or lead to confusing propositions. For instance, GPT policies do not seem globally agentic, yet can be conditioned to behave in goal-directed ways. This post describes a frame that enables more natural reasoning about properties like agency: GPT, insofar as it is inner-aligned, is a simulator which can simulate agentic and non-agentic simulacra.
The purpose of this post is to capture these objects in words so GPT can reference them and provide a better foundation for understanding them.
I use the generic term “simulator” to refer to models trained with predictive loss on a self-supervised dataset, invariant to architecture or data type (natural language, code, pixels, game states, etc). The outer objective of self-supervised learning is Bayes-optimal conditional inference over the prior of the training distribution, which I call the simulation objective, because a conditional model can be used to simulate rollouts which probabilistically obey its learned distribution by iteratively sampling from its posterior (predictions) and updating the condition (prompt). Analogously, a predictive model of physics can be used to compute rollouts of phenomena in simulation. A goal-directed agent which evolves according to physics can be simulated by the physics rule parameterized by an initial state, but the same rule could also propagate agents with different values, or non-agentic phenomena like rocks. This ontological distinction between simulator (rule) and simulacra (phenomena) applies directly to generative models like GPT.
TL;DR: To even consciously consider an alignment research direction, you should have evidence to locate it as a promising lead. As best I can tell, many directions seem interesting but do not have strong evidence of being “entangled” with the alignment problem such that I expect them to yield significant insights.
For example, “we can solve an easier version of the alignment problem by first figuring out how to build an AI which maximizes the number of real-world diamonds” has intuitive appeal and plausibility, but this claim doesn’t have to be trueand this problem does not necessarily have a natural, compact solution. In contrast, there do in fact exist humans who care about diamonds. Therefore, there are guaranteed-to-exist alignment insights concerning the way people come to care about e.g. real-world diamonds.
“Consider how humans navigate the alignment subproblem you’re worried about” is a habit which I (TurnTrout) picked up from Quintin Pope. I wrote the post, he originated the tactic.
https://www.lesswrong.com/posts/DdDt5NXkfuxAnAvGJ/changing-the-world-through-slack-and-hobbies
In EA orthodoxy, if you're really serious about EA, the three alternatives that people most often seem to talk about are
(1) “direct work” in a job that furthers a very important cause;
(2) “earning to give”;
(3) earning “career capital” that will help you do those things in the future, e.g. by getting a PhD or teaching yourself ML.
By contrast, there’s not much talk of:
(4) being in a job / situation where you have extra time and energy and freedom to explore things that seem interesting and important.
But that last one is really important!
This is Part 1 of my «Boundaries» Sequence on LessWrong.
Summary: «Boundaries» are a missing concept from the axioms of game theory and bargaining theory, which might help pin-down certain features of multi-agent rationality (this post), and have broader implications for effective altruism discourse and x-risk (future posts).
Epistemic status: me describing what I mean.
With the exception of some relatively recent and isolated pockets of research on embedded agency (e.g., Orseau & Ring, 2012; Garrabrant & Demsky, 2018), most attempts at formal descriptions of living rational agents — especially utility-theoretic descriptions — are missing the idea that living systems require and maintain boundaries.
When I say boundary, I don't just mean an arbitrary constraint or social norm. I mean something that could also be called a membrane in a generalized sense, i.e., a layer of stuff-of-some-kind that physically or cognitively separates a living system from its environment, that 'carves reality at the joints' in a way that isn't an entirely subjective judgement of the living system itself. Here are some examples that I hope will convey my meaning:
https://www.lesswrong.com/posts/MdZyLnLHuaHrCskjy/itt-passing-and-civility-are-good-charity-is-bad
I often object to claims like "charity/steelmanning is an argumentative virtue". This post collects a few things I and others have said on this topic over the last few years.
My current view is:
Related to: Slack gives you the ability to notice/reflect on subtle things
Epistemic status: A possibly annoying mixture of straightforward reasoning and hard-to-justify personal opinions.
It is often stated (with some justification, IMO) that AI risk is an “emergency.” Various people have explained to me that they put various parts of their normal life’s functioning on hold on account of AI being an “emergency.” In the interest of people doing this sanely and not confusedly, I’d like to take a step back and seek principles around what kinds of changes a person might want to make in an “emergency” of different sorts.
There are plenty of ways we can temporarily increase productivity on some narrow task or other, at the cost of our longer-term resources. For example:
(As usual, this post was written by Nate Soares with some help and editing from Rob Bensinger.)
In my last post, I described a “hard bit” of the challenge of aligning AGI—the sharp left turn that comes when your system slides into the “AGI” capabilities well, the fact that alignment doesn’t generalize similarly well at this turn, and the fact that this turn seems likely to break a bunch of your existing alignment properties.
Here, I want to briefly discuss a variety of current research proposals in the field, to explain why I think this problem is currently neglected.
I also want to mention research proposals that do strike me as having some promise, or that strike me as adjacent to promising approaches.
Before getting into that, let me be very explicit about three points:
https://www.lesswrong.com/posts/28zsuPaJpKAGSX4zq/humans-are-very-reliable-agents
Over the last few years, deep-learning-based AI has progressed extremely rapidly in fields like natural language processing and image generation. However, self-driving cars seem stuck in perpetual beta mode, and aggressive predictions there have repeatedly been disappointing. Google's self-driving project started four years before AlexNet kicked off the deep learning revolution, and it still isn't deployed at large scale, thirteen years later. Why are these fields getting such different results?
Right now, I think the biggest answer is that ML benchmarks judge models by average-case performance, while self-driving cars (and many other applications) require matching human worst-case performance. For MNIST, an easy handwriting recognition task, performance tops out at around 99.9% even for top models; it's not very practical to design for or measure higher reliability than that, because the test set is just 10,000 images and a handful are ambiguous. Redwood Research, which is exploring worst-case performance in the context of AI alignment, got reliability rates around 99.997% for their text generation models.
By comparison, human drivers are ridiculously reliable. The US has around one traffic fatality per 100 million miles driven; if a human driver makes 100 decisions per mile, that gets you a worst-case reliability of ~1:10,000,000,000 or ~99.999999999%. That's around five orders of magnitude better than a very good deep learning model, and you get that even in an open environment, where data isn't pre-filtered and there are sometimes random mechanical failures. Matching that bar is hard! I'm sure future AI will get there, but each additional "nine" of reliability is typically another unit of engineering effort. (Note that current self-driving systems use a mix of different models embedded in a larger framework, not one model trained end-to-end like GPT-3.)
https://www.lesswrong.com/posts/2GxhAyn9aHqukap2S/looking-back-on-my-alignment-phd
The funny thing about long periods of time is that they do, eventually, come to an end. I'm proud of what I accomplished during my PhD. That said, I'm going to first focus on mistakes I've made over the past four[1] years.
I think I got significantly smarter in 2018–2019, and kept learning some in 2020–2021. I was significantly less of a fool in 2021 than I was in 2017. That is important and worth feeling good about. But all things considered, I still made a lot of profound mistakes over the course of my PhD.
I figured this point out by summer 2021.
I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge.
But rationality is not about the bag of facts you know, nor is it about the concepts you have internalized. Rationality is about how your mind holds itself, it is how you weigh evidence, it is how you decide where to look next when puzzling out a new area.
If I had been more honest with myself, I could have nipped the "catching up with other thinkers" mistake in 2018. I could have removed the bad mental habits using certain introspective techniques; or at least been aware of the badness.
But I did not, in part because the truth was uncomfortable. If I did not have a clear set of prerequisites (e.g. analysis and topology and game theory) to work on, I would not have a clear and immediate direction of improvement. I would have felt adrift.
https://www.lesswrong.com/posts/7iAABhWpcGeP5e6SB/it-s-probably-not-lithium
A Chemical Hunger (a), a series by the authors of the blog Slime Mold Time Mold (SMTM) that has been received positively on LessWrong, argues that the obesity epidemic is entirely caused (a) by environmental contaminants. The authors’ top suspect is lithium (a)[1], primarily because it is known to cause weight gain at the doses used to treat bipolar disorder.
After doing some research, however, I found that it is not plausible that lithium plays a major role in the obesity epidemic, and that a lot of the claims the SMTM authors make about the topic are misleading, flat-out wrong, or based on extremely cherry-picked evidence. I have the impression that reading what they have to say about this often leaves the reader with a worse model of reality than they started with, and I’ll explain why I have that impression in this post.
https://www.lesswrong.com/posts/bhLxWTkRc8GXunFcB/what-are-you-tracking-in-your-head
A large chunk - plausibly the majority - of real-world expertise seems to be in the form of illegible skills: skills/knowledge which are hard to transmit by direct explanation. They’re not necessarily things which a teacher would even notice enough to consider important - just background skills or knowledge which is so ingrained that it becomes invisible.
I’ve recently noticed a certain common type of illegible skill which I think might account for the majority of illegible-skill-value across a wide variety of domains.
Here are a few examples of the type of skill I have in mind:
I have been doing red team, blue team (offensive, defensive) computer security for a living since September 2000. The goal of this post is to compile a list of general principles I've learned during this time that are likely relevant to the field of AGI Alignment. If this is useful, I could continue with a broader or deeper exploration.
I used to use the phrase when teaching security mindset to software developers that "security doesn't happen by accident." A system that isn't explicitly designed with a security feature is not going to have that security feature. More specifically, a system that isn't designed to be robust against a certain failure mode is going to exhibit that failure mode.
This might seem rather obvious when stated explicitly, but this is not the way that most developers, indeed most humans, think. I see a lot of disturbing parallels when I see anyone arguing that AGI won't necessarily be dangerous. An AGI that isn't intentionally designed not to exhibit a particular failure mode is going to have that failure mode. It is certainly possible to get lucky and not trigger it, and it will probably be impossible to enumerate even every category of failure mode, but to have any chance at all we will have to plan in advance for as many failure modes as we can possibly conceive.
As a practical enforcement method, I used to ask development teams that every user story have at least three abuser stories to go with it. For any new capability, think at least hard enough about it that you can imagine at least three ways that someone could misuse it. Sometimes this means looking at boundary conditions ("what if someone orders 2^64+1 items?"), sometimes it means looking at forms of invalid input ("what if someone tries to pay -$100, can they get a refund?"), and sometimes it means being aware of particular forms of attack ("what if someone puts Javascript in their order details?").
by paulfchristiano, 20th Jun 2022.
(Partially in response to AGI Ruin: A list of Lethalities. Written in the same rambling style. Not exhaustive.)
Editor's note: The following is a lightly edited copy of a document written by Eliezer Yudkowsky in November 2017. Since this is a snapshot of Eliezer’s thinking at a specific time, we’ve sprinkled reminders throughout that this is from 2017.
A background note:
It’s often the case that people are slow to abandon obsolete playbooks in response to a novel challenge. And AGI is certainly a very novel challenge.
Italian general Luigi Cadorna offers a memorable historical example. In the Isonzo Offensive of World War I, Cadorna lost hundreds of thousands of men in futile frontal assaults against enemy trenches defended by barbed wire and machine guns. As morale plummeted and desertions became epidemic, Cadorna began executing his own soldiers en masse, in an attempt to cure the rest of their “cowardice.” The offensive continued for 2.5 years.
Cadorna made many mistakes, but foremost among them was his refusal to recognize that this war was fundamentally unlike those that had come before. Modern weaponry had forced a paradigm shift, and Cadorna’s instincts were not merely miscalibrated—they were systematically broken. No number of small, incremental updates within his obsolete framework would be sufficient to meet the new challenge.
Other examples of this type of mistake include the initial response of the record industry to iTunes and streaming; or, more seriously, the response of most Western governments to COVID-19.
https://www.lesswrong.com/posts/pL4WhsoPJwauRYkeK/moses-and-the-class-struggle
"𝕿𝖆𝖐𝖊 𝖔𝖋𝖋 𝖞𝖔𝖚𝖗 𝖘𝖆𝖓𝖉𝖆𝖑𝖘. 𝕱𝖔𝖗 𝖞𝖔𝖚 𝖘𝖙𝖆𝖓𝖉 𝖔𝖓 𝖍𝖔𝖑𝖞 𝖌𝖗𝖔𝖚𝖓𝖉," said the bush.
"No," said Moses.
"Why not?" said the bush.
"I am a Jew. If there's one thing I know about this universe it's that there's no such thing as God," said Moses.
"You don't need to be certain I exist. It's a trivial case of Pascal's Wager," said the bush.
"Who is Pascal?" said Moses.
"It makes sense if you are beyond time, as I am," said the bush.
"Mysterious answers are not answers," said Moses.
https://www.lesswrong.com/posts/T6kzsMDJyKwxLGe3r/benign-boundary-violations
Recently, my friend Eric asked me what sorts of things I wanted to have happen at my bachelor party.
I said (among other things) that I'd really enjoy some benign boundary violations.
Eric went ????
Subsequently: an essay.
We use the word "boundary" to mean at least two things, when we're discussing people's personal boundaries.
The first is their actual self-defined boundary—the line that they would draw, if they had perfect introspective access, which marks the transition point from "this is okay" to "this is no longer okay."
Different people have different boundaries:
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.(If you're already familiar with all basics and don't want any preamble, skip ahead to Section B for technical difficulties of alignment proper.)
I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead.
Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified.
Three points about the general subject matter of discussion here, numbered so as not to conflict with the list of lethalities: