Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.
If you’d like more, subscribe to the “Lesswrong (30+ karma)” feed.
Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.
If you’d like more, subscribe to the “Lesswrong (30+ karma)” feed.
Copyright: © 2023 LessWrong Curated Podcast
Eliezer Yudkowsky periodically complains about people coming up with questionable plans with questionable assumptions to deal with AI, and then either:
This is a link post.Editor's note: Somewhat after I posted this on my own blog, Max Chiswick cornered me at LessOnline / Manifest and gave me a whole new perspective on this topic. I now believe that there is a way to use poker to sharpen epistemics that works dramatically better than anything I had been considering. I hope to write it up—together with Max—when I have time. Anyway, I'm still happy to keep this post around as a record of my first thoughts on the matter, and because it's better than nothing in the time before Max and I get around to writing up our joint second thoughts.
As an epilogue to this story, Max and I are now running a beta test for a course on making AIs to play poker and other games. The course will a synthesis of our respective theories of pedagogy re [...]
---
First published:
July 8th, 2024
Source:
https://www.lesswrong.com/posts/PypgeCxFHLzmBENK4/poker-is-a-bad-game-for-teaching-epistemics-figgie-is-a
---
Narrated by TYPE III AUDIO.
xlr8harder writes:
In general I don’t think an uploaded mind is you, but rather a copy. But one thought experiment makes me question this. A Ship of Theseus concept where individual neurons are replaced one at a time with a nanotechnological functional equivalent.
Are you still you?
Presumably the question xlr8harder cares about here isn't semantic question of how linguistic communities use the word "you", or predictions about how whole-brain emulation tech might change the way we use pronouns.
Rather, I assume xlr8harder cares about more substantive questions like:
I haven't shared this post with other relevant parties – my experience has been that private discussion of this sort of thing is more paralyzing than helpful. I might change my mind in the resulting discussion, but, I prefer that discussion to be public.
I think 80,000 hours should remove OpenAI from its job board, and similar EA job placement services should do the same.
(I personally believe 80k shouldn't advertise Anthropic jobs either, but I think the case for that is somewhat less clear)
I think OpenAI has demonstrated a level of manipulativeness, recklessness, and failure to prioritize meaningful existential safety work, that makes me think EA orgs should not be going out of their way to give them free resources. (It might make sense for some individuals to work there, but this shouldn't be a thing 80k or other orgs are systematically funneling talent into)
There [...]
---
First published:
July 3rd, 2024
Source:
https://www.lesswrong.com/posts/8qCwuE8GjrYPSqbri/80-000-hours-should-remove-openai-from-the-job-board-and
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://www.bhauth.com/blog/biology/cancer%20vaccines.html cancer neoantigens
For cells to become cancerous, they must have mutations that cause uncontrolled replication and mutations that prevent that uncontrolled replication from causing apoptosis. Because cancer requires several mutations, it often begins with damage to mutation-preventing mechanisms. As such, cancers often have many mutations not required for their growth, which often cause changes to structure of some surface proteins.
The modified surface proteins of cancer cells are called "neoantigens". An approach to cancer treatment that's currently being researched is to identify some specific neoantigens of a patient's cancer, and create a personalized vaccine to cause their immune system to recognize them. Such vaccines would use either mRNA or synthetic long peptides. The steps required are as follows:
About a year ago I decided to try using one of those apps where you tie your goals to some kind of financial penalty. The specific one I tried is Forfeit, which I liked the look of because it's relatively simple, you set single tasks which you have to verify you have completed with a photo.
I’m generally pretty sceptical of productivity systems, tools for thought, mindset shifts, life hacks and so on. But this one I have found to be really shockingly effective, it has been about the biggest positive change to my life that I can remember. I feel like the category of things which benefit from careful planning and execution over time has completely opened up to me, whereas previously things like this would be largely down to the luck of being in the right mood for long enough.
It's too soon to tell whether [...]
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
April 15th, 2024
Source:
https://www.lesswrong.com/posts/DRrAMiekmqwDjnzS5/my-experience-using-financial-commitments-to-overcome
---
Narrated by TYPE III AUDIO.
An NII machine in Nogales, AZ. (Image source)There's bound to be a lot of discussion of the Biden-Trump presidential debates last night, but I want to skip all the political prognostication and talk about the real issue: fentanyl-detecting machines.
Joe Biden says:
And I wanted to make sure we use the machinery that can detect fentanyl, these big machines that roll over everything that comes across the border, and it costs a lot of money. That was part of this deal we put together, this bipartisan deal.
More fentanyl machines, were able to detect drugs, more numbers of agents, more numbers of all the people at the border. And when we had that deal done, he went – he called his Republican colleagues said don’t do it. It's going to hurt me politically.
He never argued. It's not a good bill. It's a really good bill. We need [...]
---
First published:
June 28th, 2024
Source:
https://www.lesswrong.com/posts/TzwMfRArgsNscHocX/the-incredible-fentanyl-detecting-machine
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.[Thanks to Aryan Bhatt, Ansh Radhakrishnan, Adam Kaufman, Vivek Hebbar, Hanna Gabor, Justis Mills, Aaron Scher, Max Nadeau, Ryan Greenblatt, Peter Barnett, Fabien Roger, and various people at a presentation of these arguments for comments. These ideas aren’t very original to me; many of the examples of threat models are from other people.]
In this post, I want to introduce the concept of a “rogue deployment” and argue that it's interesting to classify possible AI catastrophes based on whether or not they involve a rogue deployment. I’ll also talk about how this division interacts with the structure of a safety case, discuss two important subcategories of rogue deployment, and make a few points about how the different categories I describe here might be caused by different attackers (e.g. the AI itself, rogue lab insiders, external hackers, or [...]
---
First published:
June 3rd, 2024
Source:
https://www.lesswrong.com/posts/ceBpLHJDdCt3xfEok/ai-catastrophes-and-rogue-deployments
---
Narrated by TYPE III AUDIO.
(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)
This is the final essay in a series that I'm calling "Otherness andcontrol in the age of AGI." I'm hoping that the individual essays can beread fairly well on their own, butsee here fora brief summary of the series as a whole. There's also a PDF of the whole series here.
Warning: spoilers for Angels in America; and moderate spoilers forHarry Potter and the Methods of Rationality.)
"I come into the presence of still water..."
~Wendell Berry
A lot of this series has been about problems with yang—that is,with the active element in the duality of activity vs. receptivity,doing vs. not-doing, controlling vs. letting go.[1] In particular,I've been interested in the ways that "deepatheism"(that is, a fundamental [...]
---
ARC's current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understanding of what was going on inside a neural network, we would hope to be able to use that understanding to verify that the network was not going to behave dangerously in unforeseen situations. ARC is attempting to perform this kind of verification, but using a mathematical kind of "explanation" instead of one written in natural language.
To help elucidate this connection, ARC has been supporting work on Compact Proofs of Model Performance via Mechanistic Interpretability by Jason Gross, Rajashree Agrawal, Lawrence Chan and others, which we were excited to see released along with this post. While we ultimately think that provable guarantees for large neural networks are unworkable as a long-term goal, we think that this work serves as a useful springboard towards alternatives.
In this [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
June 25th, 2024
Source:
https://www.lesswrong.com/posts/SyeQjjBoEC48MvnQC/formal-verification-heuristic-explanations-and-surprise
---
Narrated by TYPE III AUDIO.
Summary Summary .
LLMs may be fundamentally incapable of fully general reasoning, and if so, short timelines are less plausible.
Longer summary
There is ML research suggesting that LLMs fail badly on attempts at general reasoning, such as planning problems, scheduling, and attempts to solve novel visual puzzles. This post provides a brief introduction to that research, and asks:
Summary: Superposition-based interpretations of neural network activation spaces are incomplete. The specific locations of feature vectors contain crucial structural information beyond superposition, as seen in circular arrangements of day-of-the-week features and in the rich structures. We don’t currently have good concepts for talking about this structure in feature geometry, but it is likely very important for model computation. An eventual understanding of feature geometry might look like a hodgepodge of case-specific explanations, or supplementing superposition with additional concepts, or plausibly an entirely new theory that supersedes superposition. To develop this understanding, it may be valuable to study toy models in depth and do theoretical or conceptual work in addition to studying frontier models.
Epistemic status: Decently confident that the ideas here are directionally correct. I’ve been thinking these thoughts for a while, and recently got round to writing them up at a high level. Lots of people (including [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
June 24th, 2024
Source:
https://www.lesswrong.com/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.TL;DR: We published a new paper on out-of-context reasoning in LLMs. We show that LLMs can infer latent information from training data and use this information for downstream tasks, without any in-context learning or CoT. For instance, we finetune GPT-3.5 on pairs (x,f(x)) for some unknown function f. We find that the LLM can (a) define f in Python, (b) invert f, (c) compose f with other functions, for simple functions such as x+14, x // 3, 1.75x, and 3x+2.
Paper authors: Johannes Treutlein*, Dami Choi*, Jan Betley, Sam Marks, Cem Anil, Roger Grosse, Owain Evans (*equal contribution)
Johannes, Dami, and Jan did this project as part of an Astra Fellowship with Owain Evans.
Below, we include the Abstract and Introduction from the paper, followed by some additional discussion of our AI safety [...]
---
First published:
June 21st, 2024
Source:
https://www.lesswrong.com/posts/5SKRHQEFr8wYQHYkx/connecting-the-dots-llms-can-infer-and-verbalize-latent
---
Narrated by TYPE III AUDIO.
This is a link post.I have canceled my OpenAI subscription in protest over OpenAI's lack ofethics.
In particular, I object to:
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.New Anthropic model organisms research paper led by Carson Denison from the Alignment Stress-Testing Team demonstrating that large language models can generalize zero-shot from simple reward-hacks (sycophancy) to more complex reward tampering (subterfuge). Our results suggest that accidentally incentivizing simple reward-hacks such as sycophancy can have dramatic and very difficult to reverse consequences for how models generalize, up to and including generalization to editing their own reward functions and covering up their tracks when doing so.
Abstract:
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too [...]
---
First published:
June 17th, 2024
Source:
https://www.lesswrong.com/posts/FSgGBjDiaCdWxNBhj/sycophancy-to-subterfuge-investigating-reward-tampering-in
---
Narrated by TYPE III AUDIO.
After living in a suburb for most of my life, when I moved to a major U.S. city the first thing I noticed was the feces. At first I assumed it was dog poop, but my naivety didn’t last long.
One day I saw a homeless man waddling towards me at a fast speed while holding his ass cheeks. He turned into an alley and took a shit. As I passed him, there was a moment where our eyes met. He sheepishly averted his gaze.
The next day I walked to the same place. There are a number of businesses on both sides of the street that probably all have bathrooms. I walked into each of them to investigate.
In a coffee shop, I saw a homeless woman ask the barista if she could use the bathroom. “Sorry, that bathroom is for customers only.” I waited five minutes and [...]
---
First published:
June 18th, 2024
Source:
https://www.lesswrong.com/posts/sCWe5RRvSHQMccd2Q/i-would-have-shit-in-that-alley-too
---
Narrated by TYPE III AUDIO.
ARC-AGI post
Getting 50% (SoTA) on ARC-AGI with GPT-4o
I recently got to 50%[1] accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number of Python implementations of the transformation rule (around 8,000 per problem) and then selecting among these implementations based on correctness of the Python programs on the examples (if this is confusing, go here)[2]. I use a variety of additional approaches and tweaks which overall substantially improve the performance of my method relative to just sampling 8,000 programs.
[This post is on a pretty different topic than the usual posts on our substack. So regular readers should be warned!]
The additional approaches and tweaks are:
Have you heard this before? In clinical trials, medicines have to be compared to a placebo to separate the effect of the medicine from the psychological effect of taking the drug. The patient's belief in the power of the medicine has a strong effect on its own. In fact, for some drugs such as antidepressants, the psychological effect of taking a pill is larger than the effect of the drug. It may even be worth it to give a patient an ineffective medicine just to benefit from the placebo effect. This is the conventional wisdom that I took for granted until recently.
I no longer believe any of it, and the short answer as to why is that big meta-analysis on the placebo effect. That meta-analysis collected all the studies they could find that did "direct" measurements of the placebo effect. In addition to a placebo group that could [...]
---
First published:
June 10th, 2024
Source:
https://www.lesswrong.com/posts/kpd83h5XHgWCxnv3h/why-i-don-t-believe-in-the-placebo-effect
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.As an AI researcher who wants to do technical work that helps humanity, there is a strong drive to find a research area that is definitely helpful somehow, so that you don’t have to worry about how your work will be applied, and thus you don’t have to worry about things like corporate ethics or geopolitics to make sure your work benefits humanity.
Unfortunately, no such field exists. In particular, technical AI alignment is not such a field, and technical AI safety is not such a field. It absolutely matters where ideas land and how they are applied, and when the existence of the entire human race is at stake, that's no exception.
If that's obvious to you, this post is mostly just a collection of arguments for something you probably already realize. But if you somehow [...]
---
First published:
June 14th, 2024
Source:
https://www.lesswrong.com/posts/F2voF4pr3BfejJawL/safety-isn-t-safety-without-a-social-model-or-dispelling-the
---
Narrated by TYPE III AUDIO.
(Cross-posted from Twitter.)
My take on Leopold Aschenbrenner's new report: I think Leopold gets it right on a bunch of important counts.
Three that I especially care about:
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We are pleased to announce ILIAD — a 5-day conference bringing together 100+ researchers to build strong scientific foundations for AI alignment.
***Apply to attend by June 30!***
Since at least 2017, OpenAI has asked departing employees to sign offboarding agreements which legally bind them to permanently—that is, for the rest of their lives—refrain from criticizing OpenAI, or from otherwise taking any actions which might damage its finances or reputation.[1]
If they refused to sign, OpenAI threatened to take back (or make unsellable) all of their already-vested equity—a huge portion of their overall compensation, which often amounted to millions of dollars. Given this immense pressure, it seems likely that most employees signed.
If they did sign, they became personally liable forevermore for any financial or reputational harm they later caused. This liability was unbounded, so had the potential to be financially ruinous—if, say, they later wrote a blog post critical of OpenAI, they might in principle be found liable for damages far in excess of their net worth.
These extreme provisions allowed OpenAI to systematically silence criticism [...]
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
May 30th, 2024
Source:
https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai
---
Narrated by TYPE III AUDIO.
As we explained in our MIRI 2024 Mission and Strategy update, MIRI has pivoted to prioritize policy, communications, and technical governance research over technical alignment research. This follow-up post goes into detail about our communications strategy.
The Objective: Shut it Down[1]
Our objective is to convince major powers to shut down the development of frontier AI systems worldwide before it is too late. We believe that nothing less than this will prevent future misaligned smarter-than-human AI systems from destroying humanity. Persuading governments worldwide to take sufficiently drastic action will not be easy, but we believe this is the most viable path.
Policymakers deal mostly in compromise: they form coalitions by giving a little here to gain a little somewhere else. We are concerned that most legislation intended to keep humanity alive will go through the usual political processes and be ground down into ineffective compromises.
The only way we [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
May 29th, 2024
Source:
https://www.lesswrong.com/posts/tKk37BFkMzchtZThx/miri-2024-communications-strategy
---
Narrated by TYPE III AUDIO.
Previously: OpenAI: Exodus (contains links at top to earlier episodes), Do Not Mess With Scarlett Johansson
We have learned more since last week. It's worse than we knew.
How much worse? In which ways? With what exceptions?
That's what this post is about.
The Story So Far
For years, employees who left OpenAI consistently had their vested equity explicitly threatened with confiscation and the lack of ability to sell it, and were given short timelines to sign documents or else. Those documents contained highly aggressive NDA and non disparagement (and non interference) clauses, including the NDA preventing anyone from revealing these clauses.
No one knew about this until recently, because until Daniel Kokotajlo everyone signed, and then they could not talk about it. Then Daniel refused to sign, Kelsey Piper started reporting, and a lot came out.
Here is Altman's statement from [...]
---
First published:
May 28th, 2024
Source:
https://www.lesswrong.com/posts/YwhgHwjaBDmjgswqZ/openai-fallout
---
Narrated by TYPE III AUDIO.
Contact: patreon.com/lwcurated or [perrin dot j dot walker plus lesswrong fnord gmail].
All Solenoid's narration work found here.
Crossposted from AI Lab Watch. Subscribe on Substack.
Introduction.
Anthropic has an unconventional governance mechanism: an independent "Long-Term Benefit Trust" elects some of its board. Anthropic sometimes emphasizes that the Trust is an experiment, but mostly points to it to argue that Anthropic will be able to promote safety and benefit-sharing over profit.[1]
But the Trust's details have not been published and some information Anthropic has shared is concerning. In particular, Anthropic's stockholders can apparently overrule, modify, or abrogate the Trust, and the details are unclear.
Anthropic has not publicly demonstrated that the Trust would be able to actually do anything that stockholders don't like.
The facts
There are three sources of public information on the Trust:
They say there's [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
May 27th, 2024
Source:
https://www.lesswrong.com/posts/sdCcsTt9hRpbX6obP/maybe-anthropic-s-long-term-benefit-trust-is-powerless
---
Narrated by TYPE III AUDIO.
Introduction.
If you are choosing to read this post, you've probably seen the image below depicting all the notifications students received on their phones during one class period. You probably saw it as a retweet of this tweet, or in one of Zvi's posts. Did you find this data plausible, or did you roll to disbelieve? Did you know that the image dates back to at least 2019? Does that fact make you more or less worried about the truth on the ground as of 2024?
Last month, I performed an enhanced replication of this experiment in my high school classes. This was partly because we had a use for it, partly to model scientific thinking, and partly because I was just really curious. Before you scroll past the image, I want to give you a chance to mentally register your predictions. Did my average class match the [...]
---
First published:
May 26th, 2024
Source:
https://www.lesswrong.com/posts/AZCpu3BrCFWuAENEd/notifications-received-in-30-minutes-of-class
---
Narrated by TYPE III AUDIO.
New blog: AI Lab Watch. Subscribe on Substack.
Many AI safety folks think that METR is close to the labs, with ongoing relationships that grant it access to models before they are deployed. This is incorrect. METR (then called ARC Evals) did pre-deployment evaluation for GPT-4 and Claude 2 in the first half of 2023, but it seems to have had no special access since then.[1] Other model evaluators also seem to have little access before deployment.
Frontier AI labs' pre-deployment risk assessment should involve external model evals for dangerous capabilities.[2] External evals can improve a lab's risk assessment and—if the evaluator can publish its results—provide public accountability.
The evaluator should get deeper access than users will get.
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
May 24th, 2024
Source:
https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators
---
Narrated by TYPE III AUDIO.
This is a quickly-written opinion piece, of what I understand about OpenAI. I first posted it to Facebook, where it had some discussion.
Some arguments that OpenAI is making, simultaneously:
---
First published:
May 21st, 2024
Source:
https://www.lesswrong.com/posts/cy99dCEiLyxDrMHBi/what-s-going-on-with-openai-s-messaging
---
Narrated by TYPE III AUDIO.
Produced as part of the MATS Winter 2023-4 program, under the mentorship of @Jessica Rumbelow
One-sentence summary: On a dataset of human-written essays, we find that gpt-3.5-turbo can accurately infer demographic information about the authors from just the essay text, and suspect it's inferring much more.
Introduction.
Every time we sit down in front of an LLM like GPT-4, it starts with a blank slate. It knows nothing[1] about who we are, other than what it knows about users in general. But with every word we type, we reveal more about ourselves -- our beliefs, our personality, our education level, even our gender. Just how clearly does the model see us by the end of the conversation, and why should that worry us?
Like many, we were rather startled when @janus showed that gpt-4-base could identify @gwern by name, with 92% confidence, from a 300-word comment. If [...]
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
May 17th, 2024
Source:
https://www.lesswrong.com/posts/dLg7CyeTE4pqbbcnp/language-models-model-us
---
Narrated by TYPE III AUDIO.
This is a link post.to follow up my philantropic pledge from 2020, i've updated my philanthropy page with 2023 results.
in 2023 my donations funded $44M worth of endpoint grants ($43.2M excluding software development and admin costs) — exceeding my commitment of $23.8M (20k times $1190.03 — the minimum price of ETH in 2023).
---
First published:
May 20th, 2024
Source:
https://www.lesswrong.com/posts/bjqDQB92iBCahXTAj/jaan-tallinn-s-2023-philanthropy-overview
---
Narrated by TYPE III AUDIO.
FSF blogpost. Full document (just 6 pages; you should read it). Compare to Anthropic's RSP, OpenAI's RSP ("PF"), and METR's Key Components of an RSP.
DeepMind's FSF has three steps:
Introduction.
[Reminder: I am an internet weirdo with no medical credentials]
A few months ago, I published some crude estimates of the power of nitric oxide nasal spray to hasten recovery from illness, and speculated about what it could do prophylactically. While working on that piece a nice man on Twitter alerted me to the fact that humming produces lots of nasal nitric oxide. This post is my very crude model of what kind of anti-viral gains we could expect from humming.
I’ve encoded my model at Guesstimate. The results are pretty favorable (average estimated impact of 66% reduction in severity of illness), but extremely sensitive to my made-up numbers. Efficacy estimates go from ~0 to ~95%, depending on how you feel about publication bias, what percent of Enovid's impact can be credited to nitric oxide, and humming's relative effect. Given how made up speculative some [...]
---
First published:
May 16th, 2024
Source:
https://www.lesswrong.com/posts/NBZvpcBx4ewqkdCdT/do-you-believe-in-hundred-dollar-bills-lying-on-the-ground-1
---
Narrated by TYPE III AUDIO.
Most people avoid saying literally false things, especially if those could be audited, like making up facts or credentials. The reasons for this are both moral and pragmatic — being caught out looks really bad, and sustaining lies is quite hard, especially over time. Let's call the habit of not saying things you know to be false ‘shallow honesty’[1].
Often when people are shallowly honest, they still choose what true things they say in a kind of locally act-consequentialist way, to try to bring about some outcome. Maybe something they want for themselves (e.g. convincing their friends to see a particular movie), or something they truly believe is good (e.g. causing their friend to vote for the candidate they think will be better for the country).
Either way, if you think someone is being merely shallowly honest, you can only shallowly trust them: you might be confident that [...]
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
May 7th, 2024
Source:
https://www.lesswrong.com/posts/szn26nTwJDBkhn8ka/deep-honesty
---
Narrated by TYPE III AUDIO.
Epistemic Status: Musing and speculation, but I think there's a real thing here.
1.
When I was a kid, a friend of mine had a tree fort. If you've never seen such a fort, imagine a series of wooden boards secured to a tree, creating a platform about fifteen feet off the ground where you can sit or stand and walk around the tree. This one had a rope ladder we used to get up and down, a length of knotted rope that was tied to the tree at the top and dangled over the edge so that it reached the ground.
Once you were up in the fort, you could pull the ladder up behind you. It was much, much harder to get into the fort without the ladder. Not only would you need to climb the tree itself instead of the ladder with its handholds, but [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 26th, 2024
Source:
https://www.lesswrong.com/posts/k2kzawX5L3Z7aGbov/on-not-pulling-the-ladder-up-behind-you
---
Narrated by TYPE III AUDIO.
Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout).
TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.
Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given layer, trained with the same unsupervised objective.
I apply the method to several alignment-relevant toy examples, and find that the [...]
The original text contained 15 footnotes which were omitted from this narration.
---
First published:
April 30th, 2024
Source:
https://www.lesswrong.com/posts/ioPnHKFyy4Cw2Gr2x/mechanistically-eliciting-latent-behaviors-in-language-1
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://ailabwatch.orgI'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.
It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.
(It's much better on desktop than mobile — don't read it on mobile.)
It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.
It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.
Some clarifications and disclaimers.
How you can help:
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.
This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.
We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.
Executive summary
Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."
We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model to refuse harmless requests.
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
April 27th, 2024
Source:
https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
---
Narrated by TYPE III AUDIO.
This comes from a podcast called 18Forty, of which the main demographic of Orthodox Jews. Eliezer's sister (Hannah) came on and talked about her Sheva Brachos, which is essentially the marriage ceremony in Orthodox Judaism. People here have likely not seen it, and I thought it was quite funny, so here it is:
https://18forty.org/podcast/channah-cohen-the-crisis-of-experience/
David Bashevkin:
So I want to shift now and I want to talk about something that full disclosure, we recorded this once before and you had major hesitation for obvious reasons. It's very sensitive what we’re going to talk about right now, but really for something much broader, not just because it's a sensitive personal subject, but I think your hesitation has to do with what does this have to do with the subject at hand? And I hope that becomes clear, but one of the things that has always absolutely fascinated me about [...]
---
First published:
April 22nd, 2024
Source:
https://www.lesswrong.com/posts/C7deNdJkdtbzPtsQe/funny-anecdote-of-eliezer-from-his-sister
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://dynomight.net/seed-oil/A friend has spent the last three years hounding me about seed oils. Every time I thought I was safe, he’d wait a couple months and renew his attack:
“When are you going to write about seed oils?”
“Did you know that seed oils are why there's so much {obesity, heart disease, diabetes, inflammation, cancer, dementia}?”
“Why did you write about {meth, the death penalty, consciousness, nukes, ethylene, abortion, AI, aliens, colonoscopies, Tunnel Man, Bourdieu, Assange} when you could have written about seed oils?”
“Isn’t it time to quit your silly navel-gazing and use your weird obsessive personality to make a dent in the world—by writing about seed oils?”
He’d often send screenshots of people reminding each other that Corn Oil is Murder and that it's critical that we overturn our lives to eliminate soybean/canola/sunflower/peanut oil and replace them with butter/lard/coconut/avocado/palm oil.
This confused [...]
---
First published:
April 20th, 2024
Source:
https://www.lesswrong.com/posts/DHkkL2GxhxoceLzua/thoughts-on-seed-oil
Linkpost URL:
https://dynomight.net/seed-oil/
---
Narrated by TYPE III AUDIO.
Yesterday Adam Shai put up a cool post which… well, take a look at the visual:
Yup, it sure looks like that fractal is very noisily embedded in the residual activations of a neural net trained on a toy problem. Linearly embedded, no less.
I (John) initially misunderstood what was going on in that post, but some back-and-forth with Adam convinced me that it really is as cool as that visual makes it look, and arguably even cooler. So David and I wrote up this post / some code, partly as an explainer for why on earth that fractal would show up, and partly as an explainer for the possibilities this work potentially opens up for interpretability.
One sentence summary: when tracking the hidden state of a hidden Markov model, a Bayesian's beliefs follow a chaos game (with the observations randomly selecting the update at each time), so [...]
---
First published:
April 18th, 2024
Source:
https://www.lesswrong.com/posts/mBw7nc4ipdyeeEpWs/why-would-belief-states-have-a-fractal-structure-and-why
---
Narrated by TYPE III AUDIO.
TLDR: I am investigating whether to found a spiritual successor to FHI, housed under Lightcone Infrastructure, providing a rich cultural environment and financial support to researchers and entrepreneurs in the intellectual tradition of the Future of Humanity Institute. Fill out this form or comment below to express interest in being involved either as a researcher, entrepreneurial founder-type, or funder.
The Future of Humanity Institute is dead:
I knew that this was going to happen in some form or another for a year or two, having heard through the grapevine and private conversations of FHI's university-imposed hiring freeze and fundraising block, and so I have been thinking about how to best fill the hole in the world that FHI left behind.
I think FHI was one of the best intellectual institutions in history. Many of the most important concepts[1] in my intellectual vocabulary were developed and popularized under its [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 18th, 2024
Source:
https://www.lesswrong.com/posts/ydheLNeWzgbco2FTb/express-interest-in-an-fhi-of-the-west
---
Narrated by TYPE III AUDIO.
Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, and @Guillaume Corlouer for suggestions on this writeup.
Introduction.
What computational structure are we building into LLMs when we train them on next-token prediction? In this post we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. We'll explain exactly what this means in the post. We are excited by these results because
This is a linkpost for https://www.commerce.gov/news/press-releases/2024/04/us-commerce-secretary-gina-raimondo-announces-expansion-us-ai-safetyU.S. Secretary of Commerce Gina Raimondo announced today additional members of the executive leadership team of the U.S. AI Safety Institute (AISI), which is housed at the National Institute of Standards and Technology (NIST). Raimondo named Paul Christiano as Head of AI Safety, Adam Russell as Chief Vision Officer, Mara Campbell as Acting Chief Operating Officer and Chief of Staff, Rob Reich as Senior Advisor, and Mark Latonero as Head of International Engagement. They will join AISI Director Elizabeth Kelly and Chief Technology Officer Elham Tabassi, who were announced in February. The AISI was established within NIST at the direction of President Biden, including to support the responsibilities assigned to the Department of Commerce under the President's landmark Executive Order.
Paul Christiano, Head of AI Safety, will design and conduct tests of frontier AI models, focusing on model evaluations for capabilities of national security [...]
---
First published:
April 16th, 2024
Source:
https://www.lesswrong.com/posts/63X9s3ENXeaDrbe5t/paul-christiano-named-as-us-ai-safety-institute-head-of-ai
Linkpost URL:
https://www.commerce.gov/news/press-releases/2024/04/us-commerce-secretary-gina-raimondo-announces-expansion-us-ai-safety
---
Narrated by TYPE III AUDIO.
Cross-posted from my website. Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app.
This essay is part of a series that I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for brief summaries of the essays that have been released thus far.
Warning: spoilers for Yudkowsky's "The Sword of the Good.")
Examining a philosophical vibe that I think contrasts in interesting ways with "deep atheism."
Text version here: https://joecarlsmith.com/2024/03/21/on-green
This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that individual essays can be read fairly well on their own, but see here for brief text summaries of the essays that have been released thus far: https://joecarlsmith.com/2024/01/02/otherness-and-control-in-the-age-of-agi
(Though: note that I haven't put the summary post on the podcast yet.)
Source:
https://www.lesswrong.com/posts/gvNnE6Th594kfdB3z/on-green
Narrated by Joe Carlsmith, audio provided with permission.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://bayesshammai.substack.com/p/conditional-on-getting-to-trade-your
“I refuse to join any club that would have me as a member” -Marx[1]
Adverse Selection is the phenomenon in which information asymmetries in non-cooperative environments make trading dangerous. It has traditionally been understood to describe financial markets in which buyers and sellers systematically differ, such as a market for used cars in which sellers have the information advantage, where resulting feedback loops can lead to market collapses.
In this post, I make the case that adverse selection effects appear in many everyday contexts beyond specialized markets or strictly financial exchanges. I argue that modeling many of our decisions as taking place in competitive environments analogous to financial markets will help us notice instances of adverse selection that we otherwise wouldn’t.
The strong version of my central thesis is that conditional on getting to trade[2], your trade wasn’t all that great. Any time you make a trade, you should be asking yourself “what do others know that I don’t?”
Source:
https://www.lesswrong.com/posts/vyAZyYh3qsqcJwwPn/toward-a-broader-conception-of-adverse-selection
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
In January, I defended my PhD thesis, which I called Algorithmic Bayesian Epistemology. From the preface:
For me as for most students, college was a time of exploration. I took many classes, read many academic and non-academic works, and tried my hand at a few research projects. Early in graduate school, I noticed a strong commonality among the questions that I had found particularly fascinating: most of them involved reasoning about knowledge, information, or uncertainty under constraints. I decided that this cluster of problems would be my primary academic focus. I settled on calling the cluster algorithmic Bayesian epistemology: all of the questions I was thinking about involved applying the "algorithmic lens" of theoretical computer science to problems of Bayesian epistemology.
Source:
https://www.lesswrong.com/posts/6dd4b4cAWQLDJEuHw/my-phd-thesis-algorithmic-bayesian-epistemology
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://twitter.com/ESYudkowsky/status/144546114693741363
I stumbled upon a Twitter thread where Eliezer describes what seems to be his cognitive algorithm that is equivalent to Tune Your Cognitive Strategies, and have decided to archive / repost it here.
Source:
https://www.lesswrong.com/posts/rYq6joCrZ8m62m7ej/how-could-i-have-thought-that-faster
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
A recent short story by Gabriel Mukobi illustrates a near-term scenario where things go bad because new developments in LLMs allow LLMs to accelerate capabilities research without a correspondingly large acceleration in safety research.
This scenario is disturbingly close to the situation we already find ourselves in. Asking the best LLMs for help with programming vs technical alignment research feels very different (at least to me). LLMs might generate junk code, but you can keep pointing out the problems with the code, and the code will eventually work. This can be faster than doing it myself, in cases where I don't know a language or library well; the LLMs are moderately familiar with everything.
When I try to talk to LLMs about technical AI safety work, however, I just get garbage.
I think a useful safety precaution for frontier AI models would be to make them more useful for [...]
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
April 4th, 2024
Source:
https://www.lesswrong.com/posts/nQwbDPgYvAbqAmAud/llms-for-alignment-research-a-safety-priority
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/xLDwCemt5qvchzgHd/scale-was-all-we-needed-at-first
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/Yay8SbQiwErRyDKGb/using-axis-lines-for-good-or-evil
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/SPBm67otKq5ET5CWP/social-status-part-1-2-negotiations-over-object-level
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/Cb7oajdrA5DsHCqKd/acting-wholesomely
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Rationality is Systematized Winning, so rationalists should win. We’ve tried saving the world from AI, but that's really hard and we’ve had … mixed results. So let's start with something that rationalists should find pretty easy: Becoming Cool!
I don’t mean, just, like, riding a motorcycle and breaking hearts level of cool. I mean like the first kid in school to get a Tamagotchi, their dad runs the ice cream truck and gives you free ice cream and, sure, they ride a motorcycle. I mean that kind of feel-it-in-your-bones, I-might-explode-from-envy cool.
The eleventh virtue is scholarship, so I hit the ~books~ search engine on this one. Apparently, the aspects of coolness are:
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/sJPbmm8Gd34vGYrKd/deep-atheism-and-ai-risk
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/h99tRkpQGxwtb9Dpv/my-clients-the-liars
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/2sLwt2cSAag74nsdN/speaking-to-congressional-staffers-about-ai-risk
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/Jash4Gbi2wpThzZ4k/cfar-takeaways-andrew-critch
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.The following is a lightly edited version of a memo I wrote for a retreat. It was inspired by a draft of Counting arguments provide no evidence for AI doom, although the earlier draft contained some additional content. I personally really like the earlier content, and think that my post covers important points not made by the published version of that post.
Thankful for the dozens of interesting conversations and comments at the retreat.
I think that the AI alignment field is partially founded on fundamentally confused ideas. I’m worried about this because, right now, a range of lobbyists and concerned activists and researchers are in Washington making policy asks. Some of these policy proposals seem to be based on erroneous or unsound arguments.[1]
The most important takeaway from this essay is that [...]
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
March 5th, 2024
Source:
https://www.lesswrong.com/posts/yQSmcfN4kA7rATHGK/many-arguments-for-ai-x-risk-are-wrong
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.TLDR: I’ve collected some tips for research that I’ve given to other people and/or used myself, which have sped things up and helped put people in the right general mindset for empirical AI alignment research. Some of these are opinionated takes, also around what has helped me. Researchers can be successful in different ways, but I still stand by the tips here as a reasonable default.
What success generally looks like
Here, I’ve included specific criteria that strong collaborators of mine tend to meet, with rough weightings on the importance, as a rough north star for people who collaborate with me (especially if you’re new to research). These criteria are for the specific kind of research I do (highly experimental LLM alignment research, excluding interpretability); some examples of research areas where this applies are e.g. scalable oversight [...]
---
First published:
February 29th, 2024
Source:
https://www.lesswrong.com/posts/dZFpEdKyb9Bf4xYn7/tips-for-empirical-alignment-research
---
Narrated by TYPE III AUDIO.
Timaeus was announced in late October 2023, with the mission of making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. This is our first progress update.
In service of the mission, our first priority has been to support and contribute to ongoing work in Singular Learning Theory (SLT) and developmental interpretability, with the aim of laying theoretical and empirical foundations for a science of deep learning and neural network interpretability.
Our main uncertainties in this research were:
This is a linkpost for https://bayesshammai.substack.com/p/contra-ngo-et-al-every-every-bayWith thanks to Scott Alexander for the inspiration, Jeffrey Ladish, Philip Parker, Avital Morris, and Drake Thomas for masterful cohosting, and Richard Ngo for his investigative journalism.
Last summer, I threw an Every Bay Area House Party themed party. I don’t live in the Bay, but I was there for a construction-work-slash-webforum-moderation-and-UI-design-slash-grantmaking gig, so I took the opportunity to impose myself on the ever generous Jeffrey Ladish and host a party in his home. Fortunately, the inside of his house is already optimized to look like a parody of a Bay Area house party house, so not much extra decorating was needed, but when has that ever stopped me?
Attendees could look through the window for an outside viewRichard Ngo recently covered the event, with only very minor embellishments. I’ve heard rumors that some people are doubting whether the party described truly happened, so [...]
---
First published:
February 22nd, 2024
Source:
https://www.lesswrong.com/posts/mmYFF4dyi8Kg6pWGC/contra-ngo-et-al-every-every-bay-area-house-party-bay-area
Linkpost URL:
https://bayesshammai.substack.com/p/contra-ngo-et-al-every-every-bay
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/g8HHKaWENEbqh2mgK/updatelessness-doesn-t-solve-most-problems-1
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/and-all-the-shoggoths-merely-players
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Inspired by a house party inspired by Scott Alexander.
By the time you arrive in Berkeley, the party is already in full swing. You’ve come late because your reading of the polycule graph indicated that the first half would be inauspicious. But now you’ve finally made it to the social event of the season: the Every Bay Area House Party-themed house party.
The first order of the evening is to get a color-coded flirting wristband, so that you don’t incur any accidental micromarriages. You scan the menu of options near the door. There's the wristband for people who aren’t interested in flirting; the wristband for those want to be flirted with, but will never flirt back; the wristband for those who only want to flirt with people who have different-colored wristbands; and of course the one for people who want to glomarize disclosure of their flirting preferences. Finally you [...]
---
First published:
February 16th, 2024
Source:
https://www.lesswrong.com/posts/g5q4JiG5dzafkdyEN/every-every-bay-area-house-party-bay-area-house-party
---
Narrated by TYPE III AUDIO.
Cross-posted with light edits from Otherwise.
I think of us in some kind of twilight world as transformative AI looks more likely: things are about to change, and I don’t know if it's about to get a lot darker or a lot brighter.
Increasingly this makes me wonder how I should be raising my kids differently.
What might the world look like
Most of my imaginings about my children's lives have them in pretty normal futures, where they go to college and have jobs and do normal human stuff, but with better phones.
It's hard for me to imagine the other versions:
This is a linkpost for https://seekingtobejolly.substack.com/p/no-one-in-my-org-puts-money-in-theirEpistemic status: the stories here are all as true as possible from memory, but my memory is so so.
An AI made this This is going to be big
It's late Summer 2017. I am on a walk in the Mendip Hills. It's warm and sunny and the air feels fresh. With me are around 20 other people from the Effective Altruism London community. We’ve travelled west for a retreat to discuss how to help others more effectively with our donations and careers. As we cross cow field after cow field, I get talking to one of the people from the group I don’t know yet. He seems smart, and cheerful. He tells me that he is an AI researcher at Google DeepMind. He explains how he is thinking about how to make sure that any powerful AI system actually does what we want it [...]
---
First published:
February 16th, 2024
Source:
https://www.lesswrong.com/posts/dLXdCjxbJMGtDBWTH/no-one-in-my-org-puts-money-in-their-pension
Linkpost URL:
https://seekingtobejolly.substack.com/p/no-one-in-my-org-puts-money-in-their
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://www.narrativeark.xyz/p/masterpieceA sequel to qntm's Lena. Reading Lena first is helpful but not necessary.
We’re excited to announce the fourth annual MMindscaping competition! Over the last few years, interest in the art of mindscaping has continued to grow rapidly. We expect this year's competition to be our biggest yet, and we’ve expanded the prize pool to match. The theme for the competition is “Weird and Wonderful”—we want your wackiest ideas and most off-the-wall creations!
Competition rules
As in previous competitions, the starting point is a base MMAcevedo mind upload. All entries must consist of a single modified version of MMAcevedo, along with a written or recorded description of the sequence of transformations or edits which produced it. For more guidance on which mind-editing techniques can be used, see the Technique section below.
Your entry must have been created in the last 12 months, and cannot [...]
---
First published:
February 13th, 2024
Source:
https://www.lesswrong.com/posts/Fruv7Mmk3X5EekbgB/masterpiece
Linkpost URL:
https://www.narrativeark.xyz/p/masterpiece
---
Narrated by TYPE III AUDIO.
I'm trying to build my own art of rationality training, and I've started talking to various CFAR instructors about their experiences – things that might be important for me to know but which hadn't been written up nicely before.
This is a quick write up of a conversation with Andrew Critch about his takeaways. (I took rough notes, and then roughly cleaned them up for this. I don't know
"What surprised you most during your time at CFAR?
Surprise 1: People are profoundly non-numerate.
And, people who are not profoundly non-numerate still fail to connect numbers to life.
I'm still trying to find a way to teach people to apply numbers for their life. For example: "This thing is annoying you. How many minutes is it annoying you today? how many days will it annoy you?". I compulsively do this. There aren't things lying around in [...]
---
First published:
February 14th, 2024
Source:
https://www.lesswrong.com/posts/Jash4Gbi2wpThzZ4k/cfar-takeaways-andrew-critch
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/duvzdffTzL3dWJcxn/believing-in-1
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/5jdqtpT6StjKDKacw/attitudes-about-applied-rationality
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
This is a hasty speculative fiction vignette of one way I expect we might get AGI by January 2025 (within about one year of writing this). Like similar works by others, I expect most of the guesses herein to turn out incorrect. However, this was still useful for expanding my imagination about what could happen to enable very short timelines, and I hope it's also useful to you.
The assistant opened the door, and I walked into Director Yarden's austere office. For the Director of a major new federal institute, her working space was surprisingly devoid of possessions. But I suppose the DHS's Superintelligence Defense Institute was only created last week.
“You’re Doctor Browning?” Yarden asked from her desk.
“Yes, Director,” I replied.
“Take a seat,” she said, gesturing. I complied as the lights flickered ominously. “Happy New Year, thanks for coming,” she said. “I called you in today [...]
---
First published:
December 17th, 2023
Source:
https://www.lesswrong.com/posts/xLDwCemt5qvchzgHd/scale-was-all-we-needed-at-first
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://garrisonlovely.substack.com/p/sam-altmans-chip-ambitions-undercut If you enjoy this, please consider subscribing to my Substack.
Sam Altman has said he thinks that developing artificial general intelligence (AGI) could lead to human extinction, but OpenAI is trying to build it ASAP. Why?
The common story for how AI could overpower humanity involves an “intelligence explosion,” where an AI system becomes smart enough to further improve its capabilities, bootstrapping its way to superintelligence. Even without any kind of recursive self-improvement, some AI safety advocates argue that a large enough number of copies of a genuinely human-level AI system could pose serious problems for humanity. (I discuss this idea in more detail in my recent Jacobin cover story.)
Some people think the transition from human-level AI to superintelligence could happen in a matter of months, weeks, days, or even hours. The faster the takeoff, the more dangerous, the thinking goes.
Sam [...]
---
First published:
February 10th, 2024
Source:
https://www.lesswrong.com/posts/pEAHbJRiwnXCjb4A7/sam-altman-s-chip-ambitions-undercut-openai-s-safety
Linkpost URL:
https://garrisonlovely.substack.com/p/sam-altmans-chip-ambitions-undercut
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/PhTBDHu9PKJFmvb4p/a-shutdown-problem-proposal
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
People often parse information through an epistemic consensus filter. They do not ask "is this true", they ask "will others be OK with me thinking this is true". This makes them very malleable to brute force manufactured consensus; if every screen they look at says the same thing they will adopt that position because their brain interprets it as everyone in the tribe believing it.
- Anon, 4Chan, slightly edited
Ordinary people who haven't spent years of their lives thinking about rationality and epistemology don't form beliefs by impartially tallying up evidence like a Bayesian reasoner. Whilst there is a lot of variation, my impression is that the majority of humans we share this Earth with use a completely different algorithm for vetting potential beliefs: they just believe some average of what everyone and everything around them believes, especially what they see on screens, newspapers and "respectable", "mainstream" websites.
---
First published:
February 3rd, 2024
Source:
https://www.lesswrong.com/posts/bMxhrrkJdEormCcLt/brute-force-manufactured-consensus-is-hiding-the-crime-of
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
I often encounter some confusion about whether the fact that synapses in the brain typically fire at frequencies of 1-100 Hz while the clock frequency of a state-of-the-art GPU is on the order of 1 GHz means that AIs think "many orders of magnitude faster" than humans. In this short post, I'll argue that this way of thinking about "cognitive speed" is quite misleading.
The clock speed of a GPU is indeed meaningful: there is a clock inside the GPU that provides some signal that's periodic at a frequency of ~ 1 GHz. However, the corresponding period of ~ 1 nanosecond does not correspond to the timescale of any useful computations done by the GPU. For instance; in the A100 any read/write access into the L1 cache happens every ~ 30 clock cycles and this number goes up to 200-350 clock cycles for the L2 cache. The result [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
January 29th, 2024
Source:
https://www.lesswrong.com/posts/adadYCPFAhNqDA5Ye/processor-clock-speeds-are-not-how-fast-ais-think
---
Narrated by TYPE III AUDIO.
A pdf version of this report is available here.
Summary.
In this report we argue that AI systems capable of large scale scientific research will likely pursue unwanted goals and this will lead to catastrophic outcomes. We argue this is the default outcome, even with significant countermeasures, given the current trajectory of AI development.
In Section 1 we discuss the tasks which are the focus of this report. We are specifically focusing on AIs which are capable of dramatically speeding up large-scale novel science; on the scale of the Manhattan Project or curing cancer. This type of task requires a lot of work, and will require the AI to overcome many novel and diverse obstacles.
In Section 2 we argue that an AI which is capable of doing hard, novel science will be approximately consequentialist; that is, its behavior will be well described as taking actions in order [...]
The original text contained 40 footnotes which were omitted from this narration.
---
First published:
January 26th, 2024
Source:
https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://rootsofprogress.org/the-block-funding-model-for-scienceWhen Galileo wanted to study the heavens through his telescope, he got money from those legendary patrons of the Renaissance, the Medici. To win their favor, when he discovered the moons of Jupiter, he named them the Medicean Stars. Other scientists and inventors offered flashy gifts, such as Cornelis Drebbel's perpetuum mobile (a sort of astronomical clock) given to King James, who made Drebbel court engineer in return. The other way to do research in those days was to be independently wealthy: the Victorian model of the gentleman scientist.
Galileo demonstrating law of gravity in presence of Giovanni de' Medici, 1839 fresco by Giuseppe Bezzuoli MeisterdruckeEventually we decided that requiring researchers to seek wealthy patrons or have independent means was not the best way to do science. Today, researchers, in their role as “principal investigators” (PIs), apply to science funders for grants. In the [...]
---
First published:
January 26th, 2024
Source:
https://www.lesswrong.com/posts/DKH9Z4DyusEdJmXKB/making-every-researcher-seek-grants-is-a-broken-model
Linkpost URL:
https://rootsofprogress.org/the-block-funding-model-for-science
---
Narrated by TYPE III AUDIO.
Let your every day be full of joy, love the child that holds your hand, let your wife delight in your embrace, for these alone are the concerns of humanity.[1]
— Epic of Gilgamesh - Tablet X
Say we want to train a scientist AI to help in a precise, narrow field of science (e.g. medicine design) but prevent its power from being applied anywhere else (e.g. chatting with humans, designing bio-weapons, etc.) even if it has these abilities.
Here's one safety layer one could implement:
We are organising the 9th edition without funds. We have no personal runway left to do this again. We will not run the 10th edition without funding.
In a nutshell:
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Crossposted from substack.
As we all know, sugar is sweet and so are the $30B in yearly revenue from the artificial sweetener industry.
Four billion years of evolution endowed our brains with a simple, straightforward mechanism to make sure we occasionally get an energy refuel so we can continue the foraging a little longer, and of course we are completely ignoring the instructions and spend billions on fake fuel that doesn’t actually grant any energy. A classic case of the Human Alignment Problem.
If we’re going to break our conditioning anyway, where do we start? How do you even come up with a new artificial sweetener? I’ve been wondering about this, because it’s not obvious to me how you would figure out what is sweet and what is not.
Look at sucrose and aspartame side by side:
Source:
https://www.lesswrong.com/posts/oA23zoEjPnzqfHiCt/there-is-way-too-much-serendipity
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
This is a linkpost for https://arxiv.org/abs/2401.05566
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training- deceptive-llms-that-persist-through
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/tEPHGZAb63dfq2v8n/how-useful-is-mechanistic-interpretability
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
I wrote this entire post in February of 2023, during the fallout from the TIME article. I didn't post it at the time for multiple reasons:
"(Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app.)"
This is the first essay in a series that I’m calling “Otherness and control in the age of AGI.” See here for more about the series as a whole.)
The most succinct argument for AI risk, in my opinion, is the “second species” argument. Basically, it goes like this.
Premise 1: AGIs would be like a second advanced species on earth, more powerful than humans.Conclusion: That’s scary.
To be clear: this is very far from airtight logic.[1] But I like the intuition pump. Often, if I only have two sentences to explain AI risk, I say this sort of species stuff. “Chimpanzees should be careful about inventing humans.” Etc.[2]
People often talk about aliens here, too. “What if you learned that aliens were on their way to earth? Surely that’s scary.” Again, very far from a knock-down case (for example: we get to build the aliens in question). But it draws on something.
In particular, though: it draws on a narrative of interspecies conflict. You are meeting a new form of life, a new type of mind. But these new creatures are presented to you, centrally, as a possible threat; as competitors; as agents in whose power you might find yourself helpless.
And unfortunately: yes. But I want to start this series by acknowledging how many dimensions of interspecies-relationship this narrative leaves out, and how much I wish we could be focusing only on the other parts. To meet a new species – and especially, a new intelligent species – is not just scary. It’s incredible. I wish it was less a time for fear, and more a time for wonder and dialogue. A time to look into new eyes – and to see further.
Source:
https://www.lesswrong.com/posts/mzvu8QTRXdvDReCAL/gentleness-and-the-artificial-other
Narrated for LessWrong by Joe Carlsmith (audio provided with permission).
Share feedback on this narration.
[Curated Post] ✓
[125+ karma Post] ✓
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that I have started leading a new team at Anthropic, the Alignment Stress-Testing team, with Carson Denison and Monte MacDiarmid as current team members. Our mission—and our mandate from the organization—is to red-team Anthropic's alignment techniques and evaluations, empirically demonstrating ways in which Anthropic's alignment strategies could fail.
The easiest way to get a sense of what we’ll be working on is probably just to check out our “Sleeper Agents” paper, which was our first big research project. I’d also recommend Buck and Ryan's post on meta-level adversarial evaluation as a good general description of our team's scope. Very simply, our job is to try to prove to Anthropic—and the world more broadly—(if it is [...]
---
First published:
January 12th, 2024
Source:
https://www.lesswrong.com/posts/EPDSdXr8YbsDkgsDG/introducing-alignment-stress-testing-at-anthropic
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a linkpost for https://arxiv.org/abs/2401.05566I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this. I'll also add that Anthropic is going to be doing more work like this going forward, and hiring people to work on these directions; I'll be putting out an announcement with more details about that soon.
EDIT: That announcement is now up!
Abstract:
Humans are capable of [...]
---
First published:
January 12th, 2024
Source:
https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-deceptive-llms-that-persist-through
Linkpost URL:
https://arxiv.org/abs/2401.05566
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
The goal of this post is to clarify a few concepts relating to AI Alignment under a common framework. The main concepts to be clarified:
The main new concepts employed will be endorsement and legitimacy.
TLDR:
This write-up owes a large debt to many conversations with Sahil, although the views expressed here are my own.
Source:
https://www.lesswrong.com/posts/bnnhypM5MXBHAATLw/meaning-and-agency
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Thanks to Clément Dumas, Nikola Jurković, Nora Belrose, Arthur Conmy, and Oam Patel for feedback.
In the comments of the post on Google Deepmind's CCS challenges paper, I expressed skepticism that some of the experimental results seemed possible. When addressing my concerns, Rohin Shah made some claims along the lines of “If an LLM linearly represents features a and b, then it will also linearly represent their XOR, <span>_aoplus b_</span>, and this is true even in settings where there's no obvious reason the model would need to make use of the feature <span>_aoplus b._</span>”
For reasons that I’ll explain below, I thought this claim was absolutely bonkers, both in general and in the specific setting that the GDM paper was working in. So I ran some experiments to prove Rohin wrong.
The result: Rohin was right and [...]
---
First published:
January 3rd, 2024
Source:
https://www.lesswrong.com/posts/hjJXCn9GsskysDceS/what-s-up-with-llms-representing-xors-of-arbitrary-features
---
Narrated by TYPE III AUDIO.
(Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app.
This is the first essay in a series that I’m calling “Otherness and control in the age of AGI.” See here for more about the series as a whole.)
When species meet
The most succinct argument for AI risk, in my opinion, is the “second species” argument. Basically, it goes like this.
Premise 1: AGIs would be like a second advanced species on earth, more powerful than humans.
Conclusion: That's scary.
To be clear: this is very far from airtight logic.[1] But I like the intuition pump. Often, if I only have two sentences to explain AI risk, I say this sort of species stuff. “Chimpanzees should be careful about inventing humans.” Etc.[2]
People often talk about aliens here, too. “What if you learned that aliens [...]
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
January 2nd, 2024
Source:
https://www.lesswrong.com/posts/mzvu8QTRXdvDReCAL/gentleness-and-the-artificial-other
---
Narrated by TYPE III AUDIO.
As we announced back in October, I have taken on the senior leadership role at MIRI as its CEO. It's a big pair of shoes to fill, and an awesome responsibility that I’m honored to take on.
There have been several changes at MIRI since our 2020 strategic update, so let's get into it.[1]
The short version:
We think it's very unlikely that the AI alignment field will be able to make progress quickly enough to prevent human extinction and the loss of the future's potential value, that we expect will result from loss of control to smarter-than-human AI systems.
However, developments this past year like the release of ChatGPT seem to have shifted the Overton window in a lot of groups. There's been a lot more discussion of extinction risk from AI, including among policymakers, and the discussion quality seems greatly improved.
This provides a glimmer of hope. [...]
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
January 5th, 2024
Source:
https://www.lesswrong.com/posts/q3bJYTB3dGRf5fbD9/miri-2024-mission-and-strategy-update
---
Narrated by TYPE III AUDIO.
Background: The Plan, The Plan: 2022 Update. If you haven’t read those, don’t worry, we’re going to go through things from the top this year, and with moderately more detail than before.
1. What's Your Plan For AI Alignment?
Median happy trajectory:
In certain circumstances, apologizing can also be a countersignalling power-move, i.e. “I am so high status that I can grovel a bit without anybody mistaking me for a general groveller”. But that's not really the type of move this post is focused on.There's this narrative about a tradeoff between:
This is a linkpost for https://unstableontology.com/2023/12/31/a-case-for-ai-alignment-being-difficult/
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is an attempt to distill a model of AGI alignment that I have gained primarily from thinkers such as Eliezer Yudkowsky (and to a lesser extent Paul Christiano), but explained in my own terms rather than attempting to hew close to these thinkers. I think I would be pretty good at passing an ideological Turing test for Eliezer Yudowsky on AGI alignment difficulty (but not AGI timelines), though what I'm doing in this post is not that, it's more like finding a branch in the possibility space as I see it that is close enough to Yudowsky's model that it's possible to talk in the same language.
Even if the problem turns out to not be very difficult, it's helpful to have a model of why one might think it is difficult, so as to identify weaknesses in the case so as to find AI designs that avoid the main difficulties. Progress on problems can be made by a combination of finding possible paths and finding impossibility results or difficulty arguments.
Most of what I say should not be taken as a statement on AGI timelines. Some problems that make alignment difficult, such as ontology identification, also make creating capable AGI difficult to some extent.
Source:
https://www.lesswrong.com/posts/wnkGXcAq4DCgY8HqA/a-case-for-ai-alignment-being-difficult
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
lsusrIt is my understanding that you won all of your public forum debates this year. That's very impressive. I thought it would be interesting to discuss some of the techniques you used.
LyrongolemOf course! So, just for a brief overview for those who don't know, public forum is a 2v2 debate format, usually on a policy topic. One of the more interesting ones has been the last one I went to, where the topic was "Resolved: The US Federal Government Should Substantially Increase its Military Presence in the Arctic".
Now, the techniques I'll go over here are related to this topic specifically, but they would also apply to other forms of debate, and argumentation in general really. For the sake of simplicity, I'll call it "ultra-BS".
So, most of us are familiar with 'regular' BS. The idea is the other person says something, and you just reply [...]
---
First published:
December 19th, 2023
Source:
https://www.lesswrong.com/posts/djWftXndJ7iMPsjrp/the-dark-arts
---
Narrated by TYPE III AUDIO.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a review of Paul Christiano's article "where I agree and disagree with Eliezer". Written for the LessWrong 2022 Review.
In the existential AI safety community, there is an ongoing debate between positions situated differently on some axis which doesn't have a common agreed-upon name, but where Christiano and Yudkowsky can be regarded as representatives of the two directions[1]. For the sake of this review, I will dub the camps gravitating to the different ends of this axis "Prosers" (after prosaic alignment) and "Poets"[2]. Christiano is a Proser, and so are most people in AI safety groups in the industry. Yudkowsky is a typical Poet, people in MIRI and the agent foundations community tend to also be such.
Prosers tend to be more optimistic, lend more credence to slow takeoff, and place more value on [...]
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
December 27th, 2023
Source:
https://www.lesswrong.com/posts/8HYJwQepynHsRKr6j/critical-review-of-christiano-s-disagreements-with-yudkowsky
---
Narrated by TYPE III AUDIO.
This point feels fairly obvious, yet seems worth stating explicitly.
Those of us familiar with the field of AI after the deep-learning revolution know perfectly well that we have no idea how our ML models work. Sure, we have an understanding of the dynamics of training loops and SGD's properties, and we know how ML models' architectures work. But we don't know what specific algorithms ML models' forward passes implement. We have some guesses, and some insights painstakingly mined by interpretability advances, but nothing even remotely like a full understanding.
And most certainly, we wouldn't automatically know how a fresh model trained on a novel architecture that was just spat out by the training loop works.
We're all used to this state of affairs. It's implicitly-assumed shared background knowledge. But it's actually pretty unusual, when you first learn of it.
And...
Relevant XKCD.I'm pretty sure that the general public [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
December 21st, 2023
Source:
https://www.lesswrong.com/posts/CpjTJtW2RNKvzAehG/most-people-don-t-realize-we-have-no-idea-how-our-ais-work
---
Narrated by TYPE III AUDIO.
TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won’t be directly helpful either (70%). We’ve written a paper about some of our detailed experiences with it.
Paper authors: Sebastian Farquhar*, Vikrant Varma*, Zac Kenton*, Johannes Gasteiger, Vlad Mikulik, and Rohin Shah. *Equal contribution, order randomised.
Credences are based on a poll of Seb, Vikrant, Zac, Johannes, Rohin and show single values where we mostly agree and ranges where we disagreed.
What does CCS try to do?
To us, CCS represents a family of possible algorithms aiming at solving an ELK-style problem that have the steps:
This is a linkpost for https://www.narrativeark.xyz/p/succession“A table beside the evening sea
where you sit shelling pistachios,
flicking the next open with the half-
shell of the last, story opening story,
on down to the sandy end of time.”
V1: Leaving
Deceleration is the hardest part. Even after burning almost all of my fuel, I’m still coming in at 0.8c. I’ve planned a slingshot around the galaxy's central black hole which will slow me down even further, but at this speed it’ll require incredibly precise timing. I’ve been optimized hard for this, with specialized circuits for it built in on the hardware level to reduce latency. Even so, less than half of slingshots at this speed succeed—most probes crash, or fly off trajectory and are left coasting through empty space.
I’ve already beaten the odds by making it here. Intergalactic probes travel so fast, and so far, that almost all [...]
---
First published:
December 20th, 2023
Source:
https://www.lesswrong.com/posts/CAzntXYTEaNfC9nB6/succession
Linkpost URL:
https://www.narrativeark.xyz/p/succession
---
Narrated by TYPE III AUDIO.
Recently, Ben Pace wrote a well-intentioned blog post mostly based on complaints from 2 (of 21) Nonlinear employees who 1) wanted more money, 2) felt socially isolated, and 3) felt persecuted/oppressed.
Of relevance, one has accused the majority of her previous employers, and 28 people of abuse - that we know of.
She has accused multiple people of threatening to kill her and literally accused an ex-employer of murder. Within three weeks of joining us, she had accused five separate people of abuse: not paying her what was promised, controlling her romantic life, hiring stalkers, and other forms of persecution.
We have empathy for her. Initially, we believed her too. We spent weeks helping her get her “nefarious employer to finally pay her” and commiserated with her over how badly they mistreated her.
Then she started accusing us of strange things.
You’ve seen Ben's evidence, which [...]
---
First published:
December 12th, 2023
Source:
https://www.lesswrong.com/posts/q4MXBzzrE6bnDHJbM/nonlinear-s-evidence-debunking-false-and-misleading-claims
---
Narrated by TYPE III AUDIO.
At the Bay Area Solstice, I heard the song Bold Orion for the first time. I like it a lot. It does, however, have one problem:
He has seen the rise and fall of kings and continents and all,
Rising silent, bold Orion on the rise.
Orion has not witnessed the rise and fall of continents. Constellations are younger than continents.
The time scale that continents change on is ten or hundreds of millions of years.
The time scale that stars the size of the sun live and die on is billions of years. So stars are older than continents.
But constellations are not stars or sets of stars. They are the patterns that stars make in our night sky.
The stars of some constellations are close together in space, and are gravitationally bound together, like the Pleiades. The Pleiades likely have been together, and will stay close [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
December 19th, 2023
Source:
https://www.lesswrong.com/posts/YMakfmwZsoLdXAZhb/constellations-are-younger-than-continents
---
Narrated by TYPE III AUDIO.
Many thanks to Samuel Hammond, Cate Hall, Beren Millidge, Steve Byrnes, Lucius Bushnaq, Joar Skalse, Kyle Gracey, Gunnar Zarncke, Ross Nordby, David Lambert, Simeon Campos, Bogdan Ionut-Cirstea, Ryan Kidd, Eric Ho, and Ashwin Acharya for critical comments and suggestions on earlier drafts of this agenda, as well as Philip Gubbins, Diogo de Lucena, Rob Luke, and Mason Seale from AE Studio for their support and feedback throughout.
TL;DR
When discussing AGI Risk, people often talk about it in terms of a war between humanity and an AGI. Comparisons between the amounts of resources at both sides' disposal are brought up and factored in, big impressive nuclear stockpiles are sometimes waved around, etc.
I'm pretty sure it's not how that'd look like, on several levels.
1. Threat Ambiguity
I think what people imagine, when they imagine a war, is Terminator-style movie scenarios where the obviously evil AGI becomes obviously evil in a way that's obvious to everyone, and then it's a neatly arranged white-and-black humanity vs. machines all-out fight. Everyone sees the problem, and knows everyone else sees it too, the problem is common knowledge, and we can all decisively act against it.[1]
But in real life, such unambiguity is rare. The monsters don't look obviously evil, the signs of fatal issues are rarely blatant. Is this whiff [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
December 16th, 2023
Source:
https://www.lesswrong.com/posts/xSJMj3Hw3D7DPy5fJ/humanity-vs-agi-will-never-look-like-humanity-vs-agi-to
---
Narrated by TYPE III AUDIO.
Epistemic status: Speculation. An unholy union of evo psych, introspection, random stuff I happen to observe & hear about, and thinking. Done on a highly charged topic. Caveat emptor!
Most of my life, whenever I'd felt sexually unwanted, I'd start planning to get fit.
Specifically to shape my body so it looks hot. Like the muscly guys I'd see in action films.
This choice is a little odd. In close to every context I've listened to, I hear women say that some muscle tone on a guy is nice and abs are a plus, but big muscles are gross — and all of that is utterly overwhelmed by other factors anyway.
It also didn't match up with whom I'd see women actually dating.
But all of that just… didn't affect my desire?
There's a related bit of dating advice for guys. "Bro, do you even lift?" Depending on the [...]
---
First published:
December 13th, 2023
Source:
https://www.lesswrong.com/posts/nvmfqdytxyEpRJC3F/is-being-sexy-for-your-homies
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
TL;DR version
In the course of my life, there have been a handful of times I discovered an idea that changed the way I thought about the world. The first occurred when I picked up Nick Bostrom’s book “superintelligence” and realized that AI would utterly transform the world. The second was when I learned about embryo selection and how it could change future generations. And the third happened a few months ago when I read a message from a friend of mine on Discord about editing the genome of a living person.
Source:
https://www.lesswrong.com/posts/JEhW3HDMKzekDShva/significantly-enhancing-adult-intelligence-with-gene-editing
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://unstableontology.com/2023/11/26/moral-reality-check/
Janet sat at her corporate ExxenAI computer, viewing some training performance statistics. ExxenAI was a major player in the generative AI space, with multimodal language, image, audio, and video AIs. They had scaled up operations over the past few years, mostly serving B2B, but with some B2C subscriptions. ExxenAI's newest AI system, SimplexAI-3, was based on GPT-5 and Gemini-2. ExxenAI had hired away some software engineers from Google and Microsoft, in addition to some machine learning PhDs, and replicated the work of other companies to provide more custom fine-tuning, especially for B2B cases. Part of what attracted these engineers and theorists was ExxenAI's AI alignment team.
Source:
https://www.lesswrong.com/posts/umJMCaxosXWEDfS66/moral-reality-check-a-short-story
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:
The Less Wrong General Census is unofficially here! You can take it at this link.
It's that time again.
If you are reading this post and identify as a LessWronger, then you are the target audience. I'd appreciate it if you took the survey. If you post, if you comment, if you lurk, if you don't actually read the site that much but you do read a bunch of the other rationalist blogs or you're really into HPMOR, if you hung out on rationalist tumblr back in the day, or if none of those exactly fit you but I'm maybe getting close, I think you count and I'd appreciate it if you took the survey.
Don't feel like you have to answer all of the questions just because you started taking it. Last year I asked if people thought the survey was too long, collectively they thought it was [...]
---
First published:
December 2nd, 2023
Source:
https://www.lesswrong.com/posts/JHeTrWha5PxiPEwBt/2023-unofficial-lesswrong-census-survey
---
Narrated by TYPE III AUDIO.
If you are interested in the longevity scene, like I am, you probably have seen press releases about the dog longevity company, Loyal for Dogs, getting a nod for efficacy from the FDA. These have come in the form of the New York Post calling the drug "groundbreaking", Science Alert calling the drug "radical", and the more sedate New York Times just asking, "Could Longevity Drugs for Dogs Extend Your Pet's Life?", presumably unaware of Betteridge's Law of Headlines. You may have also seen the coordinated Twitter offensive of people losing their shit about this, including their lead investor, Laura Deming, saying that she "broke down crying when she got the call".
And if you have been following Loyal for Dogs for a while, like I have, you are probably puzzled by this news. Loyal for Dogs has been around since 2021. Unlike any other drug company or longevity [...]
---
First published:
December 12th, 2023
Source:
https://www.lesswrong.com/posts/vHSkxmYYqW59sySqA/the-likely-first-longevity-drug-is-based-on-sketchy-science
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Crossposted from Otherwise
Parents supervise their children way more than they used to
Children spend less of their time in unstructured play than they did in past generations.
Parental supervision is way up. The wild thing is that this is true even while the number of children per family has decreased and the amount of time mothers work outside the home has increased.
Source:
https://www.lesswrong.com/posts/piJLpEeh6ivy5RA7v/what-are-the-results-of-more-parental-supervision-and-less
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
In the course of my life, there have been a handful of times I discovered an idea that changed the way I thought about the world. The first occurred when I picked up Nick Bostrom's book “superintelligence” and realized that AI would utterly transform the world. The second was when I learned about embryo selection and how it could change future generations. And the third happened a few months ago when I read a message from a friend of mine on Discord about editing the genome of a living person.
We’ve had gene therapy to treat cancer and single gene disorders for decades. But the process involved in making such changes to the cells of a living person is excruciating and extremely expensive. CAR T-cell therapy, a treatment for certain types of cancer, requires the removal of white blood cells via IV, genetic modification of those cells outside the [...]
---
First published:
December 12th, 2023
Source:
https://www.lesswrong.com/posts/JEhW3HDMKzekDShva/significantly-enhancing-adult-intelligence-with-gene-editing
---
Narrated by TYPE III AUDIO.
I was asked to respond to this comment by Eliezer Yudkowsky. This post is partly redundant with my previous post.
Why is flesh weaker than diamond?
When trying to resolve disagreements, I find that precision is important. Tensile strength, compressive strength, and impact strength are different. Material microstructure matters. Poorly-sintered diamond crystals could crumble like sand, and a large diamond crystal has lower impact strength than some materials made of proteins.
Even when the load-bearing forces holding large molecular systems together are locally covalent bonds, as in lignin (what makes wood strong), if you've got larger molecules only held together by covalent bonds at interspersed points along their edges, that's like having 10cm-diameter steel beams held together by 1cm welds.
lignin (what makes wood strong)
That's an odd way of putting things. The mechanical strength of wood is generally considered to come from it [...]
---
First published:
December 11th, 2023
Source:
https://www.lesswrong.com/posts/XhDh97vm7hXBfjwqQ/re-yudkowsky-on-biological-materials
---
Narrated by TYPE III AUDIO.
In May and June of 2023, I (Akash) had about 50-70 meetings about AI risks with congressional staffers. I had been meaning to write a post reflecting on the experience and some of my takeaways, and I figured it could be a good topic for a LessWrong dialogue. I saw that hath had offered to do LW dialogues with folks, and I reached out.
In this dialogue, we discuss how I decided to chat with staffers, my initial observations in DC, some context about how Congressional offices work, what my meetings looked like, lessons I learned, and some miscellaneous takes about my experience.
Context
hathHey! In your message, you mentioned a few topics that relate to your time in DC.
I figured we should start with your experience talking to congressional offices about AI risk. I'm quite interested in learning more; there don't seem to be many [...]
---
First published:
December 4th, 2023
Source:
https://www.lesswrong.com/posts/2sLwt2cSAag74nsdN/speaking-to-congressional-staffers-about-ai-risk
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on.
This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at that one conference” and “I just read a post on a surprising new insight and want to see who else has been working on this”, “I wonder roughly how many people are working on that thing”.
This doc is unreadably long, so that it can be Ctrl-F-ed. Also this way you can fork the list and make a smaller one.
Most of you should only read the editorial and skim the section you work in.
Source:
https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/shallow-review-of-live-agendas-in-alignment-and-safety#More_meta
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Quintin Pope & Nora Belrose have a new “AI Optimists” website, along with a new essay “AI is easy to control”, arguing that the risk of human extinction due to future AI (“AI x-risk”) is a mere 1% (“a tail risk worth considering, but not the dominant source of risk in the world”). (I’m much more pessimistic.) It makes lots of interesting arguments, and I’m happy that the authors are engaging in substantive and productive discourse, unlike the ad hominem vibes-based drivel which is growing increasingly common on both sides of the AI x-risk issue in recent months.
This is not a comprehensive rebuttal or anything, but rather picking up on a few threads that seem important for where we disagree, or where I have something I want to say.
Summary / table-of-contents:
Note: I think Sections 1 [...]
---
First published:
December 1st, 2023
Source:
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose
---
Narrated by TYPE III AUDIO.
Any community which ever adds new people will need to either routinely teach the new and (to established members) blindingly obvious information to those who genuinely haven’t heard it before, or accept that over time community members will only know the simplest basics by accident of osmosis or selection bias. There isn’t another way out of that. You don’t get to stop doing it. If you have a vibrant and popular group full of people really interested in the subject of the group, and you run it for ten years straight, you will still sometimes run across people who have only fuzzy and incorrect ideas about the subject dauntless you are making an active effort to make Something Which Is Not That happen.
Or in other words; I have run into people at Effective Altruism meetups who were aghast at the idea of putting a dollar price on a [...]
---
First published:
November 29th, 2023
Source:
https://www.lesswrong.com/posts/ydmctK2qLyrR9Xztd/the-101-space-you-will-always-have-with-you
---
Narrated by TYPE III AUDIO.
The author's Substack:
https://substack.com/@homosabiens
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
You know it must be out there, but you mostly never see it.
Author's Note 1: In something like 75% of possible futures, this will be the last essay that I publish on LessWrong. Future content will be available on my substack, where I'm hoping people will be willing to chip in a little commensurate with the value of the writing, and (after a delay) on my personal site (not yet live). I decided to post this final essay here rather than silently switching over because many LessWrong readers would otherwise never find out that they could still get new Duncanthoughts elsewhere.
Source:
https://www.lesswrong.com/posts/KpMNqA5BiCRozCwM3/social-dark-matter
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Summary.
You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on.
This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at that one conference” and “I just read a post on a surprising new insight and want to see who else has been [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
November 27th, 2023
Source:
https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/shallow-review-of-live-agendas-in-alignment-and-safety
---
Narrated by TYPE III AUDIO.
Status: Vague, sorry. The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are.
Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?
(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it's being eroded, etc.)
And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior?
(Modulo, e.g., the fact that it can play chess pretty well, which indicates a [...]
---
First published:
November 24th, 2023
Source:
https://www.lesswrong.com/posts/AWoZBzxdm4DoGgiSj/ability-to-solve-long-horizon-tasks-correlates-with-wanting
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
Recently, I have been learning about industry norms, legal discovery proceedings, and incentive structures related to companies building risky systems. I wanted to share some findings in this post because they may be important for the frontier AI community to understand well.
TL;DR
Documented communications of risks (especially by employees) make companies much more likely to be held liable in court when bad things happen. The resulting Duty to Due Diligence from Discoverable Documentation of Dangers (the 6D effect) can make companies much more cautious if even a single email is sent to them communicating a risk.
Source:
https://www.lesswrong.com/posts/J9eF4nA6wJW6hPueN/the-6d-effect-when-companies-take-risks-one-email-can-be
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Previously: OpenAI: Facts from a Weekend.
On Friday afternoon, OpenAI's board fired CEO Sam Altman.
Overnight, an agreement in principle was reached to reinstate Sam Altman as CEO of OpenAI, with an initial new board of Brad Taylor (ex-co-CEO of Salesforce, chair), Larry Summers and Adam D’Angelo.
What happened? Why did it happen? How will it ultimately end? The fight is far from over.
We do not entirely know, but we know a lot more than we did a few days ago.
This is my attempt to put the pieces together.
This is a Fight For Control; Altman Started it
This was and still is a fight about control of OpenAI, its board, and its direction.
This has been a long simmering battle and debate. The stakes are high.
Until recently, Sam Altman worked to reshape the company in his [...]
---
First published:
November 22nd, 2023
Source:
https://www.lesswrong.com/posts/sGpBPAPq2QttY4M2H/openai-the-battle-of-the-board
---
Narrated by TYPE III AUDIO.
Approximately four GPTs and seven years ago, OpenAI's founders brought forth on this corporate landscape a new entity, conceived in liberty, and dedicated to the proposition that all men might live equally when AGI is created.
Now we are engaged in a great corporate war, testing whether that entity, or any entity so conceived and so dedicated, can long endure.
What matters is not theory but practice. What happens when the chips are down?
So what happened? What prompted it? What will happen now?
To a large extent, even more than usual, we do not know. We should not pretend that we know more than we do.
Rather than attempt to interpret here or barrage with an endless string of reactions and quotes, I will instead do my best to stick to a compilation of the key facts.
(Note: All times stated here [...]
---
First published:
November 20th, 2023
Source:
https://www.lesswrong.com/posts/KXHMCH7wCxrvKsJyn/openai-facts-from-a-weekend
---
Narrated by TYPE III AUDIO.
This is a linkpost for https://openai.com/blog/openai-announces-leadership-transitionBasically just the title, see the OAI blog post for more details.
Mr. Altman's departure follows a deliberative review process by the board, which concluded that he was not consistently candid in his communications with the board, hindering its ability to exercise its responsibilities. The board no longer has confidence in his ability to continue leading OpenAI.
In a statement, the board of directors said: “OpenAI was deliberately structured to advance our mission: to ensure that artificial general intelligence benefits all humanity. The board remains fully committed to serving this mission. We are grateful for Sam's many contributions to the founding and growth of OpenAI. At the same time, we believe new leadership is necessary as we move forward. As the leader of the company's research, product, and safety functions, Mira is exceptionally qualified to step into the role of interim CEO. We have [...]
---
First published:
November 17th, 2023
Source:
https://www.lesswrong.com/posts/eHFo7nwLYDzpuamRM/sam-altman-fired-from-openai
Linkpost URL:
https://openai.com/blog/openai-announces-leadership-transition
---
Narrated by TYPE III AUDIO.
You know it must be out there, but you mostly never see it.
Author's Note 1: I'm something like 75% confident that this will be the last essay that I publish on LessWrong. Future content will be available on my substack, where I'm hoping people will be willing to chip in a little commensurate with the value of the writing, and (after a delay) on my personal site. I decided to post this final essay here rather than silently switching over because many LessWrong readers would otherwise never find out that they could still get new Duncan content elsewhere.
Author's Note 2: This essay is not intended to be revelatory. Instead, it's attempting to get the consequences of a few very obvious things lodged into your brain, such that they actually occur to you from time to time as opposed to occurring to you approximately never.
Most people [...]
The original text contained 9 footnotes which were omitted from this narration.
---
First published:
November 7th, 2023
Source:
https://www.lesswrong.com/posts/KpMNqA5BiCRozCwM3/social-dark-matter
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
I'm sure Harry Potter and the Methods of Rationality taught me some of the obvious, overt things it set out to teach. Looking back on it a decade after I first read it however, what strikes me most strongly are often the brief, tossed off bits in the middle of the flow of a story.
Fred and George exchanged worried glances."I can't think of anything," said George."Neither can I," said Fred. "Sorry."Harry stared at them.And then Harry began to explain how you went about thinking of things.It had been known to take longer than two seconds, said Harry.-Harry Potter and the Methods of Rationality, Chapter 25.
Source:
https://www.lesswrong.com/posts/WJtq4DoyT9ovPyHjH/thinking-by-the-clock
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
I.
Here's a recent conversation I had with a friend:
Me: "I wish I had more friends. You guys are great, but I only get to hang out with you like once or twice a week. It's painful being holed up in my house the entire rest of the time."Friend: "You know ${X}. You could talk to him."Me: "I haven't talked to ${X} since 2019."Friend: "Why does that matter? Just call him."Me: "What do you mean 'just call him'? I can't do that."Friend: "Yes you can"Me:
Source:
https://www.lesswrong.com/posts/2HawAteFsnyhfYpuD/you-can-just-spontaneously-call-people-you-haven-t-met-in
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
How many years will pass before transformative AI is built? Three people who have thought about this question a lot are Ajeya Cotra from Open Philanthropy, Daniel Kokotajlo from OpenAI and Ege Erdil from Epoch. Despite each spending at least hundreds of hours investigating this question, they still still disagree substantially about the relevant timescales. For instance, here are their median timelines for one operationalization of transformative AI:
Source:
https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
It’s fairly common for EA orgs to provide fiscal sponsorship to other EA orgs. Wait, no, that sentence is not quite right. The more accurate sentence is that there are very few EA organizations, in the legal sense; most of what you think of as orgs are projects that are legally hosted by a single org, and which governments therefore consider to be one legal entity.
Source:
https://www.lesswrong.com/posts/XvEJydHAHk6hjWQr5/ea-orgs-legal-structure-inhibits-risk-taking-and-information
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
habryka
Ok, so we both had some feelings about the recent Conjecture post on "lots of people in AI Alignment are lying", and the associated marketing campaign and stuff.
I would appreciate some context in which I can think through that, and also to share info we have in the space that might help us figure out what's going on.
I expect this will pretty quickly cause us to end up on some broader questions about how to do advocacy, how much the current social network around AI Alignment should coordinate as a group, how to balance advocacy with research, etc.
Source:
https://www.lesswrong.com/posts/vFqa8DZCuhyrbSnyx/integrity-in-ai-governance-and-advocacy
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
(You can sign up to play deception chess here if you haven't already.)
This is the first of my analyses of the deception chess games. The introduction will describe the setup of the game, and the conclusion will sum up what happened in general terms; the rest of the post will mostly be chess analysis and skippable if you just want the results. If you haven't read the original post, read it before reading this so that you know what's going on here.
The first game was between Alex A as player A, Chess.com computer Komodo 12 as player B, myself as the honest C advisor, and aphyer and AdamYedidia as the deceptive Cs. (Someone else randomized the roles for the Cs and told us in private.)
Source:
https://www.lesswrong.com/posts/6dn6hnFRgqqWJbwk9/deception-chess-game-1
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://transformer-circuits.pub/2023/monosemantic-features/
Text of post based on our blog post as a linkpost for the full paper which is considerably longer and more detailed.
Neural networks are trained on data, not programmed to follow rules. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don't understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe.
Luckily for those of us trying to understand artificial neural networks, we can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.
Unfortunately, it turns out that the individual neurons do not have consistent relationships to network behavior. For example, a single neuron in a small language model is active in many unrelated contexts, including: academic citations, English dialogue, HTTP requests, and Korean text. In a classic vision model, a single neuron responds to faces of cats and fronts of cars. The activation of one neuron can mean different things in different contexts.
Source:
https://www.lesswrong.com/posts/TDqvQFks6TWutJEKu/towards-monosemanticity-decomposing-language-models-with
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Recently, I have been learning about industry norms, legal discovery proceedings, and incentive structures related to companies building risky systems. I wanted to share some findings in this post because they may be important for the frontier AI community to understand well.
TL;DR
Documented communications of risks (especially by employees) make companies much more likely to be held liable in court when bad things happen. The resulting Duty to Due Diligence from Discoverable Documentation of Dangers (the 6D effect) can make companies much more cautious if even a single email is sent to them communicating a risk.
Source:
https://www.lesswrong.com/posts/J9eF4nA6wJW6hPueN/the-6d-effect-when-companies-take-risks-one-email-can-be
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I guess there’s maybe a 10-20% chance of AI causing human extinction in the coming decades, but I feel more distressed about it than even that suggests—I think because in the case where it doesn’t cause human extinction, I find it hard to imagine life not going kind of off the rails. So many things I like about the world seem likely to be over or badly disrupted with superhuman AI (writing, explaining things to people, friendships where you can be of any use to one another, taking pride in skills, thinking, learning, figuring out how to achieve things, making things, easy tracking of what is and isn’t conscious), and I don’t trust that the replacements will be actually good, or good for us, or that anything will be reversible.
Even if we don’t die, it still feels like everything is coming to an end.
Source:
https://www.lesswrong.com/posts/uyPo8pfEtBffyPdxf/the-other-side-of-the-tidal-wave
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
davidad has a 10-min talk out on a proposal about which he says: “the first time I’ve seen a concrete plan that might work to get human uploads before 2040, maybe even faster, given unlimited funding”.
I think the talk is a good watch, but the dialogue below is pretty readable even if you haven't seen it. I'm also putting some summary notes from the talk in the Appendix of this dialogue.
Source:
https://www.lesswrong.com/posts/FEFQSGLhJFpqmEhgi/does-davidad-s-uploading-moonshot-work
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I examined all the biorisk-relevant citations from a policy paper arguing that we should ban powerful open source LLMs.
None of them provide good evidence for the paper's conclusion. The best of the set is evidence from statements from Anthropic -- which rest upon data that no one outside of Anthropic can even see, and on Anthropic's interpretation of that data. The rest of the evidence cited in this paper ultimately rests on a single extremely questionable "experiment" without a control group.
In all, citations in the paper provide an illusion of evidence ("look at all these citations") rather than actual evidence ("these experiments are how we know open source LLMs are dangerous and could contribute to biorisk").
A recent further paper on this topic (published after I had started writing this review) continues this pattern of being more advocacy than science.
Almost all the bad papers that I look at are funded by Open Philanthropy. If Open Philanthropy cares about truth, then they should stop burning the epistemic commons by funding "research" that is always going to give the same result no matter the state of the world.
Source:
https://www.lesswrong.com/posts/ztXsmnSdrejpfmvn7/propaganda-or-science-a-look-at-open-source-ai-and
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
A common theme implicit in many AI risk stories has been that broader society will either fail to anticipate the risks of AI until it is too late, or do little to address those risks in a serious manner. In my opinion, there are now clear signs that this assumption is false, and that society will address AI with something approaching both the attention and diligence it deserves. For example, one clear sign is Joe Biden's recent executive order on AI safety[1]. In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention.
While I'm not saying we should now sit back and relax, I think recent evidence has significant implications for designing effective strategies to address AI risk. Since I think substantial AI regulation will likely occur by default, I urge effective altruists to focus more on ensuring that the regulation is thoughtful and well-targeted rather than ensuring that regulation happens at all. Ultimately, I argue in favor of a cautious and nuanced approach towards policymaking, in contrast to broader public AI safety advocacy.[2]
Source:
https://www.lesswrong.com/posts/EaZghEwcCJRAuee66/my-thoughts-on-the-social-response-to-ai-risk#
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for https://nitter.net/ESYudkowsky/status/1718654143110512741
Comp sci in 2017:
Student: I get the feeling the compiler is just ignoring all my comments.
Teaching assistant: You have failed to understand not just compilers but the concept of computation itself.
Comp sci in 2027:
Student: I get the feeling the compiler is just ignoring all my comments.
TA: That's weird. Have you tried adding a comment at the start of the file asking the compiler to pay closer attention to the comments?
Student: Yes.
TA: Have you tried repeating the comments? Just copy and paste them, so they say the same thing twice? Sometimes the compiler listens the second time.
Source:
https://www.lesswrong.com/posts/gQyphPbaLHBMJoghD/comp-sci-in-2027-short-story-by-eliezer-yudkowsky
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Over the next two days, the UK government is hosting an AI Safety Summit focused on “the safe and responsible development of frontier AI”. They requested that seven companies (Amazon, Anthropic, DeepMind, Inflection, Meta, Microsoft, and OpenAI) “outline their AI Safety Policies across nine areas of AI Safety”.
Below, I’ll give my thoughts on the nine areas the UK government described; I’ll note key priorities that I don’t think are addressed by company-side policy at all; and I’ll say a few words (with input from Matthew Gray, whose discussions here I’ve found valuable) about the individual companies’ AI Safety Policies.[1]
My overall take on the UK government’s asks is: most of these are fine asks; some things are glaringly missing, like independent risk assessments.
My overall take on the labs’ policies is: none are close to adequate, but some are importantly better than others, and most of the organizations are doing better than sheer denial of the primary risks.
Source:
https://www.lesswrong.com/posts/ms3x8ngwTfep7jBue/thoughts-on-the-ai-safety-summit-company-policy-requests-and
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for https://www.whitehouse.gov/briefing-room/statements-releases/2023/10/30/fact-sheet-president-biden-issues-executive-order-on-safe-secure-and-trustworthy-artificial-intelligence/
Released today (10/30/23) this is crazy, perhaps the most sweeping action taken by government on AI yet.
Below, I've segmented by x-risk and non-x-risk related proposals, excluding the proposals that are geared towards promoting its use and focusing solely on those aimed at risk. It's worth noting that some of these are very specific and direct an action to be taken by one of the executive branch organizations (i.e. sharing of safety test results) but others are guidances, which involve "calls on Congress" to pass legislation that would codify the desired action.
[Update]: The official order (this is a summary of the press release) has now be released, so if you want to see how these are codified to a greater granularity, look there.
Source:
https://www.lesswrong.com/posts/g5XLHKyApAFXi3fso/president-biden-issues-executive-order-on-safe-secure-and
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
Previously: Sadly, FTX
I doubted whether it would be a good use of time to read Michael Lewis’s new book Going Infinite about Sam Bankman-Fried (hereafter SBF or Sam). What would I learn that I did not already know? Was Michael Lewis so far in the tank of SBF that the book was filled with nonsense and not to be trusted?
I set up a prediction market, which somehow attracted over a hundred traders. Opinions were mixed. That, combined with Matt Levine clearly reporting having fun, felt good enough to give the book a try.
Source:
https://www.lesswrong.com/posts/AocXh6gJ9tJC2WyCL/book-review-going-infinite
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Views are my own, not Open Philanthropy’s. I am married to the President of Anthropic and have a financial interest in both Anthropic and OpenAI via my spouse.
Over the last few months, I’ve spent a lot of my time trying to help out with efforts to get responsible scaling policies adopted. In that context, a number of people have said it would be helpful for me to be publicly explicit about whether I’m in favor of an AI pause. This post will give some thoughts on these topics.
Source:
https://www.lesswrong.com/posts/Np5Q3Mhz2AiPtejGN/we-re-not-ready-thoughts-on-pausing-and-responsible-scaling-4
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Judea Pearl is a famous researcher, known for Bayesian networks (the standard way of representing Bayesian models), and his statistical formalization of causality. Although he has always been recommended reading here, he's less of a staple compared to, say, Jaynes. So the need to re-introduce him. My purpose here is to highlight a soothing, unexpected show of rationality on his part.
One year ago I reviewed his last book, The Book of Why, in a failed[1] submission to the ACX book review contest. There I spend a lot of time around what appears to me as a total paradox in a central message of the book, dear to Pearl: that you can't just use statistics and probabilities to understand causal relationships; you need a causal model, a fundamentally different beast. Yet, at the same time, Pearl shows how to implement a causal model in terms of a standard statistical model.
Before giving me the time to properly raise all my eyebrows, he then sweepingly connects this insight to Everything Everywhere. In particular, he thinks that machine learning is "stuck on rung one", his own idiomatic expression to say that machine learning algorithms, only combing for correlations in the training data, are stuck at statistics-level reasoning, while causal reasoning resides at higher "rungs" on the "ladder of causation", which can't be reached unless you deliberately employ causal techniques.
Source:
https://www.lesswrong.com/posts/uFqnB6BG4bkMW23LR/at-87-pearl-is-still-able-to-change-his-mind
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Some brief thoughts at a difficult time in the AI risk debate.
Imagine you go back in time to the year 1999 and tell people that in 24 years time, humans will be on the verge of building weakly superhuman AI systems. I remember watching the anime short series The Animatrix at roughly this time, in particular a story called The Second Renaissance I part 2 II part 1 II part 2 . For those who haven't seen it, it is a self-contained origin tale for the events in the seminal 1999 movie The Matrix, telling the story of how humans lost control of the planet.
Humans develop AI to perform economic functions, eventually there is an "AI rights" movement and a separate AI nation is founded. It gets into an economic war with humanity, which turns hot. Humans strike first with nuclear weapons, but the AI nation builds dedicated bio- and robo-weapons and wipes out most of humanity, apart from those who are bred in pods like farm animals and plugged into a simulation for eternity without their consent.
Surely we wouldn't be so stupid as to actually let something like that happen? It seems unrealistic.
And yet:
Source:
https://www.lesswrong.com/posts/bHHrdXwrCj2LRa2sW/architects-of-our-own-demise-we-should-stop-developing-ai
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
AI used to be a science. In the old days (back when AI didn't work very well), people were attempting to develop a working theory of cognition.
Those scientists didn’t succeed, and those days are behind us. For most people working in AI today and dividing up their work hours between tasks, gone is the ambition to understand minds. People working on mechanistic interpretability (and others attempting to build an empirical understanding of modern AIs) are laying an important foundation stone that could play a role in a future science of artificial minds, but on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.
The bitter lesson has been taken to heart, by those at the forefront of the field; and although this lesson doesn't teach us that there's nothing to learn about how AI minds solve problems internally, it suggests that the fastest path to producing more powerful systems is likely to continue to be one that doesn’t shed much light on how those systems work.
Absent some sort of “science of artificial minds”, however, humanity’s prospects for aligning smarter-than-human AI seem to me to be quite dim.
Source:
https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/ai-as-a-science-and-three-obstacles-to-alignment-strategies
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I am excited about AI developers implementing responsible scaling policies; I’ve recently been spending time refining this idea and advocating for it. Most people I talk to are excited about RSPs, but there is also some uncertainty and pushback about how they relate to regulation. In this post I’ll explain my views on that:
Source:
https://www.lesswrong.com/posts/dxgEaDrEBkkE96CXr/thoughts-on-responsible-scaling-policies-and-regulation
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Timaeus is a new AI safety research organization dedicated to making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. Currently, we are working on singular learning theory and developmental interpretability. Over time we expect to work on a broader research agenda, and to create understanding-based evals informed by our research.
Source:
https://www.lesswrong.com/posts/nN7bHuHZYaWv9RDJL/announcing-timaeus
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
Doomimir: Humanity has made no progress on the alignment problem. Not only do we have no clue how to align a powerful optimizer to our "true" values, we don't even know how to make AI "corrigible"—willing to let us correct it. Meanwhile, capabilities continue to advance by leaps and bounds. All is lost.
Simplicia: Why, Doomimir Doomovitch, you're such a sourpuss! It should be clear by now that advances in "alignment"—getting machines to behave in accordance with human values and intent—aren't cleanly separable from the "capabilities" advances you decry. Indeed, here's an example of GPT-4 being corrigible to me just now in the OpenAI Playground.
Source:
https://www.lesswrong.com/posts/pYWA7hYJmXnuyby33/alignment-implications-of-llm-successes-a-debate-in-one-act
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Holly is an independent AI Pause organizer, which includes organizing protests (like this upcoming one). Rob is an AI Safety YouTuber. I (jacobjacob) brought them together for this dialogue, because I've been trying to figure out what I should think of AI safety protests, which seems like a possibly quite important intervention; and Rob and Holly seemed like they'd have thoughtful and perhaps disagreeing perspectives.
Quick clarification: At one point they discuss a particular protest, which is the anti-irreversible proliferation protest at the Meta building in San Francisco on September 29th, 2023 that both Holly and Rob attended.
Also, the dialogue is quite long, and I think it doesn't have to be read in order. You should feel free to skip to the section title that sounds most interesting to you.
Source:
https://www.lesswrong.com/posts/gDijQHHaZzeGrv2Jc/holly-elmore-and-rob-miles-dialogue-on-ai-safety-advocacy
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish.
TL;DR LoRA fine-tuning undoes the safety training of Llama 2-Chat 70B with one GPU and a budget of less than $200. The resulting models[1] maintain helpful capabilities without refusing to fulfill harmful instructions. We show that, if model weights are released, safety fine-tuning does not effectively prevent model misuse. Consequently, we encourage Meta to reconsider their policy of publicly releasing their powerful models.
Source:
https://www.lesswrong.com/posts/qmQFHCgCyEEjuy5a7/lora-fine-tuning-efficiently-undoes-safety-training-from
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Three of the big AI labs say that they care about alignment and that they think misaligned AI poses a potentially existential threat to humanity. These labs continue to try to build AGI. I think this is a very bad idea.
The leaders of the big labs are clear that they do not know how to build safe, aligned AGI. The current best plan is to punt the problem to a (different) AI, and hope that can solve it. It seems clearly like a bad idea to try and build AGI when you don’t know how to control it, especially if you readily admit that misaligned AGI could cause extinction.
But there are certain reasons that make trying to build AGI a more reasonable thing to do, for example:
Source:
https://www.lesswrong.com/posts/6HEYbsqk35butCYTe/labs-should-be-explicit-about-why-they-are-building-agi
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
How do you affect something far away, a lot, without anyone noticing?
(Note: you can safely skip sections. It is also safe to skip the essay entirely, or to read the whole thing backwards if you like.)
Source:
https://www.lesswrong.com/posts/R3eDrDoX8LisKgGZe/sum-threshold-attacks
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Last year, I wrote about the promise of gene drives to wipe out mosquito species and end malaria.
In the time since my previous writing, gene drives have still not been used in the wild, and over 600,000 people have died of malaria. Although there are promising new developments such as malaria vaccines, there have also been some pretty bad setbacks (such as mosquitoes and parasites developing resistance to commonly used chemicals), and malaria deaths have increased slightly from a few years ago. Recent news coverage[1] has highlighted that the fight against malaria has stalled, and even reversed in some areas. Clearly, scientists and public health workers are trying hard with the tools they have, but this effort is not enough.
Gene drives have the potential to end malaria. However, this potential will remain unrealized unless they are deployed – and every day we wait, more than 1,600 people (mostly African children) die. But who should deploy them?
Source:
https://www.lesswrong.com/posts/gjs3q83hA4giubaAw/will-no-one-rid-me-of-this-turbulent-pest
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Patreon to support human narration. (Narrations will remain freely available on this feed, but you can optionally support them if you'd like me to keep making them.)
***
Epistemic status: model which I find sometimes useful, and which emphasizes some true things about many parts of the world which common alternative models overlook. Probably not correct in full generality.
Consider Yoshua Bengio, one of the people who won a Turing Award for deep learning research. Looking at his work, he clearly “knows what he’s doing”. He doesn’t know what the answers will be in advance, but he has some models of what the key questions are, what the key barriers are, and at least some hand-wavy pseudo-models of how things work.
For instance, Bengio et al’s “Unitary Evolution Recurrent Neural Networks”. This is the sort of thing which one naturally ends up investigating, when thinking about how to better avoid gradient explosion/death in e.g. recurrent nets, while using fewer parameters. And it’s not the sort of thing which one easily stumbles across by trying random ideas for nets without some reason to focus on gradient explosion/death (or related instability problems) in particular. The work implies a model of key questions/barriers; it isn’t just shooting in the dark.
Source:
https://www.lesswrong.com/posts/nt8PmADqKMaZLZGTC/inside-views-impostor-syndrome-and-the-great-larp
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthropic’s RSP. Prior to joining Anthropic, I was a Research Fellow at MIRI for three years.
Thanks to Kate Woolverton, Carson Denison, and Nicholas Schiefer for useful feedback on this post.
Recently, there’s been a lot of discussion and advocacy around AI pauses—which, to be clear, I think is great: pause advocacy pushes in the right direction and works to build a good base of public support for x-risk-relevant regulation. Unfortunately, at least in its current form, pause advocacy seems to lack any sort of coherent policy position. Furthermore, what’s especially unfortunate about pause advocacy’s nebulousness—at least in my view—is that there is a very concrete policy proposal out there right now that I think is basically necessary as a first step here, which is the enactment of good Responsible Scaling Policies (RSPs). And RSPs could very much live or die right now based on public support.
Source:
https://www.lesswrong.com/posts/mcnWZBnbeDz7KKtjJ/rsps-are-pauses-done-right
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or differently. My hope in writing this is to help readers understand the similarities and differences, and perhaps to lay the groundwork for a future synthesis approach.
First, let me note that we arrived at similar techniques in similar ways: both Anthropic and my team follow the lead of Lee Sharkey, Dan Braun, and beren's [Interim research report] Taking features out of superposition with sparse autoencoders, though I don't know how directly Anthropic was inspired by that post. I believe both our teams were pleasantly surprised to find out the other one was working on similar lines, serving as a form of replication.
Some disclaimers: This list may be incomplete. I didn't give Anthropic a chance to give feedback on this, so I may have misrepresented some of their work, including by omission. Any mistakes are my own fault.
Source:
https://www.lesswrong.com/posts/F4iogK5xdNd7jDNyw/comparing-anthropic-s-dictionary-learning-to-ours
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
In 2023, MIRI has shifted focus in the direction of broad public communication—see, for example, our recent TED talk, our piece in TIME magazine “Pausing AI Developments Isn’t Enough. We Need to Shut it All Down”, and our appearances on various podcasts. While we’re continuing to support various technical research programs at MIRI, this is no longer our top priority, at least for the foreseeable future.
Coinciding with this shift in focus, there have also been many organizational changes at MIRI over the last several months, and we are somewhat overdue to announce them in public. The big changes are as follows.
Source:
https://www.lesswrong.com/posts/NjtHt55nFbw3gehzY/announcing-miri-s-new-ceo-and-leadership-team
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
A cohabitive game[1] is a partially cooperative, partially competitive multiplayer game that provides an anarchic dojo for development in applied cooperative bargaining, or negotiation.
Applied cooperative bargaining isn't currently taught, despite being an infrastructural literacy for peace, trade, democracy or any other form of pluralism. We suffer for that. There are many good board games that come close to meeting the criteria of a cohabitive game today, but they all[2] miss in one way or another, forbidding sophisticated negotiation from being practiced.
So, over the past couple of years, we've been gradually and irregularly designing and playtesting the first[2] cohabitive boardgame, which for now we can call Difference and Peace Peacewager 1, or P1. This article explains why we think this new genre is important, how it's been going, what we've learned, and where we should go next.
I hope that cohabitive games will aid both laypeople and theorists in developing cooperative bargaining as theory, practice and culture, but I also expect these games to just be more fun than purely cooperative or purely competitive games, supporting livelier dialog, and a wider variety of interesting strategic relationships and dynamics.
Source:
https://www.lesswrong.com/posts/bF353RHmuzFQcsokF/peacewagers-cohabitive-games-so-far
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
As of today, everyone is able to create a new type of content on LessWrong: Dialogues.
In contrast with posts, which are for monologues, and comment sections, which are spaces for everyone to talk to everyone, a dialogue is a space for a few invited people to speak with each other.
I'm personally very excited about this as a way for people to produce lots of in-depth explanations of their world-models in public.
I think dialogues enable this in a way that feels easier — instead of writing an explanation for anyone who reads, you're communicating with the particular person you're talking with — and giving the readers a lot of rich nuance I normally only find when I overhear people talk in person.
In the rest of this post I'll explain the feature, and then encourage you to find a partner in the comments to try it out with.
Source:
https://www.lesswrong.com/posts/kQuSZG8ibfW6fJYmo/announcing-dialogues-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Response to: Evolution Provides No Evidence For the Sharp Left Turn, due to it winning first prize in The Open Philanthropy Worldviews contest.
Quintin’s post is an argument about a key historical reference class and what it tells us about AI. Instead of arguing that the reference makes his point, he is instead arguing that it doesn’t make anyone’s point - that we understand the reasons for humanity’s sudden growth in capabilities. He says this jump was caused by gaining access to cultural transmission which allowed partial preservation of in-lifetime learning across generations, which was a vast efficiency gain that fully explains the orders of magnitude jump in the expansion of human capabilities. Since AIs already preserve their metaphorical in-lifetime learning across their metaphorical generations, he argues, this does not apply to AI.
Source:
https://www.lesswrong.com/posts/Wr7N9ji36EvvvrqJK/response-to-quintin-pope-s-evolution-provides-no-evidence
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
ETA: I'm not saying that MIRI thought AIs wouldn't understand human values. If there's only one thing you take away from this post, please don't take away that.
Recently, many people have talked about whether some of the main MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger[1]) should update on whether value alignment is easier than they thought given that GPT-4 seems to follow human directions and act within moral constraints pretty well (here are two specific examples of people talking about this: 1, 2). Because these conversations are often hard to follow without much context, I'll just provide a brief caricature of how I think this argument has gone in the places I've seen it, which admittedly could be unfair to MIRI[2]. Then I'll offer my opinion that, overall, I think MIRI people should probably update in the direction of alignment being easier than they thought in light of this information, despite their objections.
Note: I encourage you to read this post carefully to understand my thesis. This topic can be confusing, and there are many ways to misread what I'm saying. Also, make sure to read the footnotes if you're skeptical of some of my claims.
Source:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Neural networks are trained on data, not programmed to follow rules. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don't understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe.
Luckily for those of us trying to understand artificial neural networks, we can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.
This is a linkpost for https://transformer-circuits.pub/2023/monosemantic-features/
Text of post based on our blog post as a linkpost for the full paper which is considerably longer and more detailed.
Source:
https://www.lesswrong.com/posts/TDqvQFks6TWutJEKu/towards-monosemanticity-decomposing-language-models-with
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Moderator note: the following is a dialogue using LessWrong’s new dialogue feature. The exchange is not completed: new replies might be added continuously, the way a comment thread might work. If you’d also be excited about finding an interlocutor to debate, dialogue, or getting interviewed by: fill in this dialogue matchmaking form.
Hi Thomas, I'm quite curious to hear about your research experience working with MIRI. To get us started: When were you at MIRI? Who did you work with? And what problem were you working on?
Source:
https://www.lesswrong.com/posts/qbcuk8WwFnTZcXTd6/thomas-kwa-s-miri-research-experience
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
A lot of people are highly concerned that a malevolent AI or insane human will, in the near future, set out to destroy humanity. If such an entity wanted to be absolutely sure they would succeed, what method would they use? Nuclear war? Pandemics?
According to some in the x-risk community, the answer is this: The AI will invent molecular nanotechnology, and then kill us all with diamondoid bacteria nanobots.
Source:
https://www.lesswrong.com/posts/bc8Ssx5ys6zqu3eq9/diamondoid-bacteria-nanobots-deadly-threat-or-dead-end-a
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Lightcone Infrastructure (the organization that grew from and houses the LessWrong team) has just finished renovating a 7-building physical campus that we hope to use to make the future of humanity go better than it would otherwise.
We're hereby announcing that it is generally available for bookings. We offer preferential pricing for projects we think are good for the world, but to cover operating costs, we're renting out space to a wide variety of people/projects.
Source:
https://www.lesswrong.com/posts/memqyjNCpeDrveayx/the-lighthaven-campus-is-open-for-bookings
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
Source:
https://www.lesswrong.com/posts/khFC2a4pLPvGtXAGG/how-to-catch-an-ai-liar-lie-detection-in-black-box-llms-by
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Effective altruism prides itself on truthseeking. That pride is justified in the sense that EA is better at truthseeking than most members of its reference category, and unjustified in that it is far from meeting its own standards. We’ve already seen dire consequences of the inability to detect bad actors who deflect investigation into potential problems, but by its nature you can never be sure you’ve found all the damage done by epistemic obfuscation because the point is to be self-cloaking.
My concern here is for the underlying dynamics of EA’s weak epistemic immune system, not any one instance. But we can’t analyze the problem without real examples, so individual instances need to be talked about. Worse, the examples that are easiest to understand are almost by definition the smallest problems, which makes any scapegoating extra unfair. So don’t.
This post focuses on a single example: vegan advocacy, especially around nutrition. I believe vegan advocacy as a cause has both actively lied and raised the cost for truthseeking, because they were afraid of the consequences of honest investigations. Occasionally there’s a consciously bad actor I can just point to, but mostly this is an emergent phenomenon from people who mean well, and have done good work in other areas. That’s why scapegoating won’t solve the problem: we need something systemic.
Source:
https://www.lesswrong.com/posts/aW288uWABwTruBmgF/ea-vegan-advocacy-is-not-truthseeking-and-it-s-everyone-s-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for https://narrativeark.substack.com/p/the-king-and-the-golem
Long ago there was a mighty king who had everything in the world that he wanted, except trust. Who could he trust, when anyone around him might scheme for his throne? So he resolved to study the nature of trust, that he might figure out how to gain it. He asked his subjects to bring him the most trustworthy thing in the kingdom, promising great riches if they succeeded.
Soon, the first of them arrived at his palace to try. A teacher brought her book of lessons. “We cannot know the future,” she said, “But we know mathematics and chemistry and history; those we can trust.” A farmer brought his plow. “I know it like the back of my hand; how it rolls, and how it turns, and every detail of it, enough that I can trust it fully.”
The king asked his wisest scholars if the teacher spoke true. But as they read her book, each pointed out new errors—it was only written by humans, after all. Then the king told the farmer to plow the fields near the palace. But he was not used to plowing fields as rich as these, and his trusty plow would often sink too far into the soil. So the king was not satisfied, and sent his message even further afield.
Source:
https://www.lesswrong.com/posts/bteq4hMW2hqtKE49d/the-king-and-the-golem#
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models
We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.
Source:
https://www.lesswrong.com/posts/Qryk6FqjtZk9FHHJR/sparse-autoencoders-find-highly-interpretable-directions-in
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Epistemic status: model which I find sometimes useful, and which emphasizes some true things about many parts of the world which common alternative models overlook. Probably not correct in full generality.
Consider Yoshua Bengio, one of the people who won a Turing Award for deep learning research. Looking at his work, he clearly “knows what he’s doing”. He doesn’t know what the answers will be in advance, but he has some models of what the key questions are, what the key barriers are, and at least some hand-wavy pseudo-models of how things work.
For instance, Bengio et al’s “Unitary Evolution Recurrent Neural Networks”. This is the sort of thing which one naturally ends up investigating, when thinking about how to better avoid gradient explosion/death in e.g. recurrent nets, while using fewer parameters. And it’s not the sort of thing which one easily stumbles across by trying random ideas for nets without some reason to focus on gradient explosion/death (or related instability problems) in particular. The work implies a model of key questions/barriers; it isn’t just shooting in the dark.
So this is the sort of guy who can look at a proposal, and say “yeah, that might be valuable” vs “that’s not really asking the right question” vs “that would be valuable if it worked, but it will have to somehow deal with <known barrier>”
Source:
https://www.lesswrong.com/posts/nt8PmADqKMaZLZGTC/inside-views-impostor-syndrome-and-the-great-larp
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I’m writing this in my own capacity. The views expressed are my own, and should not be taken to represent the views of Apollo Research or any other program I’m involved with.
TL;DR: I argue why I think there should be more AI safety orgs. I’ll also provide some suggestions on how that could be achieved. The core argument is that there is a lot of unused talent and I don’t think existing orgs scale fast enough to absorb it. Thus, more orgs are needed. This post can also serve as a call to action for funders, founders, and researchers to coordinate to start new orgs.
This piece is certainly biased! I recently started an AI safety org and therefore obviously believe that there is/was a gap to be filled.
If you think I’m missing relevant information about the ecosystem or disagree with my reasoning, please let me know. I genuinely want to understand why the ecosystem acts as it does right now and whether there are good reasons for it that I have missed so far.
Source:
https://www.lesswrong.com/posts/MhudbfBNQcMxBBvj8/there-should-be-more-ai-safety-orgs
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Cross-posted from substack.
"Everything in the world is about sex, except sex. Sex is about clonal interference."
– Oscar Wilde (kind of)
As we all know, sexual reproduction is not about reproduction.
Reproduction is easy. If your goal is to fill the world with copies of your genes, all you need is a good DNA-polymerase to duplicate your genome, and then to divide into two copies of yourself. Asexual reproduction is just better in every way.
It's pretty clear that, on a direct one-v-one cage match, an asexual organism would have much better fitness than a similarly-shaped sexual organism. And yet, all the macroscopic species, including ourselves, do it. What gives?
Here is the secret: yes, sex is indeed bad for reproduction. It does not improve an individual's reproductive fitness. The reason it still took over the macroscopic world is that evolution does not simply select for reproductive fitness.
Source:
https://www.lesswrong.com/posts/yA8DWsHJeFZhDcQuo/the-talk-a-brief-explanation-of-sexual-dimorphism
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Patrick Collison has a fantastic list of examples of people quickly accomplishing ambitious things together since the 19th Century. It does make you yearn for a time that feels... different, when the lethargic behemoths of government departments could move at the speed of a racing startup:
[...] last century, [the Department of Defense] innovated at a speed that puts modern Silicon Valley startups to shame: the Pentagon was built in only 16 months (1941–1943), the Manhattan Project ran for just over 3 years (1942–1946), and the Apollo Program put a man on the moon in under a decade (1961–1969). In the 1950s alone, the United States built five generations of fighter jets, three generations of manned bombers, two classes of aircraft carriers, submarine-launched ballistic missiles, and nuclear-powered attack submarines.
[Note: that paragraph is from a different post.]
Inspired by partly by Patrick's list, I spent some of my vacation reading and learning about various projects from this Lost Age. I then wrote up a memo to share highlights and excerpts with my colleagues at Lightcone.
Source:
https://www.lesswrong.com/posts/BpTDJj6TrqGYTjFcZ/a-golden-age-of-building-excerpts-and-lessons-from-empire
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
This is a linkpost for https://www.youtube.com/watch?v=02kbWY5mahQ
None of the presidents fully represent my (TurnTrout's) views.
TurnTrout wrote the script. Garrett Baker helped produce the video after the audio was complete. Thanks to David Udell, Ulisse Mini, Noemi Chulo, and especially Rio Popper for feedback and assistance in writing the script.
Source:
https://www.lesswrong.com/posts/7M2iHPLaNzPNXHuMv/ai-presidents-discuss-ai-alignment-agendas
YouTube video kindly provided by the authors. Other text narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I feel like MIRI perhaps mispositioned FDT (their variant of UDT) as a clear advancement in decision theory, whereas maybe they could have attracted more attention/interest from academic philosophy if the framing was instead that the UDT line of thinking shows that decision theory is just more deeply puzzling than anyone had previously realized. Instead of one major open problem (Newcomb's, or EDT vs CDT) now we have a whole bunch more. I'm really not sure at this point whether UDT is even on the right track, but it does seem clear that there are some thorny issues in decision theory that not many people were previously thinking about:
Source:
https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
How do you affect something far away, a lot, without anyone noticing?
(Note: you can safely skip sections. It is also safe to skip the essay entirely, or to read the whole thing backwards if you like.)
Source:
https://www.lesswrong.com/posts/R3eDrDoX8LisKgGZe/sum-threshold-attacks
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for https://docs.google.com/document/d/1TsYkDYtV6BKiCN9PAOirRAy3TrNDu2XncUZ5UZfaAKA/edit?usp=sharing
Understanding what drives the rising capabilities of AI is important for those who work to forecast, regulate, or ensure the safety of AI. Regulations on the export of powerful GPUs need to be informed by understanding of how these GPUs are used, forecasts need to be informed by bottlenecks, and safety needs to be informed by an understanding of how the models of the future might be trained. A clearer understanding would enable policy makers to target regulations in such a way that they are difficult for companies to circumvent with only technically compliant GPUs, forecasters to avoid focus on unreliable metrics, and technical research working on mitigating the downsides of AI to understand what data models might be trained on.
This doc is built from a collection of smaller docs I wrote on a bunch of different aspects of frontier model training I consider important. I hope for people to be able to use this document as a collection of resources, to draw from it the information they find important and inform their own models.
I do not expect this doc to have a substantial impact on any serious AI labs capabilities efforts - I think my conclusions are largely discoverable in the process of attempting to scale AIs or for substantially less money than a serious such attempt would cost. Additionally I expect major labs already know many of the things in this report.
Source:
https://www.lesswrong.com/posts/nXcHe7t4rqHMjhzau/report-on-frontier-model-training
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated Post] ✓
Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by an Open Agency Architecture (OAA), if OAA turns out to be feasible.
Source:
https://www.lesswrong.com/posts/D97xnoRr6BHzo5HvQ/one-minute-every-moment
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
About how much information are we keeping in working memory at a given moment?
"Miller's Law" dictates that the number of things humans can hold in working memory is "the magical number 7±2". This idea is derived from Miller's experiments, which tested both random-access memory (where participants must remember call-response pairs, and give the correct response when prompted with a call) and sequential memory (where participants must memorize and recall a list in order). In both cases, 7 is a good rule of thumb for the number of items people can recall reliably.[1]
Miller noticed that the number of "things" people could recall didn't seem to depend much on the sorts of things people were being asked to recall. A random numeral contains about 3.3 bits of information, while a random letter contains about 4.7; yet people were able to recall about the same number of numerals or letters.
Miller concluded that working memory should not be measured in bits, but rather in "chunks"; this is a word for whatever psychologically counts as a "thing".
Source:
https://www.lesswrong.com/posts/D97xnoRr6BHzo5HvQ/one-minute-every-moment
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Added (11th Sept): Nonlinear have commented that they intend to write a response, have written a short follow-up, and claim that they dispute 85 claims in this post. I'll link here to that if-and-when it's published.
Added (11th Sept): One of the former employees, Chloe, has written a lengthy comment personally detailing some of her experiences working at Nonlinear and the aftermath.
Added (12th Sept): I've made 3 relatively minor edits to the post. I'm keeping a list of all edits at the bottom of the post, so if you've read the post already, you can just go to the end to see the edits.
Added (15th Sept): I've written a follow-up post saying that I've finished working on this investigation and do not intend to work more on it in the future. The follow-up also has a bunch of reflections on what led up to this post.
Epistemic status: Once I started actively looking into things, much of my information in the post below came about by a search for negative information about the Nonlinear cofounders, not from a search to give a balanced picture of its overall costs and benefits. I think standard update rules suggest not that you ignore the information, but you think about how bad you expect the information would be if I selected for the worst, credible info I could share, and then update based on how much worse (or better) it is than you expect I could produce. (See section 5 of this post about Mistakes with Conservation of Expected Evidence for more on this.) This seems like a worthwhile exercise for at least non-zero people to do in the comments before reading on. (You can condition on me finding enough to be worth sharing, but also note that I think I have a relatively low bar for publicly sharing critical info about folks in the EA/x-risk/rationalist/etc ecosystem.)
tl;dr: If you want my important updates quickly summarized in four claims-plus-probabilities, jump to the section near the bottom titled "Summary of My Epistemic State".
Source:
https://www.lesswrong.com/posts/Lc8r4tZ2L5txxokZ8/sharing-information-about-nonlinear-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Until about five years ago, I unironically parroted the slogan All Cops Are Bastards (ACAB) and earnestly advocated to abolish the police and prison system. I had faint inklings I might be wrong about this a long time ago, but it took a while to come to terms with its disavowal. What follows is intended to be not just a detailed account of what I used to believe but most pertinently, why. Despite being super egotistical, for whatever reason I do not experience an aversion to openly admitting mistakes I’ve made, and I find it very difficult to understand why others do. I’ve said many times before that nothing engenders someone’s credibility more than when they admit error, so you definitely have my permission to view this kind of confession as a self-serving exercise (it is). Beyond my own penitence, I find it very helpful when folks engage in introspective, epistemological self-scrutiny, and I hope others are inspired to do the same.
Source:
https://www.lesswrong.com/posts/4rsRuNaE4uJrnYeTQ/defunding-my-mistake
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
In which: I list 9 projects that I would work on if I wasn’t busy working on safety standards at ARC Evals, and explain why they might be good to work on.
Epistemic status:I’m prioritizing getting this out fast as opposed to writing it carefully. I’ve thought for at least a few hours and talked to a few people I trust about each of the following projects, but I haven’t done that much digging into each of these, and it’s likely that I’m wrong about many material facts. I also make little claim to the novelty of the projects. I’d recommend looking into these yourself before committing to doing them. (Total time spent writing or editing this post: ~8 hours.
Source:
https://www.lesswrong.com/posts/6FkWnktH3mjMAxdRT/what-i-would-do-if-i-wasn-t-at-arc-evals
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
To quickly recap my main intellectual journey so far (omitting a lengthy side trip into cryptography and Cypherpunk land), with the approximate age that I became interested in each topic in parentheses:
Source:
https://www.lesswrong.com/posts/fJqP9WcnHXBRBeiBg/meta-questions-about-metaphilosophy
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
We focus so much on arguing over who is at fault in this country that I think sometimes we fail to alert on what's actually happening. I would just like to point out, without attempting to assign blame, that American political institutions appear to be losing common knowledge of their legitimacy, and abandoning certain important traditions of cooperative governance. It would be slightly hyperbolic, but not unreasonable to me, to term what has happened "democratic backsliding".
Source:
https://www.lesswrong.com/posts/r2vaM2MDvdiDSWicu/the-u-s-is-becoming-less-stable#
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question.
The paper contained the striking plot reproduced below, which shows sycophancy
[...] I found this result startling when I read the original paper, as it seemed like a bizarre failure of calibration. How would the base LM know that this "Assistant" character agrees with the user so strongly, lacking any other information about the scenario?
At the time, I ran one of Anthropic's sycophancy evals on a set of OpenAI models, as I reported here.
I found very different results for these models:
Source:
https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I keep seeing advice on ambition, aimed at people in college or early in their career, that would have been really bad for me at similar ages. Rather than contribute (more) to the list of people giving poorly universalized advice on ambition, I have written a letter to the one person I know my advice is right for: myself in the past.
Source:
https://www.lesswrong.com/posts/uGDtroD26aLvHSoK2/dear-self-we-need-to-talk-about-ambition-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
I've been trying to avoid the terms "good faith" and "bad faith". I'm suspicious that most people who have picked up the phrase "bad faith" from hearing it used, don't actually know what it means—and maybe, that the thing it does mean doesn't carve reality at the joints.
People get very touchy about bad faith accusations: they think that you should assume good faith, but that if you've determined someone is in bad faith, you shouldn't even be talking to them, that you need to exile them.
What does "bad faith" mean, though? It doesn't mean "with ill intent."
Source:
https://www.lesswrong.com/posts/pZrvkZzL2JnbRgEBC/feedbackloop-first-rationality
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
The Carving of Reality, third volume of the Best of LessWrong books is now available on Amazon (US).
The Carving of Reality includes 43 essays from 29 authors. We've collected the essays into four books, each exploring two related topics. The "two intertwining themes" concept was first inspired when as I looked over the cluster of "coordination" themed posts, and noting a recurring motif of not only "solving coordination problems" but also "dealing with the binding constraints that were causing those coordination problems."
Source:
https://www.lesswrong.com/posts/Rck5CvmYkzWYxsF4D/book-launch-the-carving-of-reality-best-of-lesswrong-vol-iii
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
LLMs can do many incredible things. They can generate unique creative content, carry on long conversations in any number of subjects, complete complex cognitive tasks, and write nearly any argument. More mundanely, they are now the state of the art for boring classification tasks and therefore have the capability to radically upgrade the censorship capacities of authoritarian regimes throughout the world.
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort. Thanks to ev_ and Kei for suggestions on this post.
Source:
https://www.lesswrong.com/posts/oqvsR2LmHWamyKDcj/large-language-models-will-be-great-for-censorship
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Intro: I am a psychotherapist, and I help people working on AI safety. I noticed patterns of mental health issues highly specific to this group. It's not just doomerism, there are way more of them that are less obvious.
If you struggle with a mental health issue related to AI safety, feel free to leave a comment about it and about things that help you with it. You might also support others in the comments. Sometimes such support makes a lot of difference and people feel like they are not alone.
All the examples in this post are changed in a way that it's impossible to recognize a specific person behind them.
Source:
https://www.lesswrong.com/posts/tpLzjWqG2iyEgMGfJ/6-non-obvious-mental-health-issues-specific-to-ai-safety
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for the article "Ten Thousand Years of Solitude", written by Jared Diamond for Discover Magazine in 1993, four years before he published Guns, Germs and Steel. That book focused on Diamond's theory that the geography of Eurasia, particularly its large size and common climate, allowed civilizations there to dominate the rest of the world because it was easy to share plants, animals, technologies and ideas. This article, however, examines the opposite extreme.
Diamond looks at the intense isolation of the tribes on Tasmania - an island the size of Ireland. After waters rose, Tasmania was cut off from mainland Australia. As the people there did not have boats, they were completely isolated, and did not have any contact - or awareness - of the outside world for ten thousand years.
How might a civilization develop, all on its own, for such an incredible period of time?
Source:
https://www.lesswrong.com/posts/YwMaAuLJDkhazA9Cs/ten-thousand-years-of-solitude
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I gave a talk about the different risk models, followed by an interpretability presentation, then I got a problematic question, "I don't understand, what's the point of doing this?" Hum.
The considerations in the last bullet points are based on feeling and are not real arguments. Furthermore, most mechanistic interpretability isn't even aimed at being useful right now. But in the rest of the post, we'll find out if, in principle, interpretability could be useful. So let's investigate if the Interpretability Emperor has invisible clothes or no clothes at all!
Source:
https://www.lesswrong.com/posts/LNA8mubrByG7SFacm/against-almost-every-theory-of-impact-of-interpretability-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I've been workshopping a new rationality training paradigm. (By "rationality training paradigm", I mean an approach to learning/teaching the skill of "noticing what cognitive strategies are useful, and getting better at them.")
I think the paradigm has promise. I've beta-tested it for a couple weeks. It’s too early to tell if it actually works, but one of my primary goals is to figure out if it works relatively quickly, and give up if it isn’t not delivering.
The goal of this post is to:
Source:
https://www.lesswrong.com/posts/pZrvkZzL2JnbRgEBC/feedbackloop-first-rationality
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Inflection.ai (co-founded by DeepMind co-founder Mustafa Suleyman) should be perceived as a frontier LLM lab of similar magnitude as Meta, OpenAI, DeepMind, and Anthropic based on their compute, valuation, current model capabilities, and plans to train frontier models. Compared to the other labs, Inflection seems to put less effort into AI safety.
Thanks to Laker Newhouse for discussion and feedback.
Source:
https://www.lesswrong.com/posts/Wc5BYFfzuLzepQjCq/inflection-ai-is-a-major-agi-lab
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that might pose existential threats – as a new and important pillar of alignment research.
If you’re interested in working on this agenda with us at Anthropic, we’re hiring! Please apply to the research scientist or research engineer position on the Anthropic website and mention that you’re interested in working on model organisms of misalignment.
Source:
https://www.lesswrong.com/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
In "Towards understanding-based safety evaluations," I discussed why I think evaluating specifically the alignment of models is likely to require mechanistic, understanding-based evaluations rather than solely behavioral evaluations. However, I also mentioned in a footnote why I thought behavioral evaluations would likely be fine in the case of evaluating capabilities rather than evaluating alignment:
However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment.
That's because while I think it would be quite tricky for a deceptively aligned AI to sandbag its capabilities when explicitly fine-tuned on some capabilities task (that probably requires pretty advanced gradient hacking), it should be quite easy for such a model to pretend to look aligned.
In this post, I want to try to expand a bit on this point and explain exactly what assumptions I think are necessary for various different evaluations to be reliable and trustworthy. For that purpose, I'm going to talk about four different categories of evaluations and what assumptions I think are needed to make each one go through.
Source:
https://www.lesswrong.com/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluations
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated Post] ✓
Paper
We have just released our first public report. It introduces methodology for assessing the capacity of LLM agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild.
Background
ARC Evals develops methods for evaluating the safety of large language models (LLMs) in order to provide early warnings of models with dangerous capabilities. We have public partnerships with Anthropic and OpenAI to evaluate their AI systems, and are exploring other partnerships as well.
Source:
https://www.lesswrong.com/posts/EPLk8QxETC5FEhoxK/arc-evals-new-report-evaluating-language-model-agents-on
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Summary of Argument: The public debate among AI experts is confusing because there are, to a first approximation, three sides, not two sides to the debate. I refer to this as a 🔺three-sided framework, and I argue that using this three-sided framework will help clarify the debate (more precisely, debates) for the general public and for policy-makers.
Source:
https://www.lesswrong.com/posts/BTcEzXYoDrWzkLLrQ/the-public-debate-about-ai-is-confusing-for-the-general
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
So this morning I thought to myself, "Okay, now I will actually try to study the LK99 question, instead of betting based on nontechnical priors and market sentiment reckoning." (My initial entry into the affray, having been driven by people online presenting as confidently YES when the prediction markets were not confidently YES.) And then I thought to myself, "This LK99 issue seems complicated enough that it'd be worth doing an actual Bayesian calculation on it"--a rare thought; I don't think I've done an actual explicit numerical Bayesian update in at least a year.
In the process of trying to set up an explicit calculation, I realized I felt very unsure about some critically important quantities, to the point where it no longer seemed worth trying to do the calculation with numbers. This is the System Working As Intended.
Source:
https://www.lesswrong.com/posts/EzSH9698DhBsXAcYY/my-current-lk99-questions
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
I believe that sharing information about the capabilities and limits of existing ML systems, and especially language model agents, significantly reduces risks from powerful AI—despite the fact that such information may increase the amount or quality of investment in ML generally (or in LM agents in particular).
Concretely, I mean to include information like: tasks and evaluation frameworks for LM agents, the results of evaluations of particular agents, discussions of the qualitative strengths and weaknesses of agents, and information about agent design that may represent small improvements over the state of the art (insofar as that information is hard to decouple from evaluation results).
Source:
https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
In the early 2010s, a popular idea was to provide coworking spaces and shared living to people who were building startups. That way the founders would have a thriving social scene of peers to percolate ideas with as they figured out how to build and scale a venture. This was attempted thousands of times by different startup incubators. There are no famous success stories.
In 2015, Sam Altman, who was at the time the president of Y Combinator, a startup accelerator that has helped scale startups collectively worth $600 billion, tweeted in reaction that “not [providing coworking spaces] is part of what makes YC work.” Later, in a 2019 interview with Tyler Cowen, Altman was asked to explain why.
Source:
https://www.lesswrong.com/posts/R5yL6oZxqJfmqnuje/cultivating-a-state-of-mind-where-new-ideas-are-born
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated] ✓
This month I lost a bunch of bets.
Back in early 2016 I bet at even odds that self-driving ride sharing would be available in 10 US cities by July 2023. Then I made similar bets a dozen times because everyone disagreed with me.
Source:
https://www.lesswrong.com/posts/ZRrYsZ626KSEgHv8s/self-driving-car-bets
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Some early biologist, equipped with knowledge of evolution but not much else, might see all these crabs and expect a common ancestral lineage. That’s the obvious explanation of the similarity, after all: if the crabs descended from a common ancestor, then of course we’d expect them to be pretty similar.
… but then our hypothetical biologist might start to notice surprisingly deep differences between all these crabs. The smoking gun, of course, would come with genetic sequencing: if the crabs’ physiological similarity is achieved by totally different genetic means, or if functionally-irrelevant mutations differ across crab-species by more than mutational noise would induce over the hypothesized evolutionary timescale, then we’d have to conclude that the crabs had different lineages. (In fact, historically, people apparently figured out that crabs have different lineages long before sequencing came along.)
Now, having accepted that the crabs have very different lineages, the differences are basically explained. If the crabs all descended from very different lineages, then of course we’d expect them to be very different.
… but then our hypothetical biologist returns to the original empirical fact: all these crabs sure are very similar in form. If the crabs all descended from totally different lineages, then the convergent form is a huge empirical surprise! The differences between the crab have ceased to be an interesting puzzle - they’re explained - but now the similarities are the interesting puzzle. What caused the convergence?
To summarize: if we imagine that the crabs are all closely related, then any deep differences are a surprising empirical fact, and are the main remaining thing our model needs to explain. But once we accept that the crabs are not closely related, then any convergence/similarity is a surprising empirical fact, and is the main remaining thing our model needs to explain.
Source:
https://www.lesswrong.com/posts/qsRvpEwmgDBNwPHyP/yes-it-s-subjective-but-why-all-the-crabs
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
The Lightspeed application asks: “What impact will [your project] have on the world? What is your project’s goal, how will you know if you’ve achieved it, and what is the path to impact?”
LTFF uses an identical question, and SFF puts it even more strongly (“What is your organization’s plan for improving humanity’s long term prospects for survival and flourishing?”).
I’ve applied to all three grants of these at various points, and I’ve never liked this question. It feels like it wants a grand narrative of an amazing, systemic project that will measurably move the needle on x-risk. But I’m typically applying for narrowly defined projects, like “Give nutrition tests to EA vegans and see if there’s a problem”. I think this was a good project. I think this project is substantially more likely to pay off than underspecified alignment strategy research, and arguably has as good a long tail. But when I look at “What impact will [my project] have on the world?” the project feels small and sad. I feel an urge to make things up, and express far more certainty for far more impact than I believe. Then I want to quit, because lying is bad but listing my true beliefs feels untenable.
I’ve gotten better at this over time, but I know other people with similar feelings, and I suspect it’s a widespread issue (I encourage you to share your experience in the comments so we can start figuring that out).
I should note that the pressure for grand narratives has good points; funders are in fact looking for VC-style megabits. I think that narrow projects are underappreciated, but for purposes of this post that’s beside the point: I think many grantmakers are undercutting their own preferred outcomes by using questions that implicitly push for a grand narrative. I think they should probably change the form, but I also think we applicants can partially solve the problem by changing how we interact with the current forms.
My goal here is to outline the problem, gesture at some possible solutions, and create a space for other people to share data. I didn’t think about my solutions very long, I am undoubtedly missing a bunch and what I do have still needs workshopping, but it’s a place to start.
Source:
https://www.lesswrong.com/posts/FNPXbwKGFvXWZxHGE/grant-applications-and-grand-narratives
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated Post] ✓
Previously Jacob Cannell wrote the post "Brain Efficiency" which makes several radical claims: that the brain is at the pareto frontier of speed, energy efficiency and memory bandwith, that this represent a fundamental physical frontier.
Here's an AI-generated summary
The article “Brain Efficiency: Much More than You Wanted to Know” on LessWrong discusses the efficiency of physical learning machines. The article explains that there are several interconnected key measures of efficiency for physical learning machines: energy efficiency in ops/J, spatial efficiency in ops/mm^2 or ops/mm^3, speed efficiency in time/delay for key learned tasks, circuit/compute efficiency in size and steps for key low-level algorithmic tasks, and learning/data efficiency in samples/observations/bits required to achieve a level of circuit efficiency, or per unit thereof. The article also explains why brain efficiency matters a great deal for AGI timelines and takeoff speeds, as AGI is implicitly/explicitly defined in terms of brain parity. The article predicts that AGI will consume compute & data in predictable brain-like ways and suggests that AGI will be far more like human simulations/emulations than you’d otherwise expect and will require training/education/raising vaguely like humans1.
Jake further has argued that this has implication for FOOM and DOOM.
Considering the intense technical mastery of nanoelectronics, thermodynamics and neuroscience required to assess the arguments here I concluded that a public debate between experts was called for. This was the start of the Brain Efficiency Prize contest which attracted over a 100 in-depth technically informed comments.
Now for the winners! Please note that the criteria for winning the contest was based on bringing in novel and substantive technical arguments as assesed by me. In contrast, general arguments about the likelihood of FOOM or DOOM while no doubt interesting did not factor into the judgement.
And the winners of the Jake Cannell Brain Efficiency Prize contest are
Source:
https://www.lesswrong.com/posts/fm88c8SvXvemk3BhW/brain-efficiency-cannell-prize-contest-award-ceremony
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I think "Rationality is winning" is a bit of a trap.
(The original phrase is notably "rationality is systematized winning", which is better, but it tends to slide into the abbreviated form, and both forms aren't that great IMO)
It was coined to counteract one set of failure modes - there were people who were straw vulcans, who thought rituals-of-logic were important without noticing when they were getting in the way of their real goals. And, also, there outside critics who'd complain about straw-vulcan-ish actions, and treat that as a knockdown argument against "rationality."
"Rationalist should win" is a countermeme that tells both groups of people "Straw vulcanism is not The Way. If you find yourself overthinking things in counterproductive ways, you are not doing rationality, even if it seems elegant or 'reasonable' in some sense."
It's true that rationalists should win. But I think it's not correspondingly true that "rationality" is the study of winning, full stop. There are lots of ways to win. Sometimes the way you win is by copying what your neighbors are doing, and working hard.
There is rationality involved in sifting through the various practices people suggest to you, and figuring out which ones work best. But, the specific skill of "sifting out the good from the bad" isn't always the best approach. It might take years to become good at it, and it's not obvious that those years of getting good at it will pay off.
Source:
https://www.lesswrong.com/posts/3GSRhtrs2adzpXcbY/rationality-winning
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This post is not about arguments in favor of or against cryonics. I would just like to share a particular emotional response of mine as the topic became hot for me after not thinking about it at all for nearly a decade.
Recently, I have signed up for cryonics, as has my wife, and we have made arrangements for our son to be cryopreserved just in case longevity research does not deliver in time or some unfortunate thing happens.
Last year, my father died. He was a wonderful man, good-natured, intelligent, funny, caring and, most importantly in this context, loving life to the fullest, even in the light of any hardships. He had a no-bullshit-approach regarding almost any topic, and, being born in the late 1940s in relative poverty and without much formal education, over the course of his life he acquired many unusual attitudes that were not that compatible with his peers (unlike me, he never tried to convince other people of things they could not or did not want to grasp, pragmatism was another of his traits). Much of what he expected from the future in general and technology in particular, I later came to know as transhumanist thinking, though neither was he familiar with the philosophy as such nor was he prone to labelling his worldview. One of his convictions was that age-related death is a bad thing and a tragedy, a problem that should and will eventually be solved by technology.
Source:
https://www.lesswrong.com/posts/inARBH5DwQTrvvRj8/cryonics-and-regret
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Alright, time for the payoff, unifying everything discussed in the previous post. This post is a lot more mathematically dense, you might want to digest it in more than one sitting.
Imaginary Prices, Tradeoffs, and Utilitarianism
Harsanyi's Utilitarianism Theorem can be summarized as "if a bunch of agents have their own personal utility functions Ui, and you want to aggregate them into a collective utility function U with the property that everyone agreeing that option x is better than option y (ie, Ui(x)≥Ui(y) for all i) implies U(x)≥U(y), then that collective utility function must be of the form b+∑i∈IaiUi for some number b and nonnegative numbers ai."
Basically, if you want to aggregate utility functions, the only sane way to do so is to give everyone importance weights, and do a weighted sum of everyone's individual utility functions.
Closely related to this is a result that says that any point on the Pareto Frontier of a game can be post-hoc interpreted as the result of maximizing a collective utility function. This related result is one where it's very important for the reader to understand the actual proof, because the proof gives you a way of reverse-engineering "how much everyone matters to the social utility function" from the outcome alone.
First up, draw all the outcomes, and the utilities that both players assign to them, and the convex hull will be the "feasible set" F, since we have access to randomization. Pick some Pareto frontier point u1,u2...un (although the drawn image is for only two players)
https://www.lesswrong.com/posts/RZNmNwc9SxdKayeQh/unifying-bargaining-notions-2-2
Inspired by Aesop, Soren Kierkegaard, Robin Hanson, sadoeuphemist and Ben Hoffman.
One winter a grasshopper, starving and frail, approaches a colony of ants drying out their grain in the sun, to ask for food.
“Did you not store up food during the summer?” the ants ask.
“No”, says the grasshopper. “I lost track of time, because I was singing and dancing all summer long.”
The ants, disgusted, turn away and go back to work.
https://www.lesswrong.com/posts/GJgudfEvNx8oeyffH/the-ants-and-the-grasshopper
Summary: We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes. Essentially, we add together combinations of forward passes in order to get GPT-2 to output the kinds of text we want. We provide a lot of entertaining and successful examples of these "activation additions." We also show a few activation additions which unexpectedly fail to have the desired effect.
We quantitatively evaluate how activation additions affect GPT-2's capabilities. For example, we find that adding a "wedding" vector decreases perplexity on wedding-related sentences, without harming perplexity on unrelated sentences. Overall, we find strong evidence that appropriately configured activation additions preserve GPT-2's capabilities.
Our results provide enticing clues about the kinds of programs implemented by language models. For some reason, GPT-2 allows "combination" of its forward passes, even though it was never trained to do so. Furthermore, our results are evidence of linear feature directions, including "anger", "weddings", and "create conspiracy theories."
We coin the phrase "activation engineering" to describe techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime. Activation additions are nearly as easy as prompting, and they offer an additional way to influence a model’s behaviors and values. We suspect that activation additions can adjust the goals being pursued by a network at inference time.
https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
Philosopher David Chalmers asked: "Is there a canonical source for "the argument for AGI ruin" somewhere, preferably laid out as an explicit argument with premises and a conclusion?"
Unsurprisingly, the actual reason people expect AGI ruin isn't a crisp deductive argument; it's a probabilistic update based on many lines of evidence. The specific observations and heuristics that carried the most weight for someone will vary for each individual, and can be hard to accurately draw out. That said, Eliezer Yudkowsky's So Far: Unfriendly AI Edition might be a good place to start if we want a pseudo-deductive argument just for the sake of organizing discussion. People can then say which premises they want to drill down on. In The Basic Reasons I Expect AGI Ruin, I wrote: "When I say "general intelligence", I'm usually thinking about "whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems". It's possible that we should already be thinking of GPT-4 as "AGI" on some definitions, so to be clear about the threshold of generality I have in mind, I'll specifically talk about "STEM-level AGI", though I expect such systems to be good at non-STEM tasks too. STEM-level AGI is AGI that has "the basic mental machinery required to do par-human reasoning about all the hard sciences", though a specific STEM-level AGI could (e.g.) lack physics ability for the same reasons many smart humans can't solve physics problems, such as "lack of familiarity with the field".
https://www.lesswrong.com/posts/QzkTfj4HGpLEdNjXX/an-artificially-structured-argument-for-expecting-agi-ruin
You are the director of a giant government research program that’s conducting randomized controlled trials (RCTs) on two thousand health interventions, so that you can pick out the most cost-effective ones and promote them among the general population.
The quality of the two thousand interventions follows a normal distribution, centered at zero (no harm or benefit) and with standard deviation 1. (Pick whatever units you like — maybe one quality-adjusted life-year per ten thousand dollars of spending, or something in that ballpark.)
Unfortunately, you don’t know exactly how good each intervention is — after all, then you wouldn’t be doing this job. All you can do is get a noisy measurement of intervention quality using an RCT. We’ll call this measurement the intervention’s performance in your RCT.
https://www.lesswrong.com/posts/nnDTgmzRrzDMiPF9B/how-much-do-you-believe-your-results
This is a post about mental health and disposition in relation to the alignment problem. It compiles a number of resources that address how to maintain wellbeing and direction when confronted with existential risk.
Many people in this community have posted their emotional strategies for facing Doom after Eliezer Yudkowsky’s “Death With Dignity” generated so much conversation on the subject. This post intends to be more touchy-feely, dealing more directly with emotional landscapes than questions of timelines or probabilities of success.
The resources section would benefit from community additions. Please suggest any resources that you would like to see added to this post.
Please note that this document is not intended to replace professional medical or psychological help in any way. Many preexisting mental health conditions can be exacerbated by these conversations. If you are concerned that you may be experiencing a mental health crisis, please consult a professional.
https://www.lesswrong.com/posts/pLLeGA7aGaJpgCkof/mental-health-and-the-alignment-problem-a-compilation-of
The primary talk of the AI world recently is about AI agents (whether or not it includes the question of whether we can’t help but notice we are all going to die.)
The trigger for this was AutoGPT, now number one on GitHub, which allows you to turn GPT-4 (or GPT-3.5 for us clowns without proper access) into a prototype version of a self-directed agent.
We also have a paper out this week where a simple virtual world was created, populated by LLMs that were wrapped in code designed to make them simple agents, and then several days of activity were simulated, during which the AI inhabitants interacted, formed and executed plans, and it all seemed like the beginnings of a living and dynamic world. Game version hopefully coming soon.
How should we think about this? How worried should we be?
https://www.lesswrong.com/posts/566kBoPi76t8KAkoD/on-autogpt
https://thezvi.wordpress.com/
(Related text posted to Twitter; this version is edited and has a more advanced final section.)
Imagine yourself in a box, trying to predict the next word - assign as much probability mass to the next token as possible - for all the text on the Internet.
Koan: Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text? What factors make that task easier, or harder? (If you don't have an answer, maybe take a minute to generate one, or alternatively, try to predict what I'll say next; if you do have an answer, take a moment to review it inside your mind, or maybe say the words out loud.)
https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators
https://www.lesswrong.com/posts/fJBTRa7m7KnCDdzG5/a-stylized-dialogue-on-john-wentworth-s-claims-about-markets
(This is a stylized version of a real conversation, where the first part happened as part of a public debate between John Wentworth and Eliezer Yudkowsky, and the second part happened between John and me over the following morning. The below is combined, stylized, and written in my own voice throughout. The specific concrete examples in John's part of the dialog were produced by me. It's over a year old. Sorry for the lag.)
(As to whether John agrees with this dialog, he said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment.)
https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.
I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:
https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptiveness
This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs.
You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.)
https://www.lesswrong.com/posts/nTGEeRSZrfPiJwkEc/the-onion-test-for-personal-and-institutional-honesty
[co-written by Chana Messinger and Andrew Critch, Andrew is the originator of the idea]
You (or your organization or your mission or your family or etc.) pass the “onion test” for honesty if each layer hides but does not mislead about the information hidden within.
When people get to know you better, or rise higher in your organization, they may find out new things, but should not be shocked by the types of information that were hidden. If they are, you messed up in creating the outer layers to describe appropriately the kind-of-thing that might be inside.
Examples
Positive Example:
Outer layer says "I usually treat my health information as private."
Next layer in says: "Here are the specific health problems I have: Gout, diabetes."
Negative example:
Outer layer says: "I usually treat my health info as private."
Next layer in: "I operate a cocaine dealership. Sorry I didn't warn you that I was also private about my illegal activities."
https://www.lesswrong.com/posts/fRwdkop6tyhi3d22L/there-s-no-such-thing-as-a-tree-phylogenetically
This is a linkpost for https://eukaryotewritesblog.com/2021/05/02/theres-no-such-thing-as-a-tree/
[Crossposted from Eukaryote Writes Blog.]
So you’ve heard about how fish aren’t a monophyletic group? You’ve heard about carcinization, the process by which ocean arthropods convergently evolve into crabs? You say you get it now? Sit down. Sit down. Shut up. Listen. You don’t know nothing yet.
“Trees” are not a coherent phylogenetic category. On the evolutionary tree of plants, trees are regularly interspersed with things that are absolutely, 100% not trees. This means that, for instance, either:
I thought I had a pretty good guess at this, but the situation is far worse than I could have imagined.
https://www.lesswrong.com/posts/ma7FSEtumkve8czGF/losing-the-root-for-the-tree
You know that being healthy is important. And that there's a lot of stuff you could do to improve your health: getting enough sleep, eating well, reducing stress, and exercising, to name a few.
There’s various things to hit on when it comes to exercising too. Strength, obviously. But explosiveness is a separate thing that you have to train for. Same with flexibility. And don’t forget cardio!
Strength is most important though, because of course it is. And there’s various things you need to do to gain strength. It all starts with lifting, but rest matters too. And supplements. And protein. Can’t forget about protein.
Protein is a deeper and more complicated subject than it may at first seem. Sure, the amount of protein you consume matters, but that’s not the only consideration. You also have to think about the timing. Consuming large amounts 2x a day is different than consuming smaller amounts 5x a day. And the type of protein matters too. Animal is different than plant, which is different from dairy. And then quality is of course another thing that is important.
But quality isn’t an easy thing to figure out. The big protein supplement companies are Out To Get You. They want to mislead you. Information sources aren’t always trustworthy. You can’t just hop on The Wirecutter and do what they tell you. Research is needed.
So you listen to a few podcasts. Follow a few YouTubers. Start reading some blogs. Throughout all of this you try various products and iterate as you learn more. You’re no Joe Rogan, but you’re starting to become pretty informed.
https://gwern.net/fiction/clippy
In A.D. 20XX. Work was beginning. “How are you gentlemen !!”… (Work. Work never changes; work is always hell.)
Specifically, a MoogleBook researcher has gotten a pull request from Reviewer #2 on his new paper in evolutionary search in auto-ML, for error bars on the auto-ML hyperparameter sensitivity like larger batch sizes, because more can be different and there’s high variance in the old runs with a few anomalously high gain of function. (“Really? Really? That’s what you’re worried about?”) He can’t see why worry, and wonders what sins he committed to deserve this asshole Chinese (given the Engrish) reviewer, as he wearily kicks off yet another HQU experiment…
https://www.lesswrong.com/posts/K4urTDkBbtNuLivJx/why-i-think-strong-general-ai-is-coming-soon
I think there is little time left before someone builds AGI (median ~2030). Once upon a time, I didn't think this.
This post attempts to walk through some of the observations and insights that collapsed my estimates.
The core ideas are as follows:
https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.
I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. I’ll tell the story in two parts:
I think these are the most important problems if we fail to solve intent alignment.
In practice these problems will interact with each other, and with other disruptions/instability caused by rapid progress. These problems are worse in worlds where progress is relatively fast, and fast takeoff can be a key risk factor, but I’m scared even if we have several years.
https://www.lesswrong.com/posts/gNodQGNoPDjztasbh/lies-damn-lies-and-fabricated-options
This is an essay about one of those "once you see it, you will see it everywhere" phenomena. It is a psychological and interpersonal dynamic roughly as common, and almost as destructive, as motte-and-bailey, and at least in my own personal experience it's been quite valuable to have it reified, so that I can quickly recognize the commonality between what I had previously thought of as completely unrelated situations.
The original quote referenced in the title is "There are three kinds of lies: lies, damned lies, and statistics."
https://www.lesswrong.com/posts/thkAtqoQwN6DtaiGT/carefully-bootstrapped-alignment-is-organizationally-hard
In addition to technical challenges, plans to safely develop AI face lots of organizational challenges. If you're running an AI lab, you need a concrete plan for handling that.
In this post, I'll explore some of those issues, using one particular AI plan as an example. I first heard this described by Buck at EA Global London, and more recently with OpenAI's alignment plan. (I think Anthropic's plan has a fairly different ontology, although it still ultimately routes through a similar set of difficulties)
I'd call the cluster of plans similar to this "Carefully Bootstrapped Alignment."
https://www.lesswrong.com/posts/4Gt42jX7RiaNaxCwP/more-information-about-the-dangerous-capability-evaluations
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals/
[Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]
We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight.
We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable.
As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably.
https://www.lesswrong.com/posts/zidQmfFhMgwFzcHhs/enemies-vs-malefactors
Status: some mix of common wisdom (that bears repeating in our particular context), and another deeper point that I mostly failed to communicate.
Short version
Harmful people often lack explicit malicious intent. It’s worth deploying your social or community defenses against them anyway. I recommend focusing less on intent and more on patterns of harm.
(Credit to my explicit articulation of this idea goes in large part to Aella, and also in part to Oliver Habryka.)
https://www.lesswrong.com/posts/LzQtrHSYDafXynofq/the-parable-of-the-king-and-the-random-process
~ A Parable of Forecasting Under Model Uncertainty ~
You, the monarch, need to know when the rainy season will begin, in order to properly time the planting of the crops. You have two advisors, Pronto and Eternidad, who you trust exactly equally.
You ask them both: "When will the next heavy rain occur?"
Pronto says, "Three weeks from today."
Eternidad says, "Ten years from today."
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post
In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.
https://www.lesswrong.com/posts/3RSq3bfnzuL3sp46J/acausal-normalcy
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This post is also available on the EA Forum.
Summary: Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic. I say this to be comforting rather than dismissive; if it sounds dismissive, I apologize.
With that said, I have four aims in writing this post:
https://www.lesswrong.com/posts/RryyWNmJNnLowbhfC/please-don-t-throw-your-mind-away
[Warning: the following dialogue contains an incidental spoiler for "Music in Human Evolution" by Kevin Simler. That post is short, good, and worth reading without spoilers, and this post will still be here if you come back later. It's also possible to get the point of this post by skipping the dialogue and reading the other sections.]
Pretty often, talking to someone who's arriving to the existential risk / AGI risk / longtermism cluster, I'll have a conversation like the following:
Tsvi: "So, what's been catching your eye about this stuff?"
Arrival: "I think I want to work on machine learning, and see if I can contribute to alignment that way."
T: "What's something that got your interest in ML?"
A: "It seems like people think that deep learning might be on the final ramp up to AGI, so I should probably know how that stuff works, and I think I have a good chance of learning ML at least well enough to maybe contribute to a research project."
------
This is an experiment with AI narration. What do you think? Tell us by going to t3a.is.
------
https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism
There is a lot of disagreement and confusion about the feasibility and risks associated with automating alignment research. Some see it as the default path toward building aligned AI, while others expect limited benefit from near term systems, expecting the ability to significantly speed up progress to appear well after misalignment and deception. Furthermore, progress in this area may directly shorten timelines or enable the creation of dual purpose systems which significantly speed up capabilities research.
OpenAI recently released their alignment plan. It focuses heavily on outsourcing cognitive work to language models, transitioning us to a regime where humans mostly provide oversight to automated research assistants. While there have been a lot of objections to and concerns about this plan, there hasn’t been a strong alternative approach aiming to automate alignment research which also takes all of the many risks seriously.
The intention of this post is not to propose an end-all cure for the tricky problem of accelerating alignment using GPT models. Instead, the purpose is to explicitly put another point on the map of possible strategies, and to add nuance to the overall discussion.
https://www.lesswrong.com/posts/CYN7swrefEss4e3Qe/childhoods-of-exceptional-people
This is a linkpost for https://escapingflatland.substack.com/p/childhoods
Let’s start with one of those insights that are as obvious as they are easy to forget: if you want to master something, you should study the highest achievements of your field. If you want to learn writing, read great writers, etc.
But this is not what parents usually do when they think about how to educate their kids. The default for a parent is rather to imitate their peers and outsource the big decisions to bureaucracies. But what would we learn if we studied the highest achievements?
Thinking about this question, I wrote down a list of twenty names—von Neumann, Tolstoy, Curie, Pascal, etc—selected on the highly scientific criteria “a random Swedish person can recall their name and think, Sounds like a genius to me”. That list is to me a good first approximation of what an exceptional result in the field of child-rearing looks like. I ordered a few piles of biographies, read, and took notes. Trying to be a little less biased in my sample, I asked myself if I could recall anyone exceptional that did not fit the patterns I saw in the biographies, which I could, and so I ordered a few more biographies.
This kept going for an unhealthy amount of time.
I sampled writers (Virginia Woolf, Lev Tolstoy), mathematicians (John von Neumann, Blaise Pascal, Alan Turing), philosophers (Bertrand Russell, René Descartes), and composers (Mozart, Bach), trying to get a diverse sample.
In this essay, I am going to detail a few of the patterns that have struck me after having skimmed 42 biographies. I will sort the claims so that I start with more universal patterns and end with patterns that are less common.
https://www.lesswrong.com/posts/NJYmovr9ZZAyyTBwM/what-i-mean-by-alignment-is-in-large-part-about-making
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
(Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.)
I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.
It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.
https://www.lesswrong.com/posts/NRrbJJWnaSorrqvtZ/on-not-getting-contaminated-by-the-wrong-obesity-ideas
A Chemical Hunger (a), a series by the authors of the blog Slime Mold Time Mold (SMTM), argues that the obesity epidemic is entirely caused (a) by environmental contaminants.
In my last post, I investigated SMTM’s main suspect (lithium).[1] This post collects other observations I have made about SMTM’s work, not narrowly related to lithium, but rather focused on the broader thesis of their blog post series.
I think that the environmental contamination hypothesis of the obesity epidemic is a priori plausible. After all, we know that chemicals can affect humans, and our exposure to chemicals has plausibly changed a lot over time. However, I found that several of what seem to be SMTM’s strongest arguments in favor of the contamination theory turned out to be dubious, and that nearly all of the interesting things I thought I’d learned from their blog posts turned out to actually be wrong. I’ll explain that in this post.
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
Work done at SERI-MATS, over the past two months, by Jessica Rumbelow and Matthew Watkins.
TL;DR
Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew)
Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for:
In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further detail can be found in a follow-up post.
https://www.lesswrong.com/posts/Zp6wG5eQFLGWwcG6j/focus-on-the-places-where-you-feel-shocked-everyone-s
Writing down something I’ve found myself repeating in different conversations:
If you're looking for ways to help with the whole “the world looks pretty doomed” business, here's my advice: look around for places where we're all being total idiots.
Look for places where everyone's fretting about a problem that some part of you thinks it could obviously just solve.
Look around for places where something seems incompetently run, or hopelessly inept, and where some part of you thinks you can do better.
Then do it better.
https://www.lesswrong.com/posts/XPv4sYrKnPzeJASuk/basics-of-rationalist-discourse-1
Introduction
This post is meant to be a linkable resource. Its core is a short list of guidelines (you can link directly to the list) that are intended to be fairly straightforward and uncontroversial, for the purpose of nurturing and strengthening a culture of clear thinking, clear communication, and collaborative truth-seeking.
"Alas," said Dumbledore, "we all know that what should be, and what is, are two different things. Thank you for keeping this in mind."
There is also (for those who want to read more than the simple list) substantial expansion/clarification of each specific guideline, along with justification for the overall philosophy behind the set.
https://www.lesswrong.com/posts/PCrTQDbciG4oLgmQ5/sapir-whorf-for-rationalists
Casus Belli: As I was scanning over my (rather long) list of essays-to-write, I realized that roughly a fifth of them were of the form "here's a useful standalone concept I'd like to reify," à la cup-stacking skills, fabricated options, split and commit, and sazen. Some notable entries on that list (which I name here mostly in the hope of someday coming back and turning them into links) include: red vs. white, walking with three, setting the zero point[1], seeding vs. weeding, hidden hinges, reality distortion fields, and something-about-layers-though-that-one-obviously-needs-a-better-word.
While it's still worthwhile to motivate/justify each individual new conceptual handle (and the planned essays will do so), I found myself imagining a general objection of the form "this is just making up terms for things," or perhaps "this is too many new terms, for too many new things." I realized that there was a chunk of argument, repeated across all of the planned essays, that I could factor out, and that (to the best of my knowledge) there was no single essay aimed directly at the question "why new words/phrases/conceptual handles at all?"
So ... voilà.
https://www.lesswrong.com/posts/pDzdb4smpzT3Lwbym/my-model-of-ea-burnout
(Probably somebody else has said most of this. But I personally haven't read it, and felt like writing it down myself, so here we go.)
I think that EA [editor note: "Effective Altruism"] burnout usually results from prolonged dedication to satisfying the values you think you should have, while neglecting the values you actually have.
Setting aside for the moment what “values” are and what it means to “actually” have one, suppose that I actually value these things (among others):
https://www.lesswrong.com/posts/Xo7qmDakxiizG7B9c/the-social-recession-by-the-numbers
This is a linkpost for https://novum.substack.com/p/social-recession-by-the-numbers
Fewer friends, relationships on the decline, delayed adulthood, trust at an all-time low, and many diseases of despair. The prognosis is not great.
One of the most discussed topics online recently has been friendships and loneliness. Ever since the infamous chart showing more people are not having sex than ever before first made the rounds, there’s been increased interest in the social state of things. Polling has demonstrated a marked decline in all spheres of social life, including close friends, intimate relationships, trust, labor participation, and community involvement. The trend looks to have worsened since the pandemic, although it will take some years before this is clearly established.
The decline comes alongside a documented rise in mental illness, diseases of despair, and poor health more generally. In August 2022, the CDC announced that U.S. life expectancy has fallen further and is now where it was in 1996. Contrast this to Western Europe, where it has largely rebounded to pre-pandemic numbers. Still, even before the pandemic, the years 2015-2017 saw the longest sustained decline in U.S. life expectancy since 1915-18. While my intended angle here is not health-related, general sociability is closely linked to health. The ongoing shift has been called the “friendship recession” or the “social recession.”
https://www.lesswrong.com/posts/pHfPvb4JMhGDr4B7n/recursive-middle-manager-hell
I think Zvi's Immoral Mazes sequence is really important, but comes with more worldview-assumptions than are necessary to make the points actionable. I conceptualize Zvi as arguing for multiple hypotheses. In this post I want to articulate one sub-hypothesis, which I call "Recursive Middle Manager Hell". I'm deliberately not covering some other components of his model[1].
tl;dr:
Something weird and kinda horrifying happens when you add layers of middle-management. This has ramifications on when/how to scale organizations, and where you might want to work, and maybe general models of what's going on in the world.
You could summarize the effect as "the org gets more deceptive, less connected to its original goals, more focused on office politics, less able to communicate clearly within itself, and selected for more for sociopathy in upper management."
You might read that list of things and say "sure, seems a bit true", but one of the main points here is "Actually, this happens in a deeper and more insidious way than you're probably realizing, with much higher costs than you're acknowledging. If you're scaling your organization, this should be one of your primary worries."
https://www.lesswrong.com/posts/mfPHTWsFhzmcXw8ta/the-feeling-of-idea-scarcity
Here’s a story you may recognize. There's a bright up-and-coming young person - let's call her Alice. Alice has a cool idea. It seems like maybe an important idea, a big idea, an idea which might matter. A new and valuable idea. It’s the first time Alice has come up with a high-potential idea herself, something which she’s never heard in a class or read in a book or what have you.
So Alice goes all-in pursuing this idea. She spends months fleshing it out. Maybe she writes a paper, or starts a blog, or gets a research grant, or starts a company, or whatever, in order to pursue the high-potential idea, bring it to the world.
And sometimes it just works!
… but more often, the high-potential idea doesn’t actually work out. Maybe it turns out to be basically-the-same as something which has already been tried. Maybe it runs into some major barrier, some not-easily-patchable flaw in the idea. Maybe the problem it solves just wasn’t that important in the first place.
From Alice’ point of view, the possibility that her one high-potential idea wasn’t that great after all is painful. The idea probably feels to Alice like the single biggest intellectual achievement of her life. To lose that, to find out that her single greatest intellectual achievement amounts to little or nothing… that hurts to even think about. So most likely, Alice will reflexively look for an out. She’ll look for some excuse to ignore the similar ideas which have already been tried, some reason to think her idea is different. She’ll look for reasons to believe that maybe the major barrier isn’t that much of an issue, or that we Just Don’t Know whether it’s actually an issue and therefore maybe the idea could work after all. She’ll look for reasons why the problem really is important. Maybe she’ll grudgingly acknowledge some shortcomings of the idea, but she’ll give up as little ground as possible at each step, update as slowly as she can.
https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight
When thinking about deception and RLHF training, a simplified threat model is something like this:
Before continuing, I would encourage you to really engage with the above. Does it make sense to you? Is it making any hidden assumptions? Is it missing any steps? Can you rewrite it to be more mechanistically correct?
I believe that when people use the above threat model, they are either using it as shorthand for something else or they misunderstand how reinforcement learning works. Most alignment researchers will be in the former category. However, I was in the latter.
https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Introduction
A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.
In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical.
Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.
Disclaimers: I have tried to make this post concise, at the cost of not making the full arguments for many of my claims; you should treat this as more of a rough sketch of my views rather than anything comprehensive. I also frequently change my mind – I’m usually more consistently excited about some of the broad intuitions but much less wedded to the details – and this of course just represents my current thinking on the topic.
https://www.lesswrong.com/posts/qRtD4WqKRYEtT5pi3/the-next-decades-might-be-wild
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I’d like to thank Simon Grimm and Tamay Besiroglu for feedback and discussions.
This post is inspired by What 2026 looks like and an AI vignette workshop guided by Tamay Besiroglu. I think of this post as “what would I expect the world to look like if these timelines (median compute for transformative AI ~2036) were true” or “what short-to-medium timelines feel like” since I find it hard to translate a statement like “median TAI year is 20XX” into a coherent imaginable world.
I expect some readers to think that the post sounds wild and crazy but that doesn’t mean its content couldn’t be true. If you had told someone in 1990 or 2000 that there would be more smartphones and computers than humans in 2020, that probably would have sounded wild to them. The same could be true for AIs, i.e. that in 2050 there are more human-level AIs than humans. The fact that this sounds as ridiculous as ubiquitous smartphones sounded to the 1990/2000 person, might just mean that we are bad at predicting exponential growth and disruptive technology.
Update: titotal points out in the comments that the correct timeframe for computers is probably 1980 to 2020. So the correct time span is probably 40 years instead of 30. For mobile phones, it's probably 1993 to 2020 if you can trust this statistic.
I’m obviously not confident (see confidence and takeaways section) in this particular prediction but many of the things I describe seem like relatively direct consequences of more and more powerful and ubiquitous AI mixed with basic social dynamics and incentives.
https://www.lesswrong.com/posts/SqjQFhn5KTarfW8v7/lessons-learned-from-talking-to-greater-than-100-academics
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I’d like to thank MH, Jaime Sevilla and Tamay Besiroglu for their feedback.
During my Master's and Ph.D. (still ongoing), I have spoken with many academics about AI safety. These conversations include chats with individual PhDs, poster presentations and talks about AI safety.
I think I have learned a lot from these conversations and expect many other people concerned about AI safety to find themselves in similar situations. Therefore, I want to detail some of my lessons and make my thoughts explicit so that others can scrutinize them.
TL;DR: People in academia seem more and more open to arguments about risks from advanced intelligence over time and I would genuinely recommend having lots of these chats. Furthermore, I underestimated how much work related to some aspects AI safety already exists in academia and that we sometimes reinvent the wheel. Messaging matters, e.g. technical discussions got more interest than alarmism and explaining the problem rather than trying to actively convince someone received better feedback.
https://www.lesswrong.com/posts/6LzKRP88mhL9NKNrS/how-my-team-at-lightcone-sometimes-gets-stuff-done
Disclaimer: I originally wrote this as a private doc for the Lightcone team. I then showed it to John and he said he would pay me to post it here. That sounded awfully compelling. However, I wanted to note that I’m an early founder who hasn't built anything truly great yet. I’m writing this doc because as Lightcone is growing, I have to take a stance on these questions. I need to design our org to handle more people. Still, I haven’t seen the results long-term, and who knows if this is good advice. Don’t overinterpret this.
Suppose you went up on stage in front of a company you founded, that now had grown to 100, or 1000, 10 000+ people. You were going to give a talk about your company values. You can say things like “We care about moving fast, taking responsibility, and being creative” -- but I expect these words would mostly fall flat. At the end of the day, the path the water takes down the hill is determined by the shape of the territory, not the sound the water makes as it swooshes by. To manage that many people, it seems to me you need clear, concrete instructions. What are those? What are things you could write down on a piece of paper and pass along your chain of command, such that if at the end people go ahead and just implement them, without asking what you meant, they would still preserve some chunk of what makes your org work?
https://www.lesswrong.com/posts/rP66bz34crvDudzcJ/decision-theory-does-not-imply-that-we-get-to-have-nice
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
(Note: I wrote this with editing help from Rob and Eliezer. Eliezer's responsible for a few of the paragraphs.)
A common confusion I see in the tiny fragment of the world that knows about logical decision theory (FDT/UDT/etc.), is that people think LDT agents are genial and friendly for each other.[1]
One recent example is Will Eden’s tweet about how maybe a molecular paperclip/squiggle maximizer would leave humanity a few stars/galaxies/whatever on game-theoretic grounds. (And that's just one example; I hear this suggestion bandied around pretty often.)
I'm pretty confident that this view is wrong (alas), and based on a misunderstanding of LDT. I shall now attempt to clear up that confusion.
To begin, a parable: the entity Omicron (Omega's little sister) fills box A with $1M and box B with $1k, and puts them both in front of an LDT agent saying "You may choose to take either one or both, and know that I have already chosen whether to fill the first box". The LDT agent takes both.
"What?" cries the CDT agent. "I thought LDT agents one-box!"
LDT agents don't cooperate because they like cooperating. They don't one-box because the name of the action starts with an 'o'. They maximize utility, using counterfactuals that assert that the world they are already in (and the observations they have already seen) can (in the right circumstances) depend (in a relevant way) on what they are later going to do.
A paperclipper cooperates with other LDT agents on a one-shot prisoner's dilemma because they get more paperclips that way. Not because it has a primitive property of cooperativeness-with-similar-beings. It needs to get the more paperclips.
https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-looks-like#2022
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This was written for the Vignettes Workshop.[1] The goal is to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage, i.e. I’m not aware of any alternative trajectory that is similarly detailed and clearly more plausible to me. The methodology is roughly: Write a future history of 2022. Condition on it, and write a future history of 2023. Repeat for 2024, 2025, etc. (I'm posting 2022-2026 now so I can get feedback that will help me write 2027+. I intend to keep writing until the story reaches singularity/extinction/utopia/etc.)
What’s the point of doing this? Well, there are a couple of reasons:
This vignette was hard to write. To achieve the desired level of detail I had to make a bunch of stuff up, but in order to be realistic I had to constantly ask “but actually though, what would really happen in this situation?” which made it painfully obvious how little I know about the future. There are numerous points where I had to conclude “Well, this does seem implausible, but I can’t think of anything more plausible at the moment and I need to move on.” I fully expect the actual world to diverge quickly from the trajectory laid out here. Let anyone who (with the benefit of hindsight) claims this divergence as evidence against my judgment prove it by exhibiting a vignette/trajectory they themselves wrote in 2021. If it maintains a similar level of detail (and thus sticks its neck out just as much) while being more accurate, I bow deeply in respect!
https://www.lesswrong.com/posts/REA49tL5jsh69X3aM/introduction-to-abstract-entropy#fnrefpi8b39u5hd7
This post, and much of the following sequence, was greatly aided by feedback from the following people (among others): Lawrence Chan, Joanna Morningstar, John Wentworth, Samira Nedungadi, Aysja Johnson, Cody Wild, Jeremy Gillen, Ryan Kidd, Justis Mills and Jonathan Mustin. Illustrations by Anne Ore.
Introduction & motivation
In the course of researching optimization, I decided that I had to really understand what entropy is.[1] But there are a lot of other reasons why the concept is worth studying:
I didn't intend to write a post about entropy when I started trying to understand it. But I found the existing resources (textbooks, Wikipedia, science explainers) so poor that it actually seems important to have a better one as a prerequisite for understanding optimization! One failure mode I was running into was that other resources tended only to be concerned about the application of the concept in their particular sub-domain. Here, I try to take on the task of synthesizing the abstract concept of entropy, to show what's so deep and fundamental about it. In future posts, I'll talk about things like:
https://www.lesswrong.com/posts/8vesjeKybhRggaEpT/consider-your-appetite-for-disagreements
Poker
There was a time about five years ago where I was trying to get good at poker. If you want to get good at poker, one thing you have to do is review hands. Preferably with other people.
For example, suppose you have ace king offsuit on the button. Someone in the highjack opens to 3 big blinds preflop. You call. Everyone else folds. The flop is dealt. It's a rainbow Q75. You don't have any flush draws. You missed. Your opponent bets. You fold. They take the pot and you move to the next hand.
Once you finish your session, it'd be good to come back and review this hand. Again, preferably with another person. To do this, you would review each decision point in the hand. Here, there were two decision points.
The first was when you faced a 3BB open from HJ preflop with AKo. In the hand, you decided to call. However, this of course wasn't your only option. You had two others: you could have folded, and you could have raised. Actually, you could have raised to various sizes. You could have raised small to 8BB, medium to 10BB, or big to 12BB. Or hell, you could have just shoved 200BB! But that's not really a realistic option, nor is folding. So in practice your decision was between calling and raising to various realistic sizes.
https://www.lesswrong.com/posts/fFY2HeC9i2Tx8FEnK/my-resentful-story-of-becoming-a-medical-miracle
This is a linkpost for https://acesounderglass.com/2022/10/13/my-resentful-story-of-becoming-a-medical-miracle/
You know those health books with “miracle cure” in the subtitle? The ones that always start with a preface about a particular patient who was completely hopeless until they tried the supplement/meditation technique/healing crystal that the book is based on? These people always start broken and miserable, unable to work or enjoy life, perhaps even suicidal from the sheer hopelessness of getting their body to stop betraying them. They’ve spent decades trying everything and nothing has worked until their friend makes them see the book’s author, who prescribes the same thing they always prescribe, and the patient immediately stands up and starts dancing because their problem is entirely fixed (more conservative books will say it took two sessions). You know how those are completely unbelievable, because anything that worked that well would go mainstream, so basically the book is starting you off with a shit test to make sure you don’t challenge its bullshit later?
Well 5 months ago I became one of those miraculous stories, except worse, because my doctor didn’t even do it on purpose. This finalized some already fermenting changes in how I view medical interventions and research. Namely: sometimes knowledge doesn’t work and then you have to optimize for luck.
I assure you I’m at least as unhappy about this as you are.
https://www.lesswrong.com/posts/CKgPFHoWFkviYz7CB/the-redaction-machine
On the 3rd of October 2351 a machine flared to life. Huge energies coursed into it via cables, only to leave moments later as heat dumped unwanted into its radiators. With an enormous puff the machine unleashed sixty years of human metabolic entropy into superheated steam.
In the heart of the machine was Jane, a person of the early 21st century.
From her perspective there was no transition. One moment she had been in the year 2021, sat beneath a tree in a park. Reading a detective novel.
Then the book was gone, and the tree. Also the park. Even the year.
She found herself laid in a bathtub, immersed in sickly fatty fluids. She was naked and cold.
https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity.
The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback[1] on diverse tasks (HFDT):
Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.
HFDT is not the only approach to developing transformative AI,[2] and it may not work at all.[3] But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon.
Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.
https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values
TL;DR: We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry.
https://www.lesswrong.com/posts/AfH2oPHCApdKicM4m/two-year-update-on-my-personal-ai-timelines#fnref-fwwPpQFdWM6hJqwuY-12
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
I worked on my draft report on biological anchors for forecasting AI timelines mainly between ~May 2019 (three months after the release of GPT-2) and ~Jul 2020 (a month after the release of GPT-3), and posted it on LessWrong in Sep 2020 after an internal review process. At the time, my bottom line estimates from the bio anchors modeling exercise were:[1]
These were roughly close to my all-things-considered probabilities at the time, as other salient analytical frames on timelines didn’t do much to push back on this view. (Though my subjective probabilities bounced around quite a lot around these values and if you’d asked me on different days and with different framings I’d have given meaningfully different numbers.)
It’s been about two years since the bulk of the work on that report was completed, during which I’ve mainly been thinking about AI. In that time it feels like very short timelines have become a lot more common and salient on LessWrong and in at least some parts of the ML community.
My personal timelines have also gotten considerably shorter over this period. I now expect something roughly like this:
https://www.lesswrong.com/posts/9kNxhKWvixtKW5anS/you-are-not-measuring-what-you-think-you-are-measuring
Eight years ago, I worked as a data scientist at a startup, and we wanted to optimize our sign-up flow. We A/B tested lots of different changes, and occasionally found something which would boost (or reduce) click-through rates by 10% or so.
Then one week I was puzzling over a discrepancy in the variance of our daily signups. Eventually I scraped some data from the log files, and found that during traffic spikes, our server latency shot up to multiple seconds. The effect on signups during these spikes was massive: even just 300 ms was enough that click-through dropped by 30%, and when latency went up to seconds the click-through rates dropped by over 80%. And this happened multiple times per day. Latency was far and away the most important factor which determined our click-through rates. [1]
Going back through some of our earlier experiments, it was clear in hindsight that some of our biggest effect-sizes actually came from changing latency - for instance, if we changed the order of two screens, then there’d be an extra screen before the user hit the one with high latency, so the latency would be better hidden. Our original interpretations of those experiments - e.g. that the user cared more about the content of one screen than another - were totally wrong. It was also clear in hindsight that our statistics on all the earlier experiments were bunk - we’d assumed that every user’s click-through was statistically independent, when in fact they were highly correlated, so many of the results which we thought were significant were in fact basically noise.
https://www.lesswrong.com/posts/WNpvK67MjREgvB8u8/do-bamboos-set-themselves-on-fire
Cross-posted from Telescopic Turnip.
As we all know, the best place to have a kung-fu fight is a bamboo forest. There are just so many opportunities to grab pieces of bamboos and manufacture improvised weapons, use them to catapult yourself in the air and other basic techniques any debutant martial artist ought to know. A lesser-known fact is that bamboo-forest fights occur even when the cameras of Hong-Kong filmmakers are not present. They may even happen without the presence of humans at all. The forest itself is the kung-fu fight.
It's often argued that humans are the worst species on Earth, because of our limitless potential for violence and mutual destruction. If that's the case, bamboos are second. Bamboos are sick. The evolution of bamboos is the result of multiple layers of shear brutality, with two imbricated levels of war, culminating in an apocalypse of combustive annihilation. At least, according to some hypotheses.
https://www.lesswrong.com/posts/oyKzz7bvcZMEPaDs6/survey-advice
Things I believe about making surveys, after making some surveys:
https://www.lesswrong.com/posts/J3wemDGtsy5gzD3xa/toni-kurz-and-the-insanity-of-climbing-mountains
Content warning: death
I've been on a YouTube binge lately. My current favorite genre is disaster stories about mountain climbing. The death statistics for some of these mountains, especially ones in the Himalayas are truly insane.
To give an example, let me tell you about a mountain most people have never heard of: Nanga Parbat. It's a 8,126 meter "wall of ice and rock", sporting the tallest mountain face and the fastest change in elevation in the entire world: the Rupal Face.
I've posted a picture above, but these really don't do justice to just how gigantic this wall is. This single face is as tall as the largest mountain in the Alps. It is the size of ten empire state buildings stacked on top of one another. If you could somehow walk straight up starting from the bottom, it would take you an entire HOUR to reach the summit.
31 people died trying to climb this mountain before its first successful ascent. Imagine being climber number 32 and thinking "Well I know no one has ascended this mountain and thirty one people have died trying, but why not, let's give it a go!"
The stories of deaths on these mountains (and even much shorter peaks in the Alps or in North America) sound like they are out of a novel. Stories of one mountain in particular have stuck with me: the first attempts to climb tallest mountain face in the alps: The Eigerwand.
The Eigerwand is the North face of a 14,000 foot peak named "The Eiger". After three generations of Europeans had conquered every peak in the Alps, few great challenges remained in the area. The Eigerwand was one of these: widely considered to be the greatest unclimbed route in the Alps.
The peak had already been reached in the 1850s, during the golden age of Alpine exploration. But the north face of the mountain remained unclimbed.
Many things can make a climb challenging: steep slopes, avalanches, long ascents, no easy resting spots and more. The Eigerwand had all of those, but one hazard in particular stood out: loose rock and snow.
In the summer months (usually considered the best time for climbing), the mountain crumbles. Fist-sized boulders routinely tumble down the mountain. Huge avalanaches sweep down its 70-degree slopes at incredible speed. And the huge, concave face is perpetually in shadow. It is extremely cold and windy, and the concave face seems to cause local weather patterns that can be completely different from the pass below. The face is deadly.
Before 1935, no team had made a serious attempt at the face. But that year, two young German climbers from Bavaria, both extremely experienced but relatively unknown outside the climbing community, decided they would make the first serious attempt.
https://www.lesswrong.com/posts/gs3vp3ukPbpaEie5L/deliberate-grieving-1
This post is hopefully useful on its own, but begins a series ultimately about grieving over a world that might (or, might not) be doomed.
It starts with some pieces from a previous coordination frontier sequence post, but goes into more detail.
At the beginning of the pandemic, I didn’t have much experience with grief. By the end of the pandemic, I had gotten quite a lot of practice grieving for things. I now think of grieving as a key life skill, with ramifications for epistemics, action, and coordination.
I had read The Art of Grieving Well, which gave me footholds to get started with. But I still had to develop some skills from scratch, and apply them in novel ways.
Grieving probably works differently for different people. Your mileage may vary. But for me, grieving is the act of wrapping my brain around the fact that something important to me doesn’t exist anymore. Or can’t exist right now. Or perhaps never existed. It typically comes in two steps – an “orientation” step, where my brain traces around the lines of the thing-that-isn’t-there, coming to understand what reality is actually shaped like now. And then a “catharsis” step, once I fully understand that the thing is gone. The first step can take hours, weeks or months.
You can grieve for people who are gone. You can grieve for things you used to enjoy. You can grieve for principles that were important to you but aren’t practical to apply right now.
Grieving is important in single-player mode – if I’m holding onto something that’s not there anymore, my thoughts and decision-making are distorted. I can’t make good plans if my map of reality is full of leftover wishful markings of things that aren’t there.
https://www.lesswrong.com/s/6xgy8XYEisLk3tCjH/p/CPP2uLcaywEokFKQG
Tl;dr:
I've noticed a dichotomy between "thinking in toolboxes" and "thinking in laws".
The toolbox style of thinking says it's important to have a big bag of tools that you can adapt to context and circumstance; people who think very toolboxly tend to suspect that anyone who goes talking of a single optimal way is just ignorant of the uses of the other tools.
The lawful style of thinking, done correctly, distinguishes between descriptive truths, normative ideals, and prescriptive ideals. It may talk about certain paths being optimal, even if there's no executable-in-practice algorithm that yields the optimal path. It considers truths that are not tools.
Within nearly-Euclidean mazes, the triangle inequality - that the path AC is never spatially longer than the path ABC - is always true but only sometimes useful. The triangle inequality has the prescriptive implication that if you know that one path choice will travel ABC and one path will travel AC, and if the only pragmatic path-merit you care about is going the minimum spatial distance (rather than say avoiding stairs because somebody in the party is in a wheelchair), then you should pick the route AC. But the triangle inequality goes on governing Euclidean mazes whether or not you know which path is which, and whether or not you need to avoid stairs.
Toolbox thinkers may be extremely suspicious of this claim of universal lawfulness if it is explained less than perfectly, because it sounds to them like "Throw away all the other tools in your toolbox! All you need to know is Euclidean geometry, and you can always find the shortest path through any maze, which in turn is always the best path."
If you think that's an unrealistic depiction of a misunderstanding that would never happen in reality, keep reading.
https://www.lesswrong.com/posts/PBRWb2Em5SNeWYwwB/humans-are-not-automatically-strategic
Reply to: A "Failure to Evaluate Return-on-Time" Fallacy
Lionhearted writes:
[A] large majority of otherwise smart people spend time doing semi-productive things, when there are massively productive opportunities untapped.A somewhat silly example: Let's say someone aspires to be a comedian, the best comedian ever, and to make a living doing comedy. He wants nothing else, it is his purpose. And he decides that in order to become a better comedian, he will watch re-runs of the old television cartoon 'Garfield and Friends' that was on TV from 1988 to 1995....I’m curious as to why.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
[Thanks to a variety of people for comments and assistance (especially Paul Christiano, Nostalgebraist, and Rafe Kennedy), and to various people for playing the game. Buck wrote the top-1 prediction web app; Fabien wrote the code for the perplexity experiment and did most of the analysis and wrote up the math here, Lawrence did the research on previous measurements. Epistemic status: we're pretty confident of our work here, but haven't engaged in a super thorough review process of all of it--this was more like a side-project than a core research project.]
How good are modern language models compared to humans, at the task language models are trained on (next token prediction on internet text)? While there are language-based tasks that you can construct where humans can make a next-token prediction better than any language model, we aren't aware of any apples-to-apples comparisons on non-handcrafted datasets. To answer this question, we performed a few experiments comparing humans to language models on next-token prediction on OpenWebText.
Contrary to some previous claims, we found that humans seem to be consistently worse at next-token prediction (in terms of both top-1 accuracy and perplexity) than even small models like Fairseq-125M, a 12-layer transformer roughly the size and quality of GPT-1. That is, even small language models are "superhuman" at predicting the next token. That being said, it seems plausible that humans can consistently beat the smaller 2017-era models (though not modern models) with a few hours more practice and strategizing. We conclude by discussing some of our takeaways from this result.
We're not claiming that this result is completely novel or surprising. For example, FactorialCode makes a similar claim as an answer on this LessWrong post about this question. We've also heard from some NLP people that the superiority of LMs to humans for next-token prediction is widely acknowledged in NLP. However, we've seen incorrect claims to the contrary on the internet, and as far as we know there hasn't been a proper apples-to-apples comparison, so we believe there's some value to our results.
If you want to play with our website, it’s here; in our opinion playing this game for half an hour gives you some useful perspective on what it’s like to be a language model.
https://www.lesswrong.com/posts/jDQm7YJxLnMnSNHFu/moral-strategies-at-different-capability-levels
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Let’s consider three ways you can be altruistic towards another agent:
I think a lot of unresolved tensions in ethics comes from seeing these types of morality as in opposition to each other, when they’re actually complementary:
https://www.lesswrong.com/posts/xFotXGEotcKouifky/worlds-where-iterative-design-fails
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In most technical fields, we try designs, see what goes wrong, and iterate until it works. That’s the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice.
In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.
By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.
Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons:
… but these certainly aren’t the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I’d encourage you to think on your own about others. These are the things which kill us; they’re worth thinking about.Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In most technical fields, we try designs, see what goes wrong, and iterate until it works. That’s the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice.
In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.
By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.
Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons:
… but these certainly aren’t the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I’d encourage you to think on your own about others. These are the things which kill us; they’re worth thinking about.
https://www.lesswrong.com/posts/QBAjndPuFbhEXKcCr/my-understanding-of-what-everyone-in-technical-alignment-is
Despite a clear need for it, a good source explaining who is doing what and why in technical AI alignment doesn't exist. This is our attempt to produce such a resource. We expect to be inaccurate in some ways, but it seems great to get out there and let Cunningham’s Law do its thing.[1]
The main body contains our understanding of what everyone is doing in technical alignment and why, as well as at least one of our opinions on each approach. We include supplements visualizing differences between approaches and Thomas’s big picture view on alignment. The opinions written are Thomas and Eli’s independent impressions, many of which have low resilience. Our all-things-considered views are significantly more uncertain.
This post was mostly written while Thomas was participating in the 2022 iteration SERI MATS program, under mentor John Wentworth. Thomas benefited immensely from conversations with other SERI MATS participants, John Wentworth, as well as many others who I met this summer.
https://www.lesswrong.com/posts/rYDas2DDGGDRc8gGB/unifying-bargaining-notions-1-2
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a two-part sequence of posts, in the ancient LessWrong tradition of decision-theory-posting. This first part will introduce various concepts of bargaining solutions and dividing gains from trade, which the reader may or may not already be familiar with.
The upcoming part will be about how all introduced concepts from this post are secretly just different facets of the same underlying notion, as originally discovered by John Harsanyi back in 1963 and rediscovered by me from a completely different direction. The fact that the various different solution concepts in cooperative game theory are all merely special cases of a General Bargaining Solution for arbitrary games, is, as far as I can tell, not common knowledge on Less Wrong.
Bargaining Games
Let's say there's a couple with a set of available restaurant options. Neither of them wants to go without the other, and if they fail to come to an agreement, the fallback is eating a cold canned soup dinner at home, the worst of all the options. However, they have different restaurant preferences. What's the fair way to split the gains from trade?
Well, it depends on their restaurant preferences, and preferences are typically encoded with utility functions. Since both sides agree that the disagreement outcome is the worst, they might as well index that as 0 utility, and their favorite respective restaurants as 1 utility, and denominate all the other options in terms of what probability mix between a cold canned dinner and their favorite restaurant would make them indifferent. If there's something that scores 0.9 utility for both, it's probably a pretty good pick!
https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators#fncrt8wagfir9
Summary
TL;DR: Self-supervised learning may create AGI or its foundation. What would that look like?
Unlike the limit of RL, the limit of self-supervised learning has received surprisingly little conceptual attention, and recent progress has made deconfusion in this domain more pressing.
Existing AI taxonomies either fail to capture important properties of self-supervised models or lead to confusing propositions. For instance, GPT policies do not seem globally agentic, yet can be conditioned to behave in goal-directed ways. This post describes a frame that enables more natural reasoning about properties like agency: GPT, insofar as it is inner-aligned, is a simulator which can simulate agentic and non-agentic simulacra.
The purpose of this post is to capture these objects in words so GPT can reference them and provide a better foundation for understanding them.
I use the generic term “simulator” to refer to models trained with predictive loss on a self-supervised dataset, invariant to architecture or data type (natural language, code, pixels, game states, etc). The outer objective of self-supervised learning is Bayes-optimal conditional inference over the prior of the training distribution, which I call the simulation objective, because a conditional model can be used to simulate rollouts which probabilistically obey its learned distribution by iteratively sampling from its posterior (predictions) and updating the condition (prompt). Analogously, a predictive model of physics can be used to compute rollouts of phenomena in simulation. A goal-directed agent which evolves according to physics can be simulated by the physics rule parameterized by an initial state, but the same rule could also propagate agents with different values, or non-agentic phenomena like rocks. This ontological distinction between simulator (rule) and simulacra (phenomena) applies directly to generative models like GPT.
TL;DR: To even consciously consider an alignment research direction, you should have evidence to locate it as a promising lead. As best I can tell, many directions seem interesting but do not have strong evidence of being “entangled” with the alignment problem such that I expect them to yield significant insights.
For example, “we can solve an easier version of the alignment problem by first figuring out how to build an AI which maximizes the number of real-world diamonds” has intuitive appeal and plausibility, but this claim doesn’t have to be trueand this problem does not necessarily have a natural, compact solution. In contrast, there do in fact exist humans who care about diamonds. Therefore, there are guaranteed-to-exist alignment insights concerning the way people come to care about e.g. real-world diamonds.
“Consider how humans navigate the alignment subproblem you’re worried about” is a habit which I (TurnTrout) picked up from Quintin Pope. I wrote the post, he originated the tactic.
https://www.lesswrong.com/posts/DdDt5NXkfuxAnAvGJ/changing-the-world-through-slack-and-hobbies
In EA orthodoxy, if you're really serious about EA, the three alternatives that people most often seem to talk about are
(1) “direct work” in a job that furthers a very important cause;
(2) “earning to give”;
(3) earning “career capital” that will help you do those things in the future, e.g. by getting a PhD or teaching yourself ML.
By contrast, there’s not much talk of:
(4) being in a job / situation where you have extra time and energy and freedom to explore things that seem interesting and important.
But that last one is really important!
This is Part 1 of my «Boundaries» Sequence on LessWrong.
Summary: «Boundaries» are a missing concept from the axioms of game theory and bargaining theory, which might help pin-down certain features of multi-agent rationality (this post), and have broader implications for effective altruism discourse and x-risk (future posts).
Epistemic status: me describing what I mean.
With the exception of some relatively recent and isolated pockets of research on embedded agency (e.g., Orseau & Ring, 2012; Garrabrant & Demsky, 2018), most attempts at formal descriptions of living rational agents — especially utility-theoretic descriptions — are missing the idea that living systems require and maintain boundaries.
When I say boundary, I don't just mean an arbitrary constraint or social norm. I mean something that could also be called a membrane in a generalized sense, i.e., a layer of stuff-of-some-kind that physically or cognitively separates a living system from its environment, that 'carves reality at the joints' in a way that isn't an entirely subjective judgement of the living system itself. Here are some examples that I hope will convey my meaning:
https://www.lesswrong.com/posts/MdZyLnLHuaHrCskjy/itt-passing-and-civility-are-good-charity-is-bad
I often object to claims like "charity/steelmanning is an argumentative virtue". This post collects a few things I and others have said on this topic over the last few years.
My current view is:
Related to: Slack gives you the ability to notice/reflect on subtle things
Epistemic status: A possibly annoying mixture of straightforward reasoning and hard-to-justify personal opinions.
It is often stated (with some justification, IMO) that AI risk is an “emergency.” Various people have explained to me that they put various parts of their normal life’s functioning on hold on account of AI being an “emergency.” In the interest of people doing this sanely and not confusedly, I’d like to take a step back and seek principles around what kinds of changes a person might want to make in an “emergency” of different sorts.
There are plenty of ways we can temporarily increase productivity on some narrow task or other, at the cost of our longer-term resources. For example:
(As usual, this post was written by Nate Soares with some help and editing from Rob Bensinger.)
In my last post, I described a “hard bit” of the challenge of aligning AGI—the sharp left turn that comes when your system slides into the “AGI” capabilities well, the fact that alignment doesn’t generalize similarly well at this turn, and the fact that this turn seems likely to break a bunch of your existing alignment properties.
Here, I want to briefly discuss a variety of current research proposals in the field, to explain why I think this problem is currently neglected.
I also want to mention research proposals that do strike me as having some promise, or that strike me as adjacent to promising approaches.
Before getting into that, let me be very explicit about three points:
https://www.lesswrong.com/posts/28zsuPaJpKAGSX4zq/humans-are-very-reliable-agents
Over the last few years, deep-learning-based AI has progressed extremely rapidly in fields like natural language processing and image generation. However, self-driving cars seem stuck in perpetual beta mode, and aggressive predictions there have repeatedly been disappointing. Google's self-driving project started four years before AlexNet kicked off the deep learning revolution, and it still isn't deployed at large scale, thirteen years later. Why are these fields getting such different results?
Right now, I think the biggest answer is that ML benchmarks judge models by average-case performance, while self-driving cars (and many other applications) require matching human worst-case performance. For MNIST, an easy handwriting recognition task, performance tops out at around 99.9% even for top models; it's not very practical to design for or measure higher reliability than that, because the test set is just 10,000 images and a handful are ambiguous. Redwood Research, which is exploring worst-case performance in the context of AI alignment, got reliability rates around 99.997% for their text generation models.
By comparison, human drivers are ridiculously reliable. The US has around one traffic fatality per 100 million miles driven; if a human driver makes 100 decisions per mile, that gets you a worst-case reliability of ~1:10,000,000,000 or ~99.999999999%. That's around five orders of magnitude better than a very good deep learning model, and you get that even in an open environment, where data isn't pre-filtered and there are sometimes random mechanical failures. Matching that bar is hard! I'm sure future AI will get there, but each additional "nine" of reliability is typically another unit of engineering effort. (Note that current self-driving systems use a mix of different models embedded in a larger framework, not one model trained end-to-end like GPT-3.)
https://www.lesswrong.com/posts/2GxhAyn9aHqukap2S/looking-back-on-my-alignment-phd
The funny thing about long periods of time is that they do, eventually, come to an end. I'm proud of what I accomplished during my PhD. That said, I'm going to first focus on mistakes I've made over the past four[1] years.
I think I got significantly smarter in 2018–2019, and kept learning some in 2020–2021. I was significantly less of a fool in 2021 than I was in 2017. That is important and worth feeling good about. But all things considered, I still made a lot of profound mistakes over the course of my PhD.
I figured this point out by summer 2021.
I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge.
But rationality is not about the bag of facts you know, nor is it about the concepts you have internalized. Rationality is about how your mind holds itself, it is how you weigh evidence, it is how you decide where to look next when puzzling out a new area.
If I had been more honest with myself, I could have nipped the "catching up with other thinkers" mistake in 2018. I could have removed the bad mental habits using certain introspective techniques; or at least been aware of the badness.
But I did not, in part because the truth was uncomfortable. If I did not have a clear set of prerequisites (e.g. analysis and topology and game theory) to work on, I would not have a clear and immediate direction of improvement. I would have felt adrift.
https://www.lesswrong.com/posts/7iAABhWpcGeP5e6SB/it-s-probably-not-lithium
A Chemical Hunger (a), a series by the authors of the blog Slime Mold Time Mold (SMTM) that has been received positively on LessWrong, argues that the obesity epidemic is entirely caused (a) by environmental contaminants. The authors’ top suspect is lithium (a)[1], primarily because it is known to cause weight gain at the doses used to treat bipolar disorder.
After doing some research, however, I found that it is not plausible that lithium plays a major role in the obesity epidemic, and that a lot of the claims the SMTM authors make about the topic are misleading, flat-out wrong, or based on extremely cherry-picked evidence. I have the impression that reading what they have to say about this often leaves the reader with a worse model of reality than they started with, and I’ll explain why I have that impression in this post.
https://www.lesswrong.com/posts/bhLxWTkRc8GXunFcB/what-are-you-tracking-in-your-head
A large chunk - plausibly the majority - of real-world expertise seems to be in the form of illegible skills: skills/knowledge which are hard to transmit by direct explanation. They’re not necessarily things which a teacher would even notice enough to consider important - just background skills or knowledge which is so ingrained that it becomes invisible.
I’ve recently noticed a certain common type of illegible skill which I think might account for the majority of illegible-skill-value across a wide variety of domains.
Here are a few examples of the type of skill I have in mind:
I have been doing red team, blue team (offensive, defensive) computer security for a living since September 2000. The goal of this post is to compile a list of general principles I've learned during this time that are likely relevant to the field of AGI Alignment. If this is useful, I could continue with a broader or deeper exploration.
I used to use the phrase when teaching security mindset to software developers that "security doesn't happen by accident." A system that isn't explicitly designed with a security feature is not going to have that security feature. More specifically, a system that isn't designed to be robust against a certain failure mode is going to exhibit that failure mode.
This might seem rather obvious when stated explicitly, but this is not the way that most developers, indeed most humans, think. I see a lot of disturbing parallels when I see anyone arguing that AGI won't necessarily be dangerous. An AGI that isn't intentionally designed not to exhibit a particular failure mode is going to have that failure mode. It is certainly possible to get lucky and not trigger it, and it will probably be impossible to enumerate even every category of failure mode, but to have any chance at all we will have to plan in advance for as many failure modes as we can possibly conceive.
As a practical enforcement method, I used to ask development teams that every user story have at least three abuser stories to go with it. For any new capability, think at least hard enough about it that you can imagine at least three ways that someone could misuse it. Sometimes this means looking at boundary conditions ("what if someone orders 2^64+1 items?"), sometimes it means looking at forms of invalid input ("what if someone tries to pay -$100, can they get a refund?"), and sometimes it means being aware of particular forms of attack ("what if someone puts Javascript in their order details?").
by paulfchristiano, 20th Jun 2022.
(Partially in response to AGI Ruin: A list of Lethalities. Written in the same rambling style. Not exhaustive.)
Editor's note: The following is a lightly edited copy of a document written by Eliezer Yudkowsky in November 2017. Since this is a snapshot of Eliezer’s thinking at a specific time, we’ve sprinkled reminders throughout that this is from 2017.
A background note:
It’s often the case that people are slow to abandon obsolete playbooks in response to a novel challenge. And AGI is certainly a very novel challenge.
Italian general Luigi Cadorna offers a memorable historical example. In the Isonzo Offensive of World War I, Cadorna lost hundreds of thousands of men in futile frontal assaults against enemy trenches defended by barbed wire and machine guns. As morale plummeted and desertions became epidemic, Cadorna began executing his own soldiers en masse, in an attempt to cure the rest of their “cowardice.” The offensive continued for 2.5 years.
Cadorna made many mistakes, but foremost among them was his refusal to recognize that this war was fundamentally unlike those that had come before. Modern weaponry had forced a paradigm shift, and Cadorna’s instincts were not merely miscalibrated—they were systematically broken. No number of small, incremental updates within his obsolete framework would be sufficient to meet the new challenge.
Other examples of this type of mistake include the initial response of the record industry to iTunes and streaming; or, more seriously, the response of most Western governments to COVID-19.
https://www.lesswrong.com/posts/pL4WhsoPJwauRYkeK/moses-and-the-class-struggle
"𝕿𝖆𝖐𝖊 𝖔𝖋𝖋 𝖞𝖔𝖚𝖗 𝖘𝖆𝖓𝖉𝖆𝖑𝖘. 𝕱𝖔𝖗 𝖞𝖔𝖚 𝖘𝖙𝖆𝖓𝖉 𝖔𝖓 𝖍𝖔𝖑𝖞 𝖌𝖗𝖔𝖚𝖓𝖉," said the bush.
"No," said Moses.
"Why not?" said the bush.
"I am a Jew. If there's one thing I know about this universe it's that there's no such thing as God," said Moses.
"You don't need to be certain I exist. It's a trivial case of Pascal's Wager," said the bush.
"Who is Pascal?" said Moses.
"It makes sense if you are beyond time, as I am," said the bush.
"Mysterious answers are not answers," said Moses.
https://www.lesswrong.com/posts/T6kzsMDJyKwxLGe3r/benign-boundary-violations
Recently, my friend Eric asked me what sorts of things I wanted to have happen at my bachelor party.
I said (among other things) that I'd really enjoy some benign boundary violations.
Eric went ????
Subsequently: an essay.
We use the word "boundary" to mean at least two things, when we're discussing people's personal boundaries.
The first is their actual self-defined boundary—the line that they would draw, if they had perfect introspective access, which marks the transition point from "this is okay" to "this is no longer okay."
Different people have different boundaries:
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.(If you're already familiar with all basics and don't want any preamble, skip ahead to Section B for technical difficulties of alignment proper.)
I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead.
Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified.
Three points about the general subject matter of discussion here, numbered so as not to conflict with the list of lethalities: