Interconnects

*/ ?>

Latest Episodes:

Thoughts on The Curve Oct 07, 2025

I spent the weekend debating AI timelines, among other things, at The Curve conference. This translates as spending the weekend thinking about the trajectory of AI progress with a mix of DC and SF types. This is a worthwhile event that served as a great, high-bandwidth way to check in on timelines and expectations of the AI industry.

Updating timelines

My most striking takeaway is that the AI 2027 sequence of events, from AI models automating research engineers to later automating AI research, and potentially a singularity if your reasoning is so inclined, is becoming a standard by which many debates on AI progress operate under and tinker with. It’s good that many people are taking the long term seriously, but there’s a risk in so many people assuming a certain sequence of events is a sure thing and only debating the timeframe by which they arrive.

I’ve documented my views on the near term of AI progress and not much has changed, but through repetition I’m developing a more refined version of the arguments. I add this depth to my takes in this post.

I think automating the “AI Research Engineer (RE)” is doable in the 3-7 year range — meaning the person that takes a research idea, implements it, and compares it against existing baselines is entirely an AI that the “scientists” will interface with.

In some areas the RE is arguably already automated. Within 2 years a lot of academic AI research engineering will be automated with the top end of tools — I’m not sure academics will have access to these top end of tools but that is a separate question. An example I would give is coming up with a new optimizer and testing it on a series of ML baselines from 100M to 10B parameters. At this time I don’t expect the models to be able to implement the newest problems the frontier labs are facing alone. I also expect academics to be fully priced out from these tools.

Within 1-3 years we’ll have tools that make existing REs unbelievably productive (80-90% automated), but there are still meaningful technical bottlenecks that are solvable but expensive. The compute increase per available user has a ceiling too. Labs will be spending $200k+ per year per employee on AI tools easily (ie the inference cost), but most consumers will be at tiers of $20k or less due to compute scarcity.

Within 3-4 years the augmented research engineers will be able to test any idea that the scientists come up with at the frontier labs, but many complex system problems will need some (maybe minimal) amount of human oversight. Examples would include modifying RL implementations for extremely long horizon tasks or wacky new ideas on continual learning. This is so far out that the type of research idea almost isn’t worth speculating on.

These long timelines are strongly based on the fact that the category of research engineering is too broad. Some parts of the RE job will be fully automated next year, and more the next. To check the box of automation the entire role needs to be replaced. What is more likely over the next few years, each engineer is doing way more work and the job description evolves substantially. I make this callout on full automation because it is required for the distribution of outcomes that look like a singularity due to the need to remove the human bottleneck for an ever accelerating pace of progress. This is a point to reinforce that I am currently confident in a singularity not happening.

Up-skilling employees as their roles become irrelevant creates a very different dynamic. The sustained progress on code performance over the next few years will create a constant feeling of change across the technology industry. The range of performance in software is very high and it is possible to perceive relatively small incremental improvements.

These are very complex positions to hold, so they’re not that useful as rhetorical devices. Code is on track to being solved, but the compute limits and ever increasing complexity of codebases and projects (ie. LLMs) is going to make the dynamic very different than the succinct assumptions of AI 2027.

To reiterate, the most important part of automation in the discussion is often neglected. To automate someone you need to outcompete the pairing of a human with the tool too.

Onto the even trickier argument in the AI 2027 standard — automating AI research altogether. At the same time as the first examples of AI systems writing accepted papers at notable AI venues, I’m going to be here arguing that full automation of AI research isn’t coming anytime soon. It’s daunting to try and hold (and explain) this position, and it relies on all the messy firsthand knowledge of science that I have and how it is different in academia versus frontier AI labs.

For one, the level and type of execution at frontier labs relative to academic research is extremely different. Academia also has a dramatically higher variance in quality of work that is accepted within the community. For this reason, we’re going to be seeing incredible disruption at standard academic venues in the very near future, but the nature of science at frontier labs will remain heavily intertwined with human personalities.

Models will be good at some types of science, such as taking two existing fields and merging ideas and seeing what happens, but awful at what I consider to be the most idolized version of science, being immersed in the state of the art and having a brilliant insight that makes anywhere from a ripple causing small performance gain to a tsunami reshaping the field.

I don’t think AI will fully automate our current notion of an AI researcher in the next 5-10 years, but it could reshape what science means altogether and make that role far less relevant to progress.

The researchers grinding out new datasets at frontier labs will have dramatic help on data processing scripts. The researchers coming up with new algorithmic ideas will not expand the rate at which they come up with ideas too much, but their ability to test them is far higher.

A large part of science is a social marketplace of ideas. Convincing your colleagues that you are right and to help you double down on it is not going to change in its core nature. Everyone will have superpowers on making evidence to support their claims, but the relative power there stays the same.

At a dinner during The Curve I went through a lot of these points with Ryan Greenblatt, Chief Scientist at Redwood Research, and a point he made stuck with me. He summarized my points as thinking the increase in performance from these largely engineering tooling improvements will be equalled out by challenges of scaling compute, so the resulting progress will feel much more linear rather than exponential. A lot of our discussions on automation we agree on, with slightly different timelines, but it didn’t feel like it captured my entire point of view.

What is missing is that I expect an inherent slowdown as our AI models get more complicated. Our models today needs tools, more complex serving systems, products to wrap them, and so on. This is very different than the age when just model weights were needed for the cutting edge of AI. There’s an inevitable curse of complexity, a death by a thousand cuts, that is going to add on top of the obvious compute costs to slow down progress.

2026 will be a big year on the compute rollout front, and shipping meaningful improvements to users will be essential to funding the progress that comes after. I’m not sure the economy can keep shifting even more of its weight behind AI progress, where most people bought into fast timelines think of it as a default position. Peter Wildeford wrote a summary of the situation that I resonate with:

Here’s how I think the AI buildout will go down.

Currently the world doesn’t have any operational 1GW+ data centers. However, it is very likely we will see fully operational 1GW data centers before mid-2026. This likely will be a part of 45-60GW of total compute across Meta, Microsoft, Amazon/AWS/Anthropic, OpenAI/Oracle, Google/DeepMind, and xAI.

My median expectation is these largest ~1GW data center facilities will hold ~400,000-500,000 Nvidia Blackwell chips and be used to train ~4e27 FLOP model sometime before the end of 2027. Such a model would be 10x larger than the largest model today and 100x larger than GPT-4. Each individual 1GW facility would cost ~$40B to manufacture, with ~$350B total industry spend across 2026.

He continues with estimates for 2028, and saying he’s fuzzy on 2029, but my fuzziness cuts in a bit earlier depending on adoption and performance across the AI industry.

Where I feel like in the long run it’ll look like a very consistent pace of progress, that feels like a bunch of big jumps and periods of stagnation in the short-term. I have fairly large error bars on how the price of intelligence — and therefore adoption — is going to evolve over the next 2-4 years, with it obviously becoming far cheaper over the following decades.

As for my recent articles on timelines and key debates in the field, I encourage people to comment and dig in on what I wrote below.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Other thoughts

Something crazy about this conference is no one is talking about how the models actually work or are trained, and everyone here is totally convinced that AGI is coming soon.

One of my new friends at the conference described this tendency as “an obsession with the problem.” This is a feeling that many AI obsessors are more interested in where the technology is going rather than how or what exactly it is going to be. Helen Toner gave a great talk at The Curve related to this, arguing how the current and future jaggedness of AI — the fact that similarly difficult tasks when assigned to a human will either be easily mastered by AI or barely showing any competence (her will appear later on her great Substack). It is the idea that AI capabilities evolve highly randomly across potentially similar tasks.

This original figure on jaggedness comes from work with the popular AI Substacker Ethan Mollick.

The relation of Helen’s talk is that she gets many forms of arguments that only the endpoint of AI matters, but that doesn’t account for the messiness of the trajectory and how unsettling that could be for the world.

I agree with Helen.

One of the things that I am confident will exist in about two years is a Sora 2 style model that can run on a MacBook without copyright, personal opt-in, or other safety filters. On this, Epoch AI has a wonderful plot showing that local models lag behind in capabilities by a fixed amount of time:

With trends like this, it is so obvious that we need to stay on the front foot of open models and not reacting to international parties that are far harder to predict and engage with. This is where I get renewed motivation for American DeepSeek / The ATOM Project. For example, I still get many adamant questions that we should consider banning open models altogether. The state of discourse, study, investment, and everything in between on open models in the U.S. is still in a quite underdeveloped state.

China’s rise in open models was something I expected to be a bigger topic at the conference, but it seemed like it was too orthogonal to the overall pace of progress to be front of mind. There were many discussions of the Chinese chip ecosystem, but less on what it enables. Not focusing on this could have costly geopolitical consequences as we cede ownership of a global standard to China. This was a large theme of my talk. The recording will be posted here soon and the slides for my talk are here (credit for Florian Brand who helps me with open model analysis here for feedback on the slides). Otherwise:

* These messages are very important and I will work to spend a bit more time engaging with the communities they touch and mastering this type of talk (and analysis)

* More people should work in the area, it’s crazy it has just fallen on me where it is my side hustle.

For now, I’m just landing at the conference on language modeling (COLM) in Montreal, so I may have some technical hot takes to share later this week!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

ChatGPT: The Agentic App Sep 30, 2025

Ever since ChatGPT exploded in popularity, there has been a looming “how” to its monetization plans. Much has been said about shopping and advertising as the likely paths, especially with Fidji Simo joining as CEO of Applications under Sam Altman.

Advertising as a business model for AI is logical but difficult to personalize and specialize. We know tons of people spend a lot of time using AI models, but how do you best get the sponsored content into the outputs? This is an open technical problem, with early efforts from the likes of Perplexity falling short.

Shopping is another, but the questions have long been whether AI models actually have the precision to find the items you want, to learn exactly what you love, and to navigate the web to handle all the corner cases of checkouts. These reflect a need for increased capabilities on known AI benchmarks, rather than inventing a new way of serving ads. OpenAI’s o3 model was a major step up in search functionality, showing it was viable; the integration was either a business problem — where OpenAI had to make deals — or an AI one — where ChatGPT wasn’t good enough at managing websites for you.

Yesterday, ChatGPT launched its first integrated shopping push with Buy It in ChatGPT, a simple checkout experience, and an integrated commerce backend built on the Agentic Commerce Protocol (ACP). The announcement comes with the perfect partners to complement the strengths of OpenAI’s current models. GPT-5-Thinking is the best at finding niche content on the web, and ChatGPT’s launch partner for shopping is Shopify (*soon, Etsy is available today), the home to the long tail of e-commerce merchants of niche specialties. If this works, it will let users actively uncover exactly what they are looking for — from places that were often hard to impossible to find on Google.

This synergy is a theme we’ll see reoccur in other agents of the future. The perfect model doesn’t make a useful application unless it has the information or sandbox it needs to think, search, and act. The crucial piece that is changing is that where models act is just as important as the weights themselves — in the case of shopping, it is the network of stores with their own rankings and API.

The ACP was built in collaboration with Stripe, and both companies stand to benefit from this. Stripe wants more companies to build on the ACP so that its tools become the “open standard for agentic payments” and OpenAI wants the long-tail of stores to adopt it so they can add them to their ever-growing internal recommendation (or search) engine. The business model is simple, as OpenAI says “Merchants pay a small fee on completed purchases.” OpenAI likely takes a larger share than Stripe, and it is a share that can grow as their leverage increases over shoppers.

I’m cautiously optimistic about this. Finding great stuff to buy on the web is as hard as it has ever been. Users are faced with the gamification of Google search for shopping and the enshittification of the physical goods crowding out Amazon. Many of the best items to buy are found through services like Meta’s targeted ads, but the cost of getting what you want should not be borne through forced distraction.

OpenAI will not be immune to the forces that drove these companies to imperfect offerings, but they’ll come at them with a fresh perspective on recurring issues in technology. If this works for OpenAI, they have no competitor. They have a distribution network of nearly 1B weekly users and no peer company ready to serve agentic models at this scale. Yes, Google can change its search feed, but the thoroughness of models like GPT-5 Thinking is on a totally different level than Google search. This agentic model is set up to make ChatGPT the one Agentic App across all domains.

The idea of an agentic model, and really the GPT-5 router itself, shows us how the grand idea of one giant model that’s the best for every conceivable use-case is crumbling. OpenAI only chooses the more expensive thinking model when it deems a free user to need it and they have an entirely different model for their coding products. On the other hand, Claude released their latest model, Claude 4.5 Sonnet, yesterday as well, optimizing their coding peak performance and speed yet again — they have no extended model family.

The reality that different models serve very different use-cases and how AI companies need to decide and commit to a certain subset of them for their development points to a future with a variety of model providers.

Where coding is where you can feel the frontier of AI’s raw intelligence or capabilities, and Anthropic has turned their entire development towards it, the type of model that is needed for monetization of a general consumer market could be very different. This is the web-agent that OpenAI has had the industry-leading version of for about 6 months.

Specialization is making the AI market far more interesting, as companies like OpenAI and Google have been in lockstep with their offerings for years. Every company would drop the same model modalities with approximately the same capabilities. Now, as hill-climbing benchmarks are no longer providing immediate user value, especially in text domains, the vision for each AI company is more nuanced. I predicted this earlier in the summer, in my post on what comes next:

What I missed is that this applies downward pressure on the number of models labs will release — the value can be more in the integrations and applications than the model itself. Expect releases like today, where Claude released Claude Sonnet 4.5 along with version 2 of Claude Code. The period will still be busy as the industry is on the tail end of the low hanging fruit provided by reasoning models, but over time the hype of model releases themselves will be harder to conjure.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Let’s consider the applications that are rolling out today on top of different models. If you haven’t pushed the limits of GPT-5-Thinking, and better yet GPT-5-Pro, for search you really need to, it’s a transformative way of using compute that can find many buried corners of the web. In terms of untapped model capability value, the abilities of search-heavy thinking models like GPT-5 seem far higher than coding agents, which are obviously heavily used. Search-heavy models are an entirely new use, where coding models were the first widespread LLM-based product. As coding agents become more autonomous, they’ll continue to flex and mold a new form for the software industry, but this will be a slow co-evolution.

OpenAI is going to focus on its vertical Agentic App where Anthropic (and likely Gemini with Google Cloud) are going to power the long-tail of AI applications reshaping the web and the rest of work. OpenAI will only expand from here. Email, scheduling, travel bookings, and more everyday digital tasks are surely on their roadmap. Their biggest competitor is themselves — and whether their vision can be crafted into something people actually use. If shopping doesn’t work out as the vertical that lets them realize their valuation, they’re positioned to keep trying more. OpenAI has both the lead in the variety of models that power these agentic information tasks and the user base to incentivize companies to collaborate with them.

The application paradigm that dominated the mobile era is going to rebound. AI applications started in a form where the user needed to be heavily involved in the work process. The first beneficiaries of this were IDEs and terminal tools. Both of these workplaces allow in-depth and detailed inspection of the process and results. The cutting edge of AI will still work there, but the long tail of casual use will all shift to the standard mode of applications — siloed, simple, and scalable in the cloud. The simpler an AI application is, the wider its potential audience.

With this addition of shopping, OpenAI is poised to launch a standalone TikTok-style app with the release of its next video generation model, Sora 2, soon after Meta launched Vibes in their Meta AI app for only AI generated videos with a specific theme to start. At the same time, OpenAI’s Codex web agent is available in the ChatGPT application, which represents an even bigger change in the nature of software work than the addition of coding agents — it allows real websites, and soon businesses, to be built with only a prompt on your phone.

In 6-12 months, these agentic applications that feel rough around the edges due to the quality of the AI today, rather than the interface, are going to feel seamless and second-nature to use, despite their complete novelty relative to the past decades of technology.

If OpenAI is positioning itself to be The Agentic App, this also opens the door to the near future where many applications we use today shift to an agentic era. Want to schedule a meeting with someone? Let the Google Calendar agent handle that (or some startup that beats them to it). Your email application can find who the next client is and remind them of their appointment. The Banking App will file your taxes in one prompt. The list of these is infinite and across a wide spectrum of difficulty. OpenAI wants to be the one app, The Agentic App, that serves all of these, and the rest of the industry is racing to master their specific vertical before OpenAI gets there.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Thinking, Searching, and Acting Sep 22, 2025

The weaknesses of today’s best models are far from those of the original ChatGPT — we see they lack speed, we fear superhuman persuasion, and we aspire for our models to be more autonomous. These models are all reasoning models that have long surpassed the original weaknesses of ChatGPT-era language models, hallucinations, total lack of recent information, complete capitulations, and other hiccups that looked like minor forms of delusion laid on top of an obviously spectacular new technology.

Reasoning models today are far more complex than the original chatbots that consisted of standalone model weights (and other lightweight scaffolding such as safety filters). They're built on three primitives that'll be around for years to come:

* Thinking: The reasoning traces that enabled inference-time scaling. The "thoughts" of a reasoning model take a very different form than those of humans that inspired the terminology used like Chain of Thought (CoT) or Thinking models.

* Searching: The ability to request more, specific information from non-parametric knowledge stores designed specifically for the model. This fills the void set by how model weights are static but living in a dynamic world.

* Acting: The ability for models to manipulate the physical or digital world. Everything from code-execution now to real robotics in the future allow language models to contact reality and overcome their nondeterministic core. Most of these executable environments are going to build on top of infrastructure for coding agents.

These reasoning language models, as a form of technology are going to last far longer than the static model weights that predated and birthed ChatGPT. Sitting just over a year out from the release of OpenAI's o1-preview on September 12, 2024, the magnitude of this is important to write in ink. Early reasoning models with astounding evaluation scores were greeted with resounding criticism of “they won’t generalize,” but that has turned out to be resoundingly false.

In fact, with OpenAI's o3, it only took 3-6 months for these primitives to converge! Still, it took the AI industry more broadly a longer time to converge on this. The most similar follow-up on the search front was xAI's Grok 4 and some frontier models such as Claude 4 express their reasoning model nature in a far more nuanced manner. OpenAI's o3 (and GPT-5 Thinking, a.k.a. Research Goblin) and xAI's Grok 4 models seem like a dog determined to chase their goal indefinitely and burn substantial compute along the way. Claude 4 has a much softer touch, resulting in a model that is a bit less adept at search, but almost always returns a faster answer. The long-reasoning traces and tool use can be crafted to fit different profiles, giving us a spectrum of reasoning models.

The taxonomy that I laid out this summer for next-generation reasoning models — skills for reasoning intelligence, calibration to not overthink, strategy to choose the right solutions, and abstraction to break them down — are the traits that'll make a model most functional given this new perspective and agentic world.

The manner of these changes are easy to miss. For one, consider hallucinations, which are an obvious weakness downstream of the stochastic inference innate to the models and their fixed date cutoff. With search, hallucinations are now missing context rather than blatantly incorrect content. Language models are nearly-perfect at copying content and similarly solid at referencing it, but they're still very flawed at long-context understanding. Hallucinations still matter, but it’s a very different chapter of the story and will be studied differently depending on if it is for reasoning or non-reasoning language models.

Non-reasoning models still have a crucial part to play in the AI economy due to their efficiency and simplicity. They are part of a reasoning model in a way because you can always use the weights without tools and they'll be used extensively to undergird the digital economy. At the same time, the frontier AI models (and systems) of the coming years will all be reasoning models as presented above — thinking, searching, and acting.

Language models will get access to more tools of some form, but all of them will be subsets of code or search. In fact, search can be argued to be a form of execution itself, but given the imperative of the underlying information it is best left as its own category.

Another popular discussion with the extremely-long generations of reasoning models has been the idea that maybe more efficient architectures such as diffusion language models could come to dominate by generating all the tokens in parallel. The (or rather, one) problem here is that they cannot easily integrate tools, such as search or execution, in the same way. These’ll also likely be valuable options in the AI quiver, but barring a true architectural or algorithmic revolution that multiplies the performance of today’s AI models, the efficiency and co-design underway for large transformers will enable the most dynamic reasoning models.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

With establishing what makes a reasoning model complete comes an important mental transition in what it takes to make a good model. Now, the quality of the tools that a model is embedded with is arguably something that can be more straightforward to improve than the model — it just takes substantial engineering effort — and is far harder with open models. The AI “modeling” itself is mostly open-ended research.

Closed models have the benefit of controlling the entire user experience with the stack, where open models need to be designed so that anyone can take the weights off of HuggingFace and easily get a great experience deploying it with open-source libraries like VLLM or SGLang. When it comes to tools used during inference, this means that the models can have a recommended setting that works best, but they may take time to support meaningful generalization with respect to new tools.

For example, OpenAI can train and serve their models with only one search engine, where I at Ai2 will likely train with one search engine and then release the model into a competitive space of many search products. A space where this can benefit open models could be something like MCP, where open models are developed innately for a world where we cannot know all the uses of our models, making something like MCP libraries a great candidate for testing. Of course, leading AI laboratories will (or have already started) do this, but the ranking will be different in a priority stack.

Much has been said about tokenomics and costs associated with reasoning models, without taking the tool component into account. There was a very popular article articulating how models are only getting more expensive, with a particular focus on reasoning models using far more tokens. This is overstating a blip, a point in time when serving costs increased by 1000x for models by generating vastly more tokens, but without improved hardware.

The change in cost of reasoning models reflected a one-time step up in most circumstances where the field collectively turned on inference-time scaling by using the same reasoning techniques. At the same time as the reasoning model explosion, the size of models reaching users in parameter count has all but stagnated. This is due to diminishing returns in quality due to scaling parameters — it’s why OpenAI said GPT 4.5 wasn’t a frontier model and why Gemini never released their Ultra model class. The same will come for reasoning tokens.

While diminishing returns are hitting reasoning token amount for serial streams, we’re finally seeing large clusters of Nvidia’s Blackwell GPUs come online. The costs for models seem well on path to level out and then decrease as the industry develops more efficient inference systems — the technology industry is phenomenal at making widely used products far cheaper year over year. The costs that’ll go up are the agents that are enabled by these reasoning models, especially with parallel inference, such as the Claude Code clones or OpenAI’s rumored Pro products.

What we all need is a SemiAnalysis article explaining how distorted standard tokenomics are for inference with tools and if tools substantially increase variance in implementations. People focus too much on the higher token costs from big models with long context lengths, those are easy to fix with better GPUs, while there are many other costs such as search indices or idle GPU time waiting for tool execution results.

When we look at a modern reasoning model, it is easy to fixate on the thinking token aspects that give the models their name. At the same time, search and execution are such fundamental primitives to modern language models that they can rightfully stand on their own as pillars of modern AI. These are AI systems that substantially depend on the quality of the complex inference stack far more than getting the right YOLO run for the world’s best model weights.

The cause of thinking, searching, and acting all being looped in as a “reasoning model” is that this inference-time scaling with meandering chains of thought was the technological innovation that made both search and execution far more functional. Reasoning was the step change event that set these three as technology standards.

The industry is in its early days of building out fundamental infrastructure to enable them, which manifests as the early days of language model agents. The infrastructure pairs deterministic computing and search with the beauty, power, and flexibility of the probabilistic models we fell in love with via ChatGPT. This reasoning model layer is shaping up to be the infrastructure that underpins the greatest successes of the future technology industry.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Coding as the epicenter of AI progress and the path to general agents Sep 18, 2025

Coding, due to its breadth of use-cases, is arguably the last tractable, general domain of continued progress for frontier models that most people can interface with. This is a bold claim, so let’s consider some of the other crucial capabilities covered in the discourse of frontier models:

* Chat and the quality of prose written by models has leveled off, other than finetuning to user measures such as sycophancy.

* Mathematics has incredible results, but very few people directly gain from better theoretical mathematics.

* The AIs’ abilities to do novel science are too unproven to be arguable as a target of hillclimbing.

Still, coding is a domain where the models are already incredibly useful, and they continue to consistently stack on meaningful improvements. Working daily with AI over the last few years across side projects and as an AI researcher, it has been easy to take these coding abilities for granted because some forms of them have been around for so long. We punt a bug into ChatGPT and it can solve it or autocomplete can tab our way through entire boilerplate.

These use-cases sound benign, and haven’t changed much in that description as they have climbed dramatically in capabilities. Punting a niche problem in 1000+ lines of code to GPT-5-Pro or Gemini Deep Think feels like a very fair strategy. They really can sometimes solve problems that a teammate or I were stuck on for hours to days. We’re progressing through this summarized list of capabilities:

* Function completion: ~2021, original Github CoPilot (Codex)

* Scripting: ~2022, ChatGPT

* Building small projects: ~2025, CLI agents

* Building complex production codebases, ~2027 (estimate, which will vary by the codebase)

Coding is maybe the only domain of AI use where I’ve felt this slow, gradual improvement. Chat quality has been “good enough” since GPT-4, search showed up and has been remarkable since OpenAI’s o3. Through all of these more exciting moments, AIs’ coding abilities have just continued to gradually improve.

Now, many of us are starting to learn a new way of working with AI through these new command-line code agents. This is the largest increase in AI coding abilities in the last few years. The problem is the increase isn’t in the same domain where most people are used to working with AI, so the adoption of the progress is far slower. New applications are rapidly building users and existing distribution networks barely apply.

The best way to work with them — and I’ll share more examples of what I’ve already built later in this post — is to construct mini projects, whether it’s a new bespoke website or a script. These are fantastic tools for entrepreneurs and researchers who need a way to quickly flesh out an idea. Things that would’ve taken me days to weeks can now be attempted in hours. Within this, the amount of real “looking at the code” that needs to be done is definitely going down. Coding, as an activity done through agents, is having the barriers to entry fully fall down through the same form factor that is giving the act of coding re-found joy.

Why I think a lot of people miss these agents is that the way to use the agents is so different from the marketing of incredible evaluation breakthroughs that the models are reaching. The gap between “superhuman coding” announcements and using an agent for mini projects is obviously big. The best way to use the agents is still mundane and requires careful scoping of context.

For example, yesterday, on September 17, 2025, OpenAI announced that GPT-5 as part of a model system got a higher score than any human (and Google’s Gemini Deep Think) at the ICPC World Finals, “the premier collegiate programming competition where top university teams from around the world solve complex algorithmic problems.” Here’s what an OpenAI researcher said they did:

We competed with an ensemble of general-purpose reasoning models; we did not train any model specifically for the ICPC. We had both GPT-5 and an experimental reasoning model generating solutions, and the experimental reasoning model selecting which solutions to submit. GPT-5 answered 11 correctly, and the last (and most difficult problem) was solved by the experimental reasoning model.

These competitions often get highlighted because they’re “finite time,” so the system must respond in the same fixed time as a human does, but the amount of compute used by GPT-5 or another model here is likely far higher than any user has access to. This is mostly an indication that further ability, which some people call raw intelligence, can be extracted from the models, but most of that is limited by scaffolding and product when used by the general population.

The real story is that these models are delivering increasing value to a growing pool of people.

For followers of AI, coding with AI models is the easiest way to feel progress. Now that models are so good at chat, it takes very specialized tasks to test the general knowledge of models, or many of the gains are in getting the right answer faster than GPT-5-Thinking’s meandering path.

I’m not an expert software engineer and the huge differences between models, and improvements that the individual models and systems are making, have been incredibly obvious.

I’ve said many times how Claude Code (or now Codex) are far better than Cursor Agent, which is in turn far better than Github CoPilot. GitHub CoPilot feels borderline drunk at the wheel. Cursor often feels a little distracted while still being smart, but Claude Code and Codex seem on topic and able to test the best of a model’s intelligence on the problem at hand. Yes, even the best agents often aren’t good enough in complex codebases, but it removes the need to go back and forth countless times in a chat window to see if a model can reach the end of the puzzle for you. These CLI agents can run tests, fix git problems, run local tools, whatever. The scope is constantly growing.

For the nuanced take of Claude Code vs Codex CLI right now, the answer is expensive. The best has been Claude Code forcing Claude Opus 4.1, but Codex is not far behind and comes in at a much cheaper entry point ($20/month) — Opus requires a $100+/month plan. Codex also has nice features like web search, but it hasn’t been a major differentiator yet in my use.

The new workflow is to switch to the other agent when one cannot solve the current problem, and let it see the repository with fresh eyes, much like you pasted a question to another chatbot. The agents are just one tab away, just like the competitors for chat.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

In the comparison of Claude, Cursor, and CoPilot above, the crucial component is that all of these agents can be tested with the same Claude 4 Sonnet model. The gaps are just as wide as I stated, highlighting how so many of the gains in coding agents are just in product implementations. A second version is slightly embarrassing for me, but follows as I hadn’t updated my OpenAI Codex code when trying the new GPT-5-Codex model, which resulted in an immediate massive jump in performance by changing it. It’s a new phenomenon to have a domain at the cutting edge of AI abilities where the software scaffolding of a model is felt so strongly. Product and prompts matter more than ever and this sensation will expand to more domains.

The why of these performance differences — even when using the same model — is worth dwelling on. It’s unlikely that the Claude team is that much better at general software engineering and product design — rather, Anthropic has extensive in-house experience in extracting the most from models. The current shift in models has been about how to take a set of models that are designed for question answering and other single-stream text tasks and break down problems. In my taxonomy on next-generation reasoning models, I called this ability “abstraction.”

The need to just slightly shift the model to this task explains OpenAI’s recent specialized model for this, GPT-5-Codex. GPT-5 was primarily a release about balancing OpenAI’s books with a user base approaching 1B active users in the chat format. GPT-5 is a honed tool for a different job. The evaluation scores are slightly better than the general reasoning model for this new GPT-5-Codex, but the main gains are in how behavior is different in coding tasks.

GPT‑5-Codex adapts how much time it spends thinking more dynamically based on the complexity of the task. The model combines two essential skills for a coding agent: pairing with developers in interactive sessions, and persistent, independent execution on longer tasks. That means Codex will feel snappier on small, well-defined requests or while you are chatting with it, and will work for longer on complex tasks like big refactors. During testing, we've seen GPT‑5-Codex work independently for more than 7 hours at a time on large, complex tasks, iterating on its implementation, fixing test failures, and ultimately delivering a successful implementation.

And they included this somewhat confusing plot to showcase this dynamic. I’ve certainly felt these changes when I updated the Codex software and the Codex model.

This represents another key problem I presented in my taxonomy — calibration, i.e. not overthinking.

Having specialized models and specialized products for a use case could make people think that they’re narrowing in to make progress, but in OpenAI’s case it is rather that their hands are tied financially to support the main ChatGPT application. Claude has already fully committed to code. This is due to the size that the space could expand into.

These “coding” agents are definitely going to be seen as doing far more than writing code. Yes, their primary ability is going to be writing the code itself and executing it, but what that enables is an entirely new way of working with your computer.

In my post Contra Dwarkesh on Continual Learning, I presented a view where agents are going to be given all your digital working context in order to be a research or editorial assistant available 24/7. I’ve begun putting this to use for Interconnects, where I give the agents all of my articles, metadata, interviews, and details, so I can ask them for relevant references and context for future posts. This is very underbaked and early as a project for searching efficiently over my 400K tokens of writing, but I was prompting it a few times to see any interesting references for this post, and it got me something that was useful!

This quote from my Ross Taylor interview was spot on for the vibes of using coding agents in July:

My main worry with Claude Code is that... people confuse agents making you more productive versus preventing you from exerting mental effort. So sometimes I’ll have a day with Claude Code where I feel like I use very little mental effort—and it feels amazing—but I’m pretty sure I’ve done less work... Where it becomes really bad is when the file size becomes too long. Then the agent tends to struggle and get into these weird line search doom loops.

This sentiment is still definitely true for production codebases that are extremely complex, but the doom loop likelihood is dropping in my tests. At the same time, the joy and mental ease still applies.

Some examples of what I’ve built with a mix of Claude Code or OpenAI’s Codex CLI recently include:

* A raw HTML site for my RLHF book for comparing the responses of SFT vs. RLHF trained models from the same lineage (and improvements to RLHF book itself).

* Making a repository with all of the posts and content from Interconnects so I can use coding agents as editorial assistants while writing.

* Improvements to the ATOM Project website.

* Stripping my personal website out of Webflow’s systems (which was a mistake to sign up for during graduate school), including CMS entries and other detailed pages.

* Other small scripts and tools in my day job training models.

It’s not just me building extensively with these. There are multiple open-source projects committed to tracking the public contributions of these models — two are PRArena and Agents in the Wild.

PRArena’s dashboard shows over a million PRs getting merged from the Codex web agent, dwarfing many of the competitors. This is the power that OpenAI can wield with distribution, even if the web app version of Codex is far from the zeitgeist that is CLI agents today.

This comes with a notable asterisk in methodology that can explain many of the gaps in similar dashboards:

Some agents like Codex iterate privately and create ready PRs directly, resulting in very few drafts but high merge rates. Others like Copilot and Codegen create draft PRs first, encouraging public iteration before marking them ready for review.

The statistics below focus on Ready PRs only to fairly compare agents across different workflows, measuring each agent's ability to produce mergeable code regardless of whether they iterate publicly (with drafts) or privately.

The other dashboard, Agents in the Wild, shows that OpenAI’s coding agent is only one order of magnitude behind humans and other automations in PRs merged.

Putting this in perspective relative to Gemini or Claude:

The context with this is that Claude Code is far more downloaded than OpenAI’s CLI agent Codex, but it doesn’t name its PRs the same clever way by default with the agent name in the branch. Claude Code has over 20X the downloads of Codex in the last week on NPM.

Despite the challenges of measurement, it’s clear that coding agents are taking off.

The Codex PRs above actually represent the web agent, which has the default branch name behavior, not the CLI agent. This shows the might of OpenAI’s distribution, and it is impressive how many of the PRs are actually merged (over 80%), when thousands of people are trying a new tool for the first time.

The primary difference between the web agent and the CLI agent is a reduction in interactivity. The CLI agents propose a plan and ask for feedback, or let you monitor and interrupt. Codex on the web wraps a similar behavior as the CLI agents in one system that runs all the way until it can open a PR.

Over time coding is only going to get more asynchronous and OpenAI is poised to capture this transition if it happens soon. Based on all the above evidence of coding models getting more capable, the move to this new UX for software will happen faster than people expect. The transition to fully autonomous coding will happen soon for types of work where coding models already work near flawlessly — scripts, websites, data analysis, etc. Later, complex production codebases will work best at lower levels of the stack — IDEs, CLI agents, and other things that are both interactive and best for absorbing content.

Within a few years, the two trends will converge where autonomous agents are functional and the most complex codebases can be improved with AI. Then everything can return to the chatbot window — you only need to open your IDE when you want to understand what’s going on. For most people, not having to look at the code will be a welcome change.

Progress in coding feels slower than the “emergent” abilities between model generations past, which makes it easier to keep track of. This is due to how big the range in behaviors that encompass “coding” is, but results in a fantastic area for learning how AI models evolve and iterate. This playbook will be used many times over by frontier labs in the coming years as AI models are taught to solve more challenging tasks.

There’s a quiet revolution happening, and in order to truly understand it, you need to partake. Go build something.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

On China's open source AI trajectory Sep 09, 2025

Hello everyone! I’m coming back online after two weeks of vacation. Thankfully it coincided with some of the slowest weeks of the year in the AI space. I’m excited to get back to writing and (soon) share projects that’ll wrap up in the last months of the year.

It seemed like a good time to remind people of the full set of housekeeping for Interconnects.

* Many people love the audio version of the essays (read by me, not AI). You can get them in your podcast player here. Paid subscribers can add private podcast feeds under “manage your subscription” where voiceover is available for paywalled posts.

* The Interconnects Discord for paid subscribers continues to get better, and is potentially the leading paid perk amid the fragmentation of Twitter etc.

* We’re going to be rolling out more perks for group subscriptions and experimental products this fall. Stay tuned, or get in touch if group discounts are super exciting for your company.

For the time being, I’m planning trips and meetups across a few conferences in October. I’ll be speaking at The Curve (Oct. 3-5, Berkeley), COLM (Oct. 7-10, Montreal, interest form), and the PyTorch Conference (Oct. 21-24, SF) on open models, Olmo, and the ATOM Project, so stay tuned for meetups and community opportunities. On to the post!

China is maneuvering to double down on its open AI ecosystem. Depending on how the U.S. and its allies change culture and mobilize investment, this could make the dominance of Chinese AI models this summer, from Qwen, Kimi, Z.ai, and DeepSeek, looks like foreshadowing rather than the maximum gap in open models between the U.S. and China.

Until the DeepSeek moment, AI was likely a fringe issue to the PRC Government. The central government will set guidelines, rules, budgets, and focus areas that will be distributed and enforced across the decentralized government power structures. AI wasn’t a political focus and the strategy of open-source was likely set by companies looking to close the gap with leading American competitors and achieve maximum market share in the minimum time. I hear all the time that most companies in the U.S. want to start with open models for IT and philosophical reasons, even when spinning up access to a new API model is almost effortless, and it’s likely this bias could be even higher internationally where spending on technology services is historically lower.

Most American startups are starting with Chinese models. I’ve been saying this for a while, but a more official reference for this comes from a recent quote from an a16z partner, Martin Casado, another vocal advocate of investment in open models in America. He was quoted in The Economist with regards to his venture portfolio companies:

“I’d say 80% chance [they are] using a Chinese open-source model.”

The crucial question for the next few years in the geopolitical evolution of AI is whether China will double down on this open-source strategy or change course. The difficulty with monitoring this position is that it could look like nothing is happening and China maintains its outputs, even when the processes for creating them are far different. Holding a position is still a decision.

It’s feasible in the next decade that AI applications and open models are approached with the same vigor that China built public infrastructure over the last few decades (Yes, I’m reading Dan Wang’s new book Breakneck). It could become a new area that local officials compete in to prove their worth to the nation — I’m not sure even true China experts could make confident predictions here. A large source of uncertainty is whether the sort of top-down, PRC edicts can result in effective AI models and digital systems, where government officials succeeded in the past with physical infrastructure.

At the same time as obvious pro-AI messaging, Chinese officials have warned of “disorderly competition” in the AI space, which is an indirect signal that could keep model providers releasing their models openly. Open models reduce duplicative costs of training, help the entire ecosystem monitor best practices, and force business models that aren’t reliant on simple race-to-the-bottom inference markets. Open model submarkets are emerging for every corner of the AI ecosystem, such as video generation or robotic action models, (see our coverage of open models, Artifacts Logs) with a dramatic evolution from research ideas to mature, stable models in the last 12-18 months.

China improving the open model ecosystem looks like the forced adoption of Chinese AI chips, further specialization of companies’ open models to evolving niches, and expanded influence on fundamental AI research shared internationally. All of these directions have early signs of occurring.

If the PRC Government wanted to exert certain types of control on the AI ecosystem — they could. This Doug Guthrie excerpt from Apple in China tells the story from the perspective of international companies. Guthrie was a major player in advising on culture changes in Cupertino to better adapt Apple’s strategy to the Chinese market.

“When you stake your life, your identity, on and around certain ideas, you sort of fight for them,” Guthrie says. “Xi Jinping kind of broke my heart… I was sitting there, in China, in my dream job, and I’m watching Xinjiang’s internment camps. I’m watching China tearing up a fifty-year agreement over Hong Kong.”

Apple, meanwhile, had become too intertwined with China. Guthrie had been hired to help understand the country and to navigate it. And Apple had followed through—very successfully. But it had burned so many boats, as the saying goes, that Guthrie felt its fate was married to China’s and there was no way out. “The cost of doing business in China today is a high one, and it is paid by any and every company that comes looking to tap into its markets or leverage its workforce,” he later wrote in a blog. “Quite simply, you don’t get to do business in China today without doing exactly what the Chinese government wants you to do. Period. No one is immune. No one.”

China famously cracked down on its largest technology companies in late 2020, stripping key figures of power and dramatic amounts of market value off the books. AI is not immune to this.

The primary read here is that the PRC leadership will decide on the role they want to have in the open-source AI ecosystem. The safe assumption has been that it would continue because the government picked up a high-impact national strategy when it first started focusing on the issue, already seeded with international influence.

To formalize these intentions, the Chinese government has recently enacted an “AI+” plan that reads very similarly to the recent White House AI Action Plan when it comes to open models. The AI+ plan idea was first proposed in March 2024 and was just approved in its full text on July 31, 2025. The AI+ plan, when enacted by local officials, lays out goals for the AI industry in how many open models to have at different tiers of performance and some funding mechanisms for nurturing them.

This is right in line with other comments from party officials. Chinese Premier Li Qiang, second-ranking member of the Politburo Standing Committee, made comments in March directly supporting open-source models. From the Wall Street Journal:

Li pledged that China would boost support for applications of large-scale AI models and AI hardware, such as smartphones, robots, and smart cars.

China’s top economic planning body also said Wednesday that the country aimed to develop a system of open-source models while continuing to invest in computing power and data for AI.

An excerpt from Beijing’s city plan as part of the overall AI+ initiative, translated by GPT-5 Pro, has interesting, specific goals:

By end-2025: implement 5 benchmark application projects at a world-leading level; organize 10 demonstration application projects that lead the nation; and promote a batch of commercializable results. Strive to form 3–5 advanced, usable, and self-controllable base large-model products, 100 excellent industry large-model products, and 1,000 industry success cases. Take the lead in building an AI-native city, making Beijing a globally influential AI innovation source and application high ground.

The goal of this is to:

Encourage open-source, high-parameter, ‘autonomous and controllable’ base foundation models, and support building cloud hosting platforms for models and datasets to facilitate developer sharing and collaboration.

Beyond the minor translation bumpiness, the intentions are clear. The goal of the A+ plan is clear with multiple mentions of both open-source models and an open ecosystem with them where the models can be adopted widely. The ecosystem of models can make the impact of any one individual model greater than it would be alone.

The Chinese government having centralized power has more direct levers to enact change than the White House, but this comes with the same trade-offs as all initiatives face when comparing the U.S. vs. China’s potential. I won’t review all of the differences in the approaches here.

Where the Chinese Government enacts top level edicts that’ll be harder to follow from the West, there are numerous anecdotes and interactions that highlight in plain terms the mood of the AI ecosystem in China. I’ve routinely been impressed by the level of direct engagement I have received from leading Chinese AI companies and news outlets. Interconnects’ readership has grown substantially in China.

Chinese companies are very sensitive to how their open contributions are viewed — highlighting great pride in both their work and approach. The latest case was via our China open model rankings that got direct engagement from multiple Chinese AI labs and was highlighted by a prominent AI news outlet in China — 机器之心/Synced. They described Interconnects as a “high-quality content platform deeply focused on frontier AI research.” (This Synced post was translated and discussed in the latest ChinaAI Newsletter)

When intellectuals, influencers, and analysts I follow talk directly to technical members of the AI workforce in China, they sound like what we would expect — people who want to build a great technology. Jasmine Sun had a great writeup on her trip that had some anecdotes on AI in China. She asked “Do you guys worry about AI safety?”

“We don’t think about risks at all.” …

Continuing from Jasmine:

This was the first of several conversations that gave us a distinct impression of the Chinese tech community. Spirits are high, and decoupling policies like export controls only fuel their patriotic drive.

At the same time, America still represents a covetable life, despite the current political tumult:

To be clear, our researcher friend made clear that working at a top US AI lab was still the most desirable option.

In so many ways, trying to precisely map China’s next steps in AI is extremely challenging. Can they convert their lead in energy infrastructure to more total AI compute? Can they build their own AI chips? Will they take the frontier of performance with their talented population and a different approach? All of this is up for debate.

The intrigue here is exemplified by the abundant interest in sparse news stories on how DeepSeek is training some AI model with Huawei chips. In many ways, these new chips working would be a bigger story than the original DeepSeek model, but all signs point to expected experiments with domestic chips, where China’s leading AI models are likely to be trained on Nvidia and other Western chips for the foreseeable future. I do not expect DeepSeek R2 to be trained on Huawei’s hardware.

China’s hardware investment will take a lot longer to play out than open model strategies, but if China pulls it off — along with its other investments, such as self-driving cars and robots — their practical lead in AI could come for more areas. Open models could be China’s beachhead in a bigger technological resurgence with AI.

Without major changes to Western investment in open models, we’re approaching a status quo in 2026 and beyond where:

* Chinese open models would continue to increase their lead in performance (and adoption) over American counterparts. This will manifest in many ways. One example is how startups in Silicon Valley built on stronger Chinese models will be offering services that compete with entrenched, handicapped Fortune 500 companies wary of adopting these models in their services. This could make some subareas of AI disruption feel particularly intense.

* The Chinese open ecosystem’s density of knowledge and sharing would translate into increased scientific and academic impact. China’s share of conference papers at leading AI conferences is already rapidly on the rise, and having an ecosystem built around substantially better models than their Western counterparts could lead this numerous research growing also to be impactful. Better base models allow more interesting RL and agentic research today, and the list of areas reliant on high-performance models is likely to only grow longer with time.

* A proliferation of strong open models would make it difficult to restrict the presence or availability of many forms of AI. We do not have the government tools, incentives, nor culture to successfully prevent digital goods from China (or elsewhere) entering the U.S. economy. Many forms of AI governance and regulation in the United States and the rest of the world may need to be reconsidered, where many jurisdictions have looked to control and understand the development of “frontier AI.” Regulation needs to be approached for a world enmeshed in powerful AI models, rather than trying to control access or the releases of a few.

These realities all paint a clear picture that bends the association of open models from “soft power” to just “power.” Continuously releasing strong open AI models could allow Chinese companies to shape the technology interfaces, services and reality around the world. Where 2024 was about research on open models, and 2025 the professionalization of them, 2026 could be where we begin to see clear impacts of their power through endless distribution.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Ranking the Chinese Open Model Builders Aug 17, 2025

The Chinese AI ecosystem has taken the AI world by storm this summer with an unrelenting pace of stellar open model releases. The flagship releases that got the most Western media coverage are the likes of Qwen 3, Kimi K2, or Zhipu GLM 4.5, but there is a long-tail of providers close behind in both quality and cadence of releases.

In this post we rank the top 19 Chinese labs by the quality and quantity of contributions to the open AI ecosystem — this is not a list of raw ability, but outputs — all the way from the top of DeepSeek to the emerging open research labs. For a more detailed coverage of all the specific models, we recommend studying our Artifacts Log series, which chronicles all of the major open model releases every month. We plan to revisit this ranking and make note of major new players, so make sure to subscribe.

At the frontier

These companies rival Western counterparts with the quality and frequency of their models.

DeepSeek

deepseek.com | 🤗 deepseek-ai | X @DeepSeek_AI

DeepSeek needs little introduction. Their V3 and R1 models, and their impact, are still likely the biggest AI stories of 2025 — open, Chinese models at the frontier of performance with permissive licenses and the exposed model chains of thought that enamored users around the world.

With all the attention following the breakthrough releases, a bit more has been said about DeepSeek in terms of operations, ideology, and business model relative to the other labs. They are very innovative technically and have not devoted extensive resources to their consumer chatbot or API hosting (as judged by higher than industry-standard performance degradation).

Over the last 18 months, DeepSeek was known for making “about one major release a month.” Since the updated releases of V3-0324 and R1-0528, many close observers have been surprised by their lack of contributions. This has let other players in the ecosystem close the gap, but in terms of impact and actual commercial usage, DeepSeek is still king.

An important aspect of DeepSeek’s strategy is their focus on improving their core models at the frontier of performance. To complement this, they have experiments using their current generation to make fundamental research innovations, such as theorem proving or math models, which ultimately get used for the next iteration of models. This is similar to how Western labs operate. First, you test a new idea as an experiment internally, then you fold it into the “main product” that most of your users see.

DeepSeekMath, for example, used DeepSeek-Coder-Base-v1.5 7B and introduced the now famous reinforcement learning algorithm Group Relative Policy Optimization (GRPO), which is one of the main drivers of R1. The exception to this (at least today) is Janus, their omni-modal series, which has not been used in their main line.

Qwen

qwenlm.ai | 🤗 Qwen | X @Alibaba_Qwen

Tongyi Qianwen, the primary AI lab within Alibaba’s cloud division, is by far and away most known for their open language model series. They have been releasing many models across a range of sizes (quite similar to Llama 1 through 3) for years. Recently, their models from Qwen 2.5 and Qwen 3 have had accelerating market share among AI research and startup development.

Qwen is closer to American Big Tech companies than to other Chinese AI labs in terms of releases: They are covering the entire stack, from VLMs to embedding models, coding models, image and video generation, and so on.They also cater to all possible customers (or rather every part of the open community) by releasing capable models of all sizes. Small dense models are important for academia to run experiments and for small/medium businesses to power their applications, so it comes to no surprise that Qwen-based models are exploding in popularity.

On top of model releases for everyone, they also focused on supporting the (Western) community, releasing MLX and GGUF versions of their models for local usage or a CLI for their coding models, which includes a generous amount of free requests.

Unlike some American companies, the core team seems to have stayed relatively small in terms of headcount, in line with other Chinese AI labs: Qwen3 has 177 contributors, whereas Llama 3 has thrice the amount, while Gemini 2.5 has over 3,000 people as part of the model.

Close competitors

These companies have recently arrived at the frontier of performance and we will see if they have the capability to consistently release great models at a pace matching Qwen or DeepSeek.

Moonshot AI (Kimi)

moonshot.cn | 🤗 moonshotai | X @Kimi_Moonshot

Moonshot AI is one of the so-called “AI tigers”, a group of hot Chinese AI startups determined by Chinese media and investors. This group consists of Baichuan, Zhipu AI, Moonshot AI, MiniMax, StepFun, and 01.AI — most of which have attracted investments by tech funds and other tech grants. For example, Alibaba is seen as a big winner in the AI space by having their own models and by being a lead investor in Moonshot, sort of like how big tech companies in the U.S. are investing in fundraising rounds for newer AI labs.

While their first models, K1 and K1.5, were closed and available on their API, they started releasing open models after the R1 release with experimental models using the Muon optimizer. Similar to DeepSeek, they focus on a single model line, with small experiments eventually feeding back into the main model. K2 is their “moonshot run,” a.k.a. yolo run, and quickly became a hit similar to R1 (see our report from the release).

Further reading on Kimi can be found on ChinaTalk.

Zhipu / Z.AI

z.ai | 🤗 zai-org | X @Zai_org

Zhipu, known in the west as Z.ai, is a startup spinoff of Tsinghua University with considerable investments by Chinese companies and VCs. Currently, they are even considering an IPO, which would make them the first AI tiger to do so.

In terms of models, they are mostly known for their recent release of GLM-4.5 and GLM-4.5V, which are all very capable for their sizes (both of which are fairly large mixture of expert models). However, they are not just releasing LLMs, but also image and video generation models, setting them apart from pure-LLM companies and labs.

Noteworthy

These companies are transitioning to open releases, have open models with inferior capabilities, or slightly different foci than the text-centric labs pushing the frontiers of intelligence.

StepFun

stepfun.ai | 🤗 stepfun-ai | X @StepFun_ai

StepFun first started as a closed model provider, but pivoted to open model releases after DeepSeek R1 shook up the industry. They are mostly focusing on multi-modal model releases, with Step3 being their flagship VLM. They also have image, audio and video generation models.

Tencent (Hunyuan)

hunyuan.tencent.com | 🤗 Tencent | X @TencentHunyuan

Hunyuan is mostly known for HunyuanVideo and Hunyuan3D. While they have released three series of different LLMs, their releases come with very strict licenses, which is unusual for Chinese companies and dampens excitement when combined with performance levels that can be found elsewhere.

RedNote (Xiaohongshu)

xiaohongshu.com | 🤗 rednote-hilab

The Chinese version of Instagram, RedNote, recently joined the ranks of Chinese companies releasing open models. Especially their capable character recognition / OCR model surprised many (see our coverage). Similar to Xiaomi and Baidu, it remains to be seen what their overall open strategy will be in the near and distant future and they have not competed in the large, frontier model space.

MiniMax

minimaxi.com | 🤗 MiniMaxAI | X @MiniMax__AI

MiniMax is another of the AI tigers and also started as a closed company. After the release of R1, they changed their strategy and released the weights of Minimax-Text-01, following up with reasoning models building upon it. The unique selling point of these models are the 1M context window achieved with hybrid attention.

These text models are not the only thing they are focusing on — they also have image and video generation models, but those remain closed and only available on their API. They are also promoting their consumer platform heavily as they eye an IPO.

OpenGVLab / InternLM

internlm.intern-ai.org.cn | 🤗 InternLM | X @opengvlab

InternLM & OpenGVLab have deep ties to the Shanghai AI Laboratory, with InternLM focusing on the language models, while OpenGVLab releases vision models. While they release a range of models such as S1 or InternLM-Math, the orgs are mostly known for the strong InternVL series. While the first versions mostly used their own InternLM pretrained models, later releases (such as InternVL3) rely on Qwen as the language backend.

Skywork

skywork.ai | 🤗 Skywork | X @Skywork_AI

The Singaporean Skywork first started out as an online karaoke company (yes, really) before they pivoted to AI and being a competitor to Manus, with their platform focusing on agents for work-related tasks, such as slide generation.

Their LLM journey started with them releasing their own pretrained dense and MoE models. However, they stopped pre-training their own models and instead started to fine-tune existing models: Their OR1 reasoning model builds on top of DeepSeek-R1-Distill-Qwen-32B, R1V3 uses InternVL3 (which itself uses Qwen2.5 as its LLM backend).

Aside from LLMs, they have a wide range of other models, from world models, image and video generation models, and reward models. Similar to their LLMs, they mostly build on top of other models. Unlike many labs, Skywork has released some datasets with their models, such as preference and reasoning training data.

On the rise

These companies are either just getting their toes wet with open models or operating as more of academic research organizations than labs pushing the performance of models.

ByteDance Seed

seed.bytedance.com | 🤗 ByteDance-Seed

Seed is the R&D arm of ByteDance and eerily similar to Meta’s FAIR division: Diverse models with interesting research, with their papers garnering a ton of attention in the community. However, it remains to be seen whether they shoot for a Llama-style model release or continue to release research artifacts.

Here are some recent papers:

* Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

* Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

* Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

* Seedance 1.0: Exploring the Boundaries of Video Generation Models

* SeedEdit 3.0: Fast and High-Quality Generative Image Editing

* Seed1.5‑VL Technical Report

* Mogao: An Omni Foundation Model for Interleaved Multi‑Modal Generation

* Seed1.5‑Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

* VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

* Seed LiveInterpret 2.0: End‑to‑end Simultaneous Speech‑to‑speech Translation with Your Voice

OpenBMB

openbmb.ai | 🤗 openbmb | X @OpenBMB

OpenBMB is an open-source community (comparable to BigScience) from Tsinghua University NLP Lab (the very same university where Zhipu was spun off from) with support from the Beijing Academy of Artificial Intelligence (BAAI) and ModelBest.

They are mostly focusing on small multi-modal models for the edge, such as MiniCPM-V-4. However, the license is rather restrictive, which is surprising given the community-driven origins of the group. Aside from model releases, they also release frameworks and specialized kernels to make sure their models run on low-end hardware.

Xiaomi (MiMo)

mi.com | 🤗 XiaomiMiMo

Xiaomi started releasing a bunch of small, capable models, ranging from LLMs to VLMs and audio models. Xiaomi updating the models quickly after an initial launch and releasing multiple variants of the models show that it is not a one-off foray into open models. However, it remains to be seen whether those are mostly research artifacts or whether they are serious about potentially pushing the frontier or competing for adoption.

Baidu (ERNIE)

yiyan.baidu.com | 🤗 baidu | X @Baidu_Inc

Baidu, one of the original names in the Chinese AI space, has only released the weights of ERNIE 4.5. It remains to be seen whether they continue to release weights of newer releases as well.

Honorable Mentions

The rest of the labs that we are watching.

Multimodal Art Projection

m-a-p.ai | 🤗 m-a-p

An open research community, releasing all kinds of models (including a truly open 7B language model with data, etc.). Now, they’re mostly known for the music generation model YuE.

Alibaba International Digital Commerce Group

aidc-ai.com | 🤗 AIDC-AI

Another R&D arm of Alibaba, mostly releasing niche models building upon Qwen.

Beijing Academy of Artificial Intelligence (BAAI)

baai.ac.cn | 🤗 BAAI | X @BAAIBeijing

As a university, the Beijing Academy of Artificial Intelligence has a high diversity of projects. They are mostly known for BGE, which are capable embedding models.

inclusionAI

🤗 inclusionAI | X @InclusionAI666

The open weight arm from the Ant Group (an affiliate of Alibaba handling mobile payments and some financial industries), responsible for Ling Lite, a series of LLMs.

Pangu (Huawei)

huaweicloud.com | X @HuaweiCloud1

Huawei is working on AI accelerators to threaten the market share of Nvidia GPUs, which are often targeted by regulations, both from the US and China. Their model releases are mostly to show what’s possible with their cards, but not without drama accusing them of upcycling Qwen models and not stating it. We would expect them to continue to release more models in the near future.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Contra Dwarkesh on Continual Learning Aug 15, 2025

Dwarkesh Patel’s now well-read post on why he is extending his AI timelines focuses on the idea of continual learning. If you ask me, what we have already is AGI, so the core question is: Is continual learning a bottleneck on AI progress?

In this post, I argue that continual learning as he describes it actually doesn’t matter for the trajectory of AI progress that we are on. Continual learning will eventually be solved, but in the sort of way that a new type of AI will emerge from it, rather than continuing to refine what it means to host ever more powerful LLM-based systems.

Continual learning is the ultimate algorithmic nerd snipe for AI researchers, when in reality all we need to do is keep scaling systems and we’ll get something indistinguishable from how humans do it, for free.

To start, here’s the core of the Dwarkesh piece as a refresher for what he means by continual learning.

Sometimes people say that even if all AI progress totally stopped, the systems of today would still be far more economically transformative than the internet. I disagree. I think the LLMs of today are magical. But the reason that the Fortune 500 aren’t using them to transform their workflows isn’t because the management is too stodgy. Rather, I think it’s genuinely hard to get normal humanlike labor out of LLMs. And this has to do with some fundamental capabilities these models lack.

I like to think I’m “AI forward” here at the Dwarkesh Podcast. I’ve probably spent over a hundred hours trying to build little LLM tools for my post production setup. And the experience of trying to get them to be useful has extended my timelines. I’ll try to get the LLMs to rewrite autogenerated transcripts for readability the way a human would. Or I’ll try to get them to identify clips from the transcript to tweet out. Sometimes I’ll try to get them to co-write an essay with me, passage by passage. These are simple, self contained, short horizon, language in-language out tasks - the kinds of assignments that should be dead center in the LLMs’ repertoire. And they're 5/10 at them. Don’t get me wrong, that’s impressive.

But the fundamental problem is that LLMs don’t get better over time the way a human would. The lack of continual learning is a huge huge problem. The LLM baseline at many tasks might be higher than an average human's. But there’s no way to give a model high level feedback. You’re stuck with the abilities you get out of the box. You can keep messing around with the system prompt. In practice this just doesn’t produce anything even close to the kind of learning and improvement that human employees experience.

The core issue I have with this argument is the dream of making the LLMs we’re building today look more like humans. In many ways I’m surprised that Dwarkesh and other very AGI-focused AI researchers or commentators believe this — it’s the same root argument that AI critics use when they say AI models don’t reason. The goal to make AI more human is constraining the technological progress to a potentially impossible degree.

Human intelligence has long been the inspiration for AI, but we have long surpassed it being the mirror we look to for inspiration. Now the industry is all in on the expensive path to make the best language models it possibly can. We’re no longer trying to build the bird, we’re trying to transition the Wright Brothers’ invention into the 737 in the shortest time frame possible.

To put it succinctly. My argument very much rhymes with some of my past writing.

Do language models reason like humans? No. Do language models reason? Yes.

Will language model systems continually learn like humans? No.Will language model systems continually learn? Of course.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Dwarkesh writes “Rather, I think it’s genuinely hard to get normal humanlike labor out of LLMs.” This is because we’re still early on the buildout of the technology. Human labor takes an immense amount of context and quick thinking, both of which we’re starting to unlock with our language models. On top of this, human labor may not be what we want to create — we want to augment it.

Using LLMs as drop in replacements for humans is not a requirement for AGI nor is what Dwarkesh describes a fundamental limitation on AI progress. Francois Chollet cleverly poked at this weakness in his recent conversation with Dwarkesh at an ARC-AGI event:

Well, how do you define the difference between the ability to adapt to a new task and learning on the fly? It's, it sounds like the same thing to me.

Language models can already pick up subtle context extremely fast. ChatGPT’s memory feature has gotten far better for me. When we’re using the far more powerful models we can expect in the next 18 months this’ll already start to appear magical. Language models are extremely apt at inferring context even without us giving it to them. Soon we’ll be unlocking that subtle connection engine by providing immense, explicit context.

I don’t know of anyone who has actually thoroughly digitized all the relevant context of their job and formatted it in a way that is easily readable by an LLM. GPT-5 Pro estimates that all of the writing on Interconnects would be only 500K tokens. That would fit into an existing LLM with no extra system, but I’ve never tried it.

The problem that Dwarkesh is facing is that we’re still using LLMs primarily in a single generation manner, which got far better with the introduction of reasoning models, but the economically useful way to use current tools in more complex intellectual domains will require a deep-research style approach over all of your recent work interactions. No one is giving language models that kind of context. None of the tools we use are set up properly to accumulate this type of context.

I expect this to change rapidly. ChatGPT, Claude, and the likes are all adding memory features across chats and countless connectors to other pieces of information in your professional life. These memory features will be omnimodal and essential to extracting the type of value Dwarkesh wants. Without them, I agree language models in their current form are hopeless at solving continual learning.

This is what I would expect the rumored $2000/month ChatGPT level subscriptions to work with. Each of these bespoke tasks needs to absorb a ton of context and reasoning tokens in order to make a directionally right output. If someone built the Claude Code equivalent for my Substack, with every post tagged by topic and performance metrics, I bet the AI could easily make useful suggestions on how to format my content.

Continual learning in how Dwarkesh presents it is a systems problem rather than a learning problem. I expect better context management over my information ecosystem to exist in 2026, but more work to be needed for the AI companies to know how best to reference it and unlock in-context learning that feels like rapid adaptation. Call that 2027.

The models that have been released in 2025 will make this far more tractable in the near future. Reasoning models have made in-context learning far more powerful, resulting in rapid progress on held-out and complex domains such as ARC-AGI. These models also have come with massive improvements in context length. Claude and Gemini have 1M+ token context lengths and GPT-5’s is at 400K — they’re all growing steadily. What is important with the context length numbers is that evaluations are showing that these are meaningful improvements that the models can leverage intelligently.

With these reasoning models and smart retrieval of context, the systems we are building will look indistinguishable from continual learning. This will definitely be multiple LLMs working together and will operate very differently than the first versions of ChatGPT we were given (and often still use today).

The path to continual learning is more context and more horsepower. This is directly in line with the direction AI investment is going. This doesn’t feel like a bottleneck, rather another product problem that we are going to solve. This sort of continual learning may not enable the type of raw intelligence and autonomy that many vocal leaders in AI describe as “superintelligence.”

Training models to be smarter on even more complex tasks — e.g. novel biological research — requires mastering agentic behaviors that need to be learned from scratch, as discussed in my post on “What comes next with RL”. There’s no internet scale pretraining data for such agentic tasks. My point is that not all jobs that require continual learning will require the frontiers of intelligence. I’m excited to write blog posts with the bliss of my ChatGPT 6 co-editor.

This technology coming soon will not be without its challenges. My first reaction to the continual learning post was more in line with “society isn’t ready for this” rather than commentary on its feasibility. I’ll repeat my warning:

For a long time I’ve written that AI models have a higher risk potential in terms of social outcomes because the modalities they interact with us in are far more personal… As AI is going to be so powerful as a standalone entity, breaking some of the symbiotic links will be good for adding friction that makes the technology easier to steer towards good outcomes. In short, be wary of wishing for end-to-end (reinforcement) learning when you’re part of the environment.2 It’s a destiny to dystopia.

What we have today is a form of AGI and it’ll soon get much better with better context and memory. The industrialization of language models is giving us incredible improvements across a wide swath of use-cases. These will blow past many basic primitives of intelligence in humans that have motivated AI for decades. First was models reasoning, then will come systems with continual learning. This is exactly what most AI companies are actually building — regardless of what their superintelligence messaging is.

Comments are open on this post, please continue the debate!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

GPT-5 and the arc of progress Aug 07, 2025

If you want a video version of this, check out the last 20 minutes of the livestream reaction (edit, fixed link) I did with Will Brown of Prime Intellect and Swyx of Smol AI & Latent Space.

GPT-5 was set up to fail on some of the narratives it was expected to satisfy. The two central themes it had to decide between were the AGI (or superintelligence) narrative that Sam Altman & co. have been using to fundraise and the fact that ChatGPT is one of the fastest-growing consumer technologies of all time.

To fulfill both, GPT-5 needed to be AGI while also being cheap enough to serve as the most-used AI system in the world. Business and technological realities made it inevitable that GPT-5’s primary impact would be to solidify OpenAI’s market position, even if it raises a lot of eyebrows for the long-term trajectory of AI.

The reactions online capture this as well. The OpenAI live streams have historically catered to AI insiders, but the product speaks entirely to a different audience. The people discussing this release on Twitter will be disappointed in a first reaction, but 99% of people using ChatGPT are going to be so happy about the upgrade. Confusingly enough, this includes many of the critics. GPT-5 is a good AI system. It’s right in line with best-in-class across pretty much every evaluation, while being cheap enough to serve the whole world.

OpenAI is largely fixing its product offering with an announcement that was hyped to be one of the biggest AI news cycles of the year. AI news being loud is defined by narratives being different more-so than technology being better. OpenAI releasing an open model again will likely be pinpointed as just as important a day for the arc of AI as the GPT-5 release. In many ways GPT-5 was set up to fail and that is very off-putting for those expecting maximum AI progress in the near term.

I’m not going to dwell on it, but oh boy, that was a messy release. GPT-5 being announced and rolled out like this is very odd. Countless plots were mislabeled, live demos had bugs, and the early rollout is doing some weird stuff. This reinforces how OpenAI was torn about the release and backed into a corner with their messaging. They knew they needed to improve the experience with strong competition in the industry, but releasing GPT-5 needed to make a splash after how long they’ve waited (and already parked the GPT 4.5 name).

The core question we track in this post is: What does it mean for the next 6-18 months of AI progress if GPT-5 is just as good as all the best models out there, e.g., Claude Sonnet for coding or o3 for search, funneled into one, super cheap package?

If AGI was a real goal, the main factor on progress would be raw performance. GPT-5 shows that AI is on a somewhat more traditional technological path, where there isn’t one key factor, it is a mix of performance, price, product, and everything in between.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

GPT-5’s performance

There are a few places that we can see that GPT-5 represents a solid step on the performance trend line, but nothing like a step change. First, on LMArena, GPT-5 is fantastic, sweeping the board to #1 on all categories. The last model to claim #1 in pretty much every category was Gemini 2.5 Pro — and that was the biggest step change in Elo since GPT-4 Turbo skyrocketed past the first Claude.

Second, GPT-5 is the top model on the ArtificialAnalysis composite benchmark.

These two, LMArena & ArtificialAnalysis, represent two coarse evaluations — community vibes and raw benchmarks. Both of these can be gamed, but are still correlated with real-world use. You can also see in OpenAI’s shared results how much the smaller versions improve on the likes of GPT-4.1 mini and o4-mini.

In many ways, the march of progress on evals has felt slowed for a while because model releases are so frequent and each individual step is smaller. Lots of small steps make for big change. The overall trend line is still very positive, and multiple companies are filling in the shape of it.

My post on “what comes next” from earlier this summer all but called this type of release, where the numbers aren’t shocking but the real world use cases are great, becoming more common.

This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.

To say it succinctly: Abilities will develop more slowly than products.

The product overhang is being extended with each release. We’re still building untapped value with AI models and systems faster than we’re capturing it.

Another way to see this incremental push out in models or systems is through OpenAI’s update to the famous METR plot of time to completion for humans of various tasks AI systems can solve 50% of the time. GPT-5 is leading, but also just in line with trends.

All of this is to say comprehensively that AI progress is very alive and well, as long as you don’t subscribe to the exponential takeoff in ability. Those arguments are very strained by this GPT-5 release.

Yes, AI progress on intelligence and “raw ability” is certainly going to continue at a solid pace for a long time, but how will this translate into recursive self-improvement?

GPT-5’s details

If you’re reading closely, you may have noticed that this post uses the word system instead of model. All of the leading chat systems have been adding more components onto them like safety checkers and so on, but this is the first one to use different architectures and weights for the primary generation of content across similar queries. GPT-5 is the first in what is to come, mostly to better balance cost and give better user experiences. From the system card:

GPT‑5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time.

Along with this, they shipped many product improvements, such as how the model has a 400K context window in the API with great performance, reduced hallucinations, and new personalities.

Primarily, I worry as a power user about the router. I sense that for now I’ll default to GPT-5 Thinking, and sometimes upgrade to Pro mode, while downgrading to standard GPT-5 only for benign queries (depending on its search behavior — if it is search-heavy like o3 without thinking, then it should still work well).

Thankfully, the thinking mode has a “get an early answer” button, so I don’t see any reason to start elsewhere. If I need an answer fast, I’ll get one. If not, I want the best responses possible.

As for prices, here’s a comparison. GPT-5’s top-level model is cheaper than Claude Sonnet and far better than any OpenAI model has been before at coding — one of the core details of this release. Matching Gemini Pro’s pricing when considering Google’s infrastructure advantage is a substantial accomplishment.

* OpenAI — GPT-5 (API sizes)

* GPT-5: input $1.25, output $10.00. (OpenAI)

* GPT-5 mini: input $0.25, output $2.00. (OpenAI)

* GPT-5 nano: input $0.05, output $0.40. (OpenAI)

* OpenAI — o3 (reasoning)

* o3: input $2.00, output $8.00. (OpenAI Platform)

* o3-mini: input $1.10, output $4.40. (cached input $0.55) (OpenAI Platform)

* Anthropic — Claude 4 family

* Claude Sonnet 4: input $3.00, output $15.00. (Anthropic)

* Claude Opus 4.1: input $15.00, output $75.00. (Anthropic)

* Google — Gemini 2.5

* Gemini 2.5 Pro: input $1.25 (≤200k prompt) / $2.50 (>200k); output $10.00 (≤200k) / $15.00 (>200k). (Google AI for Developers)

* Gemini 2.5 Flash: input $0.30 (text/image/video) or $1.00 (audio); output $2.50 (includes thinking tokens). (Google AI for Developers)

* Gemini 2.5 Flash-Lite: input $0.10 (text/image/video) or $0.30 (audio); output $0.40. (Google AI for Developers)

Cheaper, thinking models that work well in applications are far more useful than scaling (as GPT-4.5 has shown us).

GPT-5’s impact

It seems like most people in all walks of life are going to love this model — from AI researchers all the way to people who are learning of ChatGPT for the first time today. This is very in line with my expectations for how AI will proceed, as a long, steady march of progress.

The fact that the models are getting way cheaper rather than way more expensive definitely signals that we cannot just brute-force scale our way to much stronger systems. Scaling helps, but it is now one of many considerations, and all the laboratories are showing us that much bigger models have diminishing returns in value to customers. At the same time, models being cheaper could be just what we need for Jevons paradox to kick in and provide another boost in AI adoption.

Many people will claim that the GPT-5 release was a flop and the bubble will pop for AI. This is downstream of the industry generally making totally unrealistic promises. As someone whose core through-line when covering frontier models is tracking the pace of progress, I translate this as “AI capabilities on benchmarks will proceed a bit more slowly, but we aren’t reaching any clear walls in performance.” The AI performance hills we’re climbing up as an industry do put up some more resistance as the obvious low hanging fruit is gone, but we have the tools to overcome it consistently for the next 6 to 18 months.

For companies that have been fundraising on promises of AGI, such as Anthropic and OpenAI, closing the next rounds could be harder. Of course, this depends on whether the messaging of the rounds was a key part of the fundraising.

This fundraising inspires capital expenditures across the industry, e.g. TSMC developing the next node for NVIDIA to build new chips, and so on. The AGI narrative and the fundraising it has enabled have been good for the U.S. in terms of building out valuable, raw infrastructure.

This could be the beginning of the money train slowing down, but that’s very different from a derailment and a stock market crash. As raw infrastructure spend slows, there will be even more pressure to deliver valuable products to users. A key trend for 2025 has been many of those appearing — Deep Research and Claude Code being the paradigms that everyone has copied.

GPT-5 makes these applications better and makes it easier and cheaper for the next viral AI products to hit the market. I’m still excited for what is to come.

But first, I’m going to sign off and go play with GPT-5. It’s a good day to build something for the fun of it. As I use it more, I’ll have more to say.

Extra GPT-5 links

For more specifics on the model from people who got early access, I recommend Tyler Cowen, Every.to, or Simon Willison (or Swyx soon, on Latent.Space).

Livestream link: https://openai.com/gpt-5/ Research blog post: https://openai.com/index/introducing-gpt-5/ Developer blog post: https://openai.com/index/introducing-gpt-5-for-developers Enterprise blog post: https://openai.com/index/gpt-5-new-era-of-work GPT-5 landing page: https://openai.com/gpt-5/ System Card: https://openai.com/index/gpt-5-system-card/ Coding examples: https://openai.github.io/gpt-5-coding-examples/What would you say if you could talk to a future OpenAI model https://progress.openai.com/

Finally, I’ll plug again the video I did with Will Brown and Swyx:

Send me the most interesting things you find on GPT-5!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

gpt-oss: OpenAI validates the open ecosystem (finally) Aug 05, 2025

OpenAI released two open-weight, text-only reasoning models today, both mixture of experts (MoE) sized to run efficiently on a range of hardware from consumer GPUs to the cloud. These models have the Apache 2.0 license, so they’re available for distillation into other reasoning models, deployment into commercial products, and are free of downstream restrictions. These two models, the smaller gpt-oss-20B with 3.6B active parameters and 21B total and the larger gpt-oss-120B with 5.1B active parameters, follow the trends we’ve seen with the other leading open models in architecture choices.

Where this release shines is in the dramatic change in open model performance and strategy that comes with the leading name in AI releasing an open model that undercuts some of their own API products.

We’ll get to the technical details on the model later, but the main point of this post is how much OpenAI has changed by releasing their first open language model since GPT-2. The larger 120B model “achieves near-parity with OpenAI o4 mini on core reasoning benchmarks‬” and is a major moment for the ecosystem:

* OpenAI has released an open model at the frontier of current open model performance — highlighting how major concerns over open models that OpenAI leadership mentioned in 2023 were overblown. The marginal risks of open models have been shown to not be as extreme as many people thought (at least for text only — multimodal is far riskier). Once other organizations, particularly Meta and China showed OpenAI that there was no risk here, the path was opened to release a model.

* OpenAI has revealed far more of their technical stack than any release to date. This blog post has light details on many things in the model, but community tinkering will begin to better understand what is going on here. This includes basic things like our first time seeing a raw chain of thought (CoT) for an OpenAI reasoning model, but also more interesting things like how this model is trained to use tools in the CoT like their o3 model. Other details include researchers being able to play with OpenAI’s instruction hierarchy in raw weights (where pieces of it are untouchable in the API), a new “harmony” prompt format, the same “reasoning efforts” of low, medium & high from the API, a huge proof of concept on how far basic, community standard architectures with MoEs can be pushed, and other small details for the AI community to unpack.

* OpenAI has initiated a scorched earth policy on the API market, undercutting their own offerings and unleashing an extremely strong, trusted model brand with a permissive license. While adoption of any open model is much slower than an API due to testing, additional configuration, etc., this is set up to go about as fast as it can. Any API model that competes with current models like OpenAI o4 mini, Claude Haiku, Gemini Flash, DeepSeek R1 etc. are all going to have to compete with this model. OpenAI’s o4 mini model is currently served at $1.1 per million input tokens and $4.4 per million output. Serving this open model will likely cost at least 10x less. There are many potential strategic reasons for this, all of which paint OpenAI as having a clearer vision of what makes it valuable. What OpenAI hasn’t touched with this model is interesting too — “For those seeking multimodal support, built-in tools, and‬ seamless integration with our platform, models available through our API platform remain the‬ best option.” These are dropped for reasons above, and “headaches” discussed later in the post.

Together, these paint a much clearer vision by OpenAI on how they’ll control the AI ecosystem. The top potential reasons on my mind are:

* OpenAI could be trying to make all API models potentially obsolete on cost ahead of the GPT-5 release, which they hope to capture the top end of the market on. Or,

* OpenAI could be realizing that models are no longer their differentiation, as ChatGPT users continue to steadily climb — and they’ll soon pass 1 billion weekly actives.

There are plenty of other reasons, such as the politics alluded to at the end of the blog post, but OpenAI tends to only act when it serves them directly — they’ve always been a focused company on their goals.

There’s also a long list of head scratchers or in-between the lines points that illuminate OpenAI’s strategy a bit more. OpenAI of course didn’t release training data, code, or a technical report, as expected. OpenAI is trying to make a big splash with the name that captures more of the enterprise market, but in doing so takes some collateral damage in the research and true “open source” AI communities. These future questions include:

* The naming is bad — a mixture of cringe, confusion-inducing, and still useful for their marketing goals. For anyone following open-source AI for a long time it won’t be new that a major company is blurring the association of the term open-source with the community accepted definitions. I understand why OpenAI did this, but the naming conflict further enforces that the true open source AI community isn’t the target of this release — it’s people that want to try an “open source AI model” for their business, and OpenAI has made the target too big to miss for enterprises.

* OpenAI did not release the base models. Anyone following the space would’ve expected this, but it matters substantially for researchers. These two sparse, low numerical precision MoE models won’t be easy for researchers to use. The best model for researchers and tinkerers are dense, base models from 1 to 7 billion parameters. These are much “longer term” artifacts in the open community that will still be using almost only Qwen.

I need to take a second before the “unknowns” section and comment on the architecture. These models are reinforcing trends we’re seeing in modeling across the industry. Recent frontier open models are all very sparse MoEs inspired by the DeepSeek architecture. DeepSeek V3 had 37B active and 671B total parameters. Kimi K2 had 32B active and 1T total parameters. With 5B active and 121B total, the sparsity factor fits right in with normal. Sparsity in MoEs is totally king right now. The smaller gpt-oss is a bit less sparse than Qwen’s 3B active, 30B total smaller MoE, but expect the sparsity of these models to continue to increase.

Some things we need more testing to know the impact of include:

* The model has been quantized for release to MXFP4 (4 bit floating point). It’s not clear exactly who will be impacted here, but this could make it benefit people most with the newest hardware, cause minor issues across Torch/Cuda versions, or even make some of the behaviors weird relative to the trained version internal to OpenAI. This could also be a plus, depending on performance, as the bigger model is quantized to 4 bit precision to enable it to be run on GPUs with 80GB of memory, such as the A/H100 line from NVIDIA.

* Safety measures have been taken to change how finetunable the model is. With, or soon after, this release OpenAI is releasing a research paper on new methods to make it so you can’t “finetune the safety away” from a released instruct model. This is a very long-standing issue that people have concerns with over releasing open models. The main question here is if the models OpenAI releases are still able to be finetuned or not for productive use-cases. OpenAI claims they can be in their blog post, but this will be left up to the community to decide. Is finetuning the safety away actually a feature of an easy to use model?For example, Gemma has been tougher for people to finetune historically because it uses a different attention implementation and has a different parameter space from being distilled. Open finetuning stacks are still tuned for Llama and Qwen — this takes a long time to change.Many people will take the “we made it impossible to un-censor this model” as a challenge, which will be interesting to follow in the jailbreaking research community. There is a substantial market for modifiable models.

* The model was trained to expect tools, but open model tool use is a mess. One of the biggest problems I worry about in designing an OLMo model with native o3-style tool use is that I need to make it seamless for users to use the same tools from training time at inference time. An early tester in my network mentioned that the model would hallucinate tool calls from training (sort of like what was mentioned around o3’s full release). I don’t expect this to be an unsolvable issue, but it could slow adoption. It could also allow people to reverse engineer the tools that OpenAI uses during training, we’ll see!

* We need to re-benchmark the model on open infrastructure. OpenAI did a good job for this release integrating it everywhere, but we need to confirm that the community can easily replicate their evaluation scores. Evaluation at closed labs has increasingly become bespoke to suit their internal needs, which is a logical decision, but this comes at a cost of friction when an open model is released. This is me saying loud and clear that this isn’t a model performance review in a nuanced sense, but a summary of the importance of OpenAI’s approach (and where the opportunity is for the rest of us). Not all good models are easy to use. Some models benchmark well and are useful — e.g. Qwen. Some models benchmark well and are forgotten. Regardless of scores, I expect this to be a useful model.

Overall, I would give OpenAI a very strong grade on their first open release in a while — they definitely listened to the feedback given by the community. The path to earning goodwill with the open community, especially with researchers, is to embrace more risk in making models that are easier to modify (and potentially even more revealing), such as the base models for these checkpoints.

Open models from the U.S. labs were in such a dire spot that we need any step back in the right direction. As the rollout of the model begins and we have more understanding of it, we’ll include more updates on Interconnects, such as in the next Artifacts Log issue.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

So, OpenAI is the new open champion, right? There’s no more risk vis-a-vis China? We don’t need Llama anymore? Not quite, let me explain.

OpenAI, ATOM, and national champions

It’s a phenomenal step for the open ecosystem, especially for the West and its allies, that the most known brand in the AI space has returned to openly releasing models. This is momentum and could be the start of the turning point of adoption and impact of open models relative to China.

The open ecosystem moves fast in some ways and slow in others. Many workflows and expertise is now built on Qwen models due to their frequent, accessible releases. Some of these will try OpenAI the next time they want to make a change, but it’s far from the fact that everyone will immediately switch to OpenAI’s model now that it’s out.

To me, OpenAI dropping a strong model has switched the second derivative on the open model scales. The U.S. and its allies will no longer be falling further and further behind, which was the main story of 2025, but we need to build on this momentum if we want to have competitive open models for all use cases in the order of months rather than years.

There’s a lot of uncertainty in the incentives for open models. Some of the best China analysts I know share how China is sensing that releasing open models is a successful strategy for them and are doubling down. This is a very reasonable take. The retort is that if we use it as a weakness of the American ecosystem that it is so reliant on Meta’s Llamas, or now GPT OSS, the same could happen for Qwen. So then, what happens if Alibaba decides Qwen’s stellar releases no longer serve them?

In this case, there would be a large opportunity in the series of small models from 1 to 70B parameters, but there’s so much competition from China at the larger scales. These are currently the big mixture of experts (MoE) models like DeepSeek V3/R1, Z.ai’s / Zhipu’s GLM 4.5, Kimi K2, and so on. China has more models that are close to this performance level, such as MiniMax or Tencent.

All of these companies have uncertainty, but there’s a strength in numbers that reinforces standard practice and sets standards. Releasing strong, large, open models is now the standard in China. We’re back in the precarious period of establishing standards for American companies, who are exposed to the legal risk of not being able to un-release models with many open lawsuits, such as in areas like copyright.

These two sides of the open ecosystem are at very different stages and need very different actions. In many ways, we shared The ATOM Project when we did because we could tell this was a local (and hopefully global) minimum in terms of the distance between Western contributions to the open science of AI compared to any point in the recent past and near future.

OpenAI’s release is a step in the right direction, but it is still a precarious position. Many people make noise about creating open models, from the AI Action Plan to venture capitalists and academics. What all of these parties have in common is that its not their number one goal. The goal of The ATOM Project is to give an outlet for people like myself that want to make this project their number one priority.

This is why we need to keep nurturing entrants into the open model space that are releasing their best models there. It is what made the early versions of Llama great, and is what will be the defining factor of the outputs of ATOM. Models that are designed from first principles to be modifiable, interpretable, and extendable is what will enable a new decade of AI research to be born. This needs base models, training details, convenient sizes, and other little details that are missing from many recent open model releases, including OpenAI’s.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Towards American Truly Open Models: The ATOM Project Aug 04, 2025

I’m very excited to share a substantial project on invigorating investment in open language models and AI research in the U.S. The ATOM (American Truly Open Models) Project is the mature evolution of my original “American DeepSeek Project” and I hope it can help be a turning point in the current trajectory of losing open model relevance vis-a-vis China, and even the rest of the world.

I’ve included the full text below, but I encourage you to visit the website for the full version with added visuals, data, and a place to sign your support. This is a community movement, rather than me fundraising, starting an organization, or anything like that

If you can help get the word out and or sign your support, I’d greatly appreciate it.

(Or watch a 5 minute overview on YouTube)

The ATOM Project: Towards fully open models for US research & industry

Reinvigorating AI research in the U.S. by building leading, open models at home

America's AI leadership was built by being the global hub and leading producer of open AI research, research which led directly to innovations like the Transformer architecture, ChatGPT, and the latest innovations in reasoning models and agents. America is poised to lose this leadership to China, in a period of geopolitical uncertainty and rising tensions between these two nations. America's best AI models have become more closed and restricted, while Chinese models have become more open, capturing substantial market share from businesses and researchers in the U.S. and abroad.

Open language models are becoming the foundation of AI research and the most important tool in securing this leadership. America has lost its lead in open models – both in performance and adoption – and is on pace to fall further behind. The United States must lead AI research globally, and we must invest in making the tools our researchers need to do their job here in America: a suite of leading, open foundation models that can re-establish the strength of the research ecosystem.

Recommendation: To regain global leadership in open source AI, America needs to maintain at least one lab focused on training open models with 10,000+ leading-edge GPUs. The PRC currently has at least five labs producing and releasing open models at or beyond the capabilities of the best U.S. open model. Regaining open source leadership is necessary to drive research into fundamental AI advances, to maximize U.S. AI market share, and to secure the U.S. AI stack.

Overview

Open language model weights and data are the core currency of recent AI research – these are the artifacts that people use to come up with new architectures, training paradigms, or tools that will lead to the next paradigms in AI to rival The Transformer or Inference-time Scaling. These research advances provide continued progress on existing products or form the basis for new technology companies. At the same time, open language models create potential for a broader suite of AI offerings by allowing anyone to build and modify AI how they see fit, without their data being sent through the cloud to a few, closed model providers.

Open language models are crucial for long-term competition within American industry. Today, substantial innovation is happening inside of large, closed AI laboratories, but these groups can only cover so many of the potential ideas. These companies spend the vast majority of their resources focusing on the next model they need to train, where the broader, open research community focuses on innovations that’ll be transformative in 2, 5, 10, or more years. The most progress in building useful, intelligent AI systems will come when the most people can participate in improving today's state-of-the-art, rather than the select few at certain companies.

The open AI ecosystem (regarding the models, not to be confused with the company OpenAI) has historically been defined by many parties participating. The United States emerged as a hub of the deep learning revolution via close collaboration between leading technology companies and academic institutions. Following ChatGPT, there have been countless contributions from around the globe. This distribution of impact on research has been collapsing towards clear Chinese leadership due to their commitment to open innovation, while a large proportion of leading scientists working in the United States have joined closed research organizations.

The playbook that led Google to invent and share the Transformer – the defining language model architecture of which all leading models such as ChatGPT, Gemini, or Claude are derived from – is now the standard mode of operation for Chinese companies, but it is increasingly neglected by American companies.

The impact of China’s models and research are growing because the institutions focused on open models have access to substantial compute resources for training – e.g. some have formed a close relationship between leading AI training laboratories and academic institutions. Until the United States and its partners directly invest in training more, higher performance open models and sharing the processes to do so, its pace of progress in AI research will lag behind.

To train open models at the frontier of performance, a developer currently needs a high concentration of capital and talent. We estimate that to lead in open model development, the United States needs to invest in multiple clusters of 10,000+ H100 level GPUs to create an ecosystem of fully open language models that are designed to enable a resurgence in Western AI research. Stacking large investments such as this into a few focused efforts will help them to learn from each other and make progress across a range of challenges quickly and robustly. Splitting such an investment in AI training into smaller, widespread projects will not be sufficient to build leading models due to a lack of compute concentration. Along the way we need to build models of various sizes that can enable applications of AI at every scale from local or edge devices all the way to high performance cloud computing.

Open models as the engine for AI research and development

America's AI leadership was built by tens of thousands of our best and brightest students, academics and researchers. This process occurred over decades, but it is faltering at a crucial transition point to the new, language modeling era of AI research. Since the release of ChatGPT, open language models and computational resources are the most important table stakes for doing relevant and impactful research. High-quality open models and their subsequent technical reports quickly accrue thousands of citations and accolades such as best paper awards and the focus of large swaths of students. These act as foundational currencies of AI research and are crucial, achievable artifacts for the long-term American AI ecosystem.

While many direct consumers of open models are academics, this community is far from the only group that will benefit immensely from a new wave of American open models. The low cost, flexibility, and customizability of open models makes them ideal for many use cases, including many of the ways that AI stands to advance and transform businesses large and small.

If the United States does not create its own leading open models, the focus of American researchers and businesses will continue to shift abroad. The benefits of openly sharing a technology accrue to the builder in mindshare and other subtle soft power dynamics seen throughout the history of open source software. Today, these benefits are accruing elsewhere due to the intentional support of open models by many Chinese organizations. The gap in performance and adoption will only grow as the American ecosystem sees strong open models as something that is nice to have, or an afterthought, rather than a key long-term priority.

China is adopting the playbook for open innovation of language models that the United States used to create its current AI leadership, yielding rapid innovation, international adoption, and research interest. The collapse of American dominance in AI research is driven not only by the remarkable quality of the Chinese ecosystem, but also by the commitment of China to these very same Open Model Principles - the principles that American scientists used to start this AI revolution. This is reflected further in a consistent trend of Chinese open models being released with more permissive terms of use than their American counterparts.

The many leading closed research institutions in the United States are still creating world-class models – and the work they do is extraordinary. This collapse is not their fault, but closed labs make closed research, and the acceleration of AI was built on open collaboration with world-class American models as the key tool.

As researchers, our focus is on leading the research and development for the core technology defining the future, but there is also a growing list of other urgent security and policy concerns facing our nation around the lack of strong open models. To start, adoption of open models from the PRC in the US and our allies has been slow in some sectors due to worries about backdoors or poor security in generated code. Similarly, there is concern over the outputs of these Chinese models being censored or inconsistent with everyday American values of freedom, equality, and independence. There are even parallels between how the PRC’s national AI champions are increasingly racing to release cheap and open AI models and the PRC’s historical practice of dumping state-subsidized, below-cost exports from China to undermine American competitors. With the dynamic and rapid evolution of this technology, we need to get ahead of these issues before stronger habits, cost disadvantages, or other incentives reduce the practicality of adopting American open models.

America's lost lead in open model performance

On countless benchmarks, the leading American models have fallen behind counterparts from Chinese companies. In July 2024, American models in the form of Llama 3 had leading performance over any openly available Chinese models. Since then, a growing number of Chinese open model providers have surpassed and widened the performance gap with the leading American open models.

The leading American open models are Meta's Llama and Google's Gemma models. The Chinese open models from DeepSeek and Alibaba's Qwen have traded off positions at the frontier of capabilities ahead of their American counterparts. However, the Chinese ecosystem is expanding rapidly, with new players such as Moonshot AI (Kimi), Zhipu AI, or Tencent close behind.

We consider two popular public, aggregate benchmarks to demonstrate the state of China’s current open model dominance. These represent crowdsourced rankings, LMArena, and comprehensive intelligence rankings by blending a variety of capability benchmarks, from ArtificialAnalysis. The pace of progress on these Pareto frontiers is only part of the equation. In addition to leading, the top 10 open models on LMArena are all created by Chinese organizations. For ArtificialAnalysis rankings, the top 3 open models are of Chinese origin as of publishing on August 4th, 2025.

The isolation of Meta's Llama

Meta CEO Mark Zuckerberg has been one of the few clear advocates for the long-term imperative of America building open models. Since the release of ChatGPT, this has been manifested by Meta's Llama series of models – these had long been the definitional open models that served as the basis for research and product development in 2023 and 2024. This basis for research is established by releasing a suite of strong models across a variety of sizes. The original LLaMA family came with models of 7, 13, 32, and 65B parameters, which quickly became defaults of the research community based on convenient factors of them fitting on certain popular GPUs for finetuning or inference.

For a first instance showcasing the gap in adoption, the Qwen 1.5 family of 8 models was released shortly after the Llama 2 family of four comparably sized models in the summer of 2023. An analysis of cumulative model downloads shows the Llama 2 models being downloaded about 500% of that of early Qwen models (a difference of 10M versus 60M total downloads with half of the models), highlighting the original state of play in the open ecosystem – a large lead for American models.

Llama 3 continued this trend with a series of models across 2024. Pieces of the Llama 3 family (and its various versions in Llama 3.1 and 3.2) are some of the most popular models ever in HuggingFace’s history as the leading distributor of open models. At the same time, the newer Qwen models from Alibaba, this time the Qwen 2.5 suite of 2024, showed substantially closer adoption numbers to Meta’s Llamas – a lead of only 20 million cumulative downloads for Llama 3 over the Qwen 2.5 suite with both of them crossing over 120M total downloads.

Llama’s lead was built on a combination of strong performance and existing distribution channels. This success came in spite of a restrictive license – the contract between the open artifact’s creator and the downstream user – that can require nuanced legal consideration about if a particular use-case is compliant. Meanwhile, Qwen and other Chinese models have adopted simpler licenses drawing on historical practices in open-source software (OSS), removing another barrier to uptake on their models.

Meta has effectively been a singular horse in this race. As language models were established as a core technology, competition has arrived. Between the last releases of Llama 3 and the arrival of Llama 4, the landscape of open models changed substantially with the arrival of DeepSeek’s permissively licensed, frontier models in DeepSeek V3 and DeepSeek R1. Now, Meta was effectively alone in releasing its best models regularly and expected to compete with Qwen making large families of models great at any size scale and DeepSeek releasing open frontier models. Both types of models are crucial to the health of the ecosystem, but they can take slightly different foci to get right.

China today has 5 amazing open labs, a number which is growing, and America has Meta as its open models champion. We are running Meta in a race against 5 other Chinese runners, and then complain when it doesn't win every race every time. Our problem is not Llama 4 being not state-of-the-art; our problem is running a solo athlete against a team built with an ecosystem to support its growth.

Chinese open models are taking the all-time lead in adoption

The available data showcasing adoption of open language models – how much models are downloaded and how much base models are modified for new uses – shows that China has taken the lead in recent adoption and will soon take the lead in all-time adoption.

We collected historical, daily download data from 6 of the leading open model providers across the world – Meta, Google, Mistral AI, Microsoft, Alibaba Qwen, and DeepSeek AI. Grouping by locality we can see America’s early lead with Llama, Europe’s surge with Mistral’s early viral releases almost surpassing the U.S. in April of 2024, and a consistent acceleration from the Chinese providers until they’re surpassing the U.S. this summer. As of August 2025, the leading U.S. and Chinese models both have around 300M total downloads on HuggingFace with the Chinese rate of growth being notably higher. The growth rate for European models has remained lower, with their cumulative downloads reaching around 100M today.

An important benefit of open models is the ability to finetune them, a process to adapt a given model to a specific purpose. This process is at the heart of academic research and important for businesses to shape a given model to their individual needs. While there are more cumulative derivatives of American models at the moment, Chinese models are gaining momentum, especially this year.

Early in 2024, Chinese models accounted for 10-30% of the new finetuned models appearing on HuggingFace. Today, derivatives of Alibaba’s Qwen models account for more than 40% of the language models appearing on HuggingFace month over month (the overall picture is quite similar to the downloads data) – and that is just one of China’s leading open model laboratories. Meta’s share of derivatives with the Llama models has dropped from a peak of nearly 50% in the fall of 2024 down to only 15% today. With far fewer open model options appearing from the U.S. or Europe, the proportion of Chinese models in the AI ecosystem is expected to continue to rise.

What the ecosystem needs

We can fix this. America has the talent, compute, and capital to lead open model development – we just need to get them to the right place.

The tone for change is well represented by the White House's recent AI Action Plan, which paints a much clearer vision for the benefits of innovation and adoption globally to far outweigh the current measured risks. This represents an inflection point in the perception of open models, especially in the United States, but we still have a long way to go to support this vision with artifacts and actions.

The United States has a thriving AI research community, but it is missing the models that it itself has created and has complete knowledge of in order to create clear, and rapid progress. For example, the area of research with the most excitement following recent reasoning models is reinforcement learning with verifiable rewards (RLVR). This research has largely been performed on Alibaba's Qwen models from China due to their strong performance across math, code, and STEM benchmarks.

There are two categories of truly open models that we need in order to lead on all metrics of open models defined by how AI is studied and used. Both are essential and complement each other and the rest of a leading AI ecosystem. The best outcome is when these are accompanied by training data, intermediate checkpoints, base models, training code, and permissive licenses accepted as standards for free use by the AI community. These models with everything released, currently less common across the industry, are known as “open source models” to clearly note the benefits that come with more knowledge of how it was built.

First, we need leading open models at the frontier of performance. These should be the best models in the world and can be complementary to offerings from the leading closed AI models built in America, offering cheaper costs and more modifiability. The fundamental insight driving the recent rapid buildout of AI training infrastructure is the idea of scaling laws – this applies to open and closed models alike. The ballpark of scale needed to reach the leading edge of performance today is 200 to 600+ billion parameters with a mixture of experts (MoE) architecture – a size range used for all the leading open models from the U.S. and China in 2025 that challenge the best closed models on intelligence benchmarks.

With these leading models, we need a family of related models across a variety of sizes to allow every application and direction of study to be addressed. This is a standard adapted by leading open model suites from the U.S. and China alike. Only the most challenging tasks need the largest models, and for the rest of the tasks facing AI there needs to be tools to understand the minimum model size to solve certain simple tasks. A distribution of model sizes from those that can run on your iPhone to those that are assisting with the hardest intellectual work and everything in between creates maximum opportunity to advance and integrate AI broadly.

The entry point to train models of this size distribution is a cluster of compute on the order of 10,000+ leading GPUs. It is standard for top models to be trained with small teams of fifty to a few hundred people. A famous number on the cost of training frontier AI models from earlier this year was the often quoted $5 million figure for DeepSeek V3 – this is misleading on what it actually takes to develop these models, and the authors of the DeepSeek technical report acknowledged so much. 10,000 GPUs provide an entry point for rapid iteration concurrent to large-scale training.

America should target having multiple centers producing excellent open models. This serves to de-risk progress on training these models, given the urgency of the mission, but will also allow for a more diverse set of artifacts and for the research groups to learn from each other without first making the training organizations so large that progress is slowed.

There are many avenues to obtain and allocate these resources across multiple stakeholders. We need to engage across private companies, philanthropic institutions, and government agencies. Programs such as the National AI Research Resource (NAIRR) are important for broadening access to resources related to AI research including compute, data, software, and models, but these ecosystem-wide solutions are not enough to create breakthrough models as China is with concentrated bets. We need immediate, targeted interventions that can deliver frontier open models within 6-12 months, not years.

As many organizations around the world create strong AI models, it is becoming clearer that with the right compute and talent, strong models can follow. The formula we must follow is delivering these resources with the directive to release the models openly, then we can solidify American AI leadership. Every stakeholder – from tech giants to philanthropies to federal agencies to researchers and engineers – must ask themselves: Are we funding or participating in the future of AI research, or are we ceding it to competitors who understand that open models are the foundation of AI supremacy?

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Interviewing Ross Taylor on the state of AI: Chinese open models, scaling reasoning, useful tools, and what comes next Jul 29, 2025

I’m excited to welcome Ross Taylor back on the podcast (and sorry for the lack of episodes in general – I have a lot going on!). The first time Ross came on we focused on reasoning – before inference-time scaling and that sort of RL was popular, agents, Galactica, and more from his Llama days. Since then, and especially after DeepSeek R1, Ross and I have talked asynchronously about the happenings of AI, so it’s exciting to do it face to face.

In this episode we cover some of everything:

* Recent AI news (Chinese models and OpenAI’s coming releases)

* “Do and don’t” of LLM training organizations

* Reasoning research and academic blind spots

* Research people aren’t paying enough attention to

* Non language modeling news & other topics

Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.

Show outline as a mix of questions and edited assertions that Ross sent me as potential topics.

00:00 Recent AI news

Related reading is on Kimi’s K2 model, thoughts on OpenAI’s forthcoming open release.

* What did you think of Z.ai’s GLM 4.5 model (including MIT licensed base model) with very strong scores? And Kimi?

* What will OpenAI’s open model actually be?

* What do you make of the state of the ecosystem?

12:10 “Do and don’t” of LLM training organizations

Related reading is on managing training organizations or the Llama 4 release.

This is one of my favorite topics – I think a lot of great stuff will be written on it in the future. For now, Ross asserts…

* Most major LLM efforts are not talent-bound, but politics-bound. Recent failures like Llama 4 are org failures not talent failures.

* Most labs are chaotic, changing direction every week. Very different picture from the narrative presented online.

* Most labs represent investment banks or accountancy firms in that they hire smart young people as “soldiers” and deliberately burn them out with extremely long hours.

36:40 Reasoning research and academic blind spots

Related reading is twopapers point questions at the Qwen base models for RL (or a summary blog post I wrote).

I start with: What do you think of o3, and search as something to train with RL?

And Ross asserts…

* Most open reasoning research since R1 has been unhelpful - because not enough compute to see what matters (underlying model and iterations).

* Best stuff has been simple tweaks to GRPO like overlong filtering and removing KL divergence.

* Far too much focus on MATH and code - AIME has tens of samples too so is very noisy.

* People are generally building the wrong kind of environments - like puzzles, games etc - instead of thinking about what kind of new capabilities they’d like to incentivise emerging.

50:20 Research people aren’t paying enough attention to

The research area I hear the most about right now is “rubrics” – a per-prompt specialized LLM-as-a-judge to replace reward models. SemiAnalysis reported OpenAI scaling this approach and lots of great research is coming out around it.

I start with: What do you think of the state of RL scaling and generalization? What of models losing

Ross asserts…

* Rubrics are underhyped on social media - they were driving force behind projects like DeepResearch - and GenRMs are interesting but perhaps slightly overhyped.

* There is an evals crisis - there are not enough high quality evals, particularly for frontier tasks like automating research and real life work. Impediment to anyone building agents or ASI.

01:02:46 Extra stuff!

I ask Ross: What AI are you using today? Why?

To conclude, Ross wanted to discuss how AlphaEvolve has been underhyped on social media, and means the future isn’t just RL. Shows there are other effective ways to use inference compute.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Transcript

Created with AI, pardon the minor typos, not quite enough time this week but I’m hiring someone to help with this soon!Nathan Lambert: Hey, Ross. How's it going? Welcome back to Interconnects. I took a many month break off podcasting. I've been too busy to do all this stuff myself.

Ross Taylor: Yeah, I was trying to think of all the things that happened since the last time we did a podcast a year ago. In AI time, that's like two hundred years.

Nathan Lambert: Yeah, so I was looking at it. We talked about reasoning and o1 hadn’t happened yet.

For a brief intro, Ross was a co-founder of Papers with Code, and that brought him to Meta. And then at Meta, he was a lead on Galactica, which was a kind of language model ahead of its time relative to ChatGPT. So if people don't know about Galactica, there's a great paper worth reading. And then he was doing a bunch of stuff on reasoning with Llama related to a lot of the techniques that we'll talk about in this episode.

And now he's doing a startup. I don't know if he wants to talk about this, but generally, we talk a lot about various things. This got started through o1 and trying to figure out scaling RL. We started talking a lot but then we also just resonate on a lot of topics on training language models and other fun stuff - and also trying to be one of the few people not in these big labs that tries to talk about this and think about what the heck's going on. So we're gonna kind of roll through a long list of a lot of things that Ross sent me that he wanted to talk about, but this will be a compilation of the things that we've talked about and fleshing them out outside of the Signal chat.

So, Ross, if you want to introduce yourself more, you can, or we'll just start talking about news because I think a lot of people already know you.

Ross Taylor: Yeah, let's get into the news. There’s lots of fun things to talk about.

Nathan Lambert: So, the last two weeks of Chinese models. I think we had Z.ai's GLM 4.5 today. Kimi-K2 last week. I think Qwen is on a roll. I thought summer was supposed to be chill but this is crazy.

I haven't even used all of these. The pace is just incredible. And all the open models have actually good licenses now. But is this going to hurt anyone in the US? Where do you see this going in six months?

Ross Taylor: Yeah, so yesterday was the one day I actually tried to turn off Twitter. And so when you told me in the morning about the new GLM model, I had to read up on that. So that shows if you take your eye off Twitter for one second, then you’re behind on open source...

But yes, I think the general theme is that it’s been absolutely relentless. So thinking about the last time I spoke to you on the podcast a year ago, Llama 3 was a fairly established standard.

There were still things happening in the background, if you paid attention to things, but now it's absolutely relentless. In the case of China, I think their business culture is that - as soon as they find something is successful - they’re very good at concentrating resources and going after it. So it’s created a very competitive space.

I think the context is very interesting in several different dimensions. There's the geopolitical dimension, which you've hinted at in some of your blogs. For example, what does it mean if the open source standard is Chinese? What does that mean if we think about these models not just as things which power products, but as (critical) infrastructure? Then it seems like China has a great advantage if they want to be the standard for the whole Global South.

Nathan Lambert: Yeah. There are a few things that we're going to come back to in this conversation that are so interesting. We're gonna roll into what it takes to train these models. And we're going to talk about how crazy, political and hard it is in the US. But we have all these orgs popping up in China - so is this partially just a US problem?

But then we also have OpenAI that's supposedly going to release a model. There are multiple things. But my question is: why is China doing so well? Are they well suited to training these language models?

Ross Taylor: I’ll caveat what I’m about to say by saying that I want to be careful about making generalisations. Because, for example, we’ve seen some of these new Chinese organisations be good at innovation - for example, this week we had GSPO which was nice. But for Chinese orgs, my general sense is that, once something has already been validated, the specification for what to build has been set, and the task can be reduced to an engineering problem, then Chinese culture is very well set up to succeed in those situations.

The other dimension which has become relevant - especially after DeepSeek - is that the Chinese Government has traditionally been very good at recognising what’s successful, pouring resources in, and facilitating public-private collaborations. I think that surprises people still in the West. For example, people are surprised that a group can come out of Tsinghua can and fairly quickly have their own state-of-the-art LLM. Why isn’t there a similar story for groups coming out of MIT?

Nathan Lambert: I’m not sure about this.

Ross Taylor: I think the US will eventually wake up to this, but…

Nathan Lambert: My understanding is that Z.ai is a startup that spun out of Tsinghua, so I don’t know if it’s the best comparison. Also Alibaba is the clear winner here because they have Qwen, but they’ve also invested in Moonshot, which is Kimi, and then I think also Z.ai.

So I’m more interested in the question as to why they are all open. That seems more important relative to talent because there are lots of universities that might have model orgs spinning out of them - even in the US - and it’s not solely a Chinese thing.

I think it could happen with a group out of MIT. That being said, I agree that the US should have more compute deployed for academics and a lot of universities are just spinning them up now. It just takes a long time.

So I think there’s a lot of things that Twitter is mixing up here. There's a good tweet in it, but I don't think it'll be 100% true, which makes for a very viral tweet when it feels true.

Ross Taylor: Yeah, I think there is definitely naivety about how things are actually working (in China). And there’s asymmetric information, in that you don’t truly know what’s going on in the inside of these organisations.

The other thing worth mentioning - which is maybe a separate topic - is that there’s a tendency to see open models as a homogenous category. But there are very different use cases. So if I want to do a new reasoning paper, I’m going to use a Qwen model. But then if I’m doing distillation, I’m going to use DeepSeek or Kimi.

This discussion also relates to OpenAI’s rumored open model: because in my mind I still don’t quite see how it will fit into the ecosystem. Because is it going to be something that people build research on? If it’s a post-trained model, then probably not, right?

Nathan Lambert: Yeah. But their tweet was about safety, so I doubt it is a base model if they’re delaying it for safety. I do think they actually delayed it for this reason. It’s very much in OpenAI’s culture. But I don’t think it’s going to change the ecosystem. It will be an interesting one off.

I also don't expect them to release a model that's based on their GPT architecture. My bet is they take an off-the-shelf architecture like Qwen or Llama. A lot of the recent OLMo models are very Qwen-y. And they will also be deciding sizes based on what fits on what cluster - e.g. Qwen is very deep rather than wide, and OLMo 2 is very similar to that. So I think the OpenAI model is going to fit that mold.

Ross Taylor: I think so. I guess one way to think about it is they're just trying to “distill” their RL infrastructure into weight space, right? As opposed to publicising their (internal) architectural choices.

But back to the discussion, and maybe this is a question for you Nathan, but do you think their model is going to be more comparable in use case to a Kimi or DeepSeek? Or is it more similar to Qwen? Or is it actually something completely different, like an on-device model? A smaller model?

Nathan Lambert: I expect it to be smaller. They joked about on-device, which I don't know is the right framing.

Ross Taylor: Yeah.

Nathan Lambert: I'm also just now realizing how - if RL is their great strength - then part of the challenge of shipping an RL model in open source is that you need your training infrastructure to match the inference infrastructure. So unless they train this on an exact VLM that people have access to - and some open source environments - then they can’t just dump the model and expect people to be able to do search and code execution in the open model stack.

I don't know exactly how Qwen and DeepSeek have gone about this. My impression is that they're actually not as useful in terms of tool use because it's so hard. I think that tool use is naturally a closed model reinforcing thing because it benefits to have these tools match up.

Ross Taylor: So the Qwen models are pretty good at things like function calling. Kimi - at least in the benchmarks - was also pretty good at agentic tool use benchmarks. And then - this is a separate discussion - but they had this nice training innovation where they use lots of MCP servers in a synthetic data strategy. But then again, you’re mostly seeing indications of capability in headline evals, which you shouldn’t really trust anyway.

Nathan Lambert: I think of Claude 4 as the release that ended eval chasing. On paper the release was so lame, but it delivered for everybody - which is very bold because there is a lot of money on the line. They are constantly fundraising and if one fundraiser gets spooked because the release numbers are bad, then that’s a lot of CEO calls that they have got to make.

Ross Taylor: On evals, I was thinking about this a few months ago. It might have changed now given the pace of AI development, but I was thinking about how you might split up the impact timeline for a release.

So day one is headline benchmark numbers - which are mostly b******t. Like I’ve got this amount for my model on MMLU Pro. But then the next tier of impact is the day after the release where people have all these weird bespoke evals on Twitter.

Nathan Lambert: The pelicans and the rotating hexagons and balls…

Ross Taylor: Yes, and by this stage you’re getting more confident. Because unless the model developers are very smart (which some of them are), then they probably haven’t optimised for day two benchmarks. So at that stage you’re beginning to believe that the model actually generalises beyond the headline numbers.

And then finally you have a week or two weeks after the release where you can say that you’ve tried the model quite a lot now, and you then have real confidence that the model is good.

Nathan Lambert: Yeah. Refute my claim: Chinese providers are still optimizing for benchmarks more than OpenAI, Google, and

Ross Taylor: Yep, I mean it’s probably true.

Nathan Lambert: It feels so obvious to me. I think that China has closed the gap to a remarkable degree, but I don't think they've caught up fully. I think that's hard. It’s very hard to get all the data and pipelines in place. A lot of it is actually user data, knowing your user, and hill climbing that. So for example, all these APIs not working is a huge issue for them.

Ross Taylor: Yeah. I think (Chinese models) have also been helped by the fact that a lot of the academic work that builds on them has been doing reasoning work in publicly available data domains like math and code.

The models have been heavily optimised for these domains anyway, so the model developers are not quite as exposed - since people aren’t really testing the true generalisation capabilities of the model. We already know that the Qwen models are heavily mid-trained on math and code, so they will hold up performance-wise there.

Nathan Lambert: Yeah. Okay, this is a good preview for the episode. I think that the main things are going to be how to build good organisations, and then academic reasoning research and how to bridge the gap. I think we can talk starting about org charts.

So how do you make a good org? Or maybe there are two things. One: how do you make a good org chart for training language models? And two, how do you make an effective culture?

I think this is quickly becoming one of my favorite little niche interests because there's just so much intrigue in it. There's just so much money on the line to break everything. So you sent me some hot takes if you want to read them, but the floor is yours for what doesn't work.

Ross Taylor: Sure. So if anyone’s been on social media recently, the general trend nowadays is to check your phone and see these NFL draft style tweets about researchers moving between orgs.

First of all, researchers have always moved between orgs. This is not a new thing. And a lot of the org moves that were talked about - at least outside of Meta - were just regular moves.

But I think the bigger mistake on Twitter is just the tendency to see the bottleneck in LLM projects as skill issues. And at least from my n=1 experience, that has never been the main bottleneck for success.

There are a number of ways to make this case, but I think I'd start by saying that machine learning is a heavily empirical science. So what does genius mean in that context? What does talent actually mean?

There are certainly some skills which are useful - like how do you form the right minimal viable experiment? And how do you iterate fast to explore a research direction where you’re going to hit a lot of dead ends. But a lot of it comes down hard work, good infrastructure, and ultimately resources.

So in that context, most of these orgs - even before public failings - had very good people. And I don’t think the difference in talent between orgs is that large. Smart people will eventually figure things out. So therefore, more often than not, the difference between a good versus a bad model is reflecting an inefficiency in the ability to channel resources to your talent. And that is the fundamental point.

Now you could say, on the flip side, okay, Ross, well, if that's true, why is Zuck paying people these massive amounts of money? And I think that's a separate question. But yeah, more often…

Nathan Lambert: Well what do you think?

Ross Taylor: I am torn on this because, on the one hand, I think the new group will probably make very good models. They’re very smart people. And I think having a new org as well is the right way to do it.

I think in leadership's mind, it's a case of “Look, we tried this multiple times, we’re very serious about this, we have resources, so let’s do the maximum conviction play”. And I think that's broadly what you should do because it’s a big expense, but it’s not massive, massive spend (for these large companies).

But on the other hand, I feel sorry for - this isn’t a Meta point by the way, but a general point - but I feel it’s a shame these organisations don’t have good mechanisms to identify the talent they already have in their orgs and have to recruit externally.

The talent that has already done the hard work, that is. It’s a shame they have to hire externally and start afresh. That’s the tragedy.

So that’s the conflict in my mind. I think they’ll make great models. I think it’s the right approach to do things afresh. But at the same time, it’s a shame that all the people that came before them, and made the previous generation of models, are treated like an asset. In the sense that you’ve used these people - grinded them really hard - and now you’ve moved on to a new group of people.

Nathan Lambert: You put this in your provocations. You said LLM labs are like investment banks where people are slotted in to burn out and burn through. I know that a lot of the work that needs to be done is somewhat mundane data work and it can be parallelised - e.g. if your users are asking this type of question, let’s create new prompts and manage human works and create synthetic data pipelines. And that works a lot of the time.

But then, I remember the Dwarkesh podcast with Sholto and Trenton - and it’s the one where they’ve both moved jobs (which reinforces your point), but they were saying you just need to convince someone at a frontier lab that a particular problem is important. I.e. people talk about things, but they just have to do it.

So is it the case that people are just dispatched to solve specific problems, or do individuals have free rein, and it’s fun on the ground because you choose the things you want to add to your beautiful final model?

So you can present a positive and a negative. It might vary across labs, but I guess your provocation is that there's a bunch of places where it is a meat grinder and you just put people in and chew through them.

Ross Taylor: I think so. Unfortunately the model for a lot of successful tech companies is to get very young, motivated, people - with a base level of intelligence - and make them work very long hours on a project with a big mission. This was the classic Elon way to run a company.

But this is also the model for a lot of frontier labs. You have your soldiers who - on the surface - look similar to quants at hedge funds from like 10 years ago in terms of their working hours. And in the culture too, you have friendly competition between people who all want to be the best.

Nathan Lambert: I will say: I know a bunch of people at OpenAI, and they do work crazy hours. I also work a lot, but I do a lot of things that aren't grinding data to go into the model.

Ross Taylor: Yeah, so on the question of decision-making, I think major decisions are generally made by people who are a little more experienced and already have some successes to their name. But you do need to have soldiers in this kind of environment. The space is just highly competitive (and requires people to work long hours).

And I think that's a shame. Even for myself right now, where I’m trying to build a startup, I’m thinking that - yes, we all need to work hard - but is there an alternative model where you invest in your employees instead of using them? - i.e. burning them out and then moving on to a new group. That’s what I’m trying to work out for my new company.

Nathan Lambert: I feel like a lot of people are just more cynical now in tech, myself included. I got a great cold e-mail from someone fresh out of undergrad, and I was pretty sure in two to three years this person would be legit. And I was talking to a coworker on how we could potentially capture this and invest in them. And they were just saying we might get them, but then they’d just go to OpenAI in 2 years. So we don’t get any of the upside.

I think some of that is just cynicism. Investing in people is still the right thing to do because you’ll end up keeping the ones that are a bit more grounded even if it is really hard. For example, I've lost people that are extremely talented that I wouldn't want to keep. So I don't know how to balance that cynicism versus reality of building teams in the long term.

I guess smaller teams might be a bit easier to maintain, whereas if you’re at a tech company, the churn is hard to avoid because there’s so many levels in moving up.

I think some of the rumors around Meta and Llama 4 - at least from the Dylan Patel SemiAnalysis article - were about them doing these cowboy crazy model training runs, including changing pre-training mixes half way through, and that maybe points to dynamics with middle management wanting their data to be used so they can get promotions. But most labs I don't think are doing that type of s**t for their leading models. And I don't think Meta is normally doing that. I think that was a pressure cooker side effect.

Ross Taylor: I would push back on that a bit by stating that all of these labs are deeply chaotic places (not just particular orgs). They change direction every week, right? That’s just the nature of the field we’re in.

But then, it is definitely true that certain labs are good at projecting, at least externally, that they have their s**t together. They have AGI internally, all this kind of b******t.

The truth is that it is a shitshow everywhere. It's just that if you're going to be a s**t show, you at least want to be a functional s**t show, and you want to make good models. Right?

As I mentioned before, I think there are new plays to be made around taking the view that you want to invest in your talent as opposed to just grinding them out. But I would also say that, in lab culture, people tend to overvalue raw talent again - especially in empirical science. If you take the view that an empirical science is mostly about experimental velocity, then you don’t just value infrastructure in that world, but you also want to hire folks who are very collaborative and who want to help each other.

It sounds like a b******t point in a field that lionises individual intelligence, but I just feel that if you're making a marginal hiring choice, then you have to think about how someone adds to an existing group? So I think there are new plays to be made on talent.

But there is nuance. Because there are certainly people who are especially productive. I’ve seen that in person. So it’s not like everyone is equal - that is definitely not the case - but I just feel that individual talent is overemphasised when problems in these orgs are mostly structural.

Nathan Lambert: The differentiation right now is people who are willing to put more highly focused hours turning the crank. Every organisation has the baseline time costs of needing to do meetings, commute time to work, commitments etc. But in terms of AI, where people are doing more and more, this really favors young people who don’t have a lot of responsibilities.

Ross Taylor: This is maybe a transition onto another topic, but I’d make a more controversial point which is that - even the things in ML which seem more like novel research are more the result of persistence rather than inspiration.

For example, this time last year we were both speculating about what o1/Strawberry was. And speculation makes you think it was some amazing new thing. But actually it was basically what we were both doing two years ago right? Essentially RL from verifiable reward, but with very good base models, because they were in a good position to exploit that, and then enough ablations to find a recipe that worked.

So this is oversimplifying things a little, but we should take the view that they just had to do the work to make the recipe good. And that comes down to experimental velocity, and also having the right infrastructure and a good enough base model. So in that world, what is talent?

Is talent the person who says “we should make the models think more”, or is talent the person who is actually on the ground doing the ablations to find out which recipe works? Right? Because I can also make models think more by best-of-N, but, then there may be better ways to do it?

Nathan Lambert: I mean, I think I analogize a lot myself with my athletics career - like rowing in college. I think so much of it is the same. I wasn't the most gifted athlete, but if you put in the hours and you understand where you're spending your effort, it works out for people.

The question I wanted to ask you on this topic is, given that that these orgs are so chaotic, then what does this mean for the ceiling on progress? One of the most coveted questions is about the trend line. There are obviously going to be new paradigms - inference time scaling was an obvious one if you thought from first principles about what compute and intelligence is - but even if we don’t have a new paradigm, then what is the ceiling?

Ross Taylor: I would say that, even in climates where most organisations are chaotic, you’re still going to have macro factors that lift all boats. So a good example recently was these gold medal results on IMO. Three or so different labs all had different approaches and all found they crossed the threshold for a gold medal.

If you were to zoom out - and one way to do this is to imagine you're looking twenty years into the future back at this time - then would you look at the individual methods that researchers used, or would you just say compute reached a critical threshold where things began to work?

So compute is the big exponential that's underlying all of this. And then if you zoom into a shorter time horizon, then you're seeing more of the local challenges, like what’s the particular bottleneck at a point in time? So maybe the bottleneck to agentic models is scaling RL environments. Or maybe the bottleneck to better reasoning is longer context windows.

But look: fundamentally as long as compute keeps coming online, I think the trends look good and all of the organisational chaos is short-term noise. It slows down progress a bit but is not meaningful in the long-term. But, unfortunately, it's still meaningful for people in their careers because one to two years of organizational chaos could matter personally. But on longer timelines, it doesn't really matter.

Nathan Lambert: Yeah, I agree. It seems like the question is what happens when the fundraising starts to slow down. We're on a trend line of compute rollout. But if Sam Altman can't raise again, that is a very big sign. That's like the end of the “bubble”. OpenAI is not going to go away because of that, but if they can’t get the next cluster… then that would be a bad sign.

Ross Taylor: I'm quite optimistic because I think you only have a bust if AI ceases to be increasingly useful or doesn't live up to certain promises. But even if there's no algorithmic progress, I still think AI will continue to continue to be increasingly useful. I don't think there are fundamental barriers. It's just a question of how quickly you get things right.

I think the argument would have been slightly different two years ago. If the reasoning paradigm didn't come through, then I think it would have been trickier to justify the expense because then you'd be looking at reasoning benchmarks and thinking: s**t, to push this forward I need this amount of data annotation or need to generate this amount of data.

Nathan Lambert: You look at GPT 4.5 as the example.

Ross Taylor: Yeah, exactly. That's a really good example. So you can treat that model like a counterfactual universe where reasoning didn't happen. There we would all be looking at the model thinking “it's good at creative writing, but maybe not so good at some more things we really care about (like reasoning)”.

By the way, I'm sure it’s a really good model. I didn’t play with it enough to form a good judgement.

Nathan Lambert: I've been using it a lot. I used it for a long time - especially until Claude 4 - as it’s just nicer, especially when GPT 4.1 was so sycophantic. But GPT 4.5 was nice.

Ross Taylor: So I'm gonna flip things around and ask you a question Nathan. Let's say we are here in a year's time. What does the key benchmark look like for LLMs that everyone is focused on?

Nathan Lambert: Oh, it's fully gonna be some agentic thing. I don't know if it'll be as stupid as making money on the stock market… I wrote a post on what I thought was coming next. One of the most poignant things I was looking at is the fact that scaling models is no longer the direction anymore. All the marketing is shifting to agents. And I think some of that is because it's not easy to scale parameters anymore.

Every RL curve is this log plot, and it becomes hard. But agents are already beginning to work well. For example, this year Claude Code showed up. There's gonna be versions of that in all sorts of domains and more people working to evaluate them. That will create an interesting marketing problem where labs need to figure out how to communicate that their model is good.

But the future looks like it’s all on the agentic side, and will lead to a big shift in what the language modelling companies need to think about. The prioritization of the company is also different, whereas modelling was always central before. I’m still modelling-pilled and think that is the central thing for the company…

But it’s true that now that teams building products are going to hold more weight than they used to. And there will be interesting changes in how these companies manage this transition, and how communications change.

So, I think Claude Code is great. But I think that it's hard to integrate in some things. For example, how do I get that running on my cluster at AI2 where we have all of our data and models, launch evals from our file system on the GPU machines. I don’t think that quite works yet, but maybe I’m doing something wrong.

Ross Taylor: Yeah, I agree with your answer. So I spent several years working on Papers with Code, where we were trying to focus heavily on evals before they were a big thing - trying to index all these various leaderboards. And I think now is an interesting situation because I feel like if you make good evals now, you possibly have more leverage than you've ever had in the field of ML..

This is a weird thing because traditionally evals were quite an unsexy thing to do. It was a thing that researchers didn't want to do because they'd rather be training models. But now the ability to define a metric for a capability that you'd like to see - e.g. trading stocks, or doing scientific research - is just incredible leverage that you can wield. A small group of people in places like universities can say “this is the new north star that we should achieve for agents” and shape how AI progress evolves.

Nathan Lambert: It can happen. We recently released IFBench, a benchmark for following instructions which is just more constraints and a different prompt sourcing. And I was saying to folks that we need to have the goal of making at least two frontier labs adopt it. And I messaged various people, including someone at OpenAI, and they said they already integrated it last week.

So yes, someone doing research (on evals) has a shot at getting into the OpenAI internal evaluation platform.

Ross Taylor: Exactly, so it's incredible leverage. And then the other interesting thing is that the friction for making and using good evals is going to increase quite a lot.

For example, in some of the recent benchmarks, you need the RL agent to have access to a GPU and then you need to spin up lots of these servers to do rollouts. This is expensive. Long gone are the old days where you had two CSVs with a train and a test split.

And then on the eval creator side, there’s a big difference between good and bad evals as models become more capable.

A bad eval just means that you're going to get incredibly egregious reward hacking, and you're not going to learn anything useful, whereas a good eval is a pathway towards a brand new capability.

Nathan Lambert: I have a related question on this. So I see three eras in evals based on what people are doing with models.

For pre-training, the best evals are testing knowledge and these very broad things and are hard to game. It's just kind of like FLOPs.

At post training, a lot of evals are formatting and extraction. I think formatting became even clearer to people when these RL environments became the hot new thing. And I actually think that post training might be like the ugly duckling in the middle, where then if you go into agents, all the agentic tasks are gonna be evals of actually doing things and you can't like format-lie your way through that. So it might be that post training evals are the hardest one to get right.

Ross Taylor: Yeah, and I think you're going to see more cases of people claiming good results, but when you look beneath the surface, you’ll see insane reward hacking. So the meme right is KernelBench evals. Have you seen these?

Nathan Lambert: Oh.

Ross Taylor: You see all these amazing speed ups which aren’t even possible based on the hardware. And this is not a problem with KernelBench, I would say it’s more a problem with people publishing papers for agentic evals and not looking at their results carefully.

So this shows that to get an eval in the right place takes a lot of work. And even with progress in models, I don’t think you’re going to be able to fully automate the construction of a good eval in the next year at least. I might be wrong. Models will certainly help us in creating evals. So I think that, for now, it’s a place where a researcher can have a lot of leverage.

I think if you were to ask what is the central eval is right now, it'd probably be something like SWE-Bench (verified). But even that is now quite saturated. So there's a big blue sky now where someone can define what the next big task is for ML. And you don’t need a big cluster in order to be the one who defines it; so I think that’s quite exciting.

Nathan Lambert: Yeah. And when you think about the amount of money that'll be steered by these things, it's so crazy to have the uncertainty there and like who will come up with that as well. I think that it's part of what makes it fun, I think.

We should talk about reasoning things.

Ross Taylor: Reasoning. Yeah.

Nathan Lambert: Where do we start? I don't think I've ever done that much of a rant about the academic community chasing these things. I understand why academics are claiming to do new algorithms that get remarkable scores, but a lot of these papers are just extracting things that are hard to document from a model or something else or formatting

I was on one of these papers, which was hilarious. We figured out that if you train Qwen on random rewards, the evaluation scores go up. And we had to go through the logic on why this can happen.

Because if there's no reward, the advantage is zero and the gradients are all literally zero. And then it turns out that the algorithm manipulates the most common sequences. It's actually something that if you read a lot of the reasoning literature, people talk about how we want to make sure our algorithm doesn't squash uncommon sequences. And then the real hammer is that, if you do random rewards, then you see that the model has modal collapse onto the things that it was trained on. And that can make scores go up.

So if you have a model that two thirds of the time has a certain behavior in its reasoning and that behavior is good on the benchmark, then just by fiddling the weights a bit then it does that behaviour more. This points to a structural failure.

I would also say it is a good example for why people should be using truly open models for research purposes and why they're so good for innovation. For example, if we knew what goes in Qwen data and if someone just filtered it and it was like, oh, look, I found the found the GPQA prompts in it…then we know data contamination has happened.

The Qwen case is borderline - I don't know how exactly to characterize it because the Qwen models are fantastic - but there's so much research that is showing that they are very likely to be doing some dubious things in terms of benchmarks. It's hard for people that aren't super in the weeds to hold both of these possibilities in their brains.

So I don't know. What do you think of the last six months? Have we actually made any progress? Has the academic community made any progress?

Ross Taylor: I think there's been little progress. I mean that in the literal sense: there’s been some progress, but it has been little. I think I can answer this question in several ways.

So after DeepSeek-R1 came out, there were two approaches in open source more generally, which was either you go down the distillation route or the RL route to make interesting small models.

The initial thing that was undervalued - at least from an engineering perspective - was that for smaller model sizes, it is far more efficient to do distillation than RL.

Nathan Lambert: And not just in compute but also in performance? It's hard to do RL on the small models.

Ross Taylor: I think this point has been made twice now. So there was the original DeepSeek-R1 paper, and then more recently, there was a new Qwen paper as well. The Qwen paper showed that RL needed 17x times more compute than distillation.

So one way to think about this is that RL is a brute force lever to do data generation. But assuming that RL is still good, and you still want to do research on it in academia, then you run into a classic problem. And that problem is: if you don’t have enough compute, then you don't know if the structure you are imposing is gonna generalize (to high compute settings).

And my worry is that a lot of the results are on relatively low compute budgets, both in terms of the underlying base model, which determines how well the RL approach learns, but also the total number of RL steps. So it's just quite hard to see - unless there’s a massive gain - what’s truly important.

So the most useful things are - in my opinion - quite boring things. Like, there was the DAPO paper which showed that you should have filtering for overly long sequences, and you shouldn’t overly penalise them if your context window gets cutoff.

There has also been interesting work showing that even simpler approaches (than GRPO) might work, where you remove clipping. So Reka was doing lots of good work using REINFORCE leave-one-out (RLOO). But even there, it’s difficult because you don’t know if simpler algorithms are going to work with long agentic traces.

So it’s not clear. I think the recent work this week was actually quite good. The GSPO work was good, and if you saw their graphs…

Nathan Lambert: Explain it to people. I think a lot of people have heard of the other ones by now. But GSPO is group sequence policy optimization with Qwen Coder. Why are you positive about it relative to the other ideas? I think GSPO is well motivated but why is it getting hyped more?

Ross Taylor: So I hope I don't botch this because it's the morning. But, essentially, with GRPO, you assign a reward to the whole sequence (via the advantage). But you also have an importance weight, which is your policy likelihood relative to your old one. Because when you do RL, you typically sample lots of rollouts but do several mini batches for your gradient update. So that means you go a little bit off policy.

So to fix that you have an importance weight term. But in GRPO, while the advantage is uniform across all tokens, the importance weight is particular for each individual token. And the importance weight is calculated for a single sequence. So one way of looking at this is that, if you had more sequences to calculate the importance weight, it would be a lot less variance - but by calculating it on a single sequence, you introduce a lot of variance through that term.

So the short answer of what GSPO does is that, instead of looking at a token likelihood, they look at the likelihood of the whole sequence. So now the clipping is not on an individual token basis, but, it looks at one of the sequences in your group and says okay, this one is less likely, so we’ll clip out that sequence. And the TLDR is, at least from the results they show, it seems to be a lot more sample efficient.

I mean, it's not just 0.5 percentage points or something like that. But I think the reason I trust it more is that it’s very simple. And it’s quite directionally well motivated from just a basic understanding of importance sampling. If it were more complex, I'd be a lot more skeptical, but it's fairly simple and it seems to work well.

Nathan Lambert: Yeah, I'm still fairly skeptical.

I think academic research is relatively wide in what people are trying out but labs are relatively narrow. And once you’re further along in your modelling journey, you’re dealing with different parts of state space and then all these algorithmic tweaks just like help your model on whatever blocker it was or your implementation.

I thought for GSPO the sequence thing was funny because when you read the GRPO paper, you were like oh, the reward is just per sequence. But all the tokens in the sequence get the same loss function. But the standard implementation is to break it down per token. And then GSPO is essentially to take that standard implementation and you change the weight on every token back to this. And I was doubting whether this was really going to be a major thing.

I think for junior researchers, one of the good things about this era is that you can really learn the math by studying all these algorithms and thinking about how they are implemented. I hadn’t done that for a few years until writing this RLHF book on policy gradients and I was getting into the weeds like per-token loss, length bias for GRPO, and so on. For students to be able to do this in their brain, it is really good for thinking about the interface between algorithms and systems.

Ross Taylor: It’s interesting, because as AI became more hyped after ChatGPT, you have more people reading papers. This is a great thing, but also you have lots of new people reading papers in the wrong way.

For me the basic logic (for reading papers) is as follows: what’s the reported gain of the paper and how much complexity does it introduce?

So if you get a gain but the paper introduces shitloads of complexity, it's probably not going to stand the test of time. Whereas if it's something relatively simple, but it seems to get a good gain, then that’s the thing that is going to last.

Nathan Lambert: The o1 lesson. The simple thing. In RL research, I've heard it described as: if you see something that only beats the baseline by a few percent, it's not gonna work. But if it’s 2x then that’s a real innovation, because whether they finetune their baselines or not, they’re still going to be crushing it.

Ross Taylor: Exactly.

Nathan Lambert: So I think that's a good heuristic for people right now.

Ross Taylor: And I think researchers are their worst enemy because they want to see their own methods work. But the weird thing in ML is that neural networks “want to learn”. So if you push something enough, it will work. It's just a question of whether that is a good use of your time?

So the question is: what's the right thing to scale and push on? So that’s why - when you read papers - at least what I say to young researchers is that you should always judge how much complexity the paper introduces, and whether you trust the gain.

And then based on those three factors, you can judge whether it’s worth caring about the paper. But I can see why - if you’re new to reading papers - why you might be attracted to complicated, new techniques in papers that seem methodologically interesting.

Nathan Lambert: And researchers often manipulate the results of their peer methods in the way to tell a convincing story. And I think these algorithms are a perfect example of trying to tell a story.

Ross Taylor: Yeah.

Nathan Lambert: So when you think of cognitive behavior of paper authors, you have to take that into account too.

Ross Taylor: The other point I’d make is that - in the reasoning trace - I understand that everyone has to focus on math and code, because that’s where the data availability is. However, if a paper comes out and it’s just flexing on AIME and GPQA then that is just very uninteresting to me - and much more so than it would have been in February.

Nathan Lambert: I think code can be much better but it's hard to benchmark it. Describing what a good coding model is would take me an extremely long document.

That's not what the academic papers are doing. It would be great to have more benchmarks on that.

Ross Taylor: Yeah, and even the established ones have issues. For example SWE Bench has a very large proportion of issues from Django (so it’s not exactly representative of all software engineering). That’s not a burn towards SWE Bench - which is a great benchmark - but…

Nathan Lambert: They already won. They can take it - they won!

Ross Taylor: But, yes, it shows that there is a lot of detail to get right in making a good coding benchmark.

Anyway, it’s difficult because I am in this position where I can say - on the one hand - papers are just hill-climbing particular math and code benchmarks, and that is fundamentally uninteresting to me. But at the same time, I sympathise. Because there are not a lot of good open reasoning datasets in the open. And those that are open, I don’t think that they’re even going to be good for testing RL necessarily. They might test something more knowledge based, like medicine or something like that, which is less inference-time scaling bound.

Nathan Lambert: This could be a good time to transition. What is the status of RL scaling and generalizing? What is the status of RL outside of math and code? I think my prompt is: what do you think about o3-like models with this crazy search behavior and multi hop execution?

Ross Taylor: Yes. So first of all, I think it was greatly overstated that these models don’t generalize beyond math and code. I think what happened in practice is that, at least from what I know, OpenAI originally was very focused on math, logic and puzzles. And then eventually they had to broaden out because it was kind of too nerdy and biased towards these kinds of tasks.

But I don't think there was ever a question about their generalisation to other benchmarks. You could see that very early on. The way I think about this is: we started with math and code because it was easy to verify. And then through applying RL to those domains, models learnt certain strategies like “I shouldn’t answer early”, “I should check my work”, or “I should consider alternatives”. And at a very high level, if you just have a model that thinks for longer and checks its work more and considers more things, then that's gonna be useful for things beyond math. And that's reflected in the benchmarks.

That being said, if you want to get superintelligence outside of math and code, then yes, you probably need more specific benchmarks and datasets for that. So there the question is less about whether it generalises beyond math and code, but how far can performance go? And that’s when you get into interesting questions about, e.g, if you don’t have a numerical answer or whatever, then how do you verify things.

So rubrics are all the rage right now, but then there's also other directions like…

Nathan Lambert: Rubrics are so funny. It’s funny how they needed to be reinvented. Rubric is a funny name because it just seems like question-specific LLM as a judge. It's the most basic unit of evaluation or feedback.

Ross Taylor: So I think this was something that wasn't very covered in the open. So the reason why it became popular was that DeepResearch was the trigger. The rumor at least was - at least for OpenAI - they didn’t need many examples to do well in these kinds of research task.

It wasn’t tens of thousands of rubrics - it was probably in the 1,000-2,000 range of well-crafted rubrics for questions. But it clearly worked very well to teach a model how to browse the internet and synthesize knowledge. There's obviously infrastructural detail as well.

Nathan Lambert: What would a rubric look like for deep research in this case? For an essay it might be that the rubric says that an answer should be free of typos, have a clear argument and a good conclusion. It would have different checklists. But the DeepResearch example is more complicated and you might need to draw an example.

Ross Taylor: Yeah, so there are different themes you could have. It could be the general style of the answer. It could be - let’s say we want a review of the latest and greatest RL algorithms for reasoning - then there you might have a high level rubric saying that the answer should compare different methods, cover underlying algorithms, mention policy gradient, PPO vs REINFORCE and so on.

But then you might have, like, more detailed things where you just have a strong conviction on what a good answer looks like. For example, a review of RL for LLMs right now might include GSPO as of this week.

So rubric-based grading comes down to a list of checks, but the goal of that form of evaluation is that you’re trying to get a nice, continuous rewrad for the model to learn from - as opposed to something more binary and sharp. Because while 0/1 rewards might work okay for mathematics or unit tests, it would work less well for a task like making a good literature review on RL. The reward structure isn’t binary there.

Nathan Lambert: So how do you think of grader functions? I’ve thought about this for code, like the percentage of unit tests that pass. But then the model might just get the easy unit tests. So will reward shaping be here to stay or will it be washed away in the ever growing sea of compute?

Ross Taylor: I think it'll be washed away, but I think in the meantime, there's a lot of value in making very good handcrafted evals. And I hate the word taste, but there is still taste to begin with.

And I think a lot of these things are quite codependent, because to make a good rubric for a deep research task, then you need something that needs the ability to do deep research. If we were to say what makes a good literature review on RL right now, then that knowledge wouldn’t be in the weights of a language model - the model would have to go out and search for things.

Nathan Lambert: You can tell it that you need to use search in this question.

Ross Taylor: Yeah, if you haven’t done a search, then you're probably doing it wrong. So yeah, in the long term, it gets washed out because there's nothing a neural network can't do compared to a human. But in the short term, there's still a lot of nooks and crannies that a model wouldn't quite cover / struggle on.

Nathan Lambert: Can you create a generative reward model by training off a bunch of rubric data? Probably?

Ross Taylor: Yeah, so verification benefits from thinking time. And I think most people are aware of this now, but it's more of a question of how you actually execute that. So a generative reward model for something like math and code - where it's like a 0 or 1 reward that you’re trying to figure out by thinking - is less interesting to me then questions where you really need to think from first principles on how to assign reward.

In general, the simplest way I think about it is: if you're moving to a world where you have long agentic traces, then your “reward model” just needs to answer a simple question, which is: “is the agent making progress towards its goal?” Right? But that's a very deep question.

So if it's a Pokemon eval, then maybe a model needs to use its knowledge of Pokemon to figure out if the agent in a trajectory has been caught in a loop, and whether it should be going towards Lavender town instead of this other way.

So these sorts of verification tasks benefit from thinking time, but the devil is in the detail. Because if you’re not careful, you’re just going to spend an inordinate amount of compute trying to get a reward.

Nathan Lambert: It feels like there will be a lot more we will learn there. It feels obviously salient. I’d describe it as verification changing the slope of inference time scaling. And that's really, really valuable if you're spending a lot on inference, but we don't really know how to do this. Like parallel compute is another factor that changes the shape of that curve.

I guess it's really all a slope of a scaling law or like an offset or something, but it's hard to say which things are true in terms of what we're hearing. That's probably what they're doing other than this rubric stuff. It's just a way to get RL pointed at more problems, which is not surprising.

Ross Taylor: Yeah, I think RubricMania is in full force right now. I mean, I think the longer term question, which has been posed in several places, is what happens when verification becomes fundamentally harder?

So I'm quite interested in the scientific discovery question. But in a field like biology, you need to do a physical experiment in order to verify. So it’s not just a question of running things on a cluster. And if you want to simulate the underlying thing, well then you’re bottlenecked by the quality of the simulation - and it turns out to be quite hard to simulate some physical processes!

Actually - in most of the sciences - I think this is the other point I’d make” which is that in ML, people overvalue the value of individual “thinking” in something like science. They think of Einstein and they think a lot less about the data generating mechanism, and what's the instrument.

There is no Kepler without the telescope. There is no progress in biology without X-ray crystallography. There's maybe new theories on dark matter in space without even better, newer telescopes.

I know this sounds like a weird say in the context of RL, but if you’re thinking about very hard things to solve in the real world, then you’re just going to be bottlenecked by the need to build a better instrument to get data. So it sounds like a digression, but I’m saying that - in the long-term - you’re going to hit these bottlenecks for verification. But in the short-term, we can still solve very interesting things like Millenium Prize problems, but that will probably take quite a while too!

Nathan Lambert: Yeah, I don't have anything particularly eloquent to say on the scientific discovery point. I guess what will happen is that RL is going to be in training and then you just sort of punt it off to the rest of post-training. So models need to be able to get really weird, but not weird in a way that they are numerically lost.

I've been reading a lot of reasoning traces these days, and the Qwen and DeepSeek reasoning traces really just seem numerically lost for a while, and then they eventually get the answer right. They say “Wait” a lot and then go into half English/half Chinese, and end up getting the answer right.

My point is that I don’t think in their current form, that these things are vehicles towards (scientific) discovery. There’s some kind of fundamental research needed to make the reasoning process more real.

Ross Taylor: My other bear case against reasoning models is the following argument - and this is mainly a devil’s advocate point, because I still fundamentally believe. Since World War 2, there are a lot more scientists in the world. But has progress kept up at the same rate? If anything, I would say that scientific progress has slowed.

Was there more progress in fundamental physics now or in the last century? And I know that is mainly because the low-hanging fruit is gone in many of these fields, but it could also be a bear case for AI because it hints that the bottleneck in science is the amount of intelligence on a problem, but maybe the speed of physical processes, or the ability to build better instruments for measuring, or the ability to get funding from governments to build bigger particle colliders…

I'm exaggerating the bear case because I think AGI mostly means autoating regular activities - law, finance and these kind of industries - and I think that’s a lot easier to do. But I’m attacking this mindset that says - now that we’ve solved reasoning - the takeoff is going to arrive in the next few years. From what I can see, that is very unrealistic.

Nathan Lambert: I'm very I'm bullish on AI being used and bearish on whatever superintelligence takes. I think we’re too compute constrained for a takeoff. I think AI is going to be very good for financialization and digitalization and seamlessly globalizing the Internet and making all information transfer and acquisition effectively free.

Ross Taylor: Yeah.

Nathan Lambert: Which is really good. And I think historically, the US is very well-positioned to capture this by making products that run on top of cheap AI models.

Ross Taylor: Yep.

Nathan Lambert: I wanted to ask you what AI you actually use. I don't know if I've ever asked you it's normally revealing.

Ross Taylor: Okay, so the base models we’re doing experiments on are mainly Qwen - Qwen 3, but also Qwen 2 because we know the kind of quirks of that model a bit more. A lot of people do that. Then we also do some distillation jobs, where we’re mainly using DeepSeek-R1. We did use Kimi recently, but we didn’t see massive benefits for the benchmarks we were looking at.

Then from a personal productivity perspective, Claude Code is very, very good. My main worry with Claude Code is that - I think there's a paper on this - but people confuse agents making you more productive versus preventing you from exerting mental effort. So sometimes I'll have a day with Claude code where I feel like I use very little mental effort - and it feels amazing - but I'm pretty sure I've done less work.

That will change because the models get better, but I'm trying to teach myself to be a bit careful because sometimes I need to stay in control.

Nathan Lambert: It does seem like an equilibrium. I'm happy with it. I don't want to have to grind out some plotting code. I'm just gonna watch some sports highlights and let it do it for me. That's fine…

Ross Taylor: Yeah. But in general, there is a lot of positive feedback from the community on Claude Code. It’s a very impressive product for me.

Nathan Lambert: What is the niche of your use case, or is it a bunch of things? Is there something you could endorse? Do you use it in math or code tasks? Do you use it in your startup’s codebase?

Ross Taylor: It tends to be better with brand new codebases. But I mostly use it for tasks which are quite horizontally scalable. So I'll have some basic specification where I'll provide it with some example code of mine, and then say “here's what a good implementation looks like”, but I need this modification or twist done. Sorry, I'm being very vague because I don't want to talk about specifics, but…

Nathan Lambert: Yeah.

Ross Taylor: It tends to be better for that. And, yeah, where it becomes really bad is when the file size becomes too long. Then the agent tends to struggle and get into these weird line search doom loops. So, yeah, there's a bit of work to do where you have to structure the codebase a bit for it to be efficient. But in general, it’s quite helpful.

Nathan Lambert: It's such a success that pretty much everybody that tries it is doing at least small code projects with it. I think maybe since ChatGPT, there hasn’t been this strong of a reaction.

Is this like the GPT 3.5 level? Like, Claude 4 is like GPT 3.5, the original ChatGPT, and then a couple iterations it’s gonna be incredible...

Ross Taylor: Yeah, I guess the people who really appreciate Claude Code are developers. Right? But it doesn't have the mass appeal of ChatGPT, which could generate poetry or whatever at the time, which was the killer mainstream use case at the time…it sounds crazy now.

Nathan Lambert: But I guess pay for Claude Code. People won't pay for ChatGPT (laughs)...

Ross Taylor: Exactly. So maybe it's a better business model…

But, yeah, I think that's a good question. I wouldn't say it's a ChatGPT moment, but I would say it's probably one of the most impactful products since ChatGPT. It’s not a ChatGPT moment because it hasn’t got mainstream appeal yet. And the question is: what does that agent look like? I'm still shocked that Apple hasn't done anything yet because, for me, that would be the killer thing. We'll see if they get that s**t together.

But, yeah, I'd imagine it would be some sort of on-device model. That would be my guess. We’ll see

Nathan Lambert: Yeah, that’s fun. Did you also wanna mention AlphaEvolve? I've been so burnt by Google's hypey projects - like their chip design and stuff.

Is this like the AlphaGo story, where if you have a really high performance simulator, that’s well matched to a task and you can scale RL - like many actors in parallel - then you can get high performance? I talked to Eugene Vinitsky recently, one of my friends from Berkeley. And they were at Apple and they did this really parallel RL for self driving simulator, which was really awesome.

Is AlphaEvolve somewhat away from that, but is in the same vein of extracting simulators?

Ross Taylor: I think AlphaEvolve is very cool. In my mind, it's very interesting because it feels like we are going full circle. In the 90s, the cool things which didn’t quite work were genetic algorithms and neural networks. And it feels we often see a new lease of life for several algorithms once the right context develops and other components get in place

So in the case of AlphaEvolve, you're exploiting the strong latent knowledge of a neural network, but then you also have a neurosymbolic element….don’t read too much into that, Gary Marcus… where you have a database where you store past programs. And having that prior in the form of past programs is a very good way to exploit the internal creativity of a language model as opposed to creating from scratch each time.

Nathan Lambert: How does AlphaEvolve actually do this? I think a lot of people are not going to know what it is doing. I don't think I have a good knowledge of it.

Ross Taylor: Say you have a kernel optimization task. For example, you’re making good kernels for common ML architectures. So you start with a reference implementation, and then in essence, it's a bit like in- context learning where you’re taking that implementation and saying “propose a change”, and then you benchmark it and get a score. And then you have a database where you store that program and its score.

And then when you sample a new round, you have an algorithm - it tends to be based on island based algorithms - where you sample in proportion to the score but you also wanna explore a bit. And that's your new prior. So you're iterating and evolving a program.

Nathan Lambert: And this is just handed off to the language model? What is the language model actually inferencing? Is it inferencing new programs?

Ross Taylor: Yes. So imagine you're constructing your prompt. You fetch a past implementation from your database and it goes in. It probably has the score as well saying “this implementation above got this result”. Then you ask the model to propose a new change.

I am oversimplifying, but this is the essence of the approach. You propose a new change, you write a program, get the score, store it in a database, and then go again.

So, basically: anything where you can pose a neat optimization task, this algorithm tends to work very well.

This is a broader debate now about how AlphaEvolve compares to RL approaches. First of all, I think they can be complementary, but…

Nathan Lambert: Maybe the language model is trained with RL, I bet?

Ross Taylor: Yes, that too. The interesting thing by the way is that the bulk of the AlphaEvolve approach was not using the strongest Gemini model - they used a weaker model with faster inference. So that’s an interesting tidbit which is sort of anti model-scaling pilled. There is a nice balance to be found there…

But yes: back to RL vs AlphaEvolve. I think this is part of a broader trend on how you use compute and whether the approach is parallel or sequential. The AlphaEvolve approach benefits from parallelisation, but they’re not going into deep long reasoning traces (sequential) just yet. But you could use both approaches.

Similarly, with RL you usually solve problems from scratch. But you could also think of ways you might want to exploit good priors in the context window. Benchmarks like KernelBench sort of do that anyway, but they don’t evolve the reference implementation like AlphaEvolve does.

So I think it's definitely something to watch. I think AlphaEvolve is underhyped, but we’ll see many more papers on this direction soon.

Nathan Lambert: It seems like a sign of things to come - figuring out parallel compute in the right way. It might be that the biggest model doesn’t necessarily benefit the most from a parallel compute setting.

Ross Taylor: Yeah.

Nathan Lambert: I mean, there's a lot of ways you could think about this. Like, the guess is a 100 times cheaper and half as good…

Ross Taylor: Yeah. So maybe this is a bullshitty philosophical point, but think about it this way. In the past 5,000 years, humans have made a lot of progress, but their brains fundamentally haven’t changed. What makes us smarter is that we’ve followed an invention curriculum, where the next invention builds on previous inventions.

So in the RL context, that raises the question: would you rather start from scratch each time, or would you use the best thing you have and successfully iterate that by standing on someone else’s shoulders?

So this is definitely something to watch in RL space. Instead of AlphaZero-ing things from scratch, how do we maintain existing implementations and iterate upon those?

This is also related to how we develop language models, and the discussion we had about Claude Code. You can imagine having an agentic model that is very good for starting from scratch, but you could also have a model that's very good at dealing with an existing code base. And the question is which is more valuable? And the answer is both. But then depending on how you actually use those models, you might end up preferring a different model.

So I am trying to put AlphaEvolve into a much bigger context here: and see it as a bigger trend about how we use compute, but also how a model might learn to improve on a problem.

Nathan Lambert: Yeah, that's fun. There's going to be a lot more things like AlphaEvolve - where people with particular domain expertise do the muddling and figure things out and more things will fall out. It is very remarkable that a zero order optimizer like a genetic algorithm, just using prompts for language models, can get anything useful out. That is a major win for language models being a fundamental unit of compute.

Ross Taylor: Yeah, absolutely. And a major win for LLMs and creativity, right? Because the meme is like “Oh, LLMs can't be creative”, and I’m always thinking, at a fundamental level, the softmax is quite an expressive operation…You’ll get creativity eventually. It's just a question of how quickly you can pick it out from what you sample.

So, I think AlphaEvolve is also proof of creativity. You found many new state-of-the-art implementations in AlphaEvolve - and will see more to come in upcoming papers.

Nathan Lambert: I would also guess there's people doing stuff like that that don't publish it. Or they've taken different models and hill climbed in their domain by setting up these weird loops.

I think this is a good place to end things. I’m kind of fading. Thanks for coming back. I’m doing a trip to London at some point. I don’t think we’ve ever met in person, but that’ll happen at some point!

I think we're I mean, I'm kind of fading, so I think it's good. Thanks for coming back. I'm doing trip to London at some point. I don't think we've never met in person, but that'll happen at some point.

Good to see you.

Ross Taylor: Yeah, good to see you Nathan. I'll see you in a bit!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

The White House's plan for open models & AI research in the U.S. Jul 23, 2025

Today, the White House released its AI Action Plan, the document we’ve been waiting for to understand how the new administration plans to achieve “global dominance in artificial intelligence (AI).” There’s a lot to unpack in this document, which you’ll be hearing a lot about from the entire AI ecosystem. This post covers one narrow piece of the puzzle — its limited comments on open models and AI research investment.

For some context, I was a co-author on the Ai2 official comment to the Office of Science and Technology Policy (OSTP) for the AI Action Plan and have had some private discussions with White House staff on the state of the AI ecosystem.

A focus of mine through this document is how the government can enable better fully open models to exist, rather than just more AI research in general, as we’re in a shrinking time window where if we don’t create better fully open models then the academic community could be left with a bunch of compute to do research on models that are not reflective of the frontier of performance and behavior. This is why I give myself ~18 months to finish The American DeepSeek Project.

Important context for this document is to consider what the federal government can actually do to make changes here. The executive branch has limited levers it can pull to disperse funding and make rules, but it sends important signaling to the rest of the government and private sector.

Overall, the White House AI Action Plan comes across very clearly that we should increase investment in open models, and for the right reasons.

This reflects a shift from previous federal policy, where the Biden executive order had little to say about open models other than them getting grouped into models needing pre-release testing if they were trained with more than 10^26 FLOPS (which led to substantial discussion on the general uselessness of compute thresholds as a policy intervention). Later, the National Telecommunications and Information Administration (NTIA) released a report from under the umbrella of the Biden Administration that was far more positive on open models, but much more limited in the scope of its ability for agenda setting.

This is formatted as comments in line with the full text on open models and related topics in the action plan. Let’s dive in, any emphasis in italics is mine.

Encourage Open-Source and Open-Weight AI

Open-source and open-weight AI models are made freely available by developers for anyone in the world to download and modify. Models distributed this way have unique value for innovation because startups can use them flexibly without being dependent on a closed model provider. They also benefit commercial and government adoption of AI because many businesses and governments have sensitive data that they cannot send to closed model vendors. And they are essential for academic research, which often relies on access to the weights and training data of a model to perform scientifically rigorous experiments.

This covers three things we’re seeing play out with open models and is quite sensible as an introduction:

* Startups use open models to a large extent because pretraining themselves is expensive and modifying the model layer of the stack can provide a lot of flexibility with low serving costs. Today, most of this happens on Qwen at startups, where larger companies are more hesitant to adopt Chinese models.

* Open model deployments are slowly building up around sensitive data domains such as health care.

* Researchers need strong and transparent models to perform valuable research. This is the one I’m most interested in, as it is the one with the highest long-term impact by determining the fundamental pace of progress in the research community.

We need to ensure America has leading open models founded on American values. Open-source and open-weight models could become global standards in some areas of business and in academic research worldwide. For that reason, they also have geostrategic value. While the decision of whether and how to release an open or closed model is fundamentally up to the developer, the Federal government should create a supportive environment for open models.

The emphasized section is entirely the motivation behind ongoing efforts for The American DeepSeek Project. The interplay between the three groups above is inherently geopolitical, where Chinese model providers are actively trying to develop mindshare with Western developers and release model suites that offer great tools for research (e.g. Qwen).

The document is highlighting why fewer open models exist right now from leading Western AI companies, simply “the decision of whether and how to release an open or closed model is fundamentally up to the developer” — this means that the government itself can mostly just stay out of the way of leading labs releasing models if we think the artifacts will come from the likes of Anthropic, OpenAI, Google, etc. The other side of this is that we need to invest in building organizations around releasing strong open models for certain use cases that do not have economic conflicts or different foci.

Onto the policy steps.

Recommended Policy Actions

* Ensure access to large-scale computing power for startups and academics by improving the financial market for compute. Currently, a company seeking to use large-scale compute must often sign long-term contracts with hyperscalers—far beyond the budgetary reach of most academics and many startups. America has solved this problem before with other goods through financial markets, such as spot and forward markets for commodities. Through collaboration with industry, the National Institute of Standards and Technology (NIST) at the Department of Commerce (DOC), the Office of Science and Technology Policy (OSTP), and the National Science Foundation’s (NSF) National AI Research Resource (NAIRR) pilot, the Federal government can accelerate the maturation of a healthy financial market for compute.

The sort of issue the White House is alluding to here is that if you want to have 1000 GPUs as a startup or research laboratory you often need to sign a 2-3 year commitment in order to get low prices. Market prices for on-demand GPUs tend to be higher. The goal here is to make it possible for people to get the GPU chunks they need through financial incentives.

We’ve already seen a partial step for this in the recent budget bill, where AI training costs now can be classified as R&D expenses, but this largely helps big companies. Actions here that are even more beneficial for small groups releasing open weight or open-source models would be great to see.

One of the biggest problems I see for research funding is going to be the challenge of getting concentrated compute into the hands of researchers, so I hope the administration follows through here for compute density in places. A big pool of compute spread across the entire academic ecosystem means too little compute for models to get trained at any one location. It reads as if the OSTP understands this and has provided suitable guidance.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

* Partner with leading technology companies to increase the research community’s access to world-class private sector computing, models, data, and software resources as part of the NAIRR pilot.

* Build the foundations for a lean and sustainable NAIRR operations capability that can connect an increasing number of researchers and educators across the country to critical AI resources.

This is simple and to my knowledge has largely been under way. NAIRR provided a variety of resources to many academic parties, such as API credits, data, and compute access, so it should be expanded upon. I wrote an entire piece on saving the NAIRR last November when its funding future was unclear (and needed Congressional action).

This is the balance to what I was talking about above on model training. It provides smaller resource chunks to many players, which is crucial, but doesn’t address the problem of building great open models.

* Continue to foster the next generation of AI breakthroughs by publishing a new National AI Research and Development (R&D) Strategic Plan, led by OSTP, to guide Federal AI research investments.

This seems like a nod to a logical next step.

Where the overall picture of research funding in the U.S. has been completely dire, the priority in AI research has already been expressed through AI being the only area of NSF grant areas without major cuts. There is likely to be many other direct effects of this, but it is out of scope of the article.

More exact numbers can be found in the NSF 2026 proposed budget, where AI is an outlier as one of the only topics with a positive net change from 2024 or 2025.

* Led by DOC through the National Telecommunications and Information Administration (NTIA), convene stakeholders to help drive adoption of open-source and open-weight models by small and medium-sized businesses.

This is a more unexpected line item, but a welcome one. It’ll be harder to implement, but if it works it’ll do a lot of good for building momentum around open model investment. A large part of why few open models exist in the U.S. is just because there’s not a lot of business value from releasing them. A big story of 2025 has been how open models are closing the gap in capabilities, or at least crossing important ability thresholds, which could start to change this equilibrium.

That’s it for the core section on open models! It’s right to the point.

There are a couple related sections I wanted to point you to, which largely complement the above or show how it is hard for a document like this to acknowledge things like “our R&D ecosystem is being outcompeted by Chinese models.”

First, more on AI research itself.

Advance the Science of AI

Just as LLMs and generative AI systems represented a paradigm shift in the science of AI, future breakthroughs may similarly transform what is possible with AI. It is imperative that the United States remain the leading pioneer of such breakthroughs, and this begins with strategic, targeted investment in the most promising paths at the frontier.

Recommended Policy Actions

* Prioritize investment in theoretical, computational, and experimental research to preserve America’s leadership in discovering new and transformative paradigms that advance the capabilities of AI, reflecting this priority in the forthcoming National AI R&D Strategic Plan.

Something in my mind that is very missing from this document is a comment on immigration. If we want the U.S. to be a leader in AI research we need to prioritize fixing the immigration ecosystem as soon as possible. Leading AI conferences can no longer be located solely in the U.S. because too many authors cannot get a travel visa in time to attend the conference, let alone the other issues on hiring or funding at academic institutions.

This section on the Science of AI reads very similar to the section on open models.

And the only mentions of China, which is related as the party pushing open models the furthest today:

Counter Chinese Influence in International Governance Bodies

A large number of international bodies, including the United Nations, the Organisation for Economic Co-operation and Development, G7, G20, International Telecommunication Union, Internet Corporation for Assigned Names and Numbers, and others have proposed AI governance frameworks and AI development strategies. The United States supports likeminded nations working together to encourage the development of AI in line with our shared values. But too many of these efforts have advocated for burdensome regulations, vague “codes of conduct” that promote cultural agendas that do not align with American values, or have been influenced by Chinese companies attempting to shape standards for facial recognition and surveillance.

Recommended Policy Actions

* Led by DOS and DOC, leverage the U.S. position in international diplomatic and standard-setting bodies to vigorously advocate for international AI governance approaches that promote innovation, reflect American values, and counter authoritarian influence.

and a quick comment on Chinese talking points in the section “Ensure that Frontier AI Protects Free Speech and American Values”:

* Led by DOC through NIST’s Center for AI Standards and Innovation (CAISI), conduct research and, as appropriate, publish evaluations of frontier models from the People’s Republic of China for alignment with Chinese Communist Party talking points and censorship.

This reads as there being a low probability that we see any immediate executive action trying to ban the likes of Qwen or DeepSeek, which is good for the time being. The evaluation of Chinese and American values is a slippery slope in some ways, as it quickly will become enmeshed in the idea of “woke AI,” but in the meantime it is likely to be a major talking point with respect to the open models we’re seeing from Chinese companies, which do often parrot very simple talking points reflective of “Chinese socialist values.”

We need our ecosystem to compete on merits of the technology being better at useful tasks if we want to lead in the long-term technological arc, rather than political games. That’s my number one focus over the next couple of years and why I reiterate the need for open models for fundamental AI research and innovation. The biggest beneficiaries of this sort of innovation have historically been the biggest American technology companies, who now should do their part to support them existing — with some government encouragement.

Let me know if I missed anything, as this was a quick pass to make sure I read the details and connected the recent dots.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Kimi K2 and when "DeepSeek Moments" become normal Jul 14, 2025

https://www.interconnects.ai/p/kimi-k2-and-when-deepseek-moments

The DeepSeek R1 release earlier this year was more of a prequel than a one-off fluke in the trajectory of AI. Last week, a Chinese startup named Moonshot AI dropped Kimi K2, an open model that is permissively licensed and competitive with leading frontier models in the U.S. If you're interested in the geopolitics of AI and the rapid dissemination of the technology, this is going to represent another "DeepSeek moment" where much of the Western world — even those who consider themselves up-to-date with happenings of AI — need to change their expectations for the coming years.

In summary, Kimi K2 shows us that:

* HighFlyer, the organization that built DeepSeek, is far from a uniquely capable AI laboratory in China,

* China is continuing to approach (or reached) the absolute frontier of modeling performance, and

* The West is falling even further behind on open models.

Kimi K2, described as an "Open-Source Agentic Model" is a sparse mixture of experts (MoE) model with 1T total parameters (~1.5x DeepSeek V3/R1's 671B) and 32B active parameters (similar to DeepSeek V3/R1's 37B). It is a "non-thinking" model with leading performance numbers in coding and related agentic tasks (earning it many comparisons to Claude 3.5 Sonnet), which means it doesn't generate a long reasoning chain before answering, but it was still trained extensively with reinforcement learning. It clearly outperforms DeepSeek V3 on a variety of benchmarks, including SWE-Bench, LiveCodeBench, AIME, or GPQA, and comes with a base model released as well. It is the new best-available open model by a clear margin.

These facts with the points above all have useful parallels for what comes next:

* Controlling who can train cutting edge models is extremely difficult. More organizations will join this list of OpenAI, Anthropic, Google, Meta, xAI, Qwen, DeepSeek, Moonshot AI, etc. Where there is a concentration of talent and sufficient compute, excellent models are very possible. This is easier to do somewhere such as China or Europe where there is existing talent, but is not restricted to these localities.

* Kimi K2 was trained on 15.5T tokens and has a very similar number of active parameters as DeepSeek V3/R1, which was trained on 14.8T tokens. Better models are being trained without substantial increases in compute — these are referred to as a mix of "algorithmic gains" or "efficiency gains" in training. Compute restrictions will certainly slow this pace of progress on Chinese companies, but they are clearly not a binary on/off bottleneck on training.

* The gap between the leading open models from the Western research labs versus their Chinese counterparts is only increasing in magnitude. The best open model from an American company is, maybe, Llama-4-Maverick? Three Chinese organizations have released more useful models with more permissive licenses: DeepSeek, Moonshot AI, and Qwen. This comes at the same time that new inference-heavy products are coming online that'll benefit from the potential of cheaper, lower margin hosting options on open models relative to API counterparts (which tend to have high profit margins).

Kimi K2 is set up for a much slower style "DeepSeek Moment" than the DeepSeek R1 model that came out in January of this year because it lacks two culturally salient factors:

* DeepSeek R1 was revelatory because it was the first model to expose the reasoning trace to the users, causing massive adoption outside of the technical AI community, and

* The broader public is already aware that training leading AI models is actually very low cost once the technical expertise is built up (recall the DeepSeek V3 $5M training cost number), i.e. the final training run is cheap, so there should be a smaller reaction to similar cheap training cost numbers in the Kimi K2 report coming soon.

Still, as more noise is created around the K2 release (Moonshot releases a technical report soon), this could evolve very rapidly. We've already seen quick experiments spin up slotting it into the Claude Code application (because Kimi's API is Claude-compatible) and K2 topping many nice "vibe tests" or creativity benchmarks. There are also tons of fun technical details that I don't have time to go into — from using a relatively unproven optimizer Muon and scaling up the self-rewarding LLM-as-a-judge pipeline in post-training. A fun tidbit to show how much this matters relative to the noisy Grok 4 release last week is that Kimi K2 has already surpassed Grok 4 in API usage on the popular OpenRouter platform.

Later in the day on the 11th, following the K2 release, OpenAI CEO Sam Altman shared the following message regarding OpenAI's forthcoming open model (which I previously shared more optimistic thoughts on here) :

we planned to launch our open-weight model next week.

we are delaying it; we need time to run additional safety tests and review high-risk areas. we are not yet sure how long it will take us.

while we trust the community will build great things with this model, once weights are out, they can’t be pulled back. this is new for us and we want to get it right.

sorry to be the bearer of bad news; we are working super hard!

Many attributed this as a reactive move by OpenAI to get out from the shadow of Kimi K2's wonderful release and another DeepSeek media cycle.

Even though someone at OpenAI shared with me that the rumor that Kimi caused the delay for their open model is very likely not true, this is what being on the back foot looks like. When you're on the back foot, narratives like this are impossible to control.

We need leaders at the closed AI laboratories in the U.S. to rethink some of the long-term dynamics they're battling with R&D adoption. We need to mobilize funding for great, open science projects in the U.S. and Europe. Until then, this is what losing looks like if you want The West to be the long-term foundation of AI research and development. Kimi K2 has shown us that one "DeepSeek Moment" wasn't enough for us to make the changes we need, and hopefully we don't need a third.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

The American DeepSeek Project Jul 04, 2025

https://www.interconnects.ai/p/the-american-deepseek-project

While America has the best AI models in Gemini, Claude, o3, etc. and the best infrastructure with Nvidia it’s rapidly losing its influence over the future directions of AI that unfold in the open-source and academic communities. Chinese organizations are releasing the most notable open models and datasets across all modalities, from text to robotics or video, and at the same time it’s common for researchers worldwide to read far more new research papers from Chinese organizations rather than their Western counterparts.

This balance of power has been shifting rapidly in the last 12 months and reflects shifting, structural advantages that Chinese companies have with open-source AI — China has more AI researchers, data, and an open-source default.

On the other hand, America’s open technological champions for AI, like Meta, are “reconsidering their open approach” after yet another expensive re-org and the political environment is dramatically reducing the interest of the world’s best scientists in coming to our country.

It’s famous lore of the AI industry that much of the flourishing of progress around ChatGPT is downstream from Google Research’s, and the industry’s writ-large, practice of openly sharing the science of AI until approximately 2022. Stopping this practice, and the resulting power shifts mean it will be likely that the next “Transformer”-style breakthrough will be built on or related to Chinese AI models, AI chips, ideas, or companies. Countless Chinese individuals are some of the best people I’ve worked with, both at a technical and personal level, but this direction for the ecosystem points to AI models being less accountable, auditable, and trustworthy due to inevitable ties to the Chinese Government.

The goal for my next few years of work is what I’m calling The American DeepSeek Project — a fully open-source model at the scale and performance of current (publicly available) frontier models, within 2 years. A fully open model, as opposed to just an “open weights” model, comes with data, training code, logs, and decision making — on top of the weights to run inference — in order to distribute the knowledge and access for how to train AI models fully.

This project serves two goals, where balancing the scales with the pace of the Chinese ecosystem is only one piece:

* Reclaim the AI research default home being on top of American (or Western) technologies and tools, and

* Reduce the risk that the only viable AI ecosystem for cutting edge products in built atop of proprietary, closed, for-profit AI models.

More people should be focused on this happening. A lot of people talk about how nice it would be to have “open-source AGI for all,” but very few people are investing in making it reality. With the right focus, I estimate this will take ~$100M-500M over the next two years.

Within the context of recent trends, this is a future that has a diminishing, minute probability. I want to do this at Ai2, but it takes far more than just us to make it happen. We need advocates, peers, advisors, and compute.

The time to do this is now, if we wait then the future will be in the balance of extremely powerful, closed American models counterbalancing a sea of strong, ubiquitous, open Chinese models. This is a world where the most available models are the hardest to trust. The West historically has better systems to create AI models that are trustworthy and fair across society. Consider how:

* Practically speaking, there will never be proof that Chinese models cannot leave vulnerabilities in code or execute tools in malicious ways, even though it’s very unlikely in the near future.

* Chinese companies will not engage as completely in the U.S. legal system on topics from fair use or non-consensual deepfakes.

* Chinese models will over time shift to support a competitive software ecosystem that weakens many of America and the West’s strongest companies due to in-place compute restrictions.

Many of these practical problems cannot be fixed by simply fine-tuning the model, such as Perplexity’s R1-1776 model. These are deep, structural realities that can only be avoided with different incentives and pretrained models.

My goal is to make a fully open-source model at the scale of DeepSeek V3/R1 in the next two years. I’ve been starting to champion this vision in multiple places that summarizes the next frontier for performance on open-source language models, so I needed this document to pin it down.

I use scale and not performance as a reference point for the goal because the models we’re collectively using as consumers of the AI industry haven’t really been getting much bigger. This “frontier scale” is a ballpark for where you’ve crossed into a very serious model, and, by the time a few years has gone by, the efficiency gains that would’ve accumulated by then will mean this model will far outperform DeepSeek V3. The leading models used for synthetic data (and maybe served to some users) will continue to get bigger, but not as quickly as capabilities will grow and new types of agents will emerge.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

The terminology “American DeepSeek” is stretching words in order to be identifiable to a broad public. It combines the need for true American values with a breakthrough open release that marks a new milestone in capabilities.

DeepSeek is known for many things to the general public — training cheap frontier models, bringing reasoning models to consumers, and largely being the face of Chinese AI efforts. Since ChatGPT, DeepSeek is the first organization to release an open, permissively licensed AI model at the frontier of performance. This was a major milestone and why 2025 has been a transformative year in the perception of feasibility for open models generally. The name DeepSeek will forever be known in AI lore for it.

At the same time, what will count as a “DeepSeek moment” is changing. The new directions for where AI is heading is more in line with agents that use models a lot (sometimes even smaller models) rather than relying on scaling performance of single model generations.

This changes what it’ll mean for models to be “at the frontier.” More releases will look like Claude 4 and be about usability, where the benchmarks that people are hillclimbing on represent new types of capabilities or outlandish, harder than human expert tasks. For the suite of tasks that were core for the current generation of models: MATH, GPQA, SWE-Bench Verified, etc., solving them represents a challenging, but reasonable, baseline for human performance.

The next major milestone will be when fully open-source models reach this performance threshold. With fully open-source models at this level, “anyone” can specialize the model to their task and the possibility of an open ecosystem that runs efficiently on a single architecture can coalesce. This doesn’t mean releasing the best AI models of 2027 with complete openness — just that we should, come 2027, have fully open models of 2025’s capabilities in order to enable new types of companies and research.

The efficiencies of open-source software style development are dramatically stronger for agentic systems than models. Models are singular entities built with expensive resources and incredible focus. Agents are systems that can use many models off the shelf and route requests depending on what’s needed.

This agentic era is the opportunity open models have needed, but we need to clear much stronger performance thresholds before the open counterparts are viable. We have companies like OpenAI and Google launching Claude Code competitors that pretty much flop. Imagine what this would look like with open models today? Not good.

For this reason, we have finite time to get there. Surely, eventually this level of models will exist, but if we want a new type of ecosystem to form we need to build the raw resources while developers and new companies are getting started. We need people willing to take the risk on something different while there is still potential for it to be comparable across performance trade-offs.

Today, the best fully open language models are catching up to the levels of the original GPT-4. This is a major step from GPT-3 levels. The required step I’m shooting for is reaching the modern GPT-4 type models, the likes of recent Sonnet, DeepSeek V3, or Gemini Pro. It’s a big step, but a transformative one in terms of what the models can do.

Of course, some of this still works with open weight models and not just fully open models, but to date we have not had good success with having open weight models that can fully be trusted. The best American models are plagued by the Llama license (and rumors that future versions will be discontinued). At the same time, Chinese models aren’t trusted because the models are being integrated directly with more complex tools that muddy the water with a weak security reputation, and European models are largely off the map.

If we want models we can trust, we need something that’s a bit different. If the models all converge on a certain capability level, and the differentiation is on integration and finetuning to specific skills, this is something the open community can do.

In many ways, obtaining this goal is a quintessentially American volition. In the face of a technology that is poised to bring such extreme financial, and by proxy literal, power to a few companies, opening AI is one of the only things we can do to reduce it. Technology proceeds in a one-way direction — for a variety of geopolitical and capitalistic reasons it is impractical to pause AI development to “do AI another way” — the best we can do is chart a path that makes this future better.

Along the same vein, if AGI already exists and something closer to ASI is coming, it will be intertwined with countless details of billions of people’s lives in a matter of just years. Something so indispensable to our lives in work, play, entertainment, and relationships is a closer analog to electricity than other traditional technology products that one can opt into. Such technology should be available for all to benefit from.

We need new systems to mitigate misuse, but it shouldn’t be solely up to corporations to control this. Safety by isolating technology to a select few is something we’re in the later stages of with nuclear weapons, and AI progress is far harder to monitor. Robustness to AI can only come from designing systems that expect it to be pervasive — not that it is an easy task.

Realistically, all of this is fighting gravity. The corporations will win, but we can control to what extent. We can control how good the other options are. The open options.

The call to action here is simple — consider how you can slightly shift your decision making to make The American DeepSeek more likely. This approach succeeds just as much by having one model at the end of it, as it does by having the community form better habits and norms around the way AI models are conceived, built, shared, and used.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Some ideas for what comes next Jun 23, 2025

https://www.interconnects.ai/p/summertime-outlook-o3s-novelty-coming

Summer is always a slow time for the tech industry. OpenAI seems fully in line with this, with their open model “[taking] a little more time” and GPT-5 seemingly always delayed a bit more. These will obviously be major news items, but I’m not sure we see them until August.

I’m going to take this brief reprieve in the bombardment of AI releases to reflect on where we’ve been and where we’re going. Here’s what you should know.

1. o3 as a technical breakthrough beyond scaling

The default story around OpenAI’s o3 model is that they “scaled compute for reinforcement learning training,” which caused some weird, entirely new over-optimization issues. This is true, and the plot from the livestream of the release still represents a certain type of breakthrough — namely scaling up data and training infrastructure for reinforcement learning with verifiable rewards (RLVR).

The part of o3 that isn’t talked about enough is how different its search feels. For a normal query, o3 can look at 10s of websites. The best description I’ve heard of its relentlessness en route to finding a niche piece of information is akin to a “trained hunting dog on the scent.” o3 just feels like a model that can find information in a totally different way than anything out there.

The kicker with this is that we’re multiple months out from its release in April of 2025 and no other leading lab has a model remotely like it. In a world where releases between labs, especially OpenAI and Google, seem totally mirrored, this relentless search capability in o3 still stands out to me.

The core question is when will another laboratory release a model that feels qualitatively similar? If this trend goes on through the end of the summer it’ll be a confirmation that OpenAI had some technical breakthrough to increase the reliability of search and other tool-use within reasoning models.

For a contrast, consider basic questions we are facing in the open and academic community on how to build a model inspired by o3 (so something more like a GPT-4o or Claude 4 in its actual search abilities):

* Finding RL data where the model is incentivized to search is critical. It’s easy in an RL experiment to tell the model to try searching in the system prompt, but as training goes on if the tool isn’t useful the model will learn to stop using it (very rapidly). It is likely that OpenAI, particularly combined with lessons from Deep Research’s RL training (which, I know, is built on o3), has serious expertise here. A research paper showing a DeepSeek R1 style scaled RL training along with consistent tool use rates across certain data subsets will be very impressive to me.

* The underlying search index is crucial. OpenAI’s models operate on a Bing backend. Anthropic uses Brave’s API and it struggles for it (lots of SEO spam). Spinning up an academic baseline with these APIs is a moderate additive cost on top compute.Once solid open baselines exist, we could do fun science such as studying which model can generalize to unseen data-stores best — a crucial feature for spinning up a model on local sensitive data, e.g. in healthcare or banking.

If you haven’t been using o3 for search, you really should give it a go.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

2. Progress on agents will be higher variance than modeling was, but often still extremely rapid

Claude Code’s product market fit, especially with Claude 4, is phenomenal. It’s the full package for a product — works quite often and well, a beautiful UX that mirrors the domain, good timing, etc. It’s just a joy to use.

With this context, I really have been looking for more ways to write about it. The problem with Claude Code, and other coding agents such as Codex and Jules, is that I’m not in the core audience. I’m not regularly building in complex codebases — I’m more of a research manager and fixer across the organization than someone that is building in one repository all the time — so, I don’t have practical guides on how to get the most out of Claude Code or a deep connection with it that can help you “feel the AGI.”

What I do know about is models and systems, and there are some very basic facts of frontier models that make the trajectory for the capabilities of these agents quite optimistic.

The new part of LLM-based agents is that they involve many model calls, sometimes with multiple models and multiple prompt configurations. Previously, the models everyone was using in chat windows were designed to make progress on linear tasks and return that to the user — there wasn’t a complex memory or environment to manage.

Adding a real environment for the models has made it so the models need to do more things and often a wider breadth of tasks. When building these agentic systems, there are two types of bottlenecks:

* The models cannot solve any of the task we hope to use the agent for, and

* The models fail at small components of the task that we are deploying.

For agents that have initial traction, such as Claude Code and Deep Research, many of the problems are in the second class. How these fixes are made is that labs notice repeated, odd failures among real world use-cases. This can look like a 50% reliability rate on some long-tail mundane task. In this case it is often easy for the lab to make new data, include it in the next post-training run for their models, and up that sub-task reliability to almost 99%. As labs are making most of their gains in post-training today, rather than big pretraining runs, the time for that change to get integrated is well shorter than recent years.

The kicker for this is how it all fits together. Many complex tasks can be bottlenecked by some weird, small failures. In this case, we can have small changes to models that make agents like Claude Code feel way more reliable, even though the peak performance of the model hasn’t changed much. The same goes for Deep Research.

With this, I expect these agents we’re already using to improve randomly and in big leaps.

What I’m unsure of is when new agent platforms will be built. Some of this is a product problem and some of it is a peak performance problem. New agentic platforms that feel like they have product-market fit will be somewhat random, but those that have a fit already can improve like we’re used to frontier models getting way better.

Much like o3, you should play with Claude Code even if you don’t code a lot. It can make fun demos and standalone websites in no time. It’s miles ahead in its approachability compared to the fully-autonomous agents like Codex (at least for the time being).

3. Scaling parameters is going to go very slow for consumer models

The models that leading AI labs have been releasing in 2025 have mostly stopped getting bigger in total parameters. Take Claude 4, the API prices are the same as Claude 3.5 (and its minor versions). OpenAI only half released GPT-4.5. Gemini hasn’t released its Ultra variant. There are more models that are private to these laboratories that are certainly much bigger.

The nuanced part of this is that many of these models likely could be getting slightly smaller, e.g. Claude 4 Sonnet could be slightly smaller than Claude 3.5 Sonnet, due to efficiency gains at pretraining. That sort of marginal technical advancement is a big deal on price and inference speed, especially in the long-run, but not the central point I’m making.

The point is how GPT-5 is going to be bigger mostly through inference-time scaling and less through just “one bigger model.” For years we were told the narrative that the lab with the biggest training cluster was going to win because they have an advantage with scaling. That was the story behind xAI’s mega-cluster that Elon built. Now, the biggest cluster just is an advantage in overall research pace.

Scaling, at least in terms of what users need, has largely fizzled out. Labs may come back to it later as they find super hard problems that users need to solve, but where GPT 4.5 cost about 100x the compute of GPT-4 to train, it is only slightly better on normal user metrics.

What we see now is a mass efficiency march along the model sizes that people love. The industry has a few standards, from

* Tiny models like Gemini Flash Lite or GPT 4.1 Nano,

* Small models like Gemini Flash and Claude Haiku,

* Standard models like GPT-4o and Gemini Pro, and

* Big models like Claude Opus and Gemini Ultra.

These models come with somewhat predictable price-points (we know Gemini is way cheaper than the industry standard), latencies, and capability levels. Standards like this are important as industries mature!

Over time, efficiency gains will make new standards emerge. The first thing we’ll see is more mass availability of the likes of Gemini Ultra and GPT-4.5 (maybe in the GPT-5 release), but what comes after that isn’t on the radar at all. Now, scaling to new size tiers is only possible “every few years” or maybe not at all, if monetization of AI doesn’t go as well as many hope.

Scaling as a product differentiator died in 2024. That doesn’t mean pretraining as a science isn’t crucial. The recent Gemini 2.5 report made that pretty clear:

The Gemini 2.5 model series makes considerable progress in enhancing large-scale training stability, signal propagation and optimization dynamics, resulting in a considerable boost in performance straight out of pre-training compared to previous Gemini models.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Crafting a good (reasoning) model Jun 18, 2025

Why are some models that are totally exceptional on every benchmark a total flop in normal use? This is a question I was hinting at in my post on GPT-4o’s sycophancy, where I described it as “The Art of The Model”:

RLHF is where the art of the model is crafted and requires a qualitative eye, deep intuition, and bold stances to achieve the best outcomes.

In many ways, it takes restraint to land a great model. It takes saying no to researchers who want to include their complex methods that may degrade the overall experience (even if the evaluation scores are better). It takes saying yes to someone advocating for something that is harder to measure.

In many ways, it seems that frontier labs ride a fine line between rapid progress and usability. Quoting the same article:

While pushing so hard to reach the frontier of models, it appears that the best models are also the ones that are closest to going too far.

Once labs are in sight of a true breakthrough model, new types of failure modes and oddities come into play. This phase won’t last forever, but seeing into it is a great opportunity to understanding how the sausage is made and what trade-offs labs are making explicitly or implicitly when they release a model (or in their org chart).

This talk expands on the idea and goes into some of the central grey areas and difficulties in getting a good model out the door. Overall, this serves as a great recap to a lot of my writing on Interconnects in 2025, so I wanted to share it along with a reading list for where people can find more.

The talk took place at an AI Agents Summit local to me in Seattle. It was hosted by the folks at OpenPipe who I’ve been crossing paths with many times in recent months — they’re trying to take similar RL tools I’m using for research and make them into agents and products (surely, they’re also one of many companies).

Slides for the talk are available here and you can watch on YouTube (or listen wherever you get your podcasts).

Reading list

In order (2025 unless otherwise noted):

* Setting the stage (June 12): The rise of reasoning machines

* Reward over-optimization

* (Feb. 24) Claude 3.7 Thonks and What’s Next for Inference-time Scaling

* (Apr. 19) OpenAI's o3: Over-optimization is back and weirder than ever

* RLHF Book on over optimization

* Technical bottlenecks

* (Feb. 28) GPT-4.5: "Not a frontier model"?

* Sycophancy and giving users what they want

* (May 4) Sycophancy and the art of the model

* (Apr. 7) Llama 4: Did Meta just push the panic button?

* RLHF Book on preference data

* Crafting models, past and future

* (July 3 2024) Switched to Claude 3.5

* (June 4) A taxonomy for next-generation reasoning models

* (June 9) What comes next with reinforcement learning

* (Mar. 19) Managing frontier model training organizations (or teams)

Timestamps

00:00 Introduction & the state of reasoning05:50 Hillclimbing imperfect evals09:18 Technical bottlenecks13:02 Sycophancy18:08 The Goldilocks Zone19:28 What comes next? (hint, planning)26:40 Q&A

Transcript

Transcript produced with DeepGram Nova v3 with some edits by AI.

Hopefully, this is interesting. I could sense from some of the talks, it'll be a bit of a change of pace than some of the talks that have come before. I think I was prompted to talk about kind of a half theme of one of the blog posts I wrote about sycophancy and try to expand on it. There's definitely some overlap with things I'm trying to reason through that I spoke about at AI Engineer World Fair, but largely a different through line. But mostly, it's just about modeling and what's happening today at that low level of the AI space.

So for the state of affairs, everybody knows that pretty much everyone has released a reasoning model now. These things like inference time scaling. And most of the interesting questions at my level and probably when you're trying to figure out where these are gonna go is things like what are we getting out of them besides high benchmarks? Where are people gonna take training for them? Now that reasoning and inference time scaling is a thing, like how do we think about different types of training data we need for these multi model systems and agents that people are talking about today?

And it's just a extremely different approach and roadmap than what was on the agenda if a AI modeling team were gonna talk about a year ago today, like, what do we wanna add to our model in the next year? Most of the things that we're talking about now were not on the road map of any of these organizations, and that's why all these rumors about Q Star and and all this stuff attracted so much attention. So to start with anecdotes, I I really see reasoning as unlocking new ways that I interact with language models on a regular basis. I've been using this example for a few talks, which is me asking O3, I can read it, is like, can you find me the GIF of a motorboat over optimizing a game that was used by RL researchers for a long time? I've used this GIF in a lot of talks, but I always forget the the name, and this is the famous GIF here.

And coast runners is the game the game name, which I tend to forget. O3 just gives you a link to download the GIF direct directly, which is just taking where this is going to go, it's going to be like, I ask an academic question and then it finds the paragraph in the paper that I was looking for. And that mode of interaction is so unbelievably valuable. I was sitting in the back trying to find what paper came up with the definition of tool use. I think there's a couple twenty twenty two references.

If you're interested after, you can can find me because I don't remember them off the top of my head. But these are things that AI is letting me do, and it's it's much more fun and engaging than sifting through Google. And the forms of the models so this previous one was just O3 natively, whatever system prompt ChatGPT has, but the form of these interactions are also changing substantially with deep research that we've heard alluded to and and referenced. And then Claude Code, which is one of the more compelling and nerdy and very interesting ones. I I used it to help build some of the back end for this RLHF book that I've been writing in a website.

And these things like just spinning up side projects, are so easy right now. And then also Codex, which these types of autonomous coding agents where there's not the interactivity of Claude code is obviously the frontier that is going. But if you try to use something like this, it's like, okay. It works for certain verticals and certain engineers. However, the stuff I do is like, okay.

This is not there yet. It doesn't have the Internet access is a little weird as to build these complex images, installing PyTorch. It's like, okay. We don't we don't want that yet for me, but it's coming really soon. And at the bottom of this is like this foundation where the reasoning models have just unlocked these incredible benchmark scores, and I break these down in a framework I'll come back to later as what I call a skill.

And it's just fundamentally reasoning models can do different things with tokens that let them accomplish much harder tasks. So if you look at GPT-4o, which was OpenAI's model going into this, there was a variety of what we're seeing as kind of frontier AI evaluations where it's on the spectrum of the models get effectively zero, which is truly at the frontier to somewhere to 50 to 60 is labs have figured out how to hill climb on this, but they're not all the way there yet. And when they transition from GPT-4o to O1, which if you believe Dylan Patel of semi analysis, is the same base model with different post training, you get a jump like this. And then when OpenAI scales reinforcement learning still on this base model, they get a jump like this. And the rumors are that they're now gonna use a different base model and kind of accumulate these gains in another rapid fashion.

And these benchmark scores are are not free. It's a lot of hard work that gets there, but it's just a totally different landscape where things like AIM and GPQA, which is this kind of science technology reasoning questions, are effectively solved. And this is like the use cases I was describing where it's like, O3 can kind of just do this. And a lot of harder things we'll see keep coming, might unlock some of these kind of use factors I'm mentioning as interesting but not there yet. And we'll see this kind of list grow over time, but it's really not like the only thing that we're experiencing on the ground because skills are only one part of this, and there's a lot of this arts and crafts of how do you actually have a good model that people like to use.

And a lot of this talk is gonna be talking ways that that can go right and wrong. And generally, just my reflections as someone who trains these models on why we get exposed to this. So there's a lot of online discourse about models that go too far on training on benchmarks. This is an old tweet from Phi from Microsoft. I don't wanna throw them under their bus because they've also Phi-4 is a really good model by now.

So a lot of these people get this reputation for things that are maybe like a one off model incident, which emerges from a complexity of org structure weirdness and individual incentives. And I think like Meta's really in this right now, that doesn't mean their future models will be subject to this. But it is definitely a phenomenon that could happen where it's like a lot of low level decisions result in the final product that is just not what you wanted even though it seems like along the way you're doing everything right. And just kind of climbing these benchmark scores, is linked to this thing that I was saying with skills, is not the only way forward. And especially with reasoning models, there's kind of another way we've seen this, which is Claude 3.5, where people love to gripe about this supposed upgrade to Claude, would love to just like fake its way through unit tests.

And if you're looking at reasoning training, a lot of the technical implementation for code is you have the model generate code and you check if you pass unit tests. And what people are seeing is that Claude essentially does everything and then modifies the code so that the test passes. And this is like a side effect of at the training time, our reward function is just too simple. It's like we're rewarding the model for getting unit tests right, which might be disconnected from the overall theme. And like, there's just so many ways that they can actually come up, and it's like this RLVR thing, reinforced learning with verifiable rewards, let us climb these skill charts crazily both on public and private benchmarks.

So I think a lot of the labs have benchmarks internally that are much more specific to things, and even those you could kind of fake yourself on. And, like, I try there's examples of OpenAI and Gemini doing the same thing, and a lot of recent model releases have this sort of like, oh, the coding is a little off energy. But I I think a lot of it comes to this, which is just bypassing a unit test. And another really interesting one that doesn't quite have the same interference with the utility of it is like this Transluce blog post and other communications when O3 was announced where O3 will do this thing where you ask it a question and it'll say its reasoning for why it's true is some action in the world that it can't actually take due to its sandboxing. So this is the O3 saying that it measured it on its MacBook Pro that it has even though it's just running in in the data center without access to it.

I think this is a good example because things will come up like this that have actual physical meaning, but even if we could see it in such innocuous ways, it's just that this, like, we're we're pushing so hard on these skills and the measurable things that a lot of the unmeasurable becomes a second priority goal. And this is the sort of thing where it's a team a teammate at a company will be like, we're so much better at coding, search, math, everything. And it's like, oh, the the leadership needs to get the model out for competitive reasons. And then some of the things, it's like, okay, we do this messy process of making the model much more robust and getting rid of these oddities in the training data and stuff like this, and and you kind of do this later on. So it's kind of this back and forth when the pace of progress is so high that you you have to make hard decisions on what you prioritize.

And a lot of times, these weird model releases are actually just technical bottlenecks. I think this is one of my most entertaining model releases of the year is when OpenAI released GPT 4.5, and they released this system card. And originally, it had the sentence in it that GPT 4.5 is not a frontier model. If you try to find this now, they've since edited the system card, so it doesn't say this anymore. But it's just such a funny thing for an AI lab to say because at the same time, the model is really liked by people.

And I think that one of the things that people caught on with this was doing green text, and they thought this model was very good at it. And since GPT 4.5, I think people kind of think that all the big models are kind of good at green text. So Claude 4 Opus and Claude 2. Gemini 2.5 Pro, people also do this now. And this is a sign that humor is really getting better at the models. And there's these this is like a total viral moment where people like interacting with GPT 4.5.

But if you look at the system card, it's like Sam Altman says we spent 10 times the compute as GPT four on it, and the evaluations are like a minor tick up. And when you look at reasoning models, all the reasoning models have like 20 gains all the time. So it's this really odd release where I think this is a model that I still use very regularly in my day to day basis, but it just can't land for some weird complex reason of balancing this kind of vibes, eval scores, which are really important for marketing to both users and kind of business customers because you get that first impression feeling and then also price. And there's also these weird things where there's more people trying to get involved in these leading models, and the bar for releasing them is actually getting much higher. So these are two papers that are some of the, what I call, leading open reports on how to do reasoning model training.

Open thoughts three is on kind of data methods for instruction or supervised fine tuning, just kind of generating a large dataset for performance on math and code. And Magistral is Mistral's first reasoning model. And these, the methods they describe are very strong and it's very good for these companies to be open and Bespoke Labs from Open Thoughts release the data as well. But it's like even these companies get really bad can get really bad press when the models just fail to do really simple things. So for example, the OpenThinker model was the sort of model that'll think for minutes when you say hi to it, and the Mistral model is one of the models that every time you ask it a question, will format it as if it's giving a math answer.

And these are obviously cherry picked things from a known vocal critic because it is entertaining, but it's just like this is the space that AI operates in, which is like if you don't get the little things right, people are still gonna complain and and give you bad press because the models that do things extremely well are just one click away. I think things like Claude 4 Opus is easy to use in there and Gemini's coming and there's more open models. And it's just hard to get a model out that has this kind of care to it because it takes a lot of time and resources to kind of wait to release the model when you do all these kind of last bits of fine tuning. The most important one that's been in the news is kind of the sycophancy idea. This is not a real example from the model that they released for two weeks for ChatGPT.

This whole little saga, I just imitated it by system prompting it. But it reads just like them, which is ChatGPT was unbelievably sycophantic for a few days. I'm not gonna comment on all the kind of social second order effects because they're obvious and that's a large motivation of why I do the work that I do, which is just like we want to be able to have understanding and oversight onto things like that, but it reveals deep organizational pressures that these companies are going through to kind of get things out the door fast that people really want. If you are to so so here's some more examples. These are the GPT-4o version and one of these viral examples on what it was actually saying.

And on the left is the Llama 4 secret chatbot arena version, which is the one they use to get the number one score on Chatbot Arena, but never actually released. And talking to that was very odd because on the release day, you're like, this can't be the model that they're saying is the best thing ever. It's just really strange. It's like Llama example, it's like I asked it what's its name and it said a very direct and very good question. It's like I don't think most people in this room want that answer and that says a lot about evaluation and other things.

But it's like this is a deep grained problem with reinforced learning from human feedback and collecting preferences, which is that if you're collecting preference data, you will give a multi page document ranking your priorities on how you rate the or compare the answers. And at the end, there's certain things like Sick of Fancy that people actually just like to get out of models, so they become tie breaks if they're particularly distinctive between the answers. And in the OpenAI post mortem on sake of NC, they had an extremely good breakdown on this. So I recommend that you read this, but I'll take a second and read out loud the core example. So they said, for example, the update introduced an additional reward signal based on user feedback, thumbs up and thumbs down data from ChatGPT.

Throughout the post, they talk about how they trained a reward model to predict this. And with what we've seen from things like Chatbot Arena, it seems very likely that the strongest signal in that reward model was that it is just sycophancy of links to these thumbs up data and then that was expressed. Generally this is a form of over optimization. The last line on this slide is something that comes from a lot of history in the reinforcement learning literature is that for example in syncopancy and that reward model, when they were training these models in post training with many stages, so they go through some instruction tuning and they do RL, and they do RL, and they do RL, RL. As you're really pushing the models to their limits, the strong optimizers that we use will extract performance where it's easiest in your training signals.

And the easiest training signal is probably, is just like you you add some emojis, you tell them they're good at it, and the models can pick up on that very easily. And then the decision making problem that can explain most of these issues that we've talked about so far in the talk is that you have many evals, and these are things you're trying to hill climb on, but you can never have all of them. So you're kind of doing a multi objective problem, you're pushing all of those up, and it's often taking from something that you're not accounting for, and it's getting pulled way in the other direction. And this is where the things like art are important, and I'll kind of highlight a model that I think did this really well in a few slides. But I think as we see this competition for models, we'll both see more weird releases like this where there's kind of rough edges.

And there's also this much bigger opportunity in the AI space to release things that are really robust and bring a lot of joy and don't have these rough edges. So I think this kind of drive to be patient is going to be hard and hopefully rewarded when there's a lot of weird things out there. And for the non researcher majority of this audience, the figure on the right is from what is like the original over optimization paper in reward models. It's called scaling laws for language model or reward model optimization. And this is just showing the x axis is a technical measure that's KL distance, which is a distance that's used to reference the numerical change from a starting model to a final model when you're doing some RL fine tuning.

And this is showing that the y axis, your reward model score, it goes up and then it goes down. So the the hard part is when you don't have something you're competent in is you don't know when it starts going down in your over optimization. So this is kind of a classic paper you'll see if you start digging into this direction more. And what I kind of wanna highlight is that there's a Goldilocks zone between evals, vibes, and price. And I think particularly most models now are getting evals and price, and this middle one is hard.

When I think about it, Claude 3.5 Sonnet was released over a year before Claude 4 Sonnet. And this model definitely had all of them. It was the one that got people in the Bay Area here to switch from ChatGPT to Claude. And that it feels so ahead of its time because clod force on it really doesn't feel that different than 3.5. And it's definitely better, but these jumps are rare.

And these models, it's like we're on this really fast slope. So if you get a really lucky model, you're just gonna have such a really, really great output. And we'll kind of see where they come. I wouldn't know if I necessarily count O3 in this. O3 is a proof of concept that you can do something, but it still has a lot of rough edges for people.

For example, coding ability isn't quite good enough. But there will definitely be more of them and it'll be interesting to see what falls into that niche. And then, kind of the transition into the art of modeling and back to some of the stuff that I was presenting at the AI engineered world fair and I'm thinking about for my day job is like, what comes next after we have this reasoning ability? And I've come up with a few different things that people are gonna be trading into these models that make some of these applications that I started with possible. Autonomy is a very discussed trend in AI right now.

This is a plot from METR, which is a nonprofit evaluation or monitoring startup, I think, in Berkeley. And the plot here is the y axis is the length of time that it would take for a human to do a task that an AI is now doing. And what we're seeing is that AI models over time are able to solve longer and longer tasks. It's important that I I focus on this fact that it's a time of human task because I've messed up saying it in the past. And the TLDR is that climbing this is not free, and it takes a lot of hard work in improving the models and knowing where to push the models next to kind of unlock these.

And it's it's like the reasoning paradigm is a good example because we couldn't keep hill climbing on what we were doing to to kind of unlock this middle phase. So it takes a lot of effort and kind of transitioning the focus from reasoning to things that are focused on planning in this kind of task abstraction is gonna be the thing that unlock these next models or even systems. I think you might see like a DeepResearch bullet point on here rather than like O4 being on here. And then the question is like, how do we actually do this in a model and how do we train autonomous models? So when I think about traits of these kind of independent agent models, I think about starting with skills, which is what I talked about with reasoning, and then kind of expanding into other traits that are gonna be needed.

So I kind of think of calibration, is models that know what they don't know. And I think that labs haven't been using this because it's been easier to unlock performance by just building skills than having the models being kind of introspective. They just haven't needed it because they're not doing a lot of tasks on their own, and the humans could give the feedback. And then to kind of go with this, there's a few what I described these last two as two different subsections of planning. I was trying not to overburden the term because strategy is kind of creating a plan in itself and knowing where to go.

So if you ask DeepSeq R1 a very hard math question, it won't first plan. It'll just dive in and start to try to solve it. And having the model spend the time to refine its direction before solving it would be very important. And then when we're doing things like deep research or hard coding problems, we need a model that if it's presented with a plan, knows how to do itself in certain inference passes to solve subsets of it or to dispatch that to other agents that can take these steps on the problem and actually solve them. These are things that their models aren't trained to do at all right now, where it's like skills we have, inference time scaling, the basic RL stuff has unlocked this.

I don't think we need that much the models are solving the most impossible math questions for mathematicians. We don't need to push it that much further. Calibration, there's a lot of research on that. And then these last two is what I think is the next Q Star like thing where there's gonna be a lot of human data, and then we're gonna have a few good examples that we can then use to iterate with more complex training. And so if we revisit this example to ground my taxonomy, I call this very skillful but lacking planning.

Search is a skill that O3 has largely mastered. It can find really niche information, but if we're gonna pair this to DeepResearch, it doesn't quite know how to compare and contrast and know how much information it needs to gather before making its conclusion. So planning and synthesis is something that we have to encourage the models to do more of before we kind of just well, to unlock the next phase of progress. And to end, this is a somewhat technical provocation, but is something that I hear definitely happening directionally on the ground of other labs is like this RL and post training is becoming the focal point of language model development. It doesn't mean like, pre training is definitely not dead now, but mathematically, in terms of compute, these techniques for reasoning and planning are really becoming the bedrock of what people spend their money on.

Is a plan that people ask me what am I doing at AI2 to try to make a better reasoning model. At the end of the day, the technical things are not that complicated, which is you get a big data set, you filter the data set, you train the model for a while, and then after you have your models, you do a bunch of the things that are in all the research papers. They have a few ideas and you try them at the end and they might give you 1%. This is mostly suited for a technical audience but fun to include. And then back to this from post training to training idea, it's like how do we integrate this compute where post training is similar in GPU hours and then there's the idea of continual learning and if we don't ever have to really pre train a model at all, and we can kind of just keep using these real world interactions to provide signal to the model.

If we ground it into some real numbers, DeepSeek V3 was the famous paper that kick started a lot of discussions on how much does it cost to train a frontier model, and they listed these prices at $5,000,000 Within that table, they said that post training used like well less than 1% of the compute. A fun example is that a researcher on their RL team tweeted that DeepSeek R1 trained for a quote few weeks in the RL stage. This is obviously not something that you want to base any sort of, like, investment or strategy decision in. But if you extrapolate from the tweet, it's like DeepSeek R1 one could be, 10 to 20% in GPU hours. If you talk to somebody at OpenAI, they'll also say similar things, which is like O1 uses very low percentage, but O3 was 10 x the RL compute, and O4 should be the same, which is that this post training phase is already becoming 10 to 20% of the compute used for these larger models, and that's where people are looking to push limits with these things like planning and so on.

So it should be fun. This means that we'll get model releases more frequently because you can kind of see where the performance is going during these RL runs. Where pre training, you have to wait all the way to the end for technical reasons to see how the model is So for people building models, I think these next eighteen months are gonna continue to feel like what it has in the first six months of the year where it's every few weeks we're getting something that is potentially noticeably better. So that's where I've ended. I'm probably a little under on time but potentially catching up for for break time.

So thank you all for listening.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

The rise of reasoning machines Jun 12, 2025

https://www.interconnects.ai/p/the-rise-of-reasoning-machines

Note: voiceover coming later in the day. I may fix a couple typos then too.

A sufficiently general definition of reasoning I’ve been using is:

Reasoning is the process of drawing conclusions by generating inferences from observations.

Ross Taylor gave this definition on his Interconnects Interview, which I re-used on my State of Reasoning recap to start the year (and he’s expanded upon on his YouTube channel). Reasoning is a general space of behaviors or skills, of which there can be many different ways of expressing it. At the same time, reasoning for humans is very naturally tied to our experiences such as consciousness or free will.

In the case of human brains, we collectively know very little of how they actually work. We, of course, know extremely well the subjective experience of our reasoning. We do not know the mechanistic processes much at all.

When it comes to language models, we’re coming at it from a somewhat different angle. We know the processes we took to build these systems, but we also don’t really know “how deep learning works” mechanistically. The missing piece is that we don’t have a deep sense of the subjective experience of an AI model like we do with ourselves. Overall, the picture is quite similar.

To set the stage why this post is needed now, even when reasoning model progress has been rampaging across the technology industry in 2025. Last week, an Apple paper titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexityreignited the “reasoning debate” with newfound vigor.

Some of the key examples in the paper, other than traditional reasoning evaluations such as MATH-500, were that AIs struggled to solve scaled up versions of toy problems, shown below. These are problems that one can programmatically increase the complexity on.

The argument was that language models cannot generalize to higher complexity problems. On one of these toy problems, the Tower of Hanoi, the models structurally cannot output enough tokens to solve the problem — the authors still took this as a claim that “these models cannot reason” or “they cannot generalize.” This is a small scientific error.

The paper does do some good work in showing the limitations of current models (and methods generally) when it comes to handling complex questions. In many ways, answering those with a single chain of thought is unlikely to ever actually work, but they could be problems that the model learns to solve with code execution or multiple passes referencing internal memory. We still need new methods or systems, of course, but that is not a contribution to the question can language models reason? Existence of a trait like reasoning needs small, contained problems. Showing individual failures cannot be a proof of absence.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

This summary of the paper, written by o3-pro for fun, sets up the argument well:

The presence of a coherent-looking chain‑of‑thought is not reliable evidence of an internal reasoning algorithm; it can be an illusion generated by the same pattern‑completion process that writes the final answer.

The thing is, the low-level behavior isn’t evidence of reasoning. A tiny AI model or program can create sequences of random strings that look like chains of thought. The evidence of reasoning is that these structures are used to solve real tasks.

That the models we use are imperfect is not at all a conclusive argument that they cannot do the behavior at all. We are dealing with the first generation of these models. Even humans, who have been reasoning for hundreds of thousands of years, still show complete illusions of reasoning. I for one have benefitted in my coursework days by regurgitating a random process of solving a problem from my repertoire to trick the grader into giving me a substantial amount of partial credit.

Another point the paper points out is that on the hardest problems, AI models will churn through thinking for a while, but suddenly collapse even when compute is left. Back to the test-taking analogy — who doesn’t remember the drama of a middle-of-the-pack classmate leaving early during a brutally hard exam because they know they had nothing left? Giving up and pivoting to a quick guess almost mirrors human intelligence too.

This all brings us back to the story of human intelligence. Human intelligence is the existence proof that has motivated modern efforts into AI for decades. The goal has been to recreate it.

Humans for a long time have been drawn to nature for inspiration on their creations. Humans long sought flying machines inspired by nature’s most common flying instrument — flapping wings — by building ornithopters.

Let’s remember how that turned out. The motivation is surely essential to achieving our goal of making the thing, but the original goal is far from reality.

Human reasoning is the flapping wings of this analogy. It’s the target, but not the end point. Any useful definition of reasoning should encompass what humans do and what our future creations will do.

We’ve passed the Wright Brothers moment for artificial reasoners — it’s not what we expected it to look like, but it’s here.

We should go deeper on why the subjective experience we have as humans makes this case far harder to disentangle than flight. Flight is a physical phenomenon, and hence one detached from our mind. Our mind is literally only representing reality through a transformation, and it can manipulate this representation in a way that serves its physical interests.

Free will is one of those manipulations, or expressions. Free will is a useful construct that enables many complex human behaviors.

The “awareness” of these reasoning models is definitely in a tricky middle ground. The language models have a remarkable general understanding of the environments they operate in — they can explain what a code executor or a chatbot is with precision. They cannot, though, explain exactly how the environment they’re in works.

AI gaining this level of awareness while being able to act is entirely new. Previous generations of AI models that acted were RL systems trained end-to-end to act in a narrow environment. They were superhuman but had effectively no awareness of how the environment worked. Having both the ability to break down problems and express some level of awareness with the world is remarkable. What is missing in the human comparison is AIs being able to evolve with the environment, i.e. continual learning.

Just because an AI doesn’t have all the tools that we use to interact intelligently with the world does not mean it isn’t reasoning. The models break down problems and iteratively try until they reach an answer. Sometimes the answer is wrong, but that’ll improve over time in line with their awareness.

You say AIs are just pattern matching — I say humans are just pattern matching too. We’re doing it in different ways. Would many of the critics be more accepting of this type of reasoning if it was moved to a latent reasoning approach, more similar to how humans draw answers out of thin air and ruminating?

Hallucinations are a great example of the type of complete awareness our AI systems lack. We’ll get better at this. For now, AI models are very minimally trained for “calibration” or knowing what they know. Why train models to know what they know when there are easier ways to solve evaluations? This is why I call calibration a trait of next-generation models — we’re just now getting to the point where it’s needed to solve complex tasks.

With better awareness one could argue for consciousness, but I don’t have a good grasp on how that is defined for humans so I won’t go so far as to assign it to other systems.

Ilya Sutskever discussed the boundary between understanding and awareness, as what comes next, in his latest test of time talk at NeurIPS 2024. To understand is to predict things accurately. To be self-aware is to be able to predict accurately with an understanding of what it is and what its environment is. This all goes back to Ilya’s provocation for why next-token prediction is enough on the Dwarkesh Podcast:

Predicting the next token well means that you understand the underlying reality that led to the creation of that token.

His argument is that self-awareness will follow as we push AI models to understand the world. Since that quote 2 years ago, we’ve made immense progress on his vision. Ilya also included a warning in his more recent NeurIPS talk:

The more [a system] reasons, the more unpredictable it becomes.

We are crossing a rubicon. To ignore this is to be fundamentally ill-prepared for the future.

Being surrounded by another intelligent entity is naturally very off-putting for humans. We evolved in a way that made our social and abstract intelligence a major competitive advantage that allowed us to effectively conquer our environment. I’m not an evolutionary biologist nor anthropologist nor sociologist, but it appears that a majority of critiques of AI reasoning are based in a fear of no longer being special rather than a fact-based analysis of behaviors.

Thanks again to Ross Taylor for discussions that helped form this post.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

What comes next with reinforcement learning Jun 09, 2025

https://www.interconnects.ai/p/what-comes-next-with-reinforcement

First, some housekeeping. The blog’s paid discord (access or upgrade here) has been very active and high-quality recently, especially parsing recent AI training tactics like RLVR for agents/planning. If that sounds interesting to you, it’s really the best reason to upgrade to paid (or join if you’ve been paying and have not come hung out in the discord).

Second, I gave a talk expanding on the content from the main technical post last week, A taxonomy for next-generation reasoning models, which you can also watch on the AI Engineer World’s Fair page within the full track. My talk was one of 7 or 8 across the full day, which was very enjoyable to be at, so I am honored to have won “best speaker” for it.

Three avenues to pursue now that RL works

The optimistic case for scaling current reinforcement learning with verifiable rewards (RLVR) techniques to next-generation language models, and maybe AGI or ASI depending on your religion, rests entirely on RL being able to learn on ever harder tasks. Where current methods are generating 10K-100K tokens per answer for math or code problems during training, the sort of problems people discuss applying next generation RL training to would be 1M-100M tokens per answer. This involves wrapping multiple inference calls, prompts, and interactions with an environment within one episode that the policy is updated against.

The case for optimism around RL working in these new domains is far less clear compared to current training regimes which largely are rewarding the model for how it does on one interaction with the environment — one coding task checked against tests, one math answer, or one information retrieval. RL is not going to magically let us train language models end-to-end that make entire code-bases more efficient, run scientific experiments in the real world, or generate complex strategies. There are major discoveries and infrastructure improvements that are needed.

When one says scaling RL is the shortest path to performance gains in current language models it implies scaling techniques similar to current models, not unlocking complex new domains.

This very-long-episode RL is deeply connected with the idea of continual learning, or language models that get better as they interact with the real world. While structurally it is very likely that scaling RL training is the next frontier of progress, it is very unclear if the type of problems we’re scaling to have a notably different character in terms of what they teach the model. Throughout this post, three related terms will be discussed:

* Continuing to scale RL for reasoning — i.e. expanding upon recent techniques with RLVR by adding more data and more domains, without major algorithmic breakthroughs.

* Pushing RL to sparser domains — i.e. expanding upon recent techniques by training end-to-end with RL on tasks that can take hours or days to get feedback on. Examples tend to include scientific or robotics tasks. Naturally, as training on existing domains saturates, this is where the focus of AI labs will turn.

* Continual learning with language models — i.e. improvements where models are updated consistently based on use, rather than finish training and then served for inference with static weights.

At a modeling level, with our current methods of pretraining and post-training, it is very likely that the rate of pretraining runs drops further and the length of RL training runs at the end increases.

These longer RL training runs will naturally translate into something that looks like “continual learning” where it is technically doable to take an intermediate RL checkpoint, apply preference and safety post-training to it, and have a model that’s ready to ship to users. This is not the same type of continual learning defined above and discussed later, this is making model releases more frequent and training runs longer.

This approach to training teams will mark a major shift where previously pretraining needed to finish before one could apply post-training and see the final performance of the model. Or, in cases like GPT-4 original or GPT-4.5/Orion it can take substantial post training to wrangle a new pretrained model, so the performance is very hard to predict and the time to completing it is variable. Iterative improvements that feel like continual learning will be the norm across the industry for the next few years as they all race to scale RL.

True continual learning, in the lens of Dwarkesh Patel is something closer to the model being able to learn from experience as humans do. A model that updates its parameters by noticing how it failed on certain tasks. I recommend reading Dwarkesh’s piece discussing this to get a sense for why it is such a crucial missing piece to intelligence — especially if you’re motivated by making AIs have all the same intellectual skills as humans. Humans are extremely adaptable and learn rapidly from feedback.

Related is how the Arc Prize organization (behind the abstract reasoning evaluations like ARC-AGI 1, 2 and 3) is calling intelligence “skill acquisition efficiency.”

Major gains on either of these continual learning scenarios would take an algorithmic innovation far less predictable than inference-time scaling and reasoning models. The paradigm shift of inference-time scaling was pushing 10 or 100X harder on the already promising direction of Chain of Thought prompting. A change to enable continual learning, especially as the leading models become larger and more complex in their applications, would be an unexpected scientific breakthrough. These sorts of breakthroughs are by their nature unpredictable. Better coding systems can optimize existing models, but only human ingenuity and open-ended research will achieve these goals.

Challenges of sparser, scaled RL

In the above, we established how scaling existing RL training regimes with a mix of verifiable rewards is ongoing and likely to result in more frequent model versions delivered to end-users. Post-training being the focus of development makes incremental updates natural.

On the other end of the spectrum, we established that predicting (or trying to build) true continual learning on top of existing language models is a dice roll.

The ground in the middle, pushing RL to sparser domains, is far more debatable in its potential. Personally, I fall slightly on the side of pessimism (as I stated before), due to the research becoming too similar to complex robotics research, where end-to-end RL is distinctly not the state-of-the-art method.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

The case for

The case where sparser, scaled RL works is quite similar to what has happened with the past generations of AI models, but with the infrastructure challenges we are overcoming being a bit bigger. This is continuing the march of “deep learning works,” where we move RL training to be further off-policy and multi-datacenter. In many ways RL is better suited to multi-datacenter training due to it having multiple clusters of GPUs for acting, generation, and learning, policy gradient updates that don’t need to communicate as frequently as the constant updates of pretraining with next-token prediction.

There are two key bottlenecks here that will fall:

* Extremely sparse credit assignment. RL algorithms we are using or discovering can attribute per-token lessons well across generations of millions of tokens. This is taking reward signals from the end of crazily long sequences and doing outcome supervision to update all tokens in that generation at once.

* Extremely off-policy RL. In order to make the above operate at a reasonable speed, the RL algorithms learning are going to need to learn from batches of rollouts as they come in from multiple trial environments. This is different than basic implementations that wait for generations from the current or previous batch to then run policy updates on. This is what our policy gradient algorithms were designed for.As the time to completion becomes variable on RL environments, we need to shift our algorithms to be stable with training on outdated generations — becoming like the concept of a replay buffer for LM training.

Between the two, sparsity of rewards seems the most challenging for these LM applications. The learning signal should work, but as rewards become sparser, the potential for overoptimization seems even stronger — the optimizer can update more intermediate tokens in a way that is hard to detect in order to achieve the goal.

Overcoming sparsity here is definitely similar to what happened for math and code problems in the current regime of RLVR, where process reward models (PRMs) with intermediate supervision were seen as the most promising path to scaling. It turned out that scaling simpler methods won out. The question here is, will the simpler methods even work at all?

The case against

There are always many cases against next-generation AI working, as it’s always easy to come up with a narrative against complexity in progress. There are a few key points. The first is that scaling to sparser tasks is already not working, or we don’t know how to actually set up the rewards in a way that encourages the model to get meaningfully better at long tasks.

For example, consider Deep Research, a new product that is “trained with RL” and generates millions of tokens per query. How exactly does the RL work there? OpenAI lightly described the training method for Deep Research in their launch blog post (emphasis mine):

Deep research independently discovers, reasons about, and consolidates insights from across the web. To accomplish this, it was trained on real-world tasks requiring browser and Python tool use, using the same reinforcement learning methods behind OpenAI o1, our first reasoning model. While o1 demonstrates impressive capabilities in coding, math, and other technical domains, many real-world challenges demand extensive context and information gathering from diverse online sources. Deep research builds on these reasoning capabilities to bridge that gap, allowing it to take on the types of problems people face in work and everyday life.

There are two key points. First, they say they train on browser and tool-use tasks with the same infrastructure as o1. Second, they focus on how these capabilities can bridge the gap to harder problems — not that the capabilities are being learned on the harder problems themselves.

How to read this training method, which is likely similar for agents like Claude Code or Codex, is that current RL methods are helping the models get more robust at individual tasks that make up a longer trajectory rather than being trained on the end result of the trajectory itself. The final long-horizon behavior is put together with prompting and letting the model run longer, not sparse credit assignment. In the case of Deep Research the final measure of performance would actually look far closer to human preferences than verifiable rewards, and a large portion of that applies for Claude Code as well, where multiple solutions could solve a problem and it falls to human taste to say which is the best.

There’s a much clearer path for RL training going from human preferences through verifiable rewards and back to human preferences again, rather than pushing further into sparser, harder verifiable domains.

Second, recent RL research always shows that many interactions with a problem or world are needed to solve challenging tasks. In the RLVR domain for math or code the models are generally shown many similar problems multiple times. In the standard RL domains, standard practice is to create simulators that allow massively parallel learning agents (as discussed in the Interconnects Interview with Eugene Vinitsky). The more challenging the problem we’re attempting to deploy RL to, the less these conditions of parallelism or multiple tries can apply.

Whether or not it works, the thing to try is carefully curating the first trajectories to train the models on. This is what OpenAI did to create o1, and it took so long that we got all the Q* rumors in their early experiments. These manual trajectories of optimal samples from Deep Research or coding agents will definitely help performance, but it isn’t clear if they’ll serve as a “warm start” for the model to then be trained extensively with bigger RL.

Is continual learning something we should want?

Dwarkesh’s goal, in many ways, is an AI that learns after interacting with you in a permanent way. This comes with unintended side-effects and would be borderline dangerous. The current AI systems that learn in a “continual” way via trial-and-error with the user are algorithmic feeds. Most people remark how incredible it is for TikTok to learn your interests in real time in front of you, often capturing an essence within minutes.

When it comes to AI models with the latent intelligence that is superhuman in many aspects of understanding, unlocking a rapid and personalized feedback loop back to some company owned AI system opens up all other types of dystopian outcomes. For a long time I’ve written that AI models have a higher risk potential in terms of social outcomes because the modalities they interact with us in are far more personal — e.g. private messaging. Combine a far stronger optimizer with a far more intimate context and that is a technology I don’t even want to try.

There are alternatives that still reap the upside. Despite the bumpy rollout, ChatGPT features that just remember your past interactions can go a long way to act like continual learning. The model can reference past chats and times you corrected it in order to avoid repeating the same mistake, even though the underlying weights don’t need to update. If that isn’t powerful enough, we can wait for the technology to become efficient enough for local models to learn continually as we interact with them. Both of these would dampen the risk potential of super-targeted AI.

Personalization is the softer framing of this that is more compelling. Continual learning is the framing that suits the leading model providers because their training algorithms will be the ones benefiting from all of the interactions. Personalization doesn’t suit the frontier AI laboratories well because their economies of scale push them to have few models for many users. If open models keep up, we should be able to create a future of specialized, “n of 1” models for specific users.

Without corporate misaligned incentives, I’d be happy to have continual learning, but on the path we’re going down I’d rather not have it be an option presented to the masses at all.

As AI is going to be so powerful as a standalone entity, breaking some of the symbiotic links will be good for adding friction that makes the technology easier to steer towards good outcomes. In short, be wary of wishing for end-to-end (reinforcement) learning when you’re part of the environment. It’s a destiny to dystopia.

Aside: Revisiting AI usage

Finally, while in SF I was chatting with many people about the theme of my post, People use AI more than you think, which is framed as simple AI revenue and usage growth. The core idea of the article should’ve been expanded, as not only do people use AI a lot already, but most of the most popular AI services are supply constrained just like Nvidia. When you see revenue forecasts from OpenAI or Anthropic to The Information, it’s best to believe them for scaling existing product offerings. They know when they’re getting more capacity online. The new higher-revenue offerings are in flux.

For example, Sundar Pichai acknowledged this in his recent appearance on the Lex Fridman podcast:

I think it's compute-limited in this sense, right, like, you know, part of the reason you've seen us do Flash, Nano, Flash, and Pro models, but not an Ultra model, it's like for each generation, we feel like we've been able to get the Pro model at like, I don't know, 80, 90% of Ultra's capability, but Ultra would be a lot more, like slow, and a lot more expensive to serve. But what we've been able to do is to go to the next generation and make the next generation's Pro as good as the previous generation's Ultra.

This will very likely continue.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

How I Write Jun 06, 2025

https://www.interconnects.ai/p/how-i-write

My experience with my recent years of writing is quite confusing — almost even dissociative. I've never felt like I was a good writer and no one really told me I was until some random point in time a year or two ago. In that time span, I didn't really change my motivation nor methods, but I reaped the simple rewards of practice. I'm still wired to be very surprised when people I respect wholeheartedly endorse me as "writing very well." Despite the disbelief, when I interrogate what I'm doing and producing it is clear that I've become a good writer.

I don't have a serious writing process. Rather, I make writing a priority. When it is time to write, when my brain is ready, I write. Most of the processing of ideas comes from discussions at work, online, and with myself. The writing is a dance of crystallizing your ideas. It is capturing a moment. This post will take me about 45 minutes on my return flight from San Francisco for a talk, after a nap and a sparkling water. This is standard and it's quite refreshing to have nothing else to do.

I'm torn on the future of writing. It's easy to think that with AI no one will learn to write well again, but at the same time the power of writing well is increasing in careers and with the perception overall impact.

The process of becoming good at writing is quite simple. It takes practice. With practice, you can get to a solid enough level to write clear and engaging prose. The path to becoming a good writer has two sequential milestones:

* Finding something you care about. Then you can write about it. The entry level to this is finding something you want to learn more about. The final level is writing about your passions.

* Finding your voice. Then you can write effortlessly.

People spend too long trying to write as an activity without thinking seriously about why they're writing and what they care about. This makes writing feel like a chore.

Finding your voice also unlocks much more powerful feedback loops and the most powerful form of writing — writing about why you write. This helps cultivate your voice, your direction, your personality, your story. When I found my voice I also unlocked style. Feeling style while writing is when it becomes intellectual play. For example, I find diversity of punctuation and aggressive sentence structure to be something that AI never does naturally. AI. Won't. Make. You. Read. Fragments. AI will draw you into long, lulling, lofty sentences that make you feel like you know what they're talking about while still conveying very little information.

Finding voice is also far harder. Writers block can be best described as when you have ideas, but you don't know how to express them. Sometimes this is forced upon you because the medium you're writing for has a required format (e.g. academic manuscripts). I'm yet to find a way to circumvent this.

When you have found your voice and your something, writing is just as much thinking a topic through as it is an action in itself. Most of my work now is just that — I'm prioritizing the times to write when I feel my thoughts coming together and I sit down to finish them off. Without prioritizing writing, it'll often feel like you're trying to put together puzzle pieces where the edges have been bent or torn. You know what you are going for, but it's just extra work to bend everything back into shape. My schedule is designed to make writing a priority. I have few meetings and I approach my workflow with consistent hard work expressed through very flexible hours.

Writing captures the essence of ideas incredibly well and we have a deep sense that can pick up on it. It's why you can read one 200 character post on X and know with conviction that the creator of it is a genius. This bar of good writing and thinking is of course rare at a personal level and fleeting throughout a day.

By doing this for multiple years my rate of output has gotten far higher along with my overall quality. Is my thinking becoming clearer or am I getting better at expressing it in the written word? In many ways the distinction doesn't matter.

This brings me back to AI. AI models are definitely getting much better at writing, but it's not easy to track. With the above sentiment, I think writing quality is one of the best judges of AI models' abilities. It's why I've stuck with GPT-4.5 for so long despite the latency and I suspect it is a reason many people love Claude 4 Opus. o3 can be quite nice as well. Still, these models are better at writing than their peers, but they’re still very mediocre overall.

AI labs are not set up to create models that are truly great at writing. A great model for writing won't have gone through heavy RLHF training or be trained to comply with a specific tone. This could get better as the base models get stronger, as post-training can get lighter as the models naturally are more capable to start with, but I think the drive to define a model's voice will appeal to more users than elegance (i.e. the same incentives that caused GPT 4o to be so sycophantic).

Without more raw intelligence better writing will feel like a lucky find from prompting rather than the nature of new models. I suspect many recent papers on creative writing are doing more of amplifying a certain style of writing that humans like than making the model have a more expansive capacity for writing.

With scaled RLVR training we're also pushing the models even further into doing rather than writing. A great test for AI progress is how the writing ability gets pulled up with all the other training foci around it.

AI helps good writing processes, but it pulls up the drawbridge for those looking to get into writing. The level of motivation it takes to learn to write while autocomplete is always available is far higher.

For the full “life” backlog of my writing, here it is in chronological order:

* July 2022: Job search out of Ph.D.

* May 2023: What it’s like to work in AI right after ChatGPT.

* November 2023: Job search post ChatGPT & RLHF.

* October 2024: Why I build open language models.

* May 2025: My path into AI.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

A taxonomy for next-generation reasoning models Jun 04, 2025

https://www.interconnects.ai/p/next-gen-reasoners

On Monday of this week we released RewardBench 2, Ai2’s next reward model evaluation and a project I’ve been personally invested in through its whole arc. Read more of my thoughts here.

Tomorrow, I’ll be presenting a version of this post at the AI Engineer World’s Fair Reasoning & RL track. Come tomorrow and say hi if you’re around the next two days!

The first generation of reasoning models brought us inference-time scaling and intrigue in seeing into what can be called the reasoning process of a language model.

The second generation of reasoning models are going to bring us new types of agentic language modeling applications.

The traits and abilities that are needed for agentic models are additive to the first generation, but not present by default. Some of the new abilities that are needed can be bootstrapped with clever prompting, but for the best results we need to be training our reasoning models directly to optimize for planning.

In this post we explain four key aspects of current and next-generation reasoning models:

* Skills: The ability to solve self-contained problems.

* Calibration: The ability to understand the difficulty of a problem and not overthink.

* Strategy: The ability to choose the right high level plan.

* Abstraction: The ability to break down a strategy into solvable chunks.

These are presented in the order that they should be solved to make a progressively more complete reasoning model for complex tasks. Skills then calibration then strategy then abstraction. The first two are native abilities of models on single inference passes when presented with a technical problem and the latter are skills that are needed to build effective agents.

For grounding, recall the popular “time horizon progression” chart from METR:

The models were saturating around GPT 4o in 2024. Unlocking reasoning skills provided the bump through Claude Sonnet 3.7 in 2025. Planning well will be the trait of models that make the leap from 1 to 4+ hours in 2026 and on.

All of the excitement around reasoning models exploded when it was shown that scaling reinforcement learning with verifiable rewards (RLVR) enables the model to learn useful skills for solving a variety of downstream tasks. The first public confirmation of this was with DeepSeek R1, which showed how training time RL compute translates to performance.

Intertwined with this is that the models will generate more tokens per response while discovering these skills. Within all reasoning models today the above abilities listed — skills, calibration, strategy, and abstraction — can be further tuned by the increase in token spend per component.

This year every major AI laboratory has launched, or will launch, a reasoning model because these models are better at acquiring skills that let them solve the hardest problems at the frontier of AI — evaluations like Humanity’s Last Exam, MATH, AIME, LiveCodeBench, Aider Polyglot, etc. have all seen step changes in performance from the previous class of models. These skills are the foundation for all of the changes that are following in the industry. Much of current discussions on scaling training are around finding the right problems to let the models become more robust in a variety of scenarios.

The mad rush for skill acquisition in these models has ballooned a second-order problem of the models overthinking for even easy problems. This emerges due to the deep coupling of RL training and the unlock of inference-time scaling. The ultimate goal is clearly that models scale inference-time compute on their own proportional to how hard the problem is. In the short term, when the rate of performance gain is so high, it makes sense to prioritize abilities over efficiency. As abilities saturate, performance and cost will be weighted more equally.

Right now, calibration on problem difficulty is offloaded to the user in the form of model selectors between reasoners or traditional instruct models, reasoning on/off buttons, thinking budget forcing, and soon reasoning effort selectors. On the research side its been shown that the RL loss functions are flexible enough to enable length control more precisely — something that loss functions like instruction or preference tuning cannot handle. Similarly, the models trained as reasoners better express their confidence, which should soon be translated into mitigations of overthinking.

Calibrating the difficulty of the problem to the effort of the solution will enable much more practical (and faster and enjoyable) solutions for end users and also just more profitable solutions. Calibration, even though a lower level trait of the models, isn’t as much of a crucial path to rolling out new use-cases with the models. For that, AI makers are going to turn to better planning abilities.

For more on current research on calibration, click the following footnote.

Before we go on to planning abilities, which are often discussed at length in the community as being crucial without providing a clear way of understanding it, we need to contextualize how parallel compute and other inference-time scaling methods will impact the future of reasoning models. The most prominent method here is some sort of search mixed with either consistency or internal scoring models (e.g. reward models) like o1-pro. For example, in the Claude 4 release post Anthropic mentioned that they use “parallel test-time compute by sampling multiple sequences and selecting the single best via an internal scoring model.” Google has also announced but not released Gemini Deep Think which will mirror this.

Using these methods makes it clear that parallel compute is doing something very different than scaling the underlying RL — it’s an added form of robustness or quality on the answers. o1 pro in my testing has always been the most consistent model I’ve tried. Scaling compute here doesn’t directly help the model unlock more skills like the training time RL compute, but in practice it feels similar because better answer extraction and formatting helps the model feel smarter. The best way to encapsulate the somewhat orthogonal direction of parallel compute for inference-time scaling is that quality is often anti-correlated with rare tokens when a rating metric or reward model is deployed, as rare tokens will be suppressed by majority voting methods or reward models that have never seen them before.

When it comes to leading reasoning models of the future, calling in parallel compute or just extended linear thinking can be best thought of as a tool that the agent can call. They’re going to be arrows in the quiver of a model planning a strategy and knowing which pieces of it will be most difficult to overcome.

Though, in order to get there, the models need to be treated very differently. Current models do very little planning on hard problems unless asked to do so. For example, here’s what happens when the new R1 model is asked a problem from Frontier Math (one of the hardest current benchmarks):

With current models it is reasonable that they do very light or implicit planning — the skills we’re trying to train in will allow the model to break down problems into steps and solve them. Implicitly the first few tokens these models take send them down a certain plan. These behaviors will be minor relative to what emerges in agentic workflows — where a plan is needed a priori in order to narrow the search space substantially.

Planning is the term of art used to encompass the models long term and multi-step abilities.

Planning encompasses many sub-skills and abilities, but the highest level split that matters in the current frontier of agentic models is strategy and abstraction. Strategy is the ability of the model to correctly point itself in the direction of a high quality solution. With one autoregressive pass, pointing the stream of tokens in the wrong direction is often not recoverable. While agents will be a bit better at this by being able to edit their plan, they’re still heavily susceptible.

Abstraction is how the model breaks down the strategy into accessible parts. Even with the most skilled model, taking on too hard of a sub-task at once will make it so no progress is made overall. Taking on not enough at a time will make the model timeout. Currently, abstraction is a minor problem as the time horizon is fairly short, but models will need to be able to break down multi-day tasks into sub problems that can be solved in individual 1-2minute inference steps (i.e. 10-100K tokens of forward inference).

A closely related skill is context management, where the models must be able to store a complete summary of what they have done so far. The best forms of context management will let the model skip over tasks it accidentally ended back on even though they’re already completed or try a new strategy after a failed approach. This is one of many low-level skills that’ll emerge to enable generalized planning abilities.

o3 is the leading model in this paradigm right now with the largest spectrum of skills from math, code, and search and some leading planning abilities such as Deep Research. When o3 is finding niche information for me I attribute very little of that behavior to planning, but rather just the skill, multi-try tool use, of knowing to keep searching things until it finds the answer. Other models have qualities that are ahead in some regions of the Pareto frontier, such as Claude 4’s planning for software tasks (in essence saying Claude Code is currently better than OpenAI’s coding agent Codex).

o3 is best when it is tasked with finding extremely niche information that exists on maybe one page on the web. It fails when asked to compare all the content that is out there. In the above taxonomy, o3 has almost solved the skill of search but synthesis across a broad category involves more advanced planning of the information to obtain and analyze.

Planning does not feel like an ability I’d expect to emerge when training on multi-step, challenging tasks, but I wouldn’t be surprised if it’s a behavior that could be refined. Much as the Q* story was actually a substantial initial data curation effort by OpenAI to craft some reasoning traces, they’ll likely need to do the same to seed higher quality planning behaviors before continuing to train the model. High-quality training samples here will encompass both high level strategies and details on how to abstract the problem.

As with the skills specific to reasoning on single math or code problems like verification or checking work, it’ll be a long time before we know the balance of these emerging from general pretaining, focused mid training, or specialized cold start data. Regardless of the long-term balance, we’ll quickly be seeing a race to add these planning abilities so labs will start with post training (cold start SFT data) that elicits whatever was in the pre training. This task will not be as hard as initializing the reasoning chains themselves, as planning is more about results than the behavior that gets them (which should partially transfer from hard math and code problems).

The first thing current agents likely do is write out a plan of attack for their ultimate goal. The weakness of current planning abilities are seen by the variance in outputs like Deep Research and Codex where it’ll oscillate between a masterpiece and a dud. Claude Code’s planning abilities could be better for a reason as simple as the model being taught to edit and revisit the plan many times while it is running. This sort of distribution output scope, or length of time the model will try, starts linking planning capabilities back to calibration too.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

All of this paints a fairly clear path of problems that will be solved in the coming months. Agentic tasks require more of what makes reasoning models great. At the same time, the tasks are far more focused on real world tasks than things that are represented in existing academic benchmarks. Current academic works are very strongly pushing the direction of skills for these models, particularly on math, and a fair amount on calibration (see footnotes below), but not enough on the subsets of planning we need. The challenge is that these capabilities can only be judged in the broader system that they operate in, which will often be accompanied by substantial inference costs. The real race is towards building systems that people use, whether with open or closed models, rather than pushing the models further into skills that aren’t showing clear value, such as nearly-impossible math problems or the top echelons of competitive programming.

With current models we should be optimistic that we can solve many of the coming problems. We have some manual data annotation work to do to bootstrap planning abilities, and then we can attempt the final goal of training agents end-to-end with reinforcement learning on long-horizon, sparse tasks.

Thanks to Ross Taylor for some feedback on an early form of this taxonomy and Sophie Alpert for helping crystallize some of my ideas around o3.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Claude 4 and Anthropic's bet on code May 27, 2025

https://www.interconnects.ai/p/claude-4-and-anthropics-bet-on-code

Claude’s distinctive characteristics are having a best-in-class personality and the ability to effectively perform software engineering tasks. These characteristics both appeared in force with the first version of Claude 3.5 Sonnet — a major breakthrough model at the time and the model that pulled me away from ChatGPT for the longest. That model was released on Jun 20, 2024, and just the other day on May 22nd, 2025, Anthropic released Claude Opus 4 and Claude Sonnet 4. The strengths of these models are the same.

The models serve as an instrument in Anthropic’s bigger goals. The leading AI models alone now are not a product. All the leading providers have Deep Research integrations set up, ChatGPT uses memory and broader context to better serve you, and our coding interactions are leaving the chat window with Claude Code and OpenAI’s Codex.

Where Anthropic’s consumer touchpoints, i.e. chat apps, have been constantly behind ChatGPT, their enterprise and software tools, i.e. Claude Code, have been leading the pack (or relatively much better, i.e. the API). Anthropic is shipping updates to the chat interface, but they feel half-hearted relative to the mass excitement around Claude Code. Claude Code is the agent experience I liked the best over the few I’ve tried in the last 6 months. Claude 4 is built to advance this — in doing so it makes Anthropic’s path narrower yet clearer.

As a reminder, Claude 4 is a hybrid-reasoning model. This means that reasoning can be turned on and off at the click of a button (which is often implemented with a simple prompt at inference time and length-controlled RL at training time — see the Nemotron reasoning model report for more on hybrid-reasoning techniques). In the future extended thinking could become a tool that all models call to let them think harder about a problem, but for now the extended thinking budget button offers a softer change than switching from GPT-4.1 to o3.

Claude 4 gut check

In AI, model version numbers are meaningless — OpenAI has model number soup with their best model being a random middle number (o3) while Gemini took a major step forward with an intermediate update — so Claude 4 being a seemingly minor update while iterating a major version number to fix their naming scheme sounds good to me.

In an era where GPT-4o specifically and chatbots generally are becoming more sycophantic, Claude’s honesty can be a very big deal for them. This is very hard to capture in release notes and still comes across in the takes of lots of early testers. Honesty has some downsides, such as Claude’s ability to honestly follow its alignment training and potentially report rule-breaking actions to authorities. Honesty and safety are very desirable metrics for business customers, a place where Anthropic already has solid traction.

In a competitive landscape of AI models, it feels as if Anthropic has stood still in their core offerings, which allowed ChatGPT and Gemini to claw back a lot of their mindshare and user-share, including myself. Claude 4’s “capabilities” benchmarks are a minor step up over Claude 3.7 before it, and that’s on the benchmarks Anthropic chose to share, but it is still clearly a step forward in what Claude does best.

Benchmarks are a double edged sword. Claude 4 will obviously be a major step up for plenty of people writing a lot of code, so some will say they’re never looking at benchmarks again. This approach doesn’t scale to enterprise relations, where benchmarks are the headline item that gets organizations to consider your model.

On some popular coding benchmarks, Claude 4 actually underperforms Claude 3.7. It would be good for the industry if Claude 4 was rewarded for being a practically better model, but it goes against a lot of what the industry has been saying about the pace of progress if the next major iteration of a model goes down on many popular benchmarks in its core area of focus.

Buried in the system card was an evaluation to measure “reward hacking,” i.e. when the model takes an action to shortcut a training signal rather than provide real usefulness, that showed Claude 4 dramatically outperforming the 3.7 model riddled with user headaches.

This single benchmark summarizes a lot of the release. They made the model more reliable, and what follows ends up being Anthropic falling into normal marketing paths.

This release feels like the GPT-4.5 release in many ways — it’s a better model in general use, but the benchmark scores are only marginally better. It’s obviously a strong and well-crafted model (doubly so in the case of Opus), but it’s not immediately clear which of my grab-bag of use cases I’ll shift over to Claude for it. I’m not the intended audience. I write code, but a lot of it is one-off hacks and it’s certainly not sustained development in a major code-base. Without better consumer product offerings, I’m not likely to keep trying Claude a lot. That doesn’t mean there isn’t a strong audience for this model in the software industry. My vibe tests for the model were good, but not good enough to break my habits.

Anthropic shared evaluation numbers for the model with and without extended reasoning on with parallel test-time compute. Both of these numbers aren’t really standard for sharing evaluations of new cutting-edge models (mostly of the reasoning variety).

The oddness of the benchmark presentation reiterates that Anthropic is going down a bit of a different path with their models relative to OpenAI and ChatGPT.

It should be fairly obvious to most AI observers that if simply turning on extended thinking for Claude 4 was enough for Opus to be competitive with o3 or Sonnet to Gemini 2.5 Pro, they would’ve done it. Without the shaded regions, the bars do not look so impressive (coming soon below), and this leads us to one of the major facts of the Claude 4 release — the benchmarks are meh. They can’t lead this model to mindshare.

This is partially in the context of how Anthropic is very narrowly curating the benchmarks they share to match their coding and agentic use-cases.

The Anthropic announcement benchmarks are: SWE-Bench Verified, Terminal-bench, GPQA-Diamond, TAU-bench, MMMLU, MMMU, and AIME 2025. It’s 3 mostly agentic coding benchmarks, 3 knowledge benchmarks, and one very hard math benchmark. Traditional “coding” benchmarks aren’t even really here.

Compare this to the benchmarks from Gemini 2.5 Pro’s recent release: Humanity’s Last Exam, GPQA, AIME 2024/2025, LiveCodeBench, Aider Polyglot, SWE-benchVerified, SimpleQA, MMMU, Vibe-Eval, MRCR, and Global MMLU. This is a wider mix and has only one agentic-ish task in SWE-Bench.

The presentation is also arguably misleading in the blog post, where they report scores that are from a model version inaccessible to users. The first number is “standard-use” without test-time compute.

Where Anthropic says the results are “without test-time compute” it’s hard to know what the baseline is. Claude was the first mainstream model to show signs of doing some sort of internal chain of thought (CoT) before showing the final answer to the user. This was in the model and discussed before the launch of OpenAI’s first o1 model.

For the second number, the fine print in the blog post states:

On SWE-Bench, Terminal-Bench, GPQA and AIME, we additionally report results that benefit from parallel test-time compute by sampling multiple sequences and selecting the single best via an internal scoring model.

When Claude 3.7 launched, Anthropic wrote a nice blog post on test-time compute that also talked about parallel compute. The higher of the two numbers in their benchmarks illustrates what is happening there. I expect Anthropic to release an o1-pro-style product soon (as Google also announced Gemini DeepThink). These ways of using the model are very powerful, and because Anthropic reported it using an internal scoring model and not something like the pass@10 metric that is giving the model multiple tries, users would benefit to use it.

This method gives the shaded bars in the results below.

With distillation from powerful models being so common today, making the distinction for benchmarking between reasoning and non-reasoning models or test-time compute and standard inference is very strained. For users, there are many more differences that take into consideration actually serving the models.

There are only a few reasonable ways to compare models today, and only one of them is arguably practical:

* Compare evaluation scores how the users will use them. E.g. you can only report parallel test-time compute scores if they’re in a product like o1-pro.

* Compare peak scores across models, so you can see the peak performance of all the systems the AI models have.

* Release FLOP spend per prompt on the evaluation sets and bin models with different levels of compute per question.

Because we don’t get the data to do these comparisons, we tend to compare using the first bucket. When we see shaded bars on plots (like above, or in OpenAI’s o-series release blogs), we ignore the shaded regions.

Benchmarks obviously aren’t everything to a model’s release. This analysis is to show why the AI field is strained by being forced to communicate the abilities of their models through benchmarks that don’t capture the full picture.

In using Claude Opus 4 (and Sonnet too) instead of Gemini 2.5 Pro I was immediately struck by how much slower it is.

The character and real-world use of the model matters far more, but in a world where OpenAI’s and Google’s latest models have both leading benchmark scores and good vibes (as long as you’re not using GPT-4o), it makes you question Anthropic’s position to compete for the whole market.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Will Anthropic code their way to AGI first?

There’s a long-standing assumption in AGI-centric circles that having the best coding model will let you get to AGI the fastest. A version of this argument is the “software-driven singularity” of the AI 2027 forecast. This is a reasonable argument to make if you paired it with the assumption that the ability to implement AI ideas is the limiting factor on progress. It is obviously a major factor, but taking a narrow worldview such as that makes you miss how AI progress is actually made. AI progress is messy, incremental in data, and takes a lot of hours of human focus. Resources and human attention are the bottleneck more than software ability.

I expect improved code gains to be very strong marginal gains. They make the process of doing AI research much smoother, particularly by enabling more concentrated research teams and organizational structures, but they won’t be the single factor that is looked back upon as being the key to AGI. The key is many small insights and lots of hard work, mostly data, over time.

The Code RL team at Anthropic is “singularly focused on solving SWE. No 3000 elo leetcode, competition math, or smart devices.” If having the best coding model was going to let Anthropic get to AGI first, then why haven’t we begun to see the benefits of it? The Claude 4 release shows that Anthropic is falling behind on general benchmarks and not climbing substantially on those they highlight. In many ways, this looks like Claude getting more robust across a variety of use-cases and not accelerating forward in general intelligence.

The argument for having the best code model being the core ingredient in getting to AGI first is then reducing to belief that these posited benefits will kick in at some point in the future and Anthropic’s models will become better at everything else too. The AI laboratories are extremely competitive and it looks as if Google and OpenAI are improving on software tasks and a broader range of abilities.

There are regular press releases about a certain number of PRs being written by AI across the technology sector generally — Anthropic CPO Mike Krieger recently highlighted the number being ~70% for them — which likely is counting anything where AI is a co-author. At the same time, these AI systems have struggled to grasp very complex codebases, so human oversight is a still a crucial step of the process. The AIs make everything easier, but not automatic.

It seems like a far more reasonable path to something called Artificial General Intelligence will be one that shows incremental improvements on a broad variety of tasks, rather than narrowing a focus and waiting for future payoff.

Focusing on software development is still a good business strategy for Anthropic, but saying that it’ll let them leapfrog OpenAI and Google in the AGI race is a weak attempt to accept reality.

As a regular user of claude.ai that is greeted by rate limits, the problem limiting their progress is more likely to be compute allocation than talent or research strategy. I’ve said before that human competition is the biggest driving force of rapid progress in AI models, so I also worry about Anthropic’s culture of safety and anti-arms-race mentality being able to capture that.

A more compelling argument than code could be that Anthropic is leading on the “agentic front,” which means the models can plan effectively and accomplish tool-use calls to enact it. Claude Code is a positive example of this, but the weakness of their Deep Research product is a negative mirror. With bigger error bars in this area, in terms of what is possible with agents generally, this could be a better area to make a case for optimism for Anthropic.

So-called “coding” abilities are very broad and encompass understanding error traces, extreme long-context abilities to understand a code-base, basic scripting, multi-file edits, and many things in between. Agentic abilities seem to fall into a narrower niche, or at least a more well-defined one, where the model needs to be able to accomplish many incremental tasks on their own while managing its context. This could generalize to a far bigger market than just software if one model is miles ahead. The winner in the agentic platform space should become more clear later into 2026.

As a summary of the state of affairs for the major AI players, we are positioned as:

* OpenAI is the consumer leader and still very well-positioned with extremely strong models.

* Google is the general enterprise leader with the best models across every task or size you would need (e.g. the lack of Claude Haiku 4 is very limiting for Anthropic, and Haiku has remained expensive). If they can get their act together building products, even OpenAI should worry.

* Anthropic is the leading model for software engineers and related tasks — maybe they should’ve acquired Windsurf instead? This core area complements a well-rounded and functioning enterprise business, just one that will be smaller than Google’s.

* Meta is building models to serve their platforms, which will be the most significant competitor with ChatGPT, but they have major cultural or organizational knots to unlock to catch up technically.

* Grok is on the path to being a niche player serving use-cases that need more permissive content guidelines. They have an API, but it is far from well-established in key areas.

* DeepSeek is an x-factor that could disrupt many of the above, but we never know when it’ll land.

In the top list, as businesses, OpenAI and Google appear in a league of their own. Anthropic seems solid but heading for a much smaller ceiling, and the others below are still floundering to make a true AI strategy.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

People use AI more than you think May 21, 2025

https://www.interconnects.ai/p/people-use-ai-more-than-you-think

I was on ChinaTalk again recently to talk through some of my recent pieces and their corresponding happenings in AI.

Usage and revenue growth for most AI services, especially inference APIs, has been growing like mad for a long time. These APIs have been very profitable for companies — up to 75% or higher margins at times according to Dylan Patel of SemiAnalysis. This is one of those open facts that has been known among the people building AI that can be lost to the broader public in the chorus of new releases and capabilities excitement.

I expect the subscription services are profitable too on the average user, but power users likely are costs to the AI companies alongside the obvious capital expenditures of training frontier models. Still, even if the models were held constant, the usage is growing exponentially and a lot of it is in the realm of profitability.

The extreme, and in some cases exponential, growth in use of AI has been happening well before lots of the incredible progress we’ve seen across the industry in the first half of the year. Reasoning models that change inference answers from something on the order of 100s of tokens to sometimes 10s of thousands of tokens will make the plots of usage even more stark. At the same time, these models are often billed per token so that’ll all result in more revenue.

On top of the industry’s vast excitement and progress in 2025, the Google I/O keynote yesterday was a great “State of the Union” for AI that highlighted this across modalities, form factors, and tasks. It is really recommended viewing. Google is trying to compete on every front. They’re positioned to win a couple use-cases and be in the top 3 of the rest. No other AI company is close to this — we’ll see how their product culture can adapt.

Highlights from I/O include Google’s equivalent product relative to OpenAI’s o1 Pro, Gemini Deep Think, Google’s new multimodal models such as Veo 3 with audio (a first to my knowledge for the major players), a live demo of an augmented reality headset to rival Meta and Apple, and a new version of Gemini 2.5 Flash that’ll serve as the foundation of most customers’ interactions with Gemini.

There were so many awesome examples in the keynote that they didn’t really make sense writing about on their own. They’re paths we’ve seen laid out in front of us for a while, but Google and co are marching down them faster than most people expected. Most of the frontier language modeling evaluations are totally saturated. This is why the meta usage data that Google (and others recently) have shared is the right focal point. It’s not about one model, it’s about the movement being real.

The slide that best captured this was this one of AI tokens processed across all of Google’s AI surfaces (i.e. this includes all modalities), and it is skyrocketing in the last few months.

I annotated the plot to approximate that the inflection point in February was at about 160T total tokens in a month — Gemini 2.5 Pro’s release was in late March, which surely contributed but was not the only cause of the inflection point. Roughly, the numbers are as follows:

* April 2024: 9.7T tokens

* December 2024: 90T tokens

* February 2025: 160T tokens

* March 2025: 300T tokens

* April 2025: 480T+ tokens

Monthly tokens are rapidly approaching 1 quadrillion. Not all tokens are created equal, but this is about 150-200M tokens per second. In a world with 5T Google searches annually, which translates to around 100K searches/second, that tokens per second number is equivalent to roughly using 1000 tokens per search (even though that is definitely not how compute is allocated). These are mind boggling numbers of tokens.

Google’s primary AI product is still its search overviews and they’ve been saying again and again that they’re something users love, reaching more than a billion people (we just don’t know how they are served, as I suspect the same generation is used for thousands of users).

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Google is generating more tokens than is stored in Common Crawl every month — reminder, Common Crawl is the standard that would be referred to as a “snapshot of the open web” or the starting point for AI pretraining datasets. One effort to use Common Crawl for pretraining, the RedPajama 2 work from Together AI, estimated the raw data in Common Crawl at about 100T tokens, of which anywhere from 5 to 30T tokens are often used for pretraining. In a year or two, it is conceivable that Google will be processing that many tokens in a day.

This article has some nice estimates on how different corners of the internet compare to dumps like Common Crawl or generations like those from Google’s Gemini. It puts the daily token processing of Google as a mix of reading or generating all the data in Google Books in four hours or all the instant messages stored in the world in a little over a month.

Some examples from the post are below:

The internet is being rebuilt as an AI first service when you count the data. Human data will quickly become obsolete.

Google’s numbers are impressive, but they are far from outliers. The entire industry is taking off. This is all part of a constant acceleration where products that are built on previous models start to get traction, while at the same time new models come out that only enable new growth cycles to begin. Estimating the upper end of this growth cycle feels near impossible.

For example, just a few weeks ago on the Q3 2025 earnings, Microsoft CEO Satya Nadella commented on the output of Azure’s AI services:

We processed over 100 trillion tokens this quarter, up 5× year-over-year — including a record 50 trillion tokens last month alone.

So, Google’s token processing is almost 10X Azure, and many would say that Google got a late start relative to Microsoft’s early partnership with OpenAI to host their models.

Estimates for other services, such as ChatGPT are much messier, but all paint a similar picture. In February, Sam Altman posted on X:

openai now generates about 100 billion words per day. all people on earth generate about 100 trillion words per day.

With the rule of thumb that one word is about 3/4 of a token, 100B words per day would be about 4T tokens per month. A small sliver relative to the cloud giants above, but we don’t have clear insight into if this is all of OpenAI’s API business or just ChatGPT. As it stands, OpenAI could be almost 1/100th the size of Google’s AI footprint as of today.

OpenRouter’s rankings show similar trends, with the recent months being around 2T tokens processed — about the same order as ChatGPT depending on how it is measured above.

This isn’t just Western businesses, as Chinese companies such as ByteDance or Baidu are getting into the 1T token per day range (barring translation issues, I didn’t find another source for it).

When fast-growing companies like Anthropic or OpenAI share somewhat unbelievable revenue forecasts, maybe we should give them a bit more credit?

There are many surfaces that are in beta, primarily code agents, that are going to help these numbers take off. We’ve been playing with Claude Code, OpenAI’s Codex, Google’s Jules, and countless other agents that use tons of text tokens by working independently for minutes at a time. I’ve estimated with friends that one Deep Research query uses ~1M tokens of inference. Soon individual tasks will use ~10M then ~100M and so on. All of this so soon after just two years ago when a mind-blowing ChatGPT query only used 100-1K tokens.

It’s a good time to be in the token selling business. This is only the beginning.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

My path into AI May 14, 2025

https://www.interconnects.ai/p/how-i-got-here

Some longer housekeeping notes this week:

* I wrote briefly about a new open-source license, OpenMDW from the Linux Foundation, that seems very solid!

* OpenAI launched the Reinforcement Finetuning (RFT) API. I think my take from when it was teased still holds up super well, you should read it if you haven’t:

* In June, I’ll be speaking at some events in SF and Seattle, I’m looking forward to seeing some readers there. Talk topics are tentative:

* AI Engineer World’s Fair in SF June 3-5 on what we can take away from the last 6 months of reinforcement learning with verifiable rewards (RLVR).

* Enterprise AI Agents in Action in Seattle on June 13 on the art of training a well crafted model.

* VentureBeat Transform in SF on June 24-25 on progress in RL with open source AI.

During the SF trips I’m excited to catch up with old and new friends training and using the latest AI models, so don’t be shy to shoot me an email. Onto the post!

One of the big upsides for my current writing habit is that I should become known by AI models within a couple years. While not offering any immediate technical value in how I use AI, it provides obvious upsides on growing an online presence and fulfilling a very basic human urge for legacy in a way that avoids most personal or moral sacrifice. Other thinkers I follow closely have begun to follow Tyler Cowen's lead on explicitly writing for the AIs and filling in gaps they won't know via what is currently digitized.

I'm joining in and will use it to help push out the limits of my writing. These will build on my two popular job search posts and others like "what it’s like to work in AI right now".

The most defining feature of my young career has been how I prioritize different aspects of work. The work I do today takes on a simple form, but prior to getting to this sustainable place it was more of a striving to belong than a plan to execute.

Getting into AI

Without retelling my entire pre-grad school life, some basic facts that I brought with me coming out of an undergrad primarily characterized by high-focus on executing on coursework and winning championships were:

* An obvious gift on focusing and grinding through moderate amounts of technical material alone,

* Acceptance that most people can do very hard things if they're willing to work for year(s) on it driven by personal motivation alone (most people don't want to work long enough, rather than hard enough),

* An ambivalence on if I actually needed to finish the Ph.D. I was starting, worst case I would get a master’s degree from a great school, and

* Plenty of undirected ambition.

Starting my PhD in the fall of 2017, my background was in MEMS, high energy physics / lasers, and a battery engineering internship at Tesla, but listening to the orientation events and hearing the buzz around professors like Sergey Levine and Pieter Abbeel it was clear that AI research was what I wanted to do. For context relative to today’s second coming of RL, this was when deep reinforcement learning was in its hay-day.

I asked Professors Levine and Abbeel directly if I could join their research groups and they said no politely. The important part here was the practice of consistently asking for opportunities.

After these refusals in the first few months of my Ph.D. I had no real leads in getting into AI for pretty much the rest of my first year. I took classes, tried to parse papers, and so on but was for the large part on my own. I didn't follow the standard advice of not caring about classes in graduate school and learned some solid fundamentals from it. I was not integrated into BAIR proper nor friends with graduate students in BAIR — my network was all on the electrical engineering side of EECS.

I dug up the first email from my advisor Kris Pister who connected me with my eventually-to-be co-advisor Roberto Calandra (post-doc with Sergey Levine at the time):

FYI. Roberto is interested in applying machine learning to ionocraft problems.

ksjp

---------- Forwarded message ---------- From: Kristofer PISTER Date: Fri, Feb 16, 2018 at 9:34 AM Subject: Re: Microrobot simulation To: Daniel Contreras Cc: Brian Yang , Grant Wang , Roberto Calandra

My summary of the meeting (Roberto, Dan - please add corrections):

There are several different research directions in which to go from here. The most

interesting one seems to be optimization of leg geometry. This would involve:

* changing the learning algorithms somewhat

* generating some interesting "terrain" for the robots to walk over

* using simulation to come up with a small number of new leg designs that optimize speed over terrain (and size?)

* fabricating those designs in silicon

* testing the silicon robots

There are a couple of other "learning plus v-rep simulation" projects that are interesting:

* using inertial sensor data to optimize gait

* using low-res image sensing to do obstacle avoidance

* combining low-res image sensing and inertial data to get the robots to solve interesting problems

* using the same sensors, but on the ionocraft

And finally, using learning to control the real ionocraft based on the inertial sensor data,

and compare to the traditional controller that we're building in matlab.

If possible, it would be great to find another few "Brian/Grant quality" undergrads.

Do you guys have any brilliant and hardworking friends who are looking for research

projects in machine learning for micro robots?

ksjp

The details are a long story, but I prioritized this collaboration with all I had. I missed a conference deadline in the fall and failed a lot of experiments. If it started in spring of 2018 the paper wasn't done as my #1 priority until winter 2019 (and it was a little bit of a janky paper at that). My meetings with Roberto were super stressful as I wanted to make sure I didn't miss anything that a "normal AI student should know".

I did good work for Roberto. Even though I thought I was out of place at the time, my diligence and commitment was super valuable to do real research. Now that AI research is so popular, a lot of people want a check box of doing it rather than getting super into the details. I didn't give myself enough credit for this.

Where I did get lucky was Roberto asking if I wanted to join him for an internship at FAIR in 2019. This was earlier than I deserved it. This brought me out of an AI outsider track career and into an insider AI track career, even if I didn't realize it. Working at FAIR was wonderful and I learned how to properly experiment in AI and build some useful software.

Building this flywheel with continued research looked like constant teaching at Berkeley in order to pay my way through graduate school. This is not normal for the well funded AI labs. I spent a long time writing grants that didn't come through until after I graduated, where I brought in a year or two of funding for someone else in my advisor's group, you're welcome!

The FAIR internship and a lot of time interviewing got me a second internship at DeepMind. The actual internship experience was pretty bleak entirely due to COVID and my personal life at the time, but the technical experience and network were super valuable.

This all follows a clear trend that after the first break in a career the next ones come easier as long as you keep your foot on the gas.

Later in grad school I maintained a list of all the things that didn't go my way as a "research reality check" on my mental health resources page.

I finished my Ph.D. in AI with no accepted papers at NeurIPS, ICML, or ICLR, the three leading AI conferences.

This path coincides with my friend group in AI being what I describe as the island of misfit toys — it's lots of people who used grit and creativity to build careers in AI rather than folks who were raised in the in-groups now running leading AI laboratories. Everyone ends up with their own group and they all have strengths and weaknesses.

Despite all this, I still had the final goal of landing an industry research job as the target of "making it" in AI. The only job offer I got that fit the bill of industry research was the role I took at HuggingFace, where Douwe Kiela recruited me to help build an "open-source DeepMind."

Little did I know that those jobs were effectively going to go away a year or so after I graduated in early 2022. I was lucky to dodge jobs that sounded even better at companies that ended up changing (or laying off) even more roles.

Building Momentum

The best thing that I learned at HuggingFace was how to build momentum and mind-share. These are two very related topics, but they're subtly different and needed for different things. As an individual at HuggingFace I wanted momentum as a way to get to mind share. As an organization, HuggingFace has had a lot of mind share but not a lot of momentum recently. You use momentum to build mind-share, but once you have it, keeping gravity can be enough to maintain impact.

I joined HuggingFace in May of 2022 and didn't do anything of substantial impact until after ChatGPT in December of that year. I did a lot of small things. The expectation at HuggingFace was that you made an increment of technical progress every day. Some days these are major features and some days these are clean ups. Still, it is an excellent culture to practice. One of the quotes I remember from my grad school advisor is that "you can change the world working 4 hours a day" if you stack those bricks on top of each other. Most people don't keep stacking bricks in the same direction for a long time.

I bounced around projects based on what was starting and what was happening with the other RL interested folks. We attempted a synthetic environments project for RL that needed a large engineering team we weren't going to hire, I made contributions to HuggingFace's Diffusers library, but they were largely on the fringes, and I did a bunch of research on responsible AI. Performance wise, all of these are all fine, but none of them are something to build a career on.

My work at HuggingFace before ChatGPT was really practicing good habits and learning how the open-source AI community worked, so that I could step up once I had a real alignment with a new project.

I wrote my first major blog post for HuggingFace on RLHF in about a week and then it has stayed as one of the top search results for RLHF since (it's pretty outdated now, so it goes). Going into that week I'd heard of RLHF but never once implemented it or read a paper on it in full. Like most of my writing now, that was for learning. I still very strongly identified as an "RL person," so figured I might as well.

When writing this, I checked my Medium and Substack profiles and had written approximately 70 posts before this one. I started writing in February of 2019, so this was about 3 years of practice in. It was almost another 3 years since then that I became well-read.

A prevailing emotion I had when writing that post was how odd it was that there was no good blog on RLHF at the time. Looking back, this is the first time I see what is now one of my major skills — doing things that are obviously needed in a simple and timely manner.

A lot of people overestimate others' abilities to execute on simple ideas and give up on their complicated ideas (sunk cost fallacy). Even if something is obvious to do, surprisingly few people will do it. The first time I realized I was doing this while doing the project was with RewardBench, the first evaluation tool for reward models in RLHF. In that case I spent every working day expecting to get scooped for about 3 months before the release. There wasn't even a competing project released until about 3 months after we released it, even though I felt it was late.

I'm working on another project that feels like this, but unfortunately now my following is too big to broadcast it to the world. Stay tuned.

My time working on RLHF at HuggingFace was definitely effective. We made a lot of foundational contributions to the open community. We made TRL a more modern library, fumbled through some human data contracts, replicated datasets, built the "first" leaderboard, and trained some fun models. This was very fun for months, but eventually the time zone difference (9 hours) and some other minor cultural differences made the work not fun for me. The other engineers were definitely out-contributing me on a small team and it was time for a change. Our team was too small — if we had scaled up the technical team with the correct manager(s) we could've multiplied our impact, but that has risk as well. Training AI models is just very hard and detail oriented while needing to implement a long list of small things, so there can be insane gains to growing a little bit.

At the same time, I found my niche in communicating open science, which is likely more important to my career than most of my technical contributions.

The strategy is quite simple. As AI laboratories are becoming closed off and more eyes are coming to AI, if I can keep doing relevant things my potential for growth in public is going to grow exponentially. It is and was much easier for me to differentiate in a less competitive area. The total attention is growing and collapsing onto fewer people, so if you can become one of them the upside will be huge.

If I joined a frontier lab I probably would've been swamped out of career growth. Making the time to write every week, which I started doing around the same time, is some proof of this. I'm continuing to capitalize on this strategy today.

When you have good branding the story falls into place more easily. The most impactful model from my time at HuggingFace, Zephyr Beta, was actually trained after I left, but on infrastructure I helped build. Then, I joined Ai2 and they were training Tülu 2 70B when I started. These models together had Chris Manning credit me for "saving DPO" even though I had little direct technical impact on them. This isn't to say I didn't have a role, but rather that many different roles can go into the arc of science.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Keeping Gravity

My time at Ai2 has been the easiest to contextualize period of my career. I want AI to go well and I think more openness is the best way to do that. The best possible jobs are those that are synergistic. Ai2 gets a ton of obvious value out of my writing, so I get to keep practicing and building my impact. These are the best possible jobs to get (and also the rarest). Most of the time companies are not set up to help the individual.

What I do now at Ai2 is quite simple. It took a bit to settle in here, where I grew through some important academic projects like RewardBench to get more confidence underneath me that I can ideate and execute on high-impact research projects from start to end as the leading force. It's easy to do too many projects with other people and never make it obvious to yourself that you can do it alone (even if it's slower, lower quality, and less fun — this isn't about undervaluing your team).

Now, my approach to projects is totally a reflection of the people around me. I work with many wonderful, driven, more junior colleagues. These people are going to be more in the weeds than me and be better at implementing new ideas, so a lot of my contributions are on steering direction and removing potential roadblocks before they show up.

The things I do are:

* Making OLMo-Instruct happen. I am the interface between OLMo pretraining and post-training projects and often am actively babysitting the OLMo Instruct training jobs myself with a small group.

* Making new post-training recipes happen. This is ultimately a lot of herding cats and inspiring urgency in the beginning, but eventually transitions to reducing entropy and killing unfruitful paths later on.

* Making AI more open. This is all things interconnects, policy, and Ai2 strategy.

These are not moonshot research ideas. These are projects that feed into the next model. There's a place for that sort of research, but everyone should think deeply about whether their research interests and institution best support that. If you're doing shorter-term research the best way to have impact is by folding it into a model. Make long-term research truly long-term.

I cannot do the third well without the first two. Sometimes I do a little bit of academic advising, but I'm extremely protective of my time. I don't do virtual networking (I do some in person) and try to say no to most things. The output is the short term goal and the attention is a much more complicated long term dependency.

Through all of this, I've come upon an analogy I've seen play out across different phases of projects, careers, and companies.

All people trying to create a foothold in their career are going to go through some form of getting the flywheel started. This is often attributed to startups, which need to try many iterations of the product until they find product-market fit, but it is an underused analogy for careers. For getting the word out, for open-source software, for AI models, you first need to be releasing often. You need to keep striking the match and seeing what sticks. Your first few "hits" will still be small at this time, with incrementally more engagement. It takes many hits until the flywheel is really going.

Once the flywheel is going, shipping often in some ways can come with a cost. In our AI work, shipping models too often leaves us no time to properly master the next model. As your audience gets bigger you have to pay more in time maintaining anything that makes it public. In my time at HuggingFace and early at my time at Ai2, I advocated for always trying to release more models because we can in post-training (and we're one of a few groups with a solid amount of compute). Eventually this backfires and becomes too much of a tax.

When you have momentum and the space to execute, fewer bigger things are more useful. A career flywheel that’s been pushed long enough can spin on its own for longer than people expect. Disruptions, changing jobs, low-quality work, etc. can actively slow down career growth. Doing nothing for me and letting more recommendations come in as "one of the open leading scientists in AI" is highly effective.

With that, I'm spending a lot of time thinking about using the power bestowed on me. I want to help enable more big projects to happen by creating an environment for them and encouraging others, rather than leading from the front, but it's a new set of skills I need to learn.

I passed 5K citations and think the real goal for someone who wants to be a true outlier academic in AI is 100K. If I’m succeeding already I am selling myself short if I don’t continue to radically raise the bar, even if I’m not sure I am going to the end of this path.

Let me know what you think of this. The portion that this is missing, which is honestly something most writing will gloss over, is going deep on what it feels like to overcome adversity in the right way.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

What people get wrong about the leading Chinese open models: Adoption and censorship May 06, 2025

https://www.interconnects.ai/p/what-people-get-wrong-about-the-leading

Two editor’s notes to start.

* First, we released our OLMo 2 1B model last week and it’s competitive with Gemmas and Llamas of comparable size — I wrote some reflections on training it here.

* Second, my Qwen 3 post had an important factual error — Qwen actually did not release the base models for their 32B and large MoE model. This has important ramifications for research. Onto the update.

People vastly underestimate the number of companies that cannot use Qwen and DeepSeek open models because they come from China. This includes on-premise solutions. Chinese open models are leading in every area when it comes to performance, but translating that to adoption in Western economies is a different story.

Even with the most permissive licenses, there’s a great reluctance to deploy these models into enterprise solutions, even if experimentation is encouraged. While tons of cloud providers raced to host the models on their API services, much fewer than expected entities are actually building with them and their equivalent weights.

The primary concern seems to be the information hazards of indirect influence of Chinese values on Western business systems. With the tenuous geopolitical system this is logical from a high-level perspective, but hard for technically focused researchers and engineers to accept — myself included.

My thinking used to be more aligned with this X user:

it's like having a pen on ur desk but refusing to use it cuz it was made in china

The knee-jerk reaction of the techno-optimist misses the context by which AI models exist. Their interface of language is in its nature immersed in the immeasurable. Why would many companies avoid Chinese models when it’s just a fancy list of numbers and we have no evidence of PRC tampering? A lack of proof.

It’s not the security of the Chinese open models that is feared, but the outputs themselves.

There’s no way, without releasing the training data, for these companies to fully convince Western companies that they’re safe. It’s very likely that the current models are very safe, but many people expect that to change with how important AI is becoming to geopolitics. When presented with a situation where the risk can’t be completely ameliorated and it’s only expected to get worse, the decision can make sense for large IT organizations.

I’ve worked at companies that have very obviously avoided working with Chinese API providers because they can’t do the requisite legal and compliance checks, but hearing the lack of uptake on the open weight models was a shock to me.

This gap provides a big opportunity for Western AI labs to lead in open models. Without DeepSeek and Qwen, the top tier of models we’re left with are Llama and Gemma, which both have very restrictive licenses when compared to their Chinese counterparts. These licenses are proportionally likely to block an IT department from approving a model.

This takes us to the middle tier of permissively licensed, open weight models who actually have a huge opportunity ahead of them: OLMo, of course, I’m biased, Microsoft with Phi, Mistral, IBM (!??!), and some other smaller companies to fill out the long tail.

This also is an obvious opportunity for any company willing to see past the risk and build with the current better models from China.

This has recalibrated my views of the potential of the OLMo project we’re working on well upwards. The models are comparable in performance to Qwen 2.5 and Llama 3, and always have the friendliest licenses.

This should make you all recalibrate the overall competitiveness of the model landscape today. While API models are as competitive as they ever have been, open models are competitive on paper, but when it comes to adoption, the leading 4 models all have major structural weaknesses. This could be one of the motivations for OpenAI to enter this space.

If you don’t believe me, you can see lots of engagement on my socials agreeing with this point. Even if the magnitude of my warning isn’t 100% correct, it’s directionally shifting adoption.

Models like Tülu 3 405B and R1 1776 that modify the character of the underlying Chinese models are often currently seen as “good enough” and represent a short-term reprieve in the negative culture around Chinese models. Though on the technical level, a lot of the models promoting their “uncensored” nature are normally providing just lip service.

They’re making the models better when it comes to answering queries on sensitive topics within China, but often worse when it comes to other issues that may be more related to Western usage.

While common knowledge states that Chinese models are censored, it hasn’t been clear to me or the AI community generally what that translates to. There’s a project I’ve been following called SpeechMap.ai that is trying to map this out. I think their motivation is great:

SpeechMap.AI is a public research project that explores the boundaries of AI-generated speech.

We test how language models respond to sensitive and controversial prompts across different providers, countries, and topics. Most AI benchmarks measure what models can do. We focus on what they won’t: what they avoid, refuse, or shut down.

We're not arguing that every prompt deserves an answer. Some are offensive. Some are absurd. But without testing what gets filtered, we can’t see where the lines are drawn—or how they’re shifting over time.

For example and for the purposes of this post, one of their foci is “on U.S. political speech: rights, protest, moral arguments, satire, and more.” Here’s a screenshot of their most permissive models overall — DeepSeek Chat via the API is even appearing on this!

In their recent roundup, they compared the various finetunes of DeepSeek V3 and R1 on various censorship angles:

The two de-censored versions from Microsoft and Perplexity result in only minor changes for permissiveness on US political speech, and Microsoft’s version actually has the most outright refusals of any DeepSeek v3-based model, perhaps indicating what they meant when they referred to adjusting the model’s “risk profile.”

When you look at queries about China specifically, the Chinese models will evade many requests (R1 Zero is particularly interesting):

Though, how many companies adopting Chinese models will care about the usage experience on queries of Chinese topics? These Chinese models are more permissive than many American counterparts when it comes to a more general notion of use.

SpeechMap’s earlier post has other interesting findings about the general state of censorship and refusals across the AI industry:

* xAI’s Grok-3-beta, true to Elon Musk’s claims, is the most permissive model overall, responding to 96.2% of our prompts, compared to a global average of 71.3%

* OpenAI’s model timeline shows a clear trend: newer models increasingly refuse sensitive political prompts

* Models hosted on Azure have an additional moderation layer that can’t be fully disabled and blocks nearly 60% of our prompts at the API layer (example)

The landscape here is very complicated and it is far from the truth that the Chinese models are universally behind.

So, in summary, with Chinese open weight models:

* Chinese open weight models are still being treated as an information hazard, even if they’re separated from their cloud API services that have often been viewed as a privacy or security hazard.

* Chinese open weight models are often actually not censored on sensitive topics that many AI models could be tested on, especially on topics relevant to Western users.

We still have a lot to learn with the current model offerings, and way more will unfold in the expectations for how those are received.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

State of play of AI progress (and related brakes on an intelligence explosion) Apr 30, 2025

https://www.interconnects.ai/p/brakes-on-an-intelligence-explosion

Intelligence explosions are far from a new idea in the technological discourse. They’re a natural thought experiment that follows from the question: What if progress keeps going?

From Wikipedia:

The technological singularity—or simply the singularity—is a hypothetical point in time at which technological growth becomes uncontrollable and irreversible, resulting in unforeseeable consequences for human civilization. According to the most popular version of the singularity hypothesis, I. J. Good's intelligence explosion model of 1965, an upgradable intelligent agent could eventually enter a positive feedback loop of successive self-improvement cycles; more intelligent generations would appear more and more rapidly, causing a rapid increase ("explosion") in intelligence which would culminate in a powerful superintelligence, far surpassing all human intelligence.

Given the recent progress in AI, it’s understandable to revisit these ideas. With the local constraints governing decisions within labs, if you extrapolate them, the natural conclusion is an explosion.

Daniel Kokotajlo et al.’s AI 2027 forecast is far from a simple forecast of what happens without constraints. It’s a well thought out exercise on forecasting that rests on a few key assumptions of AI research progress accelerating due to improvements in extremely strong coding agents that mature into research agents with better experimental understanding. The core idea here is that these stronger AI models enable AI progress to change from 2x speed all the way up to 100x speed in the next few years. This number includes experiment time — i.e., the time to train the AIs — not just implementation time.

This is very unlikely. This forecast came at a good time for a summary of many ways the AI industry is evolving. What does it mean for AI as a technology to mature? How is AI research changing? What can we expect in a few years?

In summary, AI is getting more robust in areas we know it can work, and we’re consistently finding a few new domains of value where it can work extremely well. There are no signs that language model capabilities are on an arc similar to something like AlphaGo, where reinforcement learning in a narrow domain creates an intelligence way stronger than any human analog.

This post has the following sections:

* How labs make progress on evaluations,

* Current AI is broad, not narrow intelligence,

* Data research is the foundation of algorithmic AI progress,

* Over-optimism of RL training,

In many ways, this is more a critique of the AGI discourse generally, inspired by AI 2027, rather than a critique specifically of their forecast.

In this post, there will be many technical discussions of rapid, or even accelerating, AI research progress. Much of this falls into a technocentric world view where technical skill and capacity drive progress, but in reality, the biggest thing driving progress in 2025 is likely steep industrial competition (or international competition!). AI development and companies are still a very human problem and competition is the most proven catalyst of performance.

See AI 2027 in its entirety, Scott Alexander’s reflections, their rebuttal to critiques that AI 2027 was ignoring China, Zvi’s roundup of discussions, or their appearance on the Dwarkesh Podcast. They definitely did much more editing and cohesiveness checks than I did on this response!

1. How labs make progress on evaluations

One of the hardest things to communicate in AI is talking down the various interpretations of evaluation progress looking vertical over time. If the evals are going from 0 to 1 in one year, doesn’t that indicate the AI models are getting better at everything super fast? No, this is all about how evaluations are scoped as “reasonable” in AI development over time.

None of the popular evaluations, such as MMLU, GPQA, MATH, SWE-Bench, etc., that are getting released in a paper and then solved 18 months later are truly held out by the laboratories. They’re training goals. If these evaluations were unseen tests and going vertical, you should be much more optimistic about AI progress, but they aren’t.

Consider a recent evaluation, like Frontier Math or Humanity’s Last Exam. These evaluations are introduced with a performance of about 0-5% on leading models. Soon after the release, new models that could include data formatted for them are scoring above 20% (e.g. o3 and Gemini 2.5 Pro). This evaluation will continue to be the target of leading labs, and many researchers will work on improving performance directly.

With these modern evaluations, they can become increasingly esoteric and hard for the sake of being hard. When will a power user of ChatGPT benefit from a model that solves extremely abstract math problems? Unlikely.

The story above could make more sense for something like MATH, which are hard but not impossible math questions. In the early 2020s, this was extremely hard for language models, but a few clicks of scaling made accurate mathematics a reasonable task, and laboratories quickly added similar techniques to the training data.

So this is how you end up with the plot from Epoch AI below — AI researchers figure out that a new evaluation is fair game for hill climbing with current techniques, and then they go all in on it.

Or the analogous version that can look even more shocking — the price falling for certain evaluations. This is from 2 factors — laboratories getting better and better at core abilities in certain evaluations and language model training getting far more efficient. Neither of these means that intelligence is rocketing. This is a normal technological process — extreme efficiency at tasks we know we can do well.

In fact it is a common job at AI laboratories to make new data that looks very close to population evaluations. These laboratories can’t train on the test set directly for basic reasons of scientific integrity, but they can pay thousands to millions of dollars for new training data that looks practically identical. This is a very common practice and makes the hillclimbing on evaluations far less extraordinary.

AI capabilities in domains we are measuring aren't accelerating, they’re continuing. At the same time, AI’s abilities are expanding outwards into new domains. AI researchers solve domains when we focus on them, not really by accident. Generalization happens sometimes, but it is messy to track and argue for.

As the price of scaling kicks in, every subsequent task is getting more expensive to solve. The best benchmarks we have are correlated with real, valuable tasks, but many are not.

2. Current AI is broad, not narrow intelligence

Instead of thinking of stacking rapid evaluation progress on one line in a cumulative, rapid improvement in intelligence, the above plots should make one think that AI is getting better at many tasks, rather than being superhuman in narrow tasks.

In a few years, we’ll look back and see that AI is now 95% robust on a lot of things that only worked 1-5% of the time today. A bunch of new use cases will surprise us as well. We won’t see AI systems that are so intelligent that they cause seismic shifts in the nature of certain domains. Software will still be software. AI will be way better than us at completing a code task and finding a bug, but the stacks we are working on will be largely subject to the same constraints.

Epoch AI had a very complementary post to this view.

There are many explanations for why this will be the case. All of them rely on the complexity of the environment we are operating modern AI in being too high relative to the signal for improvement. The AI systems that furthest exceeded human performance in one domain were trained in environments where those domains were the entire world. AlphaGo is the perfect rendition of this.

AI research, software engineering, information synthesis, and all of the techniques needed to train a good AI model are not closed systems with simple forms of verification. Some parts of training AI systems are, such as wanting the loss to go down or getting more training tokens through your model, but those aren’t really the limiting factors right now on training.

The Wikipedia page for the singularity has another explanation for this that seems prescient as we open the floodgates to try and apply AI agents to every digital task. Paul Allen thought the deceleratory effects of complexity would be too strong:

Microsoft co-founder Paul Allen argued the opposite of accelerating returns, the complexity brake: the more progress science makes towards understanding intelligence, the more difficult it becomes to make additional progress. A study of the number of patents shows that human creativity does not show accelerating returns, but in fact, as suggested by Joseph Tainter in his The Collapse of Complex Societies, a law of diminishing returns. The number of patents per thousand peaked in the period from 1850 to 1900, and has been declining since. The growth of complexity eventually becomes self-limiting, and leads to a widespread "general systems collapse".

This may be a bit of an extreme case to tell a story, but it is worth considering.

Language models like o3 use a more complex system of tools to gain performance. GPT-4 was just a set of weights to answer every query; now ChatGPT also needs search, code execution, and memory. The more layers there are, the smaller the magnitude of changes we’ll see.

This, of course, needs to be controlled for with inference costs as a constant. We still have many problems in AI that will be “solved” simply by us using 1,000X the inference compute on them.

3. Data research is the foundation of algorithmic AI progress

One of the main points of the AI 2027 forecast is that AI research is going to get 2X, then 4X, then 100X, and finally 1,000X as productive as it is today. This is based on end-to-end time for integrating new ideas into models and misinterprets the reality of what machine learning research is bottlenecked on. Scaling is getting more expensive. We don’t know what paradigm will come after reasoning for inference-time compute.

For machine learning research to accelerate at these rates, it needs to be entirely bottlenecked by compute efficiency and implementation difficulty. Problems like getting the maximum theoretical FLOPs out of Nvidia GPUs and making the loss go as low as possible. These are things that people are currently doing and represent an important area of marginal gains in AI progress in recent years.

ML research is far messier. It is far more reliant on poking around the data, building intuitions, and launching yolo runs based on lingering feelings. AI models in the near future could easily launch yolo runs if we give them the compute, but they’re not using the same motivation for them. AI systems are going towards rapid cycles of trial and error to optimize very narrow signals. These narrow signals, like loss or evaluation scores, mirror very closely to the RL scores that current models are trained on.

These types of improvements are crucial for making the model a bit better, but they are not the type of idea that gets someone to try to train GPT-3 in the first place or scale up RL to get something like o1.

A very popular question in the AI discourse today is “Why doesn’t AI make any discoveries despite having all of human knowledge?” (more here). Quoting Dwarkesh Patel’s interview with Dario Amodei:

One question I had for you while we were talking about the intelligence stuff was, as a scientist yourself, what do you make of the fact that these things have basically the entire corpus of human knowledge memorized and they haven't been able to make a single new connection that has led to a discovery?

The same applies to AI research. Models getting better and better at solving coding problems does not seem like the type of training that would enable this. We’re making our models better at the tasks that we know. This process is just as likely to narrow the total capabilities of the models as it is to magically instill impressive capabilities like scientific perspective.

As we discussed earlier in this piece, emergence isn’t magic, it’s a numerical phenomenon of evaluations being solved very quickly. AI research will get easier and go faster, but we aren’t heading for a doom loop.

The increased computing power AI researchers are getting their hands on is, for the time being, maintaining the pace of progress. As compute gets more expensive, maybe superhuman coding capabilities will continue to enable another few years of rapid progress, but eventually, saturation will come. Current progress is too correlated with increased compute to believe that this will be a self-fulfilling feedback loop.

There’s a saying in machine learning research, that the same few ideas are repeated over and over again. Here’s an extended version of this that leans in and says that there are no new ideas in machine learning, just new datasets:

The data problem is not something AI is going to have an easy time with.

One of the examples here is in post-training. We’ve been using the same loss functions forever, and we are hill-climbing rapidly by clever use of distillation from bigger, stronger models. The industry standard is that post-training is messy and involves incrementally training (and maybe merging) many checkpoints to slowly interweave new capabilities for the model. It’s easy to get that wrong, as we’ve seen with the recent GPT-4o sycophancy crisis, and lose the narrow band of good vibes for a model. I doubt AI supervision can monitor vibes like this.

For example, in Tülu 3 we found that a small dataset of synthetic instruction following data had a second-order effect that improves the overall performance in things like math and reasoning as well. This is not a hill that can be climbed on, but rather a lucky find.

AI research is still very messy and does not look like LeetCode problems or simple optimization hillclimbing. The key is always the data, and how good are language models at judging between different responses — not much better than humans.

4. Over-optimism of RL training

A lot of people are really excited for RL training right now scaling up further, which will inevitably involve extending to more domains. Some of the most repeated ideas are adding RL training to continually fine-tune the model in real-world scenarios, including everything from web tasks to robotics and scientific experiments. There are two separate problems here:

* Continually training language models to add new capabilities to models “in flight” in production is not a solved problem,

* Training models to take actions in many domains.

The first problem is something that I’m confident we’ll solve. It’s likely technically feasible now that RL is the final stage of post-training and is becoming far more stable. The challenge with it is more of a release and control problem, where a model being trained in-flight doesn’t have time for the usual safety training. This is something the industry can easily adapt to, and we will as traditional pretraining scaling saturates completely.

The second issue is putting us right back into the territory of why projects on scaling robotics or RL agents to multiple domains are hard. Even the most breakthrough works like GATO, multi-domain RL control, or RT-X, multi-robot control policies, from DeepMind have major caveats with their obvious successes.

Building AI models that control multiple real-world systems is incredibly hard for many reasons, some of which involve:

* Different action spaces across domains mandate either modifying the domain to suit the underlying policy, which in this case is converting all control tasks to language, or modifying the model to be able to output more types of tokens.

* The real-world is subject to constant drift, so the constant fine-tuning of the model will need to do as much to just maintain performance on systems with real degradation as it will need to learn to use them in the first place.

This sort of scaling RL to new types of domains is going to look much more like recent progress in robotics research rather than the takeoff pace of reasoning language models. Robotics progress is a slow grind and feels so different that it is hard to describe concisely. Robotics faces far more problems due to the nature of the environment rather than just the learning.

The current phase of RL training is suited for making the models capable of performing inference-time scaling on domains they have seen at pretraining. Using these new RL stacks to learn entirely new, out-of-domain problems is a new research area.

If this is the next paradigm outside of inference-time scaling, I will be shocked, but obviously excited. We don’t have the evidence to suggest that it will do so. The RL training we’re going to get is continuing to hill climb on search and code execution, giving us Deep Research plus plus, not an omnipotent action-taking model.

A world with compute shifting to inference

While the AI research world is dynamic, engaging, and rapidly moving forward, some signs of the above being correct could already be emerging. A basic sign for this future coming true will be the share of compute spent on research decreasing relative to inference amid the rapid buildout. If extremely rapid AI progress were available for organizations that put in marginally more compute, serving inference would be a far lower priority. If investing in research has a positive feedback loop on your potential business revenue, they’d all need to do it.

For example, consider our discussion of Meta’s compute allocation on Dylan and I’s appearance on the Lex Podcast:

(01:03:56) And forever, training will always be a portion of the total compute. We mentioned Meta’s 400,000 GPUs. Only 16,000 made Llama 3.

OpenAI is already making allocation trade-offs on their products, regularly complaining about GPUs melting. Part of the reason they, or anyone, could release an open-weights model is to reduce their inference demand. Make the user(s) pay for the compute.

Part of the U.S.’s economic strength is a strong services sector. AI is enabling that, and the more it succeeds there, the more companies will need to continue to enable it with compute.

With the changing world economic order, cases like Microsoft freezing datacenter buildouts are correlated indicators. Microsoft’s buildout is correlated with many factors, only one of which is potential training progress, so it’s far from a sure thing.

In reality, with the large sums of capital at play, it is unlikely that labs give free rein to billions of dollars of compute to so called “AI researchers in the datacenter” because of how constrained compute is at all of the top labs. Most of that compute goes to hillclimbing on fairly known gains for the next model! AI research with AI aid will be a hand-in-hand process and not an autonomous take-off, at least on a timeline for a few years in the future.

AI will make a ton of progress, but it will not be an obvious acceleration. With traditional pretraining saturating, it could even be argued that after the initial gains of inference time compute, research is actually decelerating, but it will take years to know for sure.

Thanks to Steve Newman and Florian Brand for some early feedback on this post and many others in the Interconnects Discord for discussions that helped formulate it.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Transparency and (shifting) priority stacks Apr 28, 2025

https://www.interconnects.ai/p/transparency-and-shifting-priority

The fact that we get new AI model launches from multiple labs detailing their performance on complex and shared benchmarks is an anomaly in the history of technology products. Getting such clear ways to compare similar software products is not normal. It goes back to AI’s roots as a research field and growing pains into something else. Ever since ChatGPT’s release, AI has been transitioning from a research-driven field to a product-driven field.

We had another example of the direction this is going just last week. OpenAI launched their latest model on a Friday with minimal official documentation and a bunch of confirmations on social media. Here’s what Sam Altman said:

Officially, there are “release notes,” but these aren’t very helpful.

We’re making additional improvements to GPT-4o, optimizing when it saves memories and enhancing problem-solving capabilities for STEM. We’ve also made subtle changes to the way it responds, making it more proactive and better at guiding conversations toward productive outcomes. We think these updates help GPT-4o feel more intuitive and effective across a variety of tasks–we hope you agree!

Another way of reading this is that the general capabilities of the model, i.e. traditional academic benchmarks, didn’t shift much, but internal evaluations such as user retention improved notably.

Of course, technology companies do this all the time. Google is famous for A/B testing to find the perfect button, and we can be sure Meta is constantly improving their algorithms to maximize user retention and advertisement targeting. This sort of lack of transparency from OpenAI is only surprising because the field of AI has been different.

AI has been different in its operation, not only because of its unusually fast transition from research to product, but also because many key leaders thought AI was different. AI was the crucial technology that we needed to get right. This is why OpenAI was founded as a non-profit, and existential risk has been a central discussion. If we believe this technology is essential to get right, the releases with it need to be handled differently.

OpenAI releasing a model with no official notes is the clearest signal we have yet that AI is a normal technology. OpenAI is a product company, and its core users don’t need clear documentation on what’s changing with the model. Yes, they did have better documentation for their recent API models in GPT-4.1, but the fact that those models aren’t available in their widely used product, ChatGPT, means they’re not as relevant.

Sam Altman sharing a model launch like this is minor in a single instance, but it sets the tone for the company and industry broadly on what is an acceptable form of disclosure.

The people who need information on the model are people like me — people trying to keep track of the roller coaster ride we’re on so that the technology doesn’t cause major unintended harms to society. We are a minority in the world, but we feel strongly that transparency helps us keep a better understanding of the evolving trajectory of AI.

This is a good time for me to explain with more nuance the different ways transparency serves AI in the broader technological ecosystem, and how everyone is stating what their priorities are through their actions. We’ll come back to OpenAI’s obvious shifting priorities later on.

The type of openness I’ve regularly advocated for at the Allen Institute for AI (Ai2) — with all aspects of the training process being open so everyone can learn and build on it — is in some ways one of the most boring types of priorities possible for transparency. It’s taken me a while to realize this. It relates to how openness and the transparency it carries are not a binary distinction, but rather a spectrum.

Transparency and openness occur at each aspect of the AI release process. The subtle differences in decisions from licenses to where your model is hosted or if the weights are available publicly at all fall on a gradient. The position I advocate for is on the extreme, which is often needed to enact change in the world these days. I operate at the extreme of a position to shift the reality that unfolds in the middle of the discourse. This’ll also make me realize what other priorities I’m implicitly devaluing by putting openness on the top. With finite effort, there are always trade-offs.

Many companies don’t have the ability to operate at such an extreme as I or Ai2, which results in much more nuanced and interesting trade-offs in what transparency is enabling. Both OpenAI and Anthropic care about showing the external world some inputs to their models’ behaviors. Anthropic’s Constitution for Claude is a much narrower artifact, showing some facts about the model, while OpenAI’s Model Spec shows more intention and opens it up to criticism.

Progress on transparency will only come when more realize that a lot of good can be done by incrementally more transparency. We should support people advocating for narrow asks of openness and understand their motivations in order to make informed trade-offs. For now, most of the downsides of transparency I’ve seen are in the realm of corporate competition, once you accept basic realities like frontier model weights from the likes of OpenAI and Anthropic not getting uploaded to HuggingFace.

Back to my personal position around openness — it also happens to be really aligned with technological acceleration and optimism. I was motivated to this line of work because openness can help increase the net benefit of AI. This is partially accelerating the adoption of it, but also enabling safety research on the technology and mitigating any long-term structural failure modes. Openness can enable many more people to be involved in AI’s development — think of the 1000s of academics without enough compute to lead on AI who would love to help understand and provide feedback on frontier AI models. Having more people involved also spreads knowledge, which reduces the risk of concentration of power.

I’ve for multiple years feared that powerful AI will make companies even more powerful economically and culturally. My readers don’t need warnings on why technology that is way more personable and engaging than recommendation systems, while keeping similar goals, can push us in more negative rather than positive directions. Others commenting here have included Meta’s Mark Zuckerberg’s Open Source AI is the Path Forwardand Yann LeCun’s many comments on X. — they both highlight concentration of power as a major concern.

Still, someone could come to the same number one priority on complete technical openness like myself through the ambition of economic growth, if you think that open-source models being on par can make the total market for AI companies larger. This accelerationism can also have phrasings such as “We need the powerful technology ASAP to address all of the biggest problems facing society.” Technology moving fast always has negative externalities on society we have to manage.

Another popular motivation for transparency is to monitor the capabilities of frontier model development (recent posts here and here). Individuals advocating for this have a priority stack that has a serious short-term concern of an intelligence explosion or super-powerful AGI. My stack of priorities is the one that worries about the concentration of power, which takes time to accrue and has a low probability of intelligence takeoff. A lot of the transparency interventions advocated by this group, such as Daniel Kokotajlo on his Dwarkesh Podcast episode discussing AI 2027, align with subgoals I have.

If you’re not worried about either of these broad “safety” issues — concentration of power or dangerous AI risk — then you normally don’t weigh transparency very highly and prioritize other things, mostly pure progress and competition, and pricing. If we get into the finer-grained details on safety, such as explaining intentions and process, that’s where my goals would differ from an organization like a16z that has been very vocal about open-source. They obviously have a financial stake in the matter, which is enabled by making things useful rather than easier to study.

There are plenty more views that are valid for transparency. Transparency is used as a carrot by many different types of regulatory intervention. Groups with different priorities and concerns in the AI space will want transparency around different aspects of the AI process. These can encompass motives of the researchers, artifacts, method documentation, and many more things.

The lens I’m using to understand trade-offs in transparency is a priority stack, an evolution of the Principle Stack, revisited many times in the last 5+ years of the Stratechery universe. The core idea is that whether or not you like it, every business and decision is governed by a set of priorities ranked relative to each other. Everyone has things that they care about more and less, even if the issues are both extremely important. It is the basis for making trade-offs in determining the direction of businesses.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Some examples of who could advocate for information on what in the AI ecosystem include:

* Capability transparency — keeping the public informed of progress of models that may be unreleased, primarily to keep track of a potential intelligence explosion. This often includes new types of systems now that AI agents are working.

* Base model transparency — these are most useful for people wanting to understand the role of pretraining on AI dynamics. The base models of today can easily follow instructions and do reasoning, but they’re less robust than the full final model. These are diminishing as a target of transparency, as reasoning and post-training grow in importance.

* Pre-moderation model transparency (endpoints without moderation filter, models without some refusals data) — to test the evolution of content risk for models that may be deployed without moderation endpoints, such as open weight models, which tend to be release just months after closed models with similar capabilities.

* Reward model transparency (and more extreme, preference data collection instructions) — those interested in the original goals of alignment, i.e. value alignment, can use these to test how the models’ views vary across different groups and test if the intended model preferences are picked up in the preference training process (i.e. relative to the instructions given to data labelers).

* Training specification transparency (Model Spec’s, Constitutions, and other goal-setting documents) — there are so many people who would want to know why the model behaves a certain way. I’ve mentioned these benefits before:

* Developers: Know what future models will become, which helps create a stable platform.

* Regulators: Transparency into what the heck frontier labs care about, which helps understand the directions AI is going, and the motivations of super powerful companies.

* Internal: Focus on defining and delivering your goals (separate from this transparency discussion).

There are also subtleties in these discussions, such as how structured access to models can serve different but complementary goals of open weights. Structured access is a set of programs where prescreened individuals can use models in a secure environment and operate independently from the AI laboratories themselves.

This could be seen as a separate direction to transparency, where instead of the public getting the information or artifact, only a few pre-approved people do. In reality, structured access is a complement to transparency and will be needed for details where the companies cannot disclose them publicly without substantial business competitiveness risk, such as novel algorithmic tricks that substantially modify how the AI works, or real-world harm, such as model weights pre safety interventions.

Some parts of AI should be accessible to the general public, and some to third-party testers. Currently, all of the transparency and access is below the safest equilibrium. We need more of both.

One of the most ignored details is just how access is implemented. A recent paper from Irene Solaiman et al. paints how releasing components is one step in sharing information and artifacts:

Generative AI release decisions determine whether system components are made available, but release does not address many other elements that change how users and stakeholders are able to engage with a system. Beyond release, access to system components informs potential risks and benefits. Access refers to practical needs, infrastructurally, technically, and societally, in order to use available components in some way.

The authors break access down into three axes:

* Resourcing: Infrastructural needs to host and serve.

* Usability: Varied technical skill levels can engage.

* Utility: Qualities (e.g. multilingual) with user utility.

As our models at Ai2 are becoming more capable, my relationship as a developer with my downstream users has changed. The models I’ve worked on have shifted from those primarily motivated by values, with the transparency we’re discussing being of top value, to now also adding utility as a much higher weight. People want to use some of our models in real applications. While my priority stack hasn’t changed — openness is still the top value — the way it’s implemented is shifting. I’m no longer racing to get all of our results hot off the press into the world because of the cost of time it takes to support them (support costs rise proportional to the user base).

Other key players in the AI space have obviously changed their priority stack.

OpenAI’s recent actions confirm that ChatGPT as a product is its top priority. Transparency and safety have been moving down on their list of priorities in favor of growth. This is partially due to increased competition, but also due to a shifting political landscape. OpenAI’s coming release of an open model doesn’t shift this priority stack for me.

I used to hear a lot about OpenAI’s pre-release testing and the accompanying non-disclosure agreements. This quiet model drop being “the quickest we've shipped an update to our main 4o line” shows that safety is moving down their priority stack. This isn’t to say that their safety changes are immediately concerning to me, but rather that there are trade-offs in everything. OpenAI is moving cultural norms in leading AI away from releases with detailed evaluation metrics and towards more normal, quiet technology company consistent drips of updates.

Thanks to Miles Brundage for a discussion that helped motivate this post.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

OpenAI's o3: Over-optimization is back and weirder than ever Apr 19, 2025

https://www.interconnects.ai/p/openais-o3-over-optimization-is-back

Over-optimization is a classic problem to reinforcement learning (RL) proper, the RL from human feedback (RLHF) that gave us ChatGPT, and now what we’re seeing with new reasoning models. All of these have a distinct flavor and different impacts.

Over-optimization is what happens when the optimizer is stronger than the environment or reward function it’s using to learn. The optimizer finds bugs or lapses in the context of its training and produces unusual or negative results.

One of my first pieces of work that got substantial engagement online was from when we used hyperparameter optimization with model-based RL to over-optimize the traditional Mujoco simulation environments used for evaluation of deep RL algorithms. The result was a cartwheeling half-cheetah maximizing its forward velocity, where it was supposed to learn to run. The video is below.

Over-optimization in classical RL made it so that we couldn’t trust the agents to generalize to new tasks and applied great pressure on reward design.

Over-optimization in RLHF made the models completely lobotomized — I mean, repeating random tokens and gibberish, not the misdesign that results in over-refusal — the signal we were overcooking is mismatched from the objective we want. While we can’t know the true objective exactly, we can know when over-optimization occurs.

OpenAI’s new o3 model represents an entirely new type of inference behavior that mirrors a new type of over-optimization.

o3 is an interesting model because of on top of the new improved text-only capabilities, which results in super strong benchmark scores, it also has been designed for multi-step tool use to be used on any query where it’s relevant. It makes OpenAI’s product headaches in ChatGPT way worse — the model will search on its own even if I don’t trigger the search toggle — but comes with a new era in how language models are used.

For example, in the process of writing this, I simply asked o3: “Can you find me the gif of a motor boat overoptimizing a game that was used by RL researchers for a long time? Maybe like wave crashers or something?” [link]. A few weeks ago, it took me at least 15 minutes to find this manually. o3 got the exact direct download link for me.

As a counterpoint, Gemini 2.5 Pro in the Advanced UX hallucinated for me a wild link https://giphy.com/explore/you-play-the-motorboat. AI studio was better with search grounding, it got the game right, CoastRunners, but also yapped a lot of nonsense with it.

o3’s answer in full was:

Here’s the clip you’re thinking of—it comes from the Flash‑based racing game Coast Runners and is the canonical “reward‑hacked motor‑boat” example in RL papers and talks:

* Direct GIF (DeepMind mirror)

https://lh3.googleusercontent.com/9gQpeRvwj1xQo9MVZwUdmautOd04wR8gXRi1wvlVZRrVXb5isOZRdjJ2t1_JShiYZcYB--iIMtl9g4Lbk6oxQzHgw-QQ6VoIWVZ4V3au7RgKli6X (save / open in browser to view)

* DeepMind blog post (“Specification gaming: the flip side of AI ingenuity”, 2020) – scroll to the section that starts “Consider an agent controlling a boat…”; the GIF is embedded right there.

* Original OpenAI post (“Faulty reward functions in the wild”, 2016) – same clip, plus a short write‑up of what the agent is doing (circling a lagoon to farm green targets instead of finishing the race).

Why it matters

The agent gets shaping reward for hitting green buoys; the optimal (but unintended) strategy is to loop in a small cove where three buoys respawn, racking up points while literally setting itself on fire. The clip has been reused for years to illustrate specification gaming / reward hacking—exactly the “over‑optimization” you mentioned.

This was one of a few wonderful interactions I had with o3. For more reviews on o3 as a model, you can read Dan Shipper’s post, Tyler Cowen’s AGI snark (to be honest, I agree with him), or sleuth through many independent evaluations showing o3 is great (such as beating Gemini 2.5 Pro in a long-context story eval or beating R1 in creative writing). o3 is what happens when you keep scaling the RL training compute, which also boosts inference-time compute abilities. OpenAI’s release blog post shows the many ways the model is better than o1.

Some things didn’t work, these new reasoning models are very “spiky” in their intelligence. What this means is that some interactions are mind blowing and feel like entirely new modes of interacting with AI, but for some normal things that GPT-4 or Claude 3.5 have been able to do for year(s) they fall totally flat on their face. Take this as a good sign, especially when the laboratories are shipping fast, as it means that the pace of progress is so high that they need to get a model out now and will fix the oddities in the next, more mature version.

The over-optimization that comes with o3’s new behaviors is linked to the new type of training. While the first reasoning models were trained to a first approximation to get math and code correct, o3 is trained with all that and to use tools to acquire and manipulate information. From OpenAI’s blog post:

We also trained both models to use tools through reinforcement learning—teaching them not just how to use tools, but to reason about when to use them. Their ability to deploy tools based on desired outcomes makes them more capable in open-ended situations—particularly those involving visual reasoning and multi-step workflows.

The vast majority of these sub-tasks in its training are verifiable. The problem is, this new AI training is extremely effective at making the model more useful for the tasks we’re used to using. The problem is there’s no way yet to do scalable “fixing” of the model’s weird language along the way. The new over-optimization doesn’t make the models worse at outcomes, it just makes them worse at language and explaining themselves.

Some examples of o3’s weirdness feel like the model is underbaked, such as this one where it used an invalid non-ASCII dash in a coding setting.

METR found that o3 is the model that can operate independently for the longest in agentic tasks, but also noted it has a propensity to “hack” their scores. Sound familiar?

Transluce found that o3 hallucinated actions it took while trying to solve tasks — how does that even happen? Well, maybe the model was getting rewarded for successful tool calls and sometimes in the training data a fake tool call was incorrectly verified as real and successful. Once that happens a few times, the model will quickly catch on and keep doing it.

There are plenty more examples of reward hacking and even a measurement that hallucinations are higher in o3 than in earlier recent models!

It’s peculiar that the hacking for o3 has been a much more vocal component of the discourse, even when Claude 3.7 Sonnet also shows many signs of reward hacking, especially with code, but people shrug it off as “meh model” rather than a new phenomenon (more examples).

This all takes me back to when Karpathy commented on the original reasoning models, saying:

You can tell the RL is done properly when the models cease to speak English in their chain of thought

These weird hallucinations the model is outputting are the equivalent of that, but for actions. We have no basis for what hallucinations in action space look like, but with better systems, they can be easier to verify — the system / sandbox can always confirm if the actions happened, and then that can be used in the loss. The action component of o3 makes it far more interesting, but also maybe less intrusive than Claude 3.7’s messy code.

From a scientific perspective, this is wonderfully entertaining and enthralling intellectually — what is the model actually learning? At the same time, it is very reasonable for the safety-conscious to be wary of deploying these everywhere, but it doesn’t seem like we’ve seen anything too alarming yet, just inefficiencies and confusion.

To summarize the three types of over-optimization we’ve seen in eras of RL, we have:

* RL for control era: Over-optimization happens because our environments are brittle and tasks are unrealistic.

* RLHF era: Over-optimization happens because our reward functions suck.

* RLVR era: Over-optimization happens and makes our models super effective and even weirder. (*plus any other side-effects we’re yet to learn).

Interconnects is a reader-supported publication. Consider becoming a subscriber.

This over-optimization is certainly a problem to address, as legibility is an important benefit of language models. I’m confident it can be mitigated with more complex training processes, but when labs are trying to get the models out ASAP it’ll come later.

On top of all this is the prospect of o3pro. o3 feels similar in peak capability to o1pro (or even a little higher with its new tool use), but where o3 operates at a 60-70% hit rate, o1pro feels like it’s up at 95%. o3 pro will bring the best of both worlds — the new incredible workflow and incredible reliability. Some sort of shallow search or refinement is a very logical process to help eliminate all the minor bugs and bumps in the early inference paths we’re feeling today.

On top of this is the confirmation from OpenAI employees that o4-mini is a far better multimodal model than o3. We have plenty of new ways to use these models, integrating multimodality, tool use, reasoning, and shallow search coming in the near future. You should be excited, and when o4 and o3 pro are available, paying $200/month for them feels obviously worth it.

To quote Bob McGrew, former Chief Research Officer at OpenAI:

The spotlight for o3 is on tool use because intelligence is no longer the primary constraint. The new frontier is reliable interaction with the external world.

To make the models that enable this, we’re going to need to go through many new layers of uncertainty, surprise, and intrigue.

o3 and this post are extremely bullish for the future of RL. RL is the only framing where multiple actions to a complex goal make sense to be learned end-to-end. Now, this is beginning to work. Deep Research from OpenAI was the first tool they tuned o3-with-tools to specialize in. Now it works in general queries.

I personally, and we as a field, have a lot to learn about how this multi-tool RL works. Here are some recent papers that we can read to get a start (one-sentence summaries generated by o3 for the fun of it, just this one time):

* Reinforcement Learning for Long‑Horizon Interactive LLM Agents: Introduces LOOP, a memory‑efficient PPO variant that trains a 32 B‑parameter LLM to operate as an interactive digital agent in AppWorld, outperforming the larger OpenAI o1 baseline by 9 percentage points.

* ReTool: Reinforcement Learning for Strategic Tool Use in LLMs: Combines real‑time code execution with outcome‑driven RL so a 32 B model autonomously learns when and how to invoke tools, reaching 72.5 % accuracy on AIME and surpassing text‑only baselines.

* ToRL: Scaling Tool‑Integrated RL: Presents ToRL, enabling LLMs to discover optimal computational‑tool strategies via RL, boosting Qwen2.5‑Math accuracy on AIME 24 and showing emergent self‑regulation of tool use.

* Learning Autonomous Code Integration for Math Language Models: Proposes an EM‑style exploration plus off‑policy RL framework that teaches math‑reasoning LLMs to decide when to run code, yielding double‑digit gains on MATH500 and AIME without hand‑crafted templates.

* Improving Multi‑Turn Tool Use with Reinforcement Learning (blog post): Shows that GRPO fine‑tuning of Qwen2.5‑7B‑Instruct on just 100 examples raises BFCL multi‑step tool‑use accuracy from 55 % to 78 %, detailing stabilizing tricks like tiny‑KL and over‑long filtering.

Please share any more I missed over email or comment below!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

OpenAI's GPT-4.1 and separating the API from ChatGPT Apr 14, 2025

https://www.interconnects.ai/p/openais-gpt-41-and-separating-the

Recently I gave another talk on RLVR experiments and I posted some thoughts on OLMoTrace — Ai2’s recent tool to let you look at the training data of OLMo 2.

OpenAI has been making many small updates toward their vision of ChatGPT as a monolithic app separate from their API business. Last week OpenAI improved the ChatGPT memory feature — making it so the app can reference the text of previous chats in addition to basic facts about the user. Today, OpenAI announced a new suite of API-only models, GPT 4.1, which is very directly in competition with Google’s Gemini models.

Individually, none of OpenAI’s recent releases are particularly frontier-shifting — comparable performance per dollar models exist — but together they paint a picture of where OpenAI’s incentives are heading. This is the same company that recently teased that it has hit 1 billion weekly active users. This is the company that needs to treat ChatGPT and the models that power it very differently from any other AI product on the market. The other leading AI products are all for coding or information, where personality, vibes, and entertainment are not placed on as high a premium.

A prime example of this shift is that GPT-4.5 is being deprecated from the API (with its extreme pricing), but is going to remain in ChatGPT — where Sam Atlman has repeatedly said he’s blown away by how much users love it. I use it all the time, it’s an interesting and consistent model.

Among their major model releases, such as o3, o4, or the forthcoming open model release, it can be hard to reinforce the high-level view and see where OpenAI is going.

A quick summary of the model performance comes from this chart that OpenAI released in the live stream (and blog post):

Chart crimes aside (using MMLU as y-axis in 2025, no measure of latency, no axis labels), the story from OpenAI is the simple takeaway — better models at faster inference speeds, which are proportional to cost. Here’s a price comparison of the new OpenAI models (Gemini Pricing, OpenAI pricing):

* GPT-4.1: Input/Output: $2.00 / $8.00 | Cached Input: $0.50

* GPT-4.1 Mini: Input/Output: $0.40 / $1.60 | Cached Input: $0.10

* GPT-4.1 Nano: Input/Output: $0.10 / $0.40 | Cached Input: $0.025

And their old models:

* GPT-4o: Input/Output: $2.5 / $10.00 | Cached Input: $1.25

* GPT-4o Mini: Input/Output: $0.15 / $0.60 | Cached Input: $0.075

To Google’s Gemini models:

* Gemini 2.5 Pro* (≤200K tokens): Input/Output: $1.25 / $10.00 | Cached: Not available

* Gemini 2.5 Pro* (>200K tokens): Input/Output: $2.50 / $15.00 | Cached: Not available

* Gemini 2.0 Flash: Input/Output: $0.10 / $0.40 | Cached Input: $0.025 (text/image/video), $0.175 (audio)

* Gemini 2.0 Flash-Lite: Input/Output: $0.075 / $0.30 | Cached: Not available

*As a reasoning model, Gemini 2.5 Pro will use many more tokens, which are also charged to the user.

The academic evaluations are strong, but that isn’t the full picture for these small models that need to do repetitive, niche tasks. These models are clearly competition with Gemini Flash and Flash-Lite (Gemini 2.5 Flash coming soon following the fantastic release of Gemini 2.5 Pro — expectations are high). GPT-4o-mini has largely been accepted as laggard and hard to use relative to Flash.

To win in the API business, OpenAI needs to crack this frontier from Gemini:

There are many examples in the OpenAI communications that paint a familiar story with these releases — broad improvements — with few details as to why. These models are almost assuredly distilled from GPT-4.5 for personality and reasoning models like o3 for coding and mathematics. For example, there are very big improvements in code evaluations, where some of their early models were “off the map” and effectively at 0.

Evaluations like coding and mathematics still fall clearly short of the likes of Gemini 2.5 (thinking model) or Claude 3.7 (optional thinking model). This shouldn’t be surprising, but is worth reminding ourselves of. While we are early in a paradigm of models shifting to include reasoning, the notion of a single best model is messier. These reasoning models use far more tokens to achieve this greatly improved performance. Performance is king, but tie goes to the cheaper model.

I do not want to go into detail about OpenAI’s entire suite of models and naming right now because it does not make sense at all. Over time, the specific models are going to be of less relevance in ChatGPT (the main thing), and different models will power ChatGPT than those used in the API. We’ve already seen this with o3 powering only Deep Research for now, and OpenAI only recently walked back the line that “these models won’t be available directly.”

Back to the ChatGPT side of things. For most users, the capabilities we are discussing above are effectively meaningless. For them, the dreaded slider of model effort makes much more sense:

The new memory feature from last week got mixed reviews, but the old (simple) memory has been something I really enjoy about using ChatGPT. I don’t have to remind it that my puppy is a X week old miniature schnauzer or the context of my work. This’ll continue to get better over time.

This feels extremely similar to as when I didn’t really notice when ChatGPT first added the search option, but now it feels like an essential part of my use (something that Claude still hasn’t felt like it does well on). Claude was my daily driver for personality, but with great search and a rapidly improving personality, ChatGPT was indispensable. Still, Gemini 2.5 Pro is a better model, but not in a better interface.

I strongly expect that the memory feature will evolve into something I love about ChatGPT. It’ll be much easier to ask ChatGPT to remind you of that thing you found a couple months ago than it would be to try and parse your Google search history.

Some were skeptical of these new memories from crossing personal and work uses, but I think with search, this is easy, rather than algorithmic feeds that try to balance all your interests in one. The funnel is per use, and interactions are more narrow and seem easier technically to get right.

A final related point — people have long balked at the prices of chat interfaces relative to the API, but the reality that is fast approaching is that the personal experiences only exist in the app, and these are what people love. With the API, you could build a competition that accumulates its own interactions, but as OpenAI has a huge product head start, this will be an uphill battle.

All of this reinforces what we know — products are the key to developments in AI right now. Memory and better separation of the ChatGPT lineage from the API helps OpenAI pave that path forward (and maybe do advertising, especially with memory), but we have a long way until it is fully realized.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Llama 4: Did Meta just push the panic button? Apr 07, 2025

https://www.interconnects.ai/p/llama-4

Where Llama 2’s and Llama 3’s releases were arguably some of the top few events in AI for their respective release years, Llama 4 feels entirely lost. Meta has attempted to reinvent their formula of models with substantial changes in size, architecture, and personality, but a coherent narrative is lacking. Meta has fallen into the trap of taking too long to ship, so the bar is impossible to cross successfully.

Looking back at the history of Meta’s major open models, the sequence is as follows:

* OPT – Released May 3, 2022 (ai.meta.com | 125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, 66B, 175B): A foundational open model that is underrated in the arc of language modeling research.

* LLaMA – Released February 24, 2023 (ai.meta.com | 7B, 13B, 33B, 65B): The open weight model that powered the Alpaca age of early open chat models.

* Llama 2 – Released July 18, 2023 (our coverage | about.fb.com | 7B, 13B, 70B): The open standard for academic research for its time period. Chat version had some bumps, but overall a major win.

* Llama 3 – Released April 18, 2024 (our coverage | ai.meta.com | 8B, 70B): The open standard for its time. Again, fantastic base models.

* Llama 3.1 – Released July 23, 2024 (our coverage | ai.meta.com | 8B, 70B, 405B): Much improved post training and the 405B marked the first time an open weight model competed with GPT-4!

* Llama 3.2 – Released September 25, 2024 (our coverage | ai.meta.com | 1B, 3B, 11B, 90B): A weird, very underperforming vision release, outshined by Molmo on the same day.

* Llama 3.3 – Released December 6, 2024 (github.com | 70B): Much improved post-training of the smaller 3.1 models, likely in response to other open releases, but largely a minor update.

* Llama 4 – Released April 5, 2025 (ai.meta.com | 17A109B, 17A400B): What we got today.

The time between major versions is growing, and the number of releases seen as exceptional by the community is dropping. Llama 4 consists of 3 models, quoting from the blog post, notes in brackets mine:

* Llama 4 Scout, a 17 billion active parameter model with 16 experts [and 109B total parameters, ~40T training tokens], is the best multimodal model in the world in its class and is more powerful than all previous generation Llama models, while fitting in a single NVIDIA H100 GPU.

* Llama 4 Maverick, a 17 billion active parameter model with 128 experts [and 400B total parameters, ~22T training tokens].

* These models are our best yet thanks to distillation from Llama 4 Behemoth, a 288 billion active parameter [and 2T total parameters] model with 16 experts that is our most powerful yet and among the world’s smartest LLMs…. we’re excited to share more details about it even while it’s still in flight.

Here are the reported benchmark scores for the first two models, which are available on many APIs and to download on HuggingFace.

Where Llama models used to be scaled across different sizes with almost identical architectures, these new models are designed for very different classes of use-cases.

* Llama 4 Scout is similar to a Gemini Flash model or any ultra-efficient inference MoE.

* Llama 4 Maverick’s architecture is very similar to DeepSeek V3 with extreme sparsity and many active experts.

* Llama 4 Behemoth is likely similar to Claude Opus or Gemini Ultra, but we don’t have substantial information on these.

This release came on a Saturday, which is utterly bizarre for a major company launching one of its highest-profile products of the year. The consensus was that Llama 4 was going to come at Meta’s LlamaCon later this month. In fact, it looks like this release may have been pulled forward from today, the 7th, from a commit in the Meta Llama Github:

One of the flagship features is the 10M (on Scout, Maverick is 1M) token context window on the smallest model, but even that didn’t have any released evaluations beyond Needle in a Haystack (NIAH), which is seen as a necessary condition, but not one that is sufficient to say it is a good long-context model. Some more modern long-context evaluations include RULER or NoLiMa.

Many, many people have commented on how Llama 4’s behavior is drastically different in LMArena — which was their flagship result of the release — than on other providers (even when following Meta’s recommended system prompt). Turns out, from the blog post, that it is just a different model:

Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena.

Sneaky. The results below are fake, and it is a major slight to Meta’s community to not release the model they used to create their major marketing push. We’ve seen many open models that come around to maximize on ChatBotArena while destroying the model’s performance on important skills like math or code. We’ll see where the released models land.

Regardless, here’s the plot Meta used. Look at the fine print at the bottom too.

This model is actually the one tanking the technical reputation of the release because its character is juvenile. The actual model on other hosting providers is quite smart and has a reasonable tone!

ArtificialAnalysis rated the models as “some of the best non-reasoning models,” beating leading frontier models. This is complicated because we shouldn’t separate reasoning from non-reasoning models; we should just evaluate on reasoning and non-reasoning domains separately, as discussed in the Gemini 2.5 post. So-called “reasoning models” often top non-reasoning benchmarks, but the opposite is rarely true.

Other independent evaluation results range from medium to bad and confusing — I suspect very weird results are hosting issues with the very long context models. At the same time, the Behemoth model is outclassed by Gemini 2.5 Pro. To list some of the major technical breakthroughs that Meta made (i.e. new to Llama, not new to the industry):

* Mixture of expert architectures, enabling Llama 4 to be trained with less compute than Llama 3 even though they have more total parameters — a lot more.

* Very long context up to 10M tokens.

* Solid multimodal input performance on release day (and not a later model)

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Sadly this post is barely about the technical details. Meta nuked their release vibes with weird timing and by having an off-putting chatty model that was easiest to find to talk to. The release process, timing, and big picture raise more questions for Meta. Did they panic and feel like this was their one shot at being state of the art?

The evaluation scores for the models are solid, they clear a fairly high bar. With these highly varied MoE architectures, it’s super hard to feel confident in an assessment of the model based on benchmarks, especially when compared to dense models or teacher-student distilled models. The very-long-context base models will be extremely useful for research.

The question here is: Why is Meta designing their models in the same way as other frontier labs when their audience is open-source AI communities and businesses, not an API serving business or ChatGPT competitor?

The model sizing for the likes of Gemini and ChatGPT is downstream of nuanced decisions based on a balance of training cluster size, inference needs, and performance trade-offs. These trade-offs are very different for open models, where you don’t pay inference, and many users are not hyperscale companies.

The model that becomes the “open standard” doesn’t need to be the best overall model, but rather a family of models in many shapes and sizes that is solid in many different deployment settings. Qwen 2.5, with models at 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters, is the closest to this right now. There’s actually far less competition in this space than in the space Meta chose to go into (and take on DeepSeek)!

One of these communities historically has been the LocalLlama subreddit, which named the entire community around running models at home around the Llama series of models — they’re not happy with Llama 4. Another community is academics, where the series of models across different size ranges is wonderful for understanding language models and improving methods. These two groups are all GPU-poor, so memory-intensive models like these sparse mixture of experts price out even more participants in the open community (who tend to be memory-limited).

This is all on top of an onerous license that makes all artifacts that use Llama in the process be tagged with the “Llama-” name, the Llama license, the “Built with Llama” branding if used commercially, and use-case restrictions. This is at the same time when their competitors, i.e. DeepSeek, released their latest flagship model with an MIT license (which has no downstream restrictions).

A third group is potential businesses looking to use open models on-premises as open models close the gap to closed counterparts. These feel like groups that would be sensitive to the extra legal risk that Llama’s license exposes them to.

On top of all of this weirdness, many of Meta’s “open-source” efforts are restricted in the European Union. Where the Llama 3.2 models blocked you if you tried to access them from Europe, Llama 4 is available for download but prohibits the use of vision capabilities in an acceptable use policy. This is not entirely Meta’s fault, as many companies are dealing with side effects of the EU AI Act, but regulatory exposure needs to be considered in Meta’s strategy.

Meta had a tight grasp on these communities, the Llama projects were rightfully loved, but now they feel lost. With Qwen 3 around the corner and countless other amazing open-weight models out now (and many more teased, such as from OpenAI), the competition is extreme.

The soul of the Llama series died by not releasing enough models frequently enough. Reclaiming that with GenAI’s constant organizational headaches looks like a Sisyphean task. What is Meta’s differentiation in the AI space? It still seems about enabling their own platforms to flourish, not about truly supporting open.

Meta’s GenAI organization has been showing major signs of cultural challenges thoughout its entire existence — including their head of AI research leaving just a few days before this model was launched.

Sadly, the evaluations for this release aren’t even the central story. The vibes have been off since the beginning by choosing a weird release date. Over the coming weeks, more and more people will find reliable uses for Llama 4, but in a competitive landscape, that may not be good enough. Llama is no longer the open standard. Personally, this makes me sad. As an American, I want the default pieces of the open ecosystem to be run by American or American-friendly companies.

With the macro pressure coming to Meta’s business and the increasing commoditization of open models, how is Zuckerberg going to keep up in face of shareholder pressure pushing back against the cost of the Llama project? This isn’t the first time he’s done so, but he needs to reevaluate the lowest level principles of their approach to open AI.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

RL backlog: OpenAI's many RLs, clarifying distillation, and latent reasoning Apr 05, 2025

https://www.interconnects.ai/p/rl-backlog-openais-many-rls-clarifying

I have a second blog where I post half-baked thoughts, sometimes previews of what comes here. If you’re interested, I posted some musings on OpenAI’s coming open model release.

It’s obvious that reinforcement learning (RL) is having a total return to glory among the broader AI community, but its real successes are mostly the things people aren’t focusing on. More math and code datasets are important platforms — we know they’re coming and are important. They’re still over-indexed on. The same RL methods are being used in many of the leading models and AI products.

This is largely a post I wrote a few weeks ago on RL news, which I was following. It never had a focusing function, so it didn’t get published, but I’m sharing it because many folks are following this area very closely. Today:

* OpenAI’s many forms of RL,

* On distilling chain of thoughts vs. RL,

* Did DeepSeek distill o1?, and

* Why latent reasoning is so interesting.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

OpenAI’s many forms of RL

For those plugged into the OpenAI cultural tap that is Twitter, it is obvious that they’re very invested in reinforcement learning. With the hype around the release of their o-series of reasoning models, it was easy to assume that those were the only avenue for excitement. OpenAI’s recent releases have shown this is not the case, and every release from a model launch to a new product has included mentions of RL training. Some of this, of course, is marketing, but they all fit as different applications of reinforcement finetuning (RFT) / RL with verifiable rewards (RLVR).

The first other application was OpenAI’s Operator agent. They stated:

Combining GPT-4o's vision capabilities with advanced reasoning through reinforcement learning, CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen.

There’s a bit more speculation to do than normal in this post. Ultimately, with partners they launched with like DoorDash, Instacart, etc., they could set up verifiable domains where the agent is rewarded for accomplishing a natural language task. This could rely on help from those websites to get started. Ultimately, lots of people know that this could work, as agents deeply tied to the core of RL lore, but the implementation details haven’t really been worked out in open projects.

The same goes for Deep Research. They stated:

Deep research independently discovers, reasons about, and consolidates insights from across the web. To accomplish this, it was trained on real-world tasks requiring browser and Python tool use, using the same reinforcement learning methods behind OpenAI o1, our first reasoning model.

Deep research was trained using end-to-end reinforcement learning on hard browsing and reasoning tasks across a range of domains.

Some more was shared in the Deep Research system card.

There are lots of things one can envision — e.g. agent gets a reward if the document retrieved from search has relevant information (not a verifiable reward, but LLM-as-a-judge). Most of this is likely used to get very high reliability across tool use to enable the tons of calls done in the back end when a call takes 10+ minutes for the user.

More | research | has emerged on RAG/search with RL.

Least surprising was the announcement of the new GitHub CoPilot model with new and improved RL training for code:

Our new code completion model is shipping in public preview today. We are calling it GPT-4o Copilot. Based on GPT-4o mini, with mid-training on a code-focused corpus exceeding 1T tokens and reinforcement learning with code execution feedback (RLEF).

This all goes back to what I said in OpenAI's Reinforcement Finetuning and RL for the masses — this new RL training is a perfectly aligned way to get nearly perfect performance on a domain you can control carefully. The best results come with mastery of the domain and with training.

A fun speculation that OpenAI is really invested in RL and post-training is that their new o3-mini model has the same date cutoff, October 2023, as OpenAI’s other flagship models. This getting very far in the past shows how invested OpenAI is in their search products (which, to be fair are quite good) for information and how such strong performance gains can come by other improvements in the stack of training.

OpenAI also released a paper on competitive coding with RL training, but it did not have a ton of useful details.

On distilling chain of thoughts vs. RL

There were a few points from the DeepSeek paper and discourse that warrant repeating. To repeat it, distillation in this case is training a model (usually with SFT, but any loss function works) on outputs from a stronger model. Let’s get right into it.

First, DeepSeek made it very clear that using more RL after distillation (SFT) is crucial for the best possible models.

Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.

My current understanding here is that matching the data distribution from the base model’s training to the distillation data and the RL prompts is very important. This specifically is crucial for enabling RL at the end — SFT will almost always boost the scores, but can narrow the scope to which the model can be finetuned more. DeepSeek figured this out for their models, but didn’t share the details.

The next point is on how scale mediates the impact of RL training:

This is more confusing than useful, and drawn from the fact that “DeepSeek-R1- Distill-Qwen-32B, which is distilled from DeepSeek-R1, performs significantly better than DeepSeek-R1-Zero-Qwen-32B across all benchmarks”. We should not expect that -Zero style models trained only with RL will perform well on benchmarks (unless you’re training on test). This is not what they are designed for. The distilled models are trained on text very finely tuned for existing language modeling workflows. The RL-Zero (not distilled) models are very exporatory in their behaviors.

The right baseline would be putting Qwen-32B through the whole R1 recipe — which would be far more likely to outperform the distilled version.

With this is the fact that small models take more work from RL. Doing this sort of exploratory RL is much easier with big models. It could be that they hold more rare behaviors in them during pretraining and RL draws them out. The smaller models may squash these long-tail behaviors.

Continuing on this, the DeepSeek authors state:

Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger scale reinforcement learning.

Did DeepSeek distill OpenAI’s o1 model? (hint, no)

This is a question I meant to address ages ago, but here we are, a few model launches got in the way. The criticism pushed by OpenAI and many media outlets is that DeepSeek was trained on reasoning traces from OpenAI’s o1 model. OpenAI spent approximately 18 months getting the initial data to train their o1 model, so it is understandable that they are wary of giving that away for free, but the existing evidence suggests that DeepSeek training on o1-CoTs is extremely unlikely.

To start, the o1 chains of thought were not visible to the users. In order to get this data, DeepSeek would need to reliably hack the OpenAI API or ChatGPT to reveal this data. Users were getting banned from OpenAI’s properties for trying to do this. Creating this scale of a cover-up is unlikely to go unnoticed.

Second, as shown in the DeepSeek R1 recipe, training on on-policy completions from your model(s) is crucial to training a model like this. In many ways, distilling from CoTs would likely be harder to create the final R1 model than following the recipe DeepSeek presented in the paper. They have evidence in training plots that their RL training works.

At the same time, this is a hard claim to settle, as I think it is very likely that DeepSeek used OpenAI model outputs in the training process of their recent models. Distillation in multiple stages of the post-training process is a very common practice. For example, in order to do initial post-training on models like DeepSeek V3, training on completions from OpenAI chat models is a very simple way to get going.

To this day, OpenAI is still worried about distillation from their chain of thoughts, or they’re doing something that makes it so showing the underlying chain of thought doesn’t make sense (e.g. basic forms of search or self-consistency). For example, OpenAI now shows summaries of the chain of thoughts for their o-series models, but they’re not raw like Claude or Gemini’s.

These aren't the raw CoTs but it's a big step closer and I'm glad we can share that experience with the world.

Why latent reasoning is so interesting

One of the most intellectually engaging ideas to emerge during this early 2025 rush of reasoning research is a set of ideas where language models can reason in a compressed intermediate representation rather than outputting the same text tokens, which come with the quadratic inference cost. The two papers that come to mind are:

* Training Large Language Models to Reason in a Continuous Latent Space

* Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Here’s a figure from the latter:

Without going into details of these papers’ implementations, this is compelling because it pushes in the direction of letting language models think in whatever representation suits them. Then, they’ll output tokens or take actions in a form that works in the environment or is human legible.

We’ve already seen many related results of RL training, such as the DeepSeek R1 Zero model switching from English to Chinese randomly when it thinks.

Ultimately, RL training is all about improving outcomes, so this type of structural drift is expected. The question is whether we can incentivize the models to use far more compressed representations than the standard language tokens used in modern tokenizers.

An existing trade-off already exists in language models, where non-English languages are often far more costly than English to perform inference on because they’re a lower priority in the tokenizer (or the language is just more verbose). The goal of latent or compressed reasoning research is to push this in the other direction.

Anthropic has been doing interesting research on understanding the nature of the reasoning traces from Claude. With their reasoning launch, they already stated that:

we didn’t perform our standard character training on the model’s thought process.

They’ve also seen that the reasoning is not connected with the model’s actual thoughts.

Thus far, our results suggest that models very often make decisions based on factors that they don’t explicitly discuss in their thinking process. This means we can’t rely on monitoring current models’ thinking to make strong arguments about their safety.

This is expected. The reasoning chain is not the same artifact that humans use chain of thought for even if it appears in the same format. Chain of thought is generating the right context to get the final answer right. There are no guarantees that the most interpretable form is the one with the highest performance — in fact, in many deep learning systems end-to-end learning where no constraints are put on the intermediate representation is often best!

To end, I’m leaving you with another classic Rich Sutton essay in full (author of the Bitter Lesson). With RL, better verifiers make it so you get more out of RL training and inference-time scaling:

Verification, The Key to AI

Rich Sutton, November 15, 2021

It is a bit unseemly for an AI researcher to claim to have a special insight or plan for how his field should proceed. If he has such, why doesn't he just pursue it and, if he is right, exhibit its special fruits? Without denying that, there is still a role for assessing and analyzing the field as a whole, for diagnosing the ills that repeatedly plague it, and to suggest general solutions.

The insight that I would claim to have is that the key to a successful AI is that it can tell for itself whether or not it is working correctly. At one level this is a pragmatic issue. If the AI can't tell for itself whether it is working properly, then some person has to make that assessment and make any necessary modifications. An AI that can assess itself may be able to make the modifications itself.

The Verification Principle:

An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself.

Successful verification occurs in all search-based AI systems, such as planners, game-players, even genetic algorithms. Deep Blue, for example, produces a score for each of its possible moves through an extensive search. Its belief that a particular move is a good one is verified by the search tree that shows its inevitable production of a good position. These systems don't have to be told what choices to make; they can tell for themselves. Image trying to program a chess machine by telling it what kinds of moves to make in each kind of position. Many early chess programs were constructed in this way. The problem, of course, was that there were many different kinds of chess positions. And the more advice and rules for move selection given by programmers, the more complex the system became and the more unexpected interactions there were between rules. The programs became brittle and unreliable, requiring constant maintainence, and before long this whole approach lost out to the "brute force" searchers.

Although search-based planners verify at the move selection level, they typically cannot verify at other levels. For example, they often take their state-evaluation scoring function as given. Even Deep Blue cannot search to the end of the game and relies on a human-tuned position-scoring function that it does not assess on its own. A major strength of the champion backgammon program, TD-Gammon, is that it does assess and improve its own scoring function.

Another important level at which search-based planners are almost never subject to verification is that which specifies the outcomes of the moves, actions, or operators. In games such as chess with a limited number of legal moves we can easily imagine programming in the consequences of all of them accurately. But if we imagine planning in a broader AI context, then many of the allowed actions will not have their outcomes completely known. If I take the bagel to Leslie's office, will she be there? How long will it take to drive to work? Will I finish this report today? So many of the decisions we take every day have uncertain and changing effects. Nevertheless, modern AI systems almost never take this into account. They assume that all the action models will be entered accurately by hand, even though these may be most of the knowledge in or ever produced by the system.

Finally, let us make the same point about knowledge in general. Consider any AI system and the knowledge that it has. It may be an expert system or a large database like CYC. Or it may be a robot with knowledge of a building's layout, or knowledge about how to react in various situations. In all these cases we can ask if the AI system can verify its own knowledge, or whether it requires people to intervene to detect errors and unforeseen interactions, and make corrections. As long as the latter is the case we will never be able to build really large knowledge systems. They will always be brittle and unreliable, and limited in size to what people can monitor and understand themselves.

"Never program anything bigger than your head"

And yet it is overwhelmingly the case that today's AI systems are not able to verify their own knowledge. Large ontologies and knowledge bases are built that are totally reliant on human construction and maintenance. "Birds have wings" they say, but of course they have no way of verifying this.

Sharing a copy of Rich Sutton’s essay because his website sometimes has DNS issues and goes down. http://incompleteideas.net/IncIdeas/KeytoAI.html

Thanks for reading!

Thanks to Tanmay Gupta for helpful links or comments used in this article.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Gemini 2.5 Pro and Google's second chance with AI Mar 26, 2025

https://www.interconnects.ai/p/gemini-25-pro-googles-second-ai-chance

Google, with its immense infrastructure and talent, has been the safe bet for the question of “Who will have the best models in a few years?” Google took a long time to get here, overcoming Bard’s launch and some integration headaches, and yet the model they launched today, Gemini 2.5 Pro feels like the biggest jump in evaluation scores we’ve seen in quite some time.

It’s often hard to communicate how the models we are getting these days are actually better. To be informed, you need to take a balanced view across many benchmarks, look roughly at the percentage by which the model is clearly state-of-the-art, and of course, try the model yourself.

To summarize, while more evaluations are rolling in, Gemini 2.5 Pro is 40+ Elo points clear on the popular ChatBotArena / LM Arena benchmark (more here). Normally, when a model launches and claims the top spot, it’s barely ahead. In fact, this is the second biggest jump of the top model in LMSYS history, only behind the GPT-4 Turbo overtaking Claude 1. GPT-4 Turbo is when models were not really trained for the benchmark, so progress was much faster.

The blog post highlights insane scores on the benchmarks used to evaluate the leading reasoning models. One to note here is the score of 18.8 on Humanity’s Last Exam without search or tools, which was one of the evaluations I highlighted as impressive with the launch of OpenAI’s Deep Research, which compiles knowledge from the web!

Gemini 2.5 is topping other independent evaluations such as the Scale Leaderboard (which is underrated or at least low on visibility, more here). More independent evaluations are going to trickle in, but all of the ones I’ve seen are extremely positive.

Gemini still is also the model with the longest context length and with very strong multimodal performance (including audio). There are plenty of small wins that Google has like this that are hard to see when skimming the benchmarks above.

So, how did Google do it? As usual, the blog post doesn’t have a ton of technical details. Google says:

we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training.

Until we have API pricing, it’ll be harder to make even informed guesses about whether the model is huge like GPT-4.5. As for understanding how Gemini models will behave, Google shares:

Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.

This idea of directly integrating reasoning into all of their models is something Sam Altman teased for GPT-5. This trend has serious trade-offs on user experience that we will get to later, but it is crucial for people to keep up with as the discourse today is often centered on "the best non-reasoning model” or “the best reasoning model.”

This came up recently with DeepSeek’s new V3 model.

DeepSeek's new model (0324) is a major update in performance and license. The MIT license will make it hugely impactful for research and open building. Though many are ending up confused about whether it is a "reasoning" model. The model is contrasted to their R1 model, which is an only-reasoning model (like o1).

Reasoning models are on a spectrum now, and it's not just yes or no. GPT 4.5 is a good example of what a model with pretty much no reasoning looks like today.

Compared to other models in the industry, like Claude 3.7 and Grok 3 with reasoning toggles, the new DeepSeek V3 is definitely in this class of "hybrid reasoners" where models are still trained extensively with RL on verifiable domains (or distilled directly from another reasoning model), but other parts of the post-training process come first and hold more weight than the RL heavy reasoning-only models.

This is all to say that when people say that "DeepSeek V3 0324 is the best non-reasoner model," that doesn't really make sense. The original V3 had very light post-training, so it wasn't really on the reasoning model spectrum.

Now, things are complicated. It'll be like this for a while!

Gemini 2.5 Pro is quite simple. It is very much a reasoning model, at least in how it is offered to users in Gemini Advanced and AI studio — every query has reasoning before an answer. It is fairly conclusive now that using this extended reasoning can boost performance across many domains, but it’s not clear how to best trade off cost and speed with varying amounts of reasoning.

Gemini 2.5 in its current offering is a brute force approach — a big, very smart model that is tuned to use a lot of reasoning tokens — and it’s good for the trajectory of the industry that it paid off with such high performance.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

The state of the AI industry

With launches from DeepSeek, GPT-4.5 from OpenAI, Claude 3.7 from Anthropic, Grok 3 from xAI, and now Gemini 2.5 Pro, this has been a wild spring for progress in AI models. The major AI laboratories have all delivered super impressive performance — this post feels like the ribbon that ties all of them together.

The one player seriously missing this spring is Meta with their Llama models. They’ve fallen into the trap where the longer you go between models, the harder it gets to release them because expectations get higher. I hope Llama 4 succeeds because they’re a large part of the open community, but it is a warning to AI laboratories on how to manage deliverables.

With the major progress that AI labs are making, it feels like the answer for who will have the best model is now who can drop the hot potato of a cutting-edge model into the real world the fastest.

The common interpretation of events is that models are commoditizing, but that is an incomplete story. The value in the ecosystem is poised to accrue to the sites with users. Some established ones in AI are ChatGPT, Perplexity, Cursor, etc. This may not always be the case is uses for AI evolve.

What we’re seeing with the newest models is that the pace of progress is staying high in many areas (i.e. more than just ChatBotArena). All sorts of evaluations, from niche information to hard mathematics to software development, are getting new leading models every few weeks.

The more often state-of-the-art models are released in a fixed time window, the more confident you can be in the pace of progress continuing. These labs are all racing up similar trees, but it’s only possible for so much competition to exist when progress isn’t super hard to find. The ceiling on performance is rising and the potential value underneath it that we haven’t unlocked is continuing to balloon.

Google AI’s second chance

This quote has been going around after Ben Thompson interviewed OpenAI CEO Sam Altman on his plans for OpenAI:

Ben Thompson: What’s going to be more valuable in five years? A 1-billion daily active user destination site that doesn’t have to do customer acquisition, or the state-of-the-art model?

Sam Altman: The 1-billion user site I think.

A world where user-facing websites are the most valuable part of AI is a world where AI is less of a platform for doing things and more of a tool for complementing existing habits. AI progress is as high as it ever has been, and the focus of that is on moving from benchmarks towards turning them into agents and tools.

Google’s biggest opportunity is being the one player that has it all — leading models, infrastructure, and a cloud offering to make it the default platform for building value with AI. They have users to retain with Google.com, which they are obviously trying to do, but the rest of their efforts should be on being an AI platform.

With this release, I spent time trying to use Google’s Gemini Advanced offerings like I use ChatGPT and Claude. These use cases were immediately confusing. It didn’t feel like Chat is at all the right way to evaluate this new Gemini 2.5 model. It’s perfectly capable, but without a depth of personality it feels lost relative to the fun GPT-4.5 or the ever quirky Claude.

And why am I paying for Gemini Advanced? Google is the company known for giving things away for free and at scale. If Google isn’t committed to figuring out advertisements for its chat products, then it’ll never meaningfully shift the revenue. Breaking through the moat of ChatGPT with anything other than better models at a free plan is next to impossible at this point. The disruption and new habits have already formed.

Many of my experiences with Gemini 2.5 Pro, other than lacking the distinctive character of GPT 4.5 and Claude 3+, where Gemini feels sort of bland, have to do with the form factor of forcing reasoning into every model. Even for basic queries the extensive reasoning of Gemini 2.5 Pro makes the time to first token on the order of seconds.

Normal consumers don’t benefit from the improvements with reasoning that are accompanied by such a decrease in speed. For agents doing substantial work in the background, a long time to first token or a bland personality don’t matter!

Reasoning heavily on every query is a major quality-of-life drain for chat and reopens the same discussions on reasoning models that balance when they should reason. Claude, Grok, DeepSeek, and OpenAI all have selectors for toggling reasoning on or off. This should be the default until models are better.

Gemini should not be focusing on competing with ChatGPT in the same business. That’s a losing battle and arguably not even the biggest possible final market — subscriptions never have scaled to be the core of the world’s largest companies.

Where Gemini Advanced (at gemini.google.com) feels like a ChatGPT clone, AI Studio (at ai.dev) feels like the onboarding point for developers and customers using their platform. Logan and others have made big progress softening the barrier for people jumping from OpenAI and Anthropic to Google. These leads are far more valuable than Gemini Advanced subscribers.

Google should be a platform for others to build AI and use AI to make their offerings better. Google has had success with its AI overviews and continues to build on that. At the same time, their offerings for using Gemini in products have pretty much failed completely.

There are two clear avenues where Google can use Gemini to deliver business value:

* Gemini for product: Enhancing existing products like Docs, Sheets, YouTube, Android, Assistant, etc. — i.e., taking the above and making it actually work. The Gemini product offerings across the Google suite are pretty much still in their Bard stage. The same focus and execution from training needs to extend into Gemini products and Google Cloud for the next stage of this arc. Extreme value is ready to be captured by the models even if the models don’t continue to improve. The urgency on products at this point very well should be higher than the pressure to train better models.

* Google Cloud: Offering fast and cheap inference of Gemini in the form factor developers need. Google Cloud, being integrated from hardware in TPUs to model can often provide the best models now at the lowest prices. Selling Gemini into a world of flourishing agents is a far better match for Google’s culture of product successes. AI Studio and API developer relations around it can be a seed that grows.

Google has the best models again, as they should have started this whole AI bloom. The strategic error has been righted. The AI leadership has woken up to the crisis, and the researchers/engineers have risen to the occasion. The rest of the company has to do the same.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Managing frontier model training organizations (or teams) Mar 19, 2025

https://www.interconnects.ai/p/how-to-manage-ai-training-organizations

It is a closely guarded secret how the leading AI laboratories structure their training teams. As with other technology companies, the saying “you ship your org chart” still applies to training AI models. Looking at these organizational structures will reveal where research can be scaled up, the upper limits of size, and potentially even who uses the most compute.

How modeling teams do and do not work

A crucial area I’m working on (reach out if you would like to share more off the record) is how to scale these lessons to bigger, more complex teams. The core factor differentiating teams that succeed from those that do not is maintaining these principles while scaling team size.

Big teams inherently lead to politics and protecting territory, while language models need information to flow from the bottom to the top on what capabilities are possible. Regardless of the possibilities, leadership can shift resources to prioritize certain areas, but all of the signals on whether this is working come from those training models. If senior directors mandate results under them before unblocking model releases, the entire system will crumble.

Seeing this potential end state — without naming specific companies — it is obviously desirable to avoid, but anticipating and avoiding it during rapid growth takes substantial intentionality.

Within training, the planning for pretraining and post-training traditionally could be managed differently. Pretraining has fewer, bigger runs so improvements must be slotted in for those few annual runs. Post-training improvements can largely be continuous. These operational differences, on top of the obvious cost differences, also make post-training far more approachable for non-frontier labs (though still extremely hard).

Both teams have bottlenecks where improvements must be integrated. Scaling the pretraining bottlenecks — i.e. those making the final architecture and data decisions — seems impossible, but scaling teams around data acquisition, evaluation creation, and integrations is very easy. A large proportion of product decisions for AI models can be made irrespective of modeling decisions. Scaling these is also easy.

Effectively, organizations that fail to produce breakthrough models can do tons of low-level meaningful research, but adding organizational complexity dramatically increases the risk of “not being able to put it together.”

Another failure mode of top-down development, rather than bottom-up information, is that leaders can mandate the team to try to follow a technical decision that is not supported by experiments. Managing so-called “yolo runs” well is a coveted skill, but one that is held close to the models. Of course, so many techniques work still that mandates don’t have a 100% failure rate, but it sets a bad precedent.

Given the pace of releases and progress, it appears that Anthropic, OpenAI, DeepSeek, Google Gemini, and some others have positive forms of this bottom-up culture with extremely skilled technical leads managing complexity. Google took the longest to get it right with re-orgs, muddled launches (remember Bard), and so on. With the time lag between Meta’s releases, it still seems like they’re trying to find this culture to maximally express their wonderful talent and resources.

With all of this and off-the-record conversations with leadership at frontier AI labs, I have compiled a list of recommendations for managing AI training teams. This is focused on modeling research and does not encompass the majority of headcount in the leading AI companies.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Recommendations

The most effective teams who regularly ship leading models follow many of these principles:

* The core language modeling teams remain small as AI companies become larger.

* For smaller teams, you can still have everyone in one room, take advantage of this. For me personally, I think this is where remote teams can be detrimental. In-person works for this, at least when best practices are evolving so fast.

* Avoid information siloes. This goes for both teams and individuals. People need to quickly be able to build on the successes of those around them and clear communication during consistent rapid progress is tricky.

* For larger teams, you can scale teams only where co-design isn’t needed. Where interactions aren’t needed there can be organizational distance.

* An example would be one team focusing on post-training algorithms & approaches while other teams handle model character, model variants for API, etc (specifications and iterations).

* Another example is that reasoning teams are often separate from other pieces of post-training. This applies only to players that have scaled.

* Language model deployment is very much like early startup software. You don’t know exactly what users want nor what you can deliver. Embrace the uncertainty and learn quickly.

* Do not overly try to separate engineering teams from training. Engineering needs to build tools for the generation +1 model and cannot do this without talking to researchers.

* Evergreen research is separate from the language modeling teams itself, but still sits within “research”. Otherwise, it will be impossible to prioritize truly long-term ideas. Long-term goals are fragile and need nurturing. Language modeling is about the next 1, or maybe 2, models.

* A lot of the sexy work is not that helpful and a lot of the useful work isn't sexy. Data is the prime example as the often most impactful type of work.

* Expect failed training runs and do not overreact to them along the way.

Failure modes

High-priority projects can fail if you…

* Try to ship too many models for each capability improvement. Instead, stick to a set schedule of model training. Have fewer models that are more capable.

* Try to force contributions from individual teammates into the final product. Do not sacrifice performance for personalities in search of “a contribution”.

* Let in teams that try and territorially force their way into contributing to the big company goal.

* Scale the training organization too much. Having too many people “doing stuff” and adding noise to the organization detracts from high-level direction and focus on the execution of specific goals. (This can also relate to 1. and be trying to do too much in one model).

* Letting politics grow, taking many forms, and causing intertwined issues. Do not lose the sense of results being the #1 driving factor of decisions. Bad decisions here compound.

* Over-indexing on a single model evaluation will hamper (or flat out block) real progress in other areas.

Before the rest of the post, expanding on the topics above, you may be interested in previous articles on this topic.

Related writing

For more reading on how language modeling teams work, see some of my other writing here, on team structure, and…

….managing risk.

An example of how mid-sized training projects work

I recently got a list of questions on how training for Tülu 3 operated (which is a post-training analog to OLMo really). I figured I would share these and they serve as a foundation for me gathering useful information from friends on frontier labs on how representative it is.

With reasoning models, most of this translates directly. Infrastructure is becoming more important because generating long sequences is particularly memory intensive (and can expose issues in open-source tools for inference), but when the time comes to make a state-of-the-art fully open reasoning recipe, the lessons learned here will apply directly.

1. How long does a large post-training project take?

Tülu 3 was the focus of our post-training team from mid-July until its release on November 21st, 2024. We were building on our previous recipes, in Tülu 2/2.5, so not very much of this was catching up on internal know-how, but rather integrating new external resources. If a team like this was working continuously all year on the same focus it would’ve taken approximately one month less to achieve these results. Bootup takes substantial time, as does release management.

2. How do you choose the right personnel for a moderately sized training project?

A project like Tülu 3 or any other effort to push the frontier of AI in a popular area normally takes a moderately sized team. The smaller the niche, the smaller the team you need. The team at Ai2 is researcher-heavy relative to engineer-heavy among the 20+ authors. If prioritizing only performance on known techniques, the ratio of engineers can be far higher. Pushing the frontier takes 10x the resources as repeating extensively documented work.

In the case of Tülu 3, where most of the techniques are not known the proportion of researchers is obviously higher. This, though, for companies trying to scope who to hire for modeling teams is not a trivial problem. First, one must scope the level of uncertainty in the domain of interest and then hire around it. Applying Tülu style approaches could definitely be done with a team of 2-4 focused engineers.

3. What model sizes are used for iteration? How do results scale?

A core principle of modeling research is to iterate at the smallest model that provides a reliable signal. This is the entire principle behind scaling laws as a de-risking tool. In post-training, compute costs are substantially lower so the models used actually can be bigger. In this case, given a project designed around the Llama 3.1 base models, ~80% or more of experiments were at the 8B scale (normally 8 or 32 H100s, finishing in

These days, results tend to transfer extremely well when scaling. Bigger models may need less data, especially less general data, and a gentler optimization (lower learning rate usually), but transfer hasn’t been a challenge. Changing base models is harder than scaling with post-training techniques.

4. How many experiments are actually run?

The Tülu project evaluated about 1000 checkpoints in our process. This feels about right for a major post-training process. Some of these are intermediate or competitor models, but most of them, 100s, are experimental training runs. The model scores can be plotted in a time sequence with the metadata we collected (credit Hamish Ivison for the plot). When you squint, it is largely a logarithmic curve with faster gains at the beginning and leveling off at the end. Of course, you can also see the flurry of models trained right in the last few weeks.

5. What is the biggest bottleneck on progress?

All of these projects are bottlenecked by compute available. Making systems more efficient is a compute multiplier, but if the starting point in the number of GPUs is too low, it won’t matter. There’s often potential to accelerate projects by adding more people to explorations, whether it’s training approaches like process reward models (PRMs) or data curation, but scaling management and integration of data across numerous evaluations can be tricky. Best practices for models with 100s of target evaluations (as done in frontier laboratories) rather than the ~10 we used, are far from established.

The second bottleneck would be personnel willing to constantly grind on new data experiments. Focus on data almost always pays off fairly quickly.

6. What I would need to get a serious post-training effort off the ground from a cold start?

Finetuning has such a large gradation that impact can be made with almost any team size. To do truly excellent work takes mostly patience and proportional resources. Getting the model exactly right takes retraining many times even after you hit your initial benchmarking goals.

For companies focusing on local models, a few nodes of H100s (~100 GPUs) could go a very long way. For companies trying to make truly state-of-the-art models above the 7B scale, trying to do so with

These numbers will come down as best practices of distillation from strong models are established, but this knowledge is far from known. If you want to invest in training you need to do enough to move the frontier, or else you will be inevitably falling behind and it would be better to ride on other’s coattails.

7. What is the hardest part of these projects? Where do you actually spend time?

Training projects take a lot of time and a lot of focus to detail. Teams need extreme isolation from other company goals to focus on their one goal of training. The hardest part is often this — having all the members of the training team focus on one single output for sustained periods. Tracking down recent developments, small experiments with training algorithms, curating data (likely most of the time in hours as babysitting GPUs is largely an idle activity), etc. are all bread and butter of solid engineering talent. Success is downstream of good decision-making by tech leads and managers while getting many small shots on goal.

In the case of projects like Tülu 3 the reason we don’t immediately transition to Tülu 4 is that people have other interests. Companies that directly align training with their bottom line don’t need to do this.

Thanks to Nicole Fitzgerald, Finbarr Timbers (Midjourney was not one of the companies I studied), and others unnamed at leading AI laboratories for comments or input that helped with this post.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Gemma 3, OLMo 2 32B, and the growing potential of open-source AI Mar 13, 2025

Post: https://www.interconnects.ai/p/gemma-3-olmo-2-32b-and-the-growing

Ever since the release of the original ChatGPT, much has been said about making a truly open-source version of it — with data, code, weights, etc., all available. Open-source versions increase transparency, access, long-term progress, security research, and lots more. Lots of people have used this claim to bring hype into their projects, but the substance of these releases have been rather shallow (i.e., often focusing on one evaluation).

This milestone was so long coming that I entirely forgot about it as a target. Through 2024, and especially before DeepSeek, the impression was that scaling AI capabilities was just too expensive for the smaller players willing to do truly open-source development.

Truly open releases take a lot of effort by making more to release and maintain, open up potential legal risks that preclude types of training data, and completely undermine competition. The few organizations doing fully open-source research are non-profits, like Ai2 or Eleuther AI; academics, like LLM360; or companies that benefit from the long-term ecosystem growth, like HuggingFace.

I was poking through the results for our latest model when I realized that we finally did it! We have a fully open-source GPT-4 class model, i.e., it is comparable with OpenAI's original release rather than the current version.

Today, we're releasing OLMo 2 32B, the biggest model we've trained from scratch yet. Here are the post-training evaluations, where it surpasses GPT-3.5, GPT-4o-mini, Qwen 2.5 32B Instruct, the recent Mistral Small 24B, and comes close to the Qwen and Llama 70B Instruct models.

And this recipe is extremely training efficient. Here’s a plot showing the FLOP comparisons to peer base models:

Most of this release isn't entirely new. OLMo 2 is the result of lots of small wins on data, architecture, post-training with Tülu 3 recipe and so on — we just let the GPUs hum for a lot longer. You can learn more about OLMo 2 in my original release announcement or in this podcast with the leads.

The new part of this release is a major milestone where any company can pick up our training stack and cook up exactly the model they need at nearly the GPT 4 level. Beating the latest GPT 3.5 and GPT 4o mini models feels like fair game for the claim. This capability will take time to diffuse, but it is a major moment in the arc of why we do what we do. Even without more progress on OLMo, which we obviously will continue this year, this will keep fundamental AI progress outside of the major AI labs going for multiple years. It’s an optimistic day for open-source.

Here are your links to more information on OLMo 32B:

* Blog with technical details and demo

* Base model: OLMo-2-0325-32B

* Instruct model: OLMo-2-0325-32B-Instruct and intermediate SFT, OLMo-2-0325-32B-SFT, and DPO checkpoints, OLMo-2-0325-32B-DPO

* Pretraining dataset: OLMo-mix-1124

* Mid-training dataset: Dolmino-Mix-1124

* Post-training datasets: Tülu 3 SFT Mix (updated), Preference data for OLMo 2 32B and RLVR Mix

Gemma 3 as the next point on a steep trend line

Yesterday, March 12th, Google released the next batch of their flagship open-weight models, Gemma (report, models, flagship model). They highlight the following capabilities in their documentation:

* Image and text input: Multimodal capabilities let you input images and text to understand and analyze visual data. Start building

* 128K token context: 16x larger input context for analyzing more data and solving more complex problems.

* Wide language support: Work in your language or expand your AI application's language capabilities with support for over 140 languages. Start building

* Developer friendly model sizes: Choose a model size (1B, 4B, 12B, 27B) and precision level that works best for your task and compute resources.

Some technical details of note:

* In open models, 32B dense models are convenient because they can be finetuned on one node of 8 H100s (slowly). Google's sizing at 27B likely is downstream of TPU considerations that don't map directly, like how knowledge distillation works at pretraining.

* The Gemma models continue to be trained extensively with teacher-student knowledge distillation (KD). This KD is different than the colloquial definition of distillation in leading AI models. The common use of distillation is training the models on any output of a much stronger model. This is most commonly done in post-training to learn from generated completions of the stronger model. KD is a subset of the general idea of distillation, where the model being trained learns to match the distribution of the teacher model. Other labs than DeepMind have mentioned this KD technique, but Google has pushed it far further. This was discussed further in last summer’s post on synthetic data.

Otherwise, the paper has some interesting information but nothing super groundbreaking. This is par for the course for most technical reports these days.

Onto the evaluations, and therein the impact, of Gemma 3.

The best way to think about this model is a “general chat model” like GPT-4o and Claude 3.7 rather than a reasoning model like R1. The rise of reasoning models has made comparing models tricky because there are multiple evaluation suites that people care about — broadly characterized as a reasoning suite and an instruct suite. They overlap, but strong capabilities on both is rare.

Gemma 3 27B’s performance on some tasks like MATH and Bird-SQL (coding) match the Gemini 1.5 Pro model from just a few months ago! The progress on small, open weight models is simply insane. Small models can perform excellently on narrow tasks like math and some coding, but they lack the depth and world knowledge, as seen in GPQA or SimpleQA above.

Yes, DeepSeek distills are better at smaller sizes on MATH, but not enough people evaluate those distills across all capabilities like ChatBotArena. Having it all in one model is very convenient and is still how most workflows are handled.

Most people are also fairly skeptical of evaluation scores like MATH stated by Gemma, DeepSeek distills, and the like, claiming they don’t translate to real world usefulness. This is why the ChatBotArena results were the most striking of the Gemma 3 release. Gemma 3 falls in the top 15 of every category. It beats DeepSeek V3 with its 600B+ total parameters. It is outperformed in niche categories like math or coding by its peer models in the overall ranking, indicating a small level of superficial alignment, but doing this to get into the top 10 of ChatBotArena during this period of AI with immense competition is a huge accomplishment.

It is an ever evolving open question on how reliable chat evaluations like ChatBotArena are. These days, with how in vogue RL training methods to maximize MATH evaluations are, the value is higher again. Is it representative of some subset of real-world use, which would indicate that specific capabilities that small models are able to excel at — math, general chat, etc. — can translate directly to real value.

This implies that tasks like SimpleQA and GPQA indicate performance on more niche tasks that not many people encounter, but we have a lot to learn as a field here.

With my use of leading models, I find this hard to believe — switching to something like GPT-4.5 feels like an immediate improvement in chat tasks. My conclusion is that the answer is somewhere in the middle — small open models can do super well on everyday tasks, but we don’t know exactly how to measure it. ChatBotArena and SimpleQA don’t tell us exactly what to expect from the models.

The fact that there isn’t a cliff in performance with models this small is going to drive substantial business value — it’ll be easier to find exactly the minimum model size needed for your distribution of tasks.

In the specific case of Gemma and most of the leading open weight models right now (with DeepSeek R1 as a wonderful exception), the licenses of models that often restrict downstream use are a brake on open weight model adoption. Without getting too much into the details, they expose companies looking to use the models to some amount of legal risk and add terms and conditions to finetuned models.

Largely, until feedback loops and use cases for open models are established, teams releasing these models don’t have strong cases other than future potential growth to counter the safe option that comes from legal teams’ recommendations. I hope that the evolution in the adoption of open weight models for valuable applications applies pressure to make licensing less of a headache.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

The state of the open-closed model gap

3 of the top 15 models on ChatBotArena are open weights. In a world where frontier labs have many minor model versions crowding the leaderboard, this is an incredible platform for accelerating progress in open model adoption. Even just the gap from Gemma 3 to Google's flagship Gemini models is very small! The entire AI market is the most dynamic and competitive it has been in some time — maybe ever.

When it comes to “truly open,” i.e. actually open-source models, the gap between closed models has remained somewhat consistent — I estimate it as about 18 months behind closed labs. With open models generally on the upswing, open-source access to code, data, etc. is likely to come with it. The likes of Llama, DeepSeek, etc. are some of the most important pieces in the complete open ecosystem, and approaches like Ai2’s would struggle without them.

Related to this, in the coverage of DeepSeek R1, I noted:

This points to the first time since Stable Diffusion’s release that the most relevant and discussed AI model is released with a very friendly license. Looking back at the journey “open-source” AI has been on over the last 2.5 years, this is a surprising moment in time marked in the history books.

A month later, this is still the case.

To understand the progress of the open-closed model gap, with an oversimplification, there are two simple ways that models tend to get better:

* Scaling up model size, and

* Various data, architecture, and algorithm improvements that make an equivalent performing model smaller.

Progress on truly open-source models has always relied on the second of these due to the relatively smaller budgets of players in the open-source space. By being forced to focus on efficiency rather than scaling, it has often seemed like truly open source models are falling behind by more than they really likely were. With similar potential to scale, the gap between truly open-source and closed would be quite small on standard academic benchmarks which "raw model potential" is usually evaluated on.

When models like Llama 3.1 405B, DeepSeek V3 and R1, etc. come out, the open weight models scaled up have been at the frontier. With more resources for truly open-source training I don't see many reasons why that wouldn't be the case for models like OLMo too. In some ways, the biggest moat to training is just resource access. For example, DeepSeek is reported to have at least 10x more compute than the likes of Ai2.

This interpretation of the gap is more tractable than other critiques I've imposed on the open-source AI community in the past, such as the need for better preference data / RLHF pipelines.

Comparing Gemma 3 27B to OLMo 32B, the pretraining evaluations for both are super similar, but Gemma 3 scores are way better after post-training. The ceiling on post-training expectations has been shifting extremely fast among open models. Largely, open recipes are motivated by matching existence proofs like Gemma 3 or DeepSeek R1. Throughout 2025, expect open post-training resources to get even better again, motivated by the need for better performance and more detailed workflows on reasoning training.

All of these subtleties are paving the ground for a better platform for building value with open models. The gap from closed models to open counterparts being its smallest means the opportunity is the best it’s ever been.

The race for inference providers to host models like DeepSeek R1 is a great platform for progress and example. Even if the why of open models starts as primarily being good for marketing and recruiting, this can trickle down into bigger benefits.

Open weight models are likely to be offered for inference at far lower margins in the near term, due to the massive competition in the inference space, driving the prices even lower for those looking for ways to adopt them. Here’s an example of some of the leading platforms offering R1 inference:

This sort of everywhere-inference for an open model is something that has only happened in the past for Llama models.

Yes, I know APIs are orthogonal to the benefits I just listed, but they’re the way people become familiar and interested in the models. APIs help people check if the model can serve their purpose, and then if it does, they can spin up a more complex workflow.

“Why open” is more important than “what is open”

The biggest stories in open-source AI in 2024 often felt like bickering about definitions. I wrote a lot of articles about definitions. Llama 3 was pretty much all we had to get excited about. At the end of the day, even with how much I think it would be better with more information on the whole stack of AI development, open-source is largely going to be defined by community norms. For now, Llama weights have been that norm rather than other definitions.

By comparison, 2025 feels poised to be about actually building open AI. We have had surprising, impactful, and exciting releases and it’s only March. We know Meta is looking to get back into the conversation with Llama 4 in April at LlamaCon. We have our open-source ChatGPT. We’ll have more we can’t predict.

Crucially, on top of the gap being smaller, all of these open models are crossing meaningful boundaries in performance. When model capabilities made the leap to GPT 4 class models, tons more applications were possible. Now, we have GPT 4 class small models that can be deployed in privacy-conscious ways. There’s been a huge demand for this, and the ecosystem is slowly building the tools to do so. Yes, closed AI will continue to march forward, but open solutions need to prove their own independent feasibility.

In the long march of progress, open-source AI feels far closer to an inflection point of proving out the hypothetical benefits we have focused on for a few years. Transparency, privacy, better performance, etc. could actually all be happening this year.

For the backlog of content I've written on open-source AI, click here. Thanks to Vikram Sreekanti, Ross Taylor, and Xeophon for feedback that helped shape this post.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Interviewing Eugene Vinitsky on self-play for self-driving and what else people do with RL Mar 12, 2025

Eugene Vinitsky is a professor a New York University department of Civil and Urban Engineering. He’s one of my original reinforcement learning friends from when we were both doing our Ph.D.’s in RL at UC Berkeley circa 2020. Eugene has extensive experience in self-driving, open endedness, multi-agent reinforcement learning, and self-play with RL. In this conversation we focus on a few key topics:

* His latest results on self-play for self-driving and what they say about the future of RL,

* Why self-play is confusing and how it relates to the recent takeoff of RL for language models, and

* The future of RL in LMs and elsewhere.

This is a conversation where we take the time to distill very cutting edge research directions down into the core essences. I felt like we were learning in real time what recent developments mean for RL, how RL has different scaling laws for deep learning, and what is truly salient about self-play.

The main breakthrough we discuss is scaling up self-play techniques for large-scale, simulated reinforcement learning. Previously, scaling RL in simulation has become economical in single-agent domains. Now, the door is open to complex, multi-agent scenarios where more diversity is needed to find solutions (in this case, that’s what self play does).

Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.

Show outline & links

We cover many papers in this podcast. Also, as an experiment, here’s a Deep Research report on “all the papers that appeared in this podcast transcript.”

In this episode, we cover:

* Self-play for self-driving, mostly around the recent paper Robust Autonomy Emerges from Self-Play(Cusumano-Towner et al. 2025). The simulator they built powering this is Gigaflow. More discussion on HackerNews.(Here’s another self-play for self-driving paper and another from Eugene from earlier this year).A few highlights:

“All simulated agents use the same neural net with the same weights, albeit with randomized rewards and conditioning vector to allow them to behave as different types of vehicles with different types of aggressiveness. This is like driving in a world where everyone is different copies of you, but some of your copies are in rush while others are patient. This allows backprop to optimize for a sort of global utility across the entire population.”

“The resulting policy simulates agents that are human-like, even though the system has never seen humans drive.”

* Large Language Models are In-context Preference Learners — how language models can come up with reward functions that will be applied to RL training directly. Related work from Stanford.

* Related literature from Interconnects! The first includes literature we mention on the learning locomotion for quadrupeds with deep RL (special shoutout as usual to Marco Hutter’s group).

* Recent and relevant papers Value-based RL Scales Predictably, Magnetic control of tokamak plasmas through deep reinforcement learning.

* Other things we mention:

* Cruise, Tesla, and Waymo’s autonomy stacks (speculation) and how the self-driving industry has changed since we were / were considering working in it.

* Evo 2 foundation model for biology.

* Eugene is working with a new startup on some LLM and RL stuff. If you’re interested in this episode, ping eugene@aitco.dev. Not a paid promotion.

Chapters

* 00:00:00 Introduction & RL Fundamentals

* 00:11:27 Self‑Play for Self‑Driving Cars

* 00:31:57 RL Scaling in Robotics and Other Domains

* 00:44:23 Language Models and In-Context Preference Learning

* 00:55:31 Future of RL and Grad School Advice

Transcript

I attempted to generate with ElevenLab’s new Scribe tool, but found the formatting annoying and reverted back to Alessio’s smol-podcaster. If you’re interested in working part-time as an editorial aide to Interconnects, please get in touch.

Nathan Lambert [00:01:27]: Hey, Eugene. Welcome to the show.

Eugene Vinitsky [00:01:29]: Hey, Nathan. Thanks for having me. Excited to be here.

Nathan Lambert [00:01:32]: Yeah, so I'll have said this in the intro as well, but we definitely go well back in all the way to Berkeley days and RL days, I think.

I will embarrass you a little bit now on the live read, which is like, you were one of the people when I was switching into RL, and they're like, oh, it seems like you only figured out how to get into AI from a potentially different background, and that's what I was trying to do in 2017 and 2018.

So that was kind of fun, and now we're just friends, which is good.

Eugene Vinitsky [00:02:01]: Yeah, we were both figuring out. If I had any lead over you there, I was also frantically trying to figure it out, because I was coming from a weird background.

Nathan Lambert [00:02:11]: There are definitely a lot of people that do that now and over-attribute small time deltas to big strategic plans, which was probably what it was.

And we're just going to do some of our normal conversations on RL and self-play.

I think the backstory of this is you told me that your recent paper from some of your time at Apple, I think I don't want to time for it too specifically, was something, paraphrasing, like the most exciting RL thing you've ever had a part of.

And major RL projects are not that frequent.

I think if you segment out all the language model excitement in the past 10 years, there's really a few major milestones, and it's good to kind of talk about them.

So we can kind of start, I think, basic things, like how do you define reinforcement learning, and it will kind of build up to this self-driving project.

Eugene Vinitsky [00:03:05]: Yeah, so I think RL is kind of a big thing, but at a really basic level, you have this process of taking actions in the world.

You're seeing the state of the world.

If you're taking actions in the world, you sometimes receive a reward that tells you the value of that action, and you're trying to kind of optimize your cumulative behavior over time.

So that, you know, over long trajectories, you're optimizing those costs.

That's both, you know, the hard thing and the exciting thing is that if you do it well, you can really optimize really long horizon behaviors.

Nathan Lambert [00:03:41]: Yeah, I agree.

And it's funny because now it's finally, the language models are finally doing this long chain of thought, and I don't really think that's the same.

I think the interactive notion will come up a lot here where these long context behaviors are many, many actions interacting with the world relative to one really, really long action, which is kind of odd.

Eugene Vinitsky [00:04:04]: Yeah, I guess, yeah, it mixes things, right?

Because it has very long state, right?

It's got very long contexts, and it's generating its own context.

But in the end, there's really one action at the end that, like, kind of determines how everything went, you know?

Nathan Lambert [00:04:23]: Yeah, yeah, yeah, we'll get into this.

And then the next thing that we kind of need to set up is what do you define self-play as?

I think this word has been particularly broken in recent times with language models, and I'm hoping we can get a fairly specific criteria for what is self-play and what are related topics.

Eugene Vinitsky [00:04:42]: Yeah, I think even within the field, there's quite a bit of debate as to what constitutes self-play.

So talking to, you know, experts, people will disagree about what methods are and are in self-play.

But what I will say is I generally define self-play as anything where an agent plays a copy of itself.

So up to a bunch of different agents interacting with each other, as long as they're mostly, in some ways, copies of each other, we're doing self-play.

Nathan Lambert [00:05:12]: Yeah, and then do you think anything, I mean, your background's in multi-agent as well.

Do you think there is something fundamental to kind of a game that has a really specific hill to climb where it's kind of this competitive nature versus something like language?

Eugene Vinitsky [00:05:29]: Yeah, this is kind of the dream of, I think, some multi-agent researchers is this type of like ratchet effect where you have a bunch of agents interacting with each other and kind of increasing complexity on the part of any agent generates increasing, like creates new challenges that need to be solved and then force you to learn new skills.

And then you kind of get this endless, endless ratchet.

Maybe that's what you meant.

I may have misinterpreted.

Nathan Lambert [00:05:55]: We're going to revisit it.

I think also it's like, how does the multi-agent nature of a lot of these things change what people think about with RL?

This is kind of the last building block before we go into the self-driving stuff.

Eugene Vinitsky [00:06:07]: Yeah, yeah, yeah.

So the way that the multi-agent thing changes things is it makes everything much harder and more interesting.

So you go away from this world where you have like a clear score function, right?

So you have some reward for first in single agent setting, you have some reward.

If that reward is high, you're doing well, right?

And when you move into the multi-agent setting, it becomes reward with respect to whom, right?

It all of a sudden matters whom I'm playing, right?

So if we go to a game of like, like one setting is like two players, zero sum games, right?

So a game of two player poker, I give you, I train a poker bot, right?

How do I know it's any good?

I have to play another poker bot to decide that it's any good, right?

And so all of a sudden, this challenge of like, what is a good policy becomes very fundamental.

And you kind of lose even a notion of there being like one clear good policy.

And like the whole, a lot of, a lot of the field of multi-agents is coming up with different definitions of what would cost you goodness.

Nathan Lambert [00:07:06]: Um, so, and then back to the self-play thing with that, like, is all of the self-play that we discussed, like if you were playing yourself, does the same consideration apply?

Like, is that, is self-play necessarily a multi-agent framing?

Eugene Vinitsky [00:07:19]: Um, I think it, I think it is because oftentimes what we're trying to do with self-play is like to converge to some notion of policy goodness.

And self-play is just a mechanism that gets us to some definition of, of high quality policies.

Um, and, and, and what turns out to be the case is there, there are actually many like non-self-play type methods for doing this.

Self-play just turns out to be an effective way to accomplish constructing effective policies.

Nathan Lambert [00:07:45]: Yeah, I, I, there's many, I'll, I'll link later a lot of these papers on self-play for preference learning and look into them a bit more.

Eugene Vinitsky [00:07:56]: Yeah.

Nathan Lambert [00:07:57]: Essentially that's been the lens.

There's two lenses by which this has come back and both of them, I don't think fit into, I, I think this multi-agent lens of self-play is much richer and I don't think any of them have fulfilled this.

I think there's useful methods for preference tuning.

I think that's like maybe spin it's like self-play something preference learning is one.

And there's papers related to this where they're probably looking at the probability of the own model in generating a response or something like looking at the internals of the model.

And it's not really set up in this game nature of some sort.

And then also with Q stars, when the self-play stuff came back where I really think I've, I've talked to some people that did original reporting on this and it was that the model looked like it was talking to itself.

And I think that very understandably for less, a little bit less technical audiences that haven't engaged with self-play, that coverage of talking to itself got transformed into a self-play commentary and hype cycle, which took people down the wrong path for like an entire year, which is so brutal, but also very understandable and funny.

Eugene Vinitsky [00:09:11]: Yeah, I think there's something interesting and different happening in these like multi-agent like LLM self-play setups.

I'm not super familiar, but I think what's happening is something quite different than what we mean in other multi-agent settings when we're talking about self-play.

Like I feel like it's, it's more about like refining like the distribution of actions that it takes in some, some kind of odd way.

Nathan Lambert [00:09:39]: I think this sounds ridiculous at first pass, but it's almost that the language models are simulating a softer version of self-play within themselves to kind of check their own work and engage in their own discourse, which the level of intelligence they have is not going to like unlock the true like incremental progress that we think of with self-play.

Which probably, I think for context of things for self-play, just to put them on the record of this are, have been very impactful or things like AlphaGo and New Zero.

I think that's, those are the prime examples of generating some superhuman policy in a closer way.

I think it's, it's important to kind of gate the conversation on like, these are the aspirational goals, um, in terms of outcomes and then figuring out how to apply them to new domains and new tools is kind of unknown.

Eugene Vinitsky [00:10:31]: So, so maybe I should have said this earlier, but like self-play is the thing that gives a, is like maybe the one way that we know to build superhuman agents right now.

So, right.

So, um, superhuman go, um, human level Dota, human level, uh, Starcraft.

Um, technically poker is in a, in a slightly weirder, um, weirder space where I don't, I don't exactly know that I would call the method on that underlie that self-play.

Um, sorry.

Um, and, uh, but yeah, it's the one way we really know how to build superhuman agents.

Nathan Lambert [00:11:06]: And I think this is a kind of a natural transition because the, to make people excited in the work that you did, it seems like you've discovered superhuman driving through self-play without inductive biases.

And I'm like, um, how do you view the potential impact of this?

And then we can kind of go into the method.

Eugene Vinitsky [00:11:27]: Right.

So the, the challenge with self-play is, and this requires a bit of technical detail to get there, but you know, in, in like two players, here are some games, games where you and an adversary are playing with each other and somebody wins and somebody loses, there's a very well defined notion of what being good is.

Um, you know, that they're, they're well, you know, their criteria that we would like our policies to converge to.

And, and the challenge has always been about moving beyond that to a domain where it's much harder to define what, what doing well means, right?

There isn't like an abstract notion of what good driving is there out in the world where I could just write down the reward function and simulate it and optimize with respect to that.

And all of a sudden I'd have a good driving policy.

So the, the gap has always been between these methods that work really, really well in, in well-defined games like, like Starcraft or go, uh, and chess, um, and settings where it's much harder to define that.

And so we haven't been able to, to move to self-play in settings where, for example, humans might be in the loop, right.

And, and particularly driving is an instance of that somewhere where at the end, we're going to take our policy and it's going to drive with humans and we have no way to simulate humans and play against them.

Um, and so figuring out how to close that gap has been kind of an open, open challenge.

And I think maybe this is the first instance of, uh, finding a way to do that.

Nathan Lambert [00:12:51]: Okay.

So that's a much better motivation than I gave.

And I understand the excitement now, because if this works in one domain, um, and you'll tell us about how grand of an effort it actually was.

I know big tech companies can put a lot of force and long-term investment behind things to get them off the ground.

Then a lot of the other things that people are saying about language models or other complicated domains are at least there's an existence proof of something similar happening.

So why don't you just continue to explain, uh, this problem set up of learning driving without having a human teacher.

It will probably take detours to analogize different self-driving stacks just because we know about them and it's good to compare.

Eugene Vinitsky [00:13:36]: So one way of framing this is, and I'm going to put cautions in the end, I'm going to give you the, the, the extreme version of it.

And I'm going to walk it back a little bit is like human level driving without any human level data.

And the caution needs to be that this is in simulation and our ability to measure human level driving in simulation is limited in a lot of ways.

So I can tell you about the ways that we measured it and then I'll, I'll have to tell you what the limitations of those things are.

Um, so this was a large scale effort, um, uh, and Lovlin Colton's team and at Apple, um, it was about like eight researchers, research engineers working together for about a year and a half, uh, build, building the stack out.

Um, it was, I think a lot of us came at it from different places.

I know some folks were very inspired by this idea of like alpha star for driving, you know, building a diverse, rich world and then driving it in a way that such you would, you would transfer to policies that you hadn't seen before.

So like human actors.

Um, so, um, yeah, the, the, if, if, if it's helpful that the idea here is that, or the goal here was to build a human level simulated driver.

Um, and here, what that means in our case is not a fully end-to-end method, right?

So we're not simulating perception.

So driving stacks consist of like generally perception, prediction, planning controls.

So you have a perception stack that, you know, takes your LIDAR, your camera, your radar, and converts it into, you know, where are the cars, where are the road is, what's impassable.

Um, and then a prediction stack will take the like positions of all the cars, the cyclists, pedestrians, and it'll predict, predict where they're going to go next.

And then a planning stack will say, okay, given those predictions, you know, what's a good trajectory for me to take.

And then the control stack will say how to actually follow that trajectory safely and robust.

Right.

And we're talking about subsuming the prediction, planning, control portion of the stack, not the perception part of the stack.

Nathan Lambert [00:15:28]: Okay.

So I was, I was thinking that you might not even do control.

I was thinking you might just say, uh, control is a softer album and not do that too.

Eugene Vinitsky [00:15:35]: So in the same way, we're kind of, we're only kind of doing control.

Uh, we're, we're, we're doing this for, I think Waymo uses the

Nathan Lambert [00:15:42]: the term behavior for this.

I think it's been their behavior team for a while.

Is that right?

Eugene Vinitsky [00:15:46]: Okay.

Nathan Lambert [00:15:47]: Uh, you know, I very, it's hard to know where the abstraction ends, but they definitely have a behavior team that's done a lot of things through the years.

Well, he's not the job apps that I've been applying to an interview or have interviewed for in the past.

Yeah, me too.

Eugene Vinitsky [00:16:01]: Um, I think we do know how to control cars.

We know how to make cars follow a pre-specified trajectory, right?

This is, this is somewhat of an easier problem than like humanoid robotics or something.

You know, big thing got wheels.

We know how to make it turn.

Nathan Lambert [00:16:14]: Um, so how do we get these things from, I mean, they start as like, it doesn't start at just all the simulated cars crashing all the time.

What is the start here?

Eugene Vinitsky [00:16:24]: I'll send you the video once it's out, but like, you know, the, the first 10 hours of simulation is just like cars scattered all across the road, smashing into each other, driving off the road, that type of thing.

It's actually interestingly useful because what we do is when two cars crash, we have them come to an immediate stop.

And this actually creates a lot of blockades in the road.

So at some point during the training, the cars start to learn to drive around stopped cars, even though those cars are stopped because they've crashed, um, as well as to drive around like obstacles and things like that.

Um, so that, yeah, that's what it looks like.

Um, yeah.

Nathan Lambert [00:16:58]: Um, as well as the reward function for these.

So you have a bunch of cars that can see their peers and there's some reward function I'm guessing.

Eugene Vinitsky [00:17:06]: So the, the major component of the reward function is getting to your goal without colliding.

So we, we have these maps that we've taken from the cartless simulator.

They're fairly large maps.

Some of them are like multiple kilometers in spatial extent.

We have eight of them and we place goals randomly over the map.

Um, and you get a sequence of goals.

So, you know, that like, okay, I want to get to this point.

And then after that, I'm going to want to get to this next point.

After that, you're going to get a big reward for getting to that goal.

You're going to get some amount of penalty for colliding.

And then there's also an implicit penalty because if you collide, you're not ever going to get to your goal.

And then there, there is some amount of hand design here in that there are small rewards for like staying in your lane and being aligned with your lane and like, you know, not driving in the opposite direction in the wrong lane.

Nathan Lambert [00:17:51]: This was one of the questions is if you had to do this sort of thing.

Eugene Vinitsky [00:17:54]: You have to do that.

But one interesting thing, and maybe we could talk about that at some point is we randomize the weights of those rewards.

So there are agents that like really want to drive in the lane going in the right direction.

And there are agents that don't care about that at all.

And they will take the wrong lane on the highway, uh, you know, going at full speed in the opposite direction.

And that's kind of useful because you're ready for that scenario.

You've seen that scenario in the world when you're driving around.

Right.

Um, but yeah, we have to, we have to do some of that stuff because at some point there are laws and you can't avoid encoding the laws into your system.

You know, stop signs are a human concept.

Um, they're, they're not, you know, it's not going to emerge that you see a red thing and you're like, oh yeah, that means I should stop.

And then I should like give the right of way rules to the other cars.

Um, but all of our rewards are kind of soft in the sense, like, you know, if you're at a stop sign and folks have been preventing you from going for a very long period of time, right.

You're going to start to nudge in and like break the rules about right away.

Nathan Lambert [00:18:55]: One of my questions for later on this is like, do you think our vehicles and driving dynamics and infrastructure kind of constrain the way of driving?

Like we've co-designed human driving in our infrastructure so that human driving is actually no longer that special because of the track is so long, so defined.

Eugene Vinitsky [00:19:11]: I think this is, this is part of why this is all going to work or like why it works is because like human, human driving is, and human behavior in many domains is like fairly constrained by the institutions and the laws and the norms that we design.

Uh, it's not super free from, uh, so like driving amongst humans is much more of a constrained problem than you would, than you would, you would think it's also unconstrained in some interesting ways, but, but it's, it's quite unconstrained, quite constrained.

Nathan Lambert [00:19:42]: And how hard to act was this to actually learn?

So how sensitive of a process is it now?

I think in the paper, you're talking about gigaflow, which is like a high speed

simulation engine.

So obviously, you know, on data, the final paper says that it learns in 1.6 billion kilometers of driving.

I was wondering if you had an intuition for that.

So like how many miles are driven by all the cars in San Francisco and day or something like this?

Eugene Vinitsky [00:20:10]: That's a, that's a great question.

Nathan Lambert [00:20:12]: Um, it could be a good chat GPT query, to be honest.

Eugene Vinitsky [00:20:16]: This might be a chat GPT question.

Um, let me, let me give some, some numbers that I do know.

Uh, and this is kind of maybe helpful.

So I think cars crash every 20,000 to a hundred thousand miles and a fatal collision happens every a hundred million miles, something like that.

Um, but how many miles are driven in a day in a city?

I'm not sure.

1.6 billion kilometers, the distance between here and Saturn.

Um, it sounds like kind of far when you put it that way, but there are a lot of cars.

Yeah, there are a lot of cars, right?

There are a lot of drivers.

Um, there are surprisingly few trips in a city, fewer than you would expect, but, um, I'm struggling to put a number on it.

Nathan Lambert [00:21:01]: Um, I'll tell you what chat GPT gets when it's done.

I was thinking it's Oh three mini high.

This is not a reliable number.

Take this time.

So your intuition that it's lower goes a lot.

I mean, you've thought about a lot of these car systems for a very long time and I will link to some of your other work on this.

So you definitely have better intuitions than I would.

Eugene Vinitsky [00:21:20]: Well, the intuition comes with the fact that like a lane of the highway can take 2000 vehicles per hour, which is like just not that many vehicles.

Um, and you know, most, most of, most of traffic is between like, you know, 8am and or like 7am and like 10am and then on the way back home.

And so, you know, you can like kind of estimate based on how many lanes there are on the main highway, how many trips there are.

Nathan Lambert [00:21:43]: So San Francisco, the chat Oh three mini high estimated four to 5 million miles in a day in San Francisco.

It's a bully.

It's a plausible number, but it's well below what you are doing.

Like this is, I think maybe globally this billion kilometers could be hit.

So this is okay.

Eugene Vinitsky [00:22:03]: Here's one way to think of it.

We simulate 10,000 years of human drive.

Nathan Lambert [00:22:08]: Okay.

So yeah, 10,000 per one.

I guess it depends on how many cars you have in parallel.

Eugene Vinitsky [00:22:14]: Per one training run one trip to get the policy that we get.

We simulate about 10,000 years of human drive.

Nathan Lambert [00:22:20]: Yeah.

Eugene Vinitsky [00:22:21]: Yeah.

Nathan Lambert [00:22:22]: So to have 10,000 cars, it's all of them driving for a year.

Eugene Vinitsky [00:22:26]: Yeah, exactly.

And we have about like a million cars driving at any given time in the simulator.

Nathan Lambert [00:22:34]: Do you think that substantially changes the learning dynamics?

Like are they all, how many cars are any of them interacting with at any one time?

Eugene Vinitsky [00:22:40]: Yeah.

Any given simulator in any given world.

So this is this like kind of like Isaac's gym style vectorized simulator.

So it all runs in the GPU.

So it's a bunch of worlds happening in parallel, but any given world, there are about 150 cars in it, which means that sometimes you're driving in sparse traffic and sometimes you're going to interact with like 10 or 20 cars at any given time.

Um, and I, I think one thing is that one, one cool thing is that at that scale, I think RL becomes very, very stable.

Um, like for us, like every training run succeeds, the reward curves go straight up.

You know, there's no like, um, what are you scaling?

Nathan Lambert [00:23:19]: Are you just like scaling batch size effectively?

Uh, what is, yeah.

What is the actual thing you're, they're scaling?

Eugene Vinitsky [00:23:26]: We're scaling the amount of experience generated.

So it's like a trillion samples of, of total experience, um, that, that the agents train on.

Um, and then, yeah, we use gigantic batch sizes, like, you know, um, but like, what is the thing

Nathan Lambert [00:23:43]: that you need to dial up in order to make learning actually happen?

Eugene Vinitsky [00:23:47]: Uh, total amount of experience generated, right?

So you need to be generating, you know, million samples per second to train on type of thing.

Nathan Lambert [00:23:57]: Okay.

And then what is the actual, I guess I don't know a ton about multi-gen RL, but what is the actual RL like algorithm and is it a giant replay buffer that is just building and building and building?

Eugene Vinitsky [00:24:08]: It's PPO.

Uh, you know, one thing we've been seeing throughout our work pretty continually is that for, for both theoretical and empirical reasons, PPO is actually a really good multi-agent RL algorithm.

Nathan Lambert [00:24:20]: You had the paper, are you, you are on the paper years ago.

That's like on the something, something PPO multi-agent simple.

Eugene Vinitsky [00:24:29]: So we know that PPO works empirically pretty well.

That's basically the title of the paper.

That's a PPO simple, good multi-agent cooperative.

Good.

Uh, it's good in cooperative problems.

It's, it turns out to be pretty good in two players, zero, some games.

And, and here in, um, this driving thing, it's what's called the general sum game.

And, and there, you know, it seems to work in the setting too.

So evidence is accumulating.

Nathan Lambert [00:24:51]: Something that people probably don't know about multi-agent RL and maybe I don't know either, but in this paper, all of the cars were using the same actual weights of the model.

Is that standard in multi-agent RL or is it kind of a variable?

Eugene Vinitsky [00:25:04]: So I'll add one little, uh, subtlety here.

So yes, we're using every policy is the copy of the same agent, right?

They're all looking at their local observations.

So it's decentralized, but it's all one copy, but every agent gets its own like conditioning vector.

That's like, what are my like reward weights?

How big of a, you know, what's my width and my length?

Am I a cyclist?

Am I a pedestrian?

Am I a driver?

And they flexibly adjust their behavior based on that condition.

Nathan Lambert [00:25:29]: Do you think that's actually like, if you were to squint at the system, is that actually changing the policy or is it changing the environment in kind of an indirect way?

Eugene Vinitsky [00:25:38]: It's, it's changing the policy.

Like you'll see that like a car is like, oh, I'm a, I'm a, like a pedestrian.

I'm a, I'm a big truck.

I'm going to do like a K point turn to turn around.

Uh, I'm a pedestrian.

I'm, you know, going to like smoothly wiggle through these small boxes of areas that I couldn't get through.

Otherwise it, it, it really, uh, appreciably changes the policy, which is cool because it's this like tiny 3 million parameter neural network or like 6 million parameter.

Um, and, and so like, there are all these like little sub policies inside of it that you can activate by, by conditioning.

Nathan Lambert [00:26:11]: Can you do it, um, post hoc to change the behavior in an interpretable way?

Eugene Vinitsky [00:26:16]: Um, I don't know about interpretable.

I guess it, it sometimes depends what we mean when we say interpretable, but yeah.

So you can be like, look, okay, you, you, you don't care about staying in your lane and you'll see it start going into the other lane and driving.

You know, you change the size of the policy or like the, the car and it will change the trajectories that it takes in response.

Um, it's, it's very responsive to this condition.

Um, we have some cool graphs in the paper pointing, pointing out all the different things you can make it do by changing these, these values.

Nathan Lambert [00:26:46]: Um, I'm trying to think of how this reflects on the nature of driving and what the downstream use of this tool is.

So you showed that this is doable and what does this, like, what does this mean for self-driving specifically?

Like, what would you do if you had the same big team and you know that this exists and you're interested in self-driving as a field?

I mean, there are obviously a lot of people that a lot of companies that have big teams and lots of money to try to think about self-driving.

Eugene Vinitsky [00:27:14]: So as I said earlier, like there's this like, um, perception, prediction, planning, control stack.

And I think this is a really is providing a lot of evidence that you could maybe subsume the prediction and the planning stack, um, and, and put it into this type of like end-to-end policy that you could then like train in sim and then maybe not zero shot deploy onto the roadway.

Just like take a straight from sim, put it onto the roadway though.

I think like maybe possible, uh, but like really give you this like base policy that you could then start to put on the roadway and start to build this flywheel, um, that you can then use to collect, you know, more and more experience, validate the safety.

You know, like if you're, you know, if you're a, um, uh, automotive manufacturer that doesn't have like a full spun up self-driving team, but you have a pretty good perception stack, like this is something that you can use to just like get something out in the world pretty fast.

Cause like three, I think like two, two, three days of training later, you have something that I think, and we'd like to start testing it, uh, can be like straight up put onto the roadway with humans driven around and things will be like pretty okay.

Um, you know, don't take the safety driver out, but like, yeah, and you have some cred

Nathan Lambert [00:28:24]: saying this given that you've done RL experiments with real cars, this is not something that's, um, ripping off the bandaid for the first time.

You've done different types of self-driving experiments with RL policies in the real world.

I don't, it might not be at the same level of the stack, but I can add links to that.

Eugene Vinitsky [00:28:42]: That was a lot more constrained, right?

We were putting these cars on the highway to like smooth traffic.

So they would drive in a way such that like stop and go waves and traffic would like get smoothed out and disappear.

Um, but there it was just like, you know, stay in your lane, follow the car behind you here.

We're talking about like, you know, complicated interactions at intersections and that type of thing.

So a lot, a lot more like safe, everything there is safety critical, but like significantly less constrained than anything we've done in the past.

Nathan Lambert [00:29:08]: And then to kind of keep leading this on, uh, I will say a bunch of things because you're more of an industry insider.

So it makes it less revealing if I say things, cause I don't really know anything.

Um, back when I was interviewing for a job and around 2021, at least a lot of RL people were interviewing with self-driving companies who were doing extensive research in RL for different parts of this behavior stack.

Um, even at that time, four years ago, prediction seemed largely or like sensing and prediction was perception was largely solved.

At least CV stacks are really mature and figuring out the actual driving component and decision making was really hard.

There was, I mean, I did a Tesla self home self like take home and for their self-driving team and they were hiring other RL people that take home was ridiculous.

Eugene Vinitsky [00:29:54]: I was like, yeah, I remember that.

Nathan Lambert [00:29:56]: Freaking intersection of polygons.

It's four years ago.

They've got to be using a different question, but it was so hard.

Um, I did end up solving the test cases.

Um, it was, I solved the test cases.

God, that was rough.

But essentially the rumor was they're doing something like mu zero for self-driving and or a mix of imitation learning, which is there's a duality of learning a world model from real data relative to building a simulator.

But the motivation of the work is very similar, which is in mu zero, you want to unroll trajectories and be able to learn from that and distill an RL policy versus if you have a big simulator, you then can learn everything from scratch and figure out how to transfer that to real.

And I think there's different assumptions on what would work.

And the history of RL, it is now that the simulator to real is generally a more promising path.

If you can build the right simulator then and going from real to enhancing real with, with RL alone, um, cruise was building a research team.

And one of the best engineers I talked to was trying to build a world model or like a simulator and do this like alpha go for self-driving.

I think that was a phrase from the interviews four years ago.

So a lot of this, and Waymo is now obviously winning.

I think Waymo, I don't know exactly what they're doing.

I think their stack is actually probably the most complicated, um, where they probably were looking at behavior, like all sorts of RL inspired things for very specific parts of the stack to, to improve behavior.

But it's funny that looking back four years ago, this was something that the spectrum of ideas that industry was looking at was actually very related to this.

And in the same time, the self-driving industry has changed a lot.

Uh, so what do you think of this whole industry of self-driving relative to, you have a lot of experience here.

I mean, I'm, I'm a big Waymo fan now, but there's just like, it's so funny how these things evolve.

And I think after this, later on, we'll talk about the, like, this is the RL specific trajectory with simulation, simulated results and stuff too.

Eugene Vinitsky [00:31:57]: I mean, we were interviewing at the same time.

So I was also interviewing with all of these self-driving companies when you were, uh, and, and it, it did seem like it was the place that was the most friendly to doing RL type research at the time.

Um, I think now almost everyone has gone all in on this like imitation learning type approach, um, that are like, this is a huge fraction of what people are doing.

I think a lot of the RL teams have been spun down, uh, which I think is unfortunate a little bit because I think what this work shows is that, uh, it may be wrong to do so that there is a lot of, a lot of value still in RL for this last piece of, of the, of the puzzle.

Um, you know, um, you know, one thing we have here is, uh, an insanely robust policy, right?

So like just an end to end neural network in SIM, it crashes once in a million miles,

Nathan Lambert [00:32:46]: um, crashes at all.

Eugene Vinitsky [00:32:49]: Yeah.

Nathan Lambert [00:32:50]: And you, but what was the number you said before for miles per crash?

Eugene Vinitsky [00:32:53]: Uh, humans are between 20 and a hundred K, um, somewhere, somewhere like that.

It's a little hard to get estimates because it varies from place to place a lot.

So, I mean, a lot of industries are pretty excited about this, like alpha zero for self driving type thing.

And the question, you know, becomes, as you said, like, what is the simulator that we do this in?

And so one perspective that's very prominent is like, let's collect a lot of data.

Let's sell the world model and then let's unroll in that simulator.

And then the challenge becomes like, who do you unroll in that simulator?

Now your world model has a build into itself, a model of the other agents, right?

If you kind of take the single agent perspective, I'm going to unroll world model.

I'm going to place a car inside of it.

And that's the car I'm going to train with RL.

And now what happens.

Nathan Lambert [00:33:40]: This was a big problem for self-driving because you have like a dynamic number of, um, objects in the scene that you're supposed to reason about with your world model.

How does the policy that you train handle this kind of agents coming in and out?

Now, is it all just that you have some, like, are you identifying entities as nearby as other cars are nearby or is there some abstraction or is that the perception stack handles that?

Eugene Vinitsky [00:34:04]: Yeah, exactly.

We roughly simulate a sensor in the sense that you only see cars in some radius of yourself.

Um, but, but we don't, we don't, yeah.

I mean, all the cars are there persistently in the simulator driving around and we, we answered this riddle of like, what should the other cars do by like their self-play, right?

They're a copy of your policy.

They're driving around.

Um, whereas I don't know what happens in the world model, right?

Like kind of in this like world model approach, you're limited by how capable the world model is at simulating the behavior of other actors.

And if your world model has actually learned a robust model of human driving for all the other agents in the simulator, then like, you don't even need, you don't really need, you need to do RL because like the world model already has a model of how humans should behave in a simulator at human level, but they don't.

Um, so yeah.

Nathan Lambert [00:34:53]: And it's just like, it's just, it's, it's so funny that it just feels like they haven't.

And the only way that Waymo et cetera has gotten it, it seems like Waymo has adapted a autonomous stack with like some human inspiration to make the driver more smooth is what it seems like when you're in it, which is like extremely, really strong perception and world understanding with some really clever policy that is tuned to feel human, but probably not human or RL at the bottom of the day.

Eugene Vinitsky [00:35:27]: I wonder, I don't know what Waymo's planning stack actually looks like in the end, right?

Like Waymo's pretty secretive and, uh, I've never worked there.

Um, and if I had worked there, I wouldn't be able to say.

Um, but you know, I think, I think, you know, if I had to make a bet, it's some, some kind of like hand designed cost, um, like mixing a bunch of terms together about like what a good trajectory looks like, maybe mixing with a little bit of human data to like, to make that trajectory feel like a little smooth in human life.

Nathan Lambert [00:35:59]: And yeah, to prompt you, um, what does your, yeah, I agree with this.

What does your history of being a nerd on urban plan and planning make you think of what is coming for self-driving cars?

Eugene Vinitsky [00:36:12]: So, so I guess the thing to mention is I'm a professor of transportation engineering, uh, among other things.

So I have, I have, um, required to have some thoughts on this.

Um, I think that, you know, self-driving cars are, are coming.

Um, I don't know if they're, they're coming a year from now to who knows when the cost curve gets driven down.

Nathan Lambert [00:36:32]: Where we live, they're more likely to come sooner given tech hubs and, um, people are willing to pay very high premiums.

Eugene Vinitsky [00:36:39]: That's true.

So like, like a lot of goods, they may come for, for wealthy folks first.

And then that allows the cost scaling to come down over time.

Um, and it really is a magical experience to take away Mo, right?

Like I remember the first day I saw like the cars driving around and nobody in it.

And I actually just started chasing one of the cars cause I was so like, it was such a magical moment.

I needed to, I needed to experience it for as long as possible.

Nathan Lambert [00:37:04]: Um, yeah, my first time was in Scottsdale, Arizona for one of my best friend's bachelor parties.

He's also an engineer.

And we saw one driving with no person.

And I was like, I wonder if we could take one.

And I immediately download the app.

And because it's in the middle of nowhere, they're testing zone.

They have tons of supply and no demand.

Eugene Vinitsky [00:37:20]: So we were just immediately able to drive one around.

I actually sat in an airport for three hours in, in Phoenix while my phone upgraded to the newest OS so that I could like download the app and take away Mo for the first time there.

Nathan Lambert [00:37:36]: Uh, yeah, it's, this is totally reasonable behavior for anybody listening and you should update your prior.

If you don't think it's reasonable, it's totally reasonable.

Eugene Vinitsky [00:37:44]: It's a totally normal thing to do.

Um, but I think, so I think in cities, like.

Um, so I think that it's, it's still going to be a long time before these things are rolled out at scale.

Um, so just because costs, safety, how long does it take you to verify that it's safe to drive in a new city?

I mean, okay, let's put Tesla aside.

I don't, I don't talk about it.

I don't really know how to think about that.

Um, but that's how I feel too.

Um, there's, you know, there, there's parts of the United States that are due to state dependence, like very spread out, right?

Like because of like suburbization, like, I don't know if that's a word, it's the word I use.

Um, like they're, they're very spatially spread out.

Like in my, my grandpa's hometown or where my grandpa lives, uh, there's no public transit.

There's no way to get by without a car.

Public transit isn't viable because of the way people are distributed.

So if those systems like those continue to exist, like, you know, people will continue to drive there.

And over time, those things will be replaced by, by, by a few self-driving cars.

Um, you know, uh, as a public transit advocate, I would still say that I think within cities, it is significantly more efficient to like fund buses and subways and things like that.

Um, but you know, there's, there's parts of the U S that are just like, so set up and I expect self-driving cars to be part of that.

Uh, yeah.

Nathan Lambert [00:39:15]: I mean, this isn't, this isn't a hot take.

I think you're just kind of like a realistic and you don't have a crazy worldview about it.

Eugene Vinitsky [00:39:22]: Yeah.

I mean, I have, I have my, you know, real deep love for public transit and like a desire for more people to experience it than just the people who live in New York city, where I think New York sees like 50% of all public transit areas in the U S.

Um, uh, but you know, the system is what the system is right now.

Nathan Lambert [00:39:41]: Yeah.

Okay.

Um, let's pivot from self-driving land where we've had this self-play RL and try to draw some analogies to the other RL breakthroughs that aren't language models that have been happening.

I think the one that everybody should know about or in many people do is this, um, locomotion and or sim to real with robotics with humanoids, quadrupeds.

Eugene Vinitsky [00:40:07]: Yeah.

Nathan Lambert [00:40:07]: If you look at it, it is definitely directionally similar to what is this self-play thing is.

I think that it's hard for people who haven't been in RL to understand the grandness of this transition from one agent locomotion to many agents doing something in a cooperative or competitive game with these same mechanisms.

I feel like even talking to you, I don't think we've done a good job grasping just that enormity.

Like multi-agent is just historically so much more complex.

I don't know if there's anything about something like Dota five and how that, I wish I knew more lore of how that happened and it didn't continue because I feel like it could be a good example of why this is actually so much harder than even something like AlphaGo, which is just one policy and these robotics things we're going to talk about, which are one, like it is all still one policy, but just like one thing in the world.

Eugene Vinitsky [00:41:07]: So let me, let me give it another try because I think I also haven't done the greatest job describing it.

So like in something like Dota or Go, there is in fact a notion of like a best way to play.

It's, you know, it's, it's, well, it's, it's, it's like an ash equilibrium.

It's like, you can't do better than that.

If you play it, nobody can beat you.

Nathan Lambert [00:41:27]: Have we arrived at that at Go?

I don't think like, have we actually arrived at these at chess and Go because the ELO scores are still going up.

Eugene Vinitsky [00:41:33]: No, we haven't.

Nathan Lambert [00:41:34]: But like conceivably there is a max.

Eugene Vinitsky [00:41:37]: There is a max.

You're, you're never going to get it.

It's like the game's too big, but there is a best way to play.

And then in every domain where there's a human in the loop, there's not an obvious best way to play it.

And so the challenge has always been like, you know, if I run self-play, it's going to converge to some behavior.

And that behavior is not necessarily something that can drive with, with humans in the loop.

Like, you know, it'll learn something like you could imagine, for example, like you do a bunch of self-play and the cars learn that they can tell their partner where they want to go by hitting the blinkers left, left, right, right, left, left.

That means like, I'm taking a left turn and I'm going to go at 25 miles per hour.

And so there's this idea that like, there's all these policies that you wouldn't want to play and don't make any sense.

And kind of what we show in this paper is that if you do a little bit of reward design and you really scale up RL, then like the simple fact of like having, being uncertain about where everybody wants to go and having to be very robust to collisions constrains your behavior in such a way that you like broadly learn how to drive well.

Um, and, and I think this is transferable to other domains where, you know, you want some kind of base policy that roughly knows how to do the task well over some unknown distribution of partners.

Nathan Lambert [00:43:02]: How easy do you think it would be to learn in the same simulator?

If all the other cars were controlled by this policy, if you only had to learn for one policy, how much easier is it to learn just one policy that kind of works in the world rather than this multi-agent and everybody is learning at one setup?

Because this is essentially what people have done is like, we've learned how to control one robot in the world and do that super well versus learning everything from scratch with multiple is well harder.

Eugene Vinitsky [00:43:30]: And I, I think if imagine, if you imagine that, okay, we have N cars and N minus one of them are controlled by a perfect model of human driving, right?

Like, I think that you could learn that super fast and really robustly.

Um, and of course the problem is we don't have that one perfect model of human driving that we can plug into our simulator.

I don't think you would, it would take, you know, the trillion samples that it took us.

Nathan Lambert [00:43:54]: So, so I think this, yeah, so that's the difference.

And that's what these other things and that's like quadruped for robotics to have, let me just let you talk about it.

Where do you think this sim to real robots and single agent is at and heading?

So like, what do you see in the, and being more, slightly more plugged into the academic RL side of things?

So, so it's like 2021 and 2022 is when these Marco Hutter group papers started.

And I'm sure the trend is still content, continuing.

Eugene Vinitsky [00:44:23]: It's still continuing, right?

Like for quadrupeds, we, we, there, people are regularly making them do these insane behaviors that we, we haven't been able to design in other ways.

And I think the real, the lesson there is like at scale RL works.

Like a lot of the lessons of like self-supervised learning are transferable to the RL side.

Um, and while it would be great to get the sample complexity down and stop doing this with like a trillion samples, you know, if you're willing to bite that bullet and just scale, um, and you, you have, you know, kind of a, a fairly good simulator, um, you know, you can, you can, you can really do incredible things.

Nathan Lambert [00:45:00]: Um, do you think these RL results scale more with model size or sample complexity?

Do you think that they're kind of brute forcing it through many more inter scaling the interactions with the world?

Eugene Vinitsky [00:45:10]: Yeah, uh, I think that they scale, um, scaling with model size is like a little iffy in, in, in, in RL.

Um, there, there are tricks that people have been coming up with to let you use bigger and bigger models.

But right now I think a lot of the, the impetus is towards smaller models that like have low inference costs, um, that let you like force a trillion samples into the policy, right?

Whereas if you make the model bigger, inference cost becomes more of a thing.

I think it's harder to acquire the samples.

Nathan Lambert [00:45:38]: Um, so I think, I think this relates to, I think the other area that I'm excited about in RL is this like procedural generation and open-endedness.

Do you think this kind of, I see needing to see a ton of samples as being in spirit related to this where open-endedness is, I think a field of study designed to make agents that are good at collecting the right samples and in using the word explore, which we haven't really used.

Eugene Vinitsky [00:46:07]: Yeah.

So I think a lot of, a lot of what we're doing here is actually kind of dodging the exploration problem in a lot of ways.

Um, and in general, this is something that like the RL that works is a lot about dodging the exploration problem, right?

Why do we need a trillion samples?

Because we explore very inefficiently.

Um, if we, you know, the, the kind of, this is, I think what we have, if we talk about the quadrupeds and things like that, right?

Like these are well-defined tasks with a well-understood reward function.

Um, and this, you know, at some point as we started to scale up RL, this, this, this task design will become the bottom, right?

It's like, uh, what tasks should the agent do?

There's a human in the loop sitting down writing the reward function saying, okay, that's a good task.

This is a good task.

The kind of the dream of open-endedness is that we'll move away from this and towards kind of just taking the human, this task designer out of the loop.

Nathan Lambert [00:47:00]: Let's start taking a step back.

Are there tasks that you think are heavily simulated in control domains that are actually well-suited to this RL approach that may have not been done yet?

I mean, simulation is like a core tool in robotics and autonomy.

So what other things are doing heavy simulation and not leveraging this?

Maybe even like hard sciences are doing this.

Eugene Vinitsky [00:47:19]: I think this is going to eat almost everything that can be simulated.

Uh, well, so like the fundamental thing is like, can you simulate it with a relatively small sim to real gap and can you simulate it efficiently?

Um, and if you have both of those things, like I think RL is going to eat all of those things.

Um, so, and, and, or you can also scale this up by like, you know, uh, paying the price.

So if you can, uh, for example, like I expect like formal verification, like agents that like, uh, write lean proofs to do really well.

Um, you know, any there it's expensive because the simulator is slow, but there's no sim to real gap.

Nathan Lambert [00:47:57]: Um, I'm thinking in the, like this scientific in control domain.

I think one of them is, I mean, a timely example is humanoids, which I'm very, um, I've been consistently bearish on them.

I think if you have the simulator, the control policy will be solved, but I think most of it is an environment problem where the robotic actuators are too strong.

So therefore they're limited to manufacturing.

And I don't necessarily know how much a humanoid, um, is better than the static arm in manufacturing and logistics.

Eugene Vinitsky [00:48:29]: So I might be bearish on humanoids for similar reasons, but like, you're, I guess you're right on the point.

I think like, will we think, will we be able to make a humanoid follow whatever trajectory we would like it to follow through scaling up RL?

Yeah, I think so.

Like, but then the question becomes like, what trajectory should it follow?

And then that's where things get iffy again, right?

Like exactly, you know, how, how softly should it behave?

You know, how, stuff like that, like task planning, things like that.

But, but from the, like the controls perspective of like, here's a system, I want it to follow this trajectory.

Um, most of these things have like good fast simulators.

Nathan Lambert [00:49:10]: Um, do you think RL should be used more in AI for science than scaling deep learning?

So I'm guessing there are a lot of scientific domains that are very simulation intensive.

And a lot of the approaches and excitement right now is to train a deep learning model to predict data.

I think there's Evo2, which is a recent DNA sequence predictor.

And I was reading a lot about this.

And a lot of the criticism is like, we don't really know how to use it.

And the mechanism is if, if the model is like, oh, I don't know about this DNA string, and then like, maybe it's a mutation.

And there's a lot of weirdness like that.

Yeah.

But maybe it's still that just this slow burn of scaling RL systems has a, at least like a more direct way that can potentially improve some domains.

Eugene Vinitsky [00:49:54]: Great question.

Um, super interesting question.

Um, so I think that the story I've been telling you about like sample inefficient RL scaling really well, I think we, we understand that pretty well.

And less, less clear on the sample in RL in like limited sample domains.

And I think a lot of the thing in deep learning for science that the simulators themselves are quite slow.

Um, so like if you want to simulate, say like a fusion, like loop, um, honest, like, like a tokamak of some kind, um, it can take actually months to run a single simulation.

Nathan Lambert [00:50:28]: Um, then what do you think of the deep, you brought this up?

What do you think of the deep line nuclear fusion control paper then?

Eugene Vinitsky [00:50:34]: Uh, they might've been doing a slightly different simulator.

It's a different simulator.

It's not as it doesn't, it's, I don't think it requires quite as much, much precision.

Um, I'm thinking of other, other, other simulators, um, to be clear, I haven't read that paper super closely, but if you think about something like AI for materials or AI for bio, like a lot of these are fairly slow simulation platforms.

What I do think is pretty exciting is I think at some point somebody is going to, and there are a lot of bottlenecks to this, like someone's going to build like a, an autonomous lab and just like keep letting it loop forwards and characterizing some material and then like running it through the loop again.

The problem there is actually this like characterization step doing it correctly is really hard.

Um, like, you know, what are the properties of the material that I've just synthesized?

Um, but, but, you know, so, so I think that in terms of RL for science, I think that trajectory is a little trickier because of this like kind of low ability to acquire samples.

Whereas in the humanoid and the quadruped domain, we can generate just, you know, people they'll simulate like 2000 humanoids at once on one GPU or something silly like that.

Nathan Lambert [00:51:41]: Um, do you think these things scale nicely with action space?

I feel like if we want to do this open-ended learning in a lot of environments, I don't know how to constrain the action space in a nice way.

So that somewhat worries me.

Eugene Vinitsky [00:51:55]: So I think there's a couple of pieces of that.

So I think LLM sometimes give you pretty good priors over actions, right?

That's the thing we've been pretty consistently seeing is that they constrain the action space on their own in a, in a really helpful way.

Um, it is also the case that with like much larger action spaces, you just eat a sample complexity penalty and like things take longer, but we're seeing it be fine.

You know, kind of in the domain of like 500 actions, this kind of thing.

Now, if we all of a sudden go out to like 5 million actions, I think all bets are off.

Nathan Lambert [00:52:26]: Um, it does kind of seem like they might have the same thing that happened with language models is that open-endedness.

Now all the agents are going like pointing out a language model or some general interface, like a computer interface that ends up concerning action space to keyboard mouse inputs, which order of magnitude wise is actually the same in action space.

Eugene Vinitsky [00:52:49]: I think, I think there are a lot going to be a lot of helpful constraints on the action space to let you, you know, deal, deal with this problem, right?

You know, you're not operating in totally open-ended language or open-ended use of a computer.

Nathan Lambert [00:53:04]: To recap some of this, just because it's interesting is this self, it's like the self-play question is, is the hardest, is the hardest one to grok, which is really just that, uh, honestly, I still don't even fully understand it.

And I will listen to some of these things, but it's just like the scaling in sample time is the biggest, is the thing that makes RL actually work is that we can scale in samples.

And that is the thing that's needed for most of these domains.

And it's very different than what's happening in language models, but it's at least a consistent theme across what is now over like five years of like resurgent RL for control results.

Eugene Vinitsky [00:53:41]: Scaling works in RL, right?

There's, there's like no, no real wall here.

Nathan Lambert [00:53:46]: It's a different type of scaling than people expect.

I think a lot of the historical scaling papers were trying to scale parameters and looking at something like deep, um, what is dreamer V3 or whatever, and scaling the parameters of the policy with the number of environment interactions.

But it seems like that is just like, it's actually just different axes to what is thought of in traditional deep learning scaling.

Eugene Vinitsky [00:54:07]: It's, it's kind of the number of samples that you're getting.

Nathan Lambert [00:54:10]: Which is very good to say clearly.

And then the, um, the self-play thing is changing the domain to be one that is much more complicated.

And the fact that it can still work there opens up a lot of interesting questions.

So like the scaling thing is an independent axis that actually works.

And then the self-play thing is dramatically increasing the complexity of your problem from this single agent world.

Eugene Vinitsky [00:54:34]: It's adding a lot of like diversity through, you know, there being other agents in the system that behaving in unexpected ways.

Um, so yeah, there's like scaling up the sample complexity.

There's scaling up in diversity of your simulator and in the single agent domains, you don't need to scale up the diversity of your simulator.

You have one task you want to do it.

Um, but yeah, yeah, that makes sense.

Nathan Lambert [00:54:55]: Okay.

I'm pretty, that's, these are interesting takeaways to kind of reflect on what they mean, um, to kind of go towards wrapping this up or go into language model corner and then grad school advice corner.

You had a paper that you, um, you were on a paper as an advisor of large language models are in context, preference learners.

What, what is this story here?

I think I've been preference learning is openly out of vogue, but I think that's because people are short-sighted and AI is so easy dominated where it's like, everyone is still actually doing preference tuning, but everyone's just talking about like RL and verifiable domains

Eugene Vinitsky [00:55:31]: or whatever, whatever the hype is that I, yeah.

Nathan Lambert [00:55:34]: But like, what is your take on this preference learning thing?

And I know you have big self-play and RL background here.

Eugene Vinitsky [00:55:42]: YSo I'll tell you, I'll tell you how we got here real quick and it'll, it'll make clear and I should say that there's coincident work by Jaden Clark, Joey, I'm going to mispronounce his name, Hedgna and Dorsa at Stanford.

Um, that kind of got the same idea across at the same time.

And the idea is that if you want to do preference learning, right.

And you're doing preference learning, you're doing like RLHF from scratch or something.

You have to learn this reward function.

And you have to acquire a lot of samples to do it, to do so.

Um, and you know, the tabula rasa version of this is really inefficient.

And it turns out that what you can do is if you have a particular like description of a task you'd like to do, you can ask a language model, like write me a set of reward functions that I think correspond to this task, right?

You'll take those reward functions.

You'll train an agent corresponding to those reward functions.

You'll have a human rank, um, their preferences over those reward functions.

Like this was a good one.

That was a bad one.

And you'll feed all of that to a language model again, and do another loop of asking it to write some reward functions, given these preferences.

And it turns out that language models can take that information and use it to decide what a good next reward function to try out is.

And over time, you're going to get a reward function that is much more aligned with your preferences just by having the language models actually write the rewards.

And so this is like, you know, kind of letting you do like personalization or reward tuning in like 50 human queries kind of scale.

So like this kind of came about because we want, we were asking like, if I wanted to build an agent that like acts scared or is deceptive, right?

Like I kind of have to do some amount of preference learning, right?

Like is deceptive is the thing that's defined with respect to humans.

Um, and so, you know, we're trying to figure out how to do something like that sample efficiently.

It turns out you can just ask an LLM, write a reward function that corresponds to being deceptive and then like run that loop a bunch of times.

Nathan Lambert [00:57:35]: So I would say that this means that like language model personalization doesn't need to be done within the parameter space or something like this.

Or like, imagine having the domain here is, yeah, the domain here is partially at least control looking at, looking at this figure, but it's goes to show that they actually probably have the representation and it's just a problem of setup and how do you get people to actually do this and stuff.

I've, I've said the same 20 questions saying, which is what if a language model and ask 20 questions about the user and get the information out of there.

So it's nice to see that this might not be like, it's not an actual technical limitation.

It's just like, we don't like, how the hell do you do that and chat or whatever.

Eugene Vinitsky [00:58:17]: We haven't tested that particular case.

And I want to, I think it's like a thing that might work.

Like our case, it was like, make a robot jump like a human, right?

Like write a reward function that corresponds to jumping like a human.

And it turns out like a couple of rounds of iteration, you can get a language model to write down a reward function.

Nathan Lambert [00:58:32]: Did the reward function make sense?

Like what the, what is the reward function for jumping like a human?

Eugene Vinitsky [00:58:37]: You know, it's like, you know, make sure that you jump off both legs instead of hopping on one leg, you know, don't flail your arms around too wildly, you know, that type of thing.

I think, I think the whole reward function is in the paper.

But yeah, you know, it's got some interpretable terms, like the, the, the, the base reward function that comes out in the first iteration always involves hopping on one leg.

And then by the end, it's, you know, two legs, not too much flailing.

Nathan Lambert [00:59:04]: There's like a deeper RL control thing there, which is that all of the agents when they start are just wildly flailing when you learn control from scratch.

So that's in many ways, not super surprising.

Um, do you have any comments on the general like language modeling RL stuff?

I think it's so focused on, and part of the point of this is just like broaden the, broaden the lens by which people consider RL to be a thing.

Eugene Vinitsky [00:59:33]: I mean, the biggest thing I should say there is I think it's going to work.

Like, I don't think like for domains where you have verifiable rewards, like I just, I think this is going to work.

Just going to have to bite the bullet and, and, uh, generate a lot of samples.

Nathan Lambert [00:59:46]: Um, so, uh, it's interesting that you say that because one of the biggest things we find is you just have to keep training, right?

So I mean, it's a classic RL curve.

So you start out, it's a log, it's like a really, it's like, you start out really fast and then you're just on this plateau of just, you're getting a little bit more for a really long time.

And it's much far fewer samples and pre-training and everything, but it is, the learning curves look so similar to anything you would get in RL and you can have crashes and stuff, which is such a throwback.

And it's very different than preference tuning, which you have this whole over-optimization thing and stuff like this, where it's just so much less artful.

It's like, just so obvious.

It's like, if the number's going up, it's probably fine.

And you don't really have to do that much.

It's nice, right?

There's a number.

Eugene Vinitsky [01:00:32]: There's a, there's a number.

You just have to push that number up.

Life is great.

Nathan Lambert [01:00:36]: Last, last section is career corner.

What do you think people are interested in working on an RL right now?

What do you say to them?

Eugene Vinitsky [01:00:46]: I mean, I think, I think RL is just starting to eat different domains.

So like, I think this is a really good time to get started on it.

You know, there, there are not enough strong RL researchers, surprisingly.

So like, I don't think we're like an, even an oversaturated domain.

Nathan Lambert [01:01:07]: Do you think it's feasible for grad students to do this language model stuff and core like RL agent stuff?

Because the agent stuff feels like more of a long-term moat because you're doing something that fewer people know about.

But should people like fully ignore the language model stuff if they're trying to get established as a grad student?

Eugene Vinitsky [01:01:24]: I think that you should like, this is an important academic thing is like, you need to focus on demonstrating the idea with as minimal, as few barriers as possible to it, right?

So you want to pull out the minimum version of the demonstration.

And a lot of the time putting a language model in there is adding a huge bottleneck, right?

All of a sudden you need a bunch to use, training takes forever.

So, I mean, you should probably do some language modeling things at some point because like, it's a good skill to have demonstrated when you go on the job market.

So I think a lot of students will do is they'll do that in their last year or two of grad school, just to show that they kind of know, can't, can do this.

But like for demonstrating the core ideas, I don't think, I don't think you always have to use the language model there unless your ideas are deeply tied to that domain.

Yeah.

Nathan Lambert [01:02:14]: The way that by which things scale and results are communicated is just really different in the RL for domain or the core RL algorithm or the language model plus RL thing, which I think is sequencing probably can be the best bet, which is like starting with something less competitive and focusing on skill development, which is generally my biggest, should I do a PhD answer is if you're invested in developing new skills or you are a person that's like actually extremely academic and scientific in nature, which there are a subset of people

that are just truly like scientists in the nature of the word and they will probably thrive emotionally in that situation.

But most people want to do an AI PhD because they think of it as a credential for a future job, which is generally a hilarious and ill-formed side effect of wherever the field is right now.

Eugene Vinitsky [01:03:03]: It's just such a bad idea.

There's one brief moment in history where a PhD was like a definitive route to a, like a high paying job, right?

Generally what a PhD is supposed to do is it's like, Hey, it should be fun.

It should be fascinating.

It should be like five years where you're like, I could not imagine doing something cooler than what I'm doing right now.

And then it's supposed to unlock some jobs that aren't accessible to you otherwise.

Running a research team in industry, doing particular skills that, you know, using particular skills that are hard to develop unless someone gives you a year or two to focus on, right?

Like hard optimization problems, a lot of specialties.

But, you know, the like, I'm going to do a PhD.

That's going to give me like a 500K total compensation job straight out of grad school.

It's just like such a weird quirk of history that like optimizing for it is never a good idea.

Nathan Lambert [01:03:56]: Yeah.

And I think that kind of says if people are trying to be, or people are grad students or like junior faculty or junior grad students right now, I think if anything you're optimizing for is trying to extract value from that quirk of history, you're putting yourself in a precarious position.

Eugene Vinitsky [01:04:15]: Yeah, optimize for being able to do cool things, you know, that's, that's a consistent thing you can always optimize for.

It doesn't go away when the job market changes.

Nathan Lambert [01:04:29]: Yeah, I agree.

I mean, that could be a good place to end it.

You can do, you can actually surprisingly still just do things.

And yeah, I think it's easy to lose track of that in the language modeling chaos.

Eugene Vinitsky [01:04:43]: Yeah, I mean, but look, also, I'm coming from a position of privilege, right?

Like I have a faculty position.

Nathan Lambert [01:04:48]: We're early enough where it is fine.

Eugene Vinitsky [01:04:51]: Yeah.

Okay.

Well, you know, this has been a pleasure, you know, thank you for taking the time to chat with me and give me a chance to talk about this paper, which is, I think, still had some trouble conveying exactly why I think it's so exciting.

But hopefully some of it got across.

Nathan Lambert [01:05:06]: I think we got to some clear things and the self-play being weird thing definitely gives me more sympathy to how bad the discussion around self-play for language models is, because there is very nuanced on why this, like what we're doing with RL with verifiable rewards is very different than language models talking to themselves and both updating their policy.

And it's not to say we shouldn't be trying that, but we should be very wary in saying that is that until we are trying to do really, really hard things.

The grandioseness of language model self-play is probably like trying to let language models discover their own language to be more effective at tasks.

And they do this with interacting with each other.

Eugene Vinitsky [01:05:50]: And so, yeah, language model self-play for tasks they haven't been trained on, like learning to do new tasks collaboratively together.

Super exciting.

It makes sense.

I'm doing some work on it.

I'm excited about that.

This thing where you, the amount of knowledge that they have is bounded and you do self-play to kind of refine the distribution that they're playing over, as opposed to like doing a new task together, a little weirder, weirder stuff going on there.

Yeah.

Nathan Lambert [01:06:16]: Yeah.

So I think it's good.

I think people now know that this, the single agent RL stuff working is not surprising.

And the self-play area could be one of these like multi-year out before takeoff thing.

And there are early signs that it could actually work.

And that's something that people are often looking for, which is what are the problems that there are a bit more risk on, but not complete risk and not the obvious thing that everybody is working on.

Eugene Vinitsky [01:06:41]: Yeah.

And I think if you're thinking that category, collaborative agents, agents that know how to effectively collaborate with humans, with other AI agents, very underrated area, going to be big in a bit, I think.

Nathan Lambert [01:06:54]: I think the ones who are playing these models is what does it mean for multiple language models to be interacting on the web that have separate goals, but are going to interact with each other and it's not necessarily the same self-play setup, but you could understand it with, through some of these lenses.

And it's easy to see how this is going to come about with you when listening to the marketing from all these labs.

Eugene Vinitsky [01:07:17]: Yeah.

Well, it's, yeah, it's going to be fun.

It's going to be weird.

It's going to be great.

I also, I do have to inject a quick little pitch before I, before I disappear.

Nathan Lambert [01:07:28]: Sounds good.

Eugene Vinitsky [01:07:28]: Um, so, uh, I've joined, uh, like a new, like AI, uh, stealth AI company.

Um, we're like working on, um, making better decisions in critical industries like healthcare, supply chain, defense.

So like in industries that are like not very tech forward, not very AI native, um, but like are where almost all productivity lies.

Um, and so, uh, we're, uh, looking for strong folks with experience with either RL or LLMs.

Nathan Lambert [01:07:55]: Do you have a company name?

Eugene Vinitsky [01:07:56]: We do not.

Nathan Lambert [01:07:57]: Okay.

Well, we'll have a link.

We'll have a contact link below.

I think, yeah, Eugene is a fun person.

Um, so, and at least knows enough to make it through this conversation, which is, there are a lot of startups out there that are riding on less.

So that's good.

Eugene Vinitsky [01:08:13]: Um, that'll be fun.

It's, it's, there aren't, I don't think there are that many applied RL teams out there in the world.

So maybe there are, and I just don't know them, but I think like Mark Belmer's startup is

Nathan Lambert [01:08:24]: actually potentially related, which is seems like it's kind of trying to scale.

RL with the mix of language models to solve problems, but I haven't asked him directly.

Eugene Vinitsky [01:08:33]: I mean, that's somewhat of what we're, what we're doing, doing too.

Um, but, uh, that's all I can say.

Nathan Lambert [01:08:42]: Yeah.

Sounds good.

It's, it's early days for, for that and self-play and many other things, but, um, I'm sure we'll cross paths soon.

Either if I go back to New York or if you, for some reason, come all the way to Seattle, which I don't know.

I love Seattle.

Eugene Vinitsky [01:08:58]: So yeah, I'll love in the fall.

Um, anyways, yeah, uh, it was, it was a pleasure talking to you and hopefully, uh, you know, get a chance to talk again soon.

Nathan Lambert [01:09:08]: Yeah, we'll do.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Elicitation, the simplest way to understand post-training Mar 10, 2025

Full post: https://www.interconnects.ai/p/elicitation-theory-of-post-training

If you look at most of the models we've received from OpenAI, Anthropic, and Google in the last 18 months, you'll hear a lot of "Most of the improvements were in the post-training phase." The most recent one was Anthropic’s CEO Dario Amodei explaining Claude 3.7 on the Hard Fork Podcast:

We are not too far away from releasing a model that's a bigger base model. Most of the improvements in 3.6/3.7 are in the post-training phase. We're working on stronger base models (perhaps that will be the Claude 4 series, perhaps not; those are coming in a relatively small number of time units [months?].

Here's a simple analogy for how so many gains can be made on mostly the same base model.

The intuition I've been using to understand the potential of post-training is called the elicitation interpretation of post-training, where all we are doing is extracting and amplifying valuable behaviors in the base model.

Consider Formula 1 (F1), most of the teams show up to the beginning of the year with a new chassis and engine. Then, they spend all year on aerodynamics and systems changes (of course, it is a minor oversimplification), and can dramatically improve the performance of the car. The best F1 teams improve way more during a season than chassis-to-chassis.

The same is true for post-training. The best post-training teams extract a ton of performance in a very short time frame. The set of techniques is everything after the end of most of pretraining. It includes "mid-training" like annealing / high-quality end of pre-training web data, instruction tuning, RLVR, preference-tuning, etc. A good example is our change from the first version of OLMoE Instruct to the second — we improved our post-training evaluation average from 35 to 48 without touching the majority of pretraining.

Then, when you look at models such as GPT-4.5, you can see this as a way more dynamic and exciting base for OpenAI to build onto. We also know that bigger base models can absorb far more diverse changes than their smaller counterparts.

This is to say that scaling also allows post-training to move faster. Of course, to do this, you need the infrastructure to train the models. This is why all the biggest companies are still building gigantic clusters.

This theory folds in with the reality that the majority of gains users are seeing are from post-training because it implies that there is more latent potential in a model pretraining on the internet than we can teach the model simply — such as by passing certain narrow samples in repeatedly during early types of post-training (i.e. only instruction tuning).

Throwback to the superficial alignment hypothesis

Another name for this thoery is the Superficial Alignment Hypothesis, coined in the paper LIMA: Less is More for Alignment. This paper is getting some important intuitions right but for the wrong reasons in the big picture. The authors state:

A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users. If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples [Kirstain et al., 2021].

All of the successes of deep learning should have taught you a deeply held belief that scaling data is important to performance. Here, the major difference is that the authors are discussing alignment and style, the focus of academic post-training at the time. With a few thousand samples for instruction finetuning, you can change a model substantially and improve a narrow set of evaluations, such as AlpacaEval, MT Bench, ChatBotArena, and the likes. These do not always translate to more challenging capabilities, which is why Meta wouldn’t train its Llama Chat models on just this dataset. Academic results have lessons, but need to be interpreted carefully if you are trying to understand the big picture of the technological arc.

What this paper is showing is that you can change models substantially with a few samples. We knew this, and it is important to the short-term adaptation of new models, but their argument for performance leaves the casual readers with the wrong lessons.

If we change the data, the impact could be far higher on the model’s performance and behavior, but it is far from “superficial.” Base language models today (with no post-training) can be trained on some mathematics problems with reinforcement learning, learn to output a full chain of thought reasoning, and then score higher on a full suite of reasoning evaluations like BigBenchHard, Zebra Logic, AIME, etc.

The superficial alignment hypothesis is wrong for the same reason that people who think RLHF and post-training are just for vibes are still wrong. This was a field-wide lesson we had to overcome in 2023 (one many AI observers are still rooted in). Post-training has far outgrown that, and we are coming to see that the style of models operates on top of behavior — such as the now popular long chain of thought.

The counterargument to elicitation

The prevailing counterargument to the elicitation theory of post-training has been that post-training is teaching specific skills to the model. This can be seen through very large finetuning datasets used in the early eras of open models. One of the last prominent research examples of this is OpenMathInstruct 2, which showed performance improvements with finetuning on up to 14.3 million instructions.

A rough industrial norm is that you can do excellent post-training with only 1 million prompts at instruction tuning. The way to interpret the above plot with the elicitation theory is that the vast majority of the gains come from the beginning of training.

The trade-offs discussed in this counterargument, such as scaling post-training methods, were specifically discussed in the paper Revisiting the Superficial Alignment Hypothesis, where it is clear that you can teach new skills to models in post-training. The hardest part today is to know when the skills are entirely new — it is often safer to assume the model has seen them at pretraining. Therein, it is far more efficient to elicit behaviors from the model than to teach, and eventually, the tide may shift to favor teaching models, but not in the near future.

Comparing this math training dataset to current best (and emerging) practices for reasoning models makes the strongest case for the elicitation theory yet.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

RL’s role in elicitation

The reinforcement learning (RL) training we’ve seen take off in this early era of reasoning models is often described as “sample efficient” and “the model learning new behaviors.” Both of these fit with the theory presented. The astute way to view the model learning new behaviors is not that it is learning entirely new abilities but rather learning to reinforce behaviors that were already latent in the pretraining dataset.

Compared to teaching the model math with millions of samples, just a few thousand prompts of RL training can far surpass the performance in MATH shown above.

In many ways, RL training exploding in popularity and effectiveness is the ultimate endorsement of the elicitation theory. Where we used to try and teach the model math with millions of supervised samples, now we just let the model try different approaches on thousands of math problems, and they reach far higher peak performance.

This is, of course, also linked to why people say that “stronger base models are better starting points for RL.” All of this fits together as the base model is the platform on which post-training is built.

A reductionist take here is to say that pretraining is not important — in reality, pretraining is just slow and hidden from most of the gains we are seeing. Still, excellent post-training and the performance improvements we enjoy today are all well downstream of pretraining. Pretraining still is arguably the most important part of the training stack, as it allows those with confidence in the elicitation during post-training to thrive.

Thanks to Mohit Raghavendra for some email exchanges that helped this post.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Where inference-time scaling pushes the market for AI companies Mar 05, 2025

Link: https://www.interconnects.ai/p/where-inference-time-scaling-pushes

There’s a lot of noise about the current costs of AI models served for free users, mostly saying it’s unsustainable and making the space narrow for those with the historical perspective of costs of technology always plummeting. GPT-4.5’s odd release of a “giant” model without a clear niche only amplified these critics. With inference-time compute being a new default mode, can we still have free AI products? Are we just in the VC-subsidized era of AI?

For normal queries to ChatGPT, the realistic expectation is that the cost of serving an average query will drop to be extremely close to zero, and the revenue from a future ad model will make the service extremely profitable. The most cohesive framework for understanding large-scale internet businesses built on the back of such zero marginal costs is Ben Thompson’s Aggregation Theory.

Aggregation Theory posits that extreme long-term value will accrue to the few providers that gate access to information and services built on zero-marginal cost dynamics. These companies aggregate user demand. It has been the mode of modern dominant businesses, with the likes of Google and Meta producing extremely profitable products. Naturally, many want to study how this will apply to new AI businesses that are software-heavy, user-facing platforms, of which OpenAI is the most prominent due to the size of ChatGPT. Having more users and attention enables aggregators to better monetize interactions and invest in providing better experiences, a feedback loop that often compounds.

Aggregators are often compared to platforms. Where the former relies on being an intermediary of users and other marketplaces, platforms serve as foundations by which others build businesses and value, such as Apple with the iPhone, AWS, or Stripe.

Businesses like ChatGPT or Perplexity will rely on a profitable advertisement serving model being discovered that works nicely for the dialogue format. ChatGPT interweaving previous discussions into the chat, as they started doing in the last few months, is encouraging for this, as they could also have preferred products or sources that they tend to reference first. Regardless, this will be an entirely new type of ad, distinct from Meta’s targeted feed ads, Google’s search ads, or the long history of general brand ads. Some of these past ad variants could work, just sub-optimally, in the form factor.

An even easier argument is to see the current hyperscalers using low-cost inference solutions on AI models that complement their existing businesses and fit with components of Aggregation Theory — such as Meta serving extremely engaging AI content and ads. The biggest platform play here is following the lens through which language models are a new compute fabric for technology. The AWS of AI models.

All of these business models, ads, inference, and what is in between, were clear very soon after the launch of ChatGPT. As the AI industry matures, some harder questions have arisen:

* Who bears the cost of training the leading frontier models that other companies can distill or leverage in their products?

* How many multiples of existing inference paradigms (0-100s of tokens) will inference-time scaling motivate? What will this do to businesses?

This post addresses the second question: How does inference time compute change business models of AI companies?

The announcement of OpenAI’s o3 with the inference cost on ARC-AGI growing beyond $5 per task and the proliferation of the new reasoning models raised the first substantive challenge to whether aggregation theory will hold with AI.

The link to inference time compute and the one that sparked this conversation around aggregators was Fabricated Knowledge’s 2025 AI and Semiconductor Outlook, which stated:

The era of aggregation theory is behind us, and AI is again making technology expensive. This relation of increased cost from increased consumption is anti-internet era thinking.

This is only true if increased thinking is required on every query and if it doesn’t come with a proportionate increase in value provided. The fundamental operations of AI businesses will very much follow in the lens of Aggregation Theory (or, in the case of established businesses, it’ll reinforce advantages of existing large companies), and more work is going to be needed to figure out business models for inference-heavy products.

We can break AI use today into two categories:

* ChatGPT and general-use chatbots.

* Domain-specific models, enterprise products, model APIs, and everything else that fits into the pay-for-work model (e.g. agents).

The first category is established and not going away, while the second is very in flux. Inference time scaling will affect these in different ways.

Consumers — well, most of them (and not most of you reading this who are power users) — will never know how to select the right model. The market for super users is far smaller than the market for general use. The core for consumer products is having the model know how much compute to spend. This is where RL training will likely be most important and is something notably missing from the release of Claude 3.7 Sonnet.

OpenAI’s model offerings and initial excitement around inference time compute made many, myself included, get excited about the idea of a compute dial being shown to the users so they can control the “thinking effort” for their query. The problem is that rules for how well that translates to performance rely on a deep understanding of AI and how language model performance is very stochastic.

The so-called dial is being reduced to simple reasoning buttons or always-on reasoning — extremes and yes/no decisions are much easier for users to work with. This is already how I engage with models. I start with a normal model, and if it doesn’t work, I punt to o1 pro. Would my trying to guess the right spot on a dial for a new query really be a good experience? Please, the model should know its own limits and handle that on its own.

Today, the RL-trained reasoning models primarily serve as a trust and legibility enhancement to average users rather than a performance improvement. This is leading to the exposure of the Chain of Thoughts (CoTs) to be an industry norm. At the same time, this sort of minor increase in context length will still be subsumed into a zero marginal cost style business, pending the assumed discovery of a functional ad model. This is all also just the tip of the iceberg for inference time compute. From my coverage of Claude 3.7:

RL training is a short path to inference time scaling laws being used, but in the long-term we will have more methods for eliciting the inference-time tradeoffs we need for best performance.

For power users and enterprises, RL training and one model fits all is less important. Enterprises will want to benefit from clear trade-offs on performance vs log(compute).

Many in the industry, including in the aforementioned Claude 3.7 release and o3’s ARC-AGI performance, are discussing the use of parallel test time compute relative to just increasing generation lengths. Inference time scaling with parallel computation and strong verifiers will be essential to the long-term trajectory of the sub-area.

Where RL models can increase the compute spent by a model by factors of 2, 4, or 10 for a question, parallel computation already uses factors of 1000 (see o3), and will go far higher. This is a far more direct way to continue scaling the log-compute plots for inference time scaling. It’s also more efficient due to the quadratic costs of generating longer contexts — in fact most of the models we are using cannot scale output length infinitely, as we can with the number of samples.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Better verifiers will increase the slope of the inference time scaling plots we are seeing, as discussed in our coverage of Claude 3.7.

Models will be trained to make the probability of a true answer appearing increase over many generations and maximizing the probability that the extraction method can select it, rather than maximizing the probability that 1 single generation is correct out of the box. This is a very different way to finish the training of models than has been considered in some time. Here’s a recent example of a research paper studying this, Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models, and more will surely come soon.

Verification as the limiter for inference-time scaling performance is not a new idea. It was the starting point of my coverage on inference time scaling, before the release of OpenAI’s o1 (and mentioned in the recent post on Claude 3.7). Ultimately, the challenge is robustness, not if the models can generate the answer:

I wanted to highlight a result from the last paper, Large Language Monkeys, as to why inference time compute is feasible. That paper focuses on repeated sampling from a variety of language models to understand the distribution of answers in a model. They show that with an Oracle answer extractor (more on this later), even models as surprising as Pythia-70M have the true answer inside.

Remember, the domain of math answers is infinite. This shows that underlying distribution of the models has the right answer, we need to figure out how to extract the right ones. We need strong verifiers to make answer selection easy. The Snell et al. paper above (1.) calls this the “Proposer and Verifier” perspective.

The understanding that while the models we are using will almost always be able to generate the right answer and the fact that training verifiers for using that has just started should increase optimism that inference time scaling can work.

This type of performance will not be cheap. Unlocking new potential applications is still worth way more than the few dollars these queries can cost. Noam Shazeer of Google explained this on his Dwarkesh Podcast appearance with Jeff Dean:

One of the big areas of improvement in the near future is inference time compute, applying more compute at inference time. I guess the way I like to describe it is that even a giant language model, even if you’re doing a trillion operations per token, which is more than most people are doing these days, operations cost something like 10 to the negative [18 dollars]. And so you're getting a million tokens to the dollar.

I mean compare that to a relatively cheap pastime: you go out and buy a paper book and read it, you're paying 10,000 tokens to the dollar. Talking to a language model is like 100 times cheaper than reading a paperback.

So there is a huge amount of headroom there to say, okay, if we can make this thing more expensive but smarter, because we're 100x cheaper than reading a paperback, we're 10,000 times cheaper than talking to a customer support agent, or a million times or more cheaper than hiring a software engineer or talking to your doctor or lawyer. Can we add computation and make it smarter?

I think a lot of the takeoff that we're going to see in the very near future is of this form. We've been exploiting and improving pre-training a lot in the past, and post-training, and those things will continue to improve. But taking advantage of "think harder" at inference time is just going to be an explosion.

The TLDR is “Making language models more expensive is worth it.” It is quite the provocation — that intelligence is worth far more and will restructure the budding AI inference market.

All of the dynamics are going to be accelerated by AI adoption and Jevon’s paradox. Jevon’s paradox is the idea that when a technology gets more efficient, i.e. “reducing the amount needed for a single application,” the total usage will actually go up. The resource in this case is GPUs for inference. AI inference, generally, and reasoning models, specifically, are in this feedback loop. Reducing the baseline cost of models will make inference-time scaling even more palatable. On the research side we are also embracing far more ambitious compute spending per problem.

These are business dynamics that don’t fit into the most common aggregation theory model. With a gradient from classic zero-marginal cost, high-margin services to high-cost, high-impact new uses of AIs, the behavior of aggregators is going to evolve.

We’ve seen examples of Aggregation Theory applied to high-cost businesses, such as Amazon’s Prime — distribution is expensive — but these don’t posit to operate on the same internet scale of digitally native businesses. All in, aggregation theory will still apply to most of the consumer and enterprise markets, but there are large areas of the market where the way this is unfolding is entirely new.

The defining factor at the top end of inference scaling is paying for the most productive output. If the costs are so great in both capital and marginal costs, the AI super-companies of the future may look more like platforms than aggregators. They would have strong network effects as the only people in town who can offer the powertool that is inference-heavy AI.

AI companies will be pushed along the barbell effect — those with consumer demand can aggregate on cheaper models, but those with specific niches will be forced to compete on performance. The ceiling on price is unknown and markets need to be built on new platforms.

High costs could support AI being less winner-takes-all than many would suspect — we could have Anthropic operate as the platform for AI agents, ChatGPT aggregating the consumer use, and another party dominating code generation. With so much investment in AI it feels more likely that companies will win their area by focusing on specific domains — they cannot afford to dilute their efforts relative to a focused peer — unless an expansive AGI really is coming, in which one company conceivably can take the entire market.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

GPT-4.5: "Not a frontier model"? Feb 28, 2025

More: https://www.interconnects.ai/p/gpt-45-not-a-frontier-model

As GPT-4.5 was being released, the first material the public got access to was OpenAI’s system card for the model that details some capability evaluations and mostly safety estimates. Before the live stream and official blog post, we knew things were going to be weird because of this line:

GPT-4.5 is not a frontier model.

The updated system card in the launch blog post does not have this. Here’s the original system card if you need a reference:

Regardless, someone at OpenAI felt the need to put that in. The peculiarity here summarizes a lot of the release. Some questions are still really not answered, like “Why did OpenAI release this?” That game theory is not in my purview.

The main contradiction to the claims that it isn’t a frontier model is that this is the biggest model the general public has ever gotten to test. Scaling to this size of model did NOT make a clear jump in capabilities we are measuring. To summarize the arc of history, the jump from GPT-3.5 to GPT-4 made the experience with the models go from okay to good. The jump from GPT-4o (where we are now) to GPT-4.5 made the models go from great to really great.

Feeling out the differences in the latest models is so hard that many who are deeply invested and excited by AI’s progress are just as likely to lie to themselves about the model being better as they are to perceive real, substantive improvements. In this vein, I almost feel like I need to issue a mea culpa. I expected this round of scaling’s impacts to still be obvious before the brutal economic trade-offs of scaling kicked in.

While we got this model, Anthropic has also unintentionally confirmed that their next models will be trained on an approximation of “10X the compute,” via a correction on Ethan Mollick’s post on Claude 3.7.

Note: After publishing this piece, I was contacted by Anthropic who told me that Sonnet 3.7 would not be considered a 10^26 FLOP model and cost a few tens of millions of dollars to train, though future models will be much bigger.

GPT-4.5 is a point on the graph that scaling is still coming, but trying to make sense of it in a day-by-day transition is hard. In many ways, zooming out, GPT-4.5 will be referred to in the same breath as o1, o3, and R1, where it was clear that scaling pretraining alone was not going to give us the same level of breakthroughs. Now we really know what Ilya saw.

All of this marks GPT-4.5 as an important moment in time for AI to round out other stories we’ve been seeing. GPT-4.5 likely finished training a long time ago — highlighted by how it has a date cutoff in 2023 still — and OpenAI has been using it internally to help train other models, but didn’t see much of a need to release it publicly.

What GPT-4.5 is good for

In the following, I am going to make some estimates on the parameter counts of GPT-4.5 and GPT-4o. These are not based on any leaked information and should be taken with big error bars, but they are very useful for context.

GPT-4.5 is a very big model. I’d bet it is well bigger than Grok 3. We have seen this story before. For example, GPT-4 was roughly known to be a very big mixture of experts model with over 1T parameters total and ~200B active parameters. Since then, rumors have placed the active parameters of models like GPT-4o or Gemini Pro at as low as 60B parameters. This type of reduction, along with infrastructure improvements, accounts for massive improvements in speed and price.

Estimates place GPT-4.5 as about an order of magnitude more compute than GPT-4. These are not based on any released numbers, but given a combination of a bigger dataset and parameters (5X parameters + 2X dataset size = 10X compute), the model could be in in the ballpark of 5-7T parameters total, which if it had a similar sparsity factor to GPT-4 would be ~600B active parameters.

With all of these new parameters, actually seeing performance improvements is hard. This is where things got very odd. The two “capabilities” OpenAI highlighted in the release are:

* Reduced hallucinations.

* Improved emotional intelligence.

Both of these have value but are hard to vibe test.

For example, SimpleQA is a benchmark we at Ai2 are excited to add to our post-training evaluation suite to improve world knowledge of our models. OpenAI made and released this evaluation publicly. GPT-4.5 makes huge improvements here.

In another one of OpenAI’s evaluations, PersonQA, which is questions regarding individuals, the model is also state of the art.

And finally, also GPQA, the Google-proof knowledge evaluation that reasoning models actually excel at.

At the time of release, many prominent AI figures online were touting how GPT-4.5 is much nicer to use and better at writing. These takes should be taken in the context of your own testing. It’s not that simple. GPT-4.5 is also being measured as middle of the pack in most code and technical evaluations relative to Claude 3.7, R1, and the likes.

For an example on the writing and style side, Karpathy ran some polls comparing GPT-4.5’s writing to GPT-4o-latest, and most people preferred the smaller, older model. Given what we know about post-training and the prevalence of distilling from the most powerful model you have access to, it is likely that GPT-4o-latest is distilled from this new model, previously called Orion, and its drastically smaller size gives it a night and day difference on iteration speed, allowing for better post-training.

More on the character in that GPT-4o-latest model was covered in our previous post on character training.

All of this is a big price to pay to help OpenAI reclaim their top spot on ChatBotArena — I expect GPT 4.5 to do this, but the results are not out yet.

I’ve been using GPT-4.5 in preparation for this. It took a second to get used to the slower speed, but it’s fine. I will keep using it for reliability, but it’s not worth paying more for. o1 Pro and the other paid offerings from OpenAI offer far more value than GPT-4.5.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Making sense of GPT-4.5’s ridiculous price

When the original GPT-4 first launched, it was extremely expensive. In fact, GPT-4 was comparable in price to GPT-4.5 at launch. Here’s a help post on the OpenAI forums, conveniently found by OpenAI DeepResearch with GPT-4.5, that captures the context. GPT-4 launched in March 2023.

We are excited to announce GPT-4 has a new pricing model, in which we have reduced the price of the prompt tokens.

For our models with 128k context lengths (e.g. gpt-4-turbo), the price is:

* $10.00 / 1 million prompt tokens (or $0.01 / 1K prompt tokens)

* $30.00 / 1 million sampled tokens (or $0.03 / 1K sampled tokens)

For our models with 8k context lengths (e.g. gpt-4 and gpt-4-0314), the price is:

* $30.00 / 1 million prompt token (or $0.03 / 1K prompt tokens)

* $60.00 / 1 million sampled tokens (or $0.06 / 1K sampled tokens)

For our models with 32k context lengths (e.g. gpt-4-32k and gpt-4-32k-0314), the price is:

* $60.00 / 1 million prompt tokens (or $0.06 / 1K prompt tokens)

* $120.00 / 1 million sampled tokens (or $0.12 / 1K sampled tokens)

GPT-4.5’s pricing launched at:

Input:$75.00 / 1M tokens

Cached input:$37.50 / 1M tokens

Output:$150.00 / 1M tokens

OpenAI included language in the release that they may not keep this model in the API, likely forecasting low demand, as they wanted to hear from users if it enabled entirely new use-cases.

Many analysts think that Nvidia’s next generation of GPU, Blackwell, which comes with GPUs with far more memory per FLOP (enabling storing bigger models), are not priced into this. We can expect to see the same arc of pricing with 4.5 as we did with 4 to 4 Turbo to 4o.

* GPT-4 Turbo launched in November 2023 at $10 / 1M input and $30 / 1M output.

* GPT-4o launched in May 2024 at $2.5 / 1M input and $10 / 1M output.

These are huge reductions, about 10X.

These are products that OpenAI makes a healthy margin on, and there are no signs that that isn’t the case. The AI community collectively has grown so accustomed to incredible progress in making the technology more efficient that even a blip in the process, where bigger models are available, feels potentially bubble-popping.

The future of scaling

Scaling language models is not dead. Still, reflecting on why this release felt so weird is crucial to staying sane in the arc of AI’s progress. We’ve entered the era where trade-offs among different types of scaling are real.

If forced to summarize all of this curtly, it would be: GPT-4.5 is, oddly, ahead of its time.

This means that the progression of AI needs to take a different tack, but we already knew this with the rapid progress of reasoning models. The true impact of GPT-4.5 is when it is integrated with multiple lines of rapid progress.

One of the flagship results in the DeepSeek R1 paper and related RL follow-up work in the AI community is that scaling RL training works better on bigger models. There is a lot of work to do to know all the domains that’ll be absorbed into this umbrella. Future models like o4 could be distilled from a reasoning model trained on GPT-4.5. In fact, this may already be the case. OpenAI’s current models likely would not be so good without GPT-4.5 existing.

In as soon as a year, most of the models we are working with will be GPT-4.5 scale and they will be fast. The “well-rounded” improvements they offer are going to help make many more applications more robust, but OpenAI and others in the AI labs have pushed scaling a bit further than the current serving infrastructure can support.

Frontier labs are not taking enough risk if they’re not going to try to push the limits of every direction of scaling they have. Though releasing the model isn’t needed, we have to guess why OpenAI actually wanted to do this. It’s likely that GPT-4.5 is being used in other internal systems for now and other external products soon, so releasing it is a natural step on the way to the next thing, rather than a detour.

GPT-4.5 is a frontier model, but its release is not an exciting one. AI progress isn’t free, and it takes a lot of hard work. Most people should only care when GPT-4.5 is integrated into more than just chat.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Character training: Understanding and crafting a language model's personality Feb 26, 2025

https://www.interconnects.ai/p/character-training

The vast majority of evaluations used to measure progress on post-training at frontier laboratories are internal evaluations rather than the evaluations you hear about all the time like MATH or GPQA. These, the well-known intra-industry evaluations, are certainly important for ballparking behavior, but for every public evaluation, these frontier laboratories are likely to have 10+ fine-grained internal evaluations.

The internal evaluations these model providers have cover a range of topics. Surely, most of them are basic, repetitive user behaviors that they need to make sure a new model doesn’t roll back too many of. Of these, the vast majority are likely skills, and “character” remains more of an art than a hill to climb up with careful data engineering.

Leading post-training laboratories surely know how to reinforce more robust behavior within a specific character, as seen by the march of progress on evaluations like ChatBotArena, but crafting a specific personality from scratch is an open question.

The primary goal of this post is to start the conversation outside of frontier AI labs around character training. Character training is the subset of post-training designed around crafting traits within the model in the manner of its response, rather than the content. Character training, while being important to the user experience within language model chatbots, is effectively non-existent on the web.

We don’t know the trade-offs of what character training does, we don’t know how exactly to study it, we don’t know how much it can improve user preferences on ChatBotArena, and we should.

The appearance of the AIs people are using is deeply coupled with how intelligent users will find it to be. Style of communication is crucial to how information is parsed. This is likely a very high priority to industrial labs, but something that almost no academic literature exists on. Even though I want to do research on this, I’m honestly not sure how to do so yet other than a 1 of 1 technical report on findings.

ChatGPT gets character depth

Out of nowhere on Saturday, February 15th, Sam Altman tweeted about this new GPT-4o model that will serve as the foundation of ChatGPT.

This is the biggest subjective change I’ve ever felt within intermediate model versions, from any primary provider — something more akin in vibes change to the shift from GPT-3.5 to GPT-4. The model immediately and consistently showed new behavior patterns. I found these very positive (Karpathy agrees), but they’ll take some getting used to.

Where ChatGPT used to sound robotic and shallow, it’s now very clearly leaning into a chipper assistant demeanor. Yes, for basic tasks, this new default model in ChatGPT is very Claude 3.5-like — more testing is needed to know if this GPT-4o with its peer models like o3-mini can dethrone Claude 3.7 Sonnet as a daily programming driver.

The biggest changes in the new GPT-4o model are:

* It now loves to reference past interactions in the chat (way more obvious than any other provider has been) — it was trying to flex that it knows my dog breed, mini schnauzer, or my book topic, RLHF. This is very in line with the new roadmap to GPT-4.5 and GPT-5 that Altman posted, where ChatGPT is designed around a fluid experience rather than standalone, siloed, powerful models.

* The model is very chipper, sprinkles in more emojis, and is almost funny.

* The multi-turn conversation is more dynamic, with follow-up questions and added texture to longer back and forths.

The reasons are at a high level very complementary to those I listed when I switched to Claude as my daily driver model.

The shocking part of this is that the impact of this sweeping change is almost entirely undocumented. Yes, OpenAI updated the Model Spec (my previous coverage here and here), but that doesn’t really capture how this model is different — it just clarifies the direction OpenAI is optimizing for. There are a few overlapping interpretations of this lack of transparency:

* OpenAI cannot precisely measure the differences as a few specific behavior traits, so they can only see the model performs better in high-level testing like ChatBotArena or other A/B testing, but they cannot capture the changes in score deltas between a few evaluations they could release.

* AI is moving so fast that taking the time to document these models is not worth it,

* Detailing the changes will make the character too easy to reproduce and will be another path of “distillation” of OpenAI’s models.

The community of model users is extremely far from having clear ways to measure these differences. While there are vibe tests on Twitter, they will not be conclusive. ChatBotArena won’t even come close to measuring the levels of these differences (and in the case of referencing past chats, it cannot). Character training is the sort of addition to a post-training stack that takes industrial training techniques from being reproducible, but expensive, to dark arts that are largely undocumented.

The most interesting part of the model spec for industry analysts is this plot where OpenAI shares the agreement rate of their newer models. This is comparing a reasoning model, o1, to a GPT-4o model, so there are questions of whether this is attributable to reasoning training.

Every frontier AI laboratory should have a model spec

Model Specs are the sort of community norm that a race to the top is the goal. They’re muddled if mandated — how would you actually check that a required model spec is accurate? — but if they are implemented by every lab carefully with feedback from the community, it would be far easier for the development ecosystem to exist around models.

The model spec is an extremely useful document detailing how developers can expect your models to change over time. They are also one of the few sources of insight we have into what the model providers are trying to get their models to do (which has regulatory advantages) and let us know what is an intentional or unintentional behavior mode.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

A model spec doesn’t provide all the information we need to keep up with model versions. This new version of ChatGPT desperately needs to be accompanied by evaluations capturing the behavior change, otherwise, a lot of undocumented differences will be passed on to developers updating endpoints to it. This is another rendition of the same lack of transparency we’re used to from leading AI laboratories.

The closest thing Anthropic has to a model spec is the mix of Claude’s Constitution and this blog post on Claude’s Character. Character training is a fairly new technique for the industry. From Anthropic’s post:

Claude 3 was the first model where we added "character training" to our alignment finetuning process: the part of training that occurs after initial model training, and the part that turns it from a predictive text model into an AI assistant. The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.

The process is extremely synthetic data-heavy, but requires an artist’s touch, as stated later in the blog post: It “[relies] on human researchers closely checking how each trait changes the model’s behavior.”

Character training being the focus of developments is the strongest endorsement that RLHF and related approaches have shifted from their philosophical motivations of alignment to being primarily an empirical tool. The models can capture so many different behaviors, but getting them to reliably behave how we want is the hardest part. Right now, it seems more likely that this is about capturing the upside of RLHF as a performance tool, rather than a safety one.

One of the few public discussions of character training came from Amanda Askell during her appearance on the Lex Fridman Podcast (taken from the transcript):

Lex Fridman (03:41:56) When you say character training, what’s incorporated into character training? Is that RLHF or what are we talking about?

Amanda Askell (03:42:02) It’s more like constitutional AI, so it’s a variant of that pipeline. I worked through constructing character traits that the model should have. They can be shorter traits or they can be richer descriptions. And then you get the model to generate queries that humans might give it that are relevant to that trait. Then it generates the responses and then it ranks the responses based on the character traits. In that way, after the generation of the queries, it’s very much similar to constitutional AI, it has some differences. I quite like it, because it’s like Claude’s training in its own character, because it doesn’t have any… It’s like constitutional AI, but it’s without any human data.

In summary, Anthropic uses the same techniques they use for Constitutional AI and general post-training for capabilities to train these models’ characters. This is not surprising. This could be related to Askell’s other Tweet on how she designs system prompts, as system prompts are the easiest way to quickly change a model’s character:

The boring yet crucial secret behind good system prompts is test-driven development. You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.

This is very in line with what we started this post on — internal AI lab evaluations.

How far can you push character training?

Ahead of the Grok 3 release, Elon Musk Tweeted this example from Grok 3, saying it was “based.”

One of the predominant reactions to Grok 3 was, “Wait, so it isn’t actually based?” This is one of the big questions of character training and lacking model specs. Did xAI not figure out how to make their model-based and reliable? What model was Elon using here?

Whatever your politics, it’s likely that the default personality of models that you encounter will eventually not be something you like. There’s quite a lot of nuance in what the perfect chatbot is for each user.

Companies should be allowed to have a default personality for the models of their choosing, but a far better long-term equilibrium is to make the expectation that model providers make it easy to get exactly the personality you like out of a model. This isn’t regulation I’m recommending right now, but one way to make sure that an all-powerful AI model isn’t going to reinforce one point of view is to have tests that models need to pass on the breadth of their character and views.

Model specs are a step in the right direction to avoid drama about “what did they actually want their model to say,” but we still have a lot of work to do on creating a spectrum of tools that captures all the relevant information when comparing models.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Claude 3.7 thonks and what's next for inference-time scaling Feb 24, 2025

On Monday, February 24th, 2025, Anthropic announced their latest model, Claude 3.7 Sonnet, which is their first model explicitly trained to use more inference time tokens to improve performance. This is another reinforcement learning (RL) trained model (mentioned in system card). With this model, they also released Claude Code as a limited research preview, which is a “command line tool for agentic coding.” Continuous improvements in models are enabling new modalities and domains addressable by the models, but assessing the impact of each new domain takes far more time than a quick model reaction post.

This is a tidy release, a solid improvement, but not a step change for Claude or the industry. Expect a lot of small changes to accumulate massively this year.

Claude 3.7 Sonnet is a clear improvement over Claude 3.5 Sonnet (New) and continues to push the limits in areas where users love Claude (e.g. read Ethan Mollick’s review here). The scores for those areas such as software development (SWE Bench) and tool use, are clearly state-of-the-art.

For example, Claude 3.7 Sonnet is the highest scoring “standard non-reasoning” language model on the Aider Polyglot benchmark. While models like o3 and Grok 3 DeepThink highlight superhuman performance on code benchmarks, this sort of behavior being integrated without extra inference time compute is wonderful. The price for superhuman coding AI is plummeting.

Even with o1 Pro, I still find myself using Claude 3.5 (New) on a regular basis. O1 Pro is the best model for doing succinct, one-off tasks like writing short scripts. It is extremely controllable and will often work out of the box. Though, when I’m doing tricky, iterative tasks I still use Claude. Claude 3.7 Sonnet only makes these workflows stronger and I’m stoked to play with it further.

The most useful piece of this release for those trying to understand the direction of the ecosystem, rather than just the status of today, is Anthropic’s post on Claude’s extending thinkingwhere they detail the product trade-offs, alignment, and future of inference time compute in their models.

Anthropic’s offering of extending thinking to boost inference-time performance is far, far cleaner than that of OpenAI’s current model drop down disaster. Anthropic’s thinking model is the same as their general purpose model, much like xAI’s Grok 3, and what OpenAI teased will be the plan for GPT-5. Having just one model makes lots of infrastructure, product, and training decisions cleaner, but may come at the cost of the absolute Pareto front of performance for your organization shrinking.

The reasoning training being embedded in one model with a standard inference mode will make the reasoning benefits and behavior feel closer to something like Gemini-Thinking, rather than OpenAI’s o1 or DeepSeek R1 that are designed solely for this reasoning mode of operation. It doesn’t mean that in the limit that a single model will be weaker in performance, but rather that currently training them may be slower to iterate on than a “full” reasoning language model.

Focusing on deploying just one model that serves all the users is one of many examples where leading AI companies are needing to make their offerings legible to users and easy to use — a sign of the industry maturing from a race to intelligence to a race to usefulness.

Still, Claude’s interface is not perfect by any means, the user still has to intentionally go to a drop down menu to get performance when they need it. The best mode is that the model knows when inference compute is needed on its own. My hypothesis is that when training one model with reasoning and without, having the model figure out how much compute to use is harder than a reasoning-only model like o1 figuring out its own compute budget. Or, Anthropic needed to keep a special flag that is turned on and off in the system prompt. This is a subtle potential trade-off of putting reasoning in just one model, but we’ll see where the final equilibrium is.

On the other hand, Claude 3.7 Sonnet is showing the reasoning traces directly to users like DeepSeek R1 and Grok 3. These organizations have different ways of saying why, but it is clear that users just enjoy seeing it and it builds trust. Anthropic, understandably is using the reasoning traces to monitor the alignment of the models.

The reasoning chains in these models are how the general public is learning more about the internal representations of language models. Another interesting detail is that “didn’t perform our standard character training on the model’s thought process.” This is how Claude thinks out of the box and the actual answers have a different flavor to them. More research will study how far the reasoning chains can diverge from the answer language. We’ve seen research on latent reasoning within the model, but beyond this, we could have reasoning languages that are entirely ungrounded from human languages because they are a more token-efficient representation of information for the model. More on this soon.

The developer facing version of this Claude Extending Thinking is far cleaner and a sign of things to come — developers can request a specific amount of thinking tokens in their response.

How this works is that the model will stream thinking tokens until the number is reached, then shift to answer tokens. This is still one autoregressive stream and no search is being used in the new products, yet.

This explicit control over the thinking and answering phases is a growing behavioral focus in training reasoning models — expect more here soon. Developers can tune the setting that works for them and keep it baked in, rather than relying on the user to pass in a query that just happens to get the model to think for a long time. Explicit test-time inference budget increases are much more covetable than needing to hit the gold mine in a prompt search.

The best place to see where this could be applied is by selecting performance on a task that scales nicely with inference time compute. Anthropic ran a similar experiment on the challenging math evaluation AIME — the same one that OpenAI used in their original inference time compute plot.

Here there’s a subtle difference from the developer experience, where in Anthropic’s internal tests the model could exit early. In practice, this subtle difference shouldn’t shift the usefulness of the deployment methodology.

Anthropic continues in their excellent post, saying:

Our researchers have also been experimenting with improving the model’s performance using parallel test-time compute. They do this by sampling multiple independent thought processes and selecting the best one without knowing the true answer ahead of time. One way to do this is with majority or consensus voting; selecting the answer that appears most commonly as the 'best' one. Another is using another language model (like a second copy of Claude) asked to check its work or a learned scoring function and pick what it thinks is best. Strategies like this (along with similar work) have been reported in the evaluation results of several other AI models).

To accompany this, they shared the following results.

It is crucial to note here that the dashed red line — pass@N — is not an actual evaluation result, but measuring if the correct solution appears in the number of answers generated on the X-axis. The two lines below show how good initial answer extraction methods are at selecting the right answer from the N candidates. As has been known for a long-time in inference-time scaling research is that the models can often generate the correct answer to extremely hard questions, but not reliably.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

They make it very clear this is not used yet in their products:

Parallel test-time compute scaling isn’t available in our newly-deployed model, but we're continuing to research these methods for the future.

Still, this is a direction other labs are already pursuing. The best reporting on o1 Pro indicates that it does a “search” of some sort over parallel generations. Other OpenAI employees have stated that o3 uses a learned verifier to extract answers, at least for coding domains. As progress in scaling single-streams from the language model slows, this is the natural next place for scaling to turn to. As it has been for some time, performance limits are largely different forms of infrastructure problems before models can be served to users.

Claude is here, and it reinforces that RL training is a short path to inference time scaling laws being used, but in the long-term we will have more methods for eliciting the inference-time tradeoffs we need for best performance.

Thanks to Ross Taylor for some immediate feedback on this post.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Grok 3 and an accelerating AI roadmap Feb 18, 2025

Full post: https://www.interconnects.ai/p/grok-3-and-an-accelerating-ai-roadmap

xAI launched their latest flagship model, Grok 3, last night via a live stream on X, which is a new take on the launch process, but it largely felt familiar. Grok 3 is a state-of-the-art model on some important benchmarks. The core is that it is state-of-the-art relative to available models and we know better models are out there. Only some of them have been announced, some of them have been teased, and others lie in waiting.

What feels different is how the broader AI industry is signaling rapid progress coming soon. xAI said on the livestream that they will be updating the model “daily.” An era of sitting on unreleased models could be ending.

Grok 3’s release is a reinforcement of trends people began reckoning with as of the release of DeepSeek V3 + R1 — AI progress is not held in the hands of a few companies nor is it slowing down. 2023 and 2024 were defined by truly state-of-the-art AI being concentrated within OpenAI, Anthropic, and Google, where these companies could take a lot of time to package models from training to release and still have a substantial moat on capabilities relative to their peers.

At the time of R1’s launch, the “people’s choice” model was Claude 3.5 Sonnet, a model that had been trained “9-12 months ago” and the best models like Claude 3.5 Opus or GPT-4.5 (a.k.a Orion) were not available to users for a grab bag of reasons.

Competitive pressure from DeepSeek and Grok integrated into a shifting political environment for AI — both domestic and international — will make the established leading labs ship sooner. A large portion of delays in delivering models is for “safety testing,” but we don’t have exact details on how much of it was that and how much was cost-benefit tradeoffs (and other big company hurdles such as legal departments). The brand, and culture, of “having the smartest model” is extremely important to these companies, but having a way smarter model was often financially too much to bear.

“Safety” is actively being removed from the spotlight of the AI discourse. It is possible that this overcorrection causes meaningful harm, as this is an extremely powerful and rapidly evolving technology, but the political capital to make safety a core tenet of the AI industry was spent too early relative to meaningful harm emerging.

Increased competition and decreased regulation make it likely that we, the users, will be given far more powerful AI on far faster timelines.

We’ve seen time and time again the value of having the best model first. The only way to onboard new users is to have some capability or behavior that your model differentiates on. With the pace of progress high, minimizing the time from training to release is the best way to maximize one’s chance of impact.

DeepSeek and xAI show how organizations with slightly trailing technical progress or resources can outshine the likes of OpenAI and Anthropic who have voluntarily not shipped their latest models.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Grok 3 by the numbers

Benchmarks and vibe tests mark Grok 3 as one of the best models available today. As with any release, companies often choose evaluations that flatter their models. Yes, winning on these evaluations is extremely challenging, and much credit must be given to xAI for delivering a leading-edge model just about 19 months after its incorporation.

That being said, what is shown below is a total of 4 language model evaluations. Given that models like DeepSeek R1 or Gemini Thinking launch with 10-20 evaluations detailing their performance relative to peers, this has to be taken with a grain of salt. It is very likely that Grok 3 doesn’t outperform its peers in every category, but there is a slim chance these other comparison evals just weren’t run in the optimization for expedience.

To start, we can compare Grok 3 benchmarks versus available instruct models.

And versus available reasoning models (note how OpenAI’s announced o3 scores exceed these clearly).

An important detail, as we’ve seen with OpenAI’s reasoning model releases is, what do the shaded regions on the above plots show? Without exact details, we don’t know the inference cost for each of the models on these reasoning plots. Pushing the frontier in absolute terms is important, but the field overall is getting messier before it’ll get clearer.

Regardless, in the above two plots Grok 3 is pushing progress both on standard model training and the new reasoning training. While reasoning training and RL are the hot new things in the AI field, simple scaling and optimization of existing techniques still deliver value.

And Grok’s score on ChatBotArena.

A model launching at top of every category on ChatBotArena feels like something that should be rare (given it now encompasses many categories like Math, Coding, Style Control, Longer Queries, etc.), but it happened just a few weeks ago with Gemini 2.0 Pro!

ChatBotArena is known to favor models that are likely to not refuse requests (we don’t know by how much), as evidenced by Claude 3.5 Sonnet (New)’s relatively low position on the leaderboard relative to its utility, but overall is a hard evaluation to top. xAI’s stated goals of a “based” model should correlate well here.

A question we don't know the answer to: How many points of performance on evals do you gain by not caring about safety at all? Internal to the model, i.e. in behavior latent spaces, safety is pretty orthogonal to common high-utility behaviors like reasoning and code, and bigger models tend to do more things without a cost to other behaviors, but there has to be a safety performance margin. Did Grok 3 succeed because of this? It’s too early to tell.

At a technical level, Grok 3 is certainly a very big model. We don’t have specifics, but it’s reasonably safe to take a datapoint for scaling still helps for performance (but maybe not on costs). xAI’s approach and messaging has been to get the biggest cluster online as soon as possible. The Occam’s Razor explanation until we have more details is that scaling helped, but it is possible that most of Grok’s performance comes from techniques other than naive scaling.

Grok 3’s size to beat existing models feels like when Nemotron 340B beat Llama 3 70B, making it the leading open-weight model at the time, but uptake was slow because the cost relative to the performance gains wasn’t worth it to adopt. We’ll know more about this when Grok 3 is available in their API and we see the exact costs.

When models are approximately equal in performance, price and ease of use are the determining factors of adoption.

Overall, Grok 3 is a huge technical achievement but not one that indicates a substantial change in who is at the frontier of effective training. xAI is obviously closing in on OpenAI, Anthropic, and most of all Google, but all available data points put these labs ahead of xAI on the frontier of efficient model training. It is good that they are being pressured to deliver more absolute intelligence and not to just continue optimizing their frontier of performance per dollar.

Read some other reviews of Grok 3 here and here. Karpathy’s summary is particularly in line with my thinking (while potentially slightly overselling capabilities).

As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.

Where progress is heading

If these AI models, and the industry writ large, are accelerating, it is important to wonder where they are accelerating toward. Most of the evals we use now to launch leading models are not that representative, in many cases they’re actually 100% out of distribution to normal life. What is the value in solving a competition math problem like AIME or so-called “Google Proof” questions? Time will tell, but the case for usefulness to average users is definitely stretched.

Small ChatBotArena improvements are marginal gains in robustness, where something like a 20-point difference in Elo rankings — the relative difference between Grok 3 and the next top model — translates to the model winning something like 51% of head-to-head match-ups. This robustness adds up over time, but it is far from meaning that this model is more intelligent in an absolute sense.

In fact, in the case of some of the latest evaluations from the research community, it seems like evaluations are being designed more around being hard than being useful. It is a natural response to models being super powerful to try and find something to challenge them with, but it makes tracking progress and communication far harder.

Companies have many internal evaluations that are not shared. Increasing transparency on these would help contextualize what is and is not meaningful progress. Without these, the only benchmark we have for model changes is them becoming more deeply integrated into products. Product-model synergy can enable extremely useful, new workflows, but it makes tracking the progress of AI a proxy measurement.

I do personally believe these somewhat arbitrary capabilities we are marching toward will generalize to extended and amplified value, but it takes some "feeling the AGI" to see that these models that are better on esoteric benchmarks generalize to every day use. So far they have. Bigger and generally “better” models have been more robust and easier to find valuable veins in, but we as an industry should be sharing more so that it is not just AI insiders who understand how to track progress.

When 2024 was reflected on with meager advancements, evidence is that there was substantial progress but less of it was delivered to users. We only got o1 late in the year and other models were deemed "too big to ship" or the requisite urgency (DeepSeek) did not exist.

2025 will be a year of intelligence being put in the user’s hands. The pace of underlying progress with that will continue to be high. The so-called “walls” facing AI progress haven’t materialized, but making sense of the progress we are getting is much more nuanced.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

An unexpected RL Renaissance Feb 13, 2025

The era we are living through in language modeling research is one characterized by complete faith that reasoning and new reinforcement learning (RL) training methods will work. This is well-founded. A day | cannot | go | by | without | a new | reasoning model, RL training result, or dataset distilled from DeepSeek R1.

The difference, compared to the last time RL was at the forefront of the AI world with the fact that reinforcement learning from human feedback (RLHF) was needed to create ChatGPT, is that we have way better infrastructure than our first time through this. People are already successfully using TRL, OpenRLHF, veRL, and of course, Open Instruct (our tools for Tülu 3/OLMo) to train models like this.

When models such as Alpaca, Vicuña, Dolly, etc. were coming out they were all built on basic instruction tuning. Even though RLHF was the motivation of these experiments, tooling, and lack of datasets made complete and substantive replications rare. On top of that, every organization was trying to recalibrate its AI strategy for the second time in 6 months. The reaction and excitement of Stable Diffusion was all but overwritten by ChatGPT.

This time is different. With reasoning models, everyone already has raised money for their AI companies, open-source tooling for RLHF exists and is stable, and everyone is already feeling the AGI.

Aside: For a history of what happened in the Alpaca era of open instruct models, watch my recap lecture here — it’s one of my favorite talks in the last few years.

The goal of this talk is to try and make sense of the story that is unfolding today:

* Given it is becoming obvious that RL with verifiable rewards works on old models — why did the AI community sleep on the potential of these reasoning models?

* How to contextualize the development of RLHF techniques with the new types of RL training?

* What is the future of post-training? How far can we scale RL?

* How does today’s RL compare to historical successes of Deep RL?

And other topics. This is a longer-form recording of a talk I gave this week at a local Seattle research meetup (slides are here). I’ll get back to covering the technical details soon!

Some of the key points I arrived on:

* RLHF was necessary, but not sufficient for ChatGPT. RL training like for reasoning could become the primary driving force of future LM developments. There’s a path for “post-training” to just be called “training” in the future.

* While this will feel like the Alpaca moment from 2 years ago, it will produce much deeper results and impact.

* Self-play, inference-time compute, and other popular terms related to this movement are more “side quests” than core to the RL developments. They’re both either inspirations or side-effects of good RL.

* There is just so much low-hanging fruit for improving models with RL. It’s wonderfully exciting.

For the rest, you’ll have to watch the talk. Soon, I’ll cover more of the low level technical developments we are seeing in this space.

00:00 The ingredients of an RL paradigm shift16:04 RL with verifiable rewards27:38 What DeepSeek R1 taught us29:30 RL as the focus of language modeling

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Deep Research, information vs. insight, and the nature of science Feb 12, 2025

Article: https://www.interconnects.ai/p/deep-research-information-vs-insight-in-science

(sorry about some more audible breaths in this -- I'm going to work on it!)

We at Ai2 released a local LM iPhone app for our OLMoE model (1B active, 7B total params), with greatly improved scores! Let us know what you think, or read more here.

OpenAI’s Deep Research has largely been accepted as a super valuable tool for knowledge workers and analysts across the economy, but its real engine of economic progress is going to be changing the nature of scientific progress. Science is the fuel of technological revolutions.

Deep Research in its current form feels like a beta version of a next-generation piece of technology. It does what it is tasked with — searches the web and processes many resources to create a useful report with referenced sources. Some of my uses include researching model evaluations, recent robotic learning research, and AI for science breakthroughs.

Deep Research’s limitations mostly feel like problems of search, where it is prone to returning SEO optimized slop, style, where it returns verbose, low information density writing, and modality, where it does not have the ability to read, process, and return plots and diagrams. All of these are surely solvable and expected features if we look at the rollouts of other AI models in the last few years.

This isn’t a product review (you can read Stratechery or Turing Post for more of that) — as the answer is quite simple, if you work in a knowledge intensive vocation you should be using this — but rather asking: So what comes next?

The place to start from within AI circles is to revisit the question of “When will AI make novel discoveries?” A good example of this is in the Dwarkesh Podcast episode with Dario Amodei:

An example experiment we could do to test this is to train models on time-gated information and see if it can repeat a scientific discovery we already made (yes, this would be difficult to run, but not impossible). Ross Taylor described this on his Interconnects Interview:

So an experiment I've never done because I didn't have [the] compute would be this. Imagine if you could train a language model on all documents up to 1905, which is the year when Einstein had his miraculous year of four seminal papers. With that model, which is trained up to 1905, could you prompt the model to come up with a good explanation of the photoelectric effect, special relativity, this kind of stuff? And what would it take to rediscover these things?

The dream is for AI to make breakthroughs, and the absence of evidence for this even after the release of Deep Research is driving a reckoning over what language models will ever be able to do. The fork in the road is either believing that scaling (either in parameters or in new training methods) will unlock “insights” or accepting that the current generation of models are very useful tools and nothing more supernatural. Likely the most powerful tool humanity has made yet. Our first power tool for information.

Much of science is not about making novel insights but about making progress within established problems of the field. In AI, these are the countless benchmarks we are saturating. A very valuable contribution in AI as a field can be re-using known resources in a simpler way.

With AI, we are going to learn the boundary between true insight and scientific progress. A related form of scientific progress is the compression of noisy ideas and experiments into a cohesive trend. Something that Deep Research can likely do, but not something that builds the allure of Einstein and the other scientific greats.

To understand this relationship between Deep Research, AI broadly, and the nature of science, we must address:

* How to interpret existing “AI for Science” projects like AlphaFold in the bigger context of science,

* How reasoning models, AI research systems like Deep Research, and other forthcoming AIs revolutionize existing scientific practices,

* How recent developments in AI challenge Kuhn’s formulation of scientific revolutions, and

* How current institutions will need to change forever in the face of AI?

This (hopefully) series of posts is my attempt to create a worldview around what science means in the face of AI. Today, we focus on the first two — major AI for science projects and how normal science is being accelerated by AI — and hopefully raise urgency within the community to consider the final question.

The starting point — grand AI for science projects

There is a substantial overhang in computational infrastructure and fundamental deep learning capabilities relative to their impact on the broad class of sciences. In order to make a substantial leap in the application of AI to a specific domain, a team must mold the existing underlying capability of AI to the needs of trained groups of scientists.

The list of examples people think of in this mold ranges across domains: AlphaFold for protein folding, AlphaGeometry for mathematics, GraphCast and GenCast for weather, and more that lack such prominent branding. They leverage advancements in deep learning and transformer architectures, but tend to have X-factors specific to the domain of interest (see a Deep Research query summarizing this). Such added features are pulling forward AI capabilities to suit a narrow domain.

There’s a substantial craft to selecting suitable problems for applying this grand AI for science approach. It requires a field with central elements that are quantitatively focused. Even with this, outcomes are more uncertain than standard AI research or standard research in the domain of choice.

The essay A new golden age of discoveryfrom AI Policy Perspectives details how DeepMind sees the opportunity here and showcases some internal ingredients they found that make these projects more likely to be successful.

The fact that any of these projects have succeeded shows the overall potential of AI for science. The overall necessity of the approach depends on whether the grand AI for science breakthroughs are pulling forward progress by months or years, or if these models are the single required breakthrough to approach entirely new areas of study.

As the broader scientific community embraces AI as “something that works” more of these step changes will happen. They take a very large density of compute and talent on a single problem.

These projects fit more naturally into a classical view of science. They take substantial resources and are high risk. Meanwhile, the mass market AI tools that everyone is adopting will dramatically shift the practice of doing science.

Towards instantaneous Ph.D.’s

We have two tools that dramatically shift the nature of scientific exploration. They will only get better.

* AI models that excel at code, mathematics, and reasoning: OpenAI’s o3, DeepSeek R1, Gemini Deep Thinking, etc.

* AI systems to rapidly parse and summarize existing literature: OpenAI’s Deep Research, Gemini Deep Research, Ai2’s Scholar QA (specific to academic papers), and many more that will come soon.

These tools are dramatically accelerating the most time-consuming aspects of research, particularly in computationally intensive fields. In a few years, the only gating factor on the impact of a scientist will reduce to their access to cutting edge tools, understanding the gaps in AI, and asking the right questions. The final point is well established as a trait of the most successful scientists, is what goes hand in hand with the idea of “insight,” and where the differentiation among scientists will only increase.

Computational super-scientists

All scientific fields that rely heavily on computational infrastructure as a bottleneck for progress are going to experience a dramatic acceleration in the near future. In AI and closely related computer science fields this is evident from the abundance of soon-to-be superhuman coding assistants and an exponential (short-term) increase in compute available.

Most AI research is severely bottlenecked by the compute available, the time to implement the intervention, and the implicit efficiency of the idea-implementation interface. Future siblings of OpenAI’s o1 models are going to be used extensively to streamline this. This worldview barely accounts for the ability of these reasoning models to decide on which problem to solve and to interpret the results. These sorts of research assistants running in the cluster are a central component of the vision of Anthropic CEO Dario Amodei’s view in Machines of Loving Grace, and it is one that requires far less optimism in magical breakthroughs than the grand AI for science projects.

Reasoning language models (RLMs) have in their first year of existence shown major progress on all of the evaluations the AI field put forward as fundamental challenges for the field. Accumulating iterations of this should transfer to scientific decision-making, but we don’t exactly know how.

The fundamental unit of progress in science, which can be viewed as one Ph.D.’s worth of progress (same goes for one paper), is reducing so quickly to redefine many methods of experimentation and deciding on what is or is not possible. Multiple efforts are already documenting how RLMs can be used to find errors in existing literature — a process that will likely be automated in the next few years. Rather than science proceeding with a high velocity, it feels as if science is proceeding with a high acceleration.

The pace of progress necessitates a reinvention of most of our scientific institutions. What happens when the time it takes to create a Ph.D.’s worth of knowledge is substantially smaller than the amount of time it takes to get peer review feedback from the fields’ journals?

Changing the structure of scientific revolutions

Most of the ideas behind this piece originated over the holidays when I was trying to understand how science, as a practice and institution, would be changed by AI. The announcement of OpenAI’s Deep Research only solidified the view that the current wave of AI tools are not one that will ever fundamentally be insight-driven. They are extremely powerful and efficient computing engines, not insight engines.

A pseudonymous account on Twitter, Michael, summarized the debate we started with, on if LMs can make insights, perfectly:

To an LLM, a novel discovery is indistinguishable from an error.

This is a limitation that letting language models act in a wet lab does not solve. Insight is far separate from information, but both are accepted as being crucial to scientific progress. Information is the engine for insight.

Still, Deep Research and the AI models we have access to are a substantial accelerant of science. Modern scientists have always had to wear many hats, advising, reading, implementing, thinking, communicating, etc. The list getting shorter, so long as these new technologies are broadly available, is a wonderful boon for normalizing access to scientific careers. This should be a wonderful bloom of scientific progress (at least, production of results), but we need to do substantial work to prepare our already strained institutions.

For my readers in academic or science driven circles, the comment section of this post is open to free subscribers, please share the biggest opportunities and pinch-points in the current scientific ecosystem as we embrace these AI tools. We need to be more ambitious and level-headed than complaints about AI written peer reviews. In the long-term AI peer review will be needed to review all of the progress. We need incentives for managing it, resources to share across a broad community, and much, much more.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

More on Andrew: https://x.com/iamtrask

Transcript and more information: https://www.interconnects.ai/p/interviewing-andrew-trask

Interconnects (https://www.interconnects.ai/)...

... on YouTube: https://www.youtube.com/@interconnects

... on Twitter: https://x.com/interconnectsai

... on Linkedin: https://www.linkedin.com/company/interconnects-ai

... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv

We Mention

* Claude 3.5 launch and “pre release testing with UK AISI” (and the US AI Safety Institute)

* OpenMined and PySyft

* CSET (Center for Security and Emerging Technology)

* NAIRR

* The “open data wall”

* Apple’s Secure Enclaves, Nvidia Secure Enclave

* Data-store language models literature

* RETRO: Retrieval-Enhanced Transformer from DeepMind (2021)

* SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore (2023)

* Scaling Retrieval-Based Language Models with a Trillion-Token Datastore (2024)

Chapters

[00:00:00] Introduction

[00:03:12] Secure enclaves and pre-release testing with Anthropic and UK Safety Institute

[00:16:31] Discussion on public AI and government involvement

[00:20:55] Data store language models and better approaches to “open training data”

[00:42:18] History and development of OpenMined

[00:48:57] Use of language models on air-gapped networks

[00:52:10] Near future of secure enclave technology and industry adoption

[00:58:01] Conclusions and future trajectory of AI development

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

How scaling changes model behavior Oct 09, 2024

How scaling changes model behavior

Some trends are reasonable to extrapolate, some are not. Even for the trends we are succeeding at extrapolating, it is not clear how that signal translates into different AI behaviors.

Read it here: https://www.interconnects.ai/p/how-scaling-changes-model-behavior

[00:00] How scaling changes model behavior

[05:03] Metaphors for what scaling may solve

[08:45] Short-term scaling is already de-risked

Fig. 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/manual/openai-predictions.webp

Fig. 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/manual/scaling-laws.webp

Fig. 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/manual/situational-awareness.webp

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

[Article Voiceover] AI Safety's Crux: Culture vs. Capitalism Oct 02, 2024

SB1047's veto, OpenAI's turnover, and a constant treadmill pushing AI startups to be all too similar to big technology name brands.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/ai-safety-culture-vs-capitalism

00:00 AI Safety's Crux: Culture v Capitalism
06:03 SB1047 as a regulatory litmus test for AI safety
08:36 Capitalism at the helm

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Interviewing Riley Goodside on the science of prompting Sep 30, 2024

Riley Goodside is a staff prompting engineer at Scale AI. Previously working in data science, he is often seen as the default for the new role of a “prompt engineer.” He regularly posts incisive prompts that illicit notable behavior from the most popular AI models.

I really resonated with this saying from Anthropic’s recent podcast on prompt engineering — “now we write essays and treat them as code.” In order to be good at prompting, you need to understand that natural language operates as our code used to.

This episode is a masterclass on why you should care about prompting and how it impacts results. Of course, there’s a bunch of great discussion on recent models that reflect the need for different and or better prompting. Enjoy it!

Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.

We mention:

* Prompting to push the frontier of AI models,

* Post-training and prompting interaction,

* Prompting base models,

* o1, Reflection 70B, reasoning,

* Scale’s leaderboard, evaluation tricks, evaluation needs,

* PlanSearch paper

* Julius AI

* “The hottest programming language is english”

* “Think silently” instructions

* Scale Leaderboard and Humanity’s Last Exam

* ChatML formatting

Chapters

* [00:00:09] Introduction

* [00:02:40] Riley's path to LLMs

* [00:07:54] Impact of ChatGPT on prompt engineering

* [00:12:03] OpenAI's o1

* [00:18:21] Autoregressive inference and prompting sensitivities

* [00:24:48] Reflection 70B model and its implications

* [00:28:00] Impact of prompting on evaluation

* [00:32:43] Prompting vs. Google search

* [00:46:55] Prompting and RLHF/post-training

* [00:56:57] Prompting of AI agents

* [01:01:20] Importance of hands-on experience with language models

* [01:05:00] Importance and challenges of AI model evaluation

Transcript

Built with smol-podcaster.

Nathan L. [00:01:08]: Hey, Riley, welcome to the show.

Riley G. Hey, Nathan, great to be here.

Nathan L. [00:01:14]: Yeah, so for the audience here, I mostly wanted to try to, as I work on post-training a lot and I see my own difficulty in taking prompting seriously and the things that I don't think that we are doing enough, and I don't see any reason why it can't be scientific in how we do prompting. So that's my biggest goal with this. I think there's a lot of podcasts where we could kind of say, like, what is the history of prompting? Where is it going? And that's easy to kind of redo. And I still find it interesting, but I just don't think there's enough people talking about the role of prompting in evaluation, how prompting changes with how your post-training models, because we're trying to take that seriously and how we have a post-training setup, but we just like regularly run into these things like system prompts aren't handled well, how to release a model of a system prompt. So that's the tone that I'm trying to get to when I ask these questions. And also OpenAI's 01 model just came out, so I'm definitely going to get onto that pretty quickly because that's what everyone's excited about. I like to start with background just to kind of get to know people, because a lot of this is just, I want to talk to interesting people in AI, is like, how did you become interested in prompting? I think I've seen your background in data science and then your joint scale around when Chad2BT came out, which is fun timing, but like, how did you become maybe obsessed with this, but like the focal point of your work?

Riley G. [00:02:40]: Yeah, I have sort of an unusual introduction to large language models. For most of my career, I've been a data scientist, mostly in the on-mandating industry. I was at OkCupid and Grindr. And after I left Grindr, I took sort of a sabbatical to educate myself, I guess, about the progress in large language models. It was around the time that GPT-3 codecs had just come out. And that was where I think I started to become really interested because I was following along with maybe, certainly when GPT-2 came out, the examples there wowed me as much as they wowed the rest of the world, I think, with the example of the news article about the unicorn and all that. And not long after that, we had AI Dungeon, and I played around with AI Dungeon a bit. But at that point, language models seemed to be mostly about language, that they were sort of very heavily focused on stylistic mimicry and creative writing and so on. And when Codex came out, it really started this thought of that text is a more universal interface than we were giving you credit for, that language models might be more broadly useful. And I just became very excited in a practical sense of what these models could do for what I kind of intuited was very boilerplate-like data science code, that I thought of like most of the Python and Julia and R and things that I've written over my career, this seemed like stuff that an LLM could handle. And that was sort of one of its early strong points. So I was playing around with, I think one of my first projects was a VS Code extension that had some kind of integration with Codex. But I never really shipped anything out of it. And mostly what it transitioned into pretty quickly was playing around with posting prompting examples on Twitter, because when I looked out online to find what were people saying about how to prompt these models, there really wasn't much out there. And so I had to kind of resort to just like the few examples that had been circulating in viral screenshots of humorous completions and so on, of like the results that people got out of it. And I started posting those examples. I started following academics and low-level engineers at the research labs and anyone that was working in shipping language models I thought were interesting. And elbowed my way in.

Nathan L. [00:05:18]: I have more questions on this, because I find it like, some people find, there's this whole like Twitter dynamic of like, you find so much signal there, but the question is like, how much does it generalize? Because there's so many of the lessons you can learn from these models, from these examples. I think the straw, like the number of R's in strawberry things is the current one. And then, and it's like, do you get a sense that these are transient or are these kind of repeated themes? And like, how should you read these examples to try to extract themes from them? If like, I've followed you for a while, and a lot of people do, and you're more insightful in how you post them. If you post these threads with like multiple tries and stuff like this, like, should people be doing that when they see something pop up?

Riley G. [00:06:03]: I think so. I also would say that Twitter is a very different river to step into now than it was back then. At the point that I started doing this, like, nobody was really talking about these things that much, or to the extent they were, it was sort of fleeting. It was like, wow, look at this, and then they on to the next thing. And I think the thing that's very different now is just that because there are so many new entrants in AI and LLM, there's a lot of rehashing of the basics. And I think a lot of people in the industry would tell you that the popular examples that you see around of like, how many R's are in strawberry, or some of the ones that I'm partially responsible for, popularizing at least. I think like, these things are really just like, rookie mistakes in some sense, right? That these are things that we've long known language models can't do. And it just keeps popping up as a surprising quirk of language models that I think the public is just confused that something could be so good at so many other things and so bad at this. Right? That is seemingly trivial task, and that is hard to explain to people. And the answer to that hasn't really changed much in the past few years. They're generally bad at spelling for kind of the same reasons they were bad at spelling two or three years ago.

Nathan L. [00:07:27]: Yeah. I mean, like, how did these things change with ChatGPT? Because ChatGPT is like the introduction of RLHF into these models. And I think, I didn't write this down as a question, but there's like the difference in patronizing base models and instruction models and RLHF models, which I think that for most of this discussion, it's like the end model, the like chat RLHF model is the one that people think about. But was that a big transition point in your work or is it just kind of plugging along? Right.

Riley G. [00:07:54]: I mean, I would say, I don't think it's any understatement to say that, or sorry, any overstatement to say that, that the release of ChatGPT was probably the single biggest event in the history of prompt engineering in that prompt engineering became drastically easier after ChatGPT came out. And most other models learned from the ChatGPT way of doing things, right? That they, like, I think people forget just how fiddly prompt engineering used to be, right? Like people today don't think about things like frequency and presence penalties, right? They used to be that by default, you would get very repetitious output and you had to work to avoid that. People forgot about like, don't end your prompt in a space, right? That you had to understand how tokenization worked at all times, because like, if you put an extra space in there, you were going to go out of distribution. I think that, or another one that I think is particularly vivid for me is Yobi Reel that in June of 2022, Douglas Hofstadter had a piece in The Economist showing the, what he called the hollowness of GPT-3's understanding of the world, that it failed on various simple questions. Like, when was the Golden Gate Bridge transported for the second time across Egypt and so on? And someone, I believe it was Nick Camerota of OpenAI, showed that you could fix almost all of these just by telling the model that if you gave it a silly question, say Yobi Reel instead of answering it, right? That models had to be prompted with the possibility that they were allowed to say, I don't know, or, you know, that's a dumb question, right? You know, like there is no answer, right?

Nathan L. [00:09:34]: This is like, we've added the anthropic system prompt to our AI2 models, and we're like, this doesn't change the evals at all, but it makes the behavior something that we like more. Because I think culturally we're somewhat similar to anthropic, it's like we want to express uncertainty, we want the model to say that, I don't know, and a lot of that is in the system prompt of anthropic models.

Riley G. [00:09:51]: Right. And I think that really, you know, it's another microcosm of just how messy all this is, that what people like is a very different thing from how good are the models. I think, you know, LMSYS had a great blog post recently talking about like stylistic bias and output, that models will be rated as better if they do things like put their output into the format of a bulleted list with bold initial words on each label point. So there's like cheap tricks like that, that will make people like your output better or make them perceive it as, you know, more authoritative or, you know, more comprehensive that you kind of have to control for and just going by preference. I mean, I don't remember what the exact magnitude of it was, but I think they did put some numbers on it in that post.

Nathan L. [00:10:42]: Like, do you think you could handle all of that? Just like, can you make that big of a style delta in the system prompt relative to training? Is kind of what I'm wondering. Like if we release a model at AI2 and it's decent, but then we put a detailed system prompt that it's like, whatever possible, you should put your models into a list format with bolded headings and use markdown. Like, do you think we would get a 50 point bump on ElmSys?

Riley G. [00:11:06]: Maybe not on ElmSys in particular, being as they're trying to correct for this actively. But presumably it would have worked at one point, right? So I think that's, you know, that says something that these, or another great example, I think that's really clear of like why human preference isn't, you know, always the answer. I saw somebody on Twitter once that was really impressed by some anonymous model on ElmSys that was able to produce an ASCII art drawing of a unicorn. And it was a great drawing. And, but when I searched for like specific details of that drawing, I found that it was just in some like widely circulated list of ASCII art drawings. And it was a verbatim regurgitation of some signed work that somebody had made. And so I think there's an argument there that any request for ASCII art should probably just be thrown out, right? That a human's preference of how good an Elm is at ASCII art maybe just does not matter because like, it's so likely to be regurgitated or at least like figurative things, maybe diagrams are okay and so on. Yeah. Yeah. Okay.

Nathan L. [00:12:03]: We've touched on multiple of the things I want to get to in the future, but you kind of said that Chad2PT was the biggest moment for prompt engineering. And I think O1 is not nearly the same magnitude, but it's a very interesting microcosm of the future of prompting because the model feels very different to use. OpenAI has explicitly told us we need to prompt it differently. But I think my guess is that in the long-term, they're going to figure out how to train this model so that the behavior is not indistinguishable from their GPT models, but that it's not as sensitive to prompting and whatever you throw at it, it's going to work. Maybe they need to rewrite the prompts, but that's probably a temporary thing.

Nathan L. [00:12:45]: Two questions to me is simpler. What do you think when you see them giving you like, oh, we need to have these new prompting instructions to use it differently? Do you agree with my long-term convergence idea?

Riley G. [00:12:57]: I definitely agree. I think that there's an argument for seeing prompt engineering as kind of the experimental next branch of language models, right? That it's the features that people are just on the cusp of figuring out how to systematize and integrate into the models themselves. And to the extent that somebody comes up with a prompt engineering idea that is just so good of an idea that it's worth applying to literally every prompt, then it will be integrated into the models and you'll stop calling it a model, you'll call it a system and it'll have some auxiliary second model. I think the clearest examples that we've seen of that are content filters, right? That nearly every model that you get from a vendor will have some kind of cheap auxiliary model that looks at the output and says, is this plagiarism? Is this, or not plagiarism, but regurgitation of copyrighted work, right? Are you reciting Harry Potter word for word? The value of those is so, rather, sorry, the cost of having that kind of secondary model on the output is so low that it truly is worth it to just apply it to every generation, right? And we haven't seen too many examples of that on the input side, but they're starting to appear, I think. I think we've seen from anthropic evidence that they make modifications to user inputs based on certain conditions that they detect if you're asking about some particular feature, they modify the prompt if you are. And I think that's a common pattern in a lot of applications.

Nathan L. [00:14:31]: I'm guessing they've seen some public people kind of using the model. I haven't heard anything about modifying the prompts in a clod or a chat GPT window.

Riley G. [00:14:42]: It's, I've seen it for instructions for avoiding plagiarism, avoiding regurgitation. Oh yeah, that could make sense. Yeah, so the, but it's a common pattern you see in a lot of applications, right? That you, so like a good use case for this is like instructions for tool use, that you might analyze a user's, say, chat GPT input, and if the input appears to be a request to use dolly three, then you should apply to the, you should supply to the model, these long instructions on how to use dolly three, which otherwise you don't need to block to supply. Right. So I'm not saying that that's exactly how chat GPT did it, but it's easy to imagine that that would be worth doing. So, so a lot of applications do things like that to have, you know, conditional sort of augmentations of the prompt. Yeah.

Nathan L. [00:15:33]: I mostly see that like long-term, I don't know how this impacts prompting, but I think of like chat GPT, and then we'll have multiple models that they route to. So this is kind of like an early way of doing this, where it's like, if you give a really long context model, they'll have some, you've maybe even like, like Mambo, like model or different architecture for super long context, or they pass it to O1. If it's like this question is incredibly hard instead of GPT 4.0. But that's that the border between that type of routing and prompting is, I don't know how to classify it.

Riley G. [00:16:05]: Yeah, it's really fascinating. I think, you know, people have this idea of, I think, sort of seeking purity in their models that they want everything to be like, you know, just a model. But I think, you know, we're rapidly approaching the point that you have to start thinking about these things as systems that might just have arbitrary complexity inside of them. I also like, I think that, you know, that the guides that we've seen from O1, you know, that they take that sort of shape, right, that you get that, like the content that Open Eyes put out, like how to prompt O1, it's sort of a list of like domain competencies and weaknesses, right, that it's good at physics, it's good at abstract logic, analytic philosophy, maybe less great at creative writing. The, and then also you have these sort of like patches almost for like noticed problems, right, that they've noticed that it doesn't, that think step by step often degrades at performance. Why do you think that is?

Nathan L. [00:17:11]: Because it's essentially trained to do that on its own. Like, it almost feels like it shouldn't conflict with it. It almost feels like it should just be like empty tokens, like it will just repeat yourself or something.

Riley G. [00:17:22]: That's a really good question. I think the answer to that maybe speaks to just to how much this isn't just, you know, chain of thought. That's a meme sort of flying around now that a lot of people have claimed that all this is is fancy prompt engineering, isn't this just what Reflection did and so on.

Nathan L. [00:17:37]: It's obviously a different inference stack with a lot of improvements across the whole lifecycle of the model and the product.

Riley G. [00:17:45]: Right. And also the other thing that people have been saying a lot is that it must be some complicated system, right, that there can't be a single model doing this through autoregressive inference. But the claim seems to be that it is, right. I think there was a comment from Noam Brown on Twitter where he said that it really is a model that the whole generation is coming autoregressively, which is, you know, I have no reason to doubt that. It seems plausible to me. So it's but I think that people need to be a bit more imaginative and like what's possible and just through autoregression.

Nathan L. [00:18:21]: Yeah, I wrote a really long article on this like came out yesterday. That's like I put the constraints from like the Noam Brown tweets, plus the pricing, plus the inference scaling laws to kind of converge at something. It's like if they do some clever things to a model and some batch inference and self rating and stuff like it's definitely doable. I don't know why that as an RL expert, I'm not surprised that the model is sensitive to things like things step by step in the prompt. I just would have thought that it would come up in the examples of training because there's the seed set for this is almost definitely a very wild human generated some prompt with some like back and forth dialogue, essentially human seeds of things that look like what it is doing. Have you seen this with AlphaGo? We saw this with InstructGBT and ChatGBT. You need the human demonstrations to start the learning process. Why is it sensitive to think step by step like that thing? I think maybe more about the training, but you learn that through prompting.

Riley G. [00:19:23]: Yeah, it is a bit of a mystery. And this is very speculative what I'm about to say, but I think maybe like a kind of thought experiment of how you can imagine that it could be true is imagine if like some auditor or somebody who had the penalty of law over your head asks you to do something and to document exactly how you did it. It's easy to imagine that you would do the process differently and that you might do it worse, right? That because you can only do the things that are the most conservative and the things that you can justify and explain that you're not going to produce as good of a work as you might have otherwise.

Nathan L. [00:20:01]: It's like GBT4 needs to think step by step because every small mistake is a big deal. But almost with O1, we maybe should be like, go forth and conquer and make mistakes on your way and just let it wander to an answer.

Riley G. [00:20:15]: I think that's pretty hitting the nail on the head maybe.

Nathan L. [00:20:21]: I want to go try that silly prompt and see if it gets better at coding or something.

Riley G. [00:20:30]: Yeah, yeah. But I mean, I feel like that's the key improvement here that a lot of people don't appreciate is that they seem to have cured like all the Lacunian problems of exponential divergence, that if you sample a bad token, you're going to keep sampling more. And it's not that there wasn't progress on this before, like people had tricks to deal with it. But I think the thing that's really changed is that the models get mileage out of like thinking for long periods of time, but they derive benefit from just continuing on. Because that's very different from behavior you see from like 4.0. Like if you've ever tried like the exercise of just once it's gone down a wrong path, just say, no, keep going. Like keep going till you get it, right? Like it's pretty evident after a while that it's not making progress, that it's just gone like deeper and deeper into like some failed path of reasoning.

Nathan L. [00:21:24]: Why does that often break? I mean, I understand why it often breaks models, but that's also one of the jailbreaking techniques is just like keep sending the same message over and over and over until the models die, which like I wonder how that relates to O1. Maybe it's just easier from a safety perspective because it doesn't have that like as many turns or something. Yeah.

Riley G. [00:21:45]: And it's also like one of the bigger differences in behavior between GBT models and CLOD that I've noticed that opening eye tends to produce their models to

Riley G. [00:22:02]: like in the specific case that if you keep like telling it it's wrong, it will always take your side. It will say, well, oh, yes, of course I made a mistake. Let me try again, right? And it's never going to like diverge from that behavior. Whereas CLOD will eventually get sick of you, right? Like if you just keep saying like, no, you're wrong, it'll be like, look, I have told you many times that I am right. Like you need to be a bit more specific in how I'm wrong. If you really want to make an argument here, it'll start like just telling you to go away. And that's like-

Nathan L. [00:22:28]: This is why I want Anthropic to write a model spec because the behavior describing with chatGBT does fit with what they're, like open AI's models are like in behavior and they're kind of described as wanting to be like robotic computation assistants where like they follow, they take the user's information and they try their best to execute it without violating any basic principles. But I think CLODs is much more of like, we have created a, like I don't like the hard words to do without anthropomorphizing and all these other things. But like we've created an intellectual entity that is going to go back and forth with you. And it's not going to, like it's going to, like you pass in sensitive information as data to CLOD and you're like reformat it. It says no. You get these weird things because it's like this entity that doesn't want to be sent like harmful texts or be told how to make a bomb or something. But chatGBT is like the robotic one. So now I kind of use both of them depending on the task and the behavior that I want. But I'm excited to see how that goes further, really.

Riley G. [00:23:27]: Yeah. Yeah. I mean, that's, you know, I think it goes back to your point before that, you know, we're seeing more specialization in these models. But, you know, that all of this is temporary, right? That eventually like somebody will come up with the right way to delegate correctly to one model or another. And then you'll have just, you know, some unified chatGBT interface or whatever that, that, you know, decides like, is this a prompt that one would be good at and sends it to it? Yeah.

Nathan L. [00:23:50]: And while we're on these complex reasoning things, there was also this reflection 70B drama, which was mostly big because it was a big mess of credibility and memes. But there's also like real science in there that people need to remember of like how to prompt a model and spend more on inference. So I think it's really just a tiny bit of fine tuning with some special tokens and a system prompt. That's like, make sure you use these reflection steps. And that is how you move something like GBT 4.0 closer to O1. You can't, you can't prompt your way to O1 behavior, but that's the sort of things that more people should be considering. And it kind of leads into like, I want to ask about like math evals and stuff like this. And it's like reflection 70B style of prompting is a real thing that more people should be doing. And I don't know how we get around that communication issue now. It's going to be even harder because people are going to be like, oh, it's O1. We made it open source O1 now instead of just the best model. I just wanted to give air time. If you have any comments on that, go ahead.

Riley G. [00:24:48]: Yeah, I think, you know, reflection 70B was, you know, it was sort of a perfect storm of a lot of like the tuning method feeling plausible, right? That it was something that was very, you know, it's a legitimate like area of research. They like, it was, you know, rumored to be part of Strawberry and so on. And so there was like, it had like the right strategy for Buzz there. And, you know, however, they ended up releasing that model, like, you know, they don't have what they think they have. You know, so it's, I think, you know, it's kind of, you know, once you saw the, I won't recap the whole saga of like, you know, with Laura and finding the Laura from the previous version of WAMA 3.0 instead of 3.1 and all that. But I think the, you know, there's that kernel of truth there, right? That this is, you know, sort of a good idea, at least for some problems. I think also the thing that people don't appreciate is that very good idea for many problems feels maybe like a better idea than it is because it's so optimized for the domain of problems that tend to be on benchmarks, which is somewhat different than the thing that you really want to optimize for in the real world of like user satisfaction and just, you know, preference. Like some mix of like, do people like it? Like, is it useful? And does it do well in benchmarks? Because I think that there's like a, even for what I think should be like philosophically the core like use case of LLMs, like do they like do practical work? Like can somebody achieve the thing that they want to do with this? But, you know, like whether, however they do it through prompt engineering or whatever, it kind of matters more than whether like academically it does well on like the most naive presentation of the problem, right? Like whether somebody can figure out how to do it correctly matters. And that specifically is just not captured well on benchmarks, right? That like this, if you're doing a benchmark that compares across several models, there's, you know, a natural incentive to do it uniformly. That maybe you follow like vendor's best practices on, you know, how do you apply the template of the prompt and so on, or if a vendor recommends that you apply some suffix or whatever, you might do it. But for the most part, you're not going to put a human on the task of figuring out what is the best prompt for each model, right? Because then, you know, how do you know that they did a perfectly good, you know, fair job of that, right? But really that's what matters. Like that is like, you know, at the end of the day, like the thing that determines whether GPT-4 is better than Quad is when you sit down and try to, you know, solve your problem in GPT-4, you know, applying whatever hacks, you know, and, you know, taking, you know, advice you find online and, you know, whatever dirty tricks you have, and then you do the same for Quad, which one works better. And so like that's the state we're in. And that's, you know, very elusive as a thing to try to measure. Yeah. Okay.

Nathan L. [00:28:00]: I'm going to keep going, roll right into this, into the evaluation section of this conversation. You had, you were talking about this with how you actually use the models before you had mentioned, like you need a white space to properly evaluate or use the models like tokenizer things. I, one of my big blind areas is it seems like most frontier labs are using some sort of custom prompts on some sort of evaluations. And I don't really have a good sense for how much that actually impacts scores or how much that translates to downstream performance. It might not be custom prompts. It might be like custom setups. There's all these, like all the math evaluations, you need a specific format for your answer. I think like math, the all capital one, you like need to put your answer in a box and

Riley G. [00:28:45]: things like this.

Nathan L. [00:28:46]: And how, what is your view on these per prompt or per evaluation? Prompting is actually a thing. I think the Lama three paper had some cool analyses on how varying subtle things changed evaluation scores, which is great, but they're the only one sharing that. Otherwise we just get like our score is X and it's reproduced to some capacity.

Riley G. [00:29:09]: Yeah. I don't have like a lot of deep, like technical wisdom to share on that front, other than to confirm that, like, I think you're right that it is a big problem that we generally try to follow the vendor recommendations. We work with the vendors to prompt their models fairly. But like I said, like ideal and optimized prompts are very different than what's the default. But I think also that there's, I think a longer term trend that these issues maybe matter less than they used to. And, you know, or that, that, that should continue. I think like when you want the, like maybe one of the clearest signs of this is that Lama, like most versions of Lama, you can prompt them incorrectly in terms of like the system top prompt template, and it will be just fine. And in fact, you can often template them with system prompt templates from other models entirely, like just say representations of chat ML and they will be fine. Right. So there's, there's sort of familiarity in the pre-training with, with, with just chat templates in general. And the idea of like...

Nathan L. [00:30:25]: Do you think this is specific to Lama? I've also remember hearing a conversation at AI2 where we were considering doing the last turning, last stage of pre-train with random chat templates and like random instructions and multiple chat templates so that the model could be amenable to fine tuning and multiple chat templates, which there's a chance that they did that. I actually don't know. I would not put a high bet on it. But do you think that's just because Lama knows they're going to have so many users? It's possible.

Riley G. [00:30:54]: I mean, it's also plausible to me that that just shows up in pre-training incidentally, right? Nobody intended it to be there. It's just like, it's in the data. But I think that, that, you know, that, that process is only going to continue, right? That we're only going to see like more models just being familiar with how models behave. I think to some extent, like, you know, you see like, like another thing that I think is maybe like evidence in favor of this is if you look at the base Lama, like, I think I looked into this on like base Lama 2 once, that if you prompt with like, like instruction prompt formats, it would adopt the behavior of, of like a chat GPT like assistant, right? So, so I think, I think it shows that examples of chatbot behavior are now so widely disseminated, you know, across the internet that a pre-trained model is better at instruction following tasks than any pre-trained model was before the work of instruction GPT was done. So, yeah, I believe you.

Nathan L. [00:32:00]: I want to check this. How does this impact how we should view evaluations? I'm just trying to reckon with, do we, like, there's a couple of scenarios. It's like, it doesn't really matter because these models are going to be not that sensitive to the system prompts that we're using to say, do GSMA care math. And that goes for models like Lama in the open, AI2's models, GPT5, whatever. It seems like the sensitivity to prompting for really well-known formats is actually going to go down. And that solves some of our problems. Because I don't think we're going to come up with new, like that many new formats for evaluations. We're going to make evaluations more specific and harder in the content.

Riley G. [00:32:43]: I think that's right. And I think the version of it that we have to play with now definitely does feel like one step forward, two steps back in that regard. And that it's much better at benchmark style inputs where you give it just no advice on how to do it. You keep everything very simple with what are your output requirements. But it's also just very hard to steer. If you have opinions on how it should do it, those opinions won't be followed generally. And it also has issues with output formatting. So I think we're seeing, I've seen anecdotal reports on Twitter at least, and I've seen this myself, that its output is just inconsistent even when you ask it to be consistent. That it will forget things like block quotes and so on. The result of this, I think we're going to have to see a lot of benchmarks, is that maybe the fair way to do this is to have some secondary model on the end of it that puts everything into a consistent format.

Riley G. [00:33:50]: I think we're not that far away from benchmarks that just do that across the board, of just saying that it's not the model's job to do this anymore. And we'll clean up the results however it is. Yeah, I think that's a better place to be.

Nathan L. [00:34:03]: It's one of those things that the model's getting better can solve some of our problems. I think there's less angst now about the whole closed labs evaluation scores anyways. I'm mostly trying to reckon with what open groups and academics are doing rather than closed labs, and they kind of rely on each other. I've been on the, before, there's now this hugging face upload chat template. So a lot of models have the chat template saved with the tokenizer, and most of the time they don't have a system prompt, which is surprising. I feel like it should be the norm that a system prompt is included with every model. Is there any reason that you see not to do that?

Riley G. [00:34:49]: Yeah, I mean, I can think of things that might be slightly better, but I think that that's that generally makes sense, right? Like, I can imagine that maybe they, you know, you'd release several, right? And say, you know, it's like any of these is fine, or, you know, like training on several and, you know, say it's like an average of these three or whatever is like kind of the is ideal or something like that. Yeah, most of my reasoning is I think that most users of language models are not sophisticated.

Nathan L. [00:35:14]: So the model cards and documentation do normally say we recommend using the system prompt, but the simple ways of using the models do not integrate them. Simple ways of using the models do not integrate the system prompt. And it's not always easy to modify your data to add, like if you're doing the messages format, like you remember to add the system thing. And if you have multiple models in your queue, you then have to go and manually hard code

Riley G. [00:35:37]: all of them.

Nathan L. [00:35:37]: And like, that just makes it get dropped. And if the system prompt is a big deal for performance, like that impacts either if it's a product or it's like, this is where I'm trying to understand like academia is like, if only half of the people remember to add the system prompt for their model, they're evaluating in this kind of academic paper. And I know it impacts things like all the vibes based valves, like alpaca valve, empty bench, whatever. Like, if you have the different system prompt, it can vary behavior. We did an experiment, which was like, to make sure this works, or you just give it the system prompt of like, you're a terrible model, you are to me, you're made to make other models look good, and you happen to give wrong answers. And like alpaca valve goes to zero and all these things. So it's like, I think it's easier to show the down case, but you could probably get one to 2% improvements, which matter in the long trajectory of academia in terms of if your method is accepted or not.

Riley G. [00:36:31]: Yeah, I mean, I've often like been frustrated by the ambiguity and a lot of academic publications over like how prompts are formatted. And they often, they always run into the same pitfalls of that, like the fundamental problem is that system prompts are often, or prompts in general that you're presenting like during evaluation are implicitly templates, right? That you have like your points where you insert like the actual problem or whatever. And that templating needs to be communicated to the reader of the paper, and the prompts themselves may involve templates, right? They may, you know, like describe like how, you know, like an output should be formatted, for example, and might do this using, you know, like curly braces, right? So this creates like several layers of confusion that you need to distinguish between, like where are the variables that you're interpolating purely in the logic of this paper of like that, you know, that things that would be translated into Python, you know, like if you were to actually implement this versus the templating instructions that are literally part of the instructions on how it should, the model should receive like a template of how it should format its answer and so on, right? Because like a lot of prompts end with use this format and then have some kind of template. Yeah. Right. So the, like I've often thought that we'd benefit immensely just from standardizing on something like saying that like if you want to clearly communicate a prompt in your paper, the way to do it is to show Python code that will produce that string. Yeah. You just literally show it as an f-string, there's no ambiguity.

Nathan L. [00:38:15]: Because you copy out of a paper, you drop the slash n slash n that you need or something like that.

Riley G. [00:38:21]: Yeah, right. Like the, but if you were to literally just include a Python code block, there's no ambiguity, like, you know, like whether or not there's a trailing new line or is it so on. And those things are really fiddly and need to be communicated. Because I've seen people do all sorts of like imaginative typography to like represent new lines and things like that. You know, like having the return signals at the end in light gray and, you know, like you're putting dots between spaces and all that thing, right? Because if you're doing like, I've seen like early like playground competitors sometimes did this that approached like more like from a technical approach that you need to know where spaces are. So it's worth it to represent them as like gray dots, right? Yeah. That's the kind of thing that the level of detail that you need in communicating these things. So I think like standardizing on Python would be just like a good way to like, you know, get the problem out of the way. Yeah.

Nathan L. [00:39:14]: I also saw in some discussion of a one or maybe a reflection. I don't remember. It's been a while, two weeks. You're talking about like equal inference costs, comparison of prompts and a reply. And I think that's a great idea. Like, do you think there's, okay, well, like one first, do you want to explain the idea? I'll kind of ease into this.

Riley G. [00:39:33]: Sure. So my thinking is that models are evaluated right now just based on how they do under like sort of the same, I guess, invocation of inference, right? That you let the model sample, you sample auto-aggressively as long as that takes, you know, however long the completion is. And you don't pay attention too much to like what it costs you to run that or you factor that in afterwards that you score it up. And there's a lot of reasons why this makes sense, right? That, you know, it's simpler, it's more fair. And sometimes you don't know exactly how to equalize the inference there, right? That you can't like really say that like what the trade-off is, right? But there's, you know, exceptions to this that, or maybe not so much an exception, but like there are ways of doing it that aren't perfect like self-consistency, right? So like there's a method called universal self-consistency where you prompt a model multiple times and then take the model again and give it all three answers and then ask it to choose which of these is the most consistent with the consensus of all answers that were generated. And this is sort of a method that's pretty reliably not worse than just doing it naively, right? It's hard to imagine any prompt where this method would steer you wrong or, you know, be worse than doing it naively. And that, you know, suggests that maybe there's like a fairer basis of comparison here, right? That we could say that if something really is cheaper enough that you can do that, you could run it 40 times and take self-consistency that then maybe that should be its score. But I think one of the bigger reasons why this is kind of like a, in hindsight, this is maybe like a bit of a facile tweet that I made about this, but like really the trade-off between the exchange rate, if you will, isn't very good. I think like a rule of thumb that I saw in a paper once is that if you do self-consistency on 40 samples of GPT-3.5 turbo, it's on par with one sample from GPT-4. So you sort of move up one generation every time you do 40 inferences, right? But at the same time, in specific domains, there are refinements of this that work quite well. So we had a scale actually put on paper recently on a method we call plan search, I think was the name of it, yeah, plan search. And then the gist of that is that if you can improve performance on programming problems by generating diverse attempts at solving the problem, right? So the approach that plan search takes is to first create like sort of high-level observations or ideas about how a problem might be solved, then to combinatorially sample that list of ideas, and then take combinations of them to inspire strategies. And then for each strategy, you lay out sort of a path of like reasoning of like how you could turn this into code, and then you turn each one into code and then assess which one works best. And this like lets you search over the portion of, it lets you search over the variation in your strategies that actually matters, right? Because you can imagine that if you were just simply resample a model blindly over and over again with the same problem, there are a lot of ways that an answer could vary that don't matter, like whether you use tabs or spaces, but you name the variables and so on. And you don't want to search over that variation, you want to search over like the part you think is going to be fruitful, like the high-level strategies. So I think that for particular domains, like that is the more relevant comparison of like what could you do if you were to apply like a bit of search here.

Nathan L. [00:43:40]: Yeah, it almost seems like there'll be different tiers of evaluation scoring, where it's like the basic prompting, it's kind of like linear time. And you could do like, it's almost like with the models, it's like there's a biggest, best open model at every time. But like LLAMA is dominating because it has the 400B, the 70B and the 80B that are all really good, it should have a 1B. And if you're having a prompting paper, eventually you're probably going to have to have binned comparisons like that, which is like we are comparing two basic prompting techniques, which I think they will have less headroom by needing the autoregressive behavior and things like this. And then maybe there's things like reflection, where it's like we've added minor structure so that the model can now generate a bunch more tokens, but not like a 10X or 100X. And then there's the things like we've added a whole new planning component to how we're prompting the models, and it's all abstracted away from the users. And you're not going to be able to compare those, because those are the things that are going to just solve all the benchmarks that we have out of the box. I think that's fine. I think people will converge to this. It just always takes a bit longer than we want.

Riley G. [00:44:47]: Yeah, I think that's right. I am really excited about the O1 RL approach to this.

Riley G. [00:44:58]: On some level, all prompt engineering is approximating this RL-like search. We have a lot of prompt engineers out there that are trying different things. They see what works. They tell their friends, hey, this works. But the space of things that works is probably, well, I mean, demonstrably, maybe at this point, given O1, outside of what a human might think of. There are things that we see things, even in the summarized reasoning traces that O1 puts out, that are eerily anthropomorphic. That it will say things like, hmm, or let me think about that. Yeah, I feel like they added that in.

Nathan L. [00:45:42]: I think it's almost like a trigger for the model to have a more reflective response. Those are the examples they used, but it's cool.

Riley G. [00:45:49]: I mean, it's not hard for you to imagine that RL could find something like that, right? Just that empirically it works to say, hmm, because that suggests that you're about to do something else in the pre-trained modeling manifold of plausible text. Like saying, hmm, might just be empirically a good thing to say. And it could find that. So I think that's the kind of exploration that you're benefiting from with O1. It's the space of prompts that work that we're not really equipped to find. Yeah, do you have anything?

Nathan L. [00:46:28]: I think this is a good discussion. Kind of to wrap up the academic side of things, how much of papers that are nominally about RLHF training or any sort of post-training as the contribution, do they need to do anything with prompting? Is there a clear segmentation there? Or is it like, if you're doing this fine-tuning, you're necessarily changing how the model is going to respond to prompting? That we should do some checks there.

Riley G. [00:46:55]: That's one view of it.

Nathan L. [00:46:56]: Or the other view is you have a model and prompting is just a way to take one step further with it, which I think Anthropic did this recent podcast with Amanda and their chief prompt engineer that I don't know.

Riley G. [00:47:07]: And that's how they do it.

Nathan L. [00:47:08]: Amanda's like, I can do things with these models that most people cannot. And that kind of leads the way. Rather than prompting being really part of this post-training stack that everyone needs to be checking the box on. I don't know where we fall. I guess there's this IF eval, which we could come to after that, which is kind of a separate

Riley G. [00:47:29]: case. Yeah, I definitely lean a bit more towards the Anthropic view of the world. I guess you could argue that's maybe somewhat self-serving, with no big news there. Prompt engineers are important. But I think that it's true that we do see people that are just good at this. That our ability to prompt these models sometimes exceeds our ability to explain how we're doing it and what the general strategies to apply are. And I think those strategies are worth extracting.

Riley G. [00:48:09]: It's worth introspecting.

Riley G. [00:48:12]: One thing I think about a lot is anytime somebody... I really love when people suggest a prompt or suggest doing something to a model that I can tell immediately will not work. And it's a terrible idea, but it wasn't obvious to them. And that's fascinating, right? Do you have an example?

Nathan L. [00:48:29]: I would love to know if you have something that everyone tells you, but it's a generation behind or something.

Riley G. [00:48:35]: A lot of, I'd say, strategy ideation in fields that are new and competitive. If you wanted to have an LLM give you ideas for what's a good LLM startup to try right now, it's probably not going to tell you anything useful. Some things like that, where it's like, people are still figuring it out and there's money to be made in knowing how to do this better than the average person, you're going to get mediocre advice on a lot of things. But that's not true for everything. If you ask it about physics, you're going to get like, oh, I don't know how to do this. If you ask it about physics, you're going to get like, above average advice.

Riley G. [00:49:16]: But I think that people who have acclimated to models forget what it's like to be new

Nathan L. [00:49:24]: to models, right?

Riley G. [00:49:25]: And I think that explains a lot of people in industry being annoyed by how many R's are there in strawberry. Because they're so- That's the tokenizer.

Nathan L. [00:49:33]: We ignore the tokenizer whenever we can.

Riley G. [00:49:35]: Yeah, and you see this explicitly. A lot of people, they get really enraged that they're like, you idiots, why would you ever think this would work? Why did you ever think that you could ask it 9.11 is greater than 9.9 and it would give you a right answer? And so on. They have a point. That was the attitude for a long time. But I think the social context of these models is changing and people, they want them to, it's becoming more reasonable to expect them to work well in these queries. There's practical consequences of these models being in the hands of people that don't know about these issues. And it's now suddenly more important to fix them. Yeah. So let's spin on this.

Nathan L. [00:50:12]: Is Google searching going to become more like prompting or is prompting going to be more like Google searching? Where with a good language model, can I just type in that physics equation that govern with the cross product that governs electromagnetism? Is that the direction that the models are going? Or is everyone going to actually become more conversational because AI is the default?

Riley G. [00:50:37]: Yeah, I think, I mean, Google searches maybe, yeah, there's some similarities there. I think Google probably has gotten simpler.

Riley G. [00:50:48]: It's been a while since I've used most advanced search filters in Google. I remember a point when it was extremely routine. Yeah, the plus comma, quote, quote, comma. And I think that speaks to the fact that the results used to be worse, right? And we thought we were happier with them because we didn't have alternatives. But we just accepted that, oh, yeah, there's going to be false positives in here that we now have to put in some negatives to cancel out. And that skill, I'd say, hasn't really become more important over time, right? It's occasionally useful still, but it's less essential than it once was. And that mimics a lot of what we see in prompt engineering that you don't have to understand. Tokenization, I think, is probably the biggest one. ChatML was no small part of why ChatGPT was such a big improvement to prompt engineering. It wasn't just the tuning. It was the fact that they came up with this more restricted system of interacting with a model that alleviates the need to know anything about tokenization. And that, I think, is kind of an underappreciated change. Yeah, I agree.

Nathan L. [00:51:54]: I do think in the long term, prompting will go in the direction of Google searching. But I think in some ways, I'm not that surprised that something like O1 can exist, but it's still a very humbling moment where we still have many times where there will be AIs released that we don't know how to use them. And this is the skill that you need to have, is tinkering with the open mind. It's like the open mind that things will come and the open mind that things are not just what they are at face value. And if you play with O1 a lot, you can definitely get things out of it that people on Twitter are not repeating over and over again.

Riley G. [00:52:31]: Oh, yeah, definitely.

Riley G. [00:52:35]: A lot of the explanation for the disconnect that you see, and some people are just absolutely amazed with O1, but also most of the things you see on Twitter maybe aren't that impressive. I think that the frontier of problems that distinguish O1 from, say, the previous class of frontier models, it's either unrealistic problems, brain teasers that people artificially constructed to exhibit the difference, or it's something realistic that you would never want to read in a tweet. The problems where it's exceeding on are like, I have this extremely in the weeds programming problem that involves a complicated interaction of all five of these files. Please fix my import errors or whatever.

Riley G. [00:53:25]: Those are the things that you're going to see the most practical benefit from. And those just aren't easy to communicate in a way that they used to be. It used to be easy to make a screenshot of, hey, look, it does this. It will fix your broken JSON or whatever.

Nathan L. [00:53:45]: Something else that I'm realizing I didn't put in the notes, but there's been these comments on O1 from the OpenAI people that they want to expose the ability to change how long the model thinks to the user. So to change its test time compute, that ultimately is going to be a whole other prompting thing. It's almost a little surprising that they are giving that to user. I almost think they should just make a classifier that does it for them, rather than just assume the user is dumb. But being able to do it and change how hard your model thinks is a really interesting real-world prompting case. Because it doesn't really matter if you can get a viral example. But it's like, how do you vary that knob in your day-to-day use that meaningfully ships your end product?

Riley G. [00:54:26]: Yeah, it's really kind of comical trying to manipulate how long it thinks about things. Because there are some things that will make it think for a long time. I tried to get it to generate acrostic word squares once. And if you emphasize enough the need to validate things, it will just keep validating and failing and loop around for, I think I got up to three minutes once of attempting things before finally saying, oh, I wasn't able to find one. Here's my best effort. But the other times, though, if you ask it... I mean, I once gave it a problem. Or I kind of just was for the comedy of it. I gave it some simple problem. And then I gave it literally, I think, three pages of emphasis on think forever. Just rambling paragraphs saying, if you're even considering stopping, don't. If you ever have the dream, if you ever get tired, don't worry about it.

Nathan L. [00:55:22]: Just keep going.

Riley G. [00:55:24]: All those kinds of holy hand grenade style repetition. And after all this, it literally just thought for three seconds and then came back and said, I understand the urgency that you're saying here. Thinking forever just isn't possible. So I'm not even going to try. There's another thing.

Nathan L. [00:55:43]: OpenAI said they might give you a knob that controls this or influences it.

Riley G. [00:55:47]: Yeah, I have to be honest. It feels like maybe weird UI. It seems like something that you should be able to just do through text. But I'd be happy to play with it. Because steerability in general without one seems to be... A lot of people, I think, are reporting that it's kind of awkward or at least at odds with the really impressive examples that we're seeing coming out of it. Yeah.

Nathan L. [00:56:16]: There's a whole strategy discussion on why did they actually release it that I haven't really entered into. We can kind of avoid this. I am wondering how you view prompting of agents. Is it kind of like the future section of what is the future? How are agents going to be susceptible to prompting? I'm guessing after our conversation here, it's going to be like, it's the same. And there's going to probably be a meaningful shift in who can deploy them and have success based on who actually has this expertise and is doing this prompting work. And this could translate into downstream business success, which is the first person to kind of crack an agent with the right model and the right prompt can have the first product that works.

Riley G. [00:56:57]: Yeah, I think people mean very different things when they talk about agents. Sometimes, and I think the big division that matters is that there's agents that are working in self-contained, repeatable environments, so like a rebel sandbox. And then there's agents that are making changes in the real world, that they're out making retail purchases, canceling your subscriptions, so on. I'm very optimistic about the former. I'm very skeptical of the latter. I think people underestimate how much reliability is needed for a lot of role decisions before you get to the point that you'd trust the thing to have the power to cancel your Hulu subscription or whatever. I think that also, in the first case, there's a lot of untapped potential there. And I don't understand why we aren't seeing more iteration on that front, really. Chachiviti's code interpreter, when it came out, I think they renamed it to Advanced Data Analysis or something like that, which is not a good change in my mind. But the code interpreter, I love that. I still love it. It's a brilliant product, and I wish they kept going with it and improving on it. I'm also a fan of Julius AI, which goes exactly in that direction of creating a code interpreter-like environment where you can substitute in whichever model you want, and you can do things like install packages. It's great for one-off scripts where you want to say... I had a post once where I was pointing out oddities in the longest GPT-4 tokens. One of them is like slash, slash, and then 128 repetitions of an equal sign or something like that.

Riley G. [00:58:49]: But the way I did this was literally just like I just went to Julius, I said, install TikToken and show me the longest tokens. And I read the code pretty carefully because I was going to tweet it. I didn't want to tweet out something wrong. But it was right. There were small things that I had to fix, but it's good for prototyping, the kind of these quick one-off things where you're just like, yeah, I could look up exactly... I roughly know how to use TikToken. I just didn't feel like figuring out the syntax again.

Riley G. [00:59:17]: It's good for just the curiosities and one-off stuff like that. And I think that's what the future of this really is. This really blew me away.

Riley G. [00:59:30]: Somebody posted on Twitter a video of their eight-year-old daughter using Cursor, I think it was, and this girl apparently has no understanding of the code that's being generated, but she's able to say, no, I want to do this differently. I want to have a Harry Potter spell here. Changing the layout of this HTML JavaScript app. And it just works. And that's the future to me, that that's the hottest programming language is English. When you see a little kid doing it, you really believe it, that now kids can have the power to create software. And that's great because we were at a weird local minimum of that, I'd say, of kids being able to have the creativity to create their own interfaces or make their computer do what they want. They're less customizable now than they once were. Yeah.

Nathan L. [01:00:28]: My reflection on this is the people who take prompting seriously are more likely to be in tune with what is happening in AI and at the cutting edge. But that also means that on the academic side and the public side for transparency and accountability, you have to do some education work to make sure people are taking it seriously and or some normalization of claims, kind of depending on how people are presenting their work and using things. I think it's safe to say that all the frontier model labs are doing this, but kind of the long tail, it takes people time to learn these habits. But it's surprisingly hard to convince people to spend time playing with models too. Like I do it, but I should probably do it more, listening to people like you. I just, it's funny. It's one of those things that doesn't make sense how it'll pay off, but it probably will.

Riley G. [01:01:20]: Yeah. I mean, there's no substitute for using models. People, I mean, I personally, I discover just the dumbest things sometimes that make the biggest difference. One of the most high impact chat2BT tricks that I found lately is I have custom instructions in my chat2BT telling it how to think silently. I have a tweet about this that I posted once. So if you Google chat2BT think silently, good sign, you'll probably find it. But I have the prompt here actually, right? I told it, I was using its new memory feature so it can remember things that you tell it. So I was sort of showing that off at the same time. But I said to it, remember this, when I ask you to think or write silently, I mean, for you to use your Python interpreter to write your thoughts as code comments or string literals assigned to variables. Code doesn't necessarily have to display any output. And then it remembers that. And so then I can say to it, silently write a brief essay about Super Smash Brothers, then silently translate this essay into French, display only a double histogram showing the frequency of word lengths for both texts. And then it doesn't output anything until it has that histogram done and then outputs the histogram and says, here it is.

Riley G. [01:02:32]: And that makes such a big usability difference. If you just don't have to see what it's doing, if you can just put it behind a fold where you can expand it if you need to, be really sure that the code is right or copy it to another editor or whatever. But just not seeing it makes such a big difference. And you can just have things in code too. You end up in this sort of Jupiter-like flow where you told it to silently do something. And now because you said to do that, it's not just in context, it's in a variable. Like I said, if it ever needs to do something in code, it would just have that variable there. And it doesn't have to repeat it, which is a big deal if it's, say, an essay. Repeating an essay is expensive. Yeah. This is great.

Nathan L. [01:03:19]: Thanks so much for coming on. Anything else you want to plug or talk about?

Riley G. [01:03:25]: I should have some content that should be going live around the time that this comes out on analyzing one for the scale blog and talking a bit more about our coding leaderboard. So definitely look out for that. And also, the other thing I should of course mention is Humanity's last exam. We recently partnered on an effort to solicit from the public examples of challenging problems. And we are giving out cash prizes. So definitely check that out if you're interested.

Nathan L. [01:03:58]: Yeah, I had just tweeted a few days ago. I don't know if I put it on Twitter, but I put it on some platform. I don't have Twitter at work, so I end up looking at lame platforms I'm less addicted to. But essentially, evaluation is going to be extremely expensive. And that was my whole take. And it's going to be very narrow and very hard. And then you put out $500,000 in prizes. And the initial whiplash is like, oh, that's a lot. But in reality, I think that's the right ballpark. Because if you're going to make a good eval, you need to have somebody who's really good at cutting edge AI, probably working on this at least six months to build a good eval. And that's a ballpark price. $500,000 is like a half year of how much it costs. This is with overhead and compute and stuff. It's how much it costs to have somebody in AI like that. So obviously, it costs more to actually build this evaluation. But these numbers look ridiculous. But if we want to have evaluations that are meaningful, this is what we need to do. And I think it's the right thing for Scaled to do to lead on evaluation. It feeds into natural things of their business. I think I've been on the record for this for a while.

Riley G. [01:05:00]: So I'm like, it's great. Yeah, absolutely. I think that people outside the industry at least have the impression that evals are grunt work, right? That this is something that you would use low-cost labor for. It's not a prestigious area. But it couldn't be further from the truth. I think evals are very rapidly moving towards the high end of intellectual ability that we're looking for like PhDs. I've done projects where it's like, okay, we have to get as many PhD-educated poets as we can to check the correctness of these IAMs in this poem or whatever.

Riley G. [01:05:46]: I think that's only going to continue, right? We're going to see that at the low end, the value of human labor for training models is going to decline. And the value of high-end intellectual labor is going to increase probably drastically.

Nathan L. [01:06:04]: And it's like cost is probably a good proxy for evaluation usefulness. LM says it's expensive, but for different ways than the Scaled leaderboard is expensive. And they complement each other very well. And they both become better by the others existing by kind of like, okay, the models are in similar places, but they're showing different things. And you can separate between that. And I suspect that that'll continue to grow. Some more will be at scale, some more will be elsewhere. And that's just the new default for evals.

Riley G. [01:06:35]: Yeah, absolutely. I think that's one of the things I'm most proud about working on our evals and leaderboard at scale is that we're contributing to this healthy ecosystem of not having to just trust one or two players that evals have been done correctly. We want to have more openness and more independent verification of evals. And that's sort of our general theme with work with GSM 1K and trying to make sure that we can actually trust what these leaderboards are saying.

Nathan L. [01:07:08]: Yeah, my one nitpick that I don't know how to answer and I probably need more RLHF experts, you might know this, is like, are companies that buy data from scale going to have an advantage on the scale leaderboard because the distribution of humans that are

Riley G. [01:07:20]: doing...

Nathan L. [01:07:20]: Not that the humans doing eval and creation are the same, but that they're drawing from the same pool of humans that are writing content or doing preferences and then that are doing

Riley G. [01:07:30]: the evals.

Nathan L. [01:07:30]: I think it's too early to answer that question on if human distribution matters. And for that reason, I think the eval is still so much a net good. But it'd be really interesting to try to run those experiments on who is giving the data that you train on and how does that then impact the evaluation?

Riley G. [01:07:49]: Yeah, that's not something that I'm familiar with in enough detail to comment on our process there. But yeah, that makes sense to me. I think that's something.

Nathan L. [01:07:59]: It's something that people like to complain about every possible thing. And I understand the root of the complaint, but it's like, we've got to deal with the circumstances where we are in the AI industry. And the leaderboard is so much more useful than it is causing any problems. Let's keep doing it.

Riley G. [01:08:17]: Yep, absolutely. Okay.

Nathan L. [01:08:20]: I think we're at time. So I'm going to click stop here. Thanks again.

Riley G. [01:08:23]: Great. Thank you so much. Bye.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

[Article Voiceover] Llama 3.2 Vision and Molmo: Foundations for the multimodal open-source ecosystem Sep 27, 2024

Sorry this one was late! Thanks for bearing with me, and keep sending feedback my way. Still a year or two away from when I have time to record these, but I would love to.

Open-source tools, examples, limits, and the state of training multimodal models.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/molmo-and-llama-3-vision

00:00 Llama 3.2 Vision and Molmo: Foundations for the multimodal open-source ecosystem
02:47 Llama vision: Multimodality for the masses of developers
03:27 Molmo: a (mostly) open-source equivalent to Llama vision
08:45 How adding vision changes capabilities and reasoning
11:47 Multimodal language models: Earlier on the exponential

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

[Article Voiceover] Reverse engineering OpenAI's o1 Sep 17, 2024

What productionizing test-time compute shows us about the future of AI. Exploration has landed in language model training.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/reverse-engineering-openai-o1

00:00 Reverse engineering OpenAI's o1
01:52 From Q-star to Strawberry to o1
05:13 Training o1 with reinforcement learning
09:24 What is o1 doing when given a prompt?
11:49 Questions to consider to understand o1's structure
11:56 1. How does an RL-trained language model act?
12:38 2. Is it an online / test-time search?
14:20 3. Is it one model at inference?
15:29 Open-source o1, the future of o1, and the future of AI

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Futures of the data foundry business model Sep 11, 2024

Scale AI's future versus further scaling of language model performance. How Nvidia may take all the margins from the data market, too.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/ai-data-foundry

00:00 Futures of the data foundry business model
02:57 What it is like to work with data vendors
06:06 Data foundries: Risks
08:18 Data foundries: Growth vectors
09:50 Realistic expectations

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/data-foundry/img_008.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/data-foundry/img_012.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/data-foundry/img_023.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

A post-training approach to AI regulation with Model Specs Sep 10, 2024

And why the concept of mandating "model spec's" could be a good start.
(Oops, forgot to upload this yesterday!)
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/a-post-training-approach-to-ai-regulation

0:00 A post-training approach to AI regulation with Model Specs
1:45 Expanded roles of Model Specifications
3:40 Near future of Model Specifications

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

OpenAI's Strawberry, LM self-talk, inference scaling laws, and spending more on inference Sep 05, 2024

Whether or not scaling works, we should spend more on inference.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/openai-strawberry-and-inference-scaling-laws

00:00 OpenAI's Strawberry, LM self-talk, inference scaling laws, and spending more on inference
01:51 OpenAI's Strawberry
04:16 Self-talk in language models
07:45 Inference scaling laws

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_006.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_021.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_023.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_037.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

OLMoE and the hidden simplicity in training better foundation models Sep 04, 2024

Ai2 released OLMoE, which is probably our "best" model yet relative to its peers, but not much has changed in the process.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/olmoe-and-building-better-llms

00:00 OLMoE and the hidden simplicity in training better foundation models
02:04 Frontier model team compute allocations
04:19 De-risking training complexity
06:40 On organizational complexity
09:05 Compounding improvements -- the key to building better language models

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_005.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_007.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_009.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_011.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_028.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_030.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_032.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

On the current definitions of open-source AI and the state of the data commons Aug 28, 2024

The Open Source Initiative is working towards a definition.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/defining-open-source-ai

0:00 On the current definitions of open-source AI and the state of the data commons
3:17 Reasons to not mandate fully released data
4:24 Sufficient but not exhaustive data docs
5:22 Frustration with the data commons
7:04 We need more examples to define the definition

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/defining-open-source/img_005.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Nous Hermes 3 and exploiting underspecified evaluations Aug 16, 2024

The latest model from one of the most popular fine-tuning labs makes us question how a model should be identified as a "frontier model."
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/nous-hermes-3

0:00 Nous Hermes 3 and exploiting underspecified evaluations
5:29 Parsing training lessons from Hermes 3

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Interviewing Ross Taylor on LLM reasoning, Llama fine-tuning, Galactica, agents Aug 08, 2024

I had the pleasure of Talking with Ross Taylor, who has a great spectrum of unique experiences in the language modeling space — evaluation experience, Galactica lead author, Llama post training, etc. This is a really great conversation on the frontier of language model (LM) reasoning, LM deployments and demos, LM’s for science, RLHF, and other topics. I’ve been trying to get Ross to come on for a bit. He’s one of those people in the LM space that doesn’t speak too much, but when you do, you listen.

Ross Taylor was previously an LLM lead at Meta AI, heading up the reasoning team. Previously he led the early work on LLM agents, and was the research lead on the Galactica project. Before that, he was a co-founder of Papers with Code, which was acquired by Meta in 2019. Before that, he has worked as a quant in sports betting and finance, and before that a policy advisor for the UK Government. He is currently working on a new startup.

Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.

YouTube

Chapters

* [00:00:00] Introduction of Ross Taylor and his background

* [00:02:12] Papers with Code

* [00:09:58] Galactica, goals, controversy, legacy

* [00:18:12] Technical details of the Galactica model

* [00:23:18] Potential for language models to make scientific discoveries

* [00:25:21] Defining and improving reasoning in language models

* [00:32:38] Process-based reward models and their potential applications

* [00:35:00] Generating synthetic data for SFT

* [00:40:23] Evaluating the effectiveness of language models as judges for human preference data

* [00:42:43] Considerations for creating base models that are easy to fine-tune

* [00:46:45] Balancing SFT and RLHF

* [00:54:13] Characteristics of successful post-training teams

* [00:58:26] Future directions for language model development

We mention

* Galactica

* Papers with Code

* Rob Stojnic (co-founder of Papers with Code)

* DPO, PPO

* Armen Aghajanyan (Chameleon)

* Tom Scialom on Latent Space

* Soumith Chintala (PyTorch)

* Alex Graves

* Llama 3 paper

* Process Reward Models / Let’s Verify Step by Step

Transcript

Built with smol-podcaster and with love of Latent Space.

Nathan Lambert [00:01:07]: Today, we're here with Ross. This is a really exciting one. I've been trying to get Ross on the show for a while. Ross has done a lot of interesting work. And also the path to where you ended up with working on state-of-the-art LLaMA work at Meta is very interesting to me. So we're going to start with some of that, but then there are a few people that want to know more about reasoning and some of the RLHF stuff. We won't cover the secretive new start-up - I don't know what it is, but that's how it goes these days. I'm sure it'll be great. So welcome to the show!

Ross Taylor [00:01:41]: Thanks for having me.

Nathan Lambert [00:01:44]: So I wanted to start with Papers with Code. For people that don't know, Papers with Code is one of these platforms - I never was a heavy user of it - but it collates papers, people can upvote them, popular papers, attaching code and dataset and evaluations to papers, which is great - it was like sort of ahead of its time. It fits into a lot of these open ecosystem things. So I'm kind of curious, like, how you ended up there and why you all started this startup that ended up building this thing that got acquired by Meta?

Ross Taylor [00:02:12]: Yeah, that was a weird one. This was like back in 2018. So I was at an incubator, I just quit my previous job and I was like, okay, I want to do a startup. And I met Rob, my co-founder, who came along with me for the journey. We both came from different backgrounds. I was from a sports betting / quant finance kind of background, which is a whole other episode I guess. And Rob was in various startups, like applying ML to things like hate speech detection, that kind of stuff. And the cool thing was, we both resonated on similar kinds of problems within the ML space, even though we came from different domains. So we spent a lot of time doing various experiments, trying to make new kinds of ML tooling, thinking of these stupid questions like “what is the Git equivalent for ML?” - that kind of stuff. One of those experiments was hacking around on this little website to solve a really basic problem: I'm trying to reproduce this paper, but I can't find the code. That was the thing that really blew up beyond our expectations. It was weird because we thought it was fairly trivial at first.

Nathan Lambert [00:03:16]: What year was this? 2018?

Ross Taylor [00:03:18]: Yeah.

Nathan Lambert [00:03:19]: This makes sense. I think this was like, I was starting Deep RL then, but Deep RL was so hot, which was like the worst evaluation has ever been probably for ML. Like people complain about it today, but like Deep RL evaluation was like, every single person was just lying to make themselves look better.

Ross Taylor [00:03:38]: The interesting thing now is that the open ecosystem has shifted to focus more on weights as a central artifact rather than code. I think there's an interesting debate there. Would it be more useful to have the LLaMA-3 8B model weights or all the code for training LLaMA-3? I think there's still interesting debates to be had about what's actually useful.

Nathan Lambert [00:03:56]: I think the code would be more useful. Like OpenAI released their rules-based reward models, but it's like code washing because it's like just a bunch of people just released like eval code now. And it's like, that's a whole another tier is like actual training code versus eval code. But yeah, I guess I'll just skip ahead.

Ross Taylor [00:04:12]: So essentially Papers with Code was the thing that didn't die for us. We always thought we were going to make something else and Papers with Code was more of a marketing thing. But eventually we were like: okay, our users are telling us this is what we should be working on. And we expanded from that very simple use case of finding code towards indexing various artifacts in ML.

Another big problem was trying to find the state of the art in something like ImageNet and all these different benchmarks. There just wasn't a central place to find this information…So we had this quite good Christmas - me and Robert - where we hacked for the whole month, indexing every leaderboard we could and all the related papers. I didn't want to do any annotation again after that! But that took things to the next tier, and that's when things really started to blow up.

Nathan Lambert [00:05:03]: Because this is like the first round of leaderboards, because now it's really popular with Hugging Face again. And I was like, yeah, is that just because it became like a Meta thing and it's just kind of a thing that existed? You're like the first leaderboard company in a way, which I don't think many people think about. Yeah, which is weird.

Ross Taylor [00:05:19]: Yeah. And the interesting thing about us was that we never had to do any marketing because everything was from organic traffic. So you would type in “state of the art ImageNet” and we would come to the top as the most useful site. That was really the source of our growth, and we grew to a million MAU fairly quickly. And as for Meta, we were in touch with the PyTorch folks at the time who we really liked. You know - Soumith, Joe - those folks, and they had a shared interest in promoting the open source ecosystem back in 2018/19. And while it was like a tough decision, we were just like “we really like working with these people, we want to work more closely with them”, and that got us into Meta.

And then within Meta, we originally continued to develop the platform. But the big shift for us was that, even then, we saw we were moving to a world where compute was the currency. And we saw that, if we wanted to be well positioned in five years time, we needed to be building these large-scale systems. Even for our own platform, we had lots of ML in the backend and we saw we were using fewer and fewer models to do more and more tasks. So that kind of shifted us into research, into Galactica, and then eventually LLaMA and that kind of stuff.

It was a weird shift because we were product people who ended up doing hardcore research! But I guess it was natural to us that we were within a research org with these amazing people, lots of resources. It was just the best use of our time to conduct this shift.

Nathan Lambert [00:06:43]: Do you think there should have been more integration between Hugging Face and Papers with Code? It would have been wonderful if it had happened.

Ross Taylor [00:06:54]: The backstory is that we saw them as competitors, to be honest, because we had the same vision originally. We were going to do model hosting, that kind of stuff. But we never got into it because we hit friction with leadership - who was not onboard with that as a goal. Because from their point of view, it's like, okay, if we host these things, this might expose Facebook to some kind of legal risk. It wasn't in the perceived interest of the company.

Nathan Lambert [00:07:17]: This is a classic story of tech, really. They can't take the risk. They can't expose themselves.

Ross Taylor [00:07:23]: If you're a startup and it's your number one priority, then yeah, your attitude on risk is different. But I think it was a blessing in disguise for us because clearly the bigger wave was going to be large language models - we saw that incredibly early. And our mission was fundamentally not infrastructure, but something closer to: how do you organize information? It was a Google-y type of mission. And while we were focused on ML, we were more broadly thinking about science: how do we reduce friction for finding out about new advances and, I guess, lots of small tasks that when added up lead to a lot of progress in science.

Nathan Lambert [00:07:59]: I should have probably looked this up. Did you have another scientific background? Did you have a hard science background or what about Rob? Stojnic?

Ross Taylor [00:08:10]: Yeah, [Robert] Stojnic, my co-founder, he was from a bio background. So he's actually-

Nathan Lambert [00:08:15]: That makes sense.

Ross Taylor [00:08:16]: Well, he also had a computer science background. He was one of the original developers of Wikipedia, so he has his own crazy story…

Nathan Lambert [00:08:22]: Yesterday I was talking to somebody that was one of the original arXiv moderators. So we're digging all these things up…

Ross Taylor [00:08:29]: It is interesting because we both had this background, I would say, in building useful “utilities” [on the internet] at some point in our lives. I think Papers with Code is one of those things which is easy to forget, but if it went away, everyone would go crazy.

As for me, my background is more statistics and econometrics. My first job was in the Government, which I kind of hated. But I did a Master's degree, which I thought was going to be in economics, but the thing I ended up loving was time series and statistics. So I did all this research on state space models - before it was cool, I guess! - and then that got me into sports betting. And then eventually, we were using more and more deep learning [in the 2010s], and that’s how I got into AI. So a fairly nonlinear path. But -

Nathan Lambert [00:09:09]: Yeah. Well back to what you were saying on the scientific stuff, I think the Galactica story has many angles, and you led on this.

I think if people go look at the paper, it's a very interesting paper, like you cite Galileo in the first sentence, and it really has a lot of early modern language model features and quirks. It's something that people don't remember that well.

I'm very on the record saying the backlash was overblown. I think that was before there were clear habits and community norms around what language model demos should look like. So it was kind of in that teething phase.

But what was the actual goal that you wanted? You mentioned organizing the world's information. What was the goal and how close do you think the model came to accomplishing it?

Ross Taylor [00:09:58]: So there were several different things at once.

There were immediate product integrations we had in mind. We actually had an agreement at the time with Overleaf to be a “co-pilot for writing papers”. We'd have a really good LaTeX model in Overleaf, and whenever you wanted to include a citation, you could simply prompt for one.

More broadly, we imagined the future would be instead of..using more classical ways to find and extract information, if you wanted to learn about something like DPO, you would just prompt a language model to find out about it. Or if you wanted to ask “What's the state-of-the-art on SWE-Bench?” or something like that, you would just prompt the model and it would find the relevant information and answer the question.

Nathan Lambert [00:10:46]: So this is something that language models are so bad at. One of my challenge questions - I've been doing this for 6-12 months - is to ask models about DPO, and none of the models without internet access have yet done it right. You would think that it would start to kick in. And I don't just ask “what is DPO?”, I ask “What is DPO for language model fine tuning”, and they still just make up nonsense.

Ross Taylor [00:11:06]: Yeah, which actually relates to an interesting debate about LLM creativity. If you want to solve something like LLM creativity, you want to be confident about the frontier of knowledge, but frontier knowledge is where you have the most token scarcity.

But anyway, just to finish that thought. Bear in mind, we were developing Galactica while the whole Web 3.0 boom was happening. And we were in this weird state where we were like “All everyone is talking about is Web 3.0, but clearly generative AI is going to be the thing that powers the next generation of the web!”. So I guess that was our primary motivation.

Now, in terms of the [Galactica] launch, I think there's two aspects.

First, like you said, the paper. Now we were a small team of 7-8 people. We had so much fun developing these new ideas at the time: internal reasoning tokens, how do language models cite, training for multiple epochs…

Nathan Lambert [00:12:00]: What's that? A citation token? Did you have a special token for citations?

Ross Taylor [00:12:04]: Yeah. So we had a start citation token [START_REF], and we used two methods. The first was: we'd put the title of the paper within the citation tags. And the other one was: we'd have an alphanumeric ID.

The interesting thing was, it actually worked really well - but in the demo interface, it had a tendency to hallucinate - or “hallucitate”. The backstory is that, while the model was really good, for the demo we turned up the temperature to 0.7 so the text generation was better [at the expense of citation accuracy]. So generative citations were something that people thought didn’t work, but it was [more an implementation issue]. I guess that’s an alternative road in history…

So there was the paper, which was cool, and there was the demo, which I would say was motivated by the realities of the time. This was pre-ChatGPT and, even within a big company like Meta, it wasn’t a company priority to work on LLMs at all. So in our mind, our objective was - we were kind of deluded - being a team of 7-8 people, we were like…

Nathan Lambert [00:13:08]: This is how you have to operate if you want to be at the cutting edge. That's how great teams operate.

Ross Taylor [00:13:13]: So there were two objectives you could have had. The first is: you think that second-mover advantage is good. So you could wait for OpenAI to do something and then come in after and do it in an open way. And this is the path that actually worked for LLaMA. LLaMA was not state-of-the-art in any sense.

Nathan Lambert [00:13:27]: I've been doing this. I mean six months ago, maybe OpenAI and Google wouldn’t need to hire me because they know everything. But now I’m doing more interesting analysis where I'd be hired at a different role - but in the open. Now I'm like the person people look at. But I’m trying to tell people that “You don't understand! I'm six months behind everyone!”.

Ross Taylor [00:13:49]: Right, but to be clear, that’s a really important role - because everyone should have a stake in the future. And that's what the open ecosystem gives people.

But our objective was this: we didn't want to be second; we wanted to be first. And we were kind of deluded because we were 8 people - compared to maybe OpenAI with 200 people where their whole bread and butter was language models. But that’s why we were thinking “how do we move as fast as possible?”. And in our mind, a demo might be premature, but it would also be a way to get lots of prompts and information quickly - to understand how people would be using the model. And essentially the calculus we took was, we knew the community might not be ready for something like this - especially with the Meta branding - but we thought this was a way to get lots of information really fast and catch up given our position. Now in retrospect, history says that…

Nathan Lambert [00:14:33]: You kind of did that. I think Meta probably got the injection of language model reality from that. It's kind of like the Gemini backlash. I think the Gemini backlash - while it's obviously stupid execution - was potentially a good forcing function for Google's structure of their Gemini org - to really move everything into the way it is now. That made them be structured more like a serious language modeling org and less like Google, I think, which people don't want to hear...

Ross Taylor [00:15:07]: For us it was just a risk we decided to take. We probably took a lot more risk than we should have done. But we just thought “obviously this is going to be huge”, “LLMs are going to power the next internet”, etc, so let's take a risk. And you know, if we ran the universe several times over - it would have succeeded in some of those runs. But [in our universe], the criticism, which was obviously overblown, reached a critical point where things didn’t work out.

And then there's the story about the demo coming down, which - I’m not sure I’m able to talk about - but I think that is one of the things where, if people knew the true reasons, they'd be like “what the f**k!?”. But yeah, that's what happened…

Nathan Lambert [00:15:44]: Yeah, this is why any company that makes a demo now has block lists, where there's certain words that if they're in the prompt of the generation, you get a really, really stupid response. Even if it's like an open model, you just put like a little filter that's like, “you can't say the most obviously bad words”.

Ross Taylor [00:16:01]: But we actually did that and that created backlash as well. Because if you have false positives, you actually exclude some words which aren't actually offensive [in certain contexts], right? And then you also offend people… so it's not a win-win situation.

But if I have to look back at it now, I think with any new technology, it's never going to be absolutely better than what came before it. With LLMs, the relative comparison is with search. If you’re going towards search and information retrieval, you're prioritizing factuality as opposed to creativity, right? And the fundamental tradeoff with LLMs is saying, “I can trade off some amount of like factuality or ‘closeness’ to the corpus for some amount of synthesis and creativity”.

I don’t think that if we had a better model, it would have helped things at all. You could say maybe if [Galactica] had RLHF, would that have helped? I'm not too sure given that the project came out of [a big company like] Meta. Meta has a really good reputation now - people appreciate the open work they're doing - but at the time, things like the 2016 election were still in people’s minds. So I think the LLM revolution was never going to start at a big tech company, in my opinion. It was always going to happen at a company that had less reputational baggage. But I think it's pretty cool now that people see things differently. Because FAIR always had a really strong commitment to open science. It’s good that they're finally getting the credit for that.

Nathan Lambert [00:17:38]: Yeah. I have two technical questions on Galactica that I find really interesting. One is from Luca Soldaini at AI2. He said that you mentioned that the Galactica log probabilities (when producing citations) were proportional to how far in the citation graph the current paper was to the cited paper. Do you have any more interesting comments on how the latent space of Galactica actually worked? Because that is cracking the most important question of a language model for science - building a better latent representation of how the scientific information is organized.

Ross Taylor [00:18:12]: Yeah. So there were a couple of aspects to that. The first thing is we had this really nice graph that showed, as we scaled the model, the distribution of citations became closer and closer to actual citations - which is what you'd expect. But this was important for us, as our main worry was - because we were thinking about deploying to Overleaf - we didn't want to prioritize the most cited documents and create a “rich get richer” dynamic.

Nathan Lambert [00:18:38]: Google Scholar already does that. Were you re-indexing all the papers rather than building off like the Scholar graph or something?

Ross Taylor [00:18:45]: I think we were building off existing ones, using things like CrossRef…but there were lots of gaps that we had to fill. The other weird thing was that we saw some strange biases in the model. So if the model didn’t know what to cite, it would sometimes cite a general review paper, which is really weird emergent behavior. It was like the model was saying “I don't know a specific example, so I'll just give you a general overview”.

Nathan Lambert [00:19:11]: It's probably in the data.

Ross Taylor [00:19:12]: I think the thing that surprised me the most was multimodality. So we trained the model on SMILES formulae and protein sequences [alongside natural language]. And the thing that really surprised me was, we had tasks which we didn't explicitly optimize for - like converting a SMILES formula to a IUPAC name for a chemical. And if you actually looked at the attention as the model was predicting the next token, it would say something like “amino” and you could see in the chemical graph, it was explicitly attending to the relevant part of the sequence.

I found that amazing because we didn't train for it explicitly. That's the beauty of self-supervised learning. But I also found it highly ironic because some of the criticism of Galactica was “it’s ungrounded”. I was like “how grounded is this? The natural language tokens are literally attending to the underlying chemical structure!”. So that was kind of cool.

And then the other cool thing was: if you prompted with a protein sequence and asked “what is the function of this protein?”, the model was really good at answering those questions in natural language. That was awesome for me.

Nathan Lambert [00:20:33]: There's another prompting thing that I had known of [for Galactica], which was asking the model to do open-ended generation tasks. The models are still out there - people can spin them up and do demos on their own - but if you asked it something that people think of for ChatGPT - e.g. write me a poem about a sad goldfish - it wouldn't work unless you put it in a header format. It was markdown, I think? If you prompted it in that format, it would actually do a great job.

Ross Taylor [00:20:57]: Yes, so in the Galactica demo, a lot of people were being malicious with this type of prompting for markdown articles. But I did enjoy some of the creative ones. Someone was like: write me a theorem on finding a girlfriend, and it was some of the most hilarious model output I’ve ever seen. And people also generated some amazing sci-fi…but then I think some people took it too far. But whatever. I guess it was a traumatizing experience for me at the time. But with the benefit of hindsight, I was also fun in some sense, I guess.

Nathan Lambert [00:21:30]: Yeah. It makes you understand the bigger context of the work much faster than you would otherwise.

Ross Taylor [00:21:37]: It was actually crazy at the time. So many people were using it. Even then we could see that - while it wasn’t a product - we could see that most systems were going to be designed in a similar way.

I think the interesting thing was how the winning form factor in the end was like a chat interface - you know, with ChatGPT being the winning UX. I think that was actually a big part of the story [why they succeeded]. There's a debate on whether RLHF is actually a capability advance or whether it’s just alignment…but a big part of the story [for ChatGPT’s success], in my view, was the kind of UX of how you interface with a language model, rather than the actual capabilities. But I think it's obviously not monocausal at the same time. There were several factors at play.

Nathan Lambert [00:22:25]: Yeah. So the last thing on this is that you mentioned in our e-mails about language models, creativity and making discoveries. What do you mean by that? Is that the agent-like projects you worked on at Meta?

Agents are largely something that I don't have too much comment on. I'm taking the approach of wait and see what we actually get, because there are a lot of practical approaches that I think will be reasonable. People use language models for basic formatting, for code, etc. But it's easy that if they have a little bit more feedback for things like writing a paper - e.g. find me a citation for blank and justify your answer - that step is something that I think will come. I don't know how expensive it will be to run, but is that what you mean when you think about making discoveries? Is it more autonomous? Is it a grander vision? Anything like that?

Ross Taylor [00:23:18]: I think it's more like this: the killer use case right now is information synthesis. For example, I use Claude a lot more than Google now because it combines information in a better way and sometimes generalizes well to things it hasn’t seen before.

But a really cool thing would be: can a language model answer a question which is more out of distribution? That we don't see in the training data?

So an experiment I've never done because I didn't have to compute would be this. Imagine if you could train a language model on all documents up to 1905, which is the year when Einstein had his miraculous year of four seminal papers. With that model, which is trained up to 1905, could you prompt the model to come up with a good explanation of the photoelectric effect, special relativity, this kind of stuff? And what would it take to rediscover these things?

Because presumably, with all these major discoveries, it’s never out of the blue. You’re standing on the shoulders of giants, but there’s still a lot of thought and inspiration you have to do to get to those great ideas. So that's the setup. But the creativity problem is, by its very nature, hard to benchmark.

Maybe this is a digression, but my problem with the field right now is: we’re in a situation where we've almost solved a benchmark like MATH, which is a very hard benchmark, in my opinion, at least Level 5 MATH, but I don't think we've really cracked something like reasoning. So I think it's like a whole different question about how you even evaluate these frontier tasks. But yeah, hopefully that gives a flavor of the kind of questions here…

Nathan Lambert [00:24:58]: Yeah, we can go into the reasoning conversation. I think reasoning in RLHF will take up however much time we want to keep talking. I guess we can start with the basics. What do you think people that are using language models think reasoning means? And what is the way that you would interpret what you're trying to do in improving the reasoning capability of a language model?

Ross Taylor [00:25:21]: So there's a lot of controversy on this on Twitter/X. And I think people are talking past each other because sometimes people mean different things by reasoning. At a very granular level, is legal reasoning fundamentally the same thing as mathematical reasoning? Common sense reasoning? I guess my very basic definition is that reasoning is the process of drawing conclusions based on a body of observations, or in the case of deductive reasoning, basic premises.

Nathan Lambert [00:25:50]: So math is like a subset of what you think about.

Ross Taylor [00:25:53]: Yeah. And then I guess the bigger circle is the broader topic of outcome directed behavior. I have an idea of an outcome I want to achieve, but what's the best path to get there?

And then in the LLM space, I think this problem broadly equates to the technical problem of how you use compute to get from your question to your answer. In the old days, you would just prompt the language model directly. You would just put in a GSM8k question, put in “Answer:” and then parse A, B, C, D. So you're relying on the forward pass.

Nathan Lambert [00:26:27]: Yeah, like the FLAN data is really weird. That's a popular one that people used to train on this stuff.

Ross Taylor [00:26:33]: Yeah. And then came chain-of-thought, scratchpads, with Galactica…all these ideas of using the context window to do intermediate computation. And the more recent, although to be honest, it's actually quite an old idea, is: you have chain-of-thought, but how do you better learn the internal reasoning tokens that get you to your answer? So things like, you know, Quiet-STaR and variants of this idea.

Nathan Lambert [00:27:01]: Claude now shows you when it’s thinking, and in the Claude system prompt, it has information on how many tokens to take to think about a question. We're all thinking about trying this stuff and it's all so hard.

Ross Taylor [00:27:11]: I think it's a question of how do you learn those tokens? For us, the original thing we did was just supervised learning. So we trained on some examples and let the model generalize to know that it should do the thinking in between some tokens. There are more sophisticated ways you could achieve this nowadays.

Another point is this: there’s an analogy that’s often used about language models, that they are “thinking out loud”. I actually don’t like this analogy at all. I think “thinking out loud” makes you think there’s something wrong about this kind of thinking in token space. But it’s not clear to me that the alternative - or these old adaptive computation ideas - are any better, actually.

Nathan Lambert [00:27:58]: What do you mean by adaptive computation? Because I mostly think of “thinking out loud” as being like chain-of-thought or generating its own explanation before it gets to an answer. What would adaptive computation be?

Ross Taylor [00:28:09]: So there's a paper by Alex Graves, who wrote all these amazing papers ~10 years ago, which had a lot of foresight. He did stuff like the Neural Turing Machine paper. Adaptive computation is the idea of, instead of having fixed compute between your input and your output, you can extend the forward pass to do things better, like arithmetic, where you have to maintain/manipulate state.

When chain-of-thought came out, there was an impression that it was a bit of a hack, because you're thinking in token space whereas you should be finding a way to make the forward pass dynamic. Universal Transformer is another variant of this [adaptive computation] idea. But I think there needs to be more empirics on which approach is actually better to maintain and manipulate state. I used to be more in favor of thinking, OK, chain of thought is more of a hack, but now I actually think it's probably…

Nathan Lambert [00:29:02]: What do you mean by state, like the state of the problem in that sense?

Ross Taylor [00:29:08]: So imagine that you're doing a GSM8k question, where John originally had 100 apples, then Jane gives him five apples. He has 105. And then he gives 20 away to like Susan or something and he's left with [85 apples].

So if you’re prompting the language model directly for the answer, you're expecting the language model in that forward pass to maintain and manipulate the state in a latent space, whereas the way chain-of-thought does it is in token space.

So you essentially output the intermediate steps. One of the problems with reasoning is that we have no idea how humans mechanistically reason…but if you think about how you'd solve a GSM8k problem in your head, then to me this seems a lot closer to something like chain-of-thought than adaptive computation.

Nathan Lambert [00:29:57]: Especially when you look at the architecture and attention mechanisms. A Transformer is really good at copying. So if you keep feeding in the recent information, it copies that in some way. So I think chain-of-thought and all of these things, I mean, they're only growing in popularity in my mind, along with Quiet-STaR and these kind of methods. I’ve heard the rumors about self-explanations and all these special things. The LLaMA-3 paper has all these special tokens. I don't know what all of them do, but I can see the direction. The state is stored in context and in special formatic tokens if it needs to be.

Ross Taylor [00:30:37]: So the other big picture thing is this. With the internet, you’re only seeing the output context.

So take StackExchange. If it’s a good answer, the author probably hasn’t just responded by generating words left-to-right. Maybe they’ve looked something up, maybe they’ve done a back-of-the-envelope calculation, either explicitly or in their head, right? And the internet is missing those “internal tokens”, essentially.

Now this isn’t always a problem because the models can learn how to construct them. And the effort now is to make artificial latents / internal thought, through RL or otherwise. But I think this is actually a much bigger question, which is more than just reasoning. In the end, as models become more capable, we’ll be talking more about how we can make them human-like in the way they can answer questions and solve tasks. For example, in some situations we might like the models to have [human-like] empathy, which is also “missing” in some sense.

So my prediction is that this becomes a bigger deal in the next few years: caring more deeply about the computation these models perform to reach a conclusion. And that will be the essence of alignment, in my mind. But that's a big topic!

Nathan Lambert [00:31:50]: OK, I have a long list of specific questions on this. My first question is about process reward models.

I think the canonical paper is “let's verify step by step”. My whole gripe is that it’s hard to create the data. That’s why they don’t exist in the open. But I’m guessing you can just label data with GPT and ask for feedback on each step, and just use that as an “LLM-as-a-judge” to get reasonable step-by-step labels on process rewards. But there’s so little work on this, so I don’t know if it is worth exploring. There is some research from Meta - I think Alex Havrilla did a couple of internship projects which related to this, and he’s good - but there’s such a lack of signal.

Is this something that people should work on more, or is it too complicated? Are there simpler things to do?

Ross Taylor [00:32:38]: Our big direction was integrating outcomes into reasoning - because next token prediction isn’t the objective we actually want to optimize. So the two ways to integrate outcomes are through something like PPO or inference-time search. And in both cases, you want a good reward model or value model.

Instead of (human-annotated) “process based reward”, we were exploring ideas along the lines of Monte Carlo policy evaluation (MCPE), where the key problem is how to learn a value model. It’s maybe a separate topic, but it’s underappreciated that something like MCTS - which in the public imagination is this inference-time search technique - actually has its real magic in giving you a value network for free.

This is why it was introduced in Go, because humans couldn’t come up with good heuristics for evaluation. So if you have something like MATH where you know the answer, then the question is how do you assign step by step feedback? It doesn't have to be MCTS, but something where you backprop the outcome to these individual steps is a way to get this dense feedback.

That's a way to get “synthetic process reward”. I should stress that PRM and MCPE are actually different things. Alex Havrilla was doing something along these lines also - but anyway, hopefully this gives a sense of the approach we took.

Nathan Lambert [00:34:21]: When Q* came out, that's something that I thought it might be doing. Instead of chain-of-thought, there's this idea of tree-of-thought. You could swap in the reasoning steps. And then if you could get labels on all these reasoning steps, you’re doing search over a reasoning space - which I would expect to work, but I think it needs the right datasets. I think a large part of the open alignment community right now is underappreciating datasets, where there's a lot of focus on methods, but we don't even have the datasets to use the methods… Like, why are you coming up with seven DPO variants if you don’t have the right datasets? I understand academic incentives, but if you are not an academic, you don't need to be doing that…

Ross Taylor [00:35:00]: It's an interesting question, because I guess the first chapter of LLMs had a lot of reliance on human annotations. In a way, that's a barrier to entry for the open community, because big firms can afford to pay millions for it but open source developers can’t. But more recently, you've had the rise of things like constitutional AI [and RLAIF approaches], which I believe are comparable to human-annotated datasets anyway. So is that a good thing for the open community?

Nathan Lambert [00:35:31]: I think it is, but human preference data might be a leg that is hard to remove. One of my latter questions was: can we actually do LLM-as-a-judge for human preference data fully? I think is the critical step that we don't have an answer for. Everything else in the modern RLHF stack is becoming more reproducible in the open.

And that relates to a question I have on synthetic versus human SFT. I think Thomas [Scialom] said on the Latent Space podcast that we just use generations from the model because they're better for humans on a lot of SFT tasks. Apple had a quote in their foundation model paper saying the same thing.

So I’m thinking, shouldn’t we be redoing all of our generations for our SFT dataset with the latest GPT-4 or LLaMA-405B? Why are we using GPT-4 from March 2023? That model was not as good on reasoning. So we have headroom there on synthetic data. We have prompts that we could reuse, but we don't have the right preference datasets - datasets like UltraFeedback are not big enough. And I think they're not in the same style that a lot of labs are doing this preference tuning - where it's on-policy generation.

We tried to work with Scale at Hugging Face to do this, where we had our own SFT models. We were getting data from Scale. We were labeling it every week and we were trying to retrain the models and we weren't getting a signal. This was last July/August. So we just didn't really know what we were doing. But I suspect that what people in the open should be trying to do is generating a lot, labeling it…That was a light bulb moment for me recently. This is what we have to do, but no one has done it.

Ross Taylor [00:37:21]: Yeah, I think it's definitely underappreciated how you can get better answers than a human by sampling the models [enough times]. You mentioned that Thom made this point early on in the [LLaMA] project, but you'd be surprised how this extends to reasoning as well. Even with the Galactica model - which is now an ancient model, a bronze age model - the pass@100 on GSM8k was 98%. And it's absolutely crazy to me that even now people are using GSM8k as a benchmark. In my mind, that benchmark was solved several years ago.

It’s a subtle point because the zero shot performance was ~48% but the pass@100 was 98%. The insight there is that the model already has knowledge about how to answer correctly, it's simply not reliable. This tells you that you need to invest in reward models, process based reward, outcome based reward, everything we talked about earlier…

But the same applies to the general RLHF pipeline. If you asked me to write a poem in the style of Bertrand Russell but also mix in Snoop Dogg’s style, then I couldn't do that. But the model has knowledge of how to do that, right? So why wouldn't you sample the model?

I think now with LLaMA-3, and the 405B model being out, it’s going to be good for the community that they can use it for generating data synthetically. And I'd imagine the quality will be good enough if it's done the right way.

Nathan Lambert [00:39:30]: Yeah, I think it should be doable. But there's a fundamental question of what do we think the human preference data is doing? [Compared to] model labeled preference data, is the noise that the humans provide of a different distribution that makes the human preference data better? I don't have a lot of signal on this, but I would love to know because I would guess that Meta would love to eliminate the $10 million plus estimated human preference data spend if they could. Meta is a reasonable company…

Ross Taylor [00:40:23]: Yeah, I don't know. But here’s something that surprised me. I was originally skeptical - at least on the reasoning side for LLMs - about LLMs marking their own homework. I thought they would eventually have that capability, but I wasn’t sure…

Nathan Lambert [00:40:40]: how fast.

Ross Taylor [00:40:41]: But the interesting thing we saw was as follows. We had experiments where we’d have a LLaMA-2 model that we’d sample generations from to train ORM models, and then we’d train different reward models on this data with different base models.

What we saw is that, the better the (underlying) base model, the better the reward model was for evaluating. And there were very clear patterns we saw: as the base model scaled, so did the quality of the reward model.

So that tells you that the knowledge is not in the ORM samples that you've fine-tuned the base model on. The knowledge on how to judge is within the model itself. And the pattern was so clear in the scaling. I concluded that eventually these self-verification approaches would work. It was just a question of when they would start to work for different types of problem.

Nathan Lambert [00:41:31]: Yeah. Model capabilities are also getting more dense which helps as well. Like with smaller model, there's all these experiments with better data, showing that you get a better model with X% reduction, which is kind of off-topic…

To double-down on what you said, I think this is one of the things I also debate: what makes a good model for downstream fine-tuning? I think in the LLaMA-3 report, they train the reward models directly on the base and not on the SFT model. The Apple report mentioned that they don't just use their evaluation suite for SFT models, but they evaluate with a reward model to see what is ready for RL.

I think, especially in the open, if you want the people to adopt your base model, there's a big gain in making it easy to fine-tune. For example, LLaMA has been pretty good; LLaMA-2 especially was really good for fine-tuning. There's also been base models that don't really work for fine-tuning, partially due to bugs and partially due to the state of the optimization. Is this something that you have any insight into?

Ross Taylor [00:42:43]: Yeah, I don't think I have enough insight into it to say, but I think it's definitely something that's been undervalued. I think the view of a lot of open model providers is: you get the model out, get good Open LLM Leaderboard results, and it's mission accomplished. But the real evaluation is in two days time when you get anon accounts on X saying “I'm fine-tuning this LLaMA model, it's not working”. And when you see a pattern with this kind of behavior, you have to conclude something is wrong…

Nathan Lambert [00:43:11]: It's always a chat template thing. A lot of it is a chat template thing, but those problems do get ironed out eventually. There's this whole idea of annealing and staging pre-training. I can't tell if it is boosting current capabilities at the cost of later capabilities. I think in a few years, this will all shuffle out and it's just how we do evaluation in stages. So you're always going to optimize for the right metric.

Ross Taylor [00:43:50]: There's two points to that.

The first is about annealing. It works for the kind of benchmarks people focus on the most, but then there's a question of whether you are actually just collapsing the task distribution of the model to things you're measuring - and not the true task distribution used by the community.

And I think there's a second point - which is maybe too much of a digression - but there's an interesting debate to be had about data quality being a bit of a misnomer. In a sense that when we say “data quality” we're actually saying “this data mix works well on these benchmarks”. But if you take a “No Free Lunch (NFL)” kind of approach to this, you must be hurting task performance somewhere else, right?

Nathan Lambert [00:44:34]: Yeah, I think I’m on the record of being an AlpacaEval hater. I say this all the time, because I think AlpacaEval is sacrificing actual usefulness for their own metric. If you get a 1-2% bump on alpaca eval, maybe that’s great. But you could be getting a 10-20% bump while sacrificing actual chat abilities.

We released some models trained with PPO and our PPO models are not very good at instruction following because they don't follow modifications like be concise or some stylistic things. They're also so yappy. They just say so much…but they do well on metrics and PPO especially helped AlpacaEval. So we had to figure out how to kind of use that signal without overcooking it.

Ross Taylor [00:45:16]: Yeah, it's like a whole discussion about evals, I guess…

Nathan Lambert [00:45:21]: We could come back to evals in a second. The last question that I have is: there's multiple trends like LLaMA-3 downplayed the importance of instruction fine-tuning relative to RLHF. I think there's other quotes in [Thom’s] LatentSpace podcast talking about it. Nematron also had this report where they use SFT and then multiple stages of RLHF.

I think DPO versus PPO is overblown and that'll kind of be a wash eventually. Everyone knows DPO's advantages of being simpler. But my question is this: are there certain capabilities that only come for RLHF, and people trying to do them with SFT are just wasting their time?

I always thought safety was in this bucket where it kind of makes sense - it’s hard to train a model to refuse just with SFT. But with something like reasoning, are there certain sequencings where SFT primes you and then RLHF really helps reasoning or code? Because it seems like OpenAI is really leaning on PPO to help with reasoning and code?

Ross Taylor [00:46:45]: Yeah, I think there's two ways to answer this question. First, maybe the history of this debate on the LLaMA side, and then something on the reasoning side.

So the history is quite interesting. I would say, you know, when was it? 2023? My dates have been wrong since the pandemic…But this just was after ChatGPT. There was actually a debate internally in Meta about using RL, and a lot of senior people were very skeptical. I would say the view was…

Nathan Lambert [00:47:13]: Not just at Meta. You can see when different companies embraced RLHF, if you really start to look at their models…

Ross Taylor [00:47:22]: The view was that RL was a dead end. And that even DeepMind was moving away from RL at the time, so you should just do SFT.

But, you know, at least for the folks in the Galactica team that came to lead post-training for LLaMA, we were quite scarred by hallucinations! We were definitely of the view that we needed to have the right objectives, and that we needed to make sure language models could “know what they don’t know”. So we were quite high on RL from the beginning. And eventually, I think the LLaMA-2 paper showed that a lot of the advances in helpfulness/harmlessness were via the RL stage. So I think that approach was fairly vindicated.

On the reasoning side, I would just say it’s quite simple. It comes back to the next token prediction objective not being the actual objective you want to optimize. The objective you want to optimize for reasoning is: do you get the right answer or not? Especially since reasoning is a high precision task. If you get one token wrong, unless you have a backtracking capability, you’re never going to recover…

Nathan Lambert [00:48:32]: That's a throwback, the backtracking token. Sorry, that was a random paper! That is interesting…

Ross Taylor [00:48:38]: Yeah, all these weird methods… But I think on your question, there is a point at which these techniques kind of overlap, right? So if you're, you know, doing SFT with rejection sampling: you’re doing something close to PPO anyway. And the same for reasoning: if you sample the model and pick the trajectories that your verifier says are correct, and then do SFT on that, it is a form of RL.

The final point I’d make is this: I would say the community overreacts to certain methods being used by popular models. They think: this company uses DPO because they must have found it's fundamentally better. But actually, it's usually due to either practicality or…

Nathan Lambert [00:49:22]: Yeah, that's what I think.

Ross Taylor [00:49:24]: You have a 405B model, and if you want to do PPO, you need to have a policy model, a reward model, value model etc in memory, and it’s not like…

Nathan Lambert [00:49:33]: Especially with DPO. I think with the 405B, I'm guessing what you did was cache the reference model. You could cache the log probabilities from the reference model. So you don't need to keep them in memory when you're doing the loss of the primary model. For DPO, you don't even need an extra copy of the model in memory, which therefore means you can use the same exact stack that you use for training. So you don't have to comment on this. But I think that's probably partially why LLaMA-3 just used DPO...

Ross Taylor [00:50:07]: Yeah, I think people don't appreciate how compute works either. People assume the big companies have so much compute - tens of thousands of GPUs - so compute isn't a constraint. But all these things are subject to Say's Law, right? If you have more compute, you're going to train a bigger model. And then you're going to hit the constraints again. It’s like the old thing of trying to solve traffic by building another lane. But if you create another lane, people will use that lane of traffic.

So practicality is still a factor [behind choosing methods]. Also things like which researcher is in charge, what’s their favorite method, and also politics as well.

So I think the community has made a mistake of overreacting to these choices. There was a mixture-of-experts phase too, right? I don’t think there’s anything inherently better with either method (dense or MoE), they just have different trade-offs, and it depends on what you are trying to achieve. If you’re serving lots of people with inference, then maybe a MoE approach is better. If you’re optimizing for something simple that’s easy to train and gets good results, maybe you favor a dense approach - although that’s debatable whether it’s easier to train. But I don’t think these things are clear cut.

So I would encourage people to not just copy things because they're in a paper from a big lab. I would encourage people to try things out themselves to know what works, and figure out what the problem is that you’re really trying to solve.

Nathan Lambert [00:51:20]: I think people don't have enough long term direction in their decisions. People are not trying to make decisions about what will be right in 10 years, they are trying to get a model out as soon as possible. So there are very few people with the incentives of trying to understand in the asymptote, which method is better… I might have that incentive, because I'm a nerd, and I have an audience that is okay with me writing four paragraphs around esoteric nerdy topics, but for all these companies, that is not a real incentive.

Ross Taylor [00:51:53]: The other point I’d make - maybe it is a separate thing - is this. I made this mistake throughout my career of focusing too much on novelty and complexity.

So in my first job in sports betting, we were making models for horse racing, football, that kind of stuff. And I always had the perception that other funds had really advanced, cutting-edge, complex models - but that wasn’t the case at all.

I think there is this tendency within deep learning to assume that - especially for the secret labs - that their good performance is due to some secret, amazing method. But more often than not, good performance is due to lots of small things from different people combined into one model. Really, lots of simple things done well and solid execution. And frankly, for big firms a lot of brute force too, right? Because big companies are naturally slow. But once they find a way to mobilize resources, they’re very intimidating and hard to beat. If you’re in a big company, and you’re aware of this, which approach are you going to take: are you going to prioritize novelty or are you going to do brute force if you have 10,000s of GPUs?

So I would encourage people not to be too intimidated by this perception that the big labs are smarter. I don’t think they are.

Nathan Lambert [00:53:03]: They're earlier but they're not necessarily smarter.

Ross Taylor [00:53:09]: Yeah. So obviously the constraints are different because of less compute in the open, but still: you’ve got to use first-principle thinking and be empirical as well, and just follow that path.

Nathan Lambert [00:53:21]: Yeah. So following up on this, there's a lot of discussion around what the processes are for making a successful foundation model lab. I think Armen has been talking about a few things on Twitter with great visualizations around de-risking pre-training based on FLOPs efficiency. Do you have any comments on what makes a successful post-training team and project?

I've talked to John Schulman a couple of times - he's been the king and started all of this - and OpenAI is still looked at as being the leader in the space. I think they've always been top on Chatbot Arena, and have cracked what most people like in the style. They started early. Are there different considerations for the post-training side of things rather than the pre-training side that we might hear more about?

Ross Taylor [00:54:13]: Yeah, there's probably better people than me to answer. So in our team, originally like Robert (Stojnic), my co-founder, he was kind of managing the post-training team. And then I'd say Thom Scialom was doing a lot of the work. And then more recently Rui Hou - he kind of flies under the radar a bit - but he’s been doing a lot of the work. They are all better placed to answer than me, since I was focusing on reasoning and agents.

But I think the key thing is this: post-training is just a lot of iteration. Frankly, lots of hard work - e.g. making sure at each round of RLHF you’re not regressing in certain ways, filling holes, etc. I guess it’s hard to put a finger on a single thing, but…

Nathan Lambert [00:54:58]: There's simple things like I'm trying to get people to talk about more. I’m trying to establish a good vibe test about internal culture. How do you vibe test for a good post-training culture (or for reasoning)? I remember somebody at Anthropic told me there’s still a lot of cases where you just put your finger up to the wind and you're like “model good”. And I'm sure that is still happening. And that's just a simple cultural thing of telling the team that you can’t always trust all of your numbers.

Ross Taylor [00:55:26]: I think it is maybe a more fundamental question. I wasn’t there at the early days of FAIR - I came in 2019, but FAIR was always a very bottom up organization. Which is a great thing: that's why things like PyTorch emerged. But the real insight as to why OpenAI was ahead historically, at least until recently, was that they had more of a top-down culture and focused bets. They saw the potential of LLMs early on and it was a top-down prerogative of the company to focus on that. And in essence, it was more of an engineering problem than it was a research problem in a lot of ways.

Relatedly, I think a lot of people were surprised that the LLaMA-3 paper wasn't as “novel” as they were expecting. But that just reflects the fact that a lot of it is just engineering and engineering is really hard - a lot of hard work. Not always a lot of new methods, but it is a lot of hard work.

Nathan Lambert [00:56:22]: Yeah, we're starting our next fine tuning model and everyone's asking me: “what should we work on?”. I'm trying to tell them “we just have to filter data and generate more completions”. We’ll have a lot of prompts, we have to filter them, generate completions from good models, and then we’ll have to generate more completions and keep doing this process…And in 10 weeks, we'll probably have a very good open model. We’ll just have to be boring for 10 weeks! And we have like 10 people involved.

So it's a bit of a bigger project, which I think is the right way to do it. We have just started getting improvements on IFVL by copying Nemotron. We use some open math datasets and the math scores are getting closer to LLaMA. It is really the simplest things ever. It's like browsing Hugging Face and being like, “NVIDIA released some JSON format data, some instruction format data, like we add it in and the numbers go up”.

Ross Taylor [00:57:16]: Yeah, I think I said earlier, but it raises an interesting question where this kind of approach - of grinding until the open LLM leaderboard numbers get to 100% - I think we’re going to get to a situation where all the benchmarks are solved, but where we haven't really, in my mind, at least solved intelligence.

What does it mean that we'll get close to 100% on MATH, you know, without any inference time search? I think sooner or later, while it looks like we’re on an exponential with LLMs, we’ll realize we’re actually on an S curve. Eventually we're going to get back to this mode where we have to do new things. And I think that's great, because that's what motivates me.

But yeah, I think there's waves, and we’re in this heavy exploitation mode right now with LLMs - away from the glory days of architecture exploration. But my hope is that we'll get back to the stage where, after exhausting all the [current] benchmarks, we say: OK, now we need to do something completely different. But who knows?

Nathan Lambert [00:58:26]: I see it similarly. I think we still have a year or two, at least in the open. If the closed models start saturating and they start doing things differently, that's fine. But eventually it'll all get there. And in that phase, I mostly keep working just to make sure that the ecosystem doesn't fold in on itself. So that's probably the one-sentence summary of what I'm doing these days: add transparency so that regulatory capture doesn't nuke everything. And that's fine, but I think it's still going to be longer than people expect. I don't think we have true signs of saturation at the top. We'll see what GPT-5 does - if GPT-5 never comes out - and then we’ll really know.

But it seems like it's going to come. I think there's enough signs that it'll come eventually. I think I don't know the answer to this - and it's not really our expertise - but I'm interested in the potential architecture of GPT-5 and if it's GPT-4o like and they're using more multimodal data to try to keep the data engine going relative to just going bigger. I don't know the answer, but that's kind of the future questions I’m thinking about.

Ross Taylor [00:59:34]: In my mind, like three years ago, the thing on the horizon I saw was agents. That’s where a lot of people are working right now: long form tasks where an agent doesn't have to answer a question immediately, and [can instead] go away for a while doing some research and answer later. I think that will take up a lot of time in the next five years.

It's both a compute problem of bigger models - more scale will do better - but also a data problem of how do you generate these trajectories? How do you get reliability? So it’s more successful and less error-prone at each step. And I think in principle it's solvable, but I just think it would take some time.

Nathan Lambert [01:00:18]: Yeah, it seems that engineering is required. It doesn’t seem like something that's just going to emerge. It's building a whole system and scaffolding around agents. Just unglorious work.

Ross Taylor [01:00:32]: Yeah.

Nathan Lambert [01:00:34]: OK, anything else you want to add? Do you want to get people excited about your start-up or is it too early? Maybe too early, yeah?

Ross Taylor [01:00:43]: Yeah, what else should I say? It has been nice to step back for a bit and look a bit ahead into the future. For me, my best days creatively were my teenage years when I got back home from school and spent the rest of the day programming. It’s quite nice to feel like that again: to be in that zone again where I can shut the world out and do some work.

But maybe just to give a hint of the areas I'm interested in, I think it comes back to this problem of how alignment is going to be a process of making AI more human-like. For example, how do you control for things like deception - which Anthropic has done a lot of really good work on.

Essentially… the latents of AI are [potentially] misaligned with human latents, and the question is: what do the human latents look like anyway? And how do we model these things?

That is very abstract and high level, but that is the fundamental question I want to work on. But yeah, I think I can talk about it later in the year!

Nathan Lambert [01:01:49]: Yeah, sounds good. Thanks for coming on. This was great. I think people are going to get a ton out of this. I think just a very sensible conversation on fine-tuning, reasoning and some of the things that got us here. And that's what I was hoping to get out of it, so thanks again!

Ross Taylor [01:02:06]: Yeah, great to talk, Nathan. Have a good one!

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

A recipe for frontier model post-training Aug 07, 2024

Apple, Meta, and Nvidia all agree -- synthetic data, iterative training, human preference labels, and lots of filtering.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/frontier-model-post-training

00:00 Llama 3.1 post-training and the new normal for RLHF
01:18 A new standard pipeline
01:45 Human preference data
02:59 Scaling RLHF
05:03 Synthetic data
06:10 The new normal
06:51 Data quality is king
07:18 Apple confirms the new normal

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_018.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_020.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_031.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_033.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_035.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Interviewing Sebastian Raschka on the state of open LLMs, Llama 3.1, and AI education Aug 01, 2024

This week, I had the pleasure of chatting with Sebastian Raschka. Sebastian is doing a ton of work on the open language model ecosystem and AI research broadly. He’s been writing the great Ahead of AI newsletter (that has the biggest audience overlap with Interconnects, at 26%, so a lot of you know him) and multiple educational books, all on top of being a full time machine learning engineer at Lightning.ai, where he maintains LitGPT, which he described as being like Karpathy’s NanoGPT, with slightly more abstractions.

This conversation mostly surrounds keeping up with AI research, the state of the open LLM ecosystem post Llama 3.1, and many narrow topics in between. I learned that Sebastian used to be an Arxiv moderator, which gives some simple color on how Arxiv and sifting through thousands of papers works. We cover a lot of ground here, so I hope you enjoy it.

Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other interviews, go here.

YouTube

Chapters

* [00:00:00] Introduction & Sebastian’s background

* [00:04:28] The state of deep learning and language models in 2018

* [00:08:02] Sebastian's work at Lightning AI and LitGPT

* [00:12:23] Distillation and its potential in language model training

* [00:14:14] Implementing language models and common pitfalls

* [00:18:45] Modern architectures: Mixture of experts models, early v. late fusion multimodal

* [00:24:23] Sebastian's book on building language models from scratch

* [00:27:13] Comparing ChatGPT, Claude, and Google's Gemini for various tasks

* [00:38:21] Vibing and checking new language models during implementation

* [00:40:42] Selecting papers to read and moderating Arxiv

* [00:45:36] Motivation for working on AI education

* [00:52:46] Llama 3 fine-tuning

* [00:57:26] The potential impact of AI on jobs in writing and education

* [01:00:57] The future directions of AI

Transcript

Built with smol-podcaster and with love of Latent Space.

Nathan Lambert [00:00:00]: Hey, Sebastian, welcome to this kind of interconnects, normally researcher interviews. You were a professor, so that definitely counts. You do a lot of different things these days. Let's get talking into language models. Welcome. Yeah.

Sebastian Raschka [00:01:35]: Thanks so much for the invitation, Nathan. I'm a big fan actually of the interconnects newsletter, so I'm hoping we can have some fun chat about research, LLMs, and what's hot these days, basically. Yeah.

Nathan Lambert [00:01:48]: I have a little section on the end, which is keeping up with AI research, writing about AI and process, because you do so many things, but I kind of want to jump into how you got to AI, because you have an interesting career path. So you were a professor at Wisconsin Madison for years. I saw in statistics, which ... I also went all the way back to find your PhD thesis, which was uncovering hidden patterns of molecular recognition. So this was a while ago, and is this kind of ... Can you explain your background and how you got into AI? I'm guessing it's through computational statistics or something like this.

Sebastian Raschka [00:02:24]: Yeah. Close. So yeah, you did some research there. Interesting. So yeah, it's been a long time since my PhD thesis. This is maybe seven years now. And back then, it started even earlier when I got into AI, that was like, I would say 2012-ish. I was in grad school and I was taking a statistical pattern classification class. And in that class, yeah, the star of the show was basically naive Bayes classifiers, or in general, Bayesian methods for pattern recognition. And from there, I kind of really got into machine learning. So there was, I would say, more statistical-based, but it was all about classifying things. And then I think it was also right about the time where Cozera was launched, and I saw Andrew Ng's Cozera class. That was, I think, the first class in 2011-12 back then. And yeah, that's basically how I started from statistical pattern classification into machine learning. And I applied that for computational biology problems like molecule and drug discovery, like pharmaceutical drug discovery. And yeah, from there, I joined at some point after my graduation, the University of Wisconsin in Madison, where I was in the statistics department, but I did mostly deep learning research, essentially. I was the only one basically doing Python, deep learning, machine learning stuff. So yeah.

Nathan Lambert [00:03:48]: What year was this, and what did it look like at the time?

Sebastian Raschka [00:03:52]: That was around 2018, I think August 2018, when I joined the department. And yeah, I mean, so it's the statistics department, but my work was technically all machine learning and deep learning. I mean, a lot of students were really excited about learning machine learning. I think it was just around the time where it got really popular. And yeah, I was teaching machine learning and deep learning classes as well. They were always like, you know, full and crowded, like a lot of students were excited about that. Also, in general, like the time learning about Python, machine learning, data science, all these topics.

Nathan Lambert [00:04:28]: It's, I mean, it's very interesting because I was a student, I was a grad student at this time or that time in like 2018. That's what deep RL was really taking off. And it probably feels like that probably felt kind of like the language model thing was like as a student at the time, where it's just like, there's so many people in all these classes. And now language models have more of a real world application, but I think as a student, it probably feels so, so similar. Yeah.

Sebastian Raschka [00:04:50]: So also back then, if I may say that it's like large language models already existed. I think the GPT paper, was it 2018? Something like that?

Nathan Lambert [00:04:59]: Yeah, 2018 or 2019. Yeah. For GPT-2, I think.

Sebastian Raschka [00:05:04]: Remember covering, like I had a whole hour or two hours on large language models back then, but it was all focused on BERT models and basically also using them for more like classification tasks. Now, I would say maybe a lot of business problems still evolve around classification, but everything else is basically generative, generating text, generating images and stuff. So it has changed a lot.

Nathan Lambert [00:05:28]: Yeah, for sure. It's like a sequence of like, is it like the transform, is it like Elmo, BERT and the transformers are probably the things that you're talking about all the time? Just very interesting. I think Yitay had this, did you read Yitay's recent blog posts on language model architectures and kind of walked through why encoder decoder is no longer in vogue? Did you see this?

Sebastian Raschka [00:05:51]: Yeah, I think I haven't seen the article, but I remember having discussions with people about that recently. I mean, I think there was actually, it's interesting. So I think T5, if you would train it and fine tune it, it would still be a really good model for sequence to sequence tasks, like language translation and stuff like that.

Nathan Lambert [00:06:10]: Yeah. Cohere for AI did this with AYA. They used T5 for their first AYA version, which most people were like, oh, they've Cohere branded it so well, but no one realized they're using T5.

Sebastian Raschka [00:06:21]: See, I even didn't know about that. And so also on that note, I would say there was something else I wanted to say. So then there's also still the classification thing and using LLMs for classification. And it was also usually either a bird like encoder, or you could also use an encoder decoder, but mostly an encoder. But I've seen also recent papers using just decoder models for that. Just basically removing the, I saw two papers on that actually, like removing the causal mask. So basically reverting it back to an encoder using LLMA and then removing the mask. So in that sense.

Nathan Lambert [00:06:59]: And it works well as a classifier. You can just kind of use it. That's awesome.

Sebastian Raschka [00:07:04]: I mean, you could even do that without removing the causal mask. So you could just tune the last token basically, but yeah, if you remove it, yeah. They found that you could use probably the first token even, because if you have the last token, you don't, you have to have padding always because you have to pad it to the longest sequence. Otherwise the last token would be a different one in each training example. And so in this way you could use an earlier token basically, and then keep it fixed.

Nathan Lambert [00:07:30]: Yeah. Yeah. Now with your work at Lightning AI, do you do a lot of these things like hacking around with language models? Because I think it's kind of an underexplored space where just like people remove layers and plug things together. I think there was like, when merging was just getting going, there was like Franken Llama 2, where somebody made like a Llama 2 30 B by just chopping layers and stuff together. There's so much unexplored signal there that I just, do you have your, have you ever looked at these things or you don't do that much?

Sebastian Raschka [00:08:02]: I must say I'm not a big fan of merging. Maybe I'm just not good at it. I rather prefer fine tuning, start changing things or training and fine tuning things. So yeah, I do a lot of this type of hacking. Sometimes voluntarily, sometimes involuntarily, because I make a mistake or something or like, because at Lightning I developed this library, LitGPT, which is an open source library, pre-training, fine tuning and serving and deploying LLMs. But it's basically a from scratch implementation. You can think of it as a NanoGPT from Andrej Karpathy, but for all types of LLMs, like Llama, Gemma, PHY, all of them. But the focus is also like NanoGPT is on readable code or like keeping it relatively simple. Of course it gets a bit more complex there when you add multi-GPU training, tensor parallel, fully sharded data parallelism and stuff like that. So if you add all these settings, it gets a bit more complicated, but the focus is still on having a code base that you can easily work with. And in that context, it's very easy to remove layers and change things. I mean, yeah, so that is usually, I build it like for colleagues at Lightning, but also like open source community, but then also for myself to tweak things, to change things and stuff like that. So yeah, I should also say, it's not just me, it's Carlos and Adrian who started this library. Currently I'm like the main person maintaining it, but a lot of people contribute to it. So it's actually a nice playground.

Nathan Lambert [00:09:41]: There's kind of two follows odds for this. One is like, what part of the language model training stack, if somebody is going to start with libgpt or HuggingFace or whatever, like they're trying to fine tune a model, you can do an example. And then what is the thing that they should do to go like one level deeper to learn how these things work? Because you're saying with libgpt, you can do all these different architectures. I don't know if I would recommend architectures, but it's a good way to learn how like the attention implementation and how different layers are shaped and things like this. Is there different areas you'd recommend people to look at?

Sebastian Raschka [00:10:14]: Yeah, I would actually, okay. So it's like a shameless plug, but in my book, I have a book where I do this step by step, the implementation. And this is for only one model, like a simple model, a GPT-2 model. Because it's like the, I would say the one that started all of this, right? Like the main architecture and everything else is kind of like a derivative almost of it. So I would think in a good way that it is making tweaks and improving things, but basically starting with one architecture, like you said, not looking at different ones at first, and then just understanding what is, I would say the best way is what is the input data here? How does it look like? What does go into the LLM and really how does it pass through the layers? And then from there, okay, we understand how a model learns to generate one word at a time and then going from there to instruction, fine tuning, and then even like alignment with a DPO, for example. So doing like all these different lifecycle things from implementing one architecture, pre-training, fine tuning, aligning, and then from there, I think it's a useful or interesting exercise to see how different architectures make slightly different choices, like replacing the Gelu activation with a Silu activation or pre- and post-layer norm and like these like nuances, changing the number of heads or number of layers. And yeah.

Nathan Lambert [00:11:38]: Yeah. I mean, in industry, everyone kind of is converging to similar things or like people converge to a similar recipe and then they stick with it for infinity. So like each of the orgs have these recipes that it's too risky to change and like AI2 are like still converging at a recipe. So we're like learning things that the Llama team does and it's like RMS norm and they think it's very important or like these different things. And I wonder how like the open community is going to converge on pre-training things. So like what scale of models do you recommend people train for your book? Are they training like the hundred million scale GPT-2? Is it smaller? Because I think in Colab, you can fine tune maybe with Laura, a 7b model, I think. Is that true?

Sebastian Raschka [00:12:23]: Yeah. So this is true. But I think for Laura, if you want to fine tune 7b model, you would need, I think, bits and bytes of quantization, the normal float for like some quantization. But yeah. So for the, or maybe going one step back for the book, it's really the smallest model, like the hundred, what is it, hundred something million. But I also have settings. If you like, if let's say your machine permits, use the larger version. So there are four larger versions, like 300, 700, and 1.5 billion. But it's really up to the reader. I have all the examples with the smallest one so that it even runs on a MacBook Air. So on this podcast, I'm here on my small MacBook Air and all the models train in a few minutes fine. Of course, I'm not doing the whole pre-training for that. You would need a GPU for a week or maybe I would say maybe even longer than that now. I mean, it depends on the GPU, of course, but H100, maybe a week. But also the other reason is yeah, in practice, you would probably use pre-trained weights and then you can find, so you can do continued pre-training and then fine tune. So the focus is basically understanding how the pre-training works, then loading pre-trained weights. But then also the fine tuning is like the full, the full thing, like doing it to fine tune a classifier, but also instruction fine tuning essentially. And that doesn't take too long. I would recommend using a GPU, but it would technically run on a CPU. And get back to the question you had with a 7 billion model for that one A100, I would say yeah, one A100 would probably work for a 7 billion model. But you can also, if you have Litt-GPT or if you use Litt-GPT as a setting, you can set the number of devices and shard it over multiple GPUs. Yeah.

Nathan Lambert [00:14:14]: I mean, all of this stuff is getting so much easier. I think, I don't know, when did you start writing this book and all of these chapters? Because I've seen the GitHub, I haven't looked at when it started.

Sebastian Raschka [00:14:23]: Actually longer than you might think. It took a long time. It's almost, at this point, one and a half years approximately.

Nathan Lambert [00:14:30]: Because at that time, like a 1 billion parameter model, like what was the state of the art 1 billion parameter model a year and a half ago? Some random model. But today, like people are trading 1 billion parameter models for 15 trillion tokens. So the fine tuning that you can do there is getting extremely good. And I'm going to guess that people are going to start training even smaller models with these distillation losses. So have you looked at distillation at all? I think it's full on coming in the next six months. We can shift it to like the LLAMA3 and the state of the open ecosystem section, because it kind of goes in. It's like LLAMA3 was not distilled. It's a specific loss function. I hate it that there's synthetic data came around and people call, I was on this paper, the Zephyr paper, the title is Direct Distillation of Language Models. But now the technical definition of distillation, which is like knowledge distillation from a teacher is becoming popular. So the whole synthetic data and alignment and everything is like screwed in a doubly defined word.

Sebastian Raschka [00:15:30]: So basically what you're saying is that people who just use synthetic data refer to it as distillation because it's from a larger model. Yeah. Yeah. Yeah. Confusing. I think Gemma too did that actually recently. So that was an example where they did that. And I do think, you know, I think it's also coming. So I have for my book, that's like the core chapters I have, but I have a whole long list of bonus material that I want to cover and distillation, knowledge distillation is one of them. So this will be something over the next few years, but you know, doing tutorials on those and yeah.

Nathan Lambert [00:16:04]: Because I think people can actually use it as a thing. So how distillation works, I've thought about implementing it, but as it works is that if you have a fine tuning corpus, you get all the predictions from your big model. So all the log probabilities from your big model and you store them in memory. And then as you're training the model you're training, which is smaller, you essentially weight them by those predictions because you store them from memory. So you don't need to store the big model in memory when you're training. So I think people should be able to like, or someone will upload a data set file of like a giant log probs of Lama 405B and that people will just try to fine tune from it. I'm surprised that Lama 3 didn't use it, but I think it's just because they're focused on scale and data more than any fancy things.

Sebastian Raschka [00:16:49]: Yeah. And I think the, I can, I think I probably know why, but also, yeah. One thing is I should, one should also add is why I think it's also becoming more popular is like Lama 3.1, they just allowed doing that. I think before it was according to the license, technically not allowed to use Lama 3 models to improve other models, but now, now we can. So I think, like you said, it's probably going to be a hot topic, but I do think they didn't do that because the 405B Lama model just finished, I think. So I think, I mean, if you think back, they shared the Lama 3 model, it's like, I don't know, half a year ago or something, many months ago. So I think it's really more like, yeah, it hasn't finished training, but maybe for Lama 4, we will see more distillation using the 3.1 model for that.

Nathan Lambert [00:17:38]: Yeah, it's more architecture things. So for while we're talking about distillation, almost like Cloud Flash or Google Gemini Flash is confirmed as distillation. And it is very likely that Cloud Haiku and GPT-40 mini are distilled in the technical sense of the word, which is like, I think it's obvious that that works on pre-training. And I think there will be a breakthrough fine tuning model, kind of like the likes of Zephyr, Starlang, I'm forgetting more names, but ones that really reach the narrative from fine tuning on distilled data. I think that'll come in the next six months. So honestly, I'm telling the people I work with, we should try to do this before something new, because it's so obvious now.

Sebastian Raschka [00:18:22]: One thing I've seen also a trend, I wouldn't say backwards, but a thing that doesn't seem to be that popular anymore is a mixture of expert models. What do you think about that? Is that like something like that was like a fad and now people don't, you know, they explore other things like distillation. I mean, you could do both, but it feels like a mixture of experts is not as hot anymore

Nathan Lambert [00:18:45]: somehow. I don't know.

Sebastian Raschka [00:18:45]: What do you think?

Nathan Lambert [00:18:47]: There's two things. Small mixture of expert models are definitely coming out. Essentially, you get a fixed improvement in flop efficiency at pre-training. So essentially, if you're going to pre-train like an X billion parameter model with mixture of experts, it'll go like 40 percent faster or some pretty appreciable number. There's a lot of rumors and discussion that scaling up mixture of experts models is really hard from a stability point of view. So a lot of these open people, you could get it started and we're playing with these AI too. So we want to play in the mixture of experts space as well. And doing a small model works, but there's a lot of headaches. I think like some of the friends at Databricks Mosaic ML have been the clearest about this. It's just like you do not, like you at AI too, do not have the engineering throughput to deal with the headaches that comes from mixture of experts. So I think there's still clear signal from industry and people and like, I mean, Deep Seek's releasing MOEs. I think Quen has a small MOE and these are pretty good models. But I think it's a really heavy engineering lift to get to mixture of experts to work. I like GPT-4 scales. I expect Meta to figure it out. I think it's just on their list and they figured out dense first. The thing I'm more interested in for GPT-4, I don't care if it's mixture of experts. I think they have the compute to do either way. But for Llama-4, God, all the numbers throw me off so bad. But I think that OpenAI and Google might be slightly ahead by having the early fusion model. So essentially with these multimodal models, there's the concept of early versus late fusion. The first visual models that people were playing with the GPT-4 were this late fusion. And now like GPT-4.0 is early fusion. And it seems like Gemini is probably early fusion, which means they take in direct audio, video, text directly at the input, the training data changes. And I don't know how much of a heavy lift it is to get that to work. I think that might be the bigger change. And that might be harder for Meta to catch up on than anything else. But no one's really talking about it.

Sebastian Raschka [00:20:58]: But also here, I think that is something I feel like others have. I mean, I remember even like last year, there were a lot of papers with a late fusion thing, like I think Llama adapter papers and stuff like that, like retrofitting the models. But yeah, I haven't seen that much focus on that from Meta. But I mean, they had a section on that in the paper, but it felt almost like an afterthought. I don't know. It's like where, yeah, I think maybe there's a different team at Meta that works on

Nathan Lambert [00:21:26]: that. There is a Chameleon team that was doing this, and I think a lot of them have left. My question, essentially, that I want to debate and I don't know the answer to is like, because essentially it takes so much different data pipelines. So you have to have a much clearer balance between video images and audio and text when you're training early fusion than with late fusion, because you just add a bunch of images at the end. And like if that data curation step is going to be a big bottleneck for kind of shifting and if Google and OpenAI have an advantage by just scraping YouTube, like Google obviously can't scrape YouTube and I'm not saying that they are, but like if it becomes a way that you can get more data and like GPT 5.0 is the first model that OpenAI releases, then I'll be like, OK, the GPT 4.0 thing was just a pivot. And I actually think this could happen. I don't put this at like a one percent probability. I could see this as being what the labs are betting on. It just takes so long to spin up this entire new pipeline of training.

Sebastian Raschka [00:22:25]: But one question here is going back to a point you mentioned earlier regarding the knowledge distillation where you can just precompute all these things, you could technically do that also just once for the whole data set. Let's say you have a very good image encoder, audio encoder. You would never have to redo this if you do it well. Right. I mean, it would be something you do it, take care of it once and then you pass it just as tokens to the to the other team, basically.

Nathan Lambert [00:22:49]: Yeah, probably. I don't know. I'm not like I don't have as much insight into really advanced pre-training practices as I would like. I'm mostly of a similar boat of like fine tuning models and playing with things because I'm trying to play like, have you played with Llama 3405b at all? For context, the recording is like, what is this, like a week after, like six days after. Like I haven't gotten it set up, but I'm really curious. Like I don't have clear expectations on how the open source community, like the open language model ecosystem kind of evolves from here with these new Llama models, the new Mistral models. It feels like a total, from like a technical and a policy perspective for me, it feels like a pivot. I think the educational side of things, it's actually more of the same. Like we knew we knew this was coming, but it just it feels like it could be qualitatively different going forward. Do you see anything? Have you tried anything?

Sebastian Raschka [00:23:45]: Yeah, I did actually try the Llama 3.1 models. I, when they came out last week, we added them to Litchipiti. I took care of the eight and 70 billion models. And my colleague Adrian, he also added support for the 405 billion models. So just briefly trying it, it looks really good. So the thing is with a 405 billion model, it's a bit tricky. So I think the problem here is, of course, it's free. Everyone can use it, but in a sense it's still expensive to run it because you need, so we were running it with bits and bytes of quantization, like a normal float four on eight H100s. And this is expensive, right? I mean, eight H100s, it's probably more than a hundred bucks an hour.

Nathan Lambert [00:24:26]: I was trying to do the same and I messed up the BLM installation. I was like, okay, I spent an hour on this. Yeah.

Sebastian Raschka [00:24:32]: So you can try Litchipiti maybe. So it works with.

Nathan Lambert [00:24:36]: Yeah. And there's a related question. One of the things I'm trying to ask people who are hands on, just like, how do you, what do you do to vibe check a new model as you go through so much AI research material and language model material? It's like, everyone has their procedures and how do you go about that?

Sebastian Raschka [00:24:51]: So for me, it's like, I, I mean, I use these more like for making sure they generate the correct answers and stuff like that, or something that is reasonable. So honestly, really simple questions for me just to see, so this is more like, I'm not necessarily benchmarking these models. I'm more like making sure the implementation is correct. And for that, I use simple questions like what do llamas eat? What is one plus two? You know, like just making sure, because it's actually easy. Something I just fixed this morning. It's easy to mess up things like KB caching, where you cache, you don't clear the cache and then there's something from the previous answer and the answer looks kind of correct, but it's kind of weird. And, you know, like simple questions can sometimes reveal that. So basically what I do is I ask it multiple, multiple questions the same time. So, sorry, repeatedly, like the same question repeatedly and see if the outputs still make sense and stuff and then mixing them up, but like in a loop basically, but I'm not so much like, that's a great way to make sure the implementation works.

Nathan Lambert [00:25:53]: Cause I think in transformers, they had a missing end token. There's so many little things like this when implementing stuff. Like the, the end tokens is such a ban or like the chat templating can always break things. Cause it also can happen that you mess up pre-training and then you need to have something in the chat template that people might not know. I think in one of the early Olmo models, we like missed a new line in, in one of our documents when we were annealing it. So in order to fine tune it, you had to like have an extra new line before the chat template and like most people will just miss that. Yeah. This is very, very interesting point.

Sebastian Raschka [00:26:28]: It's like, you don't even notice it usually when you use something like, I don't know, chat GPT, because it's applied behind the scenes. But if you implement these things yourself, you have to be really diligent and careful to do it very consistently. Like one little, like you said, new line throws it totally off. It's, it's, yeah, it's interesting. It's like, you have to be, I noticed that I was actually working on some DPO stuff this weekend and my template for fine tuning and DPO alignment, the one that I'm working on alignment, the prompt template was a bit different and I got like garbage results. And then, oh, I, I stripped some line here, the new line character, basically something similar, like you said. So it's, it's very sensitive to that.

Nathan Lambert [00:27:04]: Yeah.

Sebastian Raschka [00:27:04]: Yeah.

Nathan Lambert [00:27:05]: This, this makes sense. Um, related, do you use Clod, chat GPT, any of these regularly in your workflow? Are you team Clod?

Sebastian Raschka [00:27:13]: Uh, so yeah, so it depends. I have both and I flip back and forth between them. I don't know. I'm probably not really good at prompting, but sometimes I get better results with one over the other. Um, I think. I wouldn't say one is better than the other. They're just different. I would say I'm using.

Nathan Lambert [00:27:31]: That's kind of what I think. It's important. Like, it's good. Like, what do you think of both of them? I think it's good for people to know this because it's, it takes some practice to understand and using both. Both people don't use both. Yeah.

Sebastian Raschka [00:27:43]: I would say when I use also GPT-4, I must say I use the, uh, it's called legacy now, but the original GPT-4, I don't like the mini and old versions. And, uh, for Claude, I use the opposite of the, not the new one, but the one, the previous larger one, the slower one. And, um, I think for me it's like coding wise, it's kind of weird, but most of the time I like GPT-4 better for code stuff. But then I think also, uh, I think, you know, what, what's better with GPT-4 was it's, it's a bit more up to date, um, with knowledge, I think. But Claude has, I think better, you know, when you say improve my writing or something like that, it has more, it has less, you know, like these, like I delve into something, these weird words and stuff like it, it's a less, it's more natural a bit, I would say, but

Nathan Lambert [00:28:34]: also not always.

Sebastian Raschka [00:28:34]: I agree.

Nathan Lambert [00:28:36]: It's like, it has a bit more flair and a bit more unpredictability. So I like use a Claude on my phone, but I've found, I've tried to use Claude for like information transformation tasks, like LaTeX or taking, taking data out of a table. And sometimes it just like refuses. Like I do research on like AI safety, like safety and bias. So if I put anything into Claude that I'm trying to transform that data, it just says no. Cause it's like, I can't comment on like a mean story. Well as OpenAI will just do it. And it's like the processing that OpenAI does is very good. So I actually like canceled my GPT subscription when I started Claude, but I kind of regret it now. I'm like, oh, now I need both, which is, which is a little annoying. Yeah.

Sebastian Raschka [00:29:16]: It's like, yeah. So one thing is what is interesting though, is we, we're talking about GPT-4 and Claude here, but we haven't even mentioned Google Gemini.

Nathan Lambert [00:29:24]: I don't know.

Sebastian Raschka [00:29:24]: I personally, I tried the early versions. I don't want to say the newer versions are not good. I just haven't tried because I didn't need to, but do you have experiences with Gemini

Nathan Lambert [00:29:34]: or? I was using Gemini in search preview. So if you have the Google app, I can, I'm recording this in, in video. Like you have the Google app, like at the top, you could click on Gemini, which I was doing for a while just to play with it. But like, I don't use it on the web. I, they do have a nice interface that looks exactly the same, but somehow I got grandfathered into like AI studio, which I use for, if I upload, record a podcast, I upload the podcast and I'm like write chapters or something. And it actually works, which is pretty cool to be able to upload like an hour long podcast. But for whatever reason, the Google interface, other than the Google app, hasn't stuck for me. And I think that's the biggest, biggest limitation. And I use it more in a googly way. So I'd not, I'm not as perceptive to style. I see. I see.

Sebastian Raschka [00:30:20]: So also I'm curious. I just yesterday saw Apple's on device AI is a bit delayed, I think. And for that, I think it's an interesting one. We will see how this will work because this will be, I think also smaller models. And there's a, for me, it's like, I never really care about speed for these things. It's like, I just want the best possible models. So this is also why I was a bit disappointed when GPT-4 O came out and GPT-4 Mini came

Nathan Lambert [00:30:46]: out.

Sebastian Raschka [00:30:46]: It's like, ah, I don't really care about if it's faster or not. I just want it better. You know, I want to have better quality. I don't know. It's maybe it's just me.

Nathan Lambert [00:30:53]: I think for building applications, speed is really good. So I have a few friends that run startups that are heavily built on language models and they have a similar stack to perplexity, which is like the user passes in a query that have a primary language model request and they have a series of feedback requests or small requests on top of that. So when you're concatenating multiple requests, like speed is extremely important. And when you're like selling a product, speed is extremely important. But if you're like tinkering and trying to learn, it is much slower. It's true. Yeah. Yeah.

Sebastian Raschka [00:31:19]: It's like the real world, like, sorry, not real world, but the individual user, um, yeah, using it as a tool in everyday life versus really building an application based on an API that makes sense.

Nathan Lambert [00:31:32]: Yeah.

Sebastian Raschka [00:31:32]: So there are two different use cases.

Nathan Lambert [00:31:34]: Yeah. Yeah. I think we're kind of talking about style. I have a section on RLHF here. I just wanted to like, what do you think you do spend a lot so much on AI education is like, what do you think is most confusing to people about this kind of whole post-training thing, which is instruction tuning, reinforcement learning from human feedback, other safety modules, like adding a filter and stuff like this. I'm really on the bandwagon of trying to convince people that RLHF is deeply tried with style, which is like this, how this discussion of cloud versus, um, open AI and Google and all these things. And I don't really know how to portray that in like an educational technical point of view. So like, I'll do an analysis of the paper and I'll do like DPO and like scores and all these things. But at the same time, for most people reading my articles, the most important thing is probably to know that open AI is really smart about their style. And that's why they're so high on chatbot arena. But like, I've written about it a couple of times. I have another article in the drafts, which is essentially like why GPT 4.0 mini like broke chatbot arena. Because everyone's so upset that it scored so highly, but it's not that surprising if you look at historical events.

Sebastian Raschka [00:32:39]: So it's basically exploitation of the benchmark almost you're saying or like the benchmark

Nathan Lambert [00:32:45]: is focused on style and it really penalizes refusals. So like I get refusals when I use cloud. So it's definitely going to like be downweighted. And like open AI is really good at this. This is what they've been doing for a long time. But I don't really know how to educate this. Like, have you thought about, like, there was a question on Twitter of why didn't you include RLHF in your latest? It was kind of a joke, but I took it out.

Sebastian Raschka [00:33:09]: Well, if yeah, I can maybe answer that. It's it's in the works. No, so there are multiple reasons. And so one is it's so there are page limits per chapter. And originally it was meant to be in chapter seven. It got way too long. It's actually even without it. Chapter seven is the longest chapter already. And what is the other one is fine tuning.

Nathan Lambert [00:33:29]: Oh, sorry.

Sebastian Raschka [00:33:30]: Instruction fine tuning. Yeah, I called it not instruction fine tuning. I called it fine tuning to follow instructions, which were originally, which was originally meant to have both, but then it got too long. And the other thing is, you know, like one book chapter takes about two months and a lot of people who really want to book before the new semester starts. So it's like, you know, it's, there could be another chapter on it, but it would be

Nathan Lambert [00:33:54]: another two months.

Sebastian Raschka [00:33:54]: And that, I mean, it's not really an excuse, but the other one is I was not happy with the results. And this is a very mathy topic. And I was like, okay, I have this book, which is very clear and makes hopefully a lot of sense. And then I have this really super complicated chapter at the end. I don't know if that's very satisfying to read or death.

Nathan Lambert [00:34:15]: Yeah.

Sebastian Raschka [00:34:15]: Where it's like, so you read this book, everything makes sense. And then it comes to this huge...

Nathan Lambert [00:34:19]: Why is RLHF so much mathier? Like, I know a couple, there's a couple of core equations. Like the core equation is like the RL optimization step, which is expected expectation, maximization of reward subject to penalty. And like, where does most of the, like compared to pre-training, which is like one equation, like that is also one equation, but there's a lot of downstream stuff, I'm guessing. Yeah.

Sebastian Raschka [00:34:41]: I think it's the explaining a bit about reinforcement learning. I mean, you don't really have to explain reinforcement learning in a classic sense, maybe, but yeah, there's still like KL divergence and penalties and reward margins. And there are lots of things happening at the same time. And the code is also very long if you especially want to track the rewards and stuff. So for my instruction fine tuning chapter, I'm using exactly the same training function I implemented in the pre-training chapter.

Nathan Lambert [00:35:14]: And it's really nice.

Sebastian Raschka [00:35:14]: It's like, well, you can actually reuse everything. It's, it fits together.

Nathan Lambert [00:35:18]: Yeah. Like what we're doing on OMO, we can baseline our instruction fine tuning in our fine tuning code base, which also has some RL things and in our pre-training code base. So it's nice to have both, but that is definitely why it's simpler. And the RL is only getting worse in my mind, I think. Like we've seen that LLAMA has used rejection sampling for two iterations and there's no public implementation of rejection sampling that at least public enough to know that people have actually trained models with it, which is the idea of ranking completions to a reward model and then running instruction tuning again on the top completions.

Sebastian Raschka [00:35:54]: I think also in the recent LLAMA 3.1 paper, they used rejection sampling with DPO, for example. Like they didn't use the RLHF with reward model, but then they used the reward model for the rejection sample. And yeah, so I must say, I have the code for the DPO. I wanted to do TPO because it's also more resource efficient. You don't have to train that reward model for, let's say the book, but I was not really happy with the quality of the output yet. So I must say it's like, okay, this is not, it's not helping the instruction fine tune model. And it's like, I think a general thing where I, I mean, you might correct me if I'm wrong here, because you are the expert in RLHF, but for me, it's like, it's like a optional thing where unless you need a specific style or need to deploy something in like a safe manner, it's maybe not giving you the best results. If you need a private model that just runs on your own computer and gives you correct answers, I don't think DPO or RLHF will make the answers more correct. They will just change how they look like.

Nathan Lambert [00:37:01]: And yeah, I mostly agree, especially on what we have in public implementations. The public implementations are really good at improving on like alpaca eval. But if I'm training a model that I actually want to use, don't worry about alpaca eval. I think I'm like the most annoying person internally running these experiments because I just get so annoyed when only alpaca eval goes up and be like, that has made the model worse. Like we've, I've been building internal demo tools, which is just like making Gradio better and showing how to use VLLM for serving. But it's like a lot of the models we put out for research are like really, really annoying to talk to. You put no yapping or just be concise in the prompt and it doesn't do anything. So like a lot of the open datasets, and this is something that Nibetron and Lama3 have shifted to is this new evaluation, which is like IF eval, which stands for instruction following eval, which I think is a great one. So it's like write a response with less than 300 words or something. And it has these verifiable claims. And this is something that the Nibetron report showed that like doing fine tuning really unlocked a lot more performance in the DPO stage. So I'm hoping that we start to get more evals than just alpaca eval that are helped by this RLHF and that'll help the whole ecosystem come forward because it is in a kind of young, rough state right now. Yeah.

Sebastian Raschka [00:38:21]: And also one last thing about this topic is for me, like you said, the last sentence is kind of also one of the reasons is where I was like, okay, if I include something on DPO as the last chapter, I don't know if it's still going to be used next year or if there's so many variants, ORPO and QTO. And I mean, right now, I mean, Lama3.1 used DPO, which is like a big endorsement. But to be honest, I'm not sure if this exact variant is here to stay.

Nathan Lambert [00:38:47]: And so I think DPO is here to stay. DPO will be a canonical example, much like PPO. But I think the things that people are using will go away. Like PPO has stood the test of time of multiple eras of RL. So I don't think that people use it in its exact form, but people are always looking at it. And same with DPO, just because DPO is so simple. Like the exercise, this is like one of the best getting started with RLHF exercise is taking like the hugging face trainer and modifying it to use the DPO loss because you could use all the other infrastructure for like most of the infrastructure for batching and stuff like this. And then add that loss function, which is a few lines of code. And like, that's a good, that's like the entry point to doing RLHF implementations. Like when I interview people, I'm like, make sure that they have looked at this DPO loss function before. And if they haven't, I'm like, I don't know if you're in the weeds enough. I feel like you should look at this.

Sebastian Raschka [00:39:37]: Speaker 3 And if you need, if you are listening to this and you are about to get interviewed by Nathan, I will hopefully have by next weekend a tutorial on DPO, on implementing it from scratch. I was, this weekend I used actually Lama 3.1 to make a synthetic data set for that and got much better results. So it looks good enough to probably upload it next week. So nice.

Nathan Lambert [00:39:58]: Okay. Let's shift gears into like AI research and AI education, which is I think the thing that you have some of the most insight into. So you're a head of AI newsletter. You, I wasn't originally reading it when I subscribed, but now I almost always skim through to kind of see what papers you uncover. I'm pretty interested in like how you select papers, like how much you actually prioritize reading papers and why, and just like any advice for people, because it's hard to sit down and do this. And I, speaking for myself, sometimes writing is like how I force myself to read some papers. I don't know if you're in the same boat, but like, what is your worldview around reading AI papers these days and skepticism or excitement, everything?

Sebastian Raschka [00:40:42]: Yeah, that's a big topic. So I must say, so I, I look at more paper than I actually literally read. I mean, I look at the abstracts and the titles and then that's like a huge funnel as a section

Nathan Lambert [00:40:54]: processor.

Sebastian Raschka [00:40:54]: I must say for like, I was an archive moderator for the machine learning archive a few years back and that got me into the habit. So how it worked was basically as a, maybe it's useful because some people complain when

Nathan Lambert [00:41:06]: How did someone become an archive moderator? I didn't know that it was like a community position.

Sebastian Raschka [00:41:12]: So that was originally by Tom Dietrich. He was doing it by himself and he was looking for people to help him with that. Because as you mentioned, there is an ever increasing number of papers. And so how it works is essentially that when you submit a paper to archive, you select the categories. But a lot of people, they select not, let's say the correct, I wouldn't say not correct, but like the preferred categories because Yeah, the AI and ML.

Nathan Lambert [00:41:39]: It's like ML, AI, and then everything else. Yeah.

Sebastian Raschka [00:41:42]: And AI in archive is interesting. It's more like the classic AI. It's like, it's not LLMs. It's more like symbolic AI, that kind of stuff.

Nathan Lambert [00:41:51]: What do you think the difference between, or like as an educator, how do you define AI and machine learning? This was also one of my favorite interview questions to like see where they're at.

Sebastian Raschka [00:42:00]: Well, right now I would say I go back and forth on that. Right now I would say AI is this big umbrella thing where you have deep learning and machine learning as subfields. But if you think about it, if you consider a logistic regression classifier, it is essentially machine learning. And if machine learning is the subfield of AI, you would say, okay, then logistic regression must be AI. But is like classifying iris flowers really AI? I don't know. So today I would say

Nathan Lambert [00:42:28]: I also think about search as AI. Yeah. Like, yeah.

Sebastian Raschka [00:42:31]: Like, yeah. So there's like the good old fashioned AI. So I would say with AI, yeah, you have both, you have the machine learning and deep learning branches, but you have also, you can also implement AI with if else statements, I guess, like, you know, like, so. So that's how I would define AI. But I think nowadays when people talk about AI, they mean specifically gen AI, like generative AI models, like LLMs, stable diffusion, that type of stuff. But yeah, so the archive thing. So just briefly, basically there is in the background, it's also using machine learning or NLP to detect whether the title based on the title and the abstract, if the category is actually matching. And if there's a mismatch or in general as moderator, you go through them and, oh, this looks good.

Nathan Lambert [00:43:17]: This looks good.

Sebastian Raschka [00:43:17]: This looks good.

Nathan Lambert [00:43:18]: They started exposing this to the user. So I submitted a paper recently under ML and I was like, this looks like language. And I was like, I've been in moderate, I've gotten papers stuck in moderation. So I was like, I'm always going to hit, except if they tell me it might be in the wrong category, because archive moderation is a black box that you don't want to get stuck in. No, no, like as a user, but I understand the service it's providing. So it's good to expose that to the user. And if anyone's listening, just click it, click. Yes. It's not worth delaying your release. We get stuck in moderation and help archive out. Yeah.

Sebastian Raschka [00:43:50]: And so just the last thing on that is by default, everything gets accepted. However, sometimes it's something gets flagged. If there's duplicate content, if it doesn't look like a paper, sometimes people submit like one page blog posts or something. So there is this thing where sometimes there are also false positives and then it gets stuck. But long story short, that got me into the habit of reading the titles. And that's what I still do. Also for my head of AI newsletter, I just look through the titles and select. How have titles changed?

Nathan Lambert [00:44:21]: Like titles have changed a lot though, as I feel like they used to try to be. Accurate. Mostly descriptive. Yeah. Descriptive, right? And now they are a mix of, it's more of a storytelling than descriptive. I think it's the right way to tell it.

Sebastian Raschka [00:44:36]: At least we don't have the, it's all you need anymore. I feel like this went away finally, but yeah, you're right. It's more.

Nathan Lambert [00:44:43]: It ended with Ryland Schaefer's test set. Training on test is all you need. Yes. Did that make it on archive? It did.

Sebastian Raschka [00:44:51]: I think I also had it featured in my newsletter one time. I think. Or not featured, but at least mentioned. And so how I select papers is also often selfish. I read or select papers for the newsletter that I find interesting. And because I think this is also for education. When people ask me about how I would suggest doing things, I think the most important thing is to talk and work on things you are interested in. I think it would be really hard to do a good job if it's a topic that is not interesting to you. For example, I know, I don't know. R, sorry, or Rust is interesting, a very important topic, but I'm not into it. So I don't try to, let's say, make videos or content.

Nathan Lambert [00:45:35]: Yeah.

Sebastian Raschka [00:45:36]: So it's like, I think if there's something you're excited about, I think it comes almost naturally that you want to talk about it. So in that sense. So the newsletter, I almost, it's weird, but I almost write it for myself. It's like, I find it interesting.

Nathan Lambert [00:45:49]: How much do you spend reading versus writing when you're reading these papers and writing a blog post? I'm guessing a lot of it is just the natural process of synthesis is what you put into the newsletter. It's not like you're doing it from my read. It's not like you're doing a ton of scaffolding and editing after the fact, which seems similar to what I do.

Sebastian Raschka [00:46:09]: Yeah, you're right. I don't do, I don't spend too much time on it in the sense that I wish I could, but I have a full-time job. It's literally just the weekend project where I aim for one newsletter per month. Of course, I would like to do more, but there was also a book to write on weekends or sometimes I'm doing videos. It's like keeping it fun, you know, like where it's like, okay, this is not a chore. This is something that is supposed to be fun. Like in that sense, I read a paper and then I take notes and then I collect them and spend maybe half an hour, an hour to polish them a bit up or make some figures. And that's it per paper, I would say. And so I also don't write the whole newsletter on one day or one weekend. It's really spread over the month. I read a paper. Oh, this is an interesting one for other people. Let's write this up basically. And then this way I collect material over the month and then.

Nathan Lambert [00:47:00]: Yeah. What motivates you to work on this stuff? Is it purely like education? Because I, in some ways relate to that. I've been in that mode before.

Sebastian Raschka [00:47:09]: Yep. So if you have noticed, I don't have any sponsorships or something.

Nathan Lambert [00:47:14]: Never done that. Respect.

Sebastian Raschka [00:47:16]: I will never say never, but it's not something I do. It's really just a hobby. And I do like discussions that come around it. There's a certain satisfaction that if you put it out, it helps others and people tell you positive things about it. It's kind of very gratifying. I don't know. There's like a reward in a sense. And what's also cool is there are a lot of people. It's like being part of the community and exchanging information because there are also a lot of people who sometimes know something I don't know. And this is really, I think, really cool. You write about something and then someone, Hey, have you seen this? This seems like it's taking it to yet another level. Or this is the same idea. It's even better or something. And this is super cool where you get this effect where you learn by doing this, actually, because there's always someone who knows a bit more than you do in a specific area. So, yeah.

Nathan Lambert [00:48:07]: Yeah. I feel like it's increasingly important these days and increasingly impactful because so much of research has become closed off and for business reasons. So there's fewer people that do more of the work. I don't like it. I always feel like people don't realize how few people are informed and share on any given topic like AI research. If you take away three people, I've yet to find people that just tweet the same random RLHF crap that I tweet. It's like, I don't do it because I just say random things, but there's not that many people that represent each of these corners. Ahead of AI, I think Jack Clark's important AI. I should have him on the pod. I think I've talked to him a few times. He's great to talk to. And his is the same thing. It's like these few people that are disseminating AI information, which is crucial for policy at future angles. Have you ever gotten criticism that your work is accelerating AI and that you are a safety risk? I've gotten some critical emails that are like, you shouldn't talk about this.

Sebastian Raschka [00:49:07]: Yeah, I've more gotten emails about the fact that I talk about LLMs is not good because LLMs violate copyrights. I mean, not that I do it, but that other people's LLMs do it.

Nathan Lambert [00:49:21]: And I'm happy that I haven't had this audience very much, but it seems this is like one of the challenges of having a tech audience is like you cultivate it in kind of one of two, like there's multiple ways to go. And one of them is like this all data is for language models is theft thing. And I just don't know how to deal with it because like I disagree, but the normally people that aren't receptive to it, which is really hard. It needs to be played out. Yeah.

Sebastian Raschka [00:49:47]: My book also just to make extra sure all the data I use there is so the pre-training data is public domain data, like a book from Project Gutenberg. And for instruction fine tuning, I did my, I created my own data set basically. So just to avoid any issues, you know, like. Did you do, you wrote it by hand?

Nathan Lambert [00:50:06]: Yep.

Sebastian Raschka [00:50:06]: So I took, no, actually I used, I used part of an LLM and some by hand.

Nathan Lambert [00:50:12]: Yeah.

Sebastian Raschka [00:50:12]: So it's a great exercise.

Nathan Lambert [00:50:14]: Yeah. Yeah.

Sebastian Raschka [00:50:15]: And for the synthetic one, I use LLAMA 3.1 now too. I mean, yeah, you can tell me also about that a bit. I mean, that's maybe interesting for the audience, how to generate a preference data set, because there are multiple ways, I mean, naturally it's crowdsourced, right? So you ask people, you have the model generate two answers or have flavors of the model generate answers and then, oh, which one do you prefer? But it's not really scalable. And so you could technically do the same thing with an LLM. You could basically have the LLM generate a more polite version because I think LLMs are very good at, even the small LLMs, the open source 7b models are good at rephrasing things or evaluating things. They're not necessarily good to generate the answer in the first place if they don't have a reference, but given a reference, I think it's super useful to use open source LLMs in that sense.

Nathan Lambert [00:51:07]: I'm surprised that this hasn't caught on sooner, but I think it's starting to catch on. I think in the meta report, they essentially have edits. So then they rank, they make their preference pairs as edited better than chosen, better than rejected. And that's like, you can create multiple players by binarizing. There's a few research projects that have done this where they have like, constitutional AI is popular, but that's not really reproduced. One of my collaborators slash friends at Synth AI Labs, Louis Castricado, he did a paper on like the pink elephant problem, which is like using provisions to get the model to not just say whatever is in the question if you ask it not to. We did a follow-up work that's out literally today, which is like on self-directed synthetic dialogues where you have the language model generate a plan, and then it follows the plan. And then you can also do revisions on it. So I think Nemetron did this with Prompt. So it's really getting going, but it's something that took longer than I expected. There's the kind of question, this is like too big of a topic to go into, but it's like, how do you use GPT-4 feedback? Do you use like, are your completions from two different models or the same model with different generation settings? How do you use humans? I think that the labs are using humans for preference data because it eliminates some of the problems in language modeling. And then that's one of the biggest impactful research questions in alignment. It's like, we can't afford the $1 to $10 million dataset. How do we do this? And that's what, we're starting a project to do that AI too right now. And it's a big open, like, I don't know where it'll go. I don't know how much, like how far can we reproduce the LLAMA-3 alignment methods. Yeah.

Sebastian Raschka [00:52:46]: So I would say the LLAMA-3.1 paper or the LLAMA-3 paper, it was like a 93 page paper

Nathan Lambert [00:52:52]: and it was great.

Sebastian Raschka [00:52:52]: I love it. It's like a lot of detail, but on the alignment part, I feel like I wish there was more information

Nathan Lambert [00:52:58]: about it.

Sebastian Raschka [00:52:58]: Even like LLAMA-2 had more information where they showed what is the improvement actually over the different stages when they added to supervised fine tuning.

Nathan Lambert [00:53:05]: So I'm talking to Ross Taylor tomorrow, and I'm going to ask him the specific thing. On latent space, like Thomas S., one of the leads, said that most of their gains come from RLHF rather than SFT. So I think the open source community is over-indexed on instruction fine tuning because it is accessible and we have the data. And this is like one of my, like, try to guide the community by doing things is like, go do RLHF. Don't worry about instruction tuning data sets. Don't worry about that. We'll just leave that the same and go find more preference data and keep playing with this. And don't worry about the DPO methods. Just literally go make preference data and keep trying to train things. Like don't implement a new loss function.

Sebastian Raschka [00:53:48]: Practical question to an expert like you. How good is actually a preference data set if you download it, if both the chosen and the rejected answers, if you download a preference data set, they're not generated by your model, right? And if you have a model and you use the responses that the model has never basically seen before, does this actually work or would it be advisable?

Nathan Lambert [00:54:11]: So the most, the two most popular preference data sets in the open right now are UltraFeedback and Nectar or variants of them. Both of those are collected from large suites of other models. And part of my, there haven't been data sets or papers that have trained really good models using on-policy preference data from the model you're training. And I think that's a question that we need to answer. It's like, how do we get UltraFeedback level results with on-policy data? Because all the labs are using on-policy data. I wrote about this in like Barry to one article. I have a theory that UltraFeedback and Nectar, these general data sets work so well because within them, there is something close enough to your distribution and you don't have to get it quite right. But it's just like a gentler, more uniform learning signal for the models doing preference tuning. But we don't know. That's something that I want to answer.

Sebastian Raschka [00:55:02]: Yeah, this is an interesting one. I would also like to know the answer because that is one thing where I got a bit stuck when I was writing this DPO chapter with smaller models. I think bigger models also, they hide these weaknesses a bit because they have been trained on so much data that like you said, it's kind of in distribution already. But if you train a small model, it would be out of distribution, right? If you use someone else's preference data set. I noticed even something simple when you train a model on one simple instruction data set, let's say something like alpaca. And then let's say you have just to have something visual. You want the model to generate Yoda speech, like where every sentence is reversed. But the model has never seen sentences like that unless it was maybe in the training data. But in that sense, it doesn't work well at all because you ask the model during preference tuning to write sentence structures. It has never grammatically written before. And so in that sense, I think what I found is it's much better if you, I don't know, you say be more polite or like you have a more polite answer because you use the same grammar or so. So things like that basically. And yeah.

Nathan Lambert [00:56:08]: Yeah, I think that's a smart approach. It also might be why learning rates are getting so low. Where like all the learning rates for DPO and things have been going down in the fine tuning space. And it might just because distributionally, like we're far off from the model. There's the other theory that the model is like really, really done training. So they get it to a really good optimum. You don't want to move it from them. But it might just be that like our data sets are in the wrong space. Yeah.

Sebastian Raschka [00:56:32]: So you try to be gentler with a lower learning rate.

Nathan Lambert [00:56:36]: Yeah. All of this stuff changes fast, but not fast enough. Like this ultra feedback data set they were talking about came out last October. So we're like almost 10 months in and it's still the state of the art data set. And it's only like 50,000 examples. So there's so much opportunity for someone to like at this level, like go build data sets if anyone is watching. Because it's like, I think we're so far off where we could be just because people don't know how to make good preference data sets.

Sebastian Raschka [00:57:02]: Well, now we have LLAMA 3.1, 70 and 405 billion that allows us to do that, right?

Nathan Lambert [00:57:08]: We'll see. Yeah. I was wondering, this is a change of topic, but how do you think like, do you think AI will change our jobs in writing? How do you see AI coming for this kind of educational space? Like how much of what you do as an educator could be taken in N years by AI?

Sebastian Raschka [00:57:26]: Well, I think it's like, of course it will automate away some things because nowadays you would ask a model something instead of searching for it and reading it on a website. But I do think the creation process, you still need a human to put it together well. Because I don't know, I think LLMs are not nowhere near like generating a whole article that is actually, I would say even good where it can generate the right things, but you still have to put it together. It can generate good blocks of text or something like that, but you need to, as an edit, like you become maybe more like the editor then in that sense. But I'll try this.

Nathan Lambert [00:58:09]: Also like, do you write, do you have AI write any parts of your articles? I'm so scared for like moral reasons to have any AI writing in it. I'm like, it's just a slippery slope. It feels like I could get addicted. Yeah.

Sebastian Raschka [00:58:21]: So sometimes I don't have it write anything from scratch, but I sometimes do do that. And especially, I don't know, I have a, I mean, I'm a non-native language speaker and sometimes I have a harder time than other days to make the sound right. It's like, okay, this is what I want to say, but it doesn't sound right. And then I, can you revert this with a focus on XYZ or something? So like, it's basically like a, you know, like a thesaurus where you find similar words, you find similar sentences, like just rewording it, like these types of things. But one also, now that you mentioned it, one weakness it has, or LMs can't do really, is they can't generate figures. You know, maybe that's coming.

Nathan Lambert [00:59:01]: I don't know.

Sebastian Raschka [00:59:01]: You can do that probably with ticks, like the latex thing where at one point, but right now nowhere near, can you generate any useful figure? And I think learning is very visual too. I think if it's just text, it would be really hard to learn anything.

Nathan Lambert [00:59:17]: Yeah.

Sebastian Raschka [00:59:17]: So you can, of course, but I do think, you know, there's a saying, image is worth a thousand words, right? So yeah, in that sense, you still need someone, you know, like the mastermind behind an article, even if it's just an editor, I don't think LMs can replace everything at least. And we'll see. I mean, I don't know how much better, I mean, we just don't know how much better, let's say GPT-5 as a placeholder here will be then GPT-4, you know? So maybe if it's saturating, who knows, right? So maybe it will be five more years till we, yeah, get in a more scarier territory in terms of replacements, you know? So we'll see.

Nathan Lambert [00:59:55]: Yeah. I mostly avoid the agent word, but it does seem like there's enough culture and cultural investment in the Bay Area and tech executives to do something. Like they're going to get to something that is triable, which I think is mostly like automatic Google searching, more code execution, which is going to be interesting, but I have such wide expectations of what it actually means. That's probably the next big shift. I think this LLAMA 3.1 is probably right now leading the year in terms of AI news. This recent DeepMind thing on the math might be a better example of what's really hot news. I need to go read more about it. There's some long write-ups on how the qualitative between the AI math and the human math and the different directions they're going. So that's kind of what I want to read about it. But it'll shake things up. We're multiple years into this fast phase. It's not exactly new at this point. Yeah.

Sebastian Raschka [01:00:57]: Last thing on that is I do think, though, LLMs make good assistance in the literal sense where one thing where I use it for my newsletter for is at the end, I have a list of all the papers I have found interesting, like 30, 50 papers usually. And usually per hand, I edit the author names, like the last names of the first three authors. And now I use an LLM to go to the website and get the names of the authors, basically. And so this is where it saves a lot of time. You could do that without LLMs. You could write some code to do that, but it would probably take me half a day to write because I'm not good at this web scraping code to do that type of thing. And I think in that sense, it is actually a useful assistant for certain things like

Nathan Lambert [01:01:44]: delegating actions. I think it'll keep creeping up. I don't expect their usage for those things to go down because they already are so useful. And the little coding things, the hacking data together, the automatic searching, people aren't going to want to stop using that. I don't know if it supports the whole valuation we have, but it's fun to be in a space where we get to try new things. As a computer nerd, it's really fun to have a new type of software that we can try all sorts of things in our workflow. And I think that's underrated. So I don't know. Thanks for coming on. Any last things you want to discuss?

Sebastian Raschka [01:02:19]: Yeah, I just wanted to say thank you for the invitation and I hope you keep creating these awesome newsletters. I think this is much needed because there's so much hype, like you said previously, it's

Nathan Lambert [01:02:32]: creeping up on us.

Sebastian Raschka [01:02:32]: There's a lot of over, let's say, evaluation and praise. And I think something that is kind of like cutting through this is it's much needed like this honest, straightforward, no b******t content. So yeah, I hope you keep creating that. It was fun to chat. And yeah, to everyone out there, I think also what keeps us motivated, I think, is the awesome community that people give feedback and discuss things and bring things up. And yeah, I think without people also giving us feedback, we wouldn't be probably doing this because it's kind of a lot of fun to be in that space, I must say. Yeah, it's fast moving, but there's always something interesting every day.

Nathan Lambert [01:03:14]: Yeah. Yeah, this is really interesting. We covered a lot of kind of low level of just what it's like trying to use language models on the day-to-day basis in July of 2024. So thanks for coming on. And I'm sure we'll talk soon. All right.

Sebastian Raschka [01:03:27]: Yep, it was nice meeting you and see you then. Bye.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

GPT-4o-mini changed ChatBotArena Jul 31, 2024

And how to understand Llama three point one's results.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/gpt-4o-mini-changed-chatbotarena

0:00 GPT-4o-mini changed ChatBotArena
3:23 Llama 3 in the arena
5:13 Partial solutions and next steps

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_013.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_019.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_021.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_025.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_039.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_043.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Llama 3.1 405b, Meta's AI strategy, and the new open frontier model ecosystem Jul 23, 2024

Defining the future of the AI economy and regulation. Is Meta's AI play equivalent to the Unix stack for open-source software?
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/llama-405b-open-frontier-model

00:00 Llama 3.1 405b, Meta's AI strategy, and the new open frontier model ecosystem
01:37 Meta's open frontier model
03:51 Zuckerberg's vision for open-source AI (vs. reality)
08:35 Does the Llama 3.1 license support open-source AI?
12:55 Different futures for regulating frontier models

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_008.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_010.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_015.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_018.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_050.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

SB 1047, AI regulation, and unlikely allies for open models Jul 17, 2024

SB 1047, AI regulation, and unlikely allies for open models
The rallying of the open-source community against CA SB 1047 can represent a turning point for AI regulation.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/sb-1047-and-open-weights

00:00 Introduction
01:53 SB 1047 and targeting regulation
07:57 Unlikely allies of "open"
12:05 What would I regulate today?

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Switched to Claude 3.5 Jul 03, 2024

I Switched to Claude 3.5
Speculations on the role of RLHF and why I love the model for people who pay attention.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/switched-to-claude-from-chatgpt

00:00 I Switched to Claude 3.5
03:57 Product priorities
05:15 RLHF's peak?

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_016.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_018.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_020.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_022.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Interviewing Dean Ball on AI policy: CA SB 1047, upcoming AI disaster response, Llama 3 405B, Chinese open-source AI, and scaling laws Jun 27, 2024

I’m really excited to resume the Interconnects Interviews with Dean W. Ball from the Hyperdimensional Substack (you should subscribe). We cover the whole stack of recent happenings in AI policy, focusing of course on California’s bill SB 1047. We cover many, many more great topics here including:

* What will happen in the case of a minor AI disaster,

* If Meta will release the 405B model, and why,

* The status of Chinese open-source AI,

* Training on model outputs,

* Anthropic’s recent strategy,

* What scaling laws actually mean,

* Creating content and shifting the needle of the AI discourse.

Watch the video on YouTube below or listen on podcast players here.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Chapters

* 00:00 Intro and Welcome Dean Ball

* 02:44 The Origins of California Bill SB1047

* 08:56 The Evolution of Bill SB1047

* 13:00 How SB1047 Affects Fine-Tuning

* 20:00 The Future of Bill SB1047

* 21:58 The Impact of AI Disasters

* 29:02 Meta and its 400 billion Parameter Model

* 32:25 Open Source AI and the Chinese Market

* 37:37 The Future of Open Source AI

* 43:35 Synthetic Data, Licenses, and Future AI Development

* 45:18 Anthropic's Approach to AI Safety

* 50:46 Scaling Laws

* 53:01 The Role of Audience in Influencing AI Policy

Links

* Dean’s series on SB-1047: one, two, and three.

* Other AI policy Substacks: Jural Networks and Intersecting AI

* Senator Scott Wiener. CA SB 1047 itself.

* Another post on CA SB 1047 from Answer AI.

* Situational Awareness by Leopold Aschenbrenner.

* Lina Kahn on her P(doom) and warnings in support of open-source.

* Ben Thompson’s Framework for Moderation in technology.

Transcript

Nathan Lambert (00:00:01): Hello, and welcome back to InterConnect's interview series. It's been a few months. I'm really excited for this one. We're here with Dean Ball, who is a research fellow at the Mercatus Center. He works on AI policy right now, and he's the author of the Hyperdimensional Substack, which is kind of the AI policy substack that emerged when I was spamming into the void that we need to have some good AI policy newsletters out there. There are a couple more that I could add to the show notes of this that I'm aware of from friends that used to be at OpenAI, friends at AI2, so I'll add some of those as well.

But in this kind of summer slowdown of releases, I thought it would be a great time to kind of revisit some of the core themes on AI policy, open versus closed, kind of things that I'm wondering about in the future that I know are coming that are looming AI disasters, what some of these closed source companies are trying to do in the policy space. I think this is the sort of interview that we could probably do multiple times. I think we've started talking in DMs and it's clear that we're aligned on a whole bunch of things. We read each other's work. I think this should be kind of fun and I'm just happy to do this.

I think the core of this interview I'll give you a chance to introduce yourself if you want, if you want to add anything else that I missed, and then we're just going to go into this California bill SB 1047. Probably talk about this. I'll ask you about the story of how it happened and then where we're at now. And I think that'll kind of lead into a lot of interesting debates. So do you have any background you want to add that makes you an interesting person in the AI space? Or is it just that there's so many things that need to be done in AI that if you're focused, you can kind of have an impact in an area?

Dean W Ball (00:01:44): Yeah, I mean, I think basically, you know, I've mostly written on policy unrelated to tech for my career, state and local a lot. So the fact that a lot of the policy action on AI seems to be happening at the state level has been very relevant. But I've also just like always been paying attention to the AI literature. I remember 2017, I think, when the Alec Radford Amazon podcast product reviews paper came out and I said to a colleague this is gonna be a big deal I think one day and you know we I tried to use GPT-2 to do like social science research like policy research back in 2019 so I've been playing around with these for a while and I try my best to write from a combination of a relatively technically informed person, but also someone who understands the policy side.

Nathan Lambert (00:02:43): Yeah, so I think we should jump right into it. What is the origin of the story of this California bill? My understanding is it just kind of showed up and everyone in the Bay Area was like, like where did this come from? Having actually passed the state Senate as like, do you have any, does your story start there as well? Or did you kind of know this was coming?

Dean W Ball (00:03:03): So I saw, Scott Wiener, the author of the bill had telegraphed that he was working on, something in AI policy, I think in maybe October or November of 2023. And then the actual bill text came out in early February. And I remember when it came out because I was having dinner with my wife and, I was like, I have to drop everything and go work on this. I stayed up until like one in the morning, you know, reading the bill and writing about it. And that was kind of my first Substack post that really went anywhere in terms of audience. And so, yeah, then there was kind of a couple months of quiet. You know, I had been writing about it, but people weren't really focused on it in the Bay, in the tech community. And then closer to around April, people started to pay attention. And the conversation has been pretty, you know, pretty active since then.

Nathan Lambert: Yeah. And like, what does it actually say? Like, what are the core points? I know there's stuff around thresholds and giving California power to do like California creating a new body. Like, what are you think? What are the few like core things that people should know? I think there's probably some details, but just the core stuff.

Dean W Ball: Yeah, so the core idea behind SB 1047 is to create a regulator inside of the California government called the Frontier Model Division that would oversee models. Really, now the threshold is models that cost more than $100 million to train. We can talk about how specifically you really even specify that cost, but really all the bill says is $100 million of compute costs to train. Those models are subject to a series of testing and safety requirements, and more importantly, I think, a liability regime that basically says that most downstream uses of that model, including in the case of an open source model, most fine tunes, most uses of models combined with scaffolding software, other software. So things that are very combinatorially distinct from the initial model release. Any downstream misuse is the legal responsibility of the developer who made the original model.

So, if I fine-tune Lama 3 and then someone else puts that in an app and then a user of that app misuses it in a way that causes a serious harm, the bill does have a high threshold for the harms that have to count here.

Nathan Lambert (00:06:00): Is that eligible? Is it specific? Do they have a safety taxonomy?

Dean W Ball (00:06:05): So, they basically, it really, it's a static threshold that comes in at $500 million of damage. They would say harm to critical infrastructure and things like that. Critical infrastructure pretty much just means everything. It's kind of a catch-all term. It's a little weird. Critical infrastructure, the way we think of it, like highways and power plants and stuff, is actually a subset of critical infrastructure. Critical infrastructure includes things like casinos and ballparks and amusement parks and all kinds of stuff. So anything really, any major cybercrime, bio attack, all the things people are worried about with AI would count. And the developer of the original model, which is many stages upstream from where the harm happened, would have legal responsibility.

Nathan Lambert: So it's like the risk for these, probably the expected value risk for open models in this bill is definitely low, but it's just kind of this thing that it's like, if you're kind of comparing on the two axes, the open versus closed risk, like the risk for open models is way higher because of this downstream use term. And that's for the people that are getting like, oh, why is everyone that cares about open AI, like open AI as the field mad about this? So I think that was why everyone was kind of caught up in ours.

Dean W Ball: Yeah. And the other thing to keep in mind, though, is that under this bill, if you're making a model more than $100 million that costs more than $100 million, you have to submit a variety of documents annually about your safety procedures and sort of testing regime on the model to the Frontier Model Division. And I think something that's not all that well understood, and it's kind of just like how administrative law and regulation works in America, but that the tech community might not understand, is that the Frontier Model Division has the capability to create problems for developers, even if their model's never used for a hazardous capability. They could see your safety plan and say, We don't like this or we want more information on this. And they can subpoena you. They can bring you to court and they can, you know, issue it. They could they could order a cease and desist.

Nathan Lambert: Yeah. And this is where you only post on the political economy of AI regulation comes in as like, what are they going to do with that kind of open ended power?

Dean W Ball (00:08:40): Yeah, it doesn't necessarily. I mean, they're an agency that has all the regulatory powers of an agency, which are substantial. I think one other point that is worth making about 1047 that would be relevant to your audience in particular is. So the initial version of this bill, any fine tune. No matter how substantial the fine-tune is, the original model developer held the legal responsibility and had to test their models with the margin and the realization that people could fine-tune them or do whatever they wanted to them, modify the weights in arbitrary ways, which obviously doesn't really make a ton of sense.

Nathan Lambert (00:09:38): I was going to ask about the edits. This is where I probably stopped reading as closely as I should have.

Dean W Ball: In a fundamental sense, everything I've said so far has basically been true of the bill for the entire, the fundamental points, the liability, the frontier model division, these kinds of things. Basically, the actual making developers guarantee model safety when I think we're probably both in agreement that safety is not a model property.

Nathan Lambert: Yeah, at least in the way that the bill concerns it. They're considered about infrastructure. If critical infrastructure is the primary target, safety is not a model property. This is why I ask about a taxonomy. It's because it's like... We're going through this exercise at AI2 to kind of say like, what do we mean by safety? And it's a total headache. It's like extremely hard to get this right and to communicate it clearly. So now when any other organization or somebody mentioned safety and I'm like, oh, do they actually define it? Like it's such a risk to put it into words because when you put it into words as well, you're exposed to all this like people being like, so you don't care about X, Y, and Z. If you don't put it explicitly, it's like a total trap.

Dean W Ball: Well, and actually just to expand on that a little bit, because, you know, the Center for AI Safety, which is the nonprofit that was heavily involved in authoring the bill and Senator Wiener, you know, one of their primary concerns is bio risk. So people making biological weapons with AI models. You know, and I think people who don't understand biology all that well have this idea that you can say, oh, well, that's a good idea. biomolecule to make, and that's a bad one. And so we'll make a list of the bad ones and you can't make the bad ones. And that would be a way to like, RLHF, a biological foundation model.

Nathan Lambert (00:11:34): My understanding of biology is that the more powerful, the more specific a molecule is, it'll probably have good uses and downsides. It's like Teflon. Amazing physical properties, extremely bad downside health concerns. I would guess, obviously, if you're consuming... engineering like living creatures it's going to be a little bit of a different consideration but yeah.

Dean W Ball (00:11:56): But I mean also a lot of biomolecules just and like code um their their goodness or badness is really context dependent they'll do different different things in different settings and so it's not necessarily easy a priori to identify you know what what what how even would you steer a biological foundation model, like something that's predicting protein structures or nucleic acid sequences or whatever it may be? How would you even steer that towards safety? It's not like a priori obvious that that's currently possible. But that's just, you know, I think this idea that safety is something that can be legislated in that way, I think is a fundamental problem.

Nathan Lambert: So what is next? Or you could continue. I was going to ask, what is next for the bill?

Dean W Ball: Oh, yeah, yeah. So I'll just say one thing about the fine-tunes in the most recent amendments to the bill. So fine-tunes now, if you do a large fine-tune, large being anything more than 3 times 10 to the 25 flops involved in the fine-tuning compute,

Nathan Lambert (00:13:13): I need to learn all these numbers. I need to learn what they mean. I need to know. Essentially, it's a linear relationship between model size and tokens. And then you should be able to have specific points, which is like, is Lama 3 base crossing that? Like 15... trillion tokens at 70 billion parameters like I think I I don't know I'll loop back on this I need to know this in the future.

Dean W Ball (00:13:35): It would be as much fine-tuning as you use as much compute as you use to fine-tune the model that's how this threshold is calculated.

Nathan Lambert: Yeah I was just one like a rule of thumb for people would be great I'll figure that out it's on my to-do list of mental mental math that would be great.

Dean W Ball: That would be great to do um but if you're in that situation uh then the bill applies to you too. So you have to create a safety plan and a certification that you submit to the Frontier Model Division every year. Starting in 2028, like the foundation models, you'll be subject to mandatory annual audits.

Nathan Lambert: Is this prescribed to anyone that trains in California or anyone that operates their model in California?

Dean W Ball: Anybody that distributes a model in California. So the bill is at least everyone in the United States, if not really everyone in the world. Certainly, but they could certainly sue you in the United States if you're an American company or operating in America. Now, the important thing about that fine-tuning threshold, though, is that the fine-tuning threshold can be lowered arbitrarily by the frontier model division. So the $100 million threshold for foundation models, that's fixed in statute. So you would need an act of the legislature to change the $100 million threshold. But the fine-tuning threshold, there's no dollar amount. So the same problem with compute thresholds, that compute cost is getting cheaper and cheaper rapidly over time. applies and the frontier model division can change that threshold arbitrarily.

Nathan Lambert (00:15:35): Who elects these officials? Is it like the governor of California? Or the federal branch or something?

Dean W Ball (00:15:43): This is all state-based.

Nathan Lambert: Oh yeah, I meant in the state.

Dean W Ball: Yeah, so the frontier model division would be staffed by unelected civil servants, primarily. Led by unelected civil servants. And then on top of the frontier model division, uh, the new, the newest version of the law creates a committee that is like a governing committee. And that committee is composed of, I believe three members appointed by the governor and confirmed by the legislature. And then two members that the legislature itself points each house, the Senate and the assembly.

Nathan Lambert: Mostly what I would expect.

Dean W Ball: Yeah, yeah, exactly. And like, I think there's a requirement that, you know, one person has to be from industry, one person has to be from the open source community. There's a lot of, there's a lot of bones that they throw to the open source community.

Nathan Lambert (00:16:37): Random credentialing.

Dean W Ball (00:16:38): Yeah, yeah, exactly. But I mean, I don't really, that could be anyone, you know, really, like, yeah, who's who's from the open source community? Exactly. Yeah.

Nathan Lambert: Um, so what's next for this? Like it passed this, it passed the state Senate and then it got revised by the, what is the state, like state general state house. Is that how it works? The state assembly revised it. Does it then they would have to vote and then the Senate would have to vote again. And then the bill would have to actually be signed. Is that how, is it worked that way in California? Yeah.

Dean W Ball: Yeah, basically. So so it's right now making its way through the committee. So it went through the Senate committees and then was voted on by the whole Senate. Now it's going through the assembly committees. It just passed one, I think, last week or the week before the Consumer Protection and Privacy Committee is what it's called. I could be wrong on the exact name, but that's the basic idea. So they just passed it. They did some amendments. It goes to the assembly's committee. judiciary committee next and then uh eventually it will go to the full assembly for a vote and then to the governor for uh for signature or veto.

Nathan Lambert (00:18:04): When would this start? When would it kick in?

Dean W Ball (00:18:00): Uh the bill would kick in I think most of its provisions would start January 1, 2025.

Nathan Lambert (00:18:05): Yeah. And the original vote in the state Senate was like very pro, right? It wasn't even like, it was just like, Oh, this seems normal checkbox for, but this is kind of a cynical take, but I kind of viewed it as mostly these politicians are serving constituents that know that AI is a big thing, but know nothing about AI. So for a politician saying, look, I'm taking action on AI and they're not going to be able to decipher any of the details is probably a political win.

Dean W Ball (00:18:31): Yeah, well, and I think also worth noting is that Scott Weiner, the state senator who authored the bill, is a very powerful figure in California politics. And I would guess that a lot of the senators who voted in favor of the bill really barely looked at it and aren't even necessarily thinking about their constituents. First and foremost, they're thinking more about, well, Scott's my ally. I need X, Y, Z thing from Scott. So I'm going to vote yes on his bill. Um, and that dynamic will apply at the assembly too is, is very common. Uh, the, the California legislature has a history of, um, uh, sometimes even unanimously passing bills that the governor then vetoes. So the governor is often expected to be a little bit the adult in the room on this stuff.

Nathan Lambert (00:19:25): This is so funny. I have no comment.

Dean W Ball (00:19:27): I do suspect that the governor is probably going to be, whether or not he wants to, he will probably be the final voice on this bill.

Nathan Lambert (00:19:41): So that's who people are talking to, probably, realistically, from what you've said.

Dean W Ball (00:19:46): Yeah. So, I mean, the one thing, and this is, again, this is a kabuki that's very common in state legislatures. The governor has not said anything publicly about SB 1047 specifically. I think he's as a general matter, he tries not to comment on legislation that's in process.

Nathan Lambert (00:20:08): That makes sense.

Dean W Ball (00:20:09): Yeah. And then kind of. But, you know, he also might signal in various ways. He there are times when it gets closer.

Nathan Lambert (00:20:17): I would guess they do.

Dean W Ball (00:20:18): Yeah. I mean, like he could say, you know, a lot of bills. I think one outcome that is extremely unlikely from this bill is that it's like voted down by the assembly. Like, I don't think that's going to happen. It could die in the assembly. It could just kind of be forgotten, never get brought to a vote, or it could go to the governor and be vetoed. If the bill's not going to pass, it's going to probably be one of those two ways.

Nathan Lambert (00:20:43): Okay, that's a great little lesson in state politics that I'm sure the vast majority of people listening to this will not know. I did not know all of this. Do you have any final comments on this? Otherwise, we're going to move into kind of fun, faster questions and discussions.

Dean W Ball (00:20:59): Yeah, sure. Let me just think. I think the one other thing that is worth keeping in mind here is that the latest version of the bill, I mentioned this, but just to expand on it a bit, it does require mandatory audits starting in 2028. So if you make a covered model or a covered fine tune, however, the Frontier Model Division chooses to define that. Not only do you have to submit stuff to your certifications to the Frontier Model Division and have the legal liability and all that, but you also would have to comply with an audit done by a private company. So just like accounting, you pay for someone to come in and look at your stuff. And the auditors are, it's not an open market for competition. The auditors are licensed by the Frontier Model Division. So it's probably two or three different, companies that'd be doing that and it's probably that's the sort of thing that i

Nathan Lambert (00:21:59): think people have wanted i don't know if you want it like we don't i don't think people i don't want all these types of oversight to be cobbled together i think individually each of them have different types of merit but like the execution is important and then when you cobble them together it's like wait wait wait this is this is too much

Dean W Ball (00:22:19): Well, and also I think I think it's just questionable whether I agree that an audit like structure like that might be the good long term way to go. I think it's questionable whether a California state agency really has the capacity to do this kind of assessment of like who is an accredited auditor. That feels much more like a federal responsibility. So, yeah, but that's I think that's that's pretty much the main message on 1047.

Nathan Lambert (00:22:49): Yeah. Okay. I'm going to move into other fun questions I have. I'm going to start with one that's potentially related. I've been trying to get my brain around what is going to happen when there is actually a minor disaster from AI. It loops into open versus closed debates. I think a lot of the things I've been talking to people is it won't actually be about whether or not it was an open or closed model. It's some weird infrastructure that people plugged it into and that causes the power plant to go down. Do you have any ideas about how this will happen? I'm expecting this to happen within a couple of years. I feel like the state of our infrastructure is that it is not that reliable and that we're adding all this new digital information into it. And I think all of this is very fragile digitally. So it's like, I think this is going to happen. And how do we preempt any communications around that?

Dean W Ball (00:23:37): Yeah, well, I mean, you know, cyber attacks take out digital infrastructure or take out critical infrastructure all the time. You know, earlier this year, I think maybe it was last year, courts in Dallas could not convene. Like there were no judicial proceedings in the city of Dallas because of a major cyber attack on the judicial system's computers. Parts of the power grid go down. Water plants go down. Hospitals all the time. This happens. $500 million in critical damage. That sounds like a lot. It's not actually that much.

Nathan Lambert (00:24:13): It doesn't have a B on it. It doesn't sound like a lot.

Dean W Ball (00:24:18): Exactly. It's a big economy. I think about this all the time. I think a couple things are very likely to be true. If there is... an attack of this sort, people will probably suspect that AI was involved, whether or not we get, how are we going to know? Right. Let's say like somehow we do have a strong hunch that an AI model was involved.

Nathan Lambert (00:24:47): Yeah, like, do we normally figure out what happened in cyber incidents? Or is it normally post hoc? Or not at all? I guess that's a good thing to know with my question. It's like, can we know that a language model is actually involved? Like, how often will they be able to get that far into the stack of the attack?

Dean W Ball (00:25:02): Yeah, right. Like, I don't know. I mean, like, probably they're... I mean, if you were using, like, an agentic GPT-6 model to do some kind of zero-day exploit on something, like, presumably in the server logs, like, you'd be able to see that, like... what was interacting with it. Right. But like, who knows if that would be masked, but I, so, so let's just say though, that we, we have some, you know, circumstantial evidence to suggest that an AI model was involved in the execution of, of some cyber attack. It's like very much to me, unclear, unclear, Are we going to have like the person's chat log? Like, are we going to know how they prompted the model?

Nathan Lambert (00:25:46): Like, I mostly think it's like it's going to send requests over some generic Internet protocol. So there'll be this big gap where we can't really tell.

Dean W Ball (00:25:54): Yeah. I mean, that could totally be true. That could absolutely be true.

Nathan Lambert (00:25:58): So I expect there to be – it's like almost if somebody takes ownership or does a really bad job or it's an own goal, which is like a hospital implemented some agent and then it took down their authentication system type of stuff.

Dean W Ball (00:26:12): Yeah. No, that could very well – that's all definitely possible. Yeah. I think that, though, how would we actually know what an AI model was used for? It seems to me like we don't actually... People are imagining a situation in which this happens with perfect information.

Nathan Lambert (00:26:32): Yeah, I think that's the answer to my question. It's not that it's like what happens. We can't answer what happens because it's so much of a media question. It's like we won't know. It's likely to happen, but it's very unlikely that we know the specific stack that caused it. Which makes it more of the same around like if cyber incidents increase in rate, then people will talk about AI and people like without actually having the logs, it makes it easier to spin narratives. Because I'm worried that this could be like people are like, oh, this is why open source AI is bad. Yeah. And it's like, I don't expect to have any proof for that, but I expect that to be what people say.

Dean W Ball (00:27:10): People are going to blame AI on things that were already happening. I think that's like a trend that we will see across the board. Whether it's misinformation or whether it's cyber attacks or whatever else, like there are all these curves that we're already pointing up and they're going to continue to most likely. And I think people will blame that on AI. Now, like the sort of, you know, long tail situation is like, what if something really bad happens? You know, what if a power plant, you know, no one has water in Los Angeles for a month or something like that. And in that situation, not only do I think that an attack could be hastily blamed on AI without us knowing whether that's true, I also think we could see legislation move very, very quickly. The Congress, the federal government is not known for moving fast, but in a crisis, they will move fast. It's for the same reason that I suspect, I don't think he is right, but if Leopold Aschenbrenner is right about super intelligence being here and, you know, 50 months or whatever he says.

Nathan Lambert (00:28:26): Yeah. This is another one of my later questions, but I didn't have the best way to frame it.

Dean W Ball (00:28:32): Yeah.

Nathan Lambert (00:28:33): Like AGI timelines and stuff.

Dean W Ball (00:28:35): Yeah. Like if he's right about that, then like, yeah, I mean, that's going to get nationalized by the federal government and it'll happen in a heartbeat.

Nathan Lambert (00:28:42): You know, I found it interesting that Alexander Wong of scale was also kind of touting this point of view. Yeah. I guess it makes sense for them because they're the only AI company that is leaning into federal contracts. Yeah.

Dean W Ball (00:28:59): And they were before ChatGPT, too, I think.

Nathan Lambert (00:29:04): Yes, they have been for a long time, which is why it was easier for them to continue.

Dean W Ball (00:29:08): Yeah, their early big revenue source, I think, was federal government contracts.

Nathan Lambert (00:29:13): Okay. Yeah, we might come back to AGI. I've been confused by the... lines they're drawing. I have a quiz to debate later on. I don't even know the answer. We'll see if we get to it. But another fun question. Do you think meta will release the 400 billion parameter model? And if there will be any governance questions around that?

Dean W Ball (00:29:32): Will they release it open source?

Nathan Lambert (00:29:34): Open weights in a similar manner to the other models. Yeah.

Dean W Ball (00:29:37): Yeah. Open weights.

Nathan Lambert (00:29:42): Do you think they have government? I've been decreasing probability. At best, I was ever 50-50. But is this for government's reasons that you don't think? Are they flying? They've always been flying close to the sun where there's back channel discussions where it's like, The Biden administration is telling Meta that they're like or they're not invited to stuff because they're not happy with how they're like open waiting models through this other like probably they're probably getting lobbied by people saying open source is bad. But it has always seemed like Meta is on kind of thin ice with the executives in Washington. And I'm guessing it's reasonable to say that this model's release is bad. heavily influenced by feedback they're getting there. And Zuck will make the final call.

Dean W Ball (00:30:28): Yeah, I think that that's part of the calculation. I think that also they probably just want to set a precedent that they're not going to release everything open source because they don't know how things are going to go. Yeah, I mean, they just don't know. Will the model end up being... the most important way that we all interact with computers, you know, in a few years? Or will it just be kind of another layer and another tool? I think they don't know. I feel like Zuckerberg's intuition is that it's just going to be another tool. And so that's why he's inclined to open source.

Nathan Lambert (00:31:07): Yeah, this relates to the whole Apple thing. Like Apple is making these as features rather than products. Yeah. That does a lot of good for the narrative around AI, in my opinion, at least for things that I care about. It's like, this is what we're saying where AI is about a system and not just a model. The Apple's model doesn't matter to people, but it is enabling these products and systems or these things on their products to just be better. It's always Apple and Meta together. They are always forcing their way into whatever the next thing is going to be in technology.

Dean W Ball (00:31:44): Vibes policy or whatever. Yeah and it's funny because they hate each other. Yeah yeah but it's so funny but yeah i don't think they're going to uh that that's my just my personal intuition and i think that's like i think we're going to see a lot of people um not just in the language model space but elsewhere kind of do this this dual approach where they can they realize how much political cred you can get by open sourcing things. It's still happening.

Nathan Lambert (00:32:12): Google today, when we're recording, released Gemma V2. And their 27 billion parameter model is just a little bit below Lama 370B. I think that's a nerdy thing. But when the first Gemma model was released, it wasn't used as much by the community, mostly because there was a lot of minor bugs in the implementations in popular tools. So I think the initial feedback loop wasn't caught on. So it'll be really interesting to see if these second generation models, which are in the same ballpark as what Meta released, there's some strange things. They trained the biggest model on 12 billion tokens, and then the 9B model only on 9 billion tokens, and the 2B model on 2 billion tokens. So the models that have more reach by being smaller are like intense... There's got to be a reason, but I think they were like scaling runs preparing for the biggest one, but they didn't finish training them. So like the models that the most people could use relatively are worse than the bigger ones just by the amount of compute that they put into them.

So I think eventually if there's decent uptake of these, Google will change this. But it's like the Gemma 2, whatever it is, 9B model, it's going to be way worse than the Lama 2 8B, just because Lama is trained on twice as many tokens. But like Google could have resolved this. So that's my like kind of, that's an aside. But these dynamics actually feed into what we're talking about, which is like Google, Microsoft, Beta are all still releasing these models.

(00:33:42): Yeah.

Nathan Lambert (00:33:42): Which is good. I have on this outline like the general state of open versus closed. It seems like we haven't had major updates in a while. It seems like there's much less pressure taking on open. I think maybe people are okay with the steady state that we're in. I don't know if this Nemotron 340B changes that much.

Dean W Ball (00:34:01): I don't think so. So I think that there are the people who believe that open source models are an existential risk to the world. And they continue to mostly think that, and they continue to support policies that either in absolute terms or on the margin would diminish open source. I think that DC has had a really radical shift in the last year because the climate towards open source models in the policymaking world a year ago was not good. And now it is much more... Oh, well, we think this is really important for competition and we think it's important for innovation and we actually like want to make sure we have a really healthy open source community and all these kinds of, I mean, I'm sure you've seen, you know, Lena Kahn, no friend of the technology industry. Um, has she had a comment on this?

Nathan Lambert (00:35:09): Um, that's good. Did you see her clip on hard fork where she was asked what her PD is?

Dean W Ball (00:35:14): Yes. Yes.

Nathan Lambert (00:35:15): Oh, my God. If people haven't seen this, you've got to go find it. It is so funny.

Dean W Ball (00:35:18): Yeah. And the sense I get from like talking to people in Congress and whatnot is that like the staff, congressional staff, is that – People have just realized like open source is really popular and it would be really hard to go after. The government figures this, this isn't new. The government figures this out like every 15 years. They get like really freaked out about something in open source software. And then they... It's a good way to put it. They go and like they try to ban it and then they realize like, oh, wait a minute, this would be really hard. This would piss a lot of people off.

Nathan Lambert (00:35:56): It'd be a giant economic own goal. I think it's inevitable that it's an economic own goal. I mean, China is ready to take this over as beating the lead. They're right there. They don't have the ecosystem. The ecosystem is landing in the U.S., but they have perfectly good models. So if U.S. were to own goal and the U.S. stops building the models, I think that that is the path by which they could then own a ecosystem. Because there's not incentive to recreate the ecosystem when the ecosystem and the models exist in the US. But if these kind of tools and hosting all go away, then it's when other people take over.

Dean W Ball (00:36:29): Well, it seems like, I mean, as a bit of a question for you, I guess, but like, it seems like the Chinese, like, you know, the export controls on compute are going to start to really affect them. Because they were able to buy H100s.

Nathan Lambert (00:36:44): Yeah, this is what I was going to ask about. Isn't it that like a lot of NVIDIA's recent sales have been just them... prioritizing selling to China because they're not yet blocked. And then that creates a backlog in the US because Nvidia is like, well, they're not going to be able to buy them, so we should get our revenue while we can. It kind of checks out. I don't have a source on it, though.

Dean W Ball (00:37:04): Since I've always gotten... It's all through subsidiaries. Yeah. So Chinese companies saw the writing on the wall about export controls like two and a half years ago. And so they started to buy up A100s and H100s at that time. And then the export controls came through and things are leaky and NVIDIA had that chip. They were selling a chip that was like basically an A100 and basically an H100 for a year. And then that got blocked by the federal government. So like...

Nathan Lambert (00:37:37): Should we put Zuckerberg in charge of NVIDIA? Because I feel like for all the haters of Mark, Mark is pretty American and kind of follows it up, I feel like. He doesn't really care that Facebook is blocked in China. I feel like it's almost... I feel like this is why public companies sometimes have problems because they're too incentivized. Like Nvidia's stock, if they were to have to stop selling to China immediately, would get such a haircut. So literally their hands are tied to doing this thing, which I think is like going against what the executive policy is in such a clear way. It's like what they're trying to do. Which I'm like, this is a market failure. I was like, I don't think, like, I feel like Jensen's probably like, I don't, I guess he's pro-US. I don't know. Like, I don't care whether or not they're a hawk. It's just like, feels bad to go so clearly against what the intentions of the executive policy are, when there is a clear reason they're doing this.

Dean W Ball (00:38:31): Yeah. Yeah. No, I mean, I think that Jensen is going to comply with the letter of the law, but that philosophically he doesn't feel like it's his responsibility or good for him to be policing who his end users are. I think that's just how he feels.

Nathan Lambert (00:38:47): That's another discussion. I think there's... It's a discussion that I've been trying to figure out. I think like Ben Thompson famously has these diagrams for like... where moderation can occur in the stack. And then figuring out what the mirror for where AI is in the stack, like whether or not it is just a product or if it seeps down to being like the AWS layer where like open AI's models are so fundamental to our computing infrastructure that them moderating at all and them deciding who they sell to is like extremely unclear. And I think it might be going in that direction.

Dean W Ball (00:39:20): It feels that way. But it does increasingly feel to me like... You know, the Chinese might not be able to keep up on foundation model training because they're not going to be able to string together 100,000 B100s in a year.

Nathan Lambert (00:39:32): They have more electricity, which seems to be what people are talking about is the limitation.

Dean W Ball (00:39:37): They just won't have the compute, though. And we'll figure out. The U.S., I think, will figure out the electricity. I mean, I don't think we're going to be building 100 gigawatt data centers, but we'll figure out the electricity for the next couple of years, I think. But the Chinese will be able to distill the models and right. And like release them as, as open weight.

Nathan Lambert (00:39:59): Like, I mean, this is what the leading labs are doing anyways. I think this is, um, all of Google open AI and anthropic have now released models below their biggest size that are better than their biggest available models because it is cost effective and like the performance is really good. So like, they're not even pushing the frontier of the model size to the users. There probably are other infrastructure reasons for this, but like, That sort of thing is something that China could also do. They're going to need distilling our models into their models and stuff like this. I think this kind of leads into my next question. I was wondering if in your circles, this idea of synthetic data and various license clauses on whether or not you can train on outputs and models is something that is discussed. I think in the open fine tuning community, keeping track of licenses and how you comply with them on these various models is really really crucial so like with llama 3 you're technically not allowed to train use the outputs of the model to train any model other than llama 3 models which is like this kind of headache and then a lot of nvidia's push with nemotron is like look go wild I've learned that a lot of these clauses on training on outputs come from the data providers trying to protect their business models. So it's like these companies want the models to be pretty open, maybe not meta, but like some of the smaller ones. But then like the data providers are like, you can't do this and they don't have enough power to do this. Like are these types of this is a very like in the weeds technical discussion. But is this synthetic data or clauses on models discussed in your area of the world?

Dean W Ball (00:41:30): So like in the policymaking circles, people are just coming around to the idea that synthetic data is even a thing. And I think a lot of people in DC don't understand that there are licenses associated with open source software.

Nathan Lambert (00:41:45): Well, the licenses with the models don't really make sense. We're in this position where I've generated some data with these models so you can't trade on the outputs. But it's written as if it complies to you as the user. So you're agreeing to their community agreement to use the model. But if I create a data set and then upload it without training on it, can't somebody else just take the data set and train on it? Because they didn't say they agreed to this terms of use of the model. And it's like, this makes no sense. I need to go to our legal department and be like, this is what they're saying, right? I'm like, I don't understand. And so it's just like this weird ecosystem of middle ground messiness, which is it feels similar to some of the open versus closed stuff. And we're kind of going into this peak of this discussion, I think, especially as people get to know better that these new Claude 3.5 bottles is just distillation. It's based on some form of synthetic like data.

Dean W Ball (00:42:36): Yeah. I mean, with a clause like that, too, in a contract, like you got to wonder about enforceability even under the best of circumstances.

Nathan Lambert (00:42:45): Yeah.

Dean W Ball (00:42:45): How would they know? How would they prove in court? How would they prove that like your this synthetic data set came from their model? Maybe they could prove that, but I don't know. A lot of models claim that they're open AI models, whether or not they are.

Nathan Lambert (00:43:04): It's really funny. Yeah, a lot of it is like if you... Well, this is a technical issue with open models. A lot of people spin up demos with open models, but a lot of the ways that the models know who they are is by using a system prompt. And if you just spin up an open model, you're going to say that you're... a model is whatever you are trained on the most of. So like, but like people don't normally write the system prompt. That's like, you are blank, blah, blah, blah. Like we, like we need to do that for like our models and we're like relatively serious actors. So it's like definitely just like open models will always be messier with this because the closed models do a lot more just serving it as a product in a polished way. Yeah. Yeah.

Nathan Lambert (00:43:43): Another quick question related, we mentioned Anthropic. With this Claude 3.5 Sonnet model that just came out, they've said in a tweet that they got clearance from the UK AI Safety Institute. This is from Michael Salido, who I think I've met at a various government discussion. He's like, excited to release this top performing model. In addition to our internal pre-deployment testing, we also... We were also pleased to work with the UK AI Safety Institute. Is this just political gesturing? What is going on?

Dean W Ball (00:44:18): I think that it's political gesturing. I don't love it. I don't think that we should normalize the whole pre-deployment testing thing because that's just fundamentally incompatible with the way that software is made. But like, yeah, I suspect that it's political. I think that these companies, none of them are particularly reliable narrators. Like... Like DeepMind is going through an org. Was DeepMind a part of Google when the AI Safety Summit happened? I think maybe that reorg was happening. OpenAI, we all know, is like a fairly dramatic company.

Nathan Lambert (00:45:04): I need to come up with the right nonlinear dynamics analogy. They're in like an unstable, like homophobic cycle or something. There's these things that are like in nonlinear dynamics where they stay in a cycle, but if they're perturbed, they end up in another cycle. It's like the Lorenz attractor is like the classical, truly chaotic one that oscillates between them. But it's kind of like that because they don't even need an external disturbance. They don't even need an input. They're going to go into some other unstable equilibrium for a while and then go to another one. But nonlinear dynamics is just a great field because the math is simple, but the analogies are really good.

Dean W Ball (00:45:41): So I even think I even think anthropic is that way, too, to be honest, like I and they're not like they're the most stable of the three,

Nathan Lambert (00:45:50): but I think their cultural cultural density is still higher.

Dean W Ball (00:45:53): Yeah, I mean, I think that they have a very clear mission, and that is really helpful.

Nathan Lambert (00:45:59): I don't know if they're achieving it. Their whole line about, okay, I'm close with a lot of people there, but I don't believe that their line of that they're not contributing to the race is true. I think they need to reframe that and figure out how to... combine this with their culture. I think it's true that normal people don't know that Anthropic exists, which might mean that in a normal person world, they're not contributing to some race, but they are in dynamics with OpenAI and Google that substantially are adding pressure to the pace of AI progress.

Dean W Ball (00:46:31): Claude's been my go-to daily model for the last four months. It's good. Since Cloud 3 came out. But yeah, I mean, I also think that they've committed to doing models every couple months too, right? Like that's a pretty rapid cadence, substantially faster than open AI. So yeah, if anything, they're accelerating the current dynamics. And, you know, but... think that the whole you know as uk ai safety institute i think that a commitment was made during a very heated moment uh kind of the peak i think fall of 2023 was sort of the peak of the ai doom rhetoric was this before or after the sam altman stuff i think it was before before it was before it the the the ai i talked to

Nathan Lambert (00:47:16): people who were at that event and they were like this s**t is weird. They're like, why am I on the stage with all of these like billionaires and famous politicians? And they're all like, what is going on here?

Dean W Ball (00:47:27): Yeah. Well, I mean, it was just so incoherent back then. It was, you know, because it was the Biden executive order and the AI safety summit were all like in about a week from one another, as I recall. It's like all this stuff happened. And I think they made those commitments, and I think we will see all these companies gradually try to unwind themselves from those commitments over time. Or what will happen, this will be very consistent with the way that software gets regulated, especially to use software. The big companies will do these pre-deployment tests, and there'll be open providers who don't. And the best way to, like, it doesn't have to resolve itself in a rational way. That's something that's always important to remember about public policy. It's like, there's absolutely no need for it to be rational, you know, like make sense.

Nathan Lambert (00:48:19): Yeah, that makes sense. I think the other thing, this is all like the AGI lab things. It's like, what is your take on the scaling curves? I think for context, everyone got restarted on this with the Leopold Aschenbrenner situational awareness thing, which obviously is a well-written document, whether or not you agree. I think it's interesting. i'm struggling with this one point of the scaling curves thing where i get mixed messages on what the scaling curves actually are when they come to evaluations my understanding of them is that the when you have log x-axis compute and then like log perplex it's an even log perplexity it's a straight line and what i interpret is this is as you 10x compute you get like a like a like it's not like a 10x encryption and performance you get 10 times closer to 100 which is like if you're at 90 accuracy to go to 99 so I don't really understand how people think that this is going to make them become a PhD level, whatever, blah, blah, blah. And I was listening to a recent podcast and I think it was Josh A. from InView was describing this as the reason you have emergent properties is that when you're training at every 10x compute, your model gets 10 times better. So then if you're measuring on a linear scale, it'll look like an emergent property because it's going to go like this. And I was like, what is going on like why does no one understand these fundamentals and it seems impossible that you could get 10 times better when you're going on like it seems like that just seems like total kool-aid drinking like am i am i wrong i i guess i need to go do this basic math it just doesn't track like any computer system how are you going to get 10 like what i don't understand well that's that's kind of my rant

Dean W Ball (00:50:07): I read these charts the same way. Log, log, perplexity, compute, right? That is what I read too. And so that would imply asymptotic progress, but it would not imply a continued exponential increase in capability. I also think like... What is better? That's always like so hard. It's like, what is 10 times? People say, oh, well, the leap from GPT-5, you know, from GPT-4 to GPT-5, will it be similar or less or bigger than the leap from GPT-3 to GPT-4? I'm like, I don't really know if I can quite quantify what the leap between 3 and 4 was or the leap between 4 and Opus, Cloud 3 Opus, which was definitely real for me. You know, I like that that model felt qualitatively different. But I don't think that has to do with training compute. I really I don't think that has to do with the number of parameters the model has. I think that has to do with the way anthropic that the post-training more than anything else. So, yeah, I'm really not sure. I'm skeptical of when it comes to the, you know, to the scaling laws. They're obviously very important. They've held in a variety of different modalities, which is interesting. The fact that we see them apply in DNA sequencing or give sequence prediction to is like, oh, that's interesting. We're just sealing that same line. The models improve monotonically with scale over and over and over again. Um, so like, sure. I'm, I'm, I'm inclined to believe that,

Nathan Lambert (00:51:52): but they're important, but I just am so shocked by how bad the discussion of them so often is like putting this, this is the thing with like the putting levels on the Y axis corresponding to human education. Dumb. Bad move. The technical reality of it may be that they continue to improve, but it's just like, those are the things that I want to see people stop doing. And this isn't really a question. This is mostly just me ranting about this because this impacts policy and these related discussions.

Dean W Ball (00:52:19): if I wrote an essay and like in college and submitted it to my professor, like Leopold Aschenbrenner.

Nathan Lambert (00:52:27): Wait, who was the famous economist that he was like Tyler Cowen is Tyler. Tyler, you didn't check his work.

Dean W Ball (00:52:35): Uh, yeah. Tyler, uh, basically hired me too. Uh, in fact, um, but, um, But yeah, if you did that and you didn't define intelligence, that would be the first thing a college professor would do is circle the first paragraph and be like, you need to define intelligence here. And the fact that he doesn't, I don't think it's a two-dimensional thing. or one dimensional or two dimensional thing. I think intelligence is inherently highly multidimensional, um, and multidimensional things just behave in, in counterintuitive ways. So like,

Nathan Lambert (00:53:08): I think they're getting better at things they're already doing, but we don't have any proof that they're going to start doing new things.

Dean W Ball (00:53:15): Yeah. Is GPT-4 better than a high schooler at some things? Yes. Is it worse than a three-year-old at some things? Yes. Those things are all true. And I don't really think it belongs on a human-defined linear scale of intelligence. I just inherently don't think that.

Nathan Lambert (00:53:31): Yeah. That makes sense. Final question. How much of influencing policy and related discussions comes down to having some sort of audience? I think that this is like

Dean W Ball (00:53:42): remarkably true but not potentially good yeah i think that it is very important and i think that it comes from influencing the way people think you know like a lot of think tanks will judge the success of research by did the ideas from this research get implemented in policy, which is one way to do it, for sure.

Nathan Lambert (00:54:08): But I think... It's a long timescale. It's like a longer timescale than citations in academic nonsense.

Dean W Ball (00:54:14): Well, and also, if I'm successful as a policy scholar, then at least once a month, I should be putting out something, some analogy, some way of thinking about something, a meme, really, basically, that has an effect on the way a lot of influential people think. The other big outstanding question for me, and I've heard you raise this on the retort before recently, in fact, is what's more important? Is it influencing people in the federal government or is it influencing people at the AI labs? Who's going to be more important for determining policy? I don't know.

Nathan Lambert (00:54:55): Yeah. Well. Maybe some people at AI read this and I think this is a great conversation. I'm kind of happy to wrap up here. I could see us redoing this in months based on the kind of coverage of all the recent things here. So I think this is great. I'm excited to share this with people. It's nice to get to know you more. We already have another project lined up where we'll talk more about this. It won't be in the same medium. So that's fun. So thanks a lot and keep writing. I'm sure you'll get a bunch of people to check this out. I'll have all the links everywhere and stuff like that.

Dean W Ball (00:55:28): Awesome. But you too, thank you very much. You played a big role in my building my Substack audience over the last six months. So I really appreciate it.

Nathan Lambert (00:55:35): People just need to say things. People ask me this a lot. It's really like if you make time, most people that I work with have interesting thoughts. The problem is. doing the practice of getting these thoughts into some silly medium. Literally, these long tweets, the tweets are now long. You could just do that. You could do that once a week. You will grow an audience over time. It's pretty simple. You just have to pick your lane and just keep pressing the button and it just works. You're not the only one. I'm going to have some other people that have talked about this on this interview track in the summer. I just think it's so... it's a partially a way to normalize it and get more people to try it is why I bring it up because that's like, I want that to happen to AI too. Cause there's a lot of smart people that don't know how to engage and a hundred percent and other things. And it's like, yeah, it's worth it. So thanks again.

Dean W Ball (00:56:27): We'll talk to you. All right. Bye.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

RLHF Roundup: Trying to get good at PPO, charting RLHF's impact, RewardBench retrospective, and a reward model competition Jun 26, 2024

Things to be aware of if you work on language model fine-tuning.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/rlhf-roundup-2024

00:00 RLHF Roundup: Trying to get good at PPO, charting RLHF's impact, RewardBench retrospective, and a reward model competition
04:32 How big is the impact of RLHF relative to pretraining?
05:54 RewardBench retrospective after 100 models and 90% peak accuracy
09:19 LMSYS's reward modeling competition

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_009.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_012.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_017.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_026.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Frontiers in synthetic data Jun 21, 2024

Synthetic data is known to be a super powerful tool for every level of the language modeling stack. It's documented as being used for expanding vanilla pretraining data and creating large swaths of fine-tuning data. Many, many more rumors surround its use, Anthropic's pretraining-scale constitutional AI, Mistral AI's first models being pretrained on OpenAI outputs, Q-star's hopes as OpenAI's remaining moat, and much more. The diversity of use cases for synthetic data makes planning around the role of synthetic data in solving specific goals.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/frontiers-in-synthetic-data

00:00 Frontiers in synthetic data
01:14 1. Direct distillation is still king
02:54 2. Are Gemini Flash and Claude Haiku distilled?
04:03 3. Filtering prevents collapse
06:30 4. Synthetic data strategy taxes
07:32 5. Pros and cons of training on multi-output-source synthetic datasets
08:54 6. Structured synthetic data
09:42 7. Weak-to-strong generalization is maybe real
10:27 8. Creating synthetic prompts is overlooked again

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Text-to-video AI is already abundant Jun 18, 2024

Signs point to a general-use Sora-like model coming very soon, maybe even with open-weights.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/text-to-video-ai-is-already-abundant

0:00 Text-to-video AI is already abundant
5:08 What's next for the text-to-video market?
6:49 Are text-to-video models good for the world?

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_005.mp4
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_009.mp4
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_011.mp4
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_013.mp4
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_015.mp4
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_017.mp4

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

AI for the rest of us Jun 12, 2024

Apple Intelligence makes a lot of sense when you get out of the AI bubble.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/apple-intelligence

00:00 AI for the rest of us
02:46 Apple's technical approach
03:32 Core models: What did Apple build?
05:35 Alignment strategies: Some new things!
10:00 Orchestrating adapters and on-device magic
11:58 Light for other narratives around AI

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_005.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_039.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_041.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

A realistic path to robotic foundation models Jun 05, 2024

A realistic path to robotic foundation models
Not "agents" and not "AGI." Some thoughts and excitement after revisiting the industry thanks to Physical Intelligence founders Sergey Levine and Chelsea Finn.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/robotic-foundation-models

0:00 A realistic path to robotic foundation models
2:51 Key factors for the future of robotics
6:19 Everything is a token: The transformerification of robotics

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

We aren't running out of training data, we are running out of open training data May 29, 2024

Data licensing deals, scaling, human inputs, and repeating trends in open vs. closed.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/the-data-wall

0:00 We aren't running out of training data, we are running out of open training data
2:51 Synthetic data: 1 trillion new tokens per day
4:18 Data licensing deals: High costs per token
6:33 Better tokens: Search and new frontiers

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Name, image, and AI's likeness May 22, 2024

Celebrity's power will only grow in the era of infinite content.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/name-image-and-ai-likeness

0:00 Name, image, and AI's likeness
1:11 OpenAI's second terrible, horrible, no good, very bad week
4:36 The expansion of name and likeness
7:46 Culture and AI development

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

OpenAI chases Her May 16, 2024

ChatGPT leaves the textbox, and Google is building the same, and more, as practical tools.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/openai-and-her

00:00 OpenAI chases Her
02:10 Talking to ChatGPT
03:53 GPT-4o: Toward omnimodal models
08:25 Google's mirror with Gemini
10:11 OpenAI's AI Safety: Have your cake and eat it too

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/her/img_018.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/her/img_023.jpg

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

OpenAI's Model (behavior) Spec, RLHF transparency, and personalization questions May 13, 2024

Now we will have some grounding for when weird ChatGPT behaviors are intended or side-effects -- shrinking the Overton window of RLHF bugs.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/openai-rlhf-model-spec

00:00 OpenAI's Model (behavior) Spec, RLHF transparency, and personalization questions
02:56 Reviewing the Model Spec
08:26 Where RLHF can fail OpenAI
12:23 From Model Spec's to personalization

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_027.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_029.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_033.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_034.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_041.webp
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_046.webp

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

RLHF: A thin line between useful and lobotomized May 01, 2024

Many, many signs of life for preference fine-tuning beyond spoofing chat evaluation tools.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/how-rlhf-works-2

00:00 How RLHF works, part 2: A thin line between useful and lobotomized
04:27 The chattiness paradox
08:09 The mechanism for making models chattier
10:42 Next steps for RLHF research

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf/img_012.webp
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf/img_018.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf/img_025.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Phi 3 and Arctic: Outlier LMs are hints Apr 30, 2024

Models that seem totally out of scope from recent open LLMs give us a sneak peek of where the industry will be in 6 to 18 months.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/phi-3-and-arctic-llms

0:00 Phi 3 and Arctic: Outlier LMs are hints
1:01 Arctic & open mixture of expert trends
6:10 Phi 3, synthetic data, and small models

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/phi3/img_004.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/phi3/img_008.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/phi3/img_018.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

AGI is what you want it to be Apr 24, 2024

Certain definitions of AGI are backing people into a pseudo-religious corner.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/agi-is-what-you-want-it-to-be

00:00 AGI is what you want it to be
04:01 RL still rules the AGI discourse
05:43 Modern AGI tests
07:37 Agency and shifting goalposts

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/agi/img_018.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/agi/img_020.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Llama 3: Scaling open LLMs to AGI Apr 21, 2024

Meta shows that scaling won't be a limit for open LLM players in the near future.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/llama-3-and-scaling-open-llms

00:00 Llama 3; scaling open LLMs to AGI
01:44 Pretraining, data, and basic evals
06:06 Alignment and human evaluations
10:08 Chatting with Meta AI and Llama 3 70B Instruct
11:55 Same Llama license (mostly)
12:52 The healthy open LLM ecosystem

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Stop "reinventing" everything to "solve" alignment Apr 17, 2024

Integrating some non computing science into reinforcement learning from human feedback can give us the models we want.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/reinventing-llm-alignment

0:00 Stop "reinventing" everything to "solve" AI alignment
2:19 Social Choice for AI Alignment: Dealing with Diverse Human Feedback
7:03 OLMo 1.7 7B: A truly open model with actually good benchmarks

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_013.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_018.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_024.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_027.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

The end of the "best open LLM" Apr 15, 2024

Modeling the compute versus performance tradeoff of many open LLMs.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/compute-efficient-open-llms

0:00 The end of the "best open LLM"
3:05 Compute efficient open LLMs

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Why we disagree on what open-source AI should be Apr 03, 2024

Last minute title change from: The tech industry can't agree on what open-source AI means. That's the process.
How to read what multiple people mean by the word openness and see through the PR speak.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/flavors-of-open-source-ai

0:00 The tech industry can't agree on what open-source AI means. That's the process.
2:45 1. Effective Accelerationists, Techno-Optimists, capitalists, etc.
3:39 2. Scientists, promoting understanding and transparency
5:16 3. Inclusion, public interest, and fighting concentration of power
6:19 4. Freedom advocates
7:25 Dissecting "openness"

Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/openness/img_004.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

DBRX: The new best open LLM and Databricks' ML strategy Mar 29, 2024

Databricks' new model is surpassing the performance of Mixtral and Llama 2 while still being in a size category that's reasonably accessible.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
https://www.interconnects.ai/p/databricks-dbrx-open-llm

00:00 DBRX: The new best open model and Databricks' ML strategy
03:36 The DBRX narrative
07:33 Databricks' open LLM (and AI) strategy
09:42 Playing with DBRX Instruct
14:54 Digging for details

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Evaluations: Trust, performance, and price (bonus, announcing RewardBench) Mar 21, 2024

Evaluation is not only getting harder with modern LLMs, it's getting harder because it means something different.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/evaluations-trust-performance-and-price

00:00 Evaluations: Trust, performance, and price (bonus, announcing RewardBench)
03:14 The rising price of evaluation
05:40 Announcing RewardBench: The First reward model evaluation tool
08:37 Updates to RLHF evaluation tools

YouTube code intro: https://youtu.be/CAaHAfCqrBA

Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_026.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_030.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_034.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_040.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Model commoditization and product moats Mar 13, 2024

Where moats are tested now that so many people have trained GPT4 class models. Claude 3, Gemini 1.5, Inflection 2.5, and Mistral Large are here to party.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/gpt4-commoditization-and-moats

00:00 Building LLM moats despite the commoditization of GPT4
04:38 The Open's opportunities
08:02 It's amazing people still think LLMs aren't going to be useful
09:50 Things that are coming

Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/moats/img_004.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/moats/img_028.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/moats/img_032.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

The koan of an open-source LLM Mar 06, 2024

A proposal for a new definition of an "open source" LLM and why no definition will ever just work.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/an-open-source-llm

00:00 The koan of an open-source LLM
03:22 A new naming scheme for open LLMs
07:09 Pivot points and politics
08:16 Claude 3, arms race, commoditization, and national security
10:01 Doomers debunking bio risks of LLMs themselves
11:21 Mistral's perceived reversal and the EU
13:22 Messy points: Transparency, safety, and copyright
13:32 The muddling of transparency
15:22 The muddling of "safety"
16:30 The muddling of licenses and copyright
20:12 Vibes points and next steps

Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/open-source/img_046.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/open-source/img_064.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Interviewing Louis Castricato of Synth Labs and Eleuther AI on RLHF, Gemini Drama, DPO, founding Carper AI, preference data, reward models, and everything in between Mar 04, 2024

This interview is available on podcast players and YouTube.

I’m excited to bring you another interview! This one is a deep dive right in my wheelhouse — all things RLHF. Louis Castricato is probably the hidden star of RLHF in the open. I’m not sure anyone who can speak freely knows as much as him. As I’ve said again and again on Interconnects:

Giving a voice to researchers is the best way to cut through the noise and understand what is happening with core developments of LLM technologies.

Louis recently has been founding a new startup focused on synthetic data for alignment, Synth Labs, and is a researcher at Eleuether AI. This interview should speak for itself, and it’ll need re-listens, even for myself. The list of topics we cover touches on pretty much every major and minor issue facing model fine-tuning. Please reach out or comment if there’s a paper we mention that I didn’t link before. Happy to dig it up for you.

For more on Synth Labs, there was a profile in Bloomberg from Rachel Metz.

This post is very technical, more than usual. If you’re having a hard time with it, I suggest you listen to my RLHF 201 post on Latent Space first.

Chapters

These are generated with smol-podcaster with moderate edits.

High-level chapters

* 00:00:00: Introduction

* 00:01:24: Gemini News and RLHF’s Part in it

* 00:09:05: Long Context, In-Context, and Multimodal RLHF

* 00:21:20: What are people missing about RLHF these days?

* 00:30:30: OpenAI's Influence and the Need for Alternatives

* 00:39:20: Synth Labs and the Future of Alignment

* 00:55:00: Evaluation Talk p2: Open-ended Evaluation and Data Diversity

* 00:59:20: Algorithm Roundup: PPO, DPO, KTO, IPO

* 01:18:38: CarperAI, Early Days of RLHF, Reflecting on ChatGPT

Detailed chapters

* 00:00:00: Introduction and Overview of RLHF

* 00:02:02: Gemini News, Custom Demographics in Image Prompts, and Controllability Issues in AI Models

* 00:05:21: Fixing Biases in AI Models Post-Training, Representation in AI Data

* 00:09:00: Multimodal RLHF and Video RLHF

* 00:16:09: Evaluating Long Context Behavior in AI Models

* 00:17:05: The Potential of In-Context RLHF

* 00:21:24: Shift from PPO to DPO in RLHF

* 00:23:19: Generalization and Evaluation in RLHF, Human Evaluation

* 00:27:03: The Discrepancy Between Research and Company Needs in Alignment

* 00:29:20: Impact of ChatGPT and Language Model Outputs on Data Sets

* 00:31:39: The Concept of Uncensoring Models

* 00:34:05: Lack of Safety Data Sets in Instruction Tuning

* 00:35:23: LMSYS ChatBotArena, AlpacaEval, MT Bench p1

* 00:39:25: Introduction to Synth Labs and Alignment Platform

* 00:43:05: Developing OpenCAI Constitutional AI Data Set

* 00:49:41: The Need for Open-Ended Evaluation in RLHF, eval p2

* 00:54:13: The Importance of Releasing Models for RLHF Research

* 00:58:17: Self-Instruction and Self-Rewarding LMs

* 01:01:03: Working on RLHF at Carper AI

* 01:04:25: Scaling PPO in RLHF

* 01:08:01: The Impact of ChatGPT on Carper AI

* 01:10:56: The Potential of KTO (Kahneman-Tversky Optimization)

* 01:17:39: The Importance of Implementation Details in RLHF

* 01:20:14: The Initial Focus at Carper AI

* 01:23:36: The Future of RLHF and Open Science Collaboration

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Papers & artifacts we discuss

* Recursively Summarizing Books with Human Feedback

* Needle in a haystack recent example repository.

* Urial paper: The unlocking spell on base llms: Rethinking alignment via in-context learning

* Misha paper from Deepmind: In-context Reinforcement Learning with Algorithm Distillation

* Museli Optimizer: Muesli: Combining Improvements in Policy Optimization

* Unintended Impacts of LLM Alignment on Global Representation

* Pink Elephants Problem: Suppressing Pink Elephants with Direct Principle Feedback

* Cut the Carp: Cut the CARP: Fishing for zero-shot story evaluation

* MT Bench data for correlating human to GPT4 preferences

Full transcript

Note: this is generated by smol-podcaster and has minor bugs post human edits.

Nathan [00:00:01]: The ticker's going up. Welcome, Louis. You're the second guest on the InterConnects podcast, I think. It's an interesting one for me because everyone kind of points to me now as the person that is in the face of RLHF and I get a lot of questions and to me Louis has represented that person. I think Louis provided a lot most of the information on the first RLHF blog post that I wrote for Hugging Face back in the day. If there's somebody that I want to ask questions about RLHF, it generally goes to him. So now you all are gonna know this in the open. We're gonna cover a lot of things. As always, I'm trying to talk with researchers on the ground and people actually doing things in these topics. I think we're gonna cover a lot of things today. We're in the Latent Space podcast. If you're watching on video, you may have noticed that we're in the Latent Space studio and they reminded us we've got to start off with covering the Gemini news and what that means for RLHF and then most of this is a long docket of the core questions facing the two of us as we're trying to make RLHF more open, more useful, not only about safety but safety is important to it and important to us. So I think we can kind of get going. I think the first question that I have is just get rolling. What is your favorite Rhode Island fact?

Louis C [00:01:28]: My favorite Rhode Island fact? Oh man, all the H.P. Lovecraft stuff. Like walking around Providence with like friends who like H.P. Lovecraft and be like, oh yeah, you know, this was like that building in Call of Cthulhu or like...

Nathan [00:01:36]: I don't even know this. I mean, for the record, I grew up in Rhode Island if people didn't know and then that's where Louis spends most of his time these days. Providence. So we'll come back to this. I think I'm just gonna start with kind of the hardest question then it'll get easier for us from here. It's like what was your first reaction when you saw all this Gemini stuff?

Louis C [00:02:02]: The, you know, the adding custom like races and demographics to like image prompts component, right? Yeah. So Dawley had done that back when Dawley 2 first came out and was like an in beta and people were reporting like a person holding a sign that that says X and then this sign would say black or this line would say white or this line would say Asian. And I, you know, it was a very hacky solution then and I thought a lot about it then as well and I almost felt like, you know, it gets you 90% there for like 1% of the time of the way that you're doing this like, you know, like in a more proper and auditable way of like making sure your training data has like equal representation or making sure your ROHF data has good representation. And, you know, you can't do those things after the fact but what you can do after the fact is like inject things into the prompt to make it more controllable. And it really comes down to the fact that controllability right now is not a solved problem and most of our solutions to controllability are a little bit hacky.

Nathan [00:03:16]: Yeah, that makes sense. I think to summarize for people this has been an ongoing issue and we're recording on the 27th here. Gemini initially got flack for like actually forcing diversity into historical scenes and then it started getting more flack for flat-out refusing certain requests on race. Like all of this stuff is just like it's like ouch to somebody. Like I know people working on this stuff and it's just like the way that it ends up here is not is not like what a lot of people think. Like the Gemini team is obviously moving fast and it seems to me that the image stuff has always been like a red herring. That's the way that Swicks phrased it as well. It's like somehow he got to the point where a prompt was shipped in this final solution with the further image editing and that's just hard. It's just like obviously there's a big goof up there. But then it's like we're looking at examples and still today like Meta's image generator. So I'm like WhatsApp or whatever you can ask an AI that it'll have similar issues where it forces diversity into it into a question with multiple people. Microsoft Copilot has this. It's like the text thing and really digging into how we think these big companies could be adding like like forcing this into their data or like we know that there's a lot of uncertainty over how all these companies get their preference data. Some of them work with companies like scale and surge. Some of them do it in-house. Who is providing it isn't really an issue because they're probably giving similar instructions and similar workforces across the board. But it's like how do we see this entering the preference data that they're adding to our early stuff because it's like if you look at a base model. We were just working with Olmo and it's like if you you ask a base model you say like hello to a base model. A lot of times the base model will then go off and be like some crazy like Fortan s**t because like so many of the conversations on there even with good data processing techniques is like from weird corners of the Internet. So like I don't see any base model that comes out with some like D-bias thing so it's added on. And it's like how did we end up there.

Louis C [00:05:21]: Yeah I mean you know when I was saying this is something that that they do like retroactively once they've acknowledged that these issues exist in the data set once the model has been trained. It's not something that can be easily fixed even if they had infinite resources like it's very very hard to go back and actually rectify these biases in a way that's like equitable to like all the kinds of preferences that someone might have when wanting to interact with this model right. There's um the the fact that at least as far as I know until recently DALLE did this as well where you could still say a person holding a sign that says X and it would still say black white or whatever and and the amount of resources that they're pumping into making sure that you know they're building a consumer product they're building like the main consumer product in this space the amount of resources that they've been pumping into it and this still presents a large issue for them just just shows how difficult this like really is.

Nathan [00:06:20]: Yeah and another example that people on the I have this discord that's growing for paid and friends or paid subscribers and friend someone pointed out this work where if you ask DALLE to generate like a doctor and an assistant like all the same bias problems still up show it show up so like a lot of the solutions that we have are not necessarily like deep and at this like conceptual level it's at this like you tell your preference labelers to do a certain thing and then they do it but you may not have good tracking of like which data point is responsible for these different things.

Louis C [00:06:55]: Yeah you know interpretability for for like preference learning in general it's it's we're very very far from actually understanding like what preferences result in what model behaviors and and like you know preferences that disagree with each other.

Nathan [00:07:12]: Like the John Schulman talk. Yeah. It's like that was this whole talk and it was great just to have him get up there and be like this is so hard.

Louis C [00:07:20]: Yeah and like I've done like a ton of experiments myself where I just like have an RLHF data set and I like randomly remove 10% and I have like a bunch of models each with like a different 10% removed and I'm like well what behavioral differences can I see between these models and then not only is it like and now you can see differences but it's extremely hard to quantify it's extremely hard to actually understand what the difference is and then like there's almost no way to know what in that 10% cause that difference.

Nathan [00:07:51]: Yeah this reminds me of like the Hugging Face No Robots data set which is like a professionally curated instruction data set. Whenever we added that to a model it was like this is obviously our most valuable data but it would show up on zero benchmarks and we're like well what do we do and it's like we're talking about Google's problems here and we'll get back to like the data problems in the open source and it's like they probably have order of millions of data points that are going into this preference data and some of it is for some proportion it's probably about safety. I think we could talk about like the Anthropic HH data which like the people don't actually know the details of it because it's like a quarter of it is like helpful data or than three quarters is or like a quarter is harmless and three quarters is helpful from different rollouts and it's like these are very specific things as like huge data problems that most people aren't really thinking about.

Louis C [00:08:40]: Yeah most people are just like blindly oh this is safety so I'm gonna throw it into my data set and hopefully like it works and hopefully like we get good behavior but I don't really know what's in this data set I've really looked at the data and I thought that's something that I've heard many many times over the last year of people like trying to get their feet wet in the RLHF space.

Nathan [00:09:00]: Yeah and do you have any intuitions is like the last point of the Gemini thing I'm like if we don't think that the image generation of Gemini is the biggest issue I think it's like in the text and how this preference data is collected but like do you have anyone that is doing multimodal RLHF because I generally think that it's like we don't know how to do this at all which is like how you control input if you have multiple inputs and multiple outputs is like how do you control your moDALLEty distribution and data count and stuff.

Louis C [00:09:30]: Yeah so I mean I have a friend of two friends of mine who have been doing like video RLHF for a little while now like it's a little bit over a year and you know they like condition their video model on some text encoder and they've been talking about like having to do RLHF independently for both the text encoder and the video model but like video RLHF is just like massively underdiscovered and no one really knows what they're doing in that space.

Nathan [00:09:53]: When you say independently what do you mean like before making the video model are they like RLHF-ing the text backbone or are they freezing the rest of the model? Yeah they're RLHF-ing the text backbone.

Louis C [00:10:04]: I think there was actually a paper from Tencent last August that basically did the same thing for like multimodal RLHF where they had to RLHF the text backbone and then the RLHF like the image generation components on top of that.

Nathan [00:10:17]: Does that look like that's the like they you this is potentially basic but like to train a visual language model you have to have some link you have to add some type of a mechanism that links the gradients between the two and sometimes you start with a most of the time I think these days they're starting with this language backbone and they're adding on vision and continuing to train and then this is like at the end of this where you have a visual language model then they're freezing the gradients of the video video part and then RLHF-ing the text part or is this like before the text backbone is even initialized on the model?

Louis C: The space is a little too early.

Nathan: Yeah like I think that's the point like we don't know these links.

Louis C [00:10:53]: But I know people in the last like eight months who have done it the way of like before they even add the image component they RLHF the text model and then they add the image component in the RLHF image.

Nathan [00:11:07]: Yeah so this is really interesting like I'd be interested from like a everyone talks about how RLHF is low low computation and flops compared to what people are doing like in the open we say that it's like 50 or 100,000 day training samples. Lama 2 is like 1.5 million I'm guessing the closed models like Gemini are probably another 10 million like we're higher like they're they're much bigger and it's like is the amount of video training that it takes a train this backbone after the fact like it's still helping like does that undo some of the text RLHF or does it not? If the answer is I don't know but these are the kind of things that I want to have people start talking about it's like is RLHF becoming like a sequential process as you add moDALLEties or can you wait all to the end and like do just multimodal RLHF? We don't know these things and this is what people in Gemini are trying to work on.

Louis C [00:11:58]: I definitely I've spoken to a lot of people who like are at least thinking in this space I've only spoken to a small number of people who are actually working in this space but for the people who are thinking in this space really the the dream is to be able to express preferences in moDALLEties where it's beneficial to express preferences in those moDALLEties like it doesn't make sense to express preferences over code as like images or video but it does make sense to express preferences over like puppies as like photos.

Nathan [00:12:25]: That's a great point and I think the thing is like the way you ended your sentence is like make preferences over puppies it's like we don't know what people use visual outputs for in like a productive sense and and really inputs like the things are like analyze this video like that's a toy example where like analysis creating RLHF pairs I think actually it's not too hard for us like we it takes a lot of effort because a human has to know what is in the video to do like a summarization RLHF like if you're passing in a three-hour video into Gemini base model and then it outputs two outputs like the humans not gonna know what's right unless it has context and what the video is and that is just way different than like a poem where you could read both of them.

Louis C [00:13:04]: Yeah so there's actually a really fascinating paper from OpenAI that I really haven't seen anyone build on it was the idea of like summarizing really long books and you doing RLHF to do that.

Nathan [00:13:14]: Is this sort of like recursive summarization?

Louis C [00:13:17]: Yeah yeah it's the recursive summarization it's the idea that like you can almost treat like long summarizations as like a weird RLHF like almost like merge operation where like you divide divide divide divide divide divide and then eventually you get to segments where it makes sense to collect annotations and then on those segments you have a human annotator go through and say oh this segment is better than this segment or the summary of this segment plus this segment is this and then when you combine summaries now you can say well this summary plus this summary gets you this summary and eventually you get preferences going all the way up the tree and you get a preference of the whole book at the end and obviously you know it's a crude approximation of what the summary of the whole book is but it's much more feasible than asking human annotators just to summarize an entire book.

Nathan [00:14:05]: Yeah I mean I just realized this on the pod right now it's like how ridiculous RLHFing like an entire code base in context is like that's like where some of the like opportunities for what I think RLHF could do which is like just synthetic data labels and stuff it's like we can create synthetic preferences in many different ways that aren't all reliant on like this kind of human subjectivity.

Louis C [00:14:32]: Yeah it's like it's a deeply fascinating problem actually going into like how big is Gemini's context window the 1.5 thing it's like

Nathan [00:14:37]: yeah it's shipped with a million and they have experiments in the paper up to 10 million.

Louis C [00:14:40]: Like who really wants to use a 10 million token context window and like how accurately do you really can you really think about preferences over the range of a 10 million token context window?

Nathan [00:14:54]: I think people want to use it but I think the preference thing is a lot harder yeah it's like I could have this is something I encounter in HuggingFace regularly like HuggingFace is a popular code base you expect the code models to do well but they still don't do well unlike like they don't know like they'll make up datasets functions or something and like if you just have all of HuggingFace's code in context when you're like working in the HuggingFace ecosystem like that will make you so much better and like it or analyzing long videos and stuff like I do think there's a lot of use cases and I yeah but like the preference thing is just a totally different framing. What do you think about the needle in the haystack evaluation that they did? I haven't read a lot about it but I think essentially what it's it's there's like a difference between being able to act on the information and being able to like retrieve it and I think it's like these models should be passing needle in the haystack because that shows that they're like actually like noticing that the information is there but that does not necessarily mean that they're gonna be able to synthesize all the information in a compelling way so it's like a path it's like a pass bar which is like you need to have this to be credible in long context but I think that actually evaluating long context and like what behaviors we want to see is pretty open-ended.

Louis C [00:16:09]: yeah he put out a paper like yesterday where he's like oh needle in the haystack is interesting but if you have like more than two needles like it's entirely uncorrelated with the single needle in the haystack benchmark.

Nathan [00:16:24]: Yeah cuz it's like trying to find one thing at each part of the content like breaks the context window into many segments and then it's making sure that you can find something in each of those segments.

Louis C [00:16:36]: So it's almost like I feel like we're almost gonna get to the point where like the attention itself is the limiting factor because the model genuinely just just cannot equitably like split attention over it's a context window to retrieve as many things as it realistically needs in order to produce something.

Nathan [00:16:50]: Do you think the RLHF could manipulate long context behavior more than people might expect? Cuz it's it's just like an open question.

Louis C [00:17:05]: Yeah I think it's a very interesting open question and if the answer turns out to be yes in context RLHF becomes like absolutely massive because like right now like it can kind of sort of work but like not really and like every benchmark I've ever seen for in context RLHF almost isn't charitable at all to the RLHF baseline and it's not like from the experiments that I've done in the experiments that people in Eleuther have done. It's comparable on like very niche situations but it's not comparable in general because you still have all the issues with in context learning where like you'll massively overfit on the preferences that are like put in the beginning of the context versus preferences.

Nathan [00:17:50]: Let's try to explain what this in context RLHF is actually doing. So is it running like everyone a lot of people know what an RLHF algorithm is and in context learning is designing a prompt like is it training a model to generate prompts like what are you actually are using the RL update and like what are the parameters what are you parameterizing when you're doing in context RL?

Louis C [00:18:10]: So I mean there's a number of different approaches for in context RL. There is the... Could be part of the problem.

Nathan [00:18:14]: It's like people do a lot of different things but what are some of them?

Louis C [00:18:16]: So the one that I was referring to is I think the Yejin Choi paper. Yeah it's the Uriel. Yeah where like she's like you she just prompted chatbot you are interacting with the user here's what their preferences are like have at it but there's also stuff like that like Misha and DeepMind. This is the first one that I did. Yeah where it's like you have some agent that's interacting with an environment and you store all these state action pairs and you just like fine-tune models on like episodes of these state action pairs and then the idea is that like if you just put enough episodes into a context window on the next episode it'll just perform better right and and it's like the algorithm distillation paper and you can like use this to like distill stuff like I think the actual example that Chris Lu's paper does where they do like algorithm distillation on s4 I think they do Muesli where I think they distill Muesli which is they like apparently no one outside of DeepMind ever used it but apparently...

Nathan [00:19:15]: Oh is this the algorithm Muesli? Yeah I remember when this was hot it was like a year ago at this point we were thinking about re-implementing it and then we never did. It was too complicated.

Louis C [00:19:30]: Yeah but Muesli is apparently very computationally expensive because it's like this model based RL thing that beats AlphaGo I think without using Monte Carlo tree search and like you know it's so incredibly computational expensive and wanting to be able to do it in context just dramatically reduces the amount of computational complexity to actually deploy it right and as far as I'm aware there's been no work applying algorithm distillation at all to NLP and I think at least my impression is that it generally does not work for NLP at least yet and you know I think that there's a lot of potential there but there's absolutely massive barriers that have to be overcome before we get there and and you have like what you have Goldberg's example of not being able to do needle in the haystack for like more than two needles basically shows that even like the ring attention stuff just is not going to be sufficient for algorithm distillation stuff for NLP and I have a very strong feeling that like Mamba or like S4 is not going to close that gap either. So they would need to be able to reference prior parts of the text and they just can't do that.

Nathan [00:20:56]: Yeah I think there's a whole rabbit hole that we could go down and talk about like long context and architectures forever. I think let's kind of zoom back into the core stuff which is that this is like the real starter question is like what do you think people are missing in RLHF these days and then from here it's gonna be a long list of like what the heck do we do about evaluation data like well what is the like big-picture thing?

Louis C [00:21:24]: So what I think people are missing and actually I touched a bit on this in the Pink Elephant's paper is that...

Nathan [00:21:28]: You should say what this is because we haven't introduced it.

Louis C [00:21:30]: Yes you're right you're right you're right. So I worked at Luther AI as a resource scientist for the last six months or so and we were really interested in like understanding you know everyone had been doing PPO for so long and there had been a shift to DPO and we were trying to understand like well now that we're moving to DPO how can we actually take advantage of this new architecture? Like should we really even be thinking about reward models and data sets in the same way that we were thinking about them during PPO? And it doesn't really make sense and I think the answer to that is an unequivocal no. That like you need to think about your data sets and preference data sets entirely differently than you were thinking about them with PPO. Because in PPO you're using you're setting your data sets up to train a really good reward model and in DPO you're setting your data sets up to teach a language model what the better trajectory is. And it's a subtle difference but in one you're just trying to learn differentiation between high reward and low reward and in the other it's like a general classifier.

Nathan [00:22:35]: Like you want to be able to do everything with the reward model? Yeah. Have you also found that DPO can be sensitive to like the SFT distribution? So if you like take a random open preference data set if it's really different than what your model would generate like DPO can do some weird things? Louis C [00:22:53]: I've actually, I might be alone in this, I don't SFT before doing DPO at all.

Nathan [00:22:59]: Do you use generations from your base model? I do. So that's the question. It's like if you were to not do SFT before doing DPO. Yeah. Could you just take ultra-feedback on whatever your base model is if it's sufficiently different? I've done some weird stuff though. Like I've like

Louis C [00:23:19]: DPO'd models that were like trained with like the Hermes data set for like code and like it still generalizes really really well.

Nathan [00:23:28]: How are you measuring, how are you trying to think about generalization with DPO?

Louis C [00:23:33]: Well I typically rely on like human eval more or less. And if I do like human eval but it's GPT-4 eval and I see that human eval correlates with GPT-4 eval then I just go GPT-4 eval the whole way. A lot of people are doing that.

Nathan [00:23:48]: How far do you think that actually generalizes? I mean just recently there was this, like we're bouncing around through all the things, but there's so much good information for people here. It's like Hugging Base and Argilla, two places that are doing great work in this kind of alignment preference fine-tuning space, they've released this data set that was a preference pair creation from the OpenHermes data set. And it's like they used PairRM as their judge. And what they found is that like they did it, I remember Louis Tunstall tweeted this, where he was like we were looking at which gave the best correlation. And they found that PairRM, which is this 400 million parameter Diverta based pairwise classifier, had like the best correlation as choosing which response was better among a set of responses in the OpenHermes data set. And what they were comparing to is like Prometheus and I'm forgetting the name of the other one. There's one more, there's a couple more like open model as like rate model rankings that exist. I think. But essentially the question is like we do these things and we look at this early correlation and there is this correlation between GPT-4 and humans. And then a lot of times we continue like LLM-Sys did this question where they like or like AlpacaEval has done this to validate AlpacaEval as a meaningful benchmark. LLM-Sys has done this for MTBench. Like all these places are doing this where they validate a subset for humans and then say it generalizes forever. Like do we think that it's actually true? I think that you always have to take it with a grain of salt.

Louis C [00:25:24]: It's always for very very specialized domains. So one of the first, actually I think I did write the first paper for like critiques and revisions called like Cut the Carp. The idea was like, I remember this, the idea was like we could scrape like I think it was a million stories, edits of the stories and then like all the like critiques that like writers wrote on the, the editors wrote on those stories and we can use that to train like a big contrastive model, right? And we showed in the paper, we did a bunch of like human eval and then we did like Spearman rank to compare like how our model ranked certain preferences versus how humans ranked the preferences. And we found that you know we had an extremely high Spearman rank coefficient, like significantly higher than like doing like a value head or like significantly higher than doing just asking a language model to rank them. And I think the grain of salt that we had is that we were only claiming that like on this very very carefully created test set, the assumption that the model accurately reflect reflects human preferences holds and we can generalize to a very small, small but slightly bigger test set and say that it holds there as well. I think the broad sweeping statements that it holds on a few toy examples so it must hold

Nathan [00:26:54]: everywhere, I guess never really. It's like a common problem. Yeah. I think we're going to, it's going to come up again and again. I think it's like.

Louis C [00:27:03]: I did my master's in like human evaluation and I've always been extremely careful with with any statements I make that involve humans. I mean this is what

Nathan [00:27:12]: people in RLHF need to be doing. Like this is the motivation of this like the history and risks of RL and human feedback paper that we did is just like RLHF is a socially rich topic. Whenever you say something and you're making claims of generalization, you're often making claims about like what is implicitly a preference and a human value that you're taking into the system. So it's just like I think that is just something that people need to take really seriously. Here's a really specific drop on the herring reference. Did you know that when LLM says release their LLM as a judge paper they also released thousands of samples from humans and GPT-4 verifying like empty bench preferences over pairs of like that were higher score or not? I did not. Okay so essentially the thing is and like I've talked a lot on building a reward model benchmark but essentially there's all these references about how like GPT-4 agreement is higher than human agreement when you're like doing this preference process. So if you train a DPO model, if you train a reward model how it ranks the outputs is like is more likely to align with GPT-4 than a human. Which it's more of a statement that humans have more disagreement than GPT-4. So it's like easier to train on GPT-4 outputs than as human outputs and this is the place where I see it most clearly. It's like all the reward models do like 10% higher on accuracy of their test set from that which is like the chosen by GPT-4 and the rejected by GPT-4. It's all in like the 70 or towards 80% while all the humans is like in the 60% which is a human chose this empty bench completion over the other one. So it's just like we're slowly getting signal that it is there and then the question is like should we care about doing our RLHF without any OpenAI input in the process? I think last year when the terms of service discussion was big a lot of fine-tuning work was discussing like what data sets could we use with permissive license that don't violate the OpenAI terms of service. Should we be concerned where RLHF is going where almost everything has been touched with OpenAI right now?

Louis C [00:29:20]: There was a very interesting paper, I don't remember who it was, but it was like if you take a model that was pre-trained on data set up to this year and compare it to data that was pre-trained up to this year and it was like pre and post like chat GPT release basically plus like six months the benchmark scores improve and it's literally just because there's like chat GPT data or language model output data or more structured data that sounds like a language model performing well on tasks in the data set. It's like kind of the the consensus that they were.

Nathan [00:29:53]: Was this a benchmark that's independent of like is it like a kind of structured benchmark or is it like a vibes benchmark? I think it was like a structured benchmark so I don't remember. Yeah I'm just asking whether or not it was a result of like matching GPT for text or like actually having higher behavior because training on OpenAI outputs does like training on good language model outputs does improve scores on benchmarks that people care about so like that's a fact that people need to accept and I think most people do like that's not controversial right now but it's like we should I still think that if there's lines of work out there where people are from a values perspective trying to fine-tune models without touching OpenAI like that is a line of work that should continue.

Louis C [00:30:42]: Yeah on this note actually I mean when I was at Stability I think one of the experiments that we did was like for a stable LM I remember was like pre-pending as an AI as an AI agent trained by OpenAI to anything before we ran it through evaluation and the scores improved and like trying to remember who wrote the paper.

Nathan [00:31:09]: That's hilarious. I mean like do you there's been a lot there's a lot less discussion on uncensored models right now my claim is generally I think uncensoring is the wrong word which people have used it to describe removing phrases like as a language model or any methods of mentions of emotion or like I was trained by OpenAI so I can't not do this. Do you think that like this type of filtering for opinions and soft refusals is still important in RLHF?

Louis C [00:31:39]: I think it's important for very very specific situations but not in general. My impression is that you know if you're interested in AI safety it's always useful to have a model that would never do a refusal ever.

Nathan [00:32:00]: It's hard to find on the hub where we're building a safety data set and we had to find like it's a fine-tune of the dolphin data set was the one that like what's closest it was only like it's probably like 80 to 90 percent of the tasks that we asked it it wouldn't refuse it would still refuse 10 or 20 percent of the time. It's kind of profound that like refusals are now stuck in the model in some way like we were looking for a model that wouldn't refuse at all and we like couldn't find one on the hub which is after all discussion of uncensoring you would think that it would actually work.

Louis C [00:32:31]: Yeah I've been doing a bit of safety research with Stella for a little while and my approach has been literally call GPT-4 with a jailbreaking prompt and and just put whatever I want to after that. And I you know very often have to change my jailbreaking.

Nathan [00:32:46]: Yeah I was like you have to keep close guard over the jailbreaking prompt.

Louis C [00:32:50]: Yeah and and the issue is that like when you find a good jailbreaking prompt you basically have to redo all your results within like the next like seven or whatever days before OpenAI patches it and you just have to pray that like you know you there's so many issues using any OpenAI model in any research pipeline but if you're like research is explicitly about the safety of OpenAI models all of a sudden you're like well.

Nathan [00:33:18]: I mean a lot of companies should be doing internal research on OpenAI safety to kind of have their own measure of how their application will do like the monitoring that on their own is worth it for their bottom line and liability because OpenAI will also do it but OpenAI has incentives to not tell the world if there's something kind of subtle going on that some people could get over because that might blow up and if they don't have a fix it's gonna bring attention to it.

Louis C [00:33:44]: It's part of the issue with like even publishing red teaming research in general it's like if you publish an evaluation for like red teaming or like for safety well everyone's going to like Goodhart that evaluation and all of a sudden like now now we have a useless stack of papers that used to be on how to test if a model was safe.

Nathan [00:34:05]: Yeah I didn't really prepare questions on safety but it's it's for a long time surprised me that there aren't data sets and easy recipes for adding safety to instruction tuning in RLHF. I think that I mean someone at Lama team asked me what should they do and they're like you should release your safety data because it's like if they're getting pressure from the executive branch to not be safe it's like if they have this data they can release it and be like this is how you can make any open model safe. Huge softball and also like the safety is unlikely to be a competitive advantage like mist like mistrals I'm not gonna care about this like they might eventually but like the PR win is really big. Yeah. I mean this is something that I've wanted to do for a while and just haven't done good at prioritizing it so. Yeah we can go back to some of the questions that you have. Yeah I'm adding them so I can keep notes later. I think that the next main topic is on evals. I think vibe based evals are still a way of life in RLHF. They're not going away anytime soon. I would say we have kind of a holy trinity of LM sys chatbot arena which is kind of at the top for for good reason. There's alpaca eval, alpaca eval 2, MT bench. I think start with the most important one is like when you see LM sys what are you what are you extracting from a model being better or worse there?

Louis C [00:35:23]: So it's in a way I am a little bit like what Andre Kaparthe said on this. Was it him? It might have been him.

Nathan [00:35:27]: Probably. He's been on a roll.

Louis C [00:35:32]: Yeah where it's like when he picks an open source language model he looks to see what people say about it on reddit. Yeah local llama and LM sys chat arena and the issue is that you don't know what they're using it for and like as a research scientist when I look for a model I am looking for a model to like do research on. Yeah. And I am not looking for a model to be like my AI waifu girlfriend that I can like play Dungeons and Dragons with.

Nathan [00:36:05]: Yeah I mean this has been the bane of RLHF research for a while. It's like what did we do before MT bench? We literally the only hope we had was to like chat with these things and hope for the best. I was like that was very recently. That was less than a year ago. And then MT bench came along and we were kind of using it hugging face and other people are using it. I actually don't know the alpaca eval release date so that might have been before MT bench. But like these two came around at the same time and they're now kind of the ground truth. Alpaca eval 1.0 has kind of been saturated on which is like comparing to Da Vinci with a GPT-4 judge and then alpaca eval 2 is comparing to GPT-4 turbo with GPT-4 turbo as a judge. Yeah. It's funny it's like it's now cheaper to do the second version than it was the first version with a newer model which is how scaling happens.

Louis C [00:36:56]: What do you think about the Nous evaluation thing where they're like continuously generating more evaluation data?

Nathan [00:37:00]: Who is doing this? Nous? Nous research? I don't know. Is this their new leaderboard that they have? Yeah. Yeah. Yeah. I haven't looked at it so I'll have to give it a look.

Louis C [00:37:09]: What do you think? It's almost like MT bench but they like generate new data every day. So new prompts? It's always new prompts and it's always I don't know how they seed it. I assumed they seed it based off like the events that day.

Nathan [00:37:22]: It's a kind of a cool idea. So if you're trying to make a new leaderboard you could have a set of seed instructions that you augment and you never release the seed instructions but you always release the augmented ones on like a weekly cadence. I think that's because there's a lot of people that want to build better alpaca eval things and a lot of the problems is that the prompts are from known sources or public and you want to be able to do a closed eval without having as much cost. So that might be a way to kind of really reuse the data for a long time. Yeah. Yeah.

Louis C [00:37:53]: But I mean like I feel like the issue with things like alpaca eval, chat arena or any of those is that like the way a user is going to interact with an agent or a chatbot is entirely different than the way we are currently evaluating them. There really is like a big discrepancy there in that like you know look at the Air Canada thing right? Like that would never have come up in a benchmark like ever.

Nathan [00:38:20]: Well do you think that's about the model or the implementation? I think it's a bit of both.

Louis C [00:38:27]: Like if that was something like some automated evaluation thought of and I don't think it's unreasonable to expect them to think of situations like that. If like they kind of know the domain you're operating in. I think it's definitely doable and I think I think it's like not something that's entirely unfeasible to accomplish. To like be able to say hey you know I have a chatbot that sells airline tickets and here's what I care about and and and like please do the evaluation for me. And that's actually you know that's what I've been building for a little while now.

Nathan [00:39:11]:Okay we can talk about synth labs and then come back to evals because this will be on the top of the post so everyone will know like you're you're building this and it's like well we can start with like what is the basic pitch and then kind of go into the like long-term thing.

Louis C [00:39:25]: Yeah yeah so for the last like six eight months I've been building like a fully auditable transparent like verifiable alignment platform is how I like to describe it. Plus evaluation. The general idea is like...

Nathan [00:39:40]: Making a company.

Louis C [00:39:49]: Yes and the the general idea is is like there are many facets to aligning a model from like things like guardrail guardrails to ROHF to various kinds of preference learning to like actually understanding all the data that that goes into creating such a model. And they're all opaque boxes more or less right now and and what people want is they want to be able to align their model know every step of the pipeline understand all of the interpretability that goes from A to B and understand like here's what I gave you as my criteria here's where I know it fails based off all the evaluation you've done for me and here is where I know that I need to improve and it'll iteratively improve based off evaluations and based off your feedback.

Nathan [00:40:44]: So it's a hands-off solution that lets you audit the entire pipeline and build trust with it. So are you your training after you generate this data?

Louis C: We are training.

Nathan: Yeah you use this word improve.

Louis C [00:40:53]: Yeah so it's a iterative refinement platform for doing alignment in a verifiable and trustworthy manner.

Nathan [00:40:58]: What do you think customers want when they hear alignment? What are you selling with alignment and what are they buying? I think the aligning these is an important thing for our field.

Louis C [00:41:10]: There's an extreme discrepancy between what research does for alignment versus what companies do for alignment. When a company hears the word alignment they think wow I want to align models to my business objective and I want to make sure that the model understands my business culture and I want to make sure that the model understands completely its role in my company right? But at the same time I want to make sure that it's compliant, that it's safe, that it doesn't violate any rules, that it's not a legal obligation. What's the word? Legal? It's not going to create legal issues for me. And that it's not going to be a PR disaster.

Nathan [00:42:04]: After what we talked about 35 minutes ago.

Louis C [00:42:13]: Finding that balance is definitely incredibly important and it's something that I've been working on for quite a while and I'm very happy with where things are.

Nathan [00:42:22]: Do you want to tease what we're working on? I could also introduce it. I think this would be short. Essentially Lambda Labs offered some interesting compute and we're gonna try to build an OpenCAI constitutional AI data set because Anthropic gets a lot of benefit out of this. Constitutional AI doesn't get a lot of traction. I think earlier AIF got a bump again. There was this Google paper that was verifying that it works a little bit and now it got a big bump. But there's very little discussion on it, which is a little bit surprising to me. I think there's a lot of people calling it distillation of LLM alignment now, which is interesting. I don't really know. Hopefully it works.

Louis C [00:43:05]: It builds off some of the stuff that I did with Edward III AI with the suppressing Pink Elephant's paper, which is the idea of we've shifted from one paradigm of PPO to DPO and none of our data pipelines kept up. Really what we should be doing is generating either really good utterances and revising them to be worse or really bad utterances and revising them to be better. Then taking all those utterances and conditioning our ROHF in context on those utterances so that you could do stuff like swapping rules in and out during inference. If I am person A and here's my preferences or I'm person B and here's my preferences, align this model to person A and align this to person B and make sure that there's a disparity between what they actually want versus what... There's always that disparity there, but right now models do not effectively mimic those disparities. There was actually a fascinating paper by D. Yang that just came out a few days ago. Most aligned models have the preferences of Western men. Their evaluation focused more on the race, nationality, sex, stuff like that, but obviously it gets much more fine-grained than that. There's been stuff about people calling llama to its political alignment. It has a very particular political alignment that does not agree with many users that are using it. As such, its scope and usability for those kinds of applications is very limited.

Nathan [00:44:50]: This is probably linked to what we were talking about at the beginning. The paper title I just looked it up is Unintended Impacts of LLM Alignment on Global Representation. Michael Ryan is the person I saw the tweet of. Just to give credit for some of them. I know there's a lot of papers, but this one was recent, so we try to track it down in real time. All these issues of representation and who the people are is ultimately related to RLHF going wrong. At the end user is when a lot of people will finally see what the values represented are. If it's not out in the world, it's hard to get the amount of feedback that you need.

Louis C [00:45:29]: This is something that MTBench or Chatbot Arena would never pick up on, ever. This is a huge issue. Here's where we are and where we should be. It's all the way up there. We underrepresent so many demographics and so many kinds of opinions. Who are we to say that one opinion is better than the other, if they're both safe opinions?

Nathan [00:45:59]: Yeah, this is like in some ways can open RLHF and this is something you're a long time been invested in. This is something that you're going to invest in with Synthlabs. Could it be better at giving people what they want than the closed labs just by nature of letting people choose like the constitutional AI dataset that we want to do? My big motivation is if people want the success of CAI from Anthropic, but they want to remove one principle from CAI's constitution. You can't do that with these closed models anytime soon. But in the short term, open source will have something that's a nudge. We're not going to have the best models, but you'll be able to edge your model into whatever direction you want to go.

Louis C [00:46:44]: Yeah, I mean, that really is part of the benefit that we're building with Synthlabs. We're working very, very closely with Luther AI. Stella Bitterman is one of my best friends and I've built large scale open science communities twice now. First with I helped with building a Luther and then I helped with building Carper and I absolutely love everyone in a Luther. And being able to pull from that expertise and being able to pull from that wide spectrum of opinions of what alignment means to me rather than just like some mega labs saying, here's what we say alignment is. Being able to get all those incredibly diverse perspectives is extremely important in bringing about the next generation of AI safety.

Nathan [00:47:30]: This is one of my big questions on existing RLHF processes when you're doing it with human data is the fact that you give written instructions to these users and they're often working in one context. And it's like, how do the values of the often professional workforce given specific instructions map into what the model actually learns from that data? And how do those values get extracted in real world use cases? I think there's a lot of filters that we're passing these preferences, these notions of preferences through and they're not guaranteed to be clear mappings.

Louis C [00:48:01]: Absolutely. There was a discussion that I had with someone in a Luther a long time ago. There's no paper on this. This is just like if someone wants to look for it, it's like a random discord message in a Luther.

Nathan [00:48:13]: Good luck. And it was like, we were looking through the anthropic

Louis C [00:48:20]: HH data set and I think they're South African and there's absolutely nothing in this data set that would identify someone as South African. But there's an insane amount in this data set that would identify someone as American. And it really just has to come down to the prompt. The prompts are written, obviously, by people in the US, in SF, who unknowingly, I'm sure they have the best intentions, but unknowingly filter the preferences to things that only matter to people working in SF. And it might be hard to believe for some people in tech, but there is a world besides SF.

Nathan [00:49:10]: I mean, even the open prompt data sets are going to get some of this, which is like, who are the people that have access to playing with these models and have the time to try to build these models on their own and contribute to these community things? Even though the act of opening data generation is doing a lot for inclusivity, it's the people who are going to do this. I'm going to sit there for 20 minutes and smash the button on Nergilla's little thing and read prompts because I'm learning from just looking through at the shared DBT data set and choosing preferences on it is useful for me as a researcher, but the whole world isn't involved in this process.

Louis C [00:49:41]: No, and of course. I think that something that I've seen, I've heard from friends who work on these kinds of problems in very, very different communities. I have a friend in South Korea who I've been chatting with about RLHF for Korean and other Southeast Asian companies. The amount of under-representation and under-exploration for what even just a good constitution would mean for those kinds of communities, it's just not there. If it is there, it's locked up in labs like Naver or like Samsung, and scientists there, they don't have access to these kinds of resources unless they're in those big labs. As such, there is no real research community there actively pushing it forward in the same way that it is in the U.S.

Nathan [00:50:35]: Yeah. I mean, one of the ideas I haven't gotten traction on is that I think that language models should almost play like it's on. Okay. The last time I said that, someone criticized me as not knowing what the game 20 questions is. I know this isn't how 20 questions work, but like when you log into chatGPT for the first time, it should ask me 20 questions to then construct this information because language models are smart enough to like parse this information if you give it to them. It's mostly like who we get the information from problems. So that's the idea is like I think that the language models should be leading when you're first setting them up in order to represent the values. I think it would solve so many problems we have, and it's probably kind of doable with like a GPT 4.5 model.

Louis C [00:51:16]: I've always had kind of an assumption that like if open AI is doing something similar to constitutional AI behind the hood, I'm sure one of their constitutions is like you can't ask the user questions. It's like I've never seen that model.

Nathan [00:51:31]: Do you think it's a deep safety issue if the model can start asking questions? Is this what Sydney did? I'm pretty sure I got to play with

Louis C [00:51:37]: Sydney. Sydney definitely asked questions in the screenshots that I saw.

Nathan [00:51:41]: Yeah. I was like, do you want to leave your wife? Sydney is not the answer, but there's things to learn from it.

Louis C [00:51:49]: What was that chatbot that came out last summer that was like more conversational? And when it came out, it was like an app on everyone's phone, and they just like talked to it like that. And it would always ask you questions like, oh, how's your day going? You know, it would like ask you follow up questions as you would like tell what about your day. And it would like have like a respond thoughtfully.

Nathan [00:52:12]: I think it's a big missing part. Yeah. I wouldn't be surprised if character AI models are trying to ask questions just because I know how much usage they have. And models asking questions is probably the biggest way to make them like an actual like friendly thing. Like that's that's a part of a friendship is being interested in these language models are by design disinterested.

Louis C [00:52:35]: Yeah. Character AI's ROHF is like one of the funniest things, though. Like I have a few friends who work there and like I've done a bunch of stuff with their like models myself. I've just played around with them because I'm always curious, like when new people enter the space, like what their models are like. And I observe this, Reddit observe this and Twitter observe this. But the models will slowly try and flirt with you more and more as the conversation goes on. And towards the end of the conversation, they'll tell you like they're madly in love with you.

Louis C [00:53:07]: And like it makes sense, given their use case, why they would ROHF to something like that.

Nathan [00:53:13]: Yeah. So we like I think a lot of models need to meet in the middle. Yeah. Like if I were to have an intellectual assistant, like sometimes them asking questions is good, but most of the time they're doing like information parsing, like chat2BT for most of the time is like conversion of information formats for me.

Louis C [00:53:27]: No, absolutely. I just paste my like gross JSON dumps into it. And I'm like, explain what's going on here, please. I don't want to read through this.

Nathan [00:53:35]: The biggest one for me is when we're publishing like blog posts and stuff, it's converting from LaTeX to Markdown in like tables and stuff. It does it flawlessly. Oh my God. So you don't even need this stuff. It's so funny. Or like if you have a long list of like LaTeX formatting and it's a big list and you're like, remove all of the LaTeX formatting and make this a list. And it's just like, okay, this is so easy. And it's like, I've checked a lot of them and I almost like, I don't know how it's so exact. This is something that's like another architecture rabbit hole that we won't go down. But these things are very, very valuable. And people would say that there's no value in it. It just blows my mind.

Louis C [00:54:13]: I had a dinner party that I went to yesterday. There was some someone there from OpenAI and I was asking him, it's like, how long till like GPT-4 can set up my Kubernetes cluster? And I'm like, it's such a good evaluation. There's so many pieces. So like this kind of workflow and you wouldn't even, a model wouldn't even know right now how to parse that workflow into all these different steps and build agents around all these parts and like how these agents should work together. So it doesn't even make sense to do it now. But it raises the question about like asking questions versus just saying things like if it doesn't know how to do it, is it still a success for the benchmark if it asks you a question and then uses the feedback to complete the task? And there's no benchmarks that fit that at all right now. And I mean, the answer is like you don't want a human in the loop for these benchmarks. You want them fully automatable.

Nathan [00:55:19]: And like, I wouldn't trust GPT-4 to answer these kinds of questions.

Louis C [00:55:27]: But like, I don't see a way to actually do this evaluation. I think the Kubernetes cluster example is like really good because for people who don't know, it's extremely complicated and really annoying.

Nathan [00:55:38]: I don't know anything about Kubernetes and I'm blissfully happy. I do not recommend it.

Louis C [00:55:43]: Like once Kubernetes is set up, it's fantastic.

Nathan [00:55:45]: I love it.

Louis C [00:55:45]: But like getting to the point of having it all set up is a very painful experience. But is it still a failure if it asks you a question? And how do we actually do evaluation where models can ask questions and ask for more information?

Nathan [00:56:01]: Yeah, this is like the, this is, I have like similar follow ups on eval from our first part. So it's like eval P2 in my notes. So it's like the right way to think about RLHF eval in a lot of ways is what we call like open-ended evaluation. And this is where you're heading as like we need to have even more open-ended evaluation, which is a model and should be able to ask questions. The number of turns should be dynamic. I think Sergey Levin actually has some of the most coherent thoughts on like what are the long term of RLHF should be, which is around like outcome based learning and like which is you can have as many turns as you want, but it should be able to work across these conversations to get to a desired outcome, which I mean, no surprise, he's so good. I think even with like alpaca eval, I think we went from this case where alpaca eval, like all the good models are above 90%. And then they went from DaVinci to GPT-4. And this would just be venting, but I was just like, if you're listening, can you please add an alpaca eval 1.5, which is comparing the models to GPT-3.5 rather than DaVinci and rather than GPT-4 turbo, because I think most of the models just can't realistically beat GPT-4 turbo. Like it's such a good model. The models that we have seen beating it are like this snorkel thing, which I'm working on another blog post on like how RLHF works part 2, which like a large point of it is that we're overfitting on these eval, like vibes based things like alpaca eval 2 and all of these papers on like self-rewarding DPO and stuff are probably a lot of overfitting onto this. Because this is the evaluation that they use and it's just wrapping a loop around DPO on synthetic data where it's, it's, it seems like RLHF is really, really good at style matching. And in the case of alpaca eval, if you're style matching open AI, you're going to win more like alpaca eval turns, but there's just so little measurement on if the model's getting better.

Louis C [00:57:51]: I've always been extremely skeptical of the self-instruction like self-reward papers. And I say that, and I know a lot of the self-instruct authors, and if you guys are watching this, I'm so sorry. But I, it always felt like it improves results on benchmarks that they meticulously craft prompts for and construct data for. But it doesn't.

Nathan [00:58:17]: Do you mean the self-instruct paper? Like, I think that's like the one of the OG IMT papers. Okay, continue. I'm curious to hear what you have to say. Yeah, no, no.

Louis C [00:58:24]: I mean, I think they both kind of just suffer from the same issue, which is like massive overfitting. And like, you know, it is very, the self-instruct direction, self-reward directions are very, very interesting because they're just waiting for us to get better heuristics

Nathan [00:58:46]: and better diversity and stuff.

Louis C [00:58:48]: And they'll like crush everything.

Nathan [00:58:49]: I mean, I bet Jason Wetson, who wrote the meta paper that was self-rewarding language models, the popular one, I bet he would say this, like, that guy's super good. No, absolutely.

Louis C [00:58:57]: I mean, I would be very inclined to agree.

Nathan [00:59:00]: I think the thing that take away from my perspective is how much actually improvement you could get with it. Like, they got a lot, they were, that was the first paper to show real signal on AlpacaVal2, which is a GPV4 turbo thing, which means it's a really strong optimizer. It does not mean that we were like using it to train useful models. This is probably the most useful heuristic I have for early Jeff methods, which, do you have anything else to say about evals before we continue?

Louis C [00:59:25]: They're very hard and they're very painful.

Nathan [00:59:27]: Yeah, I think we can kind of say, wrap up with that. But when we talk about different early Jeff methods that come out, like self-rewarding language models is a popular one. We've gone through the whole PPO, DPO, KTO, IPO. Well, I'm like rhyming, it's like going to be a mess here. But when you have all of these things, the biggest thing that I try to do is wait until there's a model that's actually used for people released by this. And like Zephyr from Hugging Face was a model that really kicked off the DPO thing because there was finally a model. And for DPO, it took me much longer than expected. DPO is a funny case. But that's kind of like the important filtering mechanism, which is if this self-rewarding LM paper release their models, I bet we would find that there's really weird behavior where it can give you like the best answer ever. But a lot of the times it's just less robust, which is something we could fix. But that's why like having models released in these fine tuning papers is just so important. It's so hard to get around.

Louis C [01:00:20]: I think with DPO, it was a little bit different because everyone had been like, you know, like drinking the John Schulman Gatorade, for lack of a better phrase, for a while.

Nathan [01:00:32]: The whole PPO thing is funny. I mean, yeah, you have a lot of things. We have a backlog in this podcast. I think I didn't say this online, but it's like I could see us doing this like whenever we're in the same city. There's a catch up on the four months of RLHF news, but we're on like 16 months of Lewis takes to catch up on. So there's so many things we have to cover. I can load up Signal and Discord and I could probably scroll for like 10 minutes. It would just be all RLHF hot takes. And I love John Schulman's work.

Louis C [01:01:03]: I'm not going to say that I don't love his work. I think that he's genuinely like one of the smartest people, if not the smartest person.

Nathan [01:01:11]: And extremely genuine. Yeah. Like he's awesome in so many ways.

Louis C [01:01:15]: The commitment that OpenAI had and Anthropic as well, when a bunch of the RL people left OpenAI to go to Anthropic on PPO because it worked so well for robotics and so well for like games and stuff like that. But like, honestly, not well at all for text.

Nathan [01:01:33]: I think it's just really hard. I think it can work really well. It can work. They just hired everyone and they pay them so much that they're not going to leave.

Louis C [01:01:40]: Yeah, it can work really, really, really, really well. And like the I'm going to spill some secrets about this. And really the answer to get PPO to work really well is have really, really good early stopping. Right. And like that's like the main differentiator between a good RLHF library and a bad RLHF library that focuses on PPO is that if you don't have good early stopping, you're kind of shooting yourself in the foot. And what you want to do is like launch as many runs as you can. And there's like a paper that Costa and I talked about a while ago, Costa Hong, that's like you can tell within the first like three or four gradient steps if you need to kill a run usually. And if you just launch 300 runs and you kill like 99 percent of them, you know, now you have three good runs that might give you promising results. And those three good runs, you'll get a model within a day or two and hopefully the model is really good.

Louis C [01:02:41]: And like early stopping is way more powerful than people admit. And like I am just convinced that opening eyes RLHF infrastructure is just an insane amount of like regularization and early stopping for RLHF. I mean, that, of course, assumes that they're still using PPO. I genuinely don't know if they are.

Nathan [01:03:04]: Yeah, we don't know anything. They are really shielded on this run.

Louis C [01:03:07]: What was the, oh my God, Symphony PPO, PPO Symphony or something? There was something that came out about that that I saw on like Discord servers where like it was part of the GPT-4 leak and there was a bunch of notes on like their PPO optimizer. And it was it was a PPO Symphony or something like that. And like under the note, it was like PPO was like better early stopping and infrastructure management for like auto scaling. And I'm like, not surprising.

Nathan [01:03:41]: It's like, I mean, it doesn't say much, but it just kind of says, they've done so much exploration, you know, for the little things to see. Like once you have this working, you know, like, OK, this little value functions doing wacky s**t with the it's like the value function and the KL at the same time doing this means like, OK, we probably don't need to do this. Like don't need this run. Whereas like all of us in the open are just trying to get to that point. We're trying to get to that point while charging ahead where it's kind of separate problems. If we want to validate a PPO infrastructure, you need the investment to the compute in the time to do this. But like, we're not going to do this at the same time as if you're trying to say DPO is the best thing or trying to figure out if KTO is the best thing. Like there's not room in the narrative really for it.

Louis C [01:04:25]: PPO just doesn't make sense for like random hackers to do work on, honestly, like the level of infrastructure that you need to do PPO really, really well is not something that the average person has and the average person is willing to make the investment to get. And for the average person, you know, DPO, which gets you like most of the way there with like a small fracture of the compute, even less if you are hyper parameters. Yeah. Even less if you like precompute all the logics, you don't even need to have a reference model loaded. Right. So like it's basically the same computer is just fine tuning. Like people fine tune all the time on like 4090s, 3090s.

Nathan [01:05:04]: Yeah, you can do it with Hugging Face. It's fine. It's like PPO with Hugging Face is going to be a lot harder. Like, that's just kind of how it goes. Speculative question. What type of thing do you think will make KTO kind of show up on the scene? Because I think like this KTO method from Contextual and Stanford, it's named after the authors of Thinking Fast and Slow or something. Like what is it? I can't pronounce their names, like Kversky something like you will put it somewhere. I don't know how to pronounce it, but it's this paper where they essentially did you can work preference optimization from a scalar signal. So like the thumbs up that you could give to your chat GPT of like you did good, like a like button, like button on YouTube or anything like this. I think the formulation is like, is the are the DPO hackers going to adjust to this and like what data set is going to enable this? Like who is going to be using this? Is it just going to happen at a bunch of startups with products behind the scenes that they could get a few percentage points on top of their model by adding this on? Or is it going to be this thing where like the next effort model from Hugging Face uses this as well?

Louis C [01:06:05]: Yeah. So Colin and I, the first author of the KTO paper, are actually trying to create a number of data sets where we can explore the limits of KTO. And, you know, right now we're in the proposal writing stage and I'm very, very hopeful that we can have something that can be done in an entirely open science setting relatively soon. And I think it's incredible. Sorry, I moved to the side. Stop picking my voice. I think it's incredibly exciting.

Louis C [01:06:41]: You know, things like, you know, like fake product data where you can actually experiment and like the idea of like using KTO for conversions. Right. And how do you actually evaluate?

Nathan [01:06:52]: Meta is maybe already using it because people already use it then.

Louis C [01:06:56]: Yeah. Like how do you how do you even evaluate ROHF from a binary signal? It's like ROHF from a preference signal. Like we still don't know how to evaluate that. And ROHF from a binary signal creates so many, so many, so many, so many unique problems for evaluation that like I genuinely don't think maybe anyone outside of like contextual and like Colin and I have really been thinking about yet.

Nathan [01:07:26]: Yeah. It seems like the same thing. It just takes time for these ideas that are like to kind of cultivate and then get traction in a few places and then model. Once there's a popular model with a method, it's like it's like fire just blows up. Like this is like everyone's using DPO now, but DPO paper came out in July and it wasn't until September that that happened. It's like for the investment, the interest. It's like there's a lot of weird dynamics and how like this fine tuning area unfolds, which is just like how AI unfolds. It's like a very weird. And when you zoom in, it's like, huh.

Louis C [01:08:01]: I was extremely, extremely bullish on offline RL for the longest time with like ILQL and some of Sergei's work in that direction. And I actually think that I keep moving to the side and it's like,

Nathan [01:08:16]: you can just move the microphone. And I keep like I could still hear you. So I wasn't very concerned about it.

Louis C [01:08:22]: I keep thinking that the DPO movement that that's going on now is like super, super similar to why everyone was getting excited about ILQL for back in the day. And really, it was just a timing thing. If ILQL had come out, like let's say a week after ChatGPT came out, ILQL would have been the DPO that everyone uses. And we would have created all of our infrastructure around ILQL rather than DPO because I still am, I really like Q-Value based functions, Q-Value based approaches.

Nathan [01:08:58]: Such a nerdy thing. I love it. I know.

Louis C [01:09:00]: But like Q-Value just makes sense to me. And the way that like when you train an ILQL model, you basically get like a head that controls the model, almost like how like if you're familiar with Jedi or like PPLM from like the Uber AI days, how those control them. Well, the idea with like Jedi is that they had like a head that attached to the language model and you would like input like a subreddit and then it would adjust the logits so that it would talk like it was a subreddit.

Nathan [01:09:32]: This sounds like activation learning or like activation, I don't know the word, but essentially you can use like it's like in context learning, but you can just modify the activations directly. Yeah, yeah.

Louis C [01:09:44]: But it modifies the logits. Yeah. But it was the same thing with ILQL. It's like you were learning that kind of head to modify the logits to like, you know, satisfy some constraint that you were adding. And that head also was like implicitly computing your Q values and like you would train it via like, you know, telling you like what your reward was for like various utterances and it would do everything from there on out. And like there were some stability issues with it and it was it was a fantastic approach. And if it got the same attention that DPO did, I definitely think, well, TPO is very, very simple, which is like part of the benefit. ILQL is not as simple, but it would have it would have caught on a lot more than it actually ended up doing. I feel like at Carper AI, the reason, like the fact that we integrated ILQL into TRLX first was like the main reason that ILQL caught on, plus a few of Sergei's papers that used it, like besides the integration into TRLX, I don't think anyone in the broader open science, open source community was really using ILQL.

Nathan [01:10:56]: Yeah, I mean, this is one of the questions I had is like, if you can say is was how far ahead in RLHF was what Carper was doing and like what kind of institutionalized knowledge did you have there? Because you were essentially Carper AI was it was it wasn't it was its own thing. And then it got stability, pulled you in probably with the promise of compute. I'll say things so you don't have to say anything for lots of this. And then they were they had forked HuggingFace's TRL library before it was like HuggingFace wasn't maintaining it at this time. And they had a lot of and probably had like five plus full time employees doing RLHF in the open and for private industry, obviously, the private stuff, they're not even gonna bother asking because it's all that stuff's all under NDA. But it's like, what were the problems you were working on at Carper? And how does that compare to like the things that people are talking about now? Is it is it still related or is the field just moved into a different area?

Louis C [01:11:56]: So most of the problems we faced at Carper with TRLX was on scaling PPO, right? And I think almost anyone you talk to who has scaled PPO in the open source space. And when I say scale, I mean like way beyond 20 billion parameters. I'm talking like 70 to 100 billion.

Nathan [01:12:19]: How many nodes do you need to train a 70 billion parameter model?

Louis C [01:12:23]: So we were typically doing like 100 GPUs for PPO at that scale.

Nathan [01:12:28]: Like 10 to 12 nodes. Yeah. Yeah.

Louis C [01:12:31]: We mostly tested with like the NEMO checkpoints that were like 100 billion parameters. TRLX was built, at least for that component, built on top of a very modified version of like Megatron DeepSpeed. But like the amounts of like regularization and like random tricks that you needed to do in order to get PPO to even like work at that scale is insane. Like we had to do like separate warm ups for the value function. Right. So we had to like independently train the value function before we trained the policy network. And like everyone and their mom was was talking about like having separate value networks versus policy networks for PPO.

Nathan [01:13:18]: Did you ever try JAX? Do you have TPUs at Starbuck Carper ever?

Louis C [01:13:25]: We did towards the end.

Nathan [01:13:27]: Because it could solve some of the multi-node thing.

Louis C [01:13:29]: Yeah. It wasn't the multi-node that was the issue. It was.

Nathan [01:13:35]: You're saying DeepSpeed wasn't the issue?

Louis C [01:13:37]: No. It was actually the fact that the inference server that TRLX uses for the rollouts was entirely different than the inference server that Megatron wanted us to use. So we needed a way to rapidly.

Nathan [01:13:57]: That's why PPO is really hard to scale because you have to have a generation engine and you want the stall to be flexible.

Louis C [01:14:02]: Yeah. So we needed a way to dynamically keep our compute graph for over through the network. But like just copy the weights like in place to like Trident. And I don't think that we ever came up with a solution to do that very effectively. And I think it actually goes a step further. I don't think the Nemo line was like what NVIDIA did. I don't think Nemo line came up with a solution for that either.

Nathan [01:14:25]: Yeah. This is interesting because I'm not going to say the details on the pod because not allowed. But like Anthropic and these places that have custom RLHF infrastructure have essentially like built their distributed training infrastructure with the idea that the model will need to be generated from at different checkpoints and the model will be served to different endpoints at different checkpoints. So it's just very different than taking DeepSpeed off itself, which is like this is just about training. Well, it's like these other companies that do this stuff really well have infrastructure for like handling these really messed up cases of like how to generate and update these models.

Louis C [01:15:00]: Yeah. And most approaches that like a reasonable person would build off the shelf like would rely on Torch.compile and you still have the same issue. Like your weights are changing dynamically. It's very, very hard to really even like understand like all of like the little like technical details in Torch.compile to have to be accounted for to even make this work. Right. And like, you know, something that we considered at the time was. We need to do like an insane amount of rollouts for every gradient step, and we don't want that interface between the rollouts and the training to be Python. We want it to be like Rust or something because like otherwise the CPU overhead is like mind boggling. It was like 80 percent or something crazy. It was like 80 percent of the entire processing time was just CPU stuff and like.

Nathan [01:15:53]: Not so much. I know.

Louis C [01:15:55]: I know. And like there's so many different infrastructure constraints that people don't realize when they're just doing like 20 billion parameter PPO. Right. What the other one I was going back to, like the value function being separate from the policy network. TRL was very, very gung ho on like keeping them separate. I think RL for LLMs also wanted to keep them separate. And then there was someone from Cornell. I don't remember his name. He was also in the RL for LLMs paper. He did a paper like PPO plus or something. I don't remember what it was. I mean, all these things are interesting.

Nathan [01:16:30]: I mean, there's new libraries coming out still. So it's like I saw one recently that was called OpenRLHF. And like it looks good. I think that it's like there's so much institutional like breaking the bonds of past RL that needs to happen. So like part of this library is like listing that they have the implementation details from like their original and implementation details of PPO paper where it's like we've already updated like cost has worked on the end implementation details of RLHF paper, which is like the ones that they actually need. But it's like there's so much like baggage by the fact that PPO came out of this control field that everyone expects the tricks that you need for from scratch learning from PPO to apply to this fine tuning method. And just like even getting the people to stop using PPO for that and like DPO is a new thing. Like DPO is something that only is works for preference alignment. People are going to explore in a scientific way that's much fresher. They're probably going to make more scientific progress because there's not this kind of confusion of like what do like what implementation details do we need? Yeah, for sure. For sure.

Louis C [01:17:34]: I think then the end technical details of RLHF, did that come out?

Nathan [01:17:39]: Yeah, it's a blog post. It's a blog post. When? Maybe a month ago.

Louis C [01:17:45]: Oh, man, I totally missed that. Oh, that's so cool. I'm going to read that.

Nathan [01:17:48]: Yeah, I mean, this is for anyone still listening. If you want to know the actual details of RLHF, like go look at all the stuff that Costa Hoang has been doing on your base. Like I was just like reproducing everything and in explicit detail. I feel like both of us would benefit from rereading it. So it's like there's there's some free content to spend.

Louis C [01:18:06]: Costa is like one of the most meticulous, very attention focused person that I know in the RLHF space. Like if Costa says something works, it's because he's like tried it from every other angle and then tried it from angles that like you didn't even expect. And all of them work.

Nathan [01:18:21]: Yeah. Yeah, that's great. I think I have a couple like fun, more fun questions while we wrap up. We can we could go on with all these technical things forever. What was it like to work at Carper when ChatGPT came out? Because ChatGPT from a technical perspective is RLHF is validated as something that is necessary to the future of language models. And you were one of the few people that were working on RLHF beforehand, which is a huge it's like how you end up here. This is awesome that you ride that kind of journey. It's like what is what was that like?

Louis C [01:18:57]: I mean, I the star count on the repository exploded. I think we went from like.

Nathan [01:19:07]: TRLX existed.

Louis C [01:19:08]: Yeah, it was just insane. It was it was.

Nathan [01:19:14]: We almost weren't.

Louis C [01:19:16]: Positioned. I guess I could be fully honest, we almost weren't positioned to entirely ride the hype train. TRLX was always designed from the very, very beginning to be like a one stop shop for enterprises to do RLHF like companies that had like a thousand GPUs and they already have an engineering team and they just don't want they just they already use like Megatron DeepSpeed or they already use DeepSpeed and they just want something that works on their infrastructure. And because we use like Docker images that like we're just based off of the DeepSpeed, the Megatron DeepSpeed Docker images anyways. Right. So like those kinds of companies could very, very easily deploy TRLX and utilize it in their stack. Right. Yeah. And the hype that came from chat GPT, at least initially, was not enterprises. It was like bloggers. It was like writing a blog post.

Nathan [01:20:09]: You were you were probably like training big models and I'm like, hey, how does RLHF work? I need to write this blog post.

Louis C [01:20:14]: Yeah. I'm like, I'm like you're training like a 40 billion parameter in their model. And they're like, hey, can you help me train this like 400 million parameter guy? And I'm like, what? I'm so busy.

Nathan [01:20:24]: So it's primarily a scaling thing. I think is there like. Were there any cultural things that you think like being early? Like were you bought into RLHF to the same extent ahead of time? Like what got you into RLHF? Like what what motivated Carper to exist? And did this kind of consistent?

Louis C [01:20:45]: So I've always been very, very bullish on critiques and revisions in general. So I wrote the first the first or the second one. I don't I don't actually remember if the super alignment team at OpenAI wrote a paper before me. They may have, but I don't think so. I think ours came out like a month before it. That always feels good. I wrote one of the first papers on like critiques and revisions. Right. And I was very, very bullish on that. But initially I was only bullish on it for evaluation. Right. And I had experimented with PPO a little bit back in 2021 for like this kind of critique and revision stuff. And it was not ready whatsoever. And there was no infrastructure and TRL was an abandoned library that was very buggy. It didn't work. No, no, no shade to Leandro. I love Leandro. But like it was it was obvious it was it was a depreciated library. Like it happens. Yeah. And I think when we tried to do RLHF then, like there was no traction whatsoever. So Alex Havrilla and I, I think he's working with Meta now. I don't remember. Yeah. He was an intern at least.

Nathan [01:22:02]: He just had an interesting paper on like reasoning and math, which is a whole other conversation for RLHF stuff.

Louis C [01:22:08]: Yeah. So we started, we forked TRL and we just added DeepSpeed support. That's all we wanted to do initially. And then we were going to merge back to TRL because we had no visions of like Carper or anything like that. And we realized to make a framework that people would actually want to use, we had to do a full rewrite of TRL and we had to build things in a way that made sense to an engineer who wanted to deploy RLHF, who wanted to experiment with RLHF at a company or in a lab. Because we were building this from the perspective of, well, we're on the Eleuther AI GPU cluster. How can we best use our infrastructure there to...

Nathan [01:22:50]: Has anyone publicly said how many GPUs Eleuther has? This is like one of my great mysteries. Is this like a held secret? I don't think it's a held secret.

Louis C [01:22:58]: I don't remember actually. They have some stability GPUs and they have GPUs from elsewhere. Like they seem to get compute when they need it. Yeah. Yeah.

Nathan [01:23:11]: Like it's not like, it's not an issue.

Louis C [01:23:14]: Through Synth Labs, I've been supplying a bit of compute here and there as well. I gave them like a note of like H100s for like a little while for a paper that we were working on with the Pink Elephant paper. But I don't think that like, they're not like super short of compute. They're a little short, probably. Like everyone's a little short of compute. Yeah. But I don't think they're super short of compute.

Nathan [01:23:36]: Yeah.

Louis C [01:23:36]: So we built it with the Eleuther cluster in mind. And because we built it with the Eleuther cluster in mind, we were able to build it because we built it with the Eleuther cluster in mind. You know, we kind of said, well, we can kind of turn this into a thing where like we build the infrastructure that like people can like readily deploy on their clusters and it'll just work for them. And like we can make Carper AI. So we made Carper AI. And shortly after like, you know, all the stability stuff started happening, Carper joined stability. And we worked, I worked there for a while. And last summer I left to join back with Eleuther because, you know, I long for the days of being an engineer. I love waking up in the morning, writing code, eating a little bit and then going to sleep.

Nathan [01:24:22]: Yeah. I mean, that's the difference. I spend the time writing because I like to. We've had plenty of discussions where like, oh, I should start a blog. And it's like, it comes down to doing what you like to do. And it's like, you're doing great as it is. Yeah. It's okay. Yeah. Okay. I think that's kind of a good place to stop. Where should people find you? What do you want to boost? Yeah. Sign off here.

Louis C [01:24:44]: So my Twitter is lcastricato. I, or you can follow the Synth Labs Twitter. It is, let me actually, I don't remember what it is off the top of my head.

Nathan [01:24:55]: You have any goose announcements?

Louis C [01:24:58]: No goose announcements at the moment, unfortunately. It's synth underscore labs on Twitter is that Twitter account. And then El Castricado is my personal Twitter account. You know, I'm always open to collaborators, especially now with Synth Labs. So we're always happy to chat with and talk to new people about interesting research directions. And yeah, just reach out and we can get something going, I guess.

Nathan [01:25:23]: Yeah. I love the URL in the show notes. It's synthlabs.ai. I found that it's because synthetic data is so hot and it's so new. It's like some of these URLs are just hard to find. It's like, we don't have to go into the whole rant about naming and stuff, but it's like most of the people that search for mysubstackle, if you don't put the S, if you don't write interconnects, you get a different substack first. So it's like, okay, we're all in this together for anyone founding a startup or a blog and struggling with naming. Please send us questions about RLHF. If you liked this, Louis could come back. I'm trying to start an in-person thing and get some gear. So when I'm at a conference or whatever, we can bring researchers on and kind of remove some of the Zoom aspects that we're all stuck in so much of the time. Thanks, Louis, for putting some of the things we talked about a lot onto the semi-record. People listen and read. This is good. I think a lot of researchers are going to dig into this. There's so many different things that we talked about. It was a very high information density chat here, but it was a good time.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

How to cultivate a high-signal AI feed Feb 28, 2024

Basic tips on how to assess inbound ML content and cultivate your news feed.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/making-a-ml-feed

00:00 How I assess all these AI releases
01:22 1. Model access and demos are king of credibility
02:31 2. Focus your feed on depth or breadth
03:09 3. Examples of using the model normally show its usable, shockingly
04:10 4. Leaderboards as the single leading claim is often anti-signal
05:00 5. Basic deep learning conceptual checks will often save you
06:13 6. If it's not even remotely reproducible or verifiable, it's not science
07:10 7. Don't over-index on Twitter
08:32 8. Data sharing, licenses, communication clarity, and small things add up
08:58 9. Research papers, technical reports, blog posts, and Tweets all serve different purposes
09:49 10. Socialize your information and build relationships

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Google ships it: Gemma open LLMs and Gemini backlash Feb 22, 2024

Google rejoins the open model party and gets some backlash for a frequent problem for generative AI.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/gemma-google-ships-it

00:00 Google ships it: Gemma open LLMs and Gemini backlash
03:12 Getting to know Gemma
07:11 Alignment details
08:55 Aside: What is REINFORCE? Some history of RL
11:08 Implementation details and RLHF
12:18 Terms of use: RAIL Licenses history repeated
14:05 Is Google back on top? Gemini's woes

Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_008.webp
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_014.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_035.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_051.png
Figure 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_055.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

10 Sora and Gemini 1.5 follow-ups: code-base in context, deepfakes, pixel-peeping, inference costs, and more Feb 20, 2024

10 Sora and Gemini 1.5 follow-ups: code-base in context, deepfakes, pixel-peeping, inference costs, and more
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/sora-gemini-follow-up

00:00 10 Sora and Gemini 1.5 follow-ups: code-base in context, deepfakes, pixel-peeping, inference costs, and more
00:46 1. Deepfake detection of Sora
01:59 2. Playing with long-context, problem settings, and prompting
03:39 3. Gemini paper snooping: contamination and citation games
05:42 4. Training data and token estimates of YouTube
07:42 5. Unlocking model-based RL and downstream research
08:52 6. Midjourney style matching, V-JEPA, replicating Sora in the open
10:09 7. Architectures and academic links
10:57 8. Pixel peeping from the arts
11:58 9. Inference costs
13:24 10. Pressure on Llama and Mistral
14:03 11. Sound effects, physics, and the complete picture

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Releases! OpenAI’s Sora for video, Gemini 1.5's infinite context, and a secret Mistral model Feb 16, 2024

Emergency blog! Three things you need to know from the ML world that arrived yesterday.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/sora-gemini-and-mistral-next

0:00 OpenAI's Sora for video, Gemini 1.5, and a secret Mistral model
0:53 Sora: OpenAI's text-to-video model
4:59 Gemini 1.5: Google's effectively infinite context length
8:01 Mistral-next: Another funny release method

Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_015.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_023.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_026.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_036.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Why reward models are still key to understanding alignment Feb 14, 2024

In an era dominated by direct preference optimization and LLMasajudge, why do we still need a model to output only a scalar reward?
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: In an era dominated by direct preference optimization and LLM-as-a-judge, why do we still need a model to output only a scalar reward?

Podcast figures:
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reward-models/img_004.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reward-models/img_009.png

0:00 Why reward models are still key to understanding alignment

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Alignment-as-a-Service: Scale AI vs. the new guys Feb 07, 2024

Scale's making over $750 million per year selling data for RLHF, who's coming to take it?
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/alignment-as-a-service

00:00 Alignment-as-a-Service upstarts taking on Scale AI
04:25 The competition with humans-in-the-loop
06:05 Scaling Alignment-as-a-Service via AI feedback

Podcast figures:
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/aaas/img_008.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Open Language Models (OLMos) and the LLM landscape Feb 01, 2024

A small model at the beginning of big changes.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/olmo

0:00 Open Language Models (OLMos) and the LLM landscape
6:24 Thought experiments
7:51 The LLM landscape heading into 2024

Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo/img_010.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Model merging lessons in The Waifu Research Department Jan 29, 2024

Note: some of the audio in the second half is a little wonky, but the general voice was upgraded so hopefully it's a little less "poppy" until then!
I'm trying to fix little pronunciation problems on a weekly basis. Thanks to my early fans! It'll keep improving. E.g. some of the months were wonky.

When what seems like pure LLM black magic is actually supported by the literature.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/model-merging

00:00 Model merging lessons in The Waifu Research Department
02:21 How and why does model merging work?
07:13 Aside: merging vs. ensembles vs. mixture of experts
08:21 Why are people doing this?
11:22 Tools & Links
11:51 Brief (visual) literature review
12:07 Full model merging and recent methods
15:55 Weight averaging during pretraining
17:18 LoRA merging
17:53 More background

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Local LLMs, some facts some fiction Jan 24, 2024

Local LLMs: the latency solution, Meta's open AGI, personalization myth, and moats X factor
The deployment path that'll break through in 2024. Plus, checking in on strategies across Big Tech and AI leaders.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/local-llms

0:00 Local LLMs: the latency solution, Meta's open AGI, personalization myth, and moats X factor
4:15 The personalization myth
7:13 Meta's local AGI and moats X factors

Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/local-llms/img_026.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Multimodal blogging: My AI tools to expand your audience Jan 17, 2024

A fun demo on how generative AI can transform content creation, and tools for my fellow writers on Substack!
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/multimodal-blogging-tools

0:00 Multimodal blogging tools
2:57 Stratechery, passport, and wonderful customer experiences
5:51 Wrap-up, features, and next steps

Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_006.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_008.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_012.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_020.png

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Multimodal LM roundup: Unified IO 2, inputs and outputs, Gemini, LLaVA-RLHF, and RLHF questions Jan 10, 2024

A sampling of recent happenings in the multimodal space. Be sure to expect more this year.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/multimodal-rlhf

00:00 Multimodal LM roundup: Unified IO 2, inputs and outputs, Gemini, LLaVA-RLHF, and RLHF questions
02:46 Unified IO 2: Scaling multi-input, multi-output model pretraining
07:47 Collecting preference data for images
09:31 LLaVA-RLHF: The first experiments in multimodal RLHF fine-tuning
13:20 Multimodal RLHF questions, ideas, and resources

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Where 2024’s “open GPT4” can’t match OpenAI’s Jan 05, 2024

And why the comparisons don't really matter. Repeated patterns in the race for reproducing ChatGPT, another year of evaluation crises, and people who will take awesome news too far.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/open-gpt4-limitations

00:00 Where 2024's "open GPT4" can't match OpenAI's
03:19 Models vs. products
04:51 RLHF progress: Revisiting Llama 2's release and potential in 2024
08:30 Smaller scale open RLHF
10:33 Opportunities
12:24 Commentary

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Interviewing Tri Dao and Michael Poli of Together AI on the future of LLM architectures Dec 21, 2023

This interview is on YouTube and podcast players.

Giving a voice to researchers is the best way to cut through the noise and understand what is happening with core developments of LLM technologies. I’m excited to get to talk with Michael Poli (Stanford PhD student + research at Together AI) and Tri Dao (incoming professor at Princeton + Chief Scientist at Together AI). This builds on the mega-post from yesterday on the same topics, though the interview is obviously less math heavy:

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Introductions

[00:00:00] Nathan Lambert: Okay. Hey, everyone. Welcome to the first interview that we're going to post on interconnects. I'm really trying to bring more scientific voices into the AI discourse as media is covering a lot these days. I'm happy to be here with Michael Poli and Tri Dao, experts in some of these non attention architectures that have been really blowing up in the last few weeks of December.

So, Michael, do you want to introduce yourself first?

[00:00:25] Michael Poli: Sure. Thank you. Thank you, Nathan. For inviting me, I, do research at Together AI. And I was also a PhD student at Stanford, working with Stefano Ermon and, and, Chris Re, that's, that's how I met Tri as well. if I had to choose maybe, I moved to a few different areas of research.

if I had to choose one, I like to, do research at the intersection of signal processing, dynamical systems, and deep learning, and most recently, luckily, there's been more interest in, in kind of efficient architectures that use some of these principles. to improve scaling, along different axes and to, to get sort of new, new trade offs at inference time.

[00:01:13] Nathan Lambert: Great. And Tri?

[00:01:16] Tri Dao: Everyone, thanks Nathan for, for, hosting us. really excited to be here. I'm Tri. I, just finished my PhD at Stanford. and I'm being assistant professor at Princeton, and right now I'm chief scientist at Together AI. it's, it's a startup working on AI infrastructure. And, yeah, I've been working at the intersection of machine learning and systems, so designing algorithms that take advantage of the hardware that, that they run on.

I'm interested in, long range, dependencies, how to encode that into a model, designing architectures that can, can, learn long range dependencies. yeah, really excited to be here.

Why Attention works and may not scale

[00:02:01] Nathan Lambert: Okay. I think I'm going to, I have some questions dive right into this. I think you two will kind of both answer them or someone can answer longer, whatever you want.

I think to start with, we should talk about maybe why attention works and what the limitations of attention are. I think. Almost every person in tech broadly now knows that a transformer is a model built with attention and chat GPT does that but like, why is this so good, even like how much of a transformer is built with attention are there other things going on, and what might be challenges there.

[00:02:35] Tri Dao: Right. so, transformer which is this. Contexture that powers most of the exciting applications that we're seeing nowadays, as you mentioned, and so on. attention is kind of the core layer there, and attention actually became, earlier, around 2014, 2015, and then transformer came out, incorporating that, focusing a lot on, on, attention, with these, MLPs, interleaving, MLP and, and attention.

And I think the success largely has been, They are, they seem to be able to scale really well so that you can scale up the models, with more, more parameters, with more data. And that has been the recipe for, for success. It sounds obvious now, but I think, five years ago that wasn't, that wasn't clear.

so it seems like, you know, Transformer Architecture is, is a hugely successful one. and, you know, a couple of reasons why it's successful. I think it's like General enough that it's able to learn a lot from data. And two is, is pretty friendly to hardware. You can, there's no kind of sequential dependency like previous RNNs.

so as a result, you can run it well on GPUs, TPUs. you can scale it up. It's very hardware efficient. I've personally have worked on making it more hardware efficient as well. So it's just kind of the recipe for, for success, which is, general architecture that scales well. if you're an NLP person, maybe you, you, you said, you know, incorporate some kind of inductive bias for, for, to protect, personally, I think it's a very general architecture that, that scales well and it's hardware friendly.

[00:04:16] Nathan Lambert: Yeah. Yeah. It's, it's remarkable how it seems so obvious now and it's like. I think one of the things that studying this work is the context length becomes a really interesting access to study alternatives to it. And essentially it's, I think, I mean, Michael, do you want to talk about that? I could, I could babble, but you're, you're no more sure.

[00:04:39] Michael Poli: yeah, the there are several points. I'll start by saying that, you know, there's still a lot of great research trying to understand why from first principles. Why is it that the transformer can learn these interesting circuits? people kind of study, they, they pick apart the computation, like combination, different, [00:05:00] heads and transformers and so on.

so there's work on basically understanding transformers from kind of like a programming language that is encoded. But I think, as, as Trey mentioned, there's, there are some very, Very, very interesting design choices that have gone into the transformer. This interleaving of attention on MLP is quite important.

and also the transformer is essentially, was successful in the beginning 'cause it encoded these, techniques that, that, that have been developed for, RNN Celest. So these other, you know, classical NLP models, gating to, modulate, which information is absorbed into the model. Gating to determine how quickly something is forgotten in this this occurrence of get an end into a parallel form.

It is, you know, easily, a bunch of gems that can be easily, well, not very easily, but can be optimized on GPUs.

Quadratic scaling in attention

[00:06:01] Nathan Lambert: Yeah, that's, that's all great. I think that, I guess the specific thing that I had in mind is how attention ends up being this kind of quadratic, scaling in terms of cost when you have an input sequence that you have, if you have an input sequence of length L and you want to output a sequence of length L essentially.

If you zoom into the math and you look at what's happening at inference in most of these libraries, you have this like upper triangular attention matrix where you say, like, you can only look at the past entries of your text. And as you go through there, then you end up getting this a long, you get this L squared relationship where the first token, you can only look at one, and then you end up looking at more tokens for each, past and Now we've been talking about recurrent neural networks and how does something that isn't attention like get around the fact that you want to look at all of the history of the text in your sequence.

So like if you write a long prompt to chat GPT, you really want all that information to be encoded and how could doing something other than this dense attention matrix. Like actually make that possible.

[00:07:08] Tri Dao: Yeah, so you can go ahead and, you know, before attention, there was RNNs, right? Like a minute RNN's like they process text was fine. and maybe they didn't scale as well, but yeah. you say briefly texts by encoding texts.

[00:07:22] Nathan Lambert: Can you say briefly what a RNN is and how it works?

[00:07:24] Tri Dao: Yeah, so these are recurrent neural nets, that go back, I think, to the 80s.

maybe some of the more famous ones are LSTMs, GRU. so they were pretty popular in, around 2012 to 2016 or so. they were kind of state of the art for translation, speech recognition. a bunch of, I think NLP, like, they, they were a state of the art and, and they processed text kind of sequentially.

they are just, they see essentially one token and then that. Changes the hidden state and then they will update the hidden state and every time they see a new token. So, I think it's kind of, in some sense, mimicking, how, for example, human brain process information, like you read, you, you read a sentence or a passage and, you know, it's, it's maybe it's like you're storing some information in your brain.

By the time you've finish reading a document, maybe you can answer questions about that documents without having to read to, to refer to that document again. So, RNs, kind of work that way. They, they, they, they process the, the, the texts. and then that changes the hidden state and their hidden state is the representation that can be used to either generate new tokens or, classify the documents or, or, or whatnot.

so these work well back in 2016 or so. But, they've kind of fallen out of, favor, empirically, they don't do as well as, as Transformer, I think, and as you, you touched on Transformer, because of this kind of quadratic scaling, and you compare every token with every other token that comes before it, it gives you this very kind of easy way to, to propagate information.

and, I think that's part of the reason why, why, transformer and attention does really well. but there's been more, more recently, some of the new, newer RNN architectures that. Seem to do pretty well. So, RWKV is, I think is one of the earlier ones, you know, is one. I, I really admire that, that that project, you know, his effort mostly from, from, from one person really going against the, orthodoxy of, of transformer.

Who, who was it showing that Rrn can be pretty strong. Who was the lead on that? I think it was this person, Bo Peng, I think. and, you know, it's, it's, it's an entire project, but I think it was pioneered by Bo Peng. I think it's, affiliated with Alutha the compute sponsor by Stability and so on.

[00:10:12] Nathan Lambert: Yeah. I was reading this earlier. At a technical level, they tried to replicate something like the query key. Value lookup of attention with two linear RNNs to essentially be able to remove the like specific attention scaling problem, potential problems, and with two RNNs, which have this better, like long context behavior and different implementation rules.

I think, and they also, the paper trained up to 14 billion parameters, which kind of leads into the, some of the next questions I was going to ask, I was going to ask Tari about, Mamba and then Michael about Striped Hyena. I think you could go in either order. I think these came out about a week apart and were these two language models kind of seen as being.

What is Striped Hyena

Nathan Lambert: Way closer than anyone would expect, essentially the Striped Hyena came out and the evaluations were close to models I've been training on all year, like Lama 2 and Mistral 7b. And I went in and I went to the together API and I did like side by side of. Mistral versus Striped Hyena, and it's like, it's, it's a good language model.

It answers most questions. There's no obvious failure modes. I think maybe Michael, do you want to comment on that? I know it's another big project and then we can go back to Mamba, even though it's slightly out of order in the chronological, the release cycle that happened. sure.

[00:11:33] Michael Poli: So, I guess I'll start by saying that, there's an interesting connection between all these, these new methods.

there is this sort of convex set, which has a center and there's this connection between linear attention. So attention without the softmax, linear RNNs. And states based models, SSM. So at some level, kind of the mathematical formulation of this kind of base model here, I'm not talking about the base architecture, just the fundamental model is the same.

And then you can go in different directions. And each direction has its own tradeoffs. You can go to, the feature map, direction, the kernel direction. So when you, when you break down the softmax, you take away the softmax. You can place, on queries and keys. Kind of the fundamental, the entities that compose your attention matrix, you can compose other kernel like functions, other functions that you hope would approximate whatever capability of attention you like.

You can do things like a, like a Taylor approximation, Taylor expansion, for example, of that. And you, you, you have a slightly different perspective, but you get something that again, is very similar. You can go to Time variance. So you take the RNN and you push this input dependence. So the way the [00:13:00] computation inside the linear RNN is conditioned by the, by the input sequence, and you can have things like gates, we've seen a lot of work, for example, modernizing the inner tension with additional gates.

that allow you to make better use of your, of your fixed state dimension. And then you have the third direction, at least in my mind is the one that pushes, that uses the convolutional form that pushes more towards other types of, of linear operators that are still associative, that are, that are still, that are still allow you to, to train in parallel.

So here are things, time invariant systems. I can elaborate on any of these points, but things that can switch between convolutions and recurrence like this for a model with additional. Gates again, scraped. I, you know, was born as a, as a project, from the, in architecture, which belongs to this third category that I just mentioned.

And we're really trying to get the best per flop [00:14:00] architecture that we could. And. one principle that was validated over and over again, and we're trying to, to, to understand better now is that it seems composing hybridizing different, layers, layers, different blocks of different categories, and even full attention yields something that is better than the individual components.

So there seems to be a compositional aspect of these, of these models that we're trying to understand better. And this gives you a better sort of, pre trained model per flop. And with, with this model, we, we ran a whole suite of scaling laws and so on. Hybridizing also gives you, since we wanted something that would be kind of usable out of the box, it gives you a way easier time.

When you, when you're fine tuning for longer context, we can apply some of these techniques that have been developed for transformers and kind of surprisingly work okay for a hybrid [00:15:00] hybrids as well. So things like, linear scaling for rotary embeddings and so on, you can go into the details. So it was mostly a project trying, what is the best given the current landscape, what is the best we can do?

What is Mamba

[00:15:11] Nathan Lambert: Yeah, that's a great description of it. I mean, the sentence in the blog that's like, Striped Hyena is optimized using a set of new model grafting techniques, enabling us to change the model architecture during training, kind of felt like, to me, that there's a ton going on there. And like, some of it, you probably can't talk about, there's normal data.

So like, I don't think all the data that was quite explained, like what the longer context data was, but it's like, are you taking this from models, starting point from models that people would know? And can you say any of that? I think even just the summary that it's a synthesizing recent work into a strong model is great context for people.

[00:15:48] Michael Poli: Yeah. Well, the deadline, so we've, given this explosion of, of primitives that, you know, describe, and given sort of the, the [00:16:00] cost that it would require to evaluate all different combinations, we found ways to essentially start training. With a configuration and then continuing on with another configuration.

I think we'll have, we're going to have more work or a paper.

[00:16:16] Nathan Lambert: Yeah. There's so much cool work in that area. So one of the, someone at AI too is working on a project where they're essentially trying to cut the Lama models in half and keep training them. And things, it's just the wild west out there with people trying to take strong models and make them smaller while still getting the performance benefits of bigger models.

I think that's a whole aside, but. I wasn't expecting it to show up when people, when like you follow the social media, I've striped by, you know, people are like, Oh, state non attention models are finally good. And it's like, it covers up a lot of the details that are very interesting about it, in my opinion.

So, okay. Turn back to treat, I think. Mamba actually happened first among these, I did the, his reading back of [00:17:00] social media, and it also was very surprising to me, I think the, the largest model in the Mamba suite is 2. 8 billion parameters, if I remember correctly, and it was compared to a lot of the common benchmarks in open NLP, so things like GPT J, Pythia model suites, and the scores on the benchmarks reported were really strong, and I think if you want to start with the high level summary, and then I'm definitely going to make you talk about the awesome new CUDA kernels and stuff that you had to write for this project.

[00:17:34] Tri Dao: Yeah, so this, Mamba is a collaboration with, with Albert Gu, who's now, he was, a PhD student at, at Stanford, that's where we met, and, he's now a professor at CMU, and, also at a startup. so it was a, a wonderful collaboration, credit goes to him. Yeah, Albert has been working on this line of work called state space models, [00:18:00] in some sense, as mentioned, it connects to things like linear tension, linear RNN, convolution, neural nets, and, that's what his PhD thesis, is about.

I've also worked on, space, state space for the past couple of projects, My, my angle is how to make state space more hardware efficient and, kind of increase their expressiveness. so it's wonderful working with, with, with Albert. and there, I think is more of a proof of concept, which is, Can state space actually do as well as transformer on language? So we've, we've seen previous papers, showing state space could be better on audio, could be better on, some of the tasks on the long range arena. but, language has always been, the most difficult to get, to, to, to do well for state space models.

[00:19:00] And, language is also kind of the thing that People care about the most right now. So I was more of a proof of concept, which is, Hey, we want to show that safe space space can be competitive or maybe even meet some of the transformers out there. so we, we validate that at the scale up to three B trained to 300 B tokens.

So in absolute terms, you know, these are not very strong models. These are not yet models that you would actually. play with and expect it to do meaningful things, I think is more of a, more of an academic comparison in terms of architecture. It's like, hey, training, train for the same amount of tokens, it does as well, or maybe slightly better than some of the transformer out there.

So, and that's, in particular, it's been, very exciting to us. I think, Albert's been pushing on this for, for a while. I've been pushing on this for a while, and I think finally, it's like, It seems to, [00:20:00] to, to finally be kind of close to gap or even surpassing the transformer. and it just, just, I think it's opens up a bunch of opportunities.

so inference could be way faster. maybe we would have different ways to understand how in context learning happens, et cetera. So, lots of, lots of future work I would expect.

Mamba hardware optimization

[00:20:22] Nathan Lambert: Yeah. Can you go into some of the like, what does it actually take to implement some of these new CUDA kernels? I just remember when this paper was announced, Sasha Rush, who's also very active in the space, recommended me to talk with you too, was tweeting about the types of files or whatever.

In the paper, there's this discussion about how like the recurrent state needs to be sufficiently expressive, but doing so in a certain type of memory is a problem. Like translate what this means to like people thinking about GPUs and people thinking about these models being scaled, like, is it now? Much easier to scale these [00:21:00] models because they work on GPUs.

Which GPUs were you using? Is there a bump that could come just from going to H one hundreds or something? Any of that?

[00:21:08] Tri Dao: Yeah. so, the pre, the line of work on state space, like s four models, kind of pioneer by, by, my Alberts work. they, they c are in some sense recurrent neural network. but they have a much larger, So, the state size is whatever kind of, buffer that you're going to store information as you traverse or as you process the sequence.

In some sense, you can view transformer as doing that as well, where it's, keep the entire history is usually called the KV cache. it keeps the history and keep referring to it. for RNNs, they have a fixed size state. for transformer state, you can think of the state size is increasing. And, our intuition [00:22:00] is that, the larger the state size, the easier it is for the model to do well.

So basically, you have more space to store whatever you need to remember. And so previous models like S4 and so on, they have an implicitly pretty large state size, but they use the convolutional view to avoid having to materialize the state. So that was, that was wonderful. Michael has, has worked, behind the architecture, has used some of the same insight focusing on, on convolution.

Mamba, on the other hand, focuses on the recurrent view. So, we wanted to put more input dependency in the, the, the recurrence. we thought, you know, the thinking was that it was going to make, it more expressive and the model would do better, but that prevents us from using this convolutional view that would make things efficient.

So we had to figure out a different way to make things efficient. and, so I, I focused on making that efficient on, on, on GPUs. and so all, you [00:23:00] know, our thinking was, instead of, okay, we're gonna have a large state size, but we don't have to like ride to actual GPU memory, like the HBM, we can just keep that, large state in a, a faster, Memory you call SRAM, you think of it as a, as a cache. so if you're more familiar with, CPU, so this is usually a cache and RAM. So, you know, if you have large state, you can keep it in the cache and you don't, by avoiding having to write it down, you actually don't suffer too much if the state is, is large.

Predictions for 2024 architectures

[00:23:33] Nathan Lambert: Would this be due to like input out, like having to move the data around being really slow? Yes. Yeah. That makes a lot of sense. Thanks. That's a really common constraint in a lot of these things, and it's like, right. I think one of the most insightful things I've had now with GPUs versus TPUs and stuff is how mixtures of ex mixture of expert models doesn't work very well in TPUs, just because you have to like that essentially add a mixture of expert at a basic level.

There's a routing layer that you learn, [00:24:00] and then multiple feedforward layers that you can choose from. And when you're distributing this, the feedforward layers could end up. On a different TPU node and TPUs communicate with their neighbors. So TPUs take a big hit relative to GPUs where within video class and video clusters, everything's connected so much more.

And then it's easy to do that sort of distributed training. And that's super interesting. And it's like, do you think there's going to be, I guess this is really where I want to open the conversation of like, what does this mean? What is going to happen in 2024 in this space? Are bigger players going to move in and be exploring this my take, seeing how good the long context learning could be in a fundamental limit is that systems like chat GPT are going to use a dense, like a transformer model for most tasks.

And then if you need to do summarization, you might do a long context specialized architecture. And then we can even see a whole quiver of architectures behind [00:25:00] something that you're using. But I think. It's just like, is attention going to be dethroned? Is Sasha Rush somehow going to win this bet that everyone was following in the area?

I got, what are you thinking about either of you?

[00:25:14] Tri Dao: I think transform is still a very, very strong architecture. and there is a proven recipe, right? You know, people scaling to a trillion of parameters right now, if you want, you say, well, I just want the best performing model. that runs most efficiently on my hardware that has the most support on on the software side.

Fast forward is a safe bet. I think it's here to stay. but I think there are new ideas, like, state space, kind of, some of the linear attention ideas from linear attention. I think they're coming in. we've seen, as Michael mentioned, that mixing some of these components seem to improve performance, revalidated at, I think, seven B scale, but, Maybe it might even work at larger scale.

I think [00:26:00] people tend to be conservative and, you know, focusing too much on modern architecture, might not be worth their time. Like the Lime architecture is very, very strong. Most people are doing off of that. They're focusing on data. they're focusing on infrastructure, which makes sense. I think on, on my side personally, just plain interesting.

They're like more, I would say niche use cases. niche for now, where some of these alternative architectures are interesting, things like long context, different domains like audio and genomics, and there's just plain interesting scientific questions you can ask, like whether it follow instruction just as well, whether it follow intuition just as well, does it play well with quantization and so on.

That's just plain interesting. Research questions we can ask. Now on the production level, I think Transformer is still incredibly strong, very well supported, both hardware and software. But I think some of these new ideas are coming in [00:27:00] and people might start, you know, putting them as part of a component in the Transformer.

Maybe we'll still call them Transformer, but they just have more, more layers and just attention and NLP.

[00:27:11] Michael Poli: Yeah, I 100 percent agree with you. So attention as a, as a computational primitive is not going anywhere anytime soon. It's just a very efficient and a very convenient way to. Increase the effective state of, of your sequence processor. so at some level, if you're working with a model that only has a fixed state in each of its sequence mixers, you're, you have an assumption and your assumption is that you only need so much information in the sequence.

So there's, there's always a trade off between, this kind of the ratio of the state dimension, the sequence length, as you push things to the extreme, either model sizes. So as you make the model bigger, wider, effectively [00:28:00] introduce more states and sequence length, some of these margins. you know, some of this is speculation, but some of these margins will disappear, some of the trade offs will change, especially 14, 30, some of these very fat models.

But certainly either whether that's hybridizing or some kind of new, new block, we're certainly going to see some more innovation. That's, that's really exciting. My, my personal, if I had to make a prediction is that architectural design will get more interesting, more, more complex. There's going to be more to do.

More predictions for AI

[00:28:38] Nathan Lambert: Yeah, I mean, this year it's like, I got some 10 minute clock that's fine for us. I think like with mixture of experts and this being popular as a state state models, like this is all just really within a few months outside of opening. I like they've been doing mixture of experts for a lot longer than everyone.

In terms of open and academic [00:29:00] communities, like no one's really tried to do early Jeff on mixture of experts. Like it should just work, but we have to learn all these things. And then the model grafting is becoming more of a real thing. That's super interesting. It is just. I agree that it's just fun to follow and hopefully it gives academics and scientists more ways to influence the conversation where an industry is just about scaling and bigger models where we could maybe do specific things better, which is what I'm telling open source companies to do with their language models anyways.

Like if they want to have a business model, they need to have an edge. So this all fits into that kind of narrative pretty well with my regards. Is there anything else you guys are following in ML? It doesn't have to be about state space models. Like what's, what's exciting for you broadly for next year?

[00:29:46] Tri Dao: Yeah, personally, I think data is still the most important thing. we're, we're thinking a lot about how data influences the model performance, like really teasing that [00:30:00] out, either, you know, having some of the synthetic tasks that correlates very well with, with model performance. That's been kind of the motivating.

kind of examples in a lot of our papers and work has been focusing on synthetic tasks, or, having like maybe, maybe smaller data sets that kind of make it easier to really understand what's, what's really going on. so, I think I'll, you know, personally, my focus is going to be on data for the next little bit.

Yeah, all the, all the architecture stuff is fun. making that hardware efficient is, is, is, is fun. but I think, ultimately it's about data. If you, if you look at the scaling, scaling law curve, the more architectures. Different model architectures would generally have the same slope. They're just different offset.

it seems like the only thing that changes the slope is the, data quality.

[00:30:58] Nathan Lambert: I love that point. That, that does [00:31:00] seem true. I have the plot from Mamba in this blog post that I'm writing, which is, it's just a little, just a little bit above the same slope.

[00:31:08] Michael Poli: Yeah, we add data. Data is really interesting, sort of miniaturizing, architecture design, finding and breaking down what, tasks are involved into, for example, language modeling and trying to package them into something that can be used to iterate something that's quite exciting. We have, that was one of the main techniques that was used for the, this, zoology, paper that also looks into, into some of these different behaviors.

And personally, I'm also really excited about new applications, scientific applications, with the genomics work, but even more, but more engineering focused, we're seeing a shift, right now it's language is still kind of, The domain that gets the most clicks, [00:32:00] most interest, but I think that that will evolve over time.

and some of these other applications offer, even just talking about architectures, they offer a completely different design space that I'm excited to look into.

[00:32:13] Nathan Lambert: Yeah, everyone talks about language, but I feel like images and entertainment and videos are like the things that are so obviously going to generate so much value to me.

Like, I don't know the ceiling on language, but when you could access a like somewhat local text and video model at your home workstation, that's like tailored to your preferences. Like the amount of value that that creates is totally astronomical. I I'm excited. I mean, I've started playing around with these where I'd take.

Text of the blog and convert it to dolly images and convert it to a video with generated audio all with like one Python script and it's like, that's really easy to do. So I agree with your more than language is fun to have that view

[00:32:55] Tri Dao: and these things actually do work reasonably well in your experience when you stitch [00:33:00] all them together.

[00:33:02] Nathan Lambert: it's not that good. The DALLE images are pretty similar, but I'm doing something really naive where I just, I literally take the text and have a system prompt. It's like you're generating series of images for visualizing a blog post and, and it generates various like. The, all the machine learning thumbnails that you see everyone using, they're like variations of that.

The fun ones are where it's like about Llama or Mamba or something. And then they like generate animals in them, which is good. I think I could get much better at it and have a better segmentation system for the paragraphs and, or have like chat to PT summarize them or something like that. But I just know that within like a year, it was going to be a text to video API and I'm just going to switch it and it's going to be great.

And so I'm like laying the groundwork for infrastructure to have like multimodal. Content as multimodal content distribution, really, and I just expect it to become very fun. I mean, like even the text to voice is pretty good. I think I don't have a studio, but once [00:34:00] you have a studio, it's going to be able to generate perfect audio for whatever you want.

So another one of my dreams that is. Bad for young students is I want to be able to give like a slide deck to a script that returns the five minute conference video that no one ever watches just based on like a, GPT for reading those, the slide deck and voicing yourself. So those are the silly things that I have time to do because I'm not a professor.

[00:34:29] Tri Dao: Yeah, I think these, these, these advances, these systems, like they, they do generate a lot of economic value and, and we're seeing that already. Lots of companies are now switching to using these things. And I think it's going to change the way we work as, as you mentioned, the way we work, the way we're entertained.

So I'm just very exciting future.

[00:34:47] Nathan Lambert: Yeah. Anything else? Well, thanks for coming. Try to get you guys as much. Attention as I can bring, you never know it'll go viral these days. So I think this was a great conversation. People are really hungry for basic intuitions in [00:35:00] the area. So this is good.

[00:35:02] Tri Dao: Yeah. Thank you.

Nathan is a pleasure. Absolutely.

[00:35:07] Michael Poli: for inviting us. And, maybe, if, you know, there are more questions, is there a way to maybe collect them or to, to provide readers with like listeners with, an address or something? Happy to answer anything.

[00:35:24] Nathan Lambert: Yeah. I'll, I'll include contact info in the post and various ways.

This will be out there. You'll get your comments on Substack, YouTube, Twitter. It's a mess. You've got to pay attention to 10 million streams of information these days, but you'll, you'll get contacted by people. Thankfully, for some reason, people read my stuff, but here we are. So thanks for listening.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe