Debates and assumptions surrounding RLHF RLHF is rooted in the intellectual history of utilitarianism and the V&M utility theorem, which reflects debates including whether preferences can be measured at all and the different types of math used to model preferences. This raises questions about the inductive bias of a preference model and the assumption that preferences can be accurately measured in RLHF.
swyx
Yeah, you have a nice chart of the intellectual history of RLHF that we’ll send people to refer to either in your paper or on the YouTube video for this podcast. But I like the other slide that you have on the presumptions that you need to have for RLHF to work. You already mentioned some of those. Which one’s underappreciated? This is the first time I’ve come across the VNM utility theorem.
Nathan Lambert
Yeah, I know. This is what you get from working with people. To my co-host on the podcast the retort is that sociologist by training so he knows all these things and like who the philosophers are that found these different things like utilitarianism But there’s a lot that goes into this like essentially there’s even economic theories that like there’s debate whether or not preferences exist at all and there’s like different types Of math you can use with whether or not you actually can model preferences at all. So it’s pretty obvious that RLHF is built on the math that thinks that you can actually model any human preference. But this is the sort of thing that’s been debated for a long time. So all the work that’s here is like, and people hear about in their AI classes. So like Jeremy Bentham, like Hedonic Calculus and all these things. Like these are the side of work where people assume that preferences can be measured. And this is like, I don’t really know. This is why I kind of go on a rant and I say that in RLHF calling things a preference model is a little annoying because there’s no inductive bias of what a preference is. It’s like if you were to learn a robotic system and you learn a dynamics model, like hopefully that actually mirrors the world in some way of the dynamics. But with a preference model, it’s like, oh, I don’t know what this model, like IInsights on Process Reward Models and Human-Centric RL in NLP The process reward models reward each step in the chain of thought reasoning, providing more granularity for considering different states. There is ongoing debate about whether chain of thought reasoning is more like reinforcement learning (RL). The comparison of pre-deep RL versus deep RL shows that the current work in NLP originated from outside of NLP and before the prevalence of deep learning. Human-centric RL involves having a human give a score as a reward for an agent’s action, rather than having a predefined reward function.
Nathan Lambert
Work that I’ve mentioned on one slide called process reward models that essentially rewards each step in the chain of thought reasoning. It doesn’t really give the part of interaction, but it does make it a little bit more fine grained where you can think about like, calling it at least you have many states from your initial State. That formulation, I don’t think people have fully settled on. I think there’s a bunch of great work out there. Even OpenAI is releasing a lot of this and Let’s Verify step-by is their pretty great paper on the matter. I think in the next year, that’ll probably get made more concrete by the community on if you can easily draw out if chain of thought reasoning is more like RL. We can talk about that more later. That’s a kind of a more advanced topic than we probably should spend all the time on.
swyx
RLHF for decision making. You have a slide here that compares pre-deep RL versus deep RL.
Nathan Lambert
This is getting into the history of things, which is showing that the work that people are using now really came from well outside of NLP, and it came before deep learning was big. And the step from this paper, Tamer, which is from 2008, some names that are still really relevant in kind of human-centric RL, Bradley Knox and Peter Stone. If you have an agent take an action, you would just have a human give a score from zero to one as a reward rather than having a reward function. And then with that classifier, you can do something with a policy that learns to take actions to maximize that reward. It’sModeling Pairwise Preferences and Aggregation The pairwise preference approach, specifically the Bradley Terry model from the 50s, gained popularity as other methods failed. It heavily relies on the aggregation of preferences, which is not always accurate due to individual differences. This approach aims to model preferences based on correctness and style rather than controversial or meaningful notions of preference.
Nathan Lambert
The answer is really kind of no like a lot of people tried that it didn’t really work and then that’s why they tried this pairwise preference thing and it happened to work and this brad Terry model comes from like the 50s. It’s from these fields that I was mentioning earlier. And it’s wild how much of this happens. I mean, this screenshot I have in the slides is from the DPO paper, I think it might be the appendix. But it’s still really around in the literature of what people are doing for RLHF. Yeah. So that’s a fun one to know.
swyx
I’ll point out one presumption that this heavily relies on. You mentioned this as part of your six presumptions that we covered earlier, which is that you can aggregate these preferences. This is not exactly true among all humans, right? Like I have a preference of one thing, you have a preference of a different thing. And actually coming from economics, you mentioned economics earlier. There’s a theorem or a name for this called arrow impossibility, which I’m sure you’ve come across.
Nathan Lambert
Yeah, it’s one of the many kind of things we throw around in the paper. Right. Do we just ignore it? We just yeah just aggregate yeah yeah yeah i think the reason this really is done on a deep level is that you’re not actually trying to model any contestable preference in this like you’re Not trying to go into things that are controversial or anything it’s really like the notion of preference is trying to stay around like correctness and style rather than any meaningful Notion of preference because otherwise these companies, they don’t want to do this like at all. I think that’s just how it is. And it’s like, if you look at what people actually do. So I have a bunch of slides on the feedback interface.
swyx
And they all publish this. It’sRLE Jeff Improves Evaluation Metrics RLE Jeff may not improve evaluation metrics according to OpenAI’s external demonstrations. While it can change your language model, it may not always produce desired results. It is not necessarily intended to enhance multiple choice reasoning capabilities, but it might have potential in certain preference scenarios. Overall, RLE Jeff is a powerful yet complex tool.
Nathan Lambert
It’s really like RLHF is not that shown to improve capabilities yet. I think one of the fun ones is from the GPT-4 technical report. They essentially listed their kind of bogus evaluations because it’s a hilarious table because it’s like LSAT AP exams. And then like AMC-10 and AMC-12 are like kind of reasonable evals in language model land. But they just showed that like RLHF doesn’t improve their evaluation metrics. We don’t know if internally they have other ones. They probably do but from what open ai has shown us externally like rlhf improves some metrics it decreases some metrics no one could really see i do think it does things that they care About but it’s like rlhf is not an easy tool to make numbers go up with it’s a powerful tool to change your language model but like as we’ve seen with LAMA and safety RLHF, that doesn’t always Mean that people are going to be happy with those changes or it’s going to do exactly what you want.
swyx
Well, I think this is intuitive. A lot of these tests are multiple choice, and RLHF isn’t necessarily intended to improve your multiple choice reasoning capabilities.
Nathan Lambert
Yeah, I think it is reasonable, but I don’t think a lot of people have connected the dots there. And what is it in a preference point? What if your preference data was between a correct and a wrong answer? It could conceivably do it, but I just don’t think that it is remotely what it is actually doing. It’s much better being a sommelier. Yeah.How to Use RLE Jeff to Make the Bottom Better Without Using PPO Using rejection sampling and best event sampling can improve the quality of generated answers by spending more inference compute based on a preference data set. This approach, discussed in a step-by-step paper from OpenAI, involves generating multiple responses to a prompt, passing them through a reward model, and selecting the one with the highest scalar value. It’s a logical and effective technique used by many researchers and can enhance the outputs without relying on PPO.
Nathan Lambert
Yeah, I think it is reasonable, but I don’t think a lot of people have connected the dots there. And what is it in a preference point? What if your preference data was between a correct and a wrong answer? It could conceivably do it, but I just don’t think that it is remotely what it is actually doing. It’s much better being a sommelier. Yeah. That was the weirdest one that was included in the GPT-404.
swyx
Yeah, I just see that, the last three down there.
Nathan Lambert
That’s really funny.
swyx
I can’t even taste it. You can’t even taste it. Cool. Emerging directions.
Nathan Lambert
Yeah, so this is essentially how to use RLHF-like things to make the bottom better without using PPO, because PPO is kind of a nightmare to scale. The first thing that I started with is kind of the ideas of rejection sampling and best event sampling. I think best event sampling is what people often encounter first, which is the idea of you take a prompt, you generate like 10, 20 responses through it, you pass it through a reward model, The reward model assigns a scaler for each of them, you pick the one with the highest number, and that’s the one you answer the question with. It seems pretty logical to people because it’s just spending more inference time compute to make your outputs better and it works in a lot of things. This let’s verify step-by paper that I talked about from OpenAI. They use it. Lots of papers use it. It’s just kind of like a good thing to know that you can do. You can spend more inference compute based on a preference data set to make your answers better. TheRejection Sampling for RLHF Process Rejection sampling involves putting best event sampling in a feedback loop to return the best few answers, then applying instruction tuning on that dataset. Lama started their RLHF process with rejection sampling to get a signal out of preference data, which went into a reward model for ranking. This method is easier to implement than PPO and can be used with auto-regressive loss, making it suitable for RL at scale. Offline RL is also a relevant approach for RLHF, as the model doesn’t have to generate data, but rather looks at existing data and backpropagates through the reward model.
Nathan Lambert
Interesting thing that people are confused about more is rejection sampling because Meta talked about it in Llama 2. Essentially, a rejection sampling is putting something like best event sampling in a feedback loop. And instead of just returning the best answer to a user, you take the best few answers and then you apply instruction tuning on that data set. And then you do the instruction tuning and then you could collect more preference data, do a new reward model, and then you rank some new outputs and you do instruction tuning again. So essentially, LAMA started their RLHF process with this to get some signal out of preference data. That preference data went into a reward model and then the reward model did a good enough ranking that it was essentially super powered instruction tuning based on rewards. Works pretty well, much easier to implement than ppo because you can use it in all of your kind of like it’s still instruction tuning so it’s the same autoregressive loss it’s easy to Plug into things like transformers and stuff like that a lot easier to start with than whatever freaking mess doing rl at scale is going to be so that’s one a quick nod that offline rl is Something that people talk about for RLHF, essentially because your model doesn’t have to generate. In that case, you look at data and it back propagates through your reward model directly. So in PPO, you have the step of needing to generate everything and passing it through the reward model.Specific Feedback for Domain-Specific AI Development Different feedback types such as written feedback, labeling multiple scores, and pairwise preferences are expected to be used for different domains in AI development. Chain of thought reasoning and process reward models are suitable for math but may not be ideal for poetry. As AI tools improve, they become more domain-specific. Constitutional AI involves generating preference data by having a second model evaluate the outputs of the first model based on principles drawn from sources such as the UN Declaration of Human Rights and the Apple terms of service.
Nathan Lambert
Feedback types are probably going to come into play. There’s papers like written feedback or labeling multiple scores or multiple pairwise preferences for every completion. That’s coming. It’s also kind of related to what we mentioned in process reward models where you’re labeling each step in the chain of thought reasoning just to kind of make the problem more specific. It seems very likely that different feedback will be used for different domains. Chain of thought reasoning is great for math, and that’s where these process reward models are being designed. Probably not great for things like poetry, but as any tool gets better, it gets more specific. Then kind of get into more of a talking point, which I think is fun. The next one I have is constitutional AI. I think this is something that people really just kind of misunderstood. I mean, I think most people thought that constitutional AI was doing something where it’s created the preference data based on the specific principles in some way. What did you two think of constitutional AI? Yeah, I’ll be the dumb person and you correct me.
swyx
As far as I understood, Anthropic came out and said that the best way of generating this sort of preference data or alignment is give a second model a constitution to evaluate the first Model’s outputs. Yeah. The constitution is unspecified, but it draws from the UN Declaration of Human Rights and the Apple Terms of Service for some reason.Guided Sampling and Implicit Values in AI Model Training AI models use guided sampling to pick preferences based on principles from a constitution, resulting in a new preference dataset. The process is less explicit than expected, relying on averages and scale to incorporate principles. The approach is similar to RLA Jeff setup with instruction tuning, where an AI model provides critiques based on sampling of constitutional values. The process seems more tractable, but it may deviate from the stated approach in the paper.
Nathan Lambert
Which is essentially like pick between these two answers based on this principle. So they’re kind of sampling from the principles in their constitution and from kind of A, B, like two options of completions. And then the AI model is essentially given the context of a certain principle to pick the A or B preference. And then that’s a new preference data set is just the two completions without the context of the principles. So with this kind of like sampling idea, they’re sampling from like 30 principles and a wide dataset of two candidate completions across the different prompts. So to me, it’s a very like loose, like the values are not explicit in this. It’s just kind of how they’re guided. It’s a very machine learning approach because it is relying on averages and scale to get the principles in there, but it is way less explicit than I thought it was going to be. I kind of thought there was this feedback thing in the preference data where it checked to see if the principles were satisfied or anything like this. It’s really just a modification to the RLHF setup that we’ve talked about with instruction tuning and preference data collection where there’s an AI model providing critiques, and A lot of those critiques are based on sampling of constitutional values. It almost sounds more tractable in that way. But I would also guess while I just say, oh, look, I figured it out, I’m guessing they do different things than they said in the paper. This paper is in 2022.Anticipating the need for super alignment and AI overlords Advancements in AI are leading to a point where manual human preference data collection cannot scale, necessitating the trust in AI to model human preferences. This has led to the concept of super alignment where we prepare for a future where the AI being controlled is smarter than us. The idea is to train the AI to be smarter than itself, as humans may no longer be fully in control at the point of super intelligence. The potential solution appears to lie in using robust generalization, and this concept is linked to the evolution from constitutional AI to super alignment.
Nathan Lambert
Less than they know. So I think they probably have things that are pretty cool that they’re doing internally.
swyx
And I’ll summarize for listeners who may not have seen the paper because, you know, it’s impossible to keep up and everything. I do think that what constitutional AI and RLAIF represents is that we are starting to come to a point where it’s just impossible for manual human preference data collection to scale. And the only way to scale this is to trust our AI overlords to model our human preferences. And constitutional AI was the first version of this. What the second version or what’s weak to strong is, is that anticipating a future of the need for super alignment where the thing that we’re trying to control is smarter than us. So you take GPT-2 and try to use GPT-4 to teach it to be smarter than itself because this is what we’re going to have to do in the future as well when we’re no longer fully in control.
Nathan Lambert
Are we the metaphorical GPT-2?
swyx
No, we’re not even in the process anymore at the point of super intelligence. They’re prepping and they’re saying this will happen and humans will be like so far out like in the dust that we just like have no say in this debate how do we still control systems then And we too strong generalization seems to be the answer and i see a lineage from constitutional ai to this yeah the constitutional ai and the super alignment is like very conceptually
Nathan Lambert
Linked. It’s like a group of people that has like a veryPreparing for Super Intelligence The advancement towards super intelligence may render humans powerless in the decision-making process, leading to the need for control systems and strong generalization. The concept of constitutional AI and super alignment are seen as closely linked, with a call for clearer communication from the super alignment team. The focus is shifting towards the debate of making safe models more useful and the emergence of direct preference optimization.
swyx
No, we’re not even in the process anymore at the point of super intelligence. They’re prepping and they’re saying this will happen and humans will be like so far out like in the dust that we just like have no say in this debate how do we still control systems then And we too strong generalization seems to be the answer and i see a lineage from constitutional ai to this yeah the constitutional ai and the super alignment is like very conceptually
Nathan Lambert
Linked. It’s like a group of people that has like a very similar intellectual upbringing and they work together for a long time, like coming to the same conclusions in different ways. And I understand the argument. And I mostly just don’t. I think they’re just waiting to see more from the super alignment team because I just didn’t really put it together in my brain quickly looking at weak to strong generalization of like Exactly how it all fits but i’m also not a safety researcher yeah but i think that could be feedback for them it’s like i understand what synthetic data means and all of this is like how Could they communicate that a little bit more specifically in this context because like i want to know what they think about which is why i like that periodo optimal thing because it
swyx
Stares to debate away from x-risk to like no like this makes knowledge models more useful and we can all get past that i agree i think the last kind of emerging direction that i have might
Nathan Lambert
Just be like this debate you can control how long we talk about this which is about direct preference optimization you could go read my blog post on this i hadDPO Models Expectation in the Next Six Months DPO models are expected to be more prevalent in the next six months, as they are perceived as the primary model by most people. However, PPO models also have potential in certain code scenarios and may require less data manipulation. The authors of the DPO paper, Raphael, Eric, and Archet, are recommended for further insights on the topic, and their method is defended as an excellent study in language models with a strong mathematical foundation.
Nathan Lambert
The models and kind of work off of each other. So in a lot of ways, I think DPO still will be what people see, but in some ways, it’s probably slightly more constrained. There’s other ways that you could think of PPO working nicely in code, where it’s like if your code runs is the score that you give it, you have to generate, you have to kind of do canned Things to get DPO to have the same data. So there are specific cases where like the DPO formulation is a little bit harder, but I expect to see more DPO models than anything else in the next six months. That’s probably like what most people need to know unless they’re an RLHF expert. And like, I would love to learn more about ppo and a lot of authors in this space from the dpo authors who are great to talk to you can reach out to all three of them so as of time of recording
swyx
We’re actually about to publish our new rips recap where we talk to the authors yeah so so for people who are listening to this in the future you can refer to that episode yeah so like rafael
Nathan Lambert
Eric and archit i’ve talked to all of them a good length and they’re all fantastic and it’s like they’ll say similar things and they’ll also defend their method because it’s an awesome Paper. Like if you want to learn how like a good math, like I’m kind of mathy, but still experimental paper in language models is like the DPO paper is a really good one to spend more time on.
swyx
Yeah. When I asked them questions about it, they just kind of gestured it with poster and said, look at the equation, just stare at it.
Nathan Lambert
And yeah, that’s my,The Allen Institute for AI’s Transition to Model Releases The Allen Institute for AI, founded by Paul Allen from Microsoft, is transitioning from being a super academic lab known for publishing hit research papers to releasing models. Under new CEO Ali Farhadi, the institute aims to move from a focus on research papers only to also releasing models, being active in policy, and collaborating with for-profit institutions.
Nathan Lambert
So it is a really cool idea. And that’s the type of thing that academia still can do can do really well and hopefully continues to do yeah one thing i wanted to make sure i cover before uh we leave this topic you know
swyx
One of the dpo models that were trained apart from zephyr and mixed draw which is two of the more high profile ones is tulu from the allen institute and you want a few people maybe places To explain. So funny. Maybe what’s Allen Institute doing here? And what’s the backstory?
Nathan Lambert
Yeah, so the Allen Institute for AI is, I think, the 10-year birthday is in January, the special event for that.
swyx
And also, people should know, this is Paul Allen from Microsoft.
Nathan Lambert
Yeah, Paul Allen owns everything in Seattle. Not literally. I mean, he’s passed and his estate is still operating in a lot of great ways. But the Allen Institute is mostly known as being like a super academic lab where they have more resources on academia and publish like hit after hit of research paper. And they’re trying to move more in the direction of releasing models. And this is part of why I joined. It’s like talking with the new CEO, Ali Farhadi. I don’t know if I pronounced the last name right, but he’s trying to move from an org that does papers only to something that does papers, releases models, is active in policy, maybe is Like helping work with these for-profit institutions that don’t have like an established place where they could all go through to doThe Importance of Evaluations and Adoption in Open AI Models The evaluation of OpenAI models such as Da Vinci 003 and GPT 4 is crucial, with a focus on win rate calculations and the use of custom prompts like MT bench. The source of prompts, such as self-instruct, Vaikuna koala, and alpaca vowel, influences the evaluation, but ultimately the proof of a good model lies in people’s actual interactions with it. The Zephyr model from Hugging Face exemplified the impact of a well-received open release, as it quickly integrated into various products and applications.
Nathan Lambert
003, which is one of OpenAI’s older instruction models and calculating the win rate that GPT-4 sees between the new model and DaVinci. So that’s kind of like, it has many more prompts than MTBench. MTBench’s custom prompts that they made to just kind of like take a stance on what is a good chat model. ApakaEval sources theirs from Self-Instruct, which is a popular paper from AI2, Open Assistant, Vicuna, Koala, anthropics, helpful, harmless. So like Alpaca eval is from sources that people know and love. Empty bench is its own thing. We were more focused on empty bench at hugging face at AI2. We’re a little bit more focused on Alpaca eval, but it really can go either way. These are kind of like table stakes to saying that you have a good RLHF model. It’s like you should be able to have a pretty good score on both of these. And then the kind of proof is in people actually talking to it. So I think the Zephyr model from Hugging Face was a kind of step change in people’s perception of open models that got integrated into a bunch of products within a few weeks. Like you.com was experimenting with it. And someone else, like I saw some sub stacker was using it as like a writing feedback bot instead of chat GPT. But like, that’s what happens when a good open release is there now. It’s like the evaluations are good and people pick it up. And the evaluations are justBalancing Synthetic and Human Data in RL The speaker discusses the challenges of solving data problems in RL, where fixing one problem often leads to new issues. They express interest in automating and optimizing the model but feel that it is years away. They highlight the debate between synthetic and human data, suggesting that both will coexist for a while. Additionally, they note the growing ambition in the field to start companies, referencing individuals who are venturing into entrepreneurial endeavors.
Nathan Lambert
It’s probably whack-a where they’re like oh there’s this problem we have the data we can fix this and then it like pops up some new problem after doing rlhf and they they’re studying this And if you could really figure it out this is where things start to look more like rl you could automate it things are just like longer time frame of optimizing the model it would be cool But we’re i feel like i’m years away from ever actually working on this but we can try to get details from people who are excellent awesome anything else that we missed?
Alessio Fanelli
I think we covered a lot of it.
Nathan Lambert
I mean, I’m good. I would ask you guys about if you know companies that are doing this and things, but I know some that are in the RLHF as a service space will become busy. I think for good reason, just because- There’s companies doing RLHF as a service. Yeah, both of them are. It depends if synthetic data is going to win over human data. If human data is the real winning feature in the end, it’s a big capital investment, so it kind of makes sense as a VC model anyways, but there’s going to be both of them for a while.
Alessio Fanelli
It’d be cool. You see a lot of people, because I know Luis Castricado is starting a company, is there a lot of ambition in this field to start companies or is this more such a research driven part of the
Nathan Lambert
Stack that maybe it just stays there definitely is because i know my my former colleague nazine rajani from hugging face is also starting a company in this space the
