• Episode AI notes
  1. RLHF is built on the assumption that human preferences can be accurately measured, but there has been a long-standing debate about whether preferences can be accurately measured.
  2. Process reward models reward each step in the chain of thought reasoning, providing more granularity for considering different states.
  3. Pairwise preference is an effective alternative to other methods, such as the Bradley Terry model, in modeling preferences.
  4. RLE Jeff has the ability to change language models, but it may not always produce desired results and is not designed to enhance multiple-choice reasoning capabilities.
  5. Rejection sampling and best event sampling are important techniques for improving outputs in text generation based on preference data sets.
  6. Different feedback types, such as written feedback and pairwise preferences, may be used for different domains in AI development.
  7. Advancements in AI are leading to the need for super alignment, where the AI being controlled is smarter than humans.
  8. Guided sampling is used to pick preferences based on principles from a constitution, resulting in a new preference dataset.
  9. DPO models are expected to be more prevalent in the next six months, and the DPO paper provides insights into language models with a strong mathematical foundation.
  10. The Allen Institute for AI is transitioning from solely publishing research papers to also releasing models and being active in policy.
  11. The evaluation of OpenAI models is crucial, and the impact of a good open release can quickly integrate the model into various products and applications.
  12. The challenges of solving data problems in RL involve balancing synthetic and human data, and there is a growing ambition in the field to start companies and entrepreneurial endeavors. Time 0:00:00

  • Debates and assumptions surrounding RLHF Summary: RLHF is rooted in the intellectual history of utilitarianism and the V&M utility theorem, which reflects debates including whether preferences can be measured at all and the different types of math used to model preferences. This raises questions about the inductive bias of a preference model and the assumption that preferences can be accurately measured in RLHF.

    Speaker 2
    Yeah, you have a nice chart of like the sort of intellectual history of RLHF that we will send people to refer to either in your paper or in the YouTube video for this podcast. I like the other slide that you have on the presumptions that you need to have for RLHF to work. You already mentioned some of those, which ones underappreciated? This is the first time I’ve come across the V&M utility theorem.
    Speaker 1
    Yeah, I know. This is where you get from working with people. Like to my co-hosts on the podcast, the retort is a sociologist by training. So he knows all these things and like who the philosophers are that found these different things like utilitarianism. But there’s a lot that goes into this. Essentially, there’s even economic theories that like there’s debate whether or not preferences exist at all. And there’s like different types of math you can use with whether or not you actually can model preferences at all. So it’s pretty obvious that RLHF is built on the math that thinks that you can actually model any human preference. But this is the sort of thing that’s been debated for a long time. So all the work that’s here is like, and people hear about in their AI classes. So like Jeremy Bentham, like Hoddonik calculus and all these things. These are the side of work where people assume that preferences could be measured. And this is like, I don’t really know. Like, this is like I kind of go on a rant and I say that in RLHF calling things a preference model is a little annoying because there’s no inductive bias of what a preference is. It’s like if you’re sort of learned a robotics system and you learn a dynamics model, like hopefully that actually mirrors the world in some way of the dynamics. But with a preference model, it’s like, I don’t know what this model like.
  • Insights on Process Reward Models and Human-Centric RL in NLP Summary: The process reward models reward each step in the chain of thought reasoning, providing more granularity for considering different states. There is ongoing debate about whether chain of thought reasoning is more like reinforcement learning (RL). The comparison of pre-deep RL versus deep RL shows that the current work in NLP originated from outside of NLP and before the prevalence of deep learning. Human-centric RL involves having a human give a score as a reward for an agent’s action, rather than having a predefined reward function.

    Speaker 1
    There’s work that I mentioned on one slide called process reward models that essentially rewards each step in the chain of thought reasoning. It doesn’t really give the part of interaction, but it does make it a little bit more fine grained where you can think about like calling it at least you have many states from your initial State. That formulation, I don’t think people have fully settled on. I think there’s a bunch of great work out there, like even opening eyes releasing a lot of this and let’s verify step by step is there pretty great paper on the matter. I think in the next year that’ll probably get made more concrete by the community on like if you can easily draw out like if chain of thought reasoning is more like RL, we can talk about That more later. That’s a kind of a more advanced topic than we probably spend all the time on.
    Speaker 2
    RLHF for decision making, you have a slide here that compares pre-deep RL versus deep RL.
    Speaker 1
    This is getting into the history of things, which is showing that the work that people are using now really came from well outside of NLP and it came before deep learning was big. The step from this paper tamer, which is from 2008, some names that are still really relevant in kind of human centric RL, Bradley Knox in Peterstone. If you have an agent take an action, you would just have a human give a score from zero to one as a reward rather than having a reward function. Then with that classifier, you can do something with a policy that learns to take actions to maximize that reward.
  • Modeling Pairwise Preferences and Aggregation Summary: The pairwise preference approach, specifically the Bradley Terry model from the 50s, gained popularity as other methods failed. It heavily relies on the aggregation of preferences, which is not always accurate due to individual differences. This approach aims to model preferences based on correctness and style rather than controversial or meaningful notions of preference.

    Speaker 1
    And the answer is really kind of no, like a lot of people tried that. It didn’t really work. And then that’s why they tried this pairwise preference thing and it happened to work. And this Bradley Terry model comes from like the 50s. It’s from these fields that I was mentioning earlier. And it’s wild how much this happens. I mean, this is this screen try to have in the slides is from the DPO paper. I think it might be the appendix, but like it’s still really around in the literature of what people are doing for our early Jeff. Yeah. So that’s a fun one to know.
    Speaker 2
    I’ll point out one presumption that this heavily relies on you. You mentioned this as part of your six presumptions that we covered earlier, which is that you can aggregate these preferences. This is not exactly true among all humans, right? Like I have a preference of one thing, you have a person different thing. And actually coming from economics, you mentioned economics earlier. There’s a theorem or a name for this called everyone possibility, which I’m sure you’ve come.
    Speaker 1
    Yeah. Yeah. It’s one of the many kind of things we throw around in the paper. Right.
    Speaker 2
    Do we just ignore it? Yeah.
    Speaker 1
    You just, yeah, just aggregate. Yeah. Yeah. Yeah. I think the reason this really is done on a deep level is that you’re not actually trying to model any contestable preference in this. Like you’re not trying to go into things that are controversial or anything. It’s really the notion of preference is trying to stay around like correctness and style rather than any meaningful notion of preference. Cause otherwise these companies, they don’t want to do this. Like at all. I think that’s just how it is. And it’s like, if you look at what people actually do, so I have a bunch of slides on the feedback interface. And they all publish this.
  • 1min Snip Summary: RLE Jeff may not improve evaluation metrics according to OpenAI’s external demonstrations. While it can change your language model, it may not always produce desired results. It is not necessarily intended to enhance multiple choice reasoning capabilities, but it might have potential in certain preference scenarios. Overall, RLE Jeff is a powerful yet complex tool.

    Speaker 1
    But it’s really like, RLE Jeff is not that shown to improve capabilities yet. I think one of the fun ones is from the GPT4 technical report. They essentially listed their kind of bogus evaluations because it’s hilarious table because it’s like LSAT AP exams. And then like AMC 10 and AMC 12 are like kind of reasonable of als and language model land. But they just showed that like RLE Jeff doesn’t improve their evaluation metrics. We don’t know if internally they have other ones. They probably do. But from what open AI has shown us externally, like RLE Jeff improves some metrics. It decreases some metrics. No one could really see. I do think it does things that they care about. But it’s like RLE Jeff is not an easy tool to make numbers go up with. It’s a powerful tool to change your language model. But like as we’ve seen with LAMA and safety RLE Jeff, like that doesn’t always mean that people are going to be happy with those changes or it’s going to do exactly what you want.
    Speaker 2
    It’s like, well, I think this is intuitive. Like a lot of these tests are multiple choice. And RLE Jeff isn’t necessarily intended to improve your multiple choice reasoning capabilities.
    Speaker 1
    Yeah, I think it is reasonable. But I don’t think a lot of people have like connected the dots there. And like what is it in a preference point? Like one of your preference data was between a correct and a wrong answer. Like it could conceivably do it. But I just don’t think that is remotely what it is actually doing.
    Speaker 2
    It’s much better being a somelier. Yeah.
  • 1min Snip Summary: Using rejection sampling and best event sampling can improve the quality of generated answers by spending more inference compute based on a preference data set. This approach, discussed in a step-by-step paper from OpenAI, involves generating multiple responses to a prompt, passing them through a reward model, and selecting the one with the highest scalar value. It’s a logical and effective technique used by many researchers and can enhance the outputs without relying on PPO.

    Speaker 1
    Yeah, I think it is reasonable. But I don’t think a lot of people have like connected the dots there. And like what is it in a preference point? Like one of your preference data was between a correct and a wrong answer. Like it could conceivably do it. But I just don’t think that is remotely what it is actually doing.
    Speaker 2
    It’s much better being a somelier. Yeah. That was the weirdest one.
    Speaker 1
    That was included in the GPG 404. Yeah, dude. I just see that the last three down there. That’s really funny. Can’t even taste it. Can’t even taste it. Cool. Emerging directions. Yeah. So this is essentially how to use RLE Jeff like things to make the bottom better without using PPO because PPO is kind of a nightmare to scale. The first thing that I started with is kind of the ideas of rejection sampling and best event sampling. I think best event sampling is what people often encounter first, which is the idea of you take a prompt, you generate like 10, 20 responses through it. You pass it through a reward model. The reward model assigns a scalar for each of them. You pick the one with the highest number and that’s the one you answer the question with. It seems pretty logical to people because it’s just spending more inference time compute to make your outputs better. And it works in a lot of things. This lets verify a step by step paper that I talked about from OpenAI. They use it. Lots of papers use it. It’s just kind of like a good thing to know that you can do. You can spend more inference compute based on a preference data set to make your answers better.
  • Rejection Sampling for RLHF Process Summary: Rejection sampling involves putting best event sampling in a feedback loop to return the best few answers, then applying instruction tuning on that dataset. Lama started their RLHF process with rejection sampling to get a signal out of preference data, which went into a reward model for ranking. This method is easier to implement than PPO and can be used with auto-regressive loss, making it suitable for RL at scale. Offline RL is also a relevant approach for RLHF, as the model doesn’t have to generate data, but rather looks at existing data and backpropagates through the reward model.

    Speaker 1
    The interesting thing that people are confused about more is rejection sampling because Meta talked about it in Lama 2. Essentially a rejection sampling is putting something like best event sampling in a feedback loop. And instead of just returning the best answer to a user, you take the best few answers and then you apply instruction tuning on that data set. And then you do the instruction tuning and then you could collect more preference data, do a new reward model, and then you rank some new outputs and you do instruction tuning again. So essentially Lama started their RLHF process with this to get some signal out of preference data. That preference data went into a reward model and then the reward model did a good enough ranking that it was essentially super powered instruction tuning based on rewards. Works pretty well, much easier to implement than PPO because you can use it in all of your kind of like, it’s still instruction tuning, so it’s the same auto-regressive loss. It’s easy to plug into things like transformers and stuff like that a lot easier to start with than whatever freaking mess doing RL at scale is going to be. So that’s one, a quick nod that offline RL is something that people talk about for RLHF, essentially because your model doesn’t have to generate in that case, you look at data and it back Propagates through your reward model directly. So in PPO you have the step of like needing to generate everything and passing it through the reward model.
  • Specific Feedback for Domain-Specific AI Development Summary: Different feedback types such as written feedback, labeling multiple scores, and pairwise preferences are expected to be used for different domains in AI development. Chain of thought reasoning and process reward models are suitable for math but may not be ideal for poetry. As AI tools improve, they become more domain-specific. Constitutional AI involves generating preference data by having a second model evaluate the outputs of the first model based on principles drawn from sources such as the UN Declaration of Human Rights and the Apple terms of service.

    Speaker 1
    Different feedback types are probably going to come into play. There’s papers like written feedback or labeling multiple scores or multiple pairwise preferences for every completion. That’s coming. It’s also kind of related to what we mentioned in process reward models where you get your labeling each step in the chain of thought reasoning, just to kind of make the problem more specific. It seems very likely that different feedback will be used for different domains. Chain of thought reasoning is great for math and that’s where these process reward models are being designed. Probably not great for things like poetry, but as any tool gets better, it gets more specific. Then kind of get into more of a talking point, which I think is fun. The next one I have is constitutional AI. I think this is something that people really just kind of misunderstood. I think most people thought that constitutional AI was doing something where it’s created the preference data based on the specific principles in some way, where it’s like, what did You two think of constitutional AI? I’ll be the dumb person and you correct me.
    Speaker 2
    As far as I understood, Anthropic came out and said that the best way of generating this preference data or alignment is give a second model a constitution to evaluate the first model’s Outputs. The constitution is unspecified, but it draws from the UN Declaration of Human Rights and the Apple terms of service for some reason.
  • Guided Sampling and Implicit Values in AI Model Training Summary: AI models use guided sampling to pick preferences based on principles from a constitution, resulting in a new preference dataset. The process is less explicit than expected, relying on averages and scale to incorporate principles. The approach is similar to RLA Jeff setup with instruction tuning, where an AI model provides critiques based on sampling of constitutional values. The process seems more tractable, but it may deviate from the stated approach in the paper.

    Speaker 1
    Which is essentially like, pick between these two answers based on this principle. So they’re kind of sampling from the principles in their constitution, and from kind of AB, like two options of completions. And then the AI model is essentially given the context of a certain principle to pick the A or B preference. And then that’s a new preference data set is just the two completions without the context of the principles. So with this kind of like sampling idea, they’re sampling from like 30 principles and a wide data set of two candidate completions across a different prompt. So to me, it’s a very like loose, like the values are not explicit in this. It’s just kind of how they’re guided. It’s a very machine learning approach, because it is relying on averages and scale to get the principles in there. But it is way less explicit than I thought it was going to be. I kind of thought there was this like feedback thing in the preference data, or like check to see if the principles were satisfied or anything like this. It’s really just like a modification to the RLA Jeff setup that we’ve talked about with instruction tuning in preference data collection, where there’s an AI model providing critiques. And a lot of those critiques are based on like sampling of constitutional values. It almost sounds more tractable in that way, but I would also guess while I just like say like, Oh, look, I figured it out. I’m guessing they do different things than they said in the paper. Like this paper is in 2022.
  • Anticipating the need for super alignment and AI overlords Summary: Advancements in AI are leading to a point where manual human preference data collection cannot scale, necessitating the trust in AI to model human preferences. This has led to the concept of super alignment where we prepare for a future where the AI being controlled is smarter than us. The idea is to train the AI to be smarter than itself, as humans may no longer be fully in control at the point of super intelligence. The potential solution appears to lie in using robust generalization, and this concept is linked to the evolution from constitutional AI to super alignment.

    Speaker 1
    Less than they know. So I think they probably have things that are pretty cool that they’re doing internally.
    Speaker 2
    I’ll summarize for listeners who may not have seen the paper because you know, you use it. It’s impossible to keep up on everything. I do think that what constitutional AI and RLAIF represents is that we are starting to come to a point where it’s just impossible for manual human preference data collection to scale. And the only way to scale this is to trust our AI overlords to model our human preferences and constitutional AI was the first version of this. What the second version or what week to strong is is that anticipating a future of the need for super alignment where the thing that we’re trying to control is smarter than us. So you take GPT two and try to use GPT four to teach it to be smarter than itself because this is what we’re going to have to do in the future as well. When we are not, we’re no longer fully in control. Are we the metaphorical GPT two or is no, we’re like not even in the process anymore at the point of super intelligence, they’re prepping. And they’re saying this will happen. And humans will be like so far like in the dust that we just like have no say in this debate, how do we still control systems then? And we just wrong generalization seems to be the answer. And I see a lineage from constitutional AI to this.
    Speaker 1
    Yeah, the constitutional AI in the super alignment is like very conceptually linked. It’s like a group of people that has like a very
  • Preparing for Super Intelligence Summary: The advancement towards super intelligence may render humans powerless in the decision-making process, leading to the need for control systems and strong generalization. The concept of constitutional AI and super alignment are seen as closely linked, with a call for clearer communication from the super alignment team. The focus is shifting towards the debate of making safe models more useful and the emergence of direct preference optimization.

    Speaker 2
    GPT two or is no, we’re like not even in the process anymore at the point of super intelligence, they’re prepping. And they’re saying this will happen. And humans will be like so far like in the dust that we just like have no say in this debate, how do we still control systems then? And we just wrong generalization seems to be the answer. And I see a lineage from constitutional AI to this.
    Speaker 1
    Yeah, the constitutional AI in the super alignment is like very conceptually linked. It’s like a group of people that has like a very similar intellectual upbringing and they work together for a long time, like coming to the same collisions in different ways. And I understand the argument. And I mostly just don’t, I think they’re just waiting to see more from the super alignment team. Because I just didn’t really put it together in my brain quickly looking at weak to strong generalization of like exactly how it all fits. But I’m also not a safety researcher. But I think that could be feedback for them. It’s like, I understand what synthetic data means and all of this is like, how could they communicate that a little bit more specifically in this context?
    Speaker 2
    Because I want to know what they think about this. I like that period of optimal thing because it’s there’s debate away from X risk to like, no, like this makes nice models more useful. And we can all get ahead of that.
    Speaker 1
    I agree. I think the last kind of emerging direction that I have might just be like this debate that you can control how long we talk about this, which is about direct preference optimization. You could go read my blog post on this. I had tried to
  • DPO Models Expectation in the Next Six Months Summary: DPO models are expected to be more prevalent in the next six months, as they are perceived as the primary model by most people. However, PPO models also have potential in certain code scenarios and may require less data manipulation. The authors of the DPO paper, Raphael, Eric, and Archet, are recommended for further insights on the topic, and their method is defended as an excellent study in language models with a strong mathematical foundation.

    Speaker 1
    The models and kind of work off of each other. So like in a lot of ways, I think DPO still will be what people see. But like in some ways, it’s probably like slightly more constrained. There’s other ways that you could think of PPO like working nicely in code where it’s like if your code runs is the score that you give it, you have to generate like you have to kind of do Canned things to get DPO to have the same data. So there are specific cases where like the DPO formulation is a little bit harder. But I expect to see more DPO models and anything else in the next six months. That’s probably like what most people need to know unless they’re an RLA Jeff expert. And like, I would love to learn more about PPO and a lot of authors in the space from the DPO authors who are great to talk to, you could reach out to all three of them. So as a time of recording, we actually about to publish our newest recap where we talk to the authors.
    Speaker 2
    Yeah. So for people who are listening to this in the future, you can refer to that episode.
    Speaker 1
    Yeah. So like Raphael, Eric and Archet, I’ve talked to all of them a good length and they’re all fantastic. And it’s like, they’ll say similar things and they’ll also defend their method because it’s an awesome paper. Like if you want to learn how like a good math, like I’m kind of mathy, but still experimental paper in language models is like the DPO paper is a really good one to spend more time on. Yeah. When I asked them questions about it, they just kind of gestured it with poster and said, look at the equation, just stare at it. And you see it. That’s my criticism
  • The Allen Institute for AI’s Transition to Model Releases Summary: The Allen Institute for AI, founded by Paul Allen from Microsoft, is transitioning from being a super academic lab known for publishing hit research papers to releasing models. Under new CEO Ali Farhadi, the institute aims to move from a focus on research papers only to also releasing models, being active in policy, and collaborating with for-profit institutions.

    Speaker 1
    Like, it is a really cool idea. And like, that’s the type of thing that academia still can do and can do really well. And hopefully continues to do. Yeah.
    Speaker 2
    One thing I wanted to make sure I cover before we leave this topic, you know, one of the DPO models that were trains apart from Zephyr and Mixdral, which is two of the more high profile ones Is Tulu from the Allen Institute.
    Speaker 1
    And you are a few people maybe place to explain. So funny. Like maybe maybe like what’s Allen Institute doing here and like, you know, what’s the backstory? Yeah. So the Allen Institute for AI is I think the 10 year birthdays and January is special event for that. Also, like people should know this is Paul Allen from Microsoft. Yeah. Paul Allen owns everything in Seattle, not literally. I mean, he’s passed and his estate is still operating in a lot of great ways. But the Allen Institute is mostly known as being like a super academic lab where they have more resources than academia and publish like hit after hit of research paper. And they’re trying to move more in the direction of releasing models. And this is part of why I joined it’s like talking with new CEO, Ali Farhadi. I don’t know if I pronounced the last name right. But he’s trying to move from a org that does papers only to something that does papers releases models is active in policy. Maybe is like helping work with these for profit institutions that don’t have like an established place where they could all go through to new things.
  • The Importance of Evaluations and Adoption in Open AI Models Summary: The evaluation of OpenAI models such as Da Vinci 003 and GPT 4 is crucial, with a focus on win rate calculations and the use of custom prompts like MT bench. The source of prompts, such as self-instruct, Vaikuna koala, and alpaca vowel, influences the evaluation, but ultimately the proof of a good model lies in people’s actual interactions with it. The Zephyr model from Hugging Face exemplified the impact of a well-received open release, as it quickly integrated into various products and applications.

    Speaker 1
    Da Vinci 003, which is one of OpenAI’s older instruction models and calculating the win rate that GPT 4 sees between the new model and Da Vinci. So that’s kind of like, it has many more prompts than MT bench, MT bench is custom prompts that they made to just kind of like take a stance on what is a good chat model, a pack of all sources There’s from self-instruct, which is a popular paper from AI to open assistant, Vaikuna koala and throw up ex helpful harmless. So like a pack of all is from sources that people know and love empty benches its own thing. We were more focused on empty bench at hugging face at AI to where a little bit more focused on alpaca vowel, but it really can go either way. These are kind of like table stakes to saying that you have a good early Jeff model is like you should be able to have a pretty good score on both of these. And then the kind of proof is in people actually talking to it. So I think like the Zephyr model for hugging case was a kind of step change in people’s perception of open models that got integrated into a bunch of products within a few weeks like you Was you.com was experimenting with it and someone else like I saw some sub stacker was using it as like a writing feedback bot as instead of chat GPT. But like that’s what happens when a good open release is there now. It’s like if the evaluations are good and people pick it up and the evaluations are just enough
  • Balancing Synthetic and Human Data in RL Summary: The speaker discusses the challenges of solving data problems in RL, where fixing one problem often leads to new issues. They express interest in automating and optimizing the model but feel that it is years away. They highlight the debate between synthetic and human data, suggesting that both will coexist for a while. Additionally, they note the growing ambition in the field to start companies, referencing individuals who are venturing into entrepreneurial endeavors.

    Speaker 1
    Where they’re like, Oh, there’s this problem. We have the data we can fix this. And then it like pops up some new problem after doing our late Jeff and they’re studying this. And if you could really figure it out, this is where things start to look more like RL, you could automate it things are just like longer timeframe of optimizing the model. It would be cool. But I feel like I’m years away from ever actually working on this, but we can try to get details from people who are. Excellent. Awesome.
    Speaker 3
    Anything else that we missed, I think we covered a lot of it.
    Speaker 1
    I mean, I’m good. I would ask you guys about if you know companies that are doing this and things. But like, I know some that are in the like, the RL Jeff has a service space will become busy. I think for good reason, just because like, this company is doing early IFS service. Yeah, both of them are. It depends if synthetic data is going to win over human data. If human data is the real winning feature in the end, like, it’s a big capital investment. So it kind of makes sense as a VC model anyways, but there’s going to be both of them for a while. It’d be cool.
    Speaker 3
    You see a lot of people because I know Luis Kastricado is starting a company. Is there a lot of ambition in this field to start companies or is this more such a research-driven part of the stack that maybe it just stays there?
    Speaker 1
    There definitely is. Because I know my former colleague, Nastanine Rajani from Huggingface, is also starting a company in the space.