Optimizing Model Size and Scale The size of machine learning models is a crucial factor for efficiency and practicality. Confusion arises as some models are small yet powerful while others, especially those for music, need to consider the speed of generating tokens per second. Larger models are generally better, but there are challenges in running them locally or achieving optimal performance in audio applications. Shrinking models while maintaining performance is a key area for improvement, although scaling up models remains a direct way to enhance their performance.
Alessio Fanelli
Then once you get the final model, I would love to learn more about the size of these models, because people are confused when stable diffusion is so small. Theyāre like, oh, this thing can generate like any image as it possible that itās like, you know, a couple of gigabytes. And then the large language models are like, oh, these are so big, but theyāre just text in them. Whatās it like for music? Is it in between? And as you think about, yeah, you mentioned scaling and whatnot. Is this something that you see itās going to be easy for people to run locally or not?
Mikey Shulman
Our models are still pretty small, certainly by tech standards. I confess, I donāt know as well the state of the art on how diffusion models scale, but our models scale similarly to text transformers. Itās like bigger is usually better. Audio has a couple of weird quirks, though. We care a lot about how many tokens per second we can generate because we need to stream new music as fast as you can listen to it. And so that is a big one that I think probably has us never get to 175 billion parameter model, if Iām being honest. Maybe Iām wrong there, but I think that would be technologically difficult. And then the other thing is that so much progress happens in shrinking models down for the same performance in techs that Iām hopeful, at least, that a lot of our issues will get solved And we will figure out how to do better things with smaller models or relatively smaller models. But I think the other thing, itās a blessing and a curse, I think, the ability to add performance with scale. Itās like a very straightforward way to make your models better.Embrace What Feels Organic and Obvious The journey of creating a music company began organically with the success of a speech model on GitHub, steering the focus towards music. The decision was not a result of a specific āahaā moment but rather a natural progression towards what felt right. Despite still collaborating on speech models, the companyās primary focus remains on music because it has the potential to evoke emotions and make a positive impact. The company acknowledges the significance of focusing on audio enhancements, as there is immense potential for growth and improvement in this domain.
Mikey Shulman
Thatās kind of what steered us there. You know, in fact, the first thing we ever put out was a speech model. It was Bark. It was this open source text to speech model. And it got a lot of stars on GitHub. And that was people telling us even more like go do speech. And like, we almost couldnāt help ourselves from doing music. And so, I donāt know, maybe itās a little bit serendipitous, but we havenāt really, like, looked back since. I donāt think there was necessarily, like, an aha moment. It was just, like, organic and just obvious to us that this needs to, like, we want to make a music company.
Shawn āswyxā Wang
So you do regard yourself as a music company because as of last month, youāre still releasing speech models with Paraclid. We were? Oh yes, thatās right.
Mikey Shulman
So thatās a, thatās a really awesome collaboration with our friends at NVIDIA. I think we are really, really focused on music. I think that is the stuff that will really change things for the better. I think, you know, honestly, everybody is so focused on LLMs for good reason and information processing and intelligence there. And I think itās way too easy to forget that thereās this whole other side of things that makes people feel, and maybe that market is smaller, but it makes people feel and it makes us really Happy. So we do it. I think that doesnāt mean that we canāt be doing things that are related, that are in our wheelhouse, that will improve things. And so, like I said, audio is just so far behind. Thereās just so much more to do in the domain more generally.Music and Images: A Contrast in Social Modality The joy in music comes from creating the sound, while in images, it often lies in consuming them. Music has a unique social aspect where people share the same experience simultaneously, unlike images where individual perspectives vary. Additionally, music often shows a gap between taste and ability for individuals, which is not as prevalent in images.
Shawn āswyxā Wang
The dream here to be, I donāt know if itās too coarse of a grain to put it, but like, is the dream here to be like the mid-journey of music?
Mikey Shulman
I think there are certainly some parallels there because especially what I just said about being an active participant. Yeah. The joyful experience in mid-journey is the act of creating the image and not necessarily the act of consuming the image. And mid-journey will let you then very kind of quickly share the image with somebody. But I think ultimately that analogy is like somewhat limiting because thereās something really special about music. I think thereās two things. One is that thereās this really big gap for the average person between kind of their tastes in music and their abilities in music. That is not quite there for most people in images. Like most people donāt have like innate tastes in images, I think in the same way people do for music. And then the other thing, and this is the really big one, is that music is a really social modality. If we all listen to a piece of music together, weāre listening to the exact same part at the exact same time. If we all look at the picture in Alessioās background, weāre going to look at it for two seconds. Iām going to look at the top left where it says Thor. Alessio is going to look at the bottom right or something like that. And itās not really-
Episode AI notes
- Optimizing model size and scale is crucial for efficiency and practicality in machine learning, with challenges in running larger models locally or achieving optimal performance in audio applications.
- The journey of creating a music company evolved organically from the success of a speech model on GitHub, focusing on music due to its potential to evoke emotions and make a positive impact.
- Music and images have contrasting social modalities, with music offering a unique shared experience and individual connection, emphasizing the importance of creating new and original music with AI.
- Continuous enhancement of AI models for better audio quality and music creation is important, with a shift towards providing diverse and interactive music experiences that engage users in music creation.
- Enabling collaborative entertainment experiences, such as a Twitch stream where viewers control the game state of Pokemon, and envisioning collaborative concerts where the audience influences the music, showcases innovative, immersive entertainment concepts.
- Individual connection with music is highlighted, emphasizing the personal and unique meaning individuals can find in songs, and the positive feedback music can bring to fans.
- Understanding the limitations of quantitative benchmarks and the importance of incorporating values beyond quantitative metrics in decision-making processes is crucial in evaluating content impact.
First principles thinking is essential in addressing machine learning challenges, especially with the complexity of large models and the need for intuitive problem-solving approaches. TimeĀ 0:00:00
Creating New and Original Music with AI The focus is on using AI to create new and original music rather than remixing existing songs due to potential copyright issues. The aim is to allow people to generate music that is fresh and not just replicated versions of existing songs. While AI can imitate different artists, it is believed that true music interaction in the future should involve creating unique music that establishes a real connection between the artist and the audience.
Alessio Fanelli
Yeah, Iām really curious to see how people are going to use this to resample old songs into new styles. I think thatās one of my favorite things about hip-hop you have. At Trap Call Quest, they had the Lou Reed Walk on the Wild Side sample and Can I Kick It? Itās like Kanye sampled Nina Simone on Blood on the Leaves. Itās a lot of production work to actually take an old song and make it fit a new beat. And I feel like this can really help. Do you see people putting existing songs, lyrics and trying to regenerate them in like a new style?
Mikey Shulman
You know, we actually donāt let you do that. And itās because if youāre taking someone elseās lyrics, you didnāt own those. You donāt have the publishing rights to those. You canāt remake that song. I think in the future, weāll figure out how to actually let people do that in a legal way. But we are really focused on letting people make new and original music. And I think, you know, thereās a lot of music AI, which is artist A doing the song of artist B in a new style. You know, let me have Metallica doing Come Together by the Beatles or something like that. And I think this stuff is very viral, but I actually really donāt think that this is how people want to interact with music in the future. To me, this feels a lot like when you made a Shakespeare sonnet the first time you saw Chad GPT, and then you made another one, and then you made another one, and then you kind of thought Like, this is getting old. And that doesnāt mean that GPT is not amazing. GPT is amazing. Itās just not for that. And I kind of feel like the way people want to use music in the future is not just to remake songs in different peopleās voices. You lose the connection to the original artist. You lose the connection to the new artist because they didnāt really do it. So weāre very happy to just let people do things that are a flash in the pan and kind of stay under the radar.Continuous Improvement and Diverse Music Experiences The focus is on continuous enhancement of AI models for better audio quality and music, drawing from the open-source community. There is a shift from simply converting text to music to providing diverse and interactive music experiences that engage users in music creation through various methods.
Mikey Shulman
Thereās a lot. I think from the model side, itās still really early innings, and thereās still so much low-hanging fruit for us to pick to make these models much, much better, much, much more controllable, Much better music, much better audio fidelity. So much that we know about and so much that, again, we can kind of borrow from the open source Transformers community that should make these just better across the board. From the product side, weāre super focused on the experiences that we can bring to people. And so itās so much more than just text to music. And I think, Iāll say this nicely, Iām a machine learning person, but machine learning people are stupid sometimes. And we can only think about models that take X and make it into Y. And thatās just not how the average human being thinks about interacting with music. And so I think what weāre most excited about is all of the new ways that we can get people just much more actively participating in music. And that is making music, not only with text, maybe with other ways of doing stuff that is making music together. IfEnabling Collaborative Entertainment Experiences The speaker suggests creating a Twitch stream where viewers collectively control the game state of Pokemon, demonstrating the potential for collaborative entertainment. They envision a collaborative concert where the audience influences the music, requiring either musically skilled participants or an artist adept at deciphering audience cues. This concept aligns with utilizing technology, like AI, to enhance communication of artistic ideas, paving the way for innovative, immersive entertainment experiences. The ultimate goal is to evolve from traditional individual-centric performances to continuous, interactive entertainment encounters.
Shawn āswyxā Wang
Yeah. I think a minimum, you guys should have a Twitch stream thatās just like a 24-hour radio session. Have you ever come across Twitch Plays Pokemon? No. Basically, everyone in the Twitch chat can vote on the next action that the game state makes, and they sort of wire that up to a Nintendo emulator and play Pokemon the whole game through The collaborative thing. It sounds like it should be pretty easy for you guys to do that, except for the chaos that might result. But like, I mean, thatās part of the fun.
Mikey Shulman
I agree 100%. One of my like key projects or pet projects is like, what does it mean to have a collaborative concert? Maybe where there is no artist and itās just the audience, or maybe there is an artist, but thereās a lot of input from the audience. You know, if you were going to do that, you would either need an audience full of musicians, or you would need artist who can really interpret the verbal cues that an audience is giving Or nonverbal cues. But if you can give everybody the means to better articulate the sounds that are in their heads toward the rest of the audience, like which is what generative AI basically lets you do, You open up way more interesting ways of having these experiences. And so the collaborative concert is one of the things Iām most excited about. I donāt think itās coming tomorrow, but we have a lot of ideas on what that can look like.
Shawn āswyxā Wang
Yeah, I feel like itās one stage before the collaborative concert is turning Suno into a continuous experience rather than like a start and stop motion. I donāt know if that makes sense.Individual Connection with Music The snippet discusses the unique and personal connection individuals can have with music. It highlights a specific instance where a person found deep meaning in a song that may not resonate with everyone in the same way. This individual experience is seen as a beautiful aspect of music appreciation, enabling people to create their own profound connections with songs. The snippet also touches on the positive feedback received from fans, emphasizing how music can bring joy and different forms of experiences to people.
Alessio Fanelli
We had a few more notes from random community tweets. I donāt know if thereās any favorite fans of Suno that you have or whatnot. DHH, obviously, notorious tweeter and crowd inflamer, I guess. He tweeted about you guys. I saw Blau as an investor. I think Karpati also tweeted something.
Shawn āswyxā Wang
Return to monkey.
Alessio Fanelli
Yeah, yeah, yeah. Return to monkey, right?
Shawn āswyxā Wang
Is there a story behind that?
Mikey Shulman
No, he just made that song and it just speaks to him. And I think this is exactly the thing that we are trying to tap into that you can think of it. This is like a super, super, super micro genre of one person who just really liked that song and made it and shared it. And it does not speak to you the same way it speaks to him. That song really spoke to him. And I think thatās so beautiful. And thatās something that youāre never going to have an artist able to do that for you. And now you can do that for yourself. And itās just a different form of experiencing music. I think thatās such a lovely use case.
Alessio Fanelli
Any fun fan mail that you got from musicians or anybody that really was a funny story to share?
Mikey Shulman
We get a lot and itās primarily positive. And I think on the whole, I would say people realize that they are not experiencing music in all of the ways that are possible and it does bring them joy.Quantitative Benchmarks and Corporate Values It is crucial to understand the limitations of quantitative benchmarks, as what we measure might no longer be a good metric once it is optimized for. While objective benchmarks and quantitative metrics are valuable, they should not be the sole criteria as they might not encompass everything important. Aesthetics matter in the corporate world, implying that the ultimate goal is to bring content that resonates with individuals on an emotional level. It is essential to acknowledge that subjective judgment plays a critical role in evaluating the impact of content. Emphasizing the significance of incorporating values beyond quantitative benchmarks, such as understanding legal theories and natural experiments, in decision-making processes can lead to more holistic and successful outcomes.
Alessio Fanelli
Awesome. Yeah, itās a good perspective. I know we covered a lot of things, I think, before we wrap, you have written a blog post that can show about cohorts law impact in ML, which is, you know, when you measure something, then The thing that you measure is not a good metric anymore, because people optimize for it. Any thoughts on how that applies to like, LLMs and benchmarks and kind of the world weāre going into today.
Mikey Shulman
Yeah, I mean, I think itās maybe even more apropos than when I originally wrote that, because so much, we see so much noise about pick your favorite benchmark and this model does slightly Better than that model. And then at the end of the day, actually, there is no real world. There is no real world difference between these things. And it is really difficult to define what real world means. And I think to a certain extent, itās good to have these objective benchmarks. Itās good to have quantitative metrics. But at the end of the day, you need some acknowledgement that youāre not going to be able to capture everything. And so at least at Suno, to the extent that we have corporate values, if we donāt, we donāt have corporate, weāre too small to have corporate values written down. But something that we say a lot is aesthetics matter. And that the kind of quantitative benchmarks are never going to be the be all and end all of everything that you care about. And as flawed as these benchmarks are in text, theyāre way worse in audio. And so aesthetics matter basically is a statement that like at the end of the day, what we are trying to do is bring music to people that makes them feel a certain way. And effectively, the only good judge of that is your ears. And so you have to listen to it. And it is a good idea to try to make better objective benchmarks, but you really have to not fall prey to those things. I can tell you, you know, itās kind of another pet peeve of mine. Like I always said, economists do make really good machine learning engineers, and itās because they are able to think about stuff like Goodhartās law and natural experiments and Stuff like this that people with machine learning backgrounds or people with physics backgrounds like me often forget to do. And so, yeah, I mean, Iāll tell you at Kensho, we actually used to go to big econ conferences sometimes to recruit, and these were some of the best hires we ever made.The Importance of First Principles Thinking in Addressing Machine Learning Challenges Giant models are prone to overfitting and are poorly understood. The ability to think about problems from first principles and intuitively is crucial. An example is the case of question answering models exceeding human capabilities, until a clever individual recommended presenting questions without answers, creating a substantial gap between machines and humans. This kind of first principles thinking is natural to social scientists and essential in addressing machine learning challenges.
Mikey Shulman
Think itās not only the human feedback. I think you could think about this just in general, you have these like giant, really powerful models that are so prone to overfitting, that are so poorly understood, that are so easy To steer in one direction or another, not only from human feedback. And your ability to think about these problems from first principles, instead of like getting down into the weeds or only math, and to think intuitively about these problems is really, Really important. Iāll give you like, just like one of my favorite examples. Itās a little old at this point, but if you guys remember like squad and squad two, the question answering data set. Yeah, exactly. And so, you know, the benchmark for squad one, eventually the machine learning models start to do as well as a human can on this thing. And itās like, oh, now what do we do? And it takes somebody very clever to say, well, actually, letās, letās think about this for a second. What if we presented the machine with questions with no answer in the passage? And it immediately opens a massive gap between the human and the machine. And I think itās like first principles thinking like that, that comes very naturally to social scientists, that does not come as naturally to people like me. And so thatās why I like to hang out with people like that.
Shawn āswyxā Wang
Well, Iām sure you get plenty of that in Boston. And as an econ major myself, itās very gratifying to hear that we have the perspective to contribute.
Mikey Shulman
Oh, big time. Big time. I try to talk to economists as much as I can. Excellent.
