pollen

❯

❯

❯

Making Transformers Sing With Mikey Shulman of Suno

Making Transformers Sing - With Mikey Shulman of Suno

Mar 26, 202419 min read

Optimizing Model Size and Scale The size of machine learning models is a crucial factor for efficiency and practicality. Confusion arises as some models are small yet powerful while others, especially those for music, need to consider the speed of generating tokens per second. Larger models are generally better, but there are challenges in running them locally or achieving optimal performance in audio applications. Shrinking models while maintaining performance is a key area for improvement, although scaling up models remains a direct way to enhance their performance.
Alessio Fanelli
Then once you get the final model, I would love to learn more about the size of these models, because people are confused when stable diffusion is so small. They’re like, oh, this thing can generate like any image as it possible that it’s like, you know, a couple of gigabytes. And then the large language models are like, oh, these are so big, but they’re just text in them. What’s it like for music? Is it in between? And as you think about, yeah, you mentioned scaling and whatnot. Is this something that you see it’s going to be easy for people to run locally or not?
Mikey Shulman
Our models are still pretty small, certainly by tech standards. I confess, I don’t know as well the state of the art on how diffusion models scale, but our models scale similarly to text transformers. It’s like bigger is usually better. Audio has a couple of weird quirks, though. We care a lot about how many tokens per second we can generate because we need to stream new music as fast as you can listen to it. And so that is a big one that I think probably has us never get to 175 billion parameter model, if I’m being honest. Maybe I’m wrong there, but I think that would be technologically difficult. And then the other thing is that so much progress happens in shrinking models down for the same performance in techs that I’m hopeful, at least, that a lot of our issues will get solved And we will figure out how to do better things with smaller models or relatively smaller models. But I think the other thing, it’s a blessing and a curse, I think, the ability to add performance with scale. It’s like a very straightforward way to make your models better.
Time 0:08:31
2024-03-21
Embrace What Feels Organic and Obvious The journey of creating a music company began organically with the success of a speech model on GitHub, steering the focus towards music. The decision was not a result of a specific ‘aha’ moment but rather a natural progression towards what felt right. Despite still collaborating on speech models, the company’s primary focus remains on music because it has the potential to evoke emotions and make a positive impact. The company acknowledges the significance of focusing on audio enhancements, as there is immense potential for growth and improvement in this domain.
Mikey Shulman
That’s kind of what steered us there. You know, in fact, the first thing we ever put out was a speech model. It was Bark. It was this open source text to speech model. And it got a lot of stars on GitHub. And that was people telling us even more like go do speech. And like, we almost couldn’t help ourselves from doing music. And so, I don’t know, maybe it’s a little bit serendipitous, but we haven’t really, like, looked back since. I don’t think there was necessarily, like, an aha moment. It was just, like, organic and just obvious to us that this needs to, like, we want to make a music company.
Shawn ‘swyx’ Wang
So you do regard yourself as a music company because as of last month, you’re still releasing speech models with Paraclid. We were? Oh yes, that’s right.
Mikey Shulman
So that’s a, that’s a really awesome collaboration with our friends at NVIDIA. I think we are really, really focused on music. I think that is the stuff that will really change things for the better. I think, you know, honestly, everybody is so focused on LLMs for good reason and information processing and intelligence there. And I think it’s way too easy to forget that there’s this whole other side of things that makes people feel, and maybe that market is smaller, but it makes people feel and it makes us really Happy. So we do it. I think that doesn’t mean that we can’t be doing things that are related, that are in our wheelhouse, that will improve things. And so, like I said, audio is just so far behind. There’s just so much more to do in the domain more generally.
Time 0:12:01
2024-03-21
Music and Images: A Contrast in Social Modality The joy in music comes from creating the sound, while in images, it often lies in consuming them. Music has a unique social aspect where people share the same experience simultaneously, unlike images where individual perspectives vary. Additionally, music often shows a gap between taste and ability for individuals, which is not as prevalent in images.
Shawn ‘swyx’ Wang
The dream here to be, I don’t know if it’s too coarse of a grain to put it, but like, is the dream here to be like the mid-journey of music?
Mikey Shulman
I think there are certainly some parallels there because especially what I just said about being an active participant. Yeah. The joyful experience in mid-journey is the act of creating the image and not necessarily the act of consuming the image. And mid-journey will let you then very kind of quickly share the image with somebody. But I think ultimately that analogy is like somewhat limiting because there’s something really special about music. I think there’s two things. One is that there’s this really big gap for the average person between kind of their tastes in music and their abilities in music. That is not quite there for most people in images. Like most people don’t have like innate tastes in images, I think in the same way people do for music. And then the other thing, and this is the really big one, is that music is a really social modality. If we all listen to a piece of music together, we’re listening to the exact same part at the exact same time. If we all look at the picture in Alessio’s background, we’re going to look at it for two seconds. I’m going to look at the top left where it says Thor. Alessio is going to look at the bottom right or something like that. And it’s not really
Time 0:21:59
2024-03-21
Episode AI notes

Optimizing model size and scale is crucial for efficiency and practicality in machine learning, with challenges in running larger models locally or achieving optimal performance in audio applications.
The journey of creating a music company evolved organically from the success of a speech model on GitHub, focusing on music due to its potential to evoke emotions and make a positive impact.
Music and images have contrasting social modalities, with music offering a unique shared experience and individual connection, emphasizing the importance of creating new and original music with AI.
Continuous enhancement of AI models for better audio quality and music creation is important, with a shift towards providing diverse and interactive music experiences that engage users in music creation.
Enabling collaborative entertainment experiences, such as a Twitch stream where viewers control the game state of Pokemon, and envisioning collaborative concerts where the audience influences the music, showcases innovative, immersive entertainment concepts.
Individual connection with music is highlighted, emphasizing the personal and unique meaning individuals can find in songs, and the positive feedback music can bring to fans.
Understanding the limitations of quantitative benchmarks and the importance of incorporating values beyond quantitative metrics in decision-making processes is crucial in evaluating content impact.
First principles thinking is essential in addressing machine learning challenges, especially with the complexity of large models and the need for intuitive problem-solving approaches. Time 0:00:00
2024-03-25

Creating New and Original Music with AI The focus is on using AI to create new and original music rather than remixing existing songs due to potential copyright issues. The aim is to allow people to generate music that is fresh and not just replicated versions of existing songs. While AI can imitate different artists, it is believed that true music interaction in the future should involve creating unique music that establishes a real connection between the artist and the audience.
Alessio Fanelli
Yeah, I’m really curious to see how people are going to use this to resample old songs into new styles. I think that’s one of my favorite things about hip-hop you have. At Trap Call Quest, they had the Lou Reed Walk on the Wild Side sample and Can I Kick It? It’s like Kanye sampled Nina Simone on Blood on the Leaves. It’s a lot of production work to actually take an old song and make it fit a new beat. And I feel like this can really help. Do you see people putting existing songs, lyrics and trying to regenerate them in like a new style?
Mikey Shulman
You know, we actually don’t let you do that. And it’s because if you’re taking someone else’s lyrics, you didn’t own those. You don’t have the publishing rights to those. You can’t remake that song. I think in the future, we’ll figure out how to actually let people do that in a legal way. But we are really focused on letting people make new and original music. And I think, you know, there’s a lot of music AI, which is artist A doing the song of artist B in a new style. You know, let me have Metallica doing Come Together by the Beatles or something like that. And I think this stuff is very viral, but I actually really don’t think that this is how people want to interact with music in the future. To me, this feels a lot like when you made a Shakespeare sonnet the first time you saw Chad GPT, and then you made another one, and then you made another one, and then you kind of thought Like, this is getting old. And that doesn’t mean that GPT is not amazing. GPT is amazing. It’s just not for that. And I kind of feel like the way people want to use music in the future is not just to remake songs in different people’s voices. You lose the connection to the original artist. You lose the connection to the new artist because they didn’t really do it. So we’re very happy to just let people do things that are a flash in the pan and kind of stay under the radar.
Time 0:36:00
2024-03-25
Continuous Improvement and Diverse Music Experiences The focus is on continuous enhancement of AI models for better audio quality and music, drawing from the open-source community. There is a shift from simply converting text to music to providing diverse and interactive music experiences that engage users in music creation through various methods.
Mikey Shulman
There’s a lot. I think from the model side, it’s still really early innings, and there’s still so much low-hanging fruit for us to pick to make these models much, much better, much, much more controllable, Much better music, much better audio fidelity. So much that we know about and so much that, again, we can kind of borrow from the open source Transformers community that should make these just better across the board. From the product side, we’re super focused on the experiences that we can bring to people. And so it’s so much more than just text to music. And I think, I’ll say this nicely, I’m a machine learning person, but machine learning people are stupid sometimes. And we can only think about models that take X and make it into Y. And that’s just not how the average human being thinks about interacting with music. And so I think what we’re most excited about is all of the new ways that we can get people just much more actively participating in music. And that is making music, not only with text, maybe with other ways of doing stuff that is making music together. If
Time 0:38:35
ai-ux
2024-03-25
Enabling Collaborative Entertainment Experiences The speaker suggests creating a Twitch stream where viewers collectively control the game state of Pokemon, demonstrating the potential for collaborative entertainment. They envision a collaborative concert where the audience influences the music, requiring either musically skilled participants or an artist adept at deciphering audience cues. This concept aligns with utilizing technology, like AI, to enhance communication of artistic ideas, paving the way for innovative, immersive entertainment experiences. The ultimate goal is to evolve from traditional individual-centric performances to continuous, interactive entertainment encounters.
Shawn ‘swyx’ Wang
Yeah. I think a minimum, you guys should have a Twitch stream that’s just like a 24-hour radio session. Have you ever come across Twitch Plays Pokemon? No. Basically, everyone in the Twitch chat can vote on the next action that the game state makes, and they sort of wire that up to a Nintendo emulator and play Pokemon the whole game through The collaborative thing. It sounds like it should be pretty easy for you guys to do that, except for the chaos that might result. But like, I mean, that’s part of the fun.
Mikey Shulman
I agree 100%. One of my like key projects or pet projects is like, what does it mean to have a collaborative concert? Maybe where there is no artist and it’s just the audience, or maybe there is an artist, but there’s a lot of input from the audience. You know, if you were going to do that, you would either need an audience full of musicians, or you would need artist who can really interpret the verbal cues that an audience is giving Or nonverbal cues. But if you can give everybody the means to better articulate the sounds that are in their heads toward the rest of the audience, like which is what generative AI basically lets you do, You open up way more interesting ways of having these experiences. And so the collaborative concert is one of the things I’m most excited about. I don’t think it’s coming tomorrow, but we have a lot of ideas on what that can look like.
Shawn ‘swyx’ Wang
Yeah, I feel like it’s one stage before the collaborative concert is turning Suno into a continuous experience rather than like a start and stop motion. I don’t know if that makes sense.
Time 0:40:37
2024-03-25
Individual Connection with Music The snippet discusses the unique and personal connection individuals can have with music. It highlights a specific instance where a person found deep meaning in a song that may not resonate with everyone in the same way. This individual experience is seen as a beautiful aspect of music appreciation, enabling people to create their own profound connections with songs. The snippet also touches on the positive feedback received from fans, emphasizing how music can bring joy and different forms of experiences to people.
Alessio Fanelli
We had a few more notes from random community tweets. I don’t know if there’s any favorite fans of Suno that you have or whatnot. DHH, obviously, notorious tweeter and crowd inflamer, I guess. He tweeted about you guys. I saw Blau as an investor. I think Karpati also tweeted something.
Shawn ‘swyx’ Wang
Return to monkey.
Alessio Fanelli
Yeah, yeah, yeah. Return to monkey, right?
Shawn ‘swyx’ Wang
Is there a story behind that?
Mikey Shulman
No, he just made that song and it just speaks to him. And I think this is exactly the thing that we are trying to tap into that you can think of it. This is like a super, super, super micro genre of one person who just really liked that song and made it and shared it. And it does not speak to you the same way it speaks to him. That song really spoke to him. And I think that’s so beautiful. And that’s something that you’re never going to have an artist able to do that for you. And now you can do that for yourself. And it’s just a different form of experiencing music. I think that’s such a lovely use case.
Alessio Fanelli
Any fun fan mail that you got from musicians or anybody that really was a funny story to share?
Mikey Shulman
We get a lot and it’s primarily positive. And I think on the whole, I would say people realize that they are not experiencing music in all of the ways that are possible and it does bring them joy.
Time 0:43:54
2024-03-25
Quantitative Benchmarks and Corporate Values It is crucial to understand the limitations of quantitative benchmarks, as what we measure might no longer be a good metric once it is optimized for. While objective benchmarks and quantitative metrics are valuable, they should not be the sole criteria as they might not encompass everything important. Aesthetics matter in the corporate world, implying that the ultimate goal is to bring content that resonates with individuals on an emotional level. It is essential to acknowledge that subjective judgment plays a critical role in evaluating the impact of content. Emphasizing the significance of incorporating values beyond quantitative benchmarks, such as understanding legal theories and natural experiments, in decision-making processes can lead to more holistic and successful outcomes.
Alessio Fanelli
Awesome. Yeah, it’s a good perspective. I know we covered a lot of things, I think, before we wrap, you have written a blog post that can show about cohorts law impact in ML, which is, you know, when you measure something, then The thing that you measure is not a good metric anymore, because people optimize for it. Any thoughts on how that applies to like, LLMs and benchmarks and kind of the world we’re going into today.
Mikey Shulman
Yeah, I mean, I think it’s maybe even more apropos than when I originally wrote that, because so much, we see so much noise about pick your favorite benchmark and this model does slightly Better than that model. And then at the end of the day, actually, there is no real world. There is no real world difference between these things. And it is really difficult to define what real world means. And I think to a certain extent, it’s good to have these objective benchmarks. It’s good to have quantitative metrics. But at the end of the day, you need some acknowledgement that you’re not going to be able to capture everything. And so at least at Suno, to the extent that we have corporate values, if we don’t, we don’t have corporate, we’re too small to have corporate values written down. But something that we say a lot is aesthetics matter. And that the kind of quantitative benchmarks are never going to be the be all and end all of everything that you care about. And as flawed as these benchmarks are in text, they’re way worse in audio. And so aesthetics matter basically is a statement that like at the end of the day, what we are trying to do is bring music to people that makes them feel a certain way. And effectively, the only good judge of that is your ears. And so you have to listen to it. And it is a good idea to try to make better objective benchmarks, but you really have to not fall prey to those things. I can tell you, you know, it’s kind of another pet peeve of mine. Like I always said, economists do make really good machine learning engineers, and it’s because they are able to think about stuff like Goodhart’s law and natural experiments and Stuff like this that people with machine learning backgrounds or people with physics backgrounds like me often forget to do. And so, yeah, I mean, I’ll tell you at Kensho, we actually used to go to big econ conferences sometimes to recruit, and these were some of the best hires we ever made.
Time 0:48:37
2024-03-25
The Importance of First Principles Thinking in Addressing Machine Learning Challenges Giant models are prone to overfitting and are poorly understood. The ability to think about problems from first principles and intuitively is crucial. An example is the case of question answering models exceeding human capabilities, until a clever individual recommended presenting questions without answers, creating a substantial gap between machines and humans. This kind of first principles thinking is natural to social scientists and essential in addressing machine learning challenges.
Mikey Shulman
Think it’s not only the human feedback. I think you could think about this just in general, you have these like giant, really powerful models that are so prone to overfitting, that are so poorly understood, that are so easy To steer in one direction or another, not only from human feedback. And your ability to think about these problems from first principles, instead of like getting down into the weeds or only math, and to think intuitively about these problems is really, Really important. I’ll give you like, just like one of my favorite examples. It’s a little old at this point, but if you guys remember like squad and squad two, the question answering data set. Yeah, exactly. And so, you know, the benchmark for squad one, eventually the machine learning models start to do as well as a human can on this thing. And it’s like, oh, now what do we do? And it takes somebody very clever to say, well, actually, let’s, let’s think about this for a second. What if we presented the machine with questions with no answer in the passage? And it immediately opens a massive gap between the human and the machine. And I think it’s like first principles thinking like that, that comes very naturally to social scientists, that does not come as naturally to people like me. And so that’s why I like to hang out with people like that.
Shawn ‘swyx’ Wang
Well, I’m sure you get plenty of that in Boston. And as an econ major myself, it’s very gratifying to hear that we have the perspective to contribute.
Mikey Shulman
Oh, big time. Big time. I try to talk to economists as much as I can. Excellent.
Time 0:50:48
2024-03-25

Cover

AuthorLatent Space: The AI Engineer Podcast

TypePodcast

Listen to episode(share.snipd.com)

Graph View

Making Transformers Sing - With Mikey Shulman of Suno
Highlights

Created with Quartz v4.5.2 © 2026

about
so, what's enzyme?