ElevenLabs' Mati Staniszewski: How Voice Becomes the Interface for Everything

Mati Staniszewski, co-founder and CEO of ElevenLabs, joins Sequoia partner Andrew Reed at AI Ascent 2026 to talk about how a four-year-old company built a frontier audio AI business with just over 400 people and over $400M in revenue. He explains why audio was overlooked in 2022 when the rest of AI was chasing text and images, why ElevenLabs chose to monetize from day one rather than raise indefinitely, and why he believes voice will be the primary interface for agents, robots, and the next generation of computing. Also: why emotional intelligence is the next frontier in voice, and what happens when one voice agent realizes it's talking to another.

Published: Published May 8, 2026
Uploaded: Uploaded Jun 11, 2026
File type: POD
Queried: 00

Full transcript

Showing the full transcript for this episode.

AI-generated transcript with timestamped sections.

0:00-1:43

[00:00] you [00:02] So I love [00:03] line charts and bar graphs as much as the next guy, probably more. The story of 11 labs is also interesting from a human perspective, but as you started a company with a childhood friend. So maybe take us back to 2022 or earlier and just tell the human side of the 11 lab story to start. [00:33] All the names in Polish are complicated, luckily, for us. But we met in high school, became best friends, took all the same classes together. And then through the years, did everything together. So we traveled together, studied together, worked together. And time is on our side. We are still best friends. It's working out. [00:51] Started 11 Labs is inspiration from where we are both from. We are both from Poland. [00:56] suburbs of Warsaw. And there's a very peculiar thing in Poland. If you watch any foreign movie in Polish language, [01:04] All the voices, whether that's a male voice or a female voice, get narrated with one single character. So as you can imagine, pretty terrible experience. You have literally one voice. [01:15] narrating everything. It usually also on purpose is kept in monotone, so you are meant to interpret your own emotions for that content. And while we grew up with this, this is still happening today for majority of content. And that kind of opened our eyes into [01:30] one of the clear things across the domain, across audio domain, across the future, will be this ability for everybody to speak any language with the same emotion, the same intonation. And we started diving deeper into that problem.

1:44-3:18

[01:44] And realize the problem of audio exists in so many other domains, too, whether that's narrating the content around us, whether that's [01:51] books not being available in audio form, whether that's the news articles that we could read, whether that's that language barrier, or in the future, as we heard in the previous conversations, the future where humanoids, the robots are around us, the voice will be the primary interface to a lot of the technology and something we would love to fix and solve. [02:09] Excellent. And 11 Labs builds frontier models for audio. I think there's a paradigm now where, to build a frontier model, you have to start with-- [02:19] hundreds of billions or billions of dollars, and then figure out [02:22] the rest later, 11 Labs did not take that path. [02:25] May I talk a little bit about-- [02:27] Your approach towards building this company, why this hasn't been replicated, is that even possible in 2026? [02:34] et cetera. [02:35] Yeah, that goes-- I think that continues that great lack in timing, because we started in 2022. For those of you working in the domain at the time, that was a year of crypto and metaverse. Nobody was still working on the AI side. [02:51] Even further, [02:52] People were starting to work, of course, on the text models, on the visual models. But audio as a domain was still considered a big niche. There's so few researchers in the space working on that work. So for us, that was a good part of picking that domain where, A, we were excited about where that future is called. We felt that people around just didn't realize the value of that domain. But three, the requirements of what you needed to solve were very different. The audio models were smaller.

3:19-5:08

[03:19] So you don't need as much compute as you need for some of the other sister domains. The data needs are big, but-- [03:26] While there's a lot of audio data, we knew that the thing to actually get that audio working, you will need to figure out how to transcribe a lot of that data and annotate all the data, which we knew we can do. And then ultimately, it all boiled down to, [03:38] architectural side of can we solve that part in a good way. And here, my co-founder is one of the smartest people I know, and a great researcher, and has been able to assemble some of the best people in audio to help us. And we took a slightly untraditional approach at the time. We started in London. We had a lot of people between London and Warsaw. [04:00] and started a company in a completely remote way. So we wanted to hire the best researchers wherever they were. We were going through the classic GitHub scraping and trying to reach people based on their work instead of based on their presence. And based on that work, we would reach out to those people. We would always share our samples and try to get them to join the team. And that's how we assembled the first set of people who we think are some of the best researchers in that audio domain. [04:27] Through the years, they still help us crank a lot of those models into production. [04:33] Then we launched the product. I think the slightly different approach we took was monetizing very quickly, so trying to-- [04:39] get some of the revenue stream back, so we can fund a lot of the work and the models. We try to stay-- [04:48] healthy on the margins so we can continue investing with the assumption that it's better for us to figure out that stream and be able to be independent in the development. But then as the ambitions grew, we knew that we needed to train models. So we, of course, brought a lot of money externally as well. And I think like projecting to today, one thing that's clear for us is...

5:08-6:48

[05:08] There's still so many of those niches that people don't tackle that you can start with and then step by step start opening them up. [05:17] And I think a lot of-- [05:18] customers see 11 labs through their narrow needs, right? [05:23] Maybe take a [05:24] zoomed out view, what is the suite of models that Eleven Labs works on? How do you prioritize them? How do you organize R&D, etc? [05:32] And so we started [05:34] We started with the first text-to-speech model, so the model that could finally understand the context of what's being written. And based on that context understanding, get the right emotion, the right intonation from text. So if it was a happy sentence, you get that happiness out. If it's a dialogue, it can pronounce the dialogue out. And then continuously started adding that. So it started with the problem of breaking down language barriers. [05:56] Um, [05:58] The things you need to solve [06:00] Dubbing is-- [06:02] transcription, so understanding, then the translation, and then text-to-speech. So we first saw text-to-speech. Then we knew we needed to add the other component, which is speech-to-text and being able to transcribe content in a great way. Then how we combine those models together, [06:16] So that's kind of what the first three models in the first couple of years. And then, of course, the other things started happening across the space, which is that a lot of the reasoning models started becoming quick enough and smart enough at the same time where you could imagine those interactive experiences being possible. And that's where we started launching our [06:33] more of the real-time streaming models across audio, and then combining those into conversational experiences. So I added effectively all the stack, all the turn-taking and orchestration to create a voice engine for a voice agent.

6:48-8:18

[06:48] And then on the other side, as we realized that the emotionality is something we can solve, we added some of the hardest modality in audio, which is music and being able to produce music. So today we span entirety of the research of audio, whether it's text to speech, speech to text, combining those models together in both localization with dubbing, with orchestration, with voice engine, and then being able to do that across music as well. [07:13] And what's the... [07:15] All those things and all that interesting development work, [07:20] oh wow moment in terms of what these products are capable of that you can remember? You know, there's so many and it's kind of the bar changes for all of us. The first moment for us was [07:31] Well, first one for us, they always use my voice as a testing voice, because it has this weird accent. And the first time was, like, when we could replicate my voice based on a good sample. That was, like, a first wow moment to myself. And you always go for this moment, like, this is not how my voice sounds like. And then you listen to yourself side by side, and it's, like, definitely how it sounds like. [07:52] Unfortunately. Then the second moment was where we first got it to laugh. [07:58] And people were like, OK, this is actually the thing that makes the whole experience more human-- the laughter, the pauses, the umms, the umms, the imperfections. So we started getting those out. And that was the moment for us because we made it to the top of Hacker News with the first AI that can laugh model, which was a very proud moment for us.

8:19-9:49

[08:19] And then, of course, for the years, that extended where-- you might remember in 2023, 2024, there was a Javier Mille speech that went viral where he could speak other languages. So it was translated into English. And it was the first time where you could still hear his voice out there. [08:37] So that's the kind of continuous wow moment that was something that's completely impossible. And we saw that happen time and time again with Narendra Modi, with President Zelensky, all the way to recently one of the, I feel like, pinnacles of the voice performance, Matthew McConaughey giving his newsletter and his iconic lines in Spanish and Portuguese, where for the first time his family who speaks that language could hear him speak those languages too. [09:07] recent pieces, the two ones that we are excited about bringing to production, [09:11] I think the first one is finally figuring out the emotional intelligence in that interactive experience. So in the voice agent experience where it doesn't only get the right intonation and emotion, but can understand the other side. So if somebody is stressed, it gets and delivers that soothing, reassuring emotion. If someone is excited, maybe it matches that. If someone speaks slowly, it makes sure to slow down. [09:41] a path to solving, which will be just a continuous step change to what's possible. And then the second one--

9:49-11:25

[09:49] which will apply there, but also apply into general audio spaces, audio general intelligence, where you can combine audio models together in one stream. So you could theoretically have a model that narrates, then pauses, and let's say starts singing with that same continuous voice. And that's something that's extremely hard to combine today, and something that would be possible, I think, very... [10:12] Very soon. [10:14] In voice, you mentioned voice agents. And it seems like everybody is-- at least on the customer side, everyone's buying a voice agent. [10:21] And I think intuitively, you think customer support, [10:24] the old phone tree replacement. What's actually going on in the world of voice agents and what do you think are the most [10:31] interesting overlooked opportunities [10:33] Spots where startup founders should focus [10:35] Yeah, of course, the customer support is probably the one that everybody heard and knows about very well. I think the second thing and the second thread we are seeing is increasing shift to revenue-generating opportunities where voice agents can act in sales, whether it's inbound or outbound of sales. It doesn't replace the entire experience but takes and amplifies part of that experience. Maybe a good example is Deliveroo, where Deliveroo will have voice agents that contact the restaurants to capture their opening times. [11:05] And based on their opting times, they can update the riders and drivers, and of course, the people ordering on when to get to that work, all the way through to the inbound sales where-- [11:16] Increasingly people, that's a good example of Deutsche Telekom, will be contacting to inquire about the service, inquire to buy a product.

11:25-12:58

[11:25] Instead of going through the dropdown, instead of going through the form, you can speak with the voice agent to leave that information. We do it ourselves, too, so we have a good metrics and understanding of what's happening there. One, of course, so much simpler and quicker to go through instead of going through that form. But the second thing that started happening in that inbound sales flow is we had a lot more information that people started leaving because they would speak about the use case they're coming with. [11:55] and combine and then just deliver such a much better experience afterwards. On the overlooked side-- [12:01] I think my favorite example is the citizen support education and health care will completely change. On the citizen support, like all of us would benefit from just generally better education. [12:14] government access, whether that's understanding how to fill in the taxes that I think many of you went through earlier this month, all the way through to just learning what is the policy for travel abroad and how that might affect the space. We recently seen that work deployed in government of Ukraine, who we think is one of the most advanced governments on that front. [12:36] We traveled to Ukraine working with their team. And what they are trying to solve is they have a government app, which every citizen can access and get information about what's happening. [12:48] They were given the front line and lack of that access. They wanted to figure out a new channel for people to be able to call in and get that information. So they created Voice Agent effectively.

12:58-14:36

[12:58] where you can call in and get the information about what's happening on the front line. You can get education help and some of the lectures delivered to your kids, all the way through to proactive engagement about staying safe and staying out there. And maybe last example on education front, and that's probably my favorite one as I think about that changing. It's just how incredible would it be to have someone that is an incredible teacher available 24-7, [13:28] all the way through to Richard Feynman. And you can learn physics with them on the headphones while you are teaching that subject or learning that subject. And that's something that we are seeing pockets of. Like a great example is Masterclass, where Masterclass, of course, collaborates with incredible teachers to deliver static lectures. But recently, they launched an interactive version of that. So I don't know if that will be a good reference for this audience, [13:58] that can teach you cooking. So while you are in the kitchen, he can shout at you effectively to get better. Or maybe a better one is Chris Voss, where you can, of course, learn negotiation, but you can learn by negotiating with Chris live on the phone to get better, which I thought was phenomenal. [14:17] Having negotiated against Marty a number of times around financing rounds, I understand now. I think it helps you to say this, but I think the opposite is true. [14:27] I have some more questions. I want to save time for the audience as well. Maybe one-- as Constantine mentioned, more than 100 million of net new ARR in Q1.

14:36-16:11

[14:36] Obviously, the business is going very well. [14:38] Um, [14:39] And you're sort of pioneering the startup founder, building a foundation model, applications. [14:44] Any counterintuitive lessons about building a company in this era that for the founders in the audience they might want to take home with them? [14:51] Thank you. [14:52] So we are, just for reference, we are [14:55] Just over 400 people, over 400 million in revenue, [14:59] but still keep the teams extremely small, so it's-- [15:02] like rough, arbitrary, a little bit, CAP is less than 10 people. It's for each of the research product. Even the go-to-market ops talent teams are all smaller than that size. Most people will have 10 direct reports or so. So it keeps it relatively flat and allows us to move a little bit quicker. One thing that we've done, which is in this model-- and very surprisingly, this is a very similar model that we've seen actually with the government of Ukraine-- [15:27] Each of the teams, even the teams that aren't technical teams, will have engineers within them. [15:33] So our people team, our go-to-market team, our legal team will have an engineer. [15:38] in that team that helps to build, of course, automation, upscale, uplevel the rest of the people. And recently, that really helped because, as I'm sure many of you are going through, everybody will be... [15:49] Vibe coding and coding a lot of the help, even if they are not technical. So now that kind of shifted the responsibility, not responsibility, but shifted the requirement of how good the review needs to be for a lot of that work. You have security, infrastructure, implications. You will want to make sure that the output is right. And I think on the engineering side, you can put that expectation. On the non-engineering side, the

16:11-17:50

[16:11] the ability to do that is relatively hard. So that technical research in those teams helped us a lot too. [16:17] to figure this out. And in general, there's just so many incredible work you can do by having .twitter.step, [16:24] scraping on the hiring and recruiting front, and analyzing what worked in the past to improve in the future, whether that's a [16:30] upscaling the legal team on how to use those tools and then figuring out ways of-- we recently introduced this scoring system. For those on the go-to-market on the sales side, you frequently will end up in this negotiation with your sales team of, can I give indemnity provisions? What's the liability cap? Can I give the set of clauses? And then you kind of need to draw the line of how many things you give. And I ended up being in so many of those conversations that we gave already a lot or we didn't. So now we introduced the scoring system that you can [17:00] can give per size of the customer. You can just give a few of those points out and in. [17:06] We just made it so much easier. And of course, that's fully automated now with how we work across that team. So that was one of the unintuitive-- [17:13] Small teams. [17:15] Bringing technical talent in the non-technical teams, keeping it relatively flat. We also have no titles which allows us to bring people and really optimize for impact that they are having. And then you can grow as quickly as you want. The tenure will not define this. And [17:32] and many more. So we'll see. It's a four-year-old company, so we'll see if that helps. [17:37] Any questions? [17:40] Oh, no. OK, Sonia. Are you seeing people deploy voice agents to actually negotiate on their behalf? And then when you...

17:50-19:46

[17:50] Are you starting to see agents actually negotiate with agents? [17:53] Sorry, I do three part questions. When that world happens, [17:59] Do you think the agents are actually talking to each other the way that humans talk to communicate and negotiate or do you think it's [18:05] Beep boop beep boop. Do you think it's all done instantaneously? Like how's that world gonna look like? [18:10] - Okay. [18:10] So one, early inklings of that. We haven't seen any truly successful on the negotiation front. It was like more, you know, kind of order taking. What's the prize? Can we capture that? And then kind of goes back to the team. So not real negotiation. But there is a few startups that we see, especially on any organizational shifts of, can I organize this event? Calling out of places, getting the prize, and then calling again with our budget. So that is happening. And I think this will shift. I think emotional intelligence will [18:38] This is the big part that will start being important in a lot of that work, where it's not only the content that matters, but how you deliver, when you pause that work. [18:47] And then maybe the extreme version of that, which agents [18:50] are not-- like most of the people wouldn't do it, and they are not good at that. Today, you will see a lot of interruptibility built in, where a human can interrupt the agent. [19:01] But with negotiation, you also want the opposite, where agent will interrupt the human. It's kind of the extreme version of that. On the second part, on the agent to agent part, [19:08] Some of you might have seen this, that we did a hackathon over a year and a half ago, and that was exactly the case, where an agent was speaking with another agent, they detected that they are... [19:19] both agents, and they swapped over to a different language. And also, like, a more efficient transmitter of information than just the classic spoken word. And I think this will happen 100%. Like, the big question will be really voice, will it be other transmission of information? Depends truly on what the infrastructure is built for. And I think this will define that.

19:46-21:25

[19:46] that experience. [19:48] Adios. [19:50] - Yeah. [19:51] - I'll see you at the catch box. - Hey, I'm curious how you're thinking about the need for voice in the future where agents do more and more of the work. [20:02] So basically, what are the kind of use cases maybe where human conversation, I think it's more of a follow up to the last question. [20:09] Like, [20:10] First, [20:11] you, all of us, will have so many different devices around us. [20:16] step from that you will have robots around us. So of course, voice will be such an important interface to instruct [20:22] and be able to interact with those devices. In many ways, I feel like we see a lot of developments of intelligence. But then the real bottleneck of the future will be how we communicate with that intelligence. And I have voice and visual part will be a big unlock to be able to actually get the most of that intelligence value in those settings, which isn't [20:43] yet possible. But on the flip side, [20:46] yet the value of the human-to-human interaction will only increase. [20:51] whether that's the events like this one, whether that's events with your favorite artist, will increase in value. [20:59] with that ability of having voice all around you. [21:04] But the trust will be such a big part and something we optimize for in between the agent and human in the future, where all of us will have a voice agent, for example, to call and book a restaurant or give information to a health care appointment. All of that will require such a high degree of trust that this is you and authenticated you. So there'll be a level of--

21:26-22:56

[21:26] encoding and decoding for real, then encoding, decoding for watermarked [21:30] opted in human, and then by default, everything else will be fake, which is kind of the opposite of how it is today. You have detect for-- [21:37] AI [21:38] that you will detect for real authenticated AI in the future and assume it's fake. [21:45] If you could pass it. [21:48] Thank you. [21:48] Thank you. [21:50] Andre spoke earlier about jagged intelligence. Do you see similar [21:54] odd places in audio where models are good and bad that you might not expect and [21:58] Yeah, what are they? [22:00] Thank you. [22:02] - Yeah. [22:02] There's still so much on the bad side. [22:07] We spoke a little bit about where we see the voice agents working, so this combination of the models together. And support settings works really well, works reliably. [22:17] early sales starts working, but the moment you start swapping to a true emotional interaction, [22:23] Not yet working. It doesn't get the emotion that well. It's slightly too slow. [22:28] So that is still, I think, a big step change that should work. [22:33] Um, [22:34] Same will apply in a very different domain on the music side. I think in the music side, you... [22:39] Uh, [22:40] You can get... [22:42] You can get good production music. You cannot get top charts music, even with artist input. I think this will change over the next year or two. Yeah, of course. Andre's take was that the reason for that was that the labs were basically training for the stuff that

22:57-24:35

[22:57] economic value where you're training your models [22:59] Is that true of you? Are you basically training for the things that make the most money? Or is it that there are some challenges that are genuinely harder than others? [23:07] - - [23:08] We try to train the models, build the product, and the ecosystem that will derive, of course, the biggest impact for [23:14] for all our customers, all users, which should correlate, of course, with the revenue in the long term. So that long term perspective, it's going to be minimal in the next few years, so not next year. [23:27] So frequently, we will train the models that might not provide that value in the short term. Or even step before, we'll spend so much time labeling the data, not only the what of audio, but also how of audio. Like, what emotions did I use? What is my voice described as? What is this music described as? So we assembled a team of now 1,000 plus people that have been voice coaches, musicians, artists before, that can help us annotate that behind the scenes. And that will not provide value in the next-- [23:53] six to 12 months, but we think well in the next 12 to 24. And then you're going to collect that data, which frequently just isn't that accessible as well. [24:03] - [24:04] Last one and then we'll go to lunch. [24:08] Thank you. [24:09] Thank you. [24:10] Hey, you hear me? Thanks. I'm a big fan of yours in 11 Labs. Thank you. [24:15] What do you think from the model air perspective, what do you think are [24:20] the modes here with with audio models. The labs are [24:24] going there, not going there, what are the kind of, you know, in this sausage making of making a real good frontier audio model, what are the main defensible parts there?

24:35-26:23

[24:35] Yeah, dear. [24:37] So of course, we do a variety of models. And recently, I had the pleasure of meeting Jensen. And he was commending on a few of those models. And he said that our speech-to-text models are technology. And text-to-speech is artistry. And we are all artists. So he gained a client for life. But of course, we do believe there is a little bit of that, too, to really fix text-to-speech and fix that emotionality. You will need to be really focused on that space. You really need to get in front of users, collect the data, [25:07] the preferences, use that to fine tune the models. And then there is the domain specificity in how you actually bring those models to production. And health care, very different than in financial services, very different than in education or experiences. So that's on the model layer. I think there will be continuous advantage that if you actually care about the quality, actually spending the time on the model work will help you keep that advantage. That's to your point. [25:32] The models-- and a lot of use cases-- will use a model as just a small part of their stack. And that's where we spend a lot of time beyond going, beyond the research on the product side of how you understand a user's problem, the workflow that they need. And voice agents is combining the audio models with knowledge and bringing that inside of the system, how you bring it outside with telephony systems so you can interact across channels, how you evaluate, test, and monitor. And then as you create, whether that's in the agent space, [26:02] you build the ecosystem. And that's what we hope to build across 11 Labs, a place where, whether it's distribution and brand that people can trust, the platform where you have pre-existing set of work that you can start off, whether it's a template for creating an agent, templates for creating a workflow in a creative space, or whether that's a voice. And we had the pleasure now of having over

26:23-26:45

[26:23] 20,000 voices that people created, contributed, that you can use across language styles and voices. And I think that will be an increasingly important layer of how you are able to cater to that diversity, make it easy for people to start and really understand that. [26:37] That worked all. [26:39] All right. I'm going to hand it back to Konstantin Mahdi. Thank you. Andrew, thanks for being a partner. [26:44] Amazing. Thank you, guys.

Want to learn more?

Ask about this episode