Nicholas

How Google’s Nano Banana Achieved Breakthrough Character Consistency

Nicholas

When Google launched Nano Banana, it instantly became a global phenomenon, introducing an image model that finally made it possible for people to see themselves in AI-generated worlds. In this episode, Nicole Brichtova and Hansa Srinivasan, the product and engineering leads behind Nano Banana, share the story behind the model’s creation and what it means for the future of visual AI. Nicole and Hansa discuss how they achieved breakthrough character consistency, why human evaluation remains critical for models that aim to feel right, and how “fun” became a gateway to utility. They explain the craft behind Gemini’s multimodal design, the obsession with data quality that powered Nano Banana’s realism, and how user creativity continues to push the technology in unexpected directions—from personal storytelling to education and professional design. The conversation explores what comes next in visual AI, why accessibility and imagination must evolve together, and how the tools we build can help people capture not just reality but possibility. Hosted by: Stephanie Zhan and Pat Grady, Sequoia Capital

Published
Published Nov 11, 2025
Uploaded
Uploaded Jun 11, 2026
File type
POD
Queried
0

Full transcript

Showing the full transcript for this episode.

AI-generated transcript with timestamped sections.

0:00-1:35

[00:00] There's something about like visual media that really excites people that it's like the fun thing, but it's not just fun. It's exciting. It's intuitive. The visuals. [00:10] basis so much of how we as humans experience life that I think I've loved how much it's, [00:16] moved people. I think we're really now making it possible to like tell stories that you never could and in a way where like the camera allowed anyone to capture reality when it became very accessible you're kind of capturing people's imagination like you're giving them the tools to be able to like get the stuff that's in their brain out on paper visually in a way that they just couldn't before because they didn't have the tools or they didn't have the knowledge of the tools like that's been really awesome. [00:46] Bye. [00:57] Today we're talking with Nicole Brichtova and Hansa Srinivasan, the team behind Google's Nano Banana image model, which started as a 2AM codename and has become a cultural phenomenon since. [01:10] They walk us through the technical leaps that made single image character consistency possible. How high quality data, long multimodal context windows, and disciplined human evals enabled reliable character consistency from a single photo. [01:24] and why craft and infrastructure matter as much as scale. [01:28] We discussed the trade-offs between pushing the frontier versus broad accessibility and where this technology is headed.

1:35-3:05

[01:35] multimodal creation, personalized learning, and specialized UIs that marry fine-grained control with hands-off automation. [01:43] Finally, we'll touch on what's still missing for true AGI and white space where startups should be building now. [01:49] Enjoy the show. [01:50] Nicole and Hansa, thank you so much for joining us today. We're so excited to be here to chat a little bit more about Nano Banana, which has taken the world by storm. We thought we'd start off with a fun question. What have been some of your own personal creations using Nano Banana or some of the most creative things you've seen from the community? [02:10] Yeah, so I think for me, one of the most exciting things I've been seeing is like the... [02:17] It didn't occur to me, but this is very obvious in hindsight, is the use with video models to get actually consistent cross scene. [02:25] character and scene preservation. How fluid is that workflow today? How hard is it to do that? So what I've been seeing is people are really... [02:34] mixing the tools and using different video models from different sources. And so I think it's probably not very fluid. I know some there's some products out there that are trying to like integrate with multiple models to make this more fluid, but [02:45] I think the... [02:47] the difference in the videos I've been seeing from before and after the Nano Banana launch has been pretty, pretty remarkable. And it's like, much, much smoother, and much more like what you'd want in the video creating process with scene cuts that feel natural. Yeah, so that's been cool. And I don't know why it

3:05-4:36

[03:05] didn't totally occur to me that people would immediately do that yeah one of my favorite ways that i didn't expect is how people have hacked around the model to use it for learning new things or digesting information i met somebody last week who has been using it to create sketchnotes of these like varied topics and it's surprising because text rendering is not something that it's not where we want it to be but this person has hacked around like these massive prompts that [03:35] get the model to output something that's coherent. And he's used it to try to understand the work that his father's doing. He was like a chemist at a university and it's like super technical topic. And so he's been feeding his lecturers [03:49] to Gemini with Nano Banana, and then getting these sketchnotes that are very coherent and visually digestible. And for the first time, I think in decades, they've been able to have a conversation with each other about his dad's work. And that was really fun. [04:03] And something that I didn't see coming. That's very cool. I think people are really working around... [04:08] This model is amazing, but obviously it's not perfect. We have a lot of things we want to improve, and I think I've been astounded by the ways people have found to... [04:17] to work. [04:18] with the model in ways we didn't anticipate and give inputs to the models in ways we didn't anticipate to bring out the best performance, um, and unlock these things that are kind of mind blowing. Did you guys in the building of it, um, [04:31] Was there a moment, like an aha moment, where you kind of felt... [04:35] wow, this thing's going to be pretty good.

4:37-6:15

[04:37] We just talked about it. Yeah, I think Nicole had the aha moment. I had one where so we always have an internal demo where we play with the models as we're developing them. And I had one where I just took an image of myself. And then I said, like, hey, put me on the red carpet and like just total vanity prompt. Right. And then it came out and it looked like me. And then I compared it to like all the models that we had before. And no other model actually looked like me. I was like so excited. [05:07] And then people looked and they were like, okay, yeah, we get it. Like you're on the red carpet. [05:11] And then I think it took a couple of weeks of other people being able to take their own photos and play with it and just kind of realize how magical that is when you get it to work. And that's kind of the main thing that people have been actually doing with the model, right? Turning yourself into a 3D figurine where it's like you want a computer, you want a toy box, and then you as the figurine. So like you three times. [05:37] new ways and almost kind of like enhance your own identity has just been really fun. And that for me was like, oh, man, this is awesome. What was it about what Nano Banana did with you on the red carpet that was miles better than what everyone else has? It looked like me. [05:51] And it's very difficult for you to be able to judge character consistency on people's faces you don't know. And so if I saw, you know, a version of you that's like an AI version of you, I might be okay with it. But you would say like, oh, no, you know, like parts of my face are not quite right. And you can really only do it on yourself, which is why we now have evals on many team members where it's like their own faces.

6:21-7:56

[06:21] whether or not someone looks like you. Yourself and like faces you're familiar with. I think like when we started doing it on ourselves and it's like, I see Nicole a lot. So like Nicole versus like random person we might eval on, right? It's, it's just a very big difference in terms of judging the model capabilities and [06:38] Yeah, I think it's one of those things that it's like... [06:40] So fun that preservation of the identity is so fundamental to these models actually being useful and exciting, but is [06:48] surprisingly... [06:50] Tricky. [06:51] Uh, and that's why we see a lot of other models, not quite hitting it. Well, I was going to ask you, I would imagine that character consistency is not just an emergent property of scale. And so maybe two questions. One, I'm sure there's stuff you can't tell us, but what can you tell us about how you achieved it? And then two, was that an explicit goal heading into the development of this model? Yeah. [07:14] Yeah, so I would say, I mean, yeah, I think there's definitely... [07:16] things that are tricky to say here, but I would say, um, [07:19] there's like sort of different genres of ways to do image generation. Um, and so that, so that plays, that definitely plays a part, uh, [07:28] in how good it is. And I think it was definitely a goal from the beginning. It was definitely a goal because we knew it was a gap with the models that we released in the past. And generally consistency for us was a goal because every time you're editing images, right, like you want to preserve some parts of it and then you want to change something. And prior models just weren't very good at that. And that makes it not very useful in professional workflows, but it also doesn't make it useful for things like character consistency. And we've heard this

7:58-9:30

[07:58] know, trying to advertise their products and like putting them in lifestyle shots. It has to look like your product, like 100%. Otherwise, you can't put it in an ad. So we knew there was demand for it. We knew the models had a gap. And we felt like we had the right recipe, both in terms of like the model architecture and the data to finally make it happen. I think what surprised us was just how good it was when we actually finally built the model. Yeah, right. Because like, I think we felt like we had the recipe exactly as Nicole said, but yeah, [08:26] there's still always until you're seeing the model, you finish training, you [08:30] You're actually using it. You don't know how close you're going to get right to that goal. And I think we were all surprised by that. [08:36] Yeah, and I think the other thing is if we think about what people expect out of editing, one that you edit on your phone apps or Photoshop, you expect a high degree of preservation of things you're not touching. [08:48] Depending on how the models are made and how how [08:52] the design decisions behind them. [08:55] That's very tricky to do. But it's something people... [08:58] really like it's it's one of those things where like [09:02] It's shockingly technically difficult, even though it's something I think a lay person who's using the models would expect to be like the basic thing about editing. It's like you don't mess with the things you don't want to be messed with. Yeah. Back to that moment where you saw yourself on the red carpet. [09:17] And wow, that's actually me. [09:21] And it took some of your colleagues a couple of weeks to have the same experience because they tried it with their own photos. [09:26] The question is beyond... [09:27] hey, that's actually me, the qualitative test.

9:31-11:17

[09:31] Is there some sort of an eval that you can put against that to make it quantitative that, you know, we have achieved the thing that we set out to achieve here? [09:39] Yeah, so I actually think [09:41] I think face consistency exactly for the reason Nicole said is quite hard. It's quite hard for other people to do. I will say in general, I think what we found with image generation... [09:52] In particular, that's unlocked a lot for us as, like, [09:54] human evals are important. And so I think they're foundational. We have a team that works on helping us build sort of good tooling and good practices for evals. And [10:06] having humans actually eval these things that are very subtle. Like if you think about image generation, [10:12] faces, aesthetic quality. These are things that are very hard to quantify. And so I think human evals have been a big factor. [10:19] Game changer for us. [10:20] I think it's definitely I think it's a combination of there's human evals. There is very technical term eyeballing of the model results by different people. And there's also just community testing. And when we do community testing, we start internally. And we have artists at Google and at Google Define to play with these models. Our execs will play with these models. And that really helps, I think, kind of build that qualitative narrative around like, why is this model actually awesome? [10:50] if you just look at the quantitative benchmarks, you could say like, oh, it's 10% better than this model that we had before. And that doesn't quite grok that emotional aspect of like, oh, I can now see myself in new ways, or I can now finally edit this family photo that I cut up when I was five years old. And I probably shouldn't have people have done that. When then I'm able to restore it, like I think you really need that qualitative user feedback in order to be able to like tell

11:20-12:49

[11:20] of the Gen.A.I. and AI capabilities, but I think it's especially true of... [11:26] visual media where it's very subjective versus if you think about something like math and [11:31] reasoning, logic reasoning, where like... [11:34] you can really ground it in an answer, right? And so it's more easy to have these very objective, automated, you know, quantitative evals. [11:42] To get to that level of character consistency from just one 2D image of someone is really, really hard. Can you walk us through maybe a little bit? What are the technical breakthroughs that helped you drive to that level of character consistency that we actually haven't seen anywhere else? [11:59] I mean, I think a key thing is like having good data. [12:03] that teaches the models to generalize. [12:05] And the fact that it's a Gemini model, it's... [12:10] a multimodal foundational model that's had... [12:15] seen a lot of data and has good generalization capabilities. And I think that's kind of the secret sauce, right? It's like you really need... [12:23] models that generalize well to be able to take advantage of that for this right yeah and i think the other nice part about doing this in a model like gemini is that you also get this like really long context window so like yes you can provide one image of yourself but you can also provide multiple and then on the output side you can also iterate across multiple turns and actually have a conversation with the model which wasn't possible before right one two years ago we were fine-tuning

12:53-14:42

[12:53] that looked like you. And that's why it never took off in the mainstream, right? Because it's just too hard. - Too much work. - And you don't have that many images of yourself. It's too much work. And so I think it's both kind of the general, like Gemini gets better, you benefit from that multimodal context window, and you benefit from the long output and ability to maintain context over a long conversation. And then you also benefit from the, like, actually paying attention to the data, focusing on the problem. A lot of the things we get better at, [13:22] come down to there's a person on the team who's like obsessed with making them work. Like we have people on the team who are obsessed with text rendering. And so our text rendering just keeps getting better because that person just like is obsessed with the problem. Yeah. It's like it's not just about throwing high... [13:36] quantities of data in, right? I think that's one thing that's really important is it's, there's this like attention to detail, um, [13:46] and quality of, you know, all the things you're doing with the model. There's a lot of, there's a lot of small design decisions and decision points at every point. And, uh, [13:55] I think that detail-orientedness of high quality is... [13:59] data and selections are really important. Yeah. It's the craft part of, I think, the AI, which we don't talk about a lot, but I think it's super important. Yeah. How big was the team that worked on it? [14:10] To ship it, it took a village. Yeah, especially because we split ship across many places. [14:15] products. So I think like there's like sort of the core sort of modeling team and then there's, you know, our close collaborators across like all the surfaces. Yeah. When you put them all together, you easily get into like dozens and hundreds, but the team who works on the model is much smaller. And then the people who actually make all the magic happen. And we had a lot of infrastructure teams like optimizing every part of the stack to be able to serve the demand that we were seeing, which was really awesome. But really like to ship it, we were joking that it

14:45-16:17

[14:45] this, do you build it with particular... [14:47] personas or particular use cases in mind, or do you build it more [14:52] with a capability first mindset. And then once the capabilities emerge, you can map it to personas. [14:59] It's a little bit of both, I would say. Before we start training any new model, we kind of have an idea of what we want the capabilities to be. And some design decisions, like how fast is it at inference time, right? They also impact which persona you're going after. So this model, because it's kind of a conversational editor, we wanted it to be really snappy. Because you can't really have a conversation with a model if it takes like a minute or two to generate. [15:29] image models versus video models. You just don't have to wait that long. And so to us, from the beginning, it felt like a very consumer-centric model. But obviously, we also have developer products and enterprise products, and all of these capabilities end up being useful to them. But really, we've seen a ton of excitement on the consumer side in a way that I think we haven't before with our image models, because it was very snappy, and it kind of made these [15:59] accessible through a text prompt. And so that's kind of how we started it out, but then obviously it ends up being useful in other domains as well. Yeah, and I think one of the like [16:09] differences in philosophy so like previously we'd worked on the imagine line of models which were straight image generation and i think one of the like

16:18-18:06

[16:18] big philosophical goal changes. [16:20] in these Gemini image generation models is... [16:23] generalization is like a more foundational capability. So I think there is also a lot of like, there's, there's things where like, we want this model to be able to be good at this. Like, [16:35] representing people and letting them edit their images and have it look like themselves. But I think there's also a lot of things like that are, [16:42] are emergent from the goal of just having a [16:45] baseline capable model that like reasons about visual information. Like, I think one thing that's surprised me, I guess, as a callback to early conversation is people can put in math problems like a drawing of a math problem and like ask it. [16:59] to like, [17:00] render the solution, right? So like you can put in a geometry problem and say like, what is this angle? And [17:07] That's like an emergent thing of like a foundationally capable model that has both like... [17:13] reasoning, mathematical understanding, and visual understanding. So I think it's [17:17] There's it's both. Yeah. Can you maybe share? I just had a curiosity. What's a good way to understand maybe the family mapping and the relationship between Gemini powering Nano Banana, VO, you know, all these other adjacent products and models that are all driven and benefit from the generalization and the scale of Gemini itself, how you co-develop and then where you want to take it from here? [17:42] Um, [17:43] Our goal has always been to build the single most powerful model that can do all these things, right? You can take in any modality and you can transform it into any modality. And that's the North Star. We're obviously not quite there yet. And so on the way there, we had a lot of sort of specialized models that just got you great results in a specific domain. So Imagine was an example of that for image generation.

18:07-19:57

[18:07] VO is an example of that for video generation and editing. [18:11] And so I think we're both kind of developing these models to push the frontier of that modality. And you get really useful outputs out of that, right? A lot of filmmakers are using video in their creative process. [18:22] But you're also learning a lot that you can then bring back into Gemini and then make it good at that modality. Image is always a little bit, I think, ahead of the curve because you just have one frame, right? It's cheaper, both to train and at inference time. So I think kind of a lot of the developments you see in image, I expect you to see in video like six, 12 months down the line. [18:52] We're now moving closer to Gemini. [18:55] and to that vision of that single most powerful model. And you will see that, I think, with some of the other modalities. And along the way, we'll release these experiences that are just like really powerful and like really exciting in that modality. So like VO3 was really awesome because it brought audio into video generation, right, in a way that we haven't seen before. GNE3 was really awesome because it lets you in real time kind of navigate a world. And so in order to push that frontier, [19:25] model. And so to some extent, these specialized models are kind of a testing ground. But I would expect that like over time, you know, Gemini should be able to do all these things. That's so interesting. [19:36] Okay, we got to ask you about the name. I suspect that the name was a bit of a, it's an amazing product. I suspect that the name gave it a little bit of a boost because it's so easy to remember and so distinct. So was it a happy accident or is there some creative genius who knew that this is going to be just the right name?

20:06-21:47

[20:06] And part of that is you give a code name if anyone hasn't used Alamarina. [20:10] You get to put in your prompt. You'll get back two responses from two models. They have code names until they're [20:16] publicly released. And I think it was like, [20:20] We had to, someone... [20:22] we were going out at like 2am and, [20:25] Nicole is our wonderful PM. There's another PM you have, Nina. And [20:29] Someone messaged her being like, [20:31] what do we name it? And she was really tired and exhausted and she was like, [20:37] This was the name of Stroke of Genius that came to her at 2 a.m. This is you? It was not me. It was somebody on my team who named the model. I can't take credit for this. Another one of our PMs. But what was really awesome was like, A, it was really fun. I think that really helps. It's easy to pronounce. It has an emoji, which is critical for branding. She didn't overthink it. But she didn't overthink it. And what was awesome is everybody just went with it once it went live. [21:07] felt very googly and very organic and ended up looking like the stroke of marketing genius. But no, it was a happy accident and it just sort of worked out and people loved it. And so we leaned into it and now there's bananas everywhere when you go into the Gemini app, which we did because [21:22] People were complaining that they were having a really hard time finding the model when they came into the app. Yeah. And so we just made it easier. Yeah. [21:30] Yeah, exactly. I think there's like publicly people were like, nano banana, nano banana. How do I use nano banana? I had someone at Google I work with be like, how do I use nano banana? I was like, it's Gemini. It's right there. Just just ask for an image.

21:49-23:21

[21:49] Yeah. But I think that's the thing is like I think Google's always had this really fun brand. Right. Like it's like it's not like it's been a consumer oriented company. [21:58] at its inception and like I think it was really nice to [22:01] to play on that... [22:03] that image people have of like Google as a fun... [22:06] Okay. [22:07] fun place, fun company. [22:10] and have this fun name. It's also just like a really nice path to... [22:15] Fun being kind of a gateway to utility, right? I think Nano Banana and just the model in general and what you can do with it, like put yourself on the red carpet, do all the childhood dream professions you had. It's like a really fun entry point. But what's been awesome to see is that like once people are in the app and they are using Gemini, they start to use it for other things that then become useful in their day to day life. Like you use it to study and solve math problems or you use it to learn about something else. [22:45] not just with the naming, but also just like with the products that we built, because it kind of gets people in, gets them excited. And then it helps them discover other things that, you know, the models are awesome at. Yeah. [22:54] I think other users, like my parents and their friends are using it. I think it's because they had this reputation. It was really easy. It was really fun. It felt unintimidating to try. Mm-hmm. [23:06] You try it and you're like, actually, this is very easy to work. This works very easily. It's very easy to interact with. [23:12] There's no, like, technological... There's no, like... [23:15] You know, technology, I think, can sometimes... [23:17] be intimidating to people, especially AI right now. And I think

23:21-24:52

[23:21] the chatbot naturalness has been [23:23] broken a lot of that barriers but maybe more so with younger people yeah um and i think this like fun like [23:29] Yeah, my mom like [23:30] made. [23:32] was like making these images and having a great time and, and, and, [23:35] then realized she can use it to like remove people from the background of her images. Like these very practical things, right? It started very silly, turned very practical. [23:43] then people can use it to realize like, actually they can, [23:46] give you the diagrams or help them understand stuff. [23:49] I think there's also like a big accessibility component. Yeah. Where do you want to take from here? Maybe both from a model side and from a product side. [23:57] On the product side, I think there's... [24:00] kind of a couple areas like on the consumer side, I still think we have a long way to go to just like make these things easier to use. Right. You will notice that a lot of the nano banana prompts are like 100 words long and people actually go in and copy paste them into the Gemini app and like go through the work to make it work because the payoff is worth it. But I think we have to get past this prompt engineering phase for consumers and just like make things really easy for them to use. I think on the professional side, yeah. [24:27] we need to get into like much more precise control, kind of robustness, like reproducibility to make it useful in actual professional workflows. Right. So like, yes, we you know, we're very good at editing consistency and not changing pixels, but we're not 100 percent there. And when you're a professional, you need to be 100 percent there. Right. Like you really need kind of these precise, maybe even like gesture based controls like over every single pixel in the frame.

24:53-26:24

[24:53] So we definitely need to go in that direction. And then I think there's like a general direction that I'm really excited about, which is just about visualizing information. So the example I had about sketchnotes at the beginning and somebody kind of hacking their way around using Nano Banana for that use case, you could just imagine being able to do that for anything. Right. And a lot of people are visual learners. [25:23] you to consume, right? So sometimes it's a diagram, sometimes it's an image, and sometimes maybe it's a short video, right? That you want to learn about some concept that you're learning in a biology class or something like that. So I think that's like a completely new domain that I'm really excited about, just these models getting better and getting past the point where, you know, 95% of the outputs that you get out of these models are just text, which is useful, but it's not how [25:53] then are you alluding to the fact that you might want to vertically integrate and build a little bit more product around it? And also, are you alluding to the fact that maybe... [26:01] The way you interact with some of these models isn't just through pure language and prompting over time, but more UI. [26:07] Yeah, yeah, I definitely think [26:10] The chatbots, I think, are an easy entry point for people because you don't have to learn a new UI. You just talk to it and then you say whatever you want to do. Right. I think it starts to become a little bit limiting for the visual modalities. And I think there's.

26:24-27:58

[26:24] a ton of headroom to think about, like, [26:26] what is the new visual creation canvas for the future? And how do you build that in a way that doesn't become overwhelming? Because as these models can do more and more things, it's very hard to explain to the user in something that's very open-ended, like what the constraints are and how do you work around that? How do you actually use it in a productive way? So I'm really excited about people kind of building products in those directions. And for us, we have a team called Labs. [26:54] at Google that's led by Josh Woodward. And they do a lot of this kind of like frontier thinking experimentation. They work with us really closely where they take our frontier models and they think about like, what's the future of entertainment? What's the future of creation? What's the future of productivity? And so they've built products like Notebook LM and Flow on the video side. And I'm excited that maybe Flow could kind of become this place where you could do, you know, some of this creation and think about what that looks like in the future. I think in the short [27:24] this model has things that it's not perfect at. And so in the short term, it's obviously... [27:29] it should work the way you expect it to every time, not just a lot of the time, um, and really make it so seamless. Uh, [27:37] and fix all these small things where it's just a little bit inconsistent in its performance. [27:44] I think long-term it's... [27:46] I think... [27:46] Nicole covered that, which is to me, it's in order to have [27:52] that reality of really... [27:54] rich multimodal generation. So like right now, if you ask...

27:58-29:45

[27:58] Gemini to explain something [28:00] it'll usually just explain in text unless you ask it for images. But if you think about, like, the platforms that have really taken off in the last, like, [28:07] 10, 20 years for learning, right? Like, we think of, like, Khan Academy started on YouTube. We think about, like... [28:14] Wikipedia has a lot of images. It's very image focused. If you look up any math thing, you like diagrams. [28:19] that should become more like... [28:23] a natural part of the flow and a part of the way you use these models. And to enable that from a modeling point of view, it's, it goes back to like, like we were talking about this, this multimodal understanding and seamless general generalization between modalities. Um, [28:38] Maybe the other interesting area, as we think about kind of, you know, these models being more proactive at pulling in, you know, whether it's code or images or video when it's appropriate for the user intent. I think there's other exciting, I started out as a consultant in my career. And so obviously I made a lot of slide decks in my time. I still do. And I think there are some of these use cases where you don't actually really want to be in the weeds of creation. Like what you really want is, let's say you're updating your stakeholders on how a project is going, right? [29:08] in some context. Maybe it's meeting notes. Maybe it's a couple of bullet points. Maybe it's some other deck that you've created in the past. [29:16] And then you maybe just want Gemini to go off and do all the work for you, right? Pull that deck together, format it, create appropriate visuals to make it really easy to digest. And that's something that you probably don't want to be involved in, and it gets more into these agentic behaviors. Versus, I think, for some of these creative workflows, you actually want to be creating, you want to be in the weeds, you want to think about what the UI looks like that makes it easy for a user to accomplish the goal. And so if I'm designing my house,

29:45-31:18

[29:45] and I'm actually into designing my house, then I probably actually want to play with it and play with textures and different colors and what would happen if I remove this wall. And so I think there's kind of this spectrum of very hands-off, just let the model go off and pull in relevant visuals, materials, for a task that makes sense, all the way to how do you actually make a creative process more fun and remove the tedious parts and remove the technical barriers that exist today with tools that we have. It's like this mix of giving the user fine-grained control, [30:15] control they want, but also at the other extreme, having the model be able to [30:21] understand the user quest and anticipate, right? Like, [30:25] the need and the outcome that it should be and do all the intervening work in between. [30:30] Yeah, it's almost like when you actually hire a professional for something today, right? Like when you hire a designer, you give them a spec and then they go off and then they do all that awesome work that they do because they have all this expertise. And so these models should be able to do that. And they can't really do that in many domains today. What do you think the next competitive battleground is in this world? [30:52] I think. [30:53] there's still work to be done on making these models more capable. And so this idea of having a single model that can take anything and transform it into anything else, I think nobody has really figured that out. But I do think in order to actually... [31:07] drive adoption, there's probably two things. One is user interfaces. [31:11] Like we still rely very heavily on the chatbots. And we talked about this, like it's useful for some things and it's a great entry point.

31:18-33:02

[31:18] but it maybe isn't useful for all the things. And so I think starting to think about much more deeply about, [31:24] Who are the users? What are they trying to do? How can the technology be helpful? And then what product do you build around it to make that happen? It's probably one. Do you think five or 10 years from now, the frontier will be advancing as quickly as it has advanced over these last few years? Five to 10 years from now feels like 20 years from now. Just the space, and you guys probably see this too, like the space is moving really quickly. Yeah. And it's, [31:52] If you asked me two years ago, I would have told you the space is moving really quickly. If you ask me today, I will tell you it's moving faster than it was two years ago. [32:01] Okay, I'm going to ask you a very different question. [32:05] So I know Google's very sort of careful and very concerned about deep fakes and that sort of thing. And I have to imagine when you saw how capable this model was, there's a big conversation about, okay, well, how are we going to make sure people don't use it in the wrong sorts of ways? How does that sort of a conversation go inside of Google? And are you guys sort of like happy with where it ended up? [32:29] I think it's an ever-evolving frontier, also. Because it's this mix-off... [32:35] you want to give people the creative freedom to be able to use these tools, right? And you want to give users control to be able to use these tools in a way that don't feel overly restrictive. And you want to prevent the worst harm, right? I think that's always the balance that we spend a lot of time talking about. And so obviously, when you look at the outputs of the model, there's a visible watermark that says it's been generated with Gemini. So that immediately indicates that it's AI content.

33:05-34:58

[33:05] produced with our models, image, video, audio, their synth ID embedded, which is invisible watermarking. And so those are kind of the visible ways or invisible ways in which we verify that our content is AI generated. We're very invested in it and we believe that it is really important to give users those tools to be able to understand that when they're seeing something, it's not a real video or it's not a real image. [33:30] And then obviously, when we develop these models, we do a ton of testing internally and also with external partners to kind of find as the models get more capable, you find new attack vectors, right? And like new ways that you have to mitigate for. And so that is like a very important part of model development for us. [34:00] don't create harm, but also still give users the creativity and the control in order to make these models usable in a product. [34:08] I mean, I think it's a very, very hard balance to strike, right? Because... [34:14] You will always have people using a tool in good faith. You'll also always have people using it [34:18] in bad faith. [34:20] And I think [34:22] I think it's hard. It's like, is it a tool? Is it something that has responsibility? So I think we take this very seriously. [34:31] users obviously are also responsible for what they do with the model. But SynthID really is an important technology that lets us like release these capabilities to people and have some faith in that we can still verify, right? And have a tool to combat the risk of misinformation. But it's a super tricky conversation. And I think it's one that I've seen everyone take very seriously. There's a lot of...

34:58-36:36

[34:58] A lot of conversations about how to balance both. [35:01] Is that the standard now across the industry? Sync ID. It's a Google standard. It's the Google standard. I believe there's like every Google... [35:09] Like Imagine, the ImagineLine Veo, they all have SynthID when you use them in any product surface. [35:16] All right, you told us we can't go five to ten years down the road because things are moving too fast. We'll go one to three years down the road. Thank you. [35:25] Two questions. One... [35:28] uh, [35:29] What will be possible that we can only dream about today? [35:33] uh [35:34] into [35:36] What will the resulting change be to the way that we all live our lives? [35:40] I really hope that a year or [35:43] Two from now... [35:44] You could really get... [35:46] personalized tutors, personalized textbooks in a way. There's no reason why you and I should be learning from the same textbook if we have different learning styles and different starting points, but that's what we do now. That's how our learning environment is set up. And I think... [36:03] across all these breakthroughs, like that should be very possible, where you have an LLM tutor that just figures out your learning style, what are the things you like, maybe you're into basketball, and so I need to explain physics to you with basketball analogies, right? And so I'm really excited about that. [36:19] learning just becoming way more personalized. And that feels very achievable. And we obviously have to make sure that we don't hallucinate and there's like a high bar for factuality. And so we need to ground in sort of real world content, but that I'm really excited about. And that really, I think, just

36:36-38:18

[36:36] it removes a lot of barriers for people, right? To your question on like what the impact is going to be. I think it just becomes much more, it becomes much easier to learn basically anything in a way that's very tailored to you that you just can't do right now. Could that be a Google product surface? [36:54] Okay. [36:55] Somebody should look into it. [36:57] Yeah, and I think for the way it'll... [37:00] change [37:02] how we live and work. I think, I think we, we, [37:05] I think working on these technologies... [37:09] I've already seen how it changes the way we work, right? Because we obviously use them. [37:14] a lot. I'm getting married, we made our save the dates with our model. [37:19] And [37:19] So what I really think we'll see is and just work the amount part of I think the reason that the innovation has accelerated is we have these models. You have like code assistance, you have just like you can use models to like filter things, to analyze huge amounts of data like. [37:39] it's drastically increased our own workflows. Like what I can do this year versus two years ago is just like an order of magnitude more work. And I think that's true of the tech industry. It's not true of a lot of other industries, just because that integration into their workflows or into their tooling hasn't happened. [38:01] So I think, you know, some people are like, oh, [38:04] it's going to replace me. But at least what I've seen is it really just actually changes the amount of work an individual can get done. What that means for businesses or economically, I'm not sure, but I think it means we will just see people be more empowered

38:18-39:40

[38:18] to hopefully do more in the same amount of time. Like maybe you don't have to, you know, I have friends who are in consulting and spend a lot of time. They're like, I should spend a lot of time, like two hours making slides, tweaking, moving logos around. And like, hopefully they won't have to do that. They can actually spend time, [38:35] thinking about what the content of the slides should be working with clients. [38:40] And I think that that's hopefully what we will see in like one to two years. [38:45] Given the trajectory that you see in these capabilities, are there interesting areas that you think startups should go do that Google itself might not get into? Yeah. [38:54] I think there's a ton of spaces, even just in the creative tools. Like, I think there's a ton of [39:00] room for people to figure out like what do these UIs of the future look like? Like what is the creative control? How do you bring everything together? We see a lot of people in the creative field work across LLM's image, video and music [39:14] in a way where they have to go to four separate tools to be able to do that. So like, a lot of people ideate with LLMs, right? Like, give me some concepts, like use an idea that I have. Once you're happy with that, you take it to an image model, you start to think about where are the key frames that I want to have in my video, you'd spend a lot of time iterating there, then you take it to a video model, which is yet another surface. And then at some point, you want to have sound and music and mix it all together. And then you actually want to do maybe

39:44-41:16

[39:44] tools. That feels like these kind of workflow based tools are probably going to spin up for a lot of different verticals. So creative activity is just one example of it. But, you know, maybe there might be one for consultants so that you can more efficiently make slide decks and presentation and pitch decks to clients. And so I think there's a lot of opportunity there that, you know, some of the big companies may not go into. Yeah. There's a lot of like, how do we make this technology useful for [40:12] X workflow, right? Like sales finance, like, [40:17] I'm saying a lot of things I don't know about in companies like financial workflows, but I imagine there's like a lot of tasks that could be automated, could be made. [40:25] much more efficient. Yeah. And I think startups are in a good position to really like go understand the specific client use case need that niche need. [40:36] and do that application layer, right? Versus what we really focus on is the fundamental technology. [40:42] Um, [40:42] I think I'm just really excited... [40:46] by the number of people [40:48] who've been excited. [40:50] Thank you. [40:51] by this model. Yeah. If that makes sense. Like, [40:54] A lot of people in my life, like a lot of aunts, uncles, parents, like friends, like they've used chatbots. They ask it. [41:01] things, they get information. My mom loves to ask chatbots about health information, but [41:07] There's something about visual media that really excites people. It's like the fun thing, but it's not just fun. It's exciting. It's intuitive.

41:17-42:48

[41:17] the visual... [41:18] basis so much of how we as humans experience life that I think I've loved how much it's [41:24] moved people like emotionally in excitement wise like i think that's [41:30] been the most exciting part of this for me my kids love it yeah he uh my my three-year-old son tied our dog leash which is this like fraying you know brown rope like over himself so he looked like a warrior i took a picture of him and turned him into this warrior super yeah exactly it makes him feel superhuman yeah and my husband will read so he uses google storybook to to [41:55] read him these stories about lessons that he learned in school. You know, if he, if there was like an incident on the playground with another kid or adjusting to a new school and it's made, I mean, it's made these characters that look like him and my husband and me and our dog and our, and our daughter in these fun stories and lessons that we're trying to teach him to the personalization that you talked about. So I really, really love this future. It's, it's going to be totally [42:25] This is a story for, you know, one or five people that you would have never had made. Right. Like and other people probably don't want to read it. I would love to if you want. Yeah. But I think we're really now making it possible to like tell stories that you never could. And in a way where like the camera allowed anyone to capture reality when it became very accessible, you're kind of capturing people's imagination.

42:55-43:37

[42:55] in a way that they just couldn't before because they didn't have the tools or they didn't have the knowledge of the tools. Like that's been really awesome. [43:02] That's a nice way to put it. Thank you so much. Thank you for having us. It was awesome to have you. [43:07] Thank you.

Want to learn more?

Ask about this episode