LongCut logo

Gemini 3 Pro finally here, Opus 4.5 surprises everyone, GPT 5.1and more | Episode 5 Rate Limited

By Rate Limited

Summary

## Key takeaways - **Codex fixes truncation**: Codex now truncates tool output by token limits instead of line-based rules after feedback, with a config flag 'tool_output_token_limit = 25000' to raise the limit from 2.5k tokens. [01:35], [01:49] - **GPT 5.1 frustrates coders**: Adam found GPT 5.1 slower than Composer 1 with worse design chops and no real improvements; Ray noted it reverts to old Tailwind v3 despite rules, preferring training data habits. [03:13], [04:28] - **Gemini 3 excels in plans**: Adam switched to Gemini 3 Pro for phenomenal planning, execution across TypeScript/Python/React/Vue/PHP, and steerability without GPT's stubbornness, though it forgets tools occasionally and adds redundant comments. [10:15], [11:07] - **Opus 4.5 tops benchmarks**: Opus 4.5 scores #1 on repo bench with thinking enabled, outperforming Gemini 3 (ranked 26th); it's a smarter Sonnet great for multi-domain engineering, testing, and less cleanup than Sonnet 4.5. [28:11], [24:12] - **Legacy code resists AI**: AI hates legacy codebases from decades ago with lost context, misleading tests, and hidden interdependencies; engineers fear changes humans barely understand, amplified by AI's blind spots to proprietary flaws. [51:05], [52:16]

Topics Covered

  • Full Video

Full Transcript

Ladies and gentlemen, you are tuned in to the rate limited podcast with your host Ray Fernando. We also have Eric Broven, the founder of Repo Prompt. And

we have now Adam Larson, Gou coder. We

have made over the 1,000 subscriber mark and can afford Adam's last name. Welcome

to the show, ladies and gentlemen. This

show is going to be about AI practitioners right in the weeds of everything, building every single day.

There have been so many new model releases in the last couple of weeks from GPT 5.1, Gemini 3, Opus 4.5, Gro is in the mix. Kimmy has some crazy stuff

going on. We also have some updates on

going on. We also have some updates on the last episode with Codeex and their truncation issue and a whole bunch of other things with Dwaresh's part podcast talking about ASI with Ilia and so much

more. So ladies and gentlemen, we're

more. So ladies and gentlemen, we're excited to have you on the show. This is

going to be a really packed episode, so make sure you stay buckled up, follow those timestamps, and we'll catch you soon. Yeah, thanks for that intro, Ray.

soon. Yeah, thanks for that intro, Ray.

So, uh, to start things off, I just wanted to do a little quick followup on last week's episode where we talked a little bit about issues around codecs truncating context with tool calls. So

if you're not familiar or you didn't listen to that episode, the key issue is that like unlike other AI coding agents, codeex was doing this optimization where

it would if you if you return too much in a tool call like it's going off and and you know reading a file or reading some bash output um basically it would

cut off the middle part of the output to fit in some linebased rules. And so

following that discovery, I had opened a GitHub issue, made some noise about it and tweeted here. And um they actually addressed it in the release last week

and now all of the truncation is done by token limits which is a big step up. And

they added a configuration flag which you can use as well in the in the config file. So if you are feeling that this is

file. So if you are feeling that this is an issue for you, uh you can you know either reach out in the comments or know we can put a little note in there to say what the flag is you can go ahead and set if you want. Uh, but you can

actually fix that yourself uh if if it's a problem and it's great to see that their uh response to feedback. So, I

just wanted to put a little close on that saga. Uh, it's nice to see it done.

that saga. Uh, it's nice to see it done.

It's too bad the limits are still quite low uh at 2.5 thou tokens which is which is quite not there's not a lot. Um, but

uh anyway, it's good to see them fix that. Um, and yeah, so so since that

that. Um, and yeah, so so since that happened, actually that release that that fixed it for the codeex CLI, it came alongside codec 5.1 Max, which is a mouthful of a model to say the least.

Um, and you know, one thing we didn't mention in the last episode was that GPT 5.1, it just came out like right as we recorded. Um, so that's a crazy thing.

recorded. Um, so that's a crazy thing.

So I wanted to take a second here just to talk to both of you guys, Adam and Ray. So let's start with you, Adam. So,

Ray. So let's start with you, Adam. So,

what what are you thinking on this new GP 5.1 and Max, if you've tried that one, uh what are you thinking about it?

>> Yeah, it was very interesting cuz we recorded and then that very day GPT5.1 came out. I have only messed with

came out. I have only messed with GPT5.1, but as many of you know over the last like 9 10 days, I have been very heads down coding. So, to me, that's

what kind of like proves out like whether a model is valuable or not. 5.1

was kind of frustrating to work with for me. I didn't feel like much had changed.

me. I didn't feel like much had changed.

I felt like its design chops had actually gotten worse. U I felt like in general to to give you a little bit of what I was doing before, I was using composer one pretty much solely because

it's so fast. I understand the codebase.

I'm kind of like feeding it in what I wanted to do. GPT5.1 so much slower. And

because of that, my iteration time comes way down. But I also didn't feel like it

way down. But I also didn't feel like it was doing things that made sense. It was

it just something felt off about the model. Ray, I showed you that thing

model. Ray, I showed you that thing earlier today. Like from a design shop

earlier today. Like from a design shop standpoint, uh it's just weird. It does

some like weird things that I would not expect a model to do. Now, some people I've heard love the model, but for me, I got very frustrated with it and quickly went off of it. I don't know if you guys

agree or disagree. Would love to hear that.

>> Well, what what were your thoughts on it, Ray? How did you get much chance to

it, Ray? How did you get much chance to test it out or you were focused on other models for the last couple weeks? I

tried it out and compared to Sonnet 45 high which I've been using a lot in droid I felt that like the model just wasn't fast enough for me or [laughter] accurate enough. I think there were a

accurate enough. I think there were a lot of things that I'm using a lot of Tailwind classes. Uh I'm using Tailwind

Tailwind classes. Uh I'm using Tailwind V4. I'm using the whole tokenizing

V4. I'm using the whole tokenizing system >> and it just for whatever reason keeps going reverting back and inserting stuff from the older training set. And that's

kind of what I felt a little bit shocking because like I have rules files I have various things in my codebase. M

and it seems to prefer a lot of things in its its training set and it's like oh tailwind v3 stuff all the older stuff and I I haven't really remembered but I think the cut off date for this model is

actually um still older you know for the the base training data and it seems to like be reinforcing older habits and things that I wrote rules for that it surprisingly didn't follow that 51 did

follow well I mean the 50 model GPT5 did follow well and so after a couple rounds of that I just said there's already other good models in the pipeline, so might as well just use them. Uh, and so just kind of just stop using it right

away. Yeah.

away. Yeah.

>> Yeah. Well, uh, I mean, you know, that's, uh, that's interesting thoughts there. Using it for design, I think like

there. Using it for design, I think like I I've never found the GPT models to be the the right ones to use for that purpose. Uh, so that's that's, you know,

purpose. Uh, so that's that's, you know, quite expected to me. Um, but, you know, I I uh I did get a chance to to use it quite a bit. uh you know just more in

Chad T and now with 5.1 Pro that came out um I've seen you know quite a step up in in in terms of how thorough the model is. I think just the base 5.1 it

model is. I think just the base 5.1 it is it is faster and I think that's nice and I tend to use these models not so much for design expertise but for uh just just detailed planning and

architecture and I still think these models are bestin-class at that. Um so

that's that's been that's been the case you know pretty consistently. What I did find interesting though is when uh 5.1 came out uh I put it on the benchmark with repo bench which is the repo prompt

benchmark um and it actually shot up to the top and and it was it was the number one but was very interesting to me as well is that only the low version was at

the top like the lowinking version not the highinking version. Um and and you know there's a lot of speculation as to like why that might be and like my my thinking on it tends to come down to

this. So when I look at GPT models

this. So when I look at GPT models compared to Claude and other models, um what I've noticed is that they are quite sensitive to noise in their context

window. So if you have uh a lot of like

window. So if you have uh a lot of like unrelated context in your context window, that does tend to distract the 5.1 models quite a lot. And the repo bench really stress tests that. And then

also I've noticed that GPT models in general are just not as good as cloud models at doing file editing. Um and and other models that are are doing really

well at that purpose. Um and so I've seen it be quite sloppy at file editing as well. Um but with low thinking it

as well. Um but with low thinking it seems to do better. It gets less in its way. And when you have less reasoning um

way. And when you have less reasoning um you're adding less noise into the context. So like if you think about

context. So like if you think about reasoning, it's like you're adding tokens in the context window to kind of think about the problem before you get to an answer. And all of this reasoning, it sits in the context window and if the

model is more sensitive to noise from in its context window, then it's think its own thinking can get in the way of a good answer. Uh which is surprising and

good answer. Uh which is surprising and that doesn't happen with claude models as which is interesting is they are much better at dealing with that noise in the context window. Uh so I just wanted to

context window. Uh so I just wanted to put a little caveat on that just a little thought there. Um, now in terms of the Max models, I'm sure Have Have you guys tested those ones out in Codex?

Have you played with the Max models at all?

>> No. No. It's a no for you, Adam, and you, Ray. None at all. Okay.

you, Ray. None at all. Okay.

>> Yeah. I'd love to hear your thoughts.

Yeah.

>> Yeah. So, I mean, I played with it, uh, you know, a fair bit. I used it. Um, I

find, you know, it unlike, you know, when you're prompting, you know, Claude or, you know, some other, you know, more general purpose agent models, they tend to be more um, flexible and

conversational and you can kind of run with them. And uh you you can kind of

with them. And uh you you can kind of try some things out, course correct, and keep going. I find GPT models are a

keep going. I find GPT models are a little bit stubborn in terms of trying things. Like if you're deep in a context

things. Like if you're deep in a context window and you're like, "Okay, actually, let's try this thing instead." It it's like you're fighting it to kind of you're fighting its inertia to kind of try and do something else and like pull

away from that thing that you've set.

And it's very sensitive to this accumulated work that it's done and accumulated instruction. So the bigger

accumulated instruction. So the bigger that use of your context window is with the more detailed your prompts are, the harder it is to get it to change course when you're deep in the context window.

Uh and that remains true with Max, but I find Max uh if you do give it a good plan, it's a it's quite the robot. It's

able to kind of go ahead and and execute beautifully um and get some work done.

But again, like down to the file editing things I mentioned, it it is a little bit sloppy on files, especially if you use tabs in your code. Um but what is interesting to know like I don't use Windows but they did say that when that

model came out they had trained it on PowerShell environments. So if you are

PowerShell environments. So if you are in Windows this is the first model that I would consider usable on Windows in the native terminal which is really interesting. Um so yeah good good thing

interesting. Um so yeah good good thing to note uh and and it is a good model smart very slow though um and yeah if if you're if you're good with that it's worth worth looking at. Uh cool. So that

was a little wrap on the GPT series.

What happened next is Gemini came out with their Gemini 3 model and that was a big a big drop in the industry. It

really, you know, shook things out. A

lot of people had a lot to say about it.

Uh but before we get into like what the the vibes, I want to hear like, you know, Adam, did you get a chance to play with it much?

>> A ton.

>> Okay, great.

>> Out of all the three models that came out, that's the one I've used the most.

>> Okay, great. And what were you thinking?

So, the way that I would kind of like put this is like up until that point I had been we were we were we were heads down tight deadline. So, we're doing

composer one. Um I'm hitting GPT5.1

composer one. Um I'm hitting GPT5.1 giving up on that. Going back to Sonnet.

So, I'm a composer one sonnet. Gemini 3

Pro came out and it it's it's all about like do I feel like I'm getting done what I need to. I felt like the model was phenomenal honestly. Like I felt like I was able to one and twoot like

plans that I was getting getting really great plans back from actually executing on it and honestly the speed while slower than like a composer by a long margin it didn't matter as much because

it was still snappy enough and the quality I was getting was extraordinary.

Now I was doing a lot. So just the list of things that I was working on as part of this project, a lot of data engineering stuff, TypeScript, Python, uh we had React portions, Vue portions,

and some PHP stuff that was actually in there. So we're working across a lot of

there. So we're working across a lot of systems. Never ran into anything that it did like a horrible job at. It did a great job like letting me steer it. So,

for example, if uh to your point that you were talking about, Eric, which drives me nuts about the GPT models, if they get very stubborn almost to the point where they start pinging ponging back and forth between like I tell it to do something and then it forgets that it's supposed to be doing this new thing

and it goes back to the old thing. Never

felt that with uh Gemini 3, but there were a couple things that I think with Gemini 3 that were slightly irritating.

Every once in a while in a context flow, it would just forget how to call tools and it would talk to itself about the tools it should be calling and then it would think that it's trying to call

tools. But it's just imagine all that

tools. But it's just imagine all that getting put in text. So like once that happened, there was no way to correct it. So like if I was like working on

it. So like if I was like working on something that's kind of just a broken chat at that point. So I'd have to start over. That's kind of frustrating. So

over. That's kind of frustrating. So

then you got to kind of like re rebuild all that back up again, get your context where you want it and go. Not the end of the world. It only happened, you know,

the world. It only happened, you know, across the the days that I used it maybe three or four times, but it was enough to be irritating to actually happen. The

other thing that it drives me nuts is like if I tell it to go do something, it's very much a comment hungry model.

Like I I like comments, but I don't like comments that's like, you know, [snorts] that basically regurgitates what I told it to do. Like I it's just a really annoying thing. Even no matter how I did

annoying thing. Even no matter how I did it, just would leave comments in there.

>> Like leave code comments or it would just comment that it's thinking. leave

code comments. So, for example, I'll be like, "Hey, I want to switch this thing up and then it would put a comment in there." Switching this thing based on

there." Switching this thing based on request. That's not what I want in my

request. That's not what I want in my code. Like, you know what I [laughter]

code. Like, you know what I [laughter] mean? Like, I don't want that in there.

mean? Like, I don't want that in there.

Maybe I'm just being picky. And that's

>> Well, so you know, Gemini 25 when that came out, it was notorious for the same problem. So, it's interesting that that

problem. So, it's interesting that that persisted through here. Um, anyway, keep going. I was just want to mention that.

going. I was just want to mention that.

>> Yeah. I guess the final thing I would say is like I really really love the model. I think it does a great job at

model. I think it does a great job at frontend stuff. I think it's fast

frontend stuff. I think it's fast compared to like the quality you're getting out of it. I think it does a great job context gathering, building plans. You know, there are some quirks

plans. You know, there are some quirks with it, but when we get to the next model, we're going to kind of talk about like you could look at my usage. It goes

composer one, sonet 4 to a little bit of GPT5.1, back to Sonet 4.5, and then uh Composer 1 to all Gemini 3. And then the second Opus came out, I swapped over to

Opus. And Opus is

Opus. And Opus is >> Well, let's let's come back to Opus on that one. I

that one. I >> I won't touch any, but it's just like a tough one.

>> Yeah. Yeah. Yeah. Okay. And Ray, how how did you feel about Gemini?

>> I used it in AI Studio. I also use it in Droid. And I try to use it in

Droid. And I try to use it in anti-gravity. And first with like AI

anti-gravity. And first with like AI Studio, I found it to be extremely impressive at design and understanding really large complicated instruction

sets. I actually put it up against Opus.

sets. I actually put it up against Opus.

I think at that time I did Opus 41 and I was basically doing a um like a killa u um sorry the big island here in Hawaii is basic is um you know currently

erupting now and I wanted to figure out the time window at which it would do these things. So I had these really

these things. So I had these really really long prompts and a lot of research and USGS uh government documents. I want to see can I vibe code

documents. I want to see can I vibe code something together. Uh and and like if I

something together. Uh and and like if I was a government employee, you know, you have to look through five different sites and be a geologist to determine um you know this like 5day window of

whether it's going to erupt and analyze this data. And so you know Opus did its

this data. And so you know Opus did its little design thing and it looked like very you know on the web it just looks like a really simple app. I threw that into AI Studio and in one prompt I

thought let me just see how much it can chew on just from the web. It did what I thought like, you know, back in the day with 01 and GPT would just take all the different instructions.

>> Mhm.

>> Not only did it make the site in just one prompt, but it was already mobile friendly. It had like this data connect.

friendly. It had like this data connect.

It had so much details that I I was like, how did it do this? I'd even tell it a design style. It it came up with a beautiful cohesive design, a design hierarchy as well. And like there I was

like, this is amazing. So, I tried a similar prompt inside of like Droid the CLI and it didn't come out the same way.

I also tried it inside of Vzero as well.

Uh, and it didn't actually come out the same way as well. And I I was a little surprised that there's like to me it got me thinking there's something in AI Studio that didn't have >> How did you put the context into AI

Studio?

>> Uh, in AI Studio I just literally copy and pasted the entire like big prompt.

Yeah.

>> Yeah. Yeah. So, so on that note, like I think that's a common thread for most coding tools is that when you include the full context of your project as much as you can and it's like the base of

what I built with repop like that you get better results out of these models like they just see more of what you need them to see. They see less of what you don't need them to see. Like they don't see threads of tool calls uh and you

know they get better output like they're not reading small slivers. They're

they're getting the full picture and able to work. So I just want to comment on that like so I don't think it's just AI studio. I think it's just the way

AI studio. I think it's just the way you're prompting it. I Yeah.

>> Yeah. It's I always do that especially across all the different tools I use because I get to see what are the actual agents the the harnesses or whatever that's sitting on top of the model. Uh

are they picking things apart and then only sending certain things in in that prompt and then there's also an a hierarchy that they can also do. And

that's what's been interesting about reple prompt, right? cuz you reveal different pieces of that and just say here is your entire prompt that you're going to send raw to the model. And so

that got me really thinking for anti-gravity. I just couldn't get past

anti-gravity. I just couldn't get past the rate limits with one prompt. I was

pretty much done.

>> It looks so promising. But I just recently heard that if you have a workspace account, so I have like my domain and.com in a Google Workspace. I

don't have the same limits as just a regular Gmail account. Interesting.

>> And so somebody said if you just switch to your Gmail account, you're probably actually going to have higher limits.

Now you tell me, bro. So

>> I I have >> So I mean that's that's a big story about Gemini is that you know the best place to use it typically hasn't been Google's tools. Like I mean AI Studio

Google's tools. Like I mean AI Studio has been great. it's nice for free. Um

that you get a lot of usage, but anti-gravity, the CLI, uh they they they put such tight rate limit and and on the CLI in particular, they only just on as of Monday this week where we're

recording on the 26th, um they released the the all the like you they had a wait list and only now have they rolled it out. So you don't need a wait list. As

out. So you don't need a wait list. As

long as you're on a paid plan, you have access. So before that, you'd sign up

access. So before that, you'd sign up and and on the paid plan and say, "Hey, like you get more Gemini 3 usage on the CLI and on jewels and anti-gravity is not in that list, but then it wasn't true because you needed to have a weight

list and now it's finally true." Um, so if you had the ultra plan that was like $300 or 300 Canadian, like $200 a month, uh, $250 or something, I don't know. Uh,

then you got immediate access, but that was like a whole like, you know, rigomearroll to get access to. And then

I saw earlier this week they actually pulled all usage of jewels from Gemini 3. So now it's back to 2.5 for them. Um

3. So now it's back to 2.5 for them. Um

and then anti-gravity is is you can't pay for it. It's only the free tier. So

that's like kind of annoying too if you want to get use of it. Um so so I actually I did try anti-gravity a fair bit too. And and I think the cool thing

bit too. And and I think the cool thing about it is that it comes with a Chrome extension that uh you can you can set up and then when you're working on a web app, it's able to kind of run the web

app in Chrome, inspect what it's doing and and take screenshots, scroll through your page, click around, and then review its work uh automatically in a in a tight loop. And I found that was

tight loop. And I found that was helpful. It didn't work perfectly when I

helpful. It didn't work perfectly when I tested it. Uh like, you know, there was

tested it. Uh like, you know, there was some clear issues. It took screenshots at weird places. Um, so I think there's some work to be done there, but I think in the future that kind of integration is going to be important. Um, and of

course I just hit errors right away on anti-gravity and and I wasn't able to use it. And surprisingly, Sonic was

use it. And surprisingly, Sonic was working better than Gemini 3 on there, which is very surprising to me. Um,

>> yeah. I also want to add that what Adam was saying earlier about comments.

That's the other thing I kept seeing with this model. It wasn't as crazy as the previous 2.5 Pro, but they would show up. And also it it's extremely

show up. And also it it's extremely convincing at writing code if you don't review it to convince you that this is the right technique that it's doing. And

you have to be very very careful with it. And I don't know how but I'm just

it. And I don't know how but I'm just kind of I got a little cautious you know I was like this looks really good and then I was just actually running through the data model. And I

was like, wait, um, I I use convex a lot for the database. And I actually have specific instructions and uh the reason why I use convex is because everything could be reactive. And so I noticed that

the Gemini models got like extremely aggressive about, hey, I can write all the state on your front end. So that way like it'll be doing all the stuff. And

like I'm being smart. See, look what I did. Here's some comments that kind of

did. Here's some comments that kind of lead you through it. And I was like, that's not what I want. It's

everything's on the back end. Convex has

just a single query that I can just pull that in and refresh the app. I don't

need to have you write extra things to justify uh you know why you're so great at doing this. I I understand that. But

that's not needed for this framework.

>> Patting itself on the back like oh I did such a good job. Oh look at that.

>> Yeah. It's like the Obama meddling himself like thing. It's like that's what I felt. I was like okay.

>> Oh man.

>> I was like that was that was interesting. Um, but it, you know, I'm

interesting. Um, but it, you know, I'm using this framework and it has a really un like it's not in the training set as often. That's what's been really

often. That's what's been really interesting.

>> Always a big challenge.

>> It is.

>> It's Yeah. So,

>> on the um I agree with you. Day one of Gemini 3, by the way, was pretty rough, I think, for everyone. At least for me.

Like, I used it a few times, was able to get some prompts through, then it aired out. So like my I had kind of um on day

out. So like my I had kind of um on day one been a little bit frustrated just with this availability. So I was like okay give it a day or two to smooth out.

You know you mentioned all of the the capacity issues that they're having across the board.

>> I kind of wonder if open 4.5 has alleviated some of that for them just to like >> like I wonder if it was like cuz you know you a new model comes out literally everybody's hammering it 100% like

that's all we want to use. So, I do wonder like what it's at now because over the last like I was doing some evals this morning. I'm getting ready for my December evals. I was running. I

haven't hit any API errors or anything yet. So, I don't know. We'll see. We'll

yet. So, I don't know. We'll see. We'll

see.

>> Well, you know, over the API, I hadn't had any issues with it. Uh, generally it was quite stable. It was really just when you're trying to use it with their paid plans and their subscriptions that they were really trying to load balance.

And I really feel like they're pulling allocation in different places and you could really feel the Google bureaucracy of where they're putting usage uh in that distribution there. So yeah, I think over API if you're doing your benches there, I think you should be

pretty set.

>> And cursor was great by the way. Like I

I like cursor a couple things otherwise it was just solid across the board for Gemini 3. I had a question on your

Gemini 3. I had a question on your plans, Adam, because I noticed my Gemini plans compared to the Sonnet or Opus plans weren't as comprehensive even inside of Cursor cuz I love Cursor's

plan mode. Did you notice a difference

plan mode. Did you notice a difference between the two?

>> Yeah, when we get into Opus, yeah, I'm going to we'll talk about a lot of that stuff too when when Eric kind of leads us in that direction. Yeah, I will I I will say uh Gemini 3 does do good planned, but you do need to iterate on

it a bit more because it been like what I thought I had to do with with Opus, for example.

>> Mhm. So, I just want to put a little cap on the Gemini story uh here. Uh so, you know, the story's been it's it's all over the place in these different CLIs who people are trying them in different

places, getting these different results.

Uh, I wanted to make a quick note that um, so there's the AMP um, CLI tool.

They had made it their default. So

they're a big uh, they're all big about having the best model as your default and they make executive decisions for you and take away the model picker. Um,

and they had the made it the default.

They were very confident. They had some trade-offs with it, but they felt like it was so good that it was worth switching away from Sonnet for. Um, and

then I noticed uh I think it was yesterday that the creator of the ghosty terminal, who's one of the biggest users I've seen on on Twitter X uh of AMP uh

so so they had just released an Opus 45 experiment and he immediately switched over and he was like, "Oh my god, thank you for getting rid of this Gemini for me." Um, so that brings us into Opus 45.

me." Um, so that brings us into Opus 45.

Uh, and I I wanted to to kind of get your your thoughts on so that one just dropped this this week on Monday. big

release, made a big splash. Uh, so what what are you what are you all thinking?

Adam, you've been, you know, holding yourself in. You're like, you wantanna

yourself in. You're like, you wantanna you want to get into the opuses. So tell

us what are you thinking?

>> So again, like, so this is at a time where I'm like heads down coding. I hear

about it. My initial reaction is it's going to be too expensive for me to use.

So you know, it's like, oh, cool, a coding model, one that's probably out of reach for most people. So then I had a chance to look at the pricing. So first

thing I want to touch on is the pricing while higher is so much better. so much

better than the previous Opus version.

So then I was like, "All right, I'm going to just use Opus from here on out and give it a try." And I didn't look back, honestly. Like, you look at my

back, honestly. Like, you look at my usage, it was I was not jumping back to Gemini 3. Opus 4.5 was just solid. like

Gemini 3. Opus 4.5 was just solid. like

it uh plan mode it we the group that I was working with we were we were like testing some pretty hard like multi-dommain things that we're rolling out and it would come back with a plan

and we're like dang that is actually really good and then we'd have to go build it and we'd do like one or two follow-ups and it was like building just fairly complex systems to the point where we're like one of the guys on my

team was like well we don't even get to do anything hard anymore like that that's literally what he said. Yeah.

>> So, um, but regardless of all that, I think Opus 4.5 is awesome. I hope they don't mess it up. I hope the pricing stays good. The speed in cursor in

stays good. The speed in cursor in particular seemed good. The speed in claw code felt slower to me, which I find kind of odd. I don't quite understand why claw code wouldn't be faster.

>> It could be a difference in token budgets, too. I'm curious to what their

budgets, too. I'm curious to what their thinking budgets are set to.

>> Mhm. And it is again it is more expensive but the cash reading is good.

Uh the design I've heard some people say it's not good at design. I don't agree with that. I think it's amazing at

with that. I think it's amazing at design like doing front end stuff but I think it's magic is like being able to work across like domains and just come up with like great technical solutions.

And yeah it is it is right now my favorite model. But to be very clear in

favorite model. But to be very clear in terms of time, I probably have about two to three times the amount of time with Gemini 3 than I do with Opus 4.5. So

I've worked across like more projects and more things. So, you know, it may change if I get into other things that I haven't done with it yet. But I I freaking love it. I think it's awesome.

>> Yeah. Right. And Ray, what are you thinking about it?

I immediately started throwing really difficult tasks at it and I was just enjoying its thinking through these different issues and I was a little impressed cuz it came back really fast

at a solution. I'm like did you think through this and I looked through I was like actually yeah um it just makes a lot of sense. So one of the issues I'm facing down is a performance issue. And

what I'm doing is I'm loading like you know 10 GB files in on the browser and I'm doing a bunch of like web assembly type of problems. So like literally just you know memory in memory out as fast as

possible doing some processing and some cool loops and you know I have to like really whiteboard this stuff out myself and think about it you know thoroughly and then say okay I'm not a web assembly engineer you know maybe I worked on some

C stuff back in the day but this is the web right so like this was a fun problem to throw at opus 4.5 and it came back too fast at the results and like h how

how do you know this like like a lot of my benchmarks for the memory everything just started going down, I was like, "Wow, this is super duper impressive."

And I I don't know what is in that model, but I was immediately impressed compared to Gemini 3. It just, you know, I was like, "Okay, this this is a guy I want to really work with and talk with."

>> All right.

>> And then I started pointing it at all my other performance issues on the browser and saying, "Okay, I want you to kind of do a thorough review." And so I I actually did the same prompt inside of like Droid and Cursor. And I wanted to

see which agent would actually be much more thorough at this type of workflow.

And I was actually underwhelmed with cursor's output. It seemed to be just

cursor's output. It seemed to be just like super fast get to a conclusion like here you go and it's the same model you know in the same like you know the extra thinking max mode brain for cursor

>> and then you know with uh inside of droid I had it just on high and it's it seems to take extra long like not timing wise but like it seems to think a little bit longer in the droid compared to the

the cursory one. And I I don't know like there is a difference between the harness and I'm actually running a similar task right now in cloud code the actual app but cloud code does take forever. It does take you know seems

forever. It does take you know seems like forever in these times but yes it seems to take extra longer to do these tasks. Yeah.

tasks. Yeah.

>> So I've only used it in cloud code myself. Um though I did bench it over

myself. Um though I did bench it over the API. Um and you know so so just just

the API. Um and you know so so just just to caveat like I just want to start this off.

Um, if you add thinking opus 45 on on my bench, repo bench, it scores number one.

It's the top model. Um, what's

interesting is that when I release when Gemini 3 released, it actually scored number 26 on that bench. Um, which is which was very surpris like I had to extend the leaderboard because it wasn't fitting on there, which was which was

very upsetting.

I reached out to the Google team on that one. Uh, figuring out what's going on.

one. Uh, figuring out what's going on.

And I I think it's down to just file editing quality and and there's something about Claude that is both in terms of how it understands its context and how it can edit files so cleanly and clearly and how it's able through deep

in the context window with all this junk in there. They're training they have a

in there. They're training they have a special sauce around it where it's able to just stay coherent and on track and you know pivot and and stay nimble. So

there's something really special about that. I think you know you know cloud

that. I think you know you know cloud models in general and Opus45 is no exception. I think it's it's one of the

exception. I think it's it's one of the best like engineering models. So like

able to kind of debug an issue, find the data, be data driven, like find information, look at the data. Oh, how

do we debug this? How can we check?

Testing, checking its work, looking through, okay, we're good with that. U

and it's just better at using the terminal, better at like finding its way around and doing work. So I think that's just like, >> you know, there's something, you know, really magical about that. I don't think that like I'm going to turn to Opus for

like deep thinking, deep engineering tasks of like thinking through like architecture of like what's the best way to do this really complicated thing. I

still turn to the OpenAI models, especially 5.1 Pro is out now. It that

one stays very very good for that. But

Opus4, I think there's nothing that's better than it if you're using a coding engine and a terminal. It's it's just it's just so good. It's nice to talk to.

It's it's fairly priced. And it was great with cloud code if you're a cloud code user on the max plan is that they actually matched the usage of if you had sonnet uh it's the same usage as sonnet.

So like you're not using any extra use it do you don't burn your limits faster.

So you're able to really just fully switch over and they made it the default and you're not burning through limits any faster. So that's great. I want to

any faster. So that's great. I want to say one thing though. Um so while opus 4 and 41 to an extent they were it felt like this is a different model. This is

like opus the bigger bigger brain.

Opus45 feels like a smarter sonnet. It

It feels like we're like lifting up the bar of sonnet. Like here's here's a model that's like maybe some adding some IQ points to sonnet. It's able to do good work. Um and uh and one one last

good work. Um and uh and one one last point I just wanted to throw in there is uh turn on thinking for opus. So so

there are some some notes uh on X someone uh was mentioning that they felt that uh Opus 45 without thinking was less smart than sonnet with without

thinking. Um, and then that's actually

thinking. Um, and then that's actually the same results I got on on repo bench as well. Uh, so interesting to consider

as well. Uh, so interesting to consider that that that if you want it to be as smart as it can be, you got to turn on the thinking. Uh, so that's something I

the thinking. Uh, so that's something I also want to add that skills have been important for me because I've been just been adding them in and um sort of treating skills as these empty areas, especially since I'm working with

something like convex. It's not really in the training set. And so as I go through things or as I discover and do research, >> I then have the Opus model write its skill file for me so they can go back

later on and just load it in when it needs it. So like if I'm doing like, you

needs it. So like if I'm doing like, you know, this like specific data state management thing that's sort of unique, but whenever it does encounter this in the app, you know, I just have a little rules file that says, hey, go take a

look at the skills if you're going to be working in here. Even though I'm not like specifically using any keywords.

So, as kind of a double way to kind of like, oh yeah, I should probably load that in in its thinking that type of thing. Very good. And I think that's

thing. Very good. And I think that's going to be an interesting way of like I almost call like skill hacking. It's

like can you stack skills uh in in repos and have the agent build them itself because it does seem to follow those tool calls and the skills and other little additional things. And I see it

more on the web, especially when you create like a cloud a cloud project and and even for non-coding tasks. It's just

extremely thorough at those in Opus 4.5 than I've seen it even in Sonet 45 as well. So,

well. So, >> definitely. Yeah,

>> definitely. Yeah, >> I was to add on to that. I would just say like Opus 4.5 to me is so much more enjoyable to work with than Sonnet 4.5.

And it you guys know this, everybody probably has to do this. So, like when I was bouncing between Composer 1 and Sonic 4.5, every time I would go to do like some sort of code review to see if I'm going to actually like, you know,

push it up to actually have it PR, there's like five bajillion markdown files that Sonic 4.5 decided to make that I have to go remove, you know, like you've got to go through and like clean

up a bunch of stuff. Opus doesn't seem to do that. um

a small thing, but it is very annoying like how much wasted token Sonic 4.5 has in just extra things that you don't want it or necessarily need to do. The other

thing I found Opus 4.5 do to do is it picked up our the way we do testing very well to where it would actually go through I would talk about a feature. I

wouldn't even ask it to go and like write the unit test for it, but it would it would get to a point where it's like, "Hey, would you like me to go do that?"

And it like it it understood the structure of this massive code base like absolutely massive code base well to be able to set things up properly and then be able to run the particular test and iterate on it. It it's honestly so much

fun to work with >> but yeah that's what I was going to ask >> all cursor for that particular case.

Yeah.

>> Mhm. Interesting. Yeah. You know it it's interesting. There is a a note that they

interesting. There is a a note that they mentioned when they released Sonnet sorry Opus 45. Um they mentioned that you have to be a little less harsh and

intense on your prompting for it. Like

when you would say sonnet must do xyz, opus should do xyz. Uh so just thinking about like how you're prompting the model, how you're building your files.

>> Uh it does tend from my experience as well to better adhere to those rule files. So that's something to think

files. So that's something to think about. Um you don't want to overdo it on

about. Um you don't want to overdo it on the rules and and make sure that you know the model is able to kind of steer in the right. It seems to be better at instruction following. So, you got to

instruction following. So, you got to watch out for that.

>> Eric, I wanted to throw a wrench in here. Maybe I know kind of it's a little

here. Maybe I know kind of it's a little bit out of order, but I was thinking that last time we talked about if we had $200 to spend, you know, what specific models we're using and I think >> since then we didn't have a lot of air

time or, you know, code time with copus copus composer one. I know that, you know, Adam got to use it a little bit and I'm kind of curious from Adam's perspective, like you've been using

Composer 1, Gemini 3, now Opus 41.

Pricing, speed versus accuracy. I feel

like everyone's now converging on speed and accuracy at the same time. What are

you reaching to now? Like if you had the same $200, Adam, kind of curious.

>> See, it's so tough. Like right, I am actually paying for the $200 a month Cursor plan because of just how good I think Cursor is now, which if you go back in time, I've been very critical of Cursor and felt like they've done some

things that were very anti-consumer.

I feel like they've hopefully learned from that and they've turned things around. I still despise their credit

around. I still despise their credit system. I absolutely

system. I absolutely wish that we could move away from that.

Like just give me API like give me some transparent way to understand how you're billing. not tell me how many tokens and

billing. not tell me how many tokens and have the credit system behind it. So, I

guess if I only had $200 to spend and I didn't need to code beyond the capacity of like Cursor's $200 a month plan, I probably would do that. But you are

limited because you you can burn through that $200 very quickly, especially if you're just using like Opus 4.5, for example. In a matter of like two days, I

example. In a matter of like two days, I was able I burnt through like $75 on the cursor plan just in two days. So,

>> it's worth noting too, Adam, that they're currently billing Opus at half price on Cursor.

>> So, that's going to also suck when that changes. So, that's a good point. So, if

changes. So, that's a good point. So, if

I were to spend $200 today and cursor was out of the question because it wasn't enough usage, it would 100% still be cla $100 a month plan. And then I

would most likely I would love to have some way to get Gemini 3, but I just all I'm hearing about access to the Gemini 3 plan that just seems kind of odd right now with the way you authenticate

through Google terminal. So I'd probably still do the OpenAI um codecs $20 a month plan and then fill in the rest with some API usage is the way I would

do it. So, just a note. So, to go back

do it. So, just a note. So, to go back to the Gemini story, with the $20 a month now with uh Google's plan, you do get a decent amount of CLI access and they have opened up the the gates to

everyone on the CLI.

>> So, all the limits are lifted now, like we're >> Well, I haven't used it enough to hit the limits.

>> Okay.

>> The thing that's a little annoying though with the Gemini CLI, and I've mentioned this to the to their team on on Twitter, is that they're a little sneaky with the model routing. like

you're using the Gemini CLI and uh and then you're you're gonna run into some limit and oh now you're on 25 Pro.

>> Oh, and then now you're on 25 flash for some reason. Um but it turns out that if

some reason. Um but it turns out that if you start the Gemini CLI and you put dash model and put Gemini 3 Pro preview, then you're going to be only using that model. And I was like, why why can't I

model. And I was like, why why can't I just do that from inside the like why do I have to start this launch lag? And

there's a lot of strange decisions by Google around this release, but that's one of them. But there are ways around them and $20 does give you a fair bit and and use you you still have the Gemini app and the Gemini.com whatever

you can still use that but then AI Studio also gives you a lot of Gemini 3 usage for free. So there's a lot of places to use it for free depending on how you hit the limits. Might be worth

trying that. Um anyway just wanted to

trying that. Um anyway just wanted to caveat that. Ray, how how is your uh

caveat that. Ray, how how is your uh spending changed since since release there?

>> I'm very biased because you know I have used Droid. I love it so much. I I

used Droid. I love it so much. I I

reached out to them to sponsor my show and you know I I deeply like the way that they handle this infinite complex and that's kind of why I've been so biased and I use cursor and both of them

a lot. Like I maybe for me it might be

a lot. Like I maybe for me it might be like if I had infinite money which one would I kind of reach to all the time.

Adam is really convincing me to go ahead and like just go all in on cursor and just you know Gemini 3 because I think that's a really interesting perspective especially that things you're calling out there Adam about this longer context

thing and then these other things that are just kind of it's picking up as it goes for for that type of experience and um the thing that I I find enjoyable about like Droid and stuff is the fact

that how you know this they have this unique compression thing that I also want to test with cloud code. So cloud

code and apparently cloud code has this new capability now where you can just kind of have this really long chat conversation and it's supposed to do this nice compression and I think it's built into the way the model does it or the tool call. It's not really really

clear what's happening there but it feels like you should have a longer conversation. So they just announced

conversation. So they just announced that but that was actually for cloud AI not for claude code so the website um and I think they've already been doing some work on callulling some tool calls

with micro compacts but the big change there is on the website or the desktop app so that's that's worth noting >> oh dang okay yeah cuz that's I mean I've had like 10 million token sessions in

Droid and like it stays to the spec file it stays anchored with all my rules files it just keeps going and going and what for me it's like okay I have a a feature and I just can see it from

planning to ship, you know, and MCP tool calls and everything all in one session.

And so that's what's been interesting for me. And I I haven't seen any other I

for me. And I I haven't seen any other I think augment code may have been doing something similar.

>> I need to try them again and see, you know, what they've been up to. Um, but

that's >> they're also moving towards the complicated credit system. So, good luck figuring that out if you do move over to them.

>> And it's not to go off the rails there, but it's even worse than I think some of the other ones. And Warp also did the same thing recently. I mean I think they have to but it should not be credits. It

needs to be cost plus. Anyway, to your point it is interesting that you work that way where you actually are trying to like run these very very long deep

context conversations.

My workflow typically is like I don't I try to break things up into the point where like I'm constantly starting new chats. So if you look at if you were to

chats. So if you look at if you were to go look at my cloud code history in a period of a day I'd probably have 20 chaps where I've maybe even more. I

don't I haven't looked same thing at cursor like I end up with a bunch of tabs you know in the cursor IDE >> on that because I I I like um working on

a very focused feature or thing and then I I like to say like okay now we're going to edit that thing. So I go to a new context. I tee it up with the

new context. I tee it up with the context I need and then I go. So it is interesting. I haven't never thought

interesting. I haven't never thought about it from the standpoint of like wanting a 10 million or you know super long chat. I actually get annoyed kind

long chat. I actually get annoyed kind of when they go and start doing contact uh context compression. So I just try to avoid that honestly. So when I see the when I see any sort of like you got 2% I'm like all right time to start a new

chat then start working.

>> Yeah. Yeah. Previously for my workflow I would just open new windows all the time and you know I would actually be extremely diligent at 50% was my marker.

uh and I you know in clot code and in even in in cursor but I had the same problem I'd have 50 chats so I'd have like a plan chat I then have like a handoff I'd write the markdown file you

know waste a bunch of tokens there and then I actually have like documentation workflows that actually have a spec file and also has a progress file and then I would hand the progress file and the

spec file into every single new chat >> and like I was just doing this over and over and over again and I got so kind of exhausted because like I just want to do a bug fix and then another bug fix and

another iteration. So I just kind of I

another iteration. So I just kind of I had so many chats and I was like, "Okay, I have to go back 40 different chats cuz I'm still going to go back to phase two of the plan and I need to continue phase

two and you know cuz I like to do this a waterfall method of like the plan build iterate iterate iterate and then fix you know polish and then you know ship it and then go back to phase two and then

like just keep doing that and I'd have these like diffs in essence like I I also started doing like stacked diffs.

So, it's like every feature gets a PR, then I add another PR on top of that, and I just keep PR until I have the whole feature set, and then I could just, you know, ship it um and and close the PRs, and everything gets built on

the server, too. So, it's like it was like this.

>> And all these are >> Yeah, it it's it probably comes from just working at Apple old school wise.

It's like ship, test, iterate, keep doing this thing again. And maybe I'm just trying to like, you know, for me it was just a little bit exhausting keeping track of all the chats because the UI

for chat history is just like only scrolling, you know, in a timeline where I kind of need like more of like a grid or something. Yeah.

or something. Yeah.

>> Yeah. I I usually like just kill like I imagine when I go to a new chat, all my other ones no longer exist anymore because I feel like I've moved past that point. So, it's very interesting to I I

point. So, it's very interesting to I I don't know. It's I think I need to think

don't know. It's I think I need to think a little bit more about how you're working and see like if that makes sense for some of the things I'm working on. I

Well, hold on. I I I think you it really depends on the tool that you're using.

Like I think Droid, they put a lot of elbow grease into making that possible.

But I think, you know, you're you're fighting a little bit on the gravity of the models context window. And and these models just aren't >> built they're not just just not great at doing this kind of thing. So you really

got to work around them to kind of get this possible. But I get it like this is

this possible. But I get it like this is a lot nicer of a workflow to just keep going u pick up where you left off. But

like I just think you know right now like model models just they they do better early in the context window and if you're always full with with stuff and even if you're compressing all the way like there's stuff that may not need

to be there. So it's a little tricky. So

I I get you know Adam I I'm I'm a little bit more along your approach too but I do want to mention as well like if you're using Opus 45 like it is a lot more resilient to noisy context windows

and it's able to keep going further and I have found that in cloud code in particular if I'm doing compacts a few times it's able to kind of pick up the thread and keep going. However, I don't trust it to kind of write a bunch of

novel code at that point in the context window after a few compacts like you you know you got a lot of junk in there. So

what I do do though um so you know I I'm a big fan of combining models together and lately this week what I've been doing a lot with with the repo prompt MCP setup in cloud code I'll be like hey

like get a second opinion from GP51 on on this on this idea before you kind of proceed with your implementation like you're like here because as you're deep in a context with a lot of with a lot of context and you've been through compacts

like the model has a lot of understanding of what you've been doing um so it's able to know what files to kind of pull out and give to GT5 to kind of think about um and it's able to kind of do that to kind of think through

problems. And I found a few times it caught a lot of issues with with what Opus was trying to do even with a detailed plan. Opus made some syntax

detailed plan. Opus made some syntax issues or uh like one issue is it was trying to iterate over a dictionary and pull stuff out and mutate it while iterating which is you know if you've used a lot of languages like in

programming like you know mutating a dictionary during an iterator that's just a known that's a cause for a crash in a native language. Uh so you know if you I I didn't see it right away and then it gave it to 51 and it's it

spotted it right away. Uh so so having you know these models check the work of each other. It's very helpful stuff. Um

each other. It's very helpful stuff. Um

yeah. Any uh closing thoughts on Opus before we move on to the next topic there? Uh no. All right. Um so this is a

there? Uh no. All right. Um so this is a small call out I just wanted to make.

Um, so there's someone on my Discord this morning who was reaching out to me and he was trying to figure out this right setup uh to to kind of get this working uh for the benchmark in repo prompt. But he tested Kimmy for coding

prompt. But he tested Kimmy for coding um which is a new model that kind of just showed up and they didn't really talk about what it is. Um but

surprisingly on repo bench it scored 72 or 73% which is on par with Opus non-thinking. Uh which is very

non-thinking. Uh which is very surprising to me. It's like the other Kimmy models didn't do nearly as well.

So I don't know what special sauce is going on there and there's to be TBD on more to come for that but uh if you are interested in Kimmy and worth taking a look at that to what's interesting to me is that it's the highest score for any

non-western model. Uh so that to me

non-western model. Uh so that to me that's a big a big shot across the bow for the western labs uh for coding workflows uh that Kim has something cooking there and it's worth paying attention to uh especially now they got

their nice Black Friday plans. Uh worth

worth thinking about um if if you're interested in that. Uh any any other thoughts on that? I guess you guys haven't tried that one out, so I haven't really played with it either. So, I just wanted to make a little call out for it.

Um, cool. So, uh, next thing I just wanted to take a little moment for us all to go around uh, you know, as we're wrapping up here, I just want to do a little, uh, little tour of like what what's your terminal setup like? I know

you guys are both like, uh, well, I know Adam's using cursor a lot lately. Uh,

you know, Ray Ray is using Droid in the terminal. You know, what what what's

terminal. You know, what what what's your terminal setup looking like these days, guys? Like, how are you guys

days, guys? Like, how are you guys working with these models?

>> I guess I can go. Yeah, go for it.

>> Um, so for me it's literally just a GSH in the shell. [laughter]

>> Right on.

>> I don't know. I'm so old school. I think

it's just kind of been simple that way.

I don't want to over complicate it.

>> I I do have like I want to learn a little bit more to be honest. I know

Ghostly's come out. I know there's a whole bunch of other somebody talked about Kitty and people have rewritten stuff in Rust and so forth. And I I don't I I know a lot of these for

terminals, all this is text and some of terminals can be optimized to rerender for graphics work >> and it's like text is only updating in the screen. My screen update refresh is

the screen. My screen update refresh is maybe 120 Hz, 60 Hz a second.

>> I'm not playing games in the terminal with text and stuff like that. So

>> yeah. Yeah. I don't know. Like in these, you know, my CLI are only drawing, you know, so many frames per second and >> that's all I need. But I'm really curious. Does anyone else use anything

curious. Does anyone else use anything else? That's, you know,

else? That's, you know, >> I'm Yeah, I just want to make a quick mention on this. So, so while that default terminals here on Mac OS worth mentioning, uh, Ray, cuz you know, the terminals do change per per OS. But on

Mac OS, the default terminal doesn't have tabs and that alone is like a big blocker to using it for me. So, just I switched over to Ghosti pretty much just for that. And Ghosti is an open source

for that. And Ghosti is an open source terminal. Uh, it's more performant than

terminal. Uh, it's more performant than the the built-in one. And the thing that I like a lot about it beyond the tabs um which you know I'm opening tabs all the time starting new chats you know with

these models um is that it is more performant and so if I'm pasting a lot of text which I do quite a lot with these with these coding agents you feel the difference like if you use the built-in terminal you're pasting a lot

into cloud code it can buckle down cl the terminal and you can feel there's like a big lag for that text to make its way through but if you're using go it's instant and that alone worth worth the switch for. Yeah.

switch for. Yeah.

>> Okay. Hey, I think you're convincing me.

>> Yeah, right on.

>> Um, yeah, tell us, Adam. What's fun?

>> I mean, I agree with you, uh, Ray. So,

if I'm in an IDE, I use just the basic um, Bash or or ZSH or whatever, whatever I happen to be using at the time. I kind

of have them all set up.

Actual best terminal for me, a terminal emulator really, would be the u what they do with warp.dev. And I know I'm not a big fan with their pricing and all the stuff they've changed and the AI side of it,

>> but from the like terminal experience, it's amazing. You can and just the most

it's amazing. You can and just the most minor things being able to select text in the middle of something you paste it in and edit it. Like I know it's minor, but like you actually have like an

editor, you know what I mean? Like and

then you got tabs built in. There's some

AI assistance in there.

>> Honestly, I really really love their their terminal experience that they've got. So that that's my go-to. Very good.

got. So that that's my go-to. Very good.

Yeah. Uh I'll mention a little little shout out. One thing that a lot of folks

shout out. One thing that a lot of folks uh in my Discord have been doing is they run T-Muxes. Uh so that's just basically

run T-Muxes. Uh so that's just basically a terminal multiplexer if you're not familiar. So they want to see a

familiar. So they want to see a dashboard. They want to be like those

dashboard. They want to be like those finance guys that that have like 18 screens at the same time. They want to watch these eight coding agents, 20 30 coding agents running at the same time fill their whole screen with, you know, I it's a little bit intense for me

personally. I don't I'm not running that

personally. I don't I'm not running that many at the same time. But if you're one of those people, you want to might want to look into a T-Mox be could be worth your time.

>> Cool.

>> That's hilarious.

>> Yeah. All right. Well, you know, we're we're coming up on time, I think, for today. I just want to do a little quick

today. I just want to do a little quick final shout out as well. You know, Adam spent a little time uh, you know, at a conference this week. You any highlights you want to take about the AI AI engineering conference?

>> Yeah. So, it was uh AI Native DevCon in New York. So, it was kind of it was kind

New York. So, it was kind of it was kind of a short trip because I was in Atlanta. I had to fly up honestly some

Atlanta. I had to fly up honestly some amazing people there that I got to actually talk to both after the talk I gave which is >> really about how we are evaluating LLM's ability to code and why it's such a

difficult problem and you know some of the things that I've done so far and discovered all of that was great there was one talk in particular that I thought was very interesting it was from somebody from open hands uh I think one

of the founders of open hands the talk was called AI hates legacy code >> and you know a lot The world that I've lived in in the last three years honestly has been, you know, legacy code

is like 20 30 years old. I've been

working with code that's less than five years old. So it's very, you know, very

years old. So it's very, you know, very new. And it really made me think because

new. And it really made me think because I have actually worked with code bases.

And I know Eric, you're probably in the same place coming from Unity.

>> Like there are code bases that have been around for decades and have stuff that are still, >> you know, that the people that wrote that code no longer at the company anymore. So you have no context around

anymore. So you have no context around it. So the point that he was trying to

it. So the point that he was trying to make, there's a bunch of it, but the my big takeaway was it's not that, you know, people that are working in legacy code bases all are afraid of AI. They're

afraid of changing the code themselves because they don't understand the context. Yeah.

context. Yeah.

>> So, so it's like >> when you have AI going in there and messing with something that you don't really fully understand, that no one at the company fully understands, but you know it kind of works.

>> It's just it's a very different paradigm to kind of think about. Well, the best part about this is that the tests are often completely misleading and will have give you no signal whatsoever as to what the code is doing or if it's

working cuz it's it's it's a feature built 20 years ago by someone and then a bunch of other features are built on top of it and you have no idea what the downstream impacts of one line of code is that seems innocuous. All the tests

are fine. You don't know what you

are fine. You don't know what you actually just did there. So, you got to watch out. Yeah, I'd love to add some

watch out. Yeah, I'd love to add some insight to that after you kind of talk about that because close out here real quick and then pass it to you because I think the big takeaway for me is like I've been so much thinking about it in

terms of building startups and writing you know it is a very interesting like mind shift to think about what it means to work in like very old code that a

human is scared to actually edit and then what happens when you put AI in there and to Eric Eric made the exact point that the guy from open hands did which is you make a change and it literally breaks things that you don't

think are related at all like and that's just so is it is it really these engineers are afraid of AI or is it more that the engineers are they're just afraid of changing anything in that code

anyway and then AI just adds another layer of uncertainty to the whole thing.

>> So anyway, it is >> and having worked in some large code bases very recently over the last nine days, >> I totally get it. I'm starting to come around to like why it's such a difficult thing because you can't find the person

that wrote the particular thing that you're trying to debug or figure out.

Anyway, Ray, I'll pass it to you. But I

that I thought that was fascinating and I know people can resonate with that.

Mhm.

>> Yeah, I think that's a great topic that I'd love to deep dive further because in a lot of my experience this also happens in like I used to work at a company and it you know codebase is like 30 plus

years old too and there are like file systems that are you know 20 30 plus years old and then like I worked on a project where the file system literally got rewritten and you have to have like

you know I worked on the update stack so you talk about updates that have version mismatches and all this legacy type of things. One of the most valuable things

things. One of the most valuable things because the uh company notoriously kept the teams extremely tiny is the value of a QA engineer and a QA engineer that not

necessarily just was writing tests. The

value is in the integration and the actual understanding of the procedures of the company's output.

>> Every single detail matters from >> when it leaves a factory to when it you know actually goes through all these different processes. These people are

different processes. These people are extremely valuable and add that extra layer that an engineer really can't think about day-to-day when they're writing code. And it's actually a

writing code. And it's actually a different brain that is required for thinking about these problems cuz they'll actually approach it more from a user centric standpoint.

>> And so the bugs that they find there have actually a higher priority for fixing and they have these established tests that things that they check daily every week to every month. And so it

gets me thinking that maybe we are also going to turn into QA engineers too >> and start to think about it that way and the AI can help us actually grab the relevant pieces of code or maybe debug

faster. But you have to maybe think

faster. But you have to maybe think about these things maybe in a more procedural business operations type of way of like okay what parts of my software stack are the highest impact

for customers that have to be tested every day and then you know do we have humans around them actually going through these workflows if we don't have some type of test or some type of first

gate you know signal because if a build breaks now you have 20,000 30,000 you know however many millions of engineers that you have across the globe like Microsoft right uh these people are going to wait on a build and then that's

wasted money. And so some of these tests

wasted money. And so some of these tests can help uh you know find problems like find the smoke. uh the human goes and takes a look at the fire and see what bigger problems because there have been

some big massive massive bugs uh or refactorings of things where like we forgot like literally one function and it it just broke things in ways that you're like oh yeah of course we didn't

think that that was relevant but now there's this really weird use case with the old version of some other thing that when it gets introduced into the new ecosystem it now throws everything for a wrench. It's like

wrench. It's like >> how are you going to update older code from a thing that doesn't even take software anymore?

>> Yeah. The other the other thing is is like you guys have definitely dealt with this. There are times that a service

this. There are times that a service exists that has a bug in it and getting that updated or changed in some way is just not going to fit the timelines around what you need to release.

>> But it's not like a major bug. You can

kind of work around it. So you make decisions to now like build around this this flaw. Mhm.

this flaw. Mhm.

>> Eventually that flaw gets fixed. What

happens to the stuff that you actually did to kind of over So there's like all of these little micro decisions that happen to try to like get something out the door and then people forget about it. They forget that like hey if I

it. They forget that like hey if I update this thing to make it the way it should be the correct we're going to break all these people over here that had to work around the way it was

before. Uh, and it's just uh, so anyway,

before. Uh, and it's just uh, so anyway, like when I think about the pockets of people that are AI coders, there's people like us that like are fully bought in. We love it. It works really

bought in. We love it. It works really good, especially on a fairly newer code bases and you're building up. They're

the vibe coders that just want to like make zero to one stuff.

>> Then there are the people that are working in these like >> banking core systems that were made 50 years ago. And like I had a lot of

years ago. And like I had a lot of empathy for those people being scared of putting AI on their system where before, you know, I was of the mindset like everybody should be doing it. But I've

really started to try to like, you know, kind of look at it from the perspective of if I as a human am scared to change something. Putting AI on that actually

something. Putting AI on that actually is going to make it more tricky.

>> There's a couple other axes there too though is is you know like an AI model is only as good as this training data.

And you know, a lot of these enterprise legacy code bases just don't exist on the web. And so the AIS just don't have

the web. And so the AIS just don't have any idea how they work. And it's not just the languages, too. Like you might have like a shortage of cobalt online, which definitely is the case. Um, but

it's also like each enterprise codebase is its own beast. And even though it shares like technical, you know, stacks with other things that are on the web, the the the code itself is like an

intricate machine that the AI wasn't trained on. And to your point, Adam, you

trained on. And to your point, Adam, you know, there's these flaws that are kind of caked in and worked around that like you you the AI has no idea about. And at

the end of the day, an AI model only knows what it fits in its context window. So if it's poking around trying

window. So if it's poking around trying to find some things, it'll make judgment calls with what it can see, but it can't see everything. It's just impossible.

see everything. It's just impossible.

So, you know, for now, this is an inherent limitation. And so, you have to

inherent limitation. And so, you have to work with it to find out what you should give it as information. It has to complement your own knowledge. It's a

very tricky thing working in these code bases with AI models and you have to treat it almost like you're you're training up a new hire but every time you use it which is kind of you know upsetting uh to do. So you can't just vibe your way through it. Uh which which

is a big issue.

>> N you can't vibe your way through these old code bases.

>> Yep. Yeah.

>> But I thought it was a really cool perspective.

>> Yeah.

>> Anyway, I think that's uh there were other good talks but I think that was probably the one that stood out the most.

>> Very good. Well, thanks for sharing that with us, Adam. And I I think honestly that's a good place for us to wrap for today. So, uh, that is episode five of

today. So, uh, that is episode five of Rate Limited. So, thank you so much for

Rate Limited. So, thank you so much for joining us this week. Uh, so I'm been your moderator today, Eric. Uh, there's

thanks to Ray as well and Adam as well.

Uh, so hope you join us next time. Don't

forget to like and subscribe. Thank you

so much. Peace out, everyone. Take care.

Take it easy.

Loading...

Loading video analysis...