Gemini 3 Pro finally here, Opus 4.5 surprises everyone, GPT 5.1and more | Episode 5 Rate Limited
By Rate Limited
Summary
## Key takeaways - **Codex fixes truncation**: Codex now truncates tool output by token limits instead of line-based rules after feedback, with a config flag 'tool_output_token_limit = 25000' to raise the limit from 2.5k tokens. [01:35], [01:49] - **GPT 5.1 frustrates coders**: Adam found GPT 5.1 slower than Composer 1 with worse design chops and no real improvements; Ray noted it reverts to old Tailwind v3 despite rules, preferring training data habits. [03:13], [04:28] - **Gemini 3 excels in plans**: Adam switched to Gemini 3 Pro for phenomenal planning, execution across TypeScript/Python/React/Vue/PHP, and steerability without GPT's stubbornness, though it forgets tools occasionally and adds redundant comments. [10:15], [11:07] - **Opus 4.5 tops benchmarks**: Opus 4.5 scores #1 on repo bench with thinking enabled, outperforming Gemini 3 (ranked 26th); it's a smarter Sonnet great for multi-domain engineering, testing, and less cleanup than Sonnet 4.5. [28:11], [24:12] - **Legacy code resists AI**: AI hates legacy codebases from decades ago with lost context, misleading tests, and hidden interdependencies; engineers fear changes humans barely understand, amplified by AI's blind spots to proprietary flaws. [51:05], [52:16]
Topics Covered
- Full Video
Full Transcript
Ladies and gentlemen, you are tuned in to the rate limited podcast with your host Ray Fernando. We also have Eric Broven, the founder of Repo Prompt. And
we have now Adam Larson, Gou coder. We
have made over the 1,000 subscriber mark and can afford Adam's last name. Welcome
to the show, ladies and gentlemen. This
show is going to be about AI practitioners right in the weeds of everything, building every single day.
There have been so many new model releases in the last couple of weeks from GPT 5.1, Gemini 3, Opus 4.5, Gro is in the mix. Kimmy has some crazy stuff
going on. We also have some updates on
going on. We also have some updates on the last episode with Codeex and their truncation issue and a whole bunch of other things with Dwaresh's part podcast talking about ASI with Ilia and so much
more. So ladies and gentlemen, we're
more. So ladies and gentlemen, we're excited to have you on the show. This is
going to be a really packed episode, so make sure you stay buckled up, follow those timestamps, and we'll catch you soon. Yeah, thanks for that intro, Ray.
soon. Yeah, thanks for that intro, Ray.
So, uh, to start things off, I just wanted to do a little quick followup on last week's episode where we talked a little bit about issues around codecs truncating context with tool calls. So
if you're not familiar or you didn't listen to that episode, the key issue is that like unlike other AI coding agents, codeex was doing this optimization where
it would if you if you return too much in a tool call like it's going off and and you know reading a file or reading some bash output um basically it would
cut off the middle part of the output to fit in some linebased rules. And so
following that discovery, I had opened a GitHub issue, made some noise about it and tweeted here. And um they actually addressed it in the release last week
and now all of the truncation is done by token limits which is a big step up. And
they added a configuration flag which you can use as well in the in the config file. So if you are feeling that this is
file. So if you are feeling that this is an issue for you, uh you can you know either reach out in the comments or know we can put a little note in there to say what the flag is you can go ahead and set if you want. Uh, but you can
actually fix that yourself uh if if it's a problem and it's great to see that their uh response to feedback. So, I
just wanted to put a little close on that saga. Uh, it's nice to see it done.
that saga. Uh, it's nice to see it done.
It's too bad the limits are still quite low uh at 2.5 thou tokens which is which is quite not there's not a lot. Um, but
uh anyway, it's good to see them fix that. Um, and yeah, so so since that
that. Um, and yeah, so so since that happened, actually that release that that fixed it for the codeex CLI, it came alongside codec 5.1 Max, which is a mouthful of a model to say the least.
Um, and you know, one thing we didn't mention in the last episode was that GPT 5.1, it just came out like right as we recorded. Um, so that's a crazy thing.
recorded. Um, so that's a crazy thing.
So I wanted to take a second here just to talk to both of you guys, Adam and Ray. So let's start with you, Adam. So,
Ray. So let's start with you, Adam. So,
what what are you thinking on this new GP 5.1 and Max, if you've tried that one, uh what are you thinking about it?
>> Yeah, it was very interesting cuz we recorded and then that very day GPT5.1 came out. I have only messed with
came out. I have only messed with GPT5.1, but as many of you know over the last like 9 10 days, I have been very heads down coding. So, to me, that's
what kind of like proves out like whether a model is valuable or not. 5.1
was kind of frustrating to work with for me. I didn't feel like much had changed.
me. I didn't feel like much had changed.
I felt like its design chops had actually gotten worse. U I felt like in general to to give you a little bit of what I was doing before, I was using composer one pretty much solely because
it's so fast. I understand the codebase.
I'm kind of like feeding it in what I wanted to do. GPT5.1 so much slower. And
because of that, my iteration time comes way down. But I also didn't feel like it
way down. But I also didn't feel like it was doing things that made sense. It was
it just something felt off about the model. Ray, I showed you that thing
model. Ray, I showed you that thing earlier today. Like from a design shop
earlier today. Like from a design shop standpoint, uh it's just weird. It does
some like weird things that I would not expect a model to do. Now, some people I've heard love the model, but for me, I got very frustrated with it and quickly went off of it. I don't know if you guys
agree or disagree. Would love to hear that.
>> Well, what what were your thoughts on it, Ray? How did you get much chance to
it, Ray? How did you get much chance to test it out or you were focused on other models for the last couple weeks? I
tried it out and compared to Sonnet 45 high which I've been using a lot in droid I felt that like the model just wasn't fast enough for me or [laughter] accurate enough. I think there were a
accurate enough. I think there were a lot of things that I'm using a lot of Tailwind classes. Uh I'm using Tailwind
Tailwind classes. Uh I'm using Tailwind V4. I'm using the whole tokenizing
V4. I'm using the whole tokenizing system >> and it just for whatever reason keeps going reverting back and inserting stuff from the older training set. And that's
kind of what I felt a little bit shocking because like I have rules files I have various things in my codebase. M
and it seems to prefer a lot of things in its its training set and it's like oh tailwind v3 stuff all the older stuff and I I haven't really remembered but I think the cut off date for this model is
actually um still older you know for the the base training data and it seems to like be reinforcing older habits and things that I wrote rules for that it surprisingly didn't follow that 51 did
follow well I mean the 50 model GPT5 did follow well and so after a couple rounds of that I just said there's already other good models in the pipeline, so might as well just use them. Uh, and so just kind of just stop using it right
away. Yeah.
away. Yeah.
>> Yeah. Well, uh, I mean, you know, that's, uh, that's interesting thoughts there. Using it for design, I think like
there. Using it for design, I think like I I've never found the GPT models to be the the right ones to use for that purpose. Uh, so that's that's, you know,
purpose. Uh, so that's that's, you know, quite expected to me. Um, but, you know, I I uh I did get a chance to to use it quite a bit. uh you know just more in
Chad T and now with 5.1 Pro that came out um I've seen you know quite a step up in in in terms of how thorough the model is. I think just the base 5.1 it
model is. I think just the base 5.1 it is it is faster and I think that's nice and I tend to use these models not so much for design expertise but for uh just just detailed planning and
architecture and I still think these models are bestin-class at that. Um so
that's that's been that's been the case you know pretty consistently. What I did find interesting though is when uh 5.1 came out uh I put it on the benchmark with repo bench which is the repo prompt
benchmark um and it actually shot up to the top and and it was it was the number one but was very interesting to me as well is that only the low version was at
the top like the lowinking version not the highinking version. Um and and you know there's a lot of speculation as to like why that might be and like my my thinking on it tends to come down to
this. So when I look at GPT models
this. So when I look at GPT models compared to Claude and other models, um what I've noticed is that they are quite sensitive to noise in their context
window. So if you have uh a lot of like
window. So if you have uh a lot of like unrelated context in your context window, that does tend to distract the 5.1 models quite a lot. And the repo bench really stress tests that. And then
also I've noticed that GPT models in general are just not as good as cloud models at doing file editing. Um and and other models that are are doing really
well at that purpose. Um and so I've seen it be quite sloppy at file editing as well. Um but with low thinking it
as well. Um but with low thinking it seems to do better. It gets less in its way. And when you have less reasoning um
way. And when you have less reasoning um you're adding less noise into the context. So like if you think about
context. So like if you think about reasoning, it's like you're adding tokens in the context window to kind of think about the problem before you get to an answer. And all of this reasoning, it sits in the context window and if the
model is more sensitive to noise from in its context window, then it's think its own thinking can get in the way of a good answer. Uh which is surprising and
good answer. Uh which is surprising and that doesn't happen with claude models as which is interesting is they are much better at dealing with that noise in the context window. Uh so I just wanted to
context window. Uh so I just wanted to put a little caveat on that just a little thought there. Um, now in terms of the Max models, I'm sure Have Have you guys tested those ones out in Codex?
Have you played with the Max models at all?
>> No. No. It's a no for you, Adam, and you, Ray. None at all. Okay.
you, Ray. None at all. Okay.
>> Yeah. I'd love to hear your thoughts.
Yeah.
>> Yeah. So, I mean, I played with it, uh, you know, a fair bit. I used it. Um, I
find, you know, it unlike, you know, when you're prompting, you know, Claude or, you know, some other, you know, more general purpose agent models, they tend to be more um, flexible and
conversational and you can kind of run with them. And uh you you can kind of
with them. And uh you you can kind of try some things out, course correct, and keep going. I find GPT models are a
keep going. I find GPT models are a little bit stubborn in terms of trying things. Like if you're deep in a context
things. Like if you're deep in a context window and you're like, "Okay, actually, let's try this thing instead." It it's like you're fighting it to kind of you're fighting its inertia to kind of try and do something else and like pull
away from that thing that you've set.
And it's very sensitive to this accumulated work that it's done and accumulated instruction. So the bigger
accumulated instruction. So the bigger that use of your context window is with the more detailed your prompts are, the harder it is to get it to change course when you're deep in the context window.
Uh and that remains true with Max, but I find Max uh if you do give it a good plan, it's a it's quite the robot. It's
able to kind of go ahead and and execute beautifully um and get some work done.
But again, like down to the file editing things I mentioned, it it is a little bit sloppy on files, especially if you use tabs in your code. Um but what is interesting to know like I don't use Windows but they did say that when that
model came out they had trained it on PowerShell environments. So if you are
PowerShell environments. So if you are in Windows this is the first model that I would consider usable on Windows in the native terminal which is really interesting. Um so yeah good good thing
interesting. Um so yeah good good thing to note uh and and it is a good model smart very slow though um and yeah if if you're if you're good with that it's worth worth looking at. Uh cool. So that
was a little wrap on the GPT series.
What happened next is Gemini came out with their Gemini 3 model and that was a big a big drop in the industry. It
really, you know, shook things out. A
lot of people had a lot to say about it.
Uh but before we get into like what the the vibes, I want to hear like, you know, Adam, did you get a chance to play with it much?
>> A ton.
>> Okay, great.
>> Out of all the three models that came out, that's the one I've used the most.
>> Okay, great. And what were you thinking?
So, the way that I would kind of like put this is like up until that point I had been we were we were we were heads down tight deadline. So, we're doing
composer one. Um I'm hitting GPT5.1
composer one. Um I'm hitting GPT5.1 giving up on that. Going back to Sonnet.
So, I'm a composer one sonnet. Gemini 3
Pro came out and it it's it's all about like do I feel like I'm getting done what I need to. I felt like the model was phenomenal honestly. Like I felt like I was able to one and twoot like
plans that I was getting getting really great plans back from actually executing on it and honestly the speed while slower than like a composer by a long margin it didn't matter as much because
it was still snappy enough and the quality I was getting was extraordinary.
Now I was doing a lot. So just the list of things that I was working on as part of this project, a lot of data engineering stuff, TypeScript, Python, uh we had React portions, Vue portions,
and some PHP stuff that was actually in there. So we're working across a lot of
there. So we're working across a lot of systems. Never ran into anything that it did like a horrible job at. It did a great job like letting me steer it. So,
for example, if uh to your point that you were talking about, Eric, which drives me nuts about the GPT models, if they get very stubborn almost to the point where they start pinging ponging back and forth between like I tell it to do something and then it forgets that it's supposed to be doing this new thing
and it goes back to the old thing. Never
felt that with uh Gemini 3, but there were a couple things that I think with Gemini 3 that were slightly irritating.
Every once in a while in a context flow, it would just forget how to call tools and it would talk to itself about the tools it should be calling and then it would think that it's trying to call
tools. But it's just imagine all that
tools. But it's just imagine all that getting put in text. So like once that happened, there was no way to correct it. So like if I was like working on
it. So like if I was like working on something that's kind of just a broken chat at that point. So I'd have to start over. That's kind of frustrating. So
over. That's kind of frustrating. So
then you got to kind of like re rebuild all that back up again, get your context where you want it and go. Not the end of the world. It only happened, you know,
the world. It only happened, you know, across the the days that I used it maybe three or four times, but it was enough to be irritating to actually happen. The
other thing that it drives me nuts is like if I tell it to go do something, it's very much a comment hungry model.
Like I I like comments, but I don't like comments that's like, you know, [snorts] that basically regurgitates what I told it to do. Like I it's just a really annoying thing. Even no matter how I did
annoying thing. Even no matter how I did it, just would leave comments in there.
>> Like leave code comments or it would just comment that it's thinking. leave
code comments. So, for example, I'll be like, "Hey, I want to switch this thing up and then it would put a comment in there." Switching this thing based on
there." Switching this thing based on request. That's not what I want in my
request. That's not what I want in my code. Like, you know what I [laughter]
code. Like, you know what I [laughter] mean? Like, I don't want that in there.
mean? Like, I don't want that in there.
Maybe I'm just being picky. And that's
>> Well, so you know, Gemini 25 when that came out, it was notorious for the same problem. So, it's interesting that that
problem. So, it's interesting that that persisted through here. Um, anyway, keep going. I was just want to mention that.
going. I was just want to mention that.
>> Yeah. I guess the final thing I would say is like I really really love the model. I think it does a great job at
model. I think it does a great job at frontend stuff. I think it's fast
frontend stuff. I think it's fast compared to like the quality you're getting out of it. I think it does a great job context gathering, building plans. You know, there are some quirks
plans. You know, there are some quirks with it, but when we get to the next model, we're going to kind of talk about like you could look at my usage. It goes
composer one, sonet 4 to a little bit of GPT5.1, back to Sonet 4.5, and then uh Composer 1 to all Gemini 3. And then the second Opus came out, I swapped over to
Opus. And Opus is
Opus. And Opus is >> Well, let's let's come back to Opus on that one. I
that one. I >> I won't touch any, but it's just like a tough one.
>> Yeah. Yeah. Yeah. Okay. And Ray, how how did you feel about Gemini?
>> I used it in AI Studio. I also use it in Droid. And I try to use it in
Droid. And I try to use it in anti-gravity. And first with like AI
anti-gravity. And first with like AI Studio, I found it to be extremely impressive at design and understanding really large complicated instruction
sets. I actually put it up against Opus.
sets. I actually put it up against Opus.
I think at that time I did Opus 41 and I was basically doing a um like a killa u um sorry the big island here in Hawaii is basic is um you know currently
erupting now and I wanted to figure out the time window at which it would do these things. So I had these really
these things. So I had these really really long prompts and a lot of research and USGS uh government documents. I want to see can I vibe code
documents. I want to see can I vibe code something together. Uh and and like if I
something together. Uh and and like if I was a government employee, you know, you have to look through five different sites and be a geologist to determine um you know this like 5day window of
whether it's going to erupt and analyze this data. And so you know Opus did its
this data. And so you know Opus did its little design thing and it looked like very you know on the web it just looks like a really simple app. I threw that into AI Studio and in one prompt I
thought let me just see how much it can chew on just from the web. It did what I thought like, you know, back in the day with 01 and GPT would just take all the different instructions.
>> Mhm.
>> Not only did it make the site in just one prompt, but it was already mobile friendly. It had like this data connect.
friendly. It had like this data connect.
It had so much details that I I was like, how did it do this? I'd even tell it a design style. It it came up with a beautiful cohesive design, a design hierarchy as well. And like there I was
like, this is amazing. So, I tried a similar prompt inside of like Droid the CLI and it didn't come out the same way.
I also tried it inside of Vzero as well.
Uh, and it didn't actually come out the same way as well. And I I was a little surprised that there's like to me it got me thinking there's something in AI Studio that didn't have >> How did you put the context into AI
Studio?
>> Uh, in AI Studio I just literally copy and pasted the entire like big prompt.
Yeah.
>> Yeah. Yeah. So, so on that note, like I think that's a common thread for most coding tools is that when you include the full context of your project as much as you can and it's like the base of
what I built with repop like that you get better results out of these models like they just see more of what you need them to see. They see less of what you don't need them to see. Like they don't see threads of tool calls uh and you
know they get better output like they're not reading small slivers. They're
they're getting the full picture and able to work. So I just want to comment on that like so I don't think it's just AI studio. I think it's just the way
AI studio. I think it's just the way you're prompting it. I Yeah.
>> Yeah. It's I always do that especially across all the different tools I use because I get to see what are the actual agents the the harnesses or whatever that's sitting on top of the model. Uh
are they picking things apart and then only sending certain things in in that prompt and then there's also an a hierarchy that they can also do. And
that's what's been interesting about reple prompt, right? cuz you reveal different pieces of that and just say here is your entire prompt that you're going to send raw to the model. And so
that got me really thinking for anti-gravity. I just couldn't get past
anti-gravity. I just couldn't get past the rate limits with one prompt. I was
pretty much done.
>> It looks so promising. But I just recently heard that if you have a workspace account, so I have like my domain and.com in a Google Workspace. I
don't have the same limits as just a regular Gmail account. Interesting.
>> And so somebody said if you just switch to your Gmail account, you're probably actually going to have higher limits.
Now you tell me, bro. So
>> I I have >> So I mean that's that's a big story about Gemini is that you know the best place to use it typically hasn't been Google's tools. Like I mean AI Studio
Google's tools. Like I mean AI Studio has been great. it's nice for free. Um
that you get a lot of usage, but anti-gravity, the CLI, uh they they they put such tight rate limit and and on the CLI in particular, they only just on as of Monday this week where we're
recording on the 26th, um they released the the all the like you they had a wait list and only now have they rolled it out. So you don't need a wait list. As
out. So you don't need a wait list. As
long as you're on a paid plan, you have access. So before that, you'd sign up
access. So before that, you'd sign up and and on the paid plan and say, "Hey, like you get more Gemini 3 usage on the CLI and on jewels and anti-gravity is not in that list, but then it wasn't true because you needed to have a weight
list and now it's finally true." Um, so if you had the ultra plan that was like $300 or 300 Canadian, like $200 a month, uh, $250 or something, I don't know. Uh,
then you got immediate access, but that was like a whole like, you know, rigomearroll to get access to. And then
I saw earlier this week they actually pulled all usage of jewels from Gemini 3. So now it's back to 2.5 for them. Um
3. So now it's back to 2.5 for them. Um
and then anti-gravity is is you can't pay for it. It's only the free tier. So
that's like kind of annoying too if you want to get use of it. Um so so I actually I did try anti-gravity a fair bit too. And and I think the cool thing
bit too. And and I think the cool thing about it is that it comes with a Chrome extension that uh you can you can set up and then when you're working on a web app, it's able to kind of run the web
app in Chrome, inspect what it's doing and and take screenshots, scroll through your page, click around, and then review its work uh automatically in a in a tight loop. And I found that was
tight loop. And I found that was helpful. It didn't work perfectly when I
helpful. It didn't work perfectly when I tested it. Uh like, you know, there was
tested it. Uh like, you know, there was some clear issues. It took screenshots at weird places. Um, so I think there's some work to be done there, but I think in the future that kind of integration is going to be important. Um, and of
course I just hit errors right away on anti-gravity and and I wasn't able to use it. And surprisingly, Sonic was
use it. And surprisingly, Sonic was working better than Gemini 3 on there, which is very surprising to me. Um,
>> yeah. I also want to add that what Adam was saying earlier about comments.
That's the other thing I kept seeing with this model. It wasn't as crazy as the previous 2.5 Pro, but they would show up. And also it it's extremely
show up. And also it it's extremely convincing at writing code if you don't review it to convince you that this is the right technique that it's doing. And
you have to be very very careful with it. And I don't know how but I'm just
it. And I don't know how but I'm just kind of I got a little cautious you know I was like this looks really good and then I was just actually running through the data model. And I
was like, wait, um, I I use convex a lot for the database. And I actually have specific instructions and uh the reason why I use convex is because everything could be reactive. And so I noticed that
the Gemini models got like extremely aggressive about, hey, I can write all the state on your front end. So that way like it'll be doing all the stuff. And
like I'm being smart. See, look what I did. Here's some comments that kind of
did. Here's some comments that kind of lead you through it. And I was like, that's not what I want. It's
everything's on the back end. Convex has
just a single query that I can just pull that in and refresh the app. I don't
need to have you write extra things to justify uh you know why you're so great at doing this. I I understand that. But
that's not needed for this framework.
>> Patting itself on the back like oh I did such a good job. Oh look at that.
>> Yeah. It's like the Obama meddling himself like thing. It's like that's what I felt. I was like okay.
>> Oh man.
>> I was like that was that was interesting. Um, but it, you know, I'm
interesting. Um, but it, you know, I'm using this framework and it has a really un like it's not in the training set as often. That's what's been really
often. That's what's been really interesting.
>> Always a big challenge.
>> It is.
>> It's Yeah. So,
>> on the um I agree with you. Day one of Gemini 3, by the way, was pretty rough, I think, for everyone. At least for me.
Like, I used it a few times, was able to get some prompts through, then it aired out. So like my I had kind of um on day
out. So like my I had kind of um on day one been a little bit frustrated just with this availability. So I was like okay give it a day or two to smooth out.
You know you mentioned all of the the capacity issues that they're having across the board.
>> I kind of wonder if open 4.5 has alleviated some of that for them just to like >> like I wonder if it was like cuz you know you a new model comes out literally everybody's hammering it 100% like
that's all we want to use. So, I do wonder like what it's at now because over the last like I was doing some evals this morning. I'm getting ready for my December evals. I was running. I
haven't hit any API errors or anything yet. So, I don't know. We'll see. We'll
yet. So, I don't know. We'll see. We'll
see.
>> Well, you know, over the API, I hadn't had any issues with it. Uh, generally it was quite stable. It was really just when you're trying to use it with their paid plans and their subscriptions that they were really trying to load balance.
And I really feel like they're pulling allocation in different places and you could really feel the Google bureaucracy of where they're putting usage uh in that distribution there. So yeah, I think over API if you're doing your benches there, I think you should be
pretty set.
>> And cursor was great by the way. Like I
I like cursor a couple things otherwise it was just solid across the board for Gemini 3. I had a question on your
Gemini 3. I had a question on your plans, Adam, because I noticed my Gemini plans compared to the Sonnet or Opus plans weren't as comprehensive even inside of Cursor cuz I love Cursor's
plan mode. Did you notice a difference
plan mode. Did you notice a difference between the two?
>> Yeah, when we get into Opus, yeah, I'm going to we'll talk about a lot of that stuff too when when Eric kind of leads us in that direction. Yeah, I will I I will say uh Gemini 3 does do good planned, but you do need to iterate on
it a bit more because it been like what I thought I had to do with with Opus, for example.
>> Mhm. So, I just want to put a little cap on the Gemini story uh here. Uh so, you know, the story's been it's it's all over the place in these different CLIs who people are trying them in different
places, getting these different results.
Uh, I wanted to make a quick note that um, so there's the AMP um, CLI tool.
They had made it their default. So
they're a big uh, they're all big about having the best model as your default and they make executive decisions for you and take away the model picker. Um,
and they had the made it the default.
They were very confident. They had some trade-offs with it, but they felt like it was so good that it was worth switching away from Sonnet for. Um, and
then I noticed uh I think it was yesterday that the creator of the ghosty terminal, who's one of the biggest users I've seen on on Twitter X uh of AMP uh
so so they had just released an Opus 45 experiment and he immediately switched over and he was like, "Oh my god, thank you for getting rid of this Gemini for me." Um, so that brings us into Opus 45.
me." Um, so that brings us into Opus 45.
Uh, and I I wanted to to kind of get your your thoughts on so that one just dropped this this week on Monday. big
release, made a big splash. Uh, so what what are you what are you all thinking?
Adam, you've been, you know, holding yourself in. You're like, you wantanna
yourself in. You're like, you wantanna you want to get into the opuses. So tell
us what are you thinking?
>> So again, like, so this is at a time where I'm like heads down coding. I hear
about it. My initial reaction is it's going to be too expensive for me to use.
So you know, it's like, oh, cool, a coding model, one that's probably out of reach for most people. So then I had a chance to look at the pricing. So first
thing I want to touch on is the pricing while higher is so much better. so much
better than the previous Opus version.
So then I was like, "All right, I'm going to just use Opus from here on out and give it a try." And I didn't look back, honestly. Like, you look at my
back, honestly. Like, you look at my usage, it was I was not jumping back to Gemini 3. Opus 4.5 was just solid. like
Gemini 3. Opus 4.5 was just solid. like
it uh plan mode it we the group that I was working with we were we were like testing some pretty hard like multi-dommain things that we're rolling out and it would come back with a plan
and we're like dang that is actually really good and then we'd have to go build it and we'd do like one or two follow-ups and it was like building just fairly complex systems to the point where we're like one of the guys on my
team was like well we don't even get to do anything hard anymore like that that's literally what he said. Yeah.
>> So, um, but regardless of all that, I think Opus 4.5 is awesome. I hope they don't mess it up. I hope the pricing stays good. The speed in cursor in
stays good. The speed in cursor in particular seemed good. The speed in claw code felt slower to me, which I find kind of odd. I don't quite understand why claw code wouldn't be faster.
>> It could be a difference in token budgets, too. I'm curious to what their
budgets, too. I'm curious to what their thinking budgets are set to.
>> Mhm. And it is again it is more expensive but the cash reading is good.
Uh the design I've heard some people say it's not good at design. I don't agree with that. I think it's amazing at
with that. I think it's amazing at design like doing front end stuff but I think it's magic is like being able to work across like domains and just come up with like great technical solutions.
And yeah it is it is right now my favorite model. But to be very clear in
favorite model. But to be very clear in terms of time, I probably have about two to three times the amount of time with Gemini 3 than I do with Opus 4.5. So
I've worked across like more projects and more things. So, you know, it may change if I get into other things that I haven't done with it yet. But I I freaking love it. I think it's awesome.
>> Yeah. Right. And Ray, what are you thinking about it?
I immediately started throwing really difficult tasks at it and I was just enjoying its thinking through these different issues and I was a little impressed cuz it came back really fast
at a solution. I'm like did you think through this and I looked through I was like actually yeah um it just makes a lot of sense. So one of the issues I'm facing down is a performance issue. And
what I'm doing is I'm loading like you know 10 GB files in on the browser and I'm doing a bunch of like web assembly type of problems. So like literally just you know memory in memory out as fast as
possible doing some processing and some cool loops and you know I have to like really whiteboard this stuff out myself and think about it you know thoroughly and then say okay I'm not a web assembly engineer you know maybe I worked on some
C stuff back in the day but this is the web right so like this was a fun problem to throw at opus 4.5 and it came back too fast at the results and like h how
how do you know this like like a lot of my benchmarks for the memory everything just started going down, I was like, "Wow, this is super duper impressive."
And I I don't know what is in that model, but I was immediately impressed compared to Gemini 3. It just, you know, I was like, "Okay, this this is a guy I want to really work with and talk with."
>> All right.
>> And then I started pointing it at all my other performance issues on the browser and saying, "Okay, I want you to kind of do a thorough review." And so I I actually did the same prompt inside of like Droid and Cursor. And I wanted to
see which agent would actually be much more thorough at this type of workflow.
And I was actually underwhelmed with cursor's output. It seemed to be just
cursor's output. It seemed to be just like super fast get to a conclusion like here you go and it's the same model you know in the same like you know the extra thinking max mode brain for cursor
>> and then you know with uh inside of droid I had it just on high and it's it seems to take extra long like not timing wise but like it seems to think a little bit longer in the droid compared to the
the cursory one. And I I don't know like there is a difference between the harness and I'm actually running a similar task right now in cloud code the actual app but cloud code does take forever. It does take you know seems
forever. It does take you know seems like forever in these times but yes it seems to take extra longer to do these tasks. Yeah.
tasks. Yeah.
>> So I've only used it in cloud code myself. Um though I did bench it over
myself. Um though I did bench it over the API. Um and you know so so just just
the API. Um and you know so so just just to caveat like I just want to start this off.
Um, if you add thinking opus 45 on on my bench, repo bench, it scores number one.
It's the top model. Um, what's
interesting is that when I release when Gemini 3 released, it actually scored number 26 on that bench. Um, which is which was very surpris like I had to extend the leaderboard because it wasn't fitting on there, which was which was
very upsetting.
I reached out to the Google team on that one. Uh, figuring out what's going on.
one. Uh, figuring out what's going on.
And I I think it's down to just file editing quality and and there's something about Claude that is both in terms of how it understands its context and how it can edit files so cleanly and clearly and how it's able through deep
in the context window with all this junk in there. They're training they have a
in there. They're training they have a special sauce around it where it's able to just stay coherent and on track and you know pivot and and stay nimble. So
there's something really special about that. I think you know you know cloud
that. I think you know you know cloud models in general and Opus45 is no exception. I think it's it's one of the
exception. I think it's it's one of the best like engineering models. So like
able to kind of debug an issue, find the data, be data driven, like find information, look at the data. Oh, how
do we debug this? How can we check?
Testing, checking its work, looking through, okay, we're good with that. U
and it's just better at using the terminal, better at like finding its way around and doing work. So I think that's just like, >> you know, there's something, you know, really magical about that. I don't think that like I'm going to turn to Opus for
like deep thinking, deep engineering tasks of like thinking through like architecture of like what's the best way to do this really complicated thing. I
still turn to the OpenAI models, especially 5.1 Pro is out now. It that
one stays very very good for that. But
Opus4, I think there's nothing that's better than it if you're using a coding engine and a terminal. It's it's just it's just so good. It's nice to talk to.
It's it's fairly priced. And it was great with cloud code if you're a cloud code user on the max plan is that they actually matched the usage of if you had sonnet uh it's the same usage as sonnet.
So like you're not using any extra use it do you don't burn your limits faster.
So you're able to really just fully switch over and they made it the default and you're not burning through limits any faster. So that's great. I want to
any faster. So that's great. I want to say one thing though. Um so while opus 4 and 41 to an extent they were it felt like this is a different model. This is
like opus the bigger bigger brain.
Opus45 feels like a smarter sonnet. It
It feels like we're like lifting up the bar of sonnet. Like here's here's a model that's like maybe some adding some IQ points to sonnet. It's able to do good work. Um and uh and one one last
good work. Um and uh and one one last point I just wanted to throw in there is uh turn on thinking for opus. So so
there are some some notes uh on X someone uh was mentioning that they felt that uh Opus 45 without thinking was less smart than sonnet with without
thinking. Um, and then that's actually
thinking. Um, and then that's actually the same results I got on on repo bench as well. Uh, so interesting to consider
as well. Uh, so interesting to consider that that that if you want it to be as smart as it can be, you got to turn on the thinking. Uh, so that's something I
the thinking. Uh, so that's something I also want to add that skills have been important for me because I've been just been adding them in and um sort of treating skills as these empty areas, especially since I'm working with
something like convex. It's not really in the training set. And so as I go through things or as I discover and do research, >> I then have the Opus model write its skill file for me so they can go back
later on and just load it in when it needs it. So like if I'm doing like, you
needs it. So like if I'm doing like, you know, this like specific data state management thing that's sort of unique, but whenever it does encounter this in the app, you know, I just have a little rules file that says, hey, go take a
look at the skills if you're going to be working in here. Even though I'm not like specifically using any keywords.
So, as kind of a double way to kind of like, oh yeah, I should probably load that in in its thinking that type of thing. Very good. And I think that's
thing. Very good. And I think that's going to be an interesting way of like I almost call like skill hacking. It's
like can you stack skills uh in in repos and have the agent build them itself because it does seem to follow those tool calls and the skills and other little additional things. And I see it
more on the web, especially when you create like a cloud a cloud project and and even for non-coding tasks. It's just
extremely thorough at those in Opus 4.5 than I've seen it even in Sonet 45 as well. So,
well. So, >> definitely. Yeah,
>> definitely. Yeah, >> I was to add on to that. I would just say like Opus 4.5 to me is so much more enjoyable to work with than Sonnet 4.5.
And it you guys know this, everybody probably has to do this. So, like when I was bouncing between Composer 1 and Sonic 4.5, every time I would go to do like some sort of code review to see if I'm going to actually like, you know,
push it up to actually have it PR, there's like five bajillion markdown files that Sonic 4.5 decided to make that I have to go remove, you know, like you've got to go through and like clean
up a bunch of stuff. Opus doesn't seem to do that. um
a small thing, but it is very annoying like how much wasted token Sonic 4.5 has in just extra things that you don't want it or necessarily need to do. The other
thing I found Opus 4.5 do to do is it picked up our the way we do testing very well to where it would actually go through I would talk about a feature. I
wouldn't even ask it to go and like write the unit test for it, but it would it would get to a point where it's like, "Hey, would you like me to go do that?"
And it like it it understood the structure of this massive code base like absolutely massive code base well to be able to set things up properly and then be able to run the particular test and iterate on it. It it's honestly so much
fun to work with >> but yeah that's what I was going to ask >> all cursor for that particular case.
Yeah.
>> Mhm. Interesting. Yeah. You know it it's interesting. There is a a note that they
interesting. There is a a note that they mentioned when they released Sonnet sorry Opus 45. Um they mentioned that you have to be a little less harsh and
intense on your prompting for it. Like
when you would say sonnet must do xyz, opus should do xyz. Uh so just thinking about like how you're prompting the model, how you're building your files.
>> Uh it does tend from my experience as well to better adhere to those rule files. So that's something to think
files. So that's something to think about. Um you don't want to overdo it on
about. Um you don't want to overdo it on the rules and and make sure that you know the model is able to kind of steer in the right. It seems to be better at instruction following. So, you got to
instruction following. So, you got to watch out for that.
>> Eric, I wanted to throw a wrench in here. Maybe I know kind of it's a little
here. Maybe I know kind of it's a little bit out of order, but I was thinking that last time we talked about if we had $200 to spend, you know, what specific models we're using and I think >> since then we didn't have a lot of air
time or, you know, code time with copus copus composer one. I know that, you know, Adam got to use it a little bit and I'm kind of curious from Adam's perspective, like you've been using
Composer 1, Gemini 3, now Opus 41.
Pricing, speed versus accuracy. I feel
like everyone's now converging on speed and accuracy at the same time. What are
you reaching to now? Like if you had the same $200, Adam, kind of curious.
>> See, it's so tough. Like right, I am actually paying for the $200 a month Cursor plan because of just how good I think Cursor is now, which if you go back in time, I've been very critical of Cursor and felt like they've done some
things that were very anti-consumer.
I feel like they've hopefully learned from that and they've turned things around. I still despise their credit
around. I still despise their credit system. I absolutely
system. I absolutely wish that we could move away from that.
Like just give me API like give me some transparent way to understand how you're billing. not tell me how many tokens and
billing. not tell me how many tokens and have the credit system behind it. So, I
guess if I only had $200 to spend and I didn't need to code beyond the capacity of like Cursor's $200 a month plan, I probably would do that. But you are
limited because you you can burn through that $200 very quickly, especially if you're just using like Opus 4.5, for example. In a matter of like two days, I
example. In a matter of like two days, I was able I burnt through like $75 on the cursor plan just in two days. So,
>> it's worth noting too, Adam, that they're currently billing Opus at half price on Cursor.
>> So, that's going to also suck when that changes. So, that's a good point. So, if
changes. So, that's a good point. So, if
I were to spend $200 today and cursor was out of the question because it wasn't enough usage, it would 100% still be cla $100 a month plan. And then I
would most likely I would love to have some way to get Gemini 3, but I just all I'm hearing about access to the Gemini 3 plan that just seems kind of odd right now with the way you authenticate
through Google terminal. So I'd probably still do the OpenAI um codecs $20 a month plan and then fill in the rest with some API usage is the way I would
do it. So, just a note. So, to go back
do it. So, just a note. So, to go back to the Gemini story, with the $20 a month now with uh Google's plan, you do get a decent amount of CLI access and they have opened up the the gates to
everyone on the CLI.
>> So, all the limits are lifted now, like we're >> Well, I haven't used it enough to hit the limits.
>> Okay.
>> The thing that's a little annoying though with the Gemini CLI, and I've mentioned this to the to their team on on Twitter, is that they're a little sneaky with the model routing. like
you're using the Gemini CLI and uh and then you're you're gonna run into some limit and oh now you're on 25 Pro.
>> Oh, and then now you're on 25 flash for some reason. Um but it turns out that if
some reason. Um but it turns out that if you start the Gemini CLI and you put dash model and put Gemini 3 Pro preview, then you're going to be only using that model. And I was like, why why can't I
model. And I was like, why why can't I just do that from inside the like why do I have to start this launch lag? And
there's a lot of strange decisions by Google around this release, but that's one of them. But there are ways around them and $20 does give you a fair bit and and use you you still have the Gemini app and the Gemini.com whatever
you can still use that but then AI Studio also gives you a lot of Gemini 3 usage for free. So there's a lot of places to use it for free depending on how you hit the limits. Might be worth
trying that. Um anyway just wanted to
trying that. Um anyway just wanted to caveat that. Ray, how how is your uh
caveat that. Ray, how how is your uh spending changed since since release there?
>> I'm very biased because you know I have used Droid. I love it so much. I I
used Droid. I love it so much. I I
reached out to them to sponsor my show and you know I I deeply like the way that they handle this infinite complex and that's kind of why I've been so biased and I use cursor and both of them
a lot. Like I maybe for me it might be
a lot. Like I maybe for me it might be like if I had infinite money which one would I kind of reach to all the time.
Adam is really convincing me to go ahead and like just go all in on cursor and just you know Gemini 3 because I think that's a really interesting perspective especially that things you're calling out there Adam about this longer context
thing and then these other things that are just kind of it's picking up as it goes for for that type of experience and um the thing that I I find enjoyable about like Droid and stuff is the fact
that how you know this they have this unique compression thing that I also want to test with cloud code. So cloud
code and apparently cloud code has this new capability now where you can just kind of have this really long chat conversation and it's supposed to do this nice compression and I think it's built into the way the model does it or the tool call. It's not really really
clear what's happening there but it feels like you should have a longer conversation. So they just announced
conversation. So they just announced that but that was actually for cloud AI not for claude code so the website um and I think they've already been doing some work on callulling some tool calls
with micro compacts but the big change there is on the website or the desktop app so that's that's worth noting >> oh dang okay yeah cuz that's I mean I've had like 10 million token sessions in
Droid and like it stays to the spec file it stays anchored with all my rules files it just keeps going and going and what for me it's like okay I have a a feature and I just can see it from
planning to ship, you know, and MCP tool calls and everything all in one session.
And so that's what's been interesting for me. And I I haven't seen any other I
for me. And I I haven't seen any other I think augment code may have been doing something similar.
>> I need to try them again and see, you know, what they've been up to. Um, but
that's >> they're also moving towards the complicated credit system. So, good luck figuring that out if you do move over to them.
>> And it's not to go off the rails there, but it's even worse than I think some of the other ones. And Warp also did the same thing recently. I mean I think they have to but it should not be credits. It
needs to be cost plus. Anyway, to your point it is interesting that you work that way where you actually are trying to like run these very very long deep
context conversations.
My workflow typically is like I don't I try to break things up into the point where like I'm constantly starting new chats. So if you look at if you were to
chats. So if you look at if you were to go look at my cloud code history in a period of a day I'd probably have 20 chaps where I've maybe even more. I
don't I haven't looked same thing at cursor like I end up with a bunch of tabs you know in the cursor IDE >> on that because I I I like um working on
a very focused feature or thing and then I I like to say like okay now we're going to edit that thing. So I go to a new context. I tee it up with the
new context. I tee it up with the context I need and then I go. So it is interesting. I haven't never thought
interesting. I haven't never thought about it from the standpoint of like wanting a 10 million or you know super long chat. I actually get annoyed kind
long chat. I actually get annoyed kind of when they go and start doing contact uh context compression. So I just try to avoid that honestly. So when I see the when I see any sort of like you got 2% I'm like all right time to start a new
chat then start working.
>> Yeah. Yeah. Previously for my workflow I would just open new windows all the time and you know I would actually be extremely diligent at 50% was my marker.
uh and I you know in clot code and in even in in cursor but I had the same problem I'd have 50 chats so I'd have like a plan chat I then have like a handoff I'd write the markdown file you
know waste a bunch of tokens there and then I actually have like documentation workflows that actually have a spec file and also has a progress file and then I would hand the progress file and the
spec file into every single new chat >> and like I was just doing this over and over and over again and I got so kind of exhausted because like I just want to do a bug fix and then another bug fix and
another iteration. So I just kind of I
another iteration. So I just kind of I had so many chats and I was like, "Okay, I have to go back 40 different chats cuz I'm still going to go back to phase two of the plan and I need to continue phase
two and you know cuz I like to do this a waterfall method of like the plan build iterate iterate iterate and then fix you know polish and then you know ship it and then go back to phase two and then
like just keep doing that and I'd have these like diffs in essence like I I also started doing like stacked diffs.
So, it's like every feature gets a PR, then I add another PR on top of that, and I just keep PR until I have the whole feature set, and then I could just, you know, ship it um and and close the PRs, and everything gets built on
the server, too. So, it's like it was like this.
>> And all these are >> Yeah, it it's it probably comes from just working at Apple old school wise.
It's like ship, test, iterate, keep doing this thing again. And maybe I'm just trying to like, you know, for me it was just a little bit exhausting keeping track of all the chats because the UI
for chat history is just like only scrolling, you know, in a timeline where I kind of need like more of like a grid or something. Yeah.
or something. Yeah.
>> Yeah. I I usually like just kill like I imagine when I go to a new chat, all my other ones no longer exist anymore because I feel like I've moved past that point. So, it's very interesting to I I
point. So, it's very interesting to I I don't know. It's I think I need to think
don't know. It's I think I need to think a little bit more about how you're working and see like if that makes sense for some of the things I'm working on. I
Well, hold on. I I I think you it really depends on the tool that you're using.
Like I think Droid, they put a lot of elbow grease into making that possible.
But I think, you know, you're you're fighting a little bit on the gravity of the models context window. And and these models just aren't >> built they're not just just not great at doing this kind of thing. So you really
got to work around them to kind of get this possible. But I get it like this is
this possible. But I get it like this is a lot nicer of a workflow to just keep going u pick up where you left off. But
like I just think you know right now like model models just they they do better early in the context window and if you're always full with with stuff and even if you're compressing all the way like there's stuff that may not need
to be there. So it's a little tricky. So
I I get you know Adam I I'm I'm a little bit more along your approach too but I do want to mention as well like if you're using Opus 45 like it is a lot more resilient to noisy context windows
and it's able to keep going further and I have found that in cloud code in particular if I'm doing compacts a few times it's able to kind of pick up the thread and keep going. However, I don't trust it to kind of write a bunch of
novel code at that point in the context window after a few compacts like you you know you got a lot of junk in there. So
what I do do though um so you know I I'm a big fan of combining models together and lately this week what I've been doing a lot with with the repo prompt MCP setup in cloud code I'll be like hey
like get a second opinion from GP51 on on this on this idea before you kind of proceed with your implementation like you're like here because as you're deep in a context with a lot of with a lot of context and you've been through compacts
like the model has a lot of understanding of what you've been doing um so it's able to know what files to kind of pull out and give to GT5 to kind of think about um and it's able to kind of do that to kind of think through
problems. And I found a few times it caught a lot of issues with with what Opus was trying to do even with a detailed plan. Opus made some syntax
detailed plan. Opus made some syntax issues or uh like one issue is it was trying to iterate over a dictionary and pull stuff out and mutate it while iterating which is you know if you've used a lot of languages like in
programming like you know mutating a dictionary during an iterator that's just a known that's a cause for a crash in a native language. Uh so you know if you I I didn't see it right away and then it gave it to 51 and it's it
spotted it right away. Uh so so having you know these models check the work of each other. It's very helpful stuff. Um
each other. It's very helpful stuff. Um
yeah. Any uh closing thoughts on Opus before we move on to the next topic there? Uh no. All right. Um so this is a
there? Uh no. All right. Um so this is a small call out I just wanted to make.
Um, so there's someone on my Discord this morning who was reaching out to me and he was trying to figure out this right setup uh to to kind of get this working uh for the benchmark in repo prompt. But he tested Kimmy for coding
prompt. But he tested Kimmy for coding um which is a new model that kind of just showed up and they didn't really talk about what it is. Um but
surprisingly on repo bench it scored 72 or 73% which is on par with Opus non-thinking. Uh which is very
non-thinking. Uh which is very surprising to me. It's like the other Kimmy models didn't do nearly as well.
So I don't know what special sauce is going on there and there's to be TBD on more to come for that but uh if you are interested in Kimmy and worth taking a look at that to what's interesting to me is that it's the highest score for any
non-western model. Uh so that to me
non-western model. Uh so that to me that's a big a big shot across the bow for the western labs uh for coding workflows uh that Kim has something cooking there and it's worth paying attention to uh especially now they got
their nice Black Friday plans. Uh worth
worth thinking about um if if you're interested in that. Uh any any other thoughts on that? I guess you guys haven't tried that one out, so I haven't really played with it either. So, I just wanted to make a little call out for it.
Um, cool. So, uh, next thing I just wanted to take a little moment for us all to go around uh, you know, as we're wrapping up here, I just want to do a little, uh, little tour of like what what's your terminal setup like? I know
you guys are both like, uh, well, I know Adam's using cursor a lot lately. Uh,
you know, Ray Ray is using Droid in the terminal. You know, what what what's
terminal. You know, what what what's your terminal setup looking like these days, guys? Like, how are you guys
days, guys? Like, how are you guys working with these models?
>> I guess I can go. Yeah, go for it.
>> Um, so for me it's literally just a GSH in the shell. [laughter]
>> Right on.
>> I don't know. I'm so old school. I think
it's just kind of been simple that way.
I don't want to over complicate it.
>> I I do have like I want to learn a little bit more to be honest. I know
Ghostly's come out. I know there's a whole bunch of other somebody talked about Kitty and people have rewritten stuff in Rust and so forth. And I I don't I I know a lot of these for
terminals, all this is text and some of terminals can be optimized to rerender for graphics work >> and it's like text is only updating in the screen. My screen update refresh is
the screen. My screen update refresh is maybe 120 Hz, 60 Hz a second.
>> I'm not playing games in the terminal with text and stuff like that. So
>> yeah. Yeah. I don't know. Like in these, you know, my CLI are only drawing, you know, so many frames per second and >> that's all I need. But I'm really curious. Does anyone else use anything
curious. Does anyone else use anything else? That's, you know,
else? That's, you know, >> I'm Yeah, I just want to make a quick mention on this. So, so while that default terminals here on Mac OS worth mentioning, uh, Ray, cuz you know, the terminals do change per per OS. But on
Mac OS, the default terminal doesn't have tabs and that alone is like a big blocker to using it for me. So, just I switched over to Ghosti pretty much just for that. And Ghosti is an open source
for that. And Ghosti is an open source terminal. Uh, it's more performant than
terminal. Uh, it's more performant than the the built-in one. And the thing that I like a lot about it beyond the tabs um which you know I'm opening tabs all the time starting new chats you know with
these models um is that it is more performant and so if I'm pasting a lot of text which I do quite a lot with these with these coding agents you feel the difference like if you use the built-in terminal you're pasting a lot
into cloud code it can buckle down cl the terminal and you can feel there's like a big lag for that text to make its way through but if you're using go it's instant and that alone worth worth the switch for. Yeah.
switch for. Yeah.
>> Okay. Hey, I think you're convincing me.
>> Yeah, right on.
>> Um, yeah, tell us, Adam. What's fun?
>> I mean, I agree with you, uh, Ray. So,
if I'm in an IDE, I use just the basic um, Bash or or ZSH or whatever, whatever I happen to be using at the time. I kind
of have them all set up.
Actual best terminal for me, a terminal emulator really, would be the u what they do with warp.dev. And I know I'm not a big fan with their pricing and all the stuff they've changed and the AI side of it,
>> but from the like terminal experience, it's amazing. You can and just the most
it's amazing. You can and just the most minor things being able to select text in the middle of something you paste it in and edit it. Like I know it's minor, but like you actually have like an
editor, you know what I mean? Like and
then you got tabs built in. There's some
AI assistance in there.
>> Honestly, I really really love their their terminal experience that they've got. So that that's my go-to. Very good.
got. So that that's my go-to. Very good.
Yeah. Uh I'll mention a little little shout out. One thing that a lot of folks
shout out. One thing that a lot of folks uh in my Discord have been doing is they run T-Muxes. Uh so that's just basically
run T-Muxes. Uh so that's just basically a terminal multiplexer if you're not familiar. So they want to see a
familiar. So they want to see a dashboard. They want to be like those
dashboard. They want to be like those finance guys that that have like 18 screens at the same time. They want to watch these eight coding agents, 20 30 coding agents running at the same time fill their whole screen with, you know, I it's a little bit intense for me
personally. I don't I'm not running that
personally. I don't I'm not running that many at the same time. But if you're one of those people, you want to might want to look into a T-Mox be could be worth your time.
>> Cool.
>> That's hilarious.
>> Yeah. All right. Well, you know, we're we're coming up on time, I think, for today. I just want to do a little quick
today. I just want to do a little quick final shout out as well. You know, Adam spent a little time uh, you know, at a conference this week. You any highlights you want to take about the AI AI engineering conference?
>> Yeah. So, it was uh AI Native DevCon in New York. So, it was kind of it was kind
New York. So, it was kind of it was kind of a short trip because I was in Atlanta. I had to fly up honestly some
Atlanta. I had to fly up honestly some amazing people there that I got to actually talk to both after the talk I gave which is >> really about how we are evaluating LLM's ability to code and why it's such a
difficult problem and you know some of the things that I've done so far and discovered all of that was great there was one talk in particular that I thought was very interesting it was from somebody from open hands uh I think one
of the founders of open hands the talk was called AI hates legacy code >> and you know a lot The world that I've lived in in the last three years honestly has been, you know, legacy code
is like 20 30 years old. I've been
working with code that's less than five years old. So it's very, you know, very
years old. So it's very, you know, very new. And it really made me think because
new. And it really made me think because I have actually worked with code bases.
And I know Eric, you're probably in the same place coming from Unity.
>> Like there are code bases that have been around for decades and have stuff that are still, >> you know, that the people that wrote that code no longer at the company anymore. So you have no context around
anymore. So you have no context around it. So the point that he was trying to
it. So the point that he was trying to make, there's a bunch of it, but the my big takeaway was it's not that, you know, people that are working in legacy code bases all are afraid of AI. They're
afraid of changing the code themselves because they don't understand the context. Yeah.
context. Yeah.
>> So, so it's like >> when you have AI going in there and messing with something that you don't really fully understand, that no one at the company fully understands, but you know it kind of works.
>> It's just it's a very different paradigm to kind of think about. Well, the best part about this is that the tests are often completely misleading and will have give you no signal whatsoever as to what the code is doing or if it's
working cuz it's it's it's a feature built 20 years ago by someone and then a bunch of other features are built on top of it and you have no idea what the downstream impacts of one line of code is that seems innocuous. All the tests
are fine. You don't know what you
are fine. You don't know what you actually just did there. So, you got to watch out. Yeah, I'd love to add some
watch out. Yeah, I'd love to add some insight to that after you kind of talk about that because close out here real quick and then pass it to you because I think the big takeaway for me is like I've been so much thinking about it in
terms of building startups and writing you know it is a very interesting like mind shift to think about what it means to work in like very old code that a
human is scared to actually edit and then what happens when you put AI in there and to Eric Eric made the exact point that the guy from open hands did which is you make a change and it literally breaks things that you don't
think are related at all like and that's just so is it is it really these engineers are afraid of AI or is it more that the engineers are they're just afraid of changing anything in that code
anyway and then AI just adds another layer of uncertainty to the whole thing.
>> So anyway, it is >> and having worked in some large code bases very recently over the last nine days, >> I totally get it. I'm starting to come around to like why it's such a difficult thing because you can't find the person
that wrote the particular thing that you're trying to debug or figure out.
Anyway, Ray, I'll pass it to you. But I
that I thought that was fascinating and I know people can resonate with that.
Mhm.
>> Yeah, I think that's a great topic that I'd love to deep dive further because in a lot of my experience this also happens in like I used to work at a company and it you know codebase is like 30 plus
years old too and there are like file systems that are you know 20 30 plus years old and then like I worked on a project where the file system literally got rewritten and you have to have like
you know I worked on the update stack so you talk about updates that have version mismatches and all this legacy type of things. One of the most valuable things
things. One of the most valuable things because the uh company notoriously kept the teams extremely tiny is the value of a QA engineer and a QA engineer that not
necessarily just was writing tests. The
value is in the integration and the actual understanding of the procedures of the company's output.
>> Every single detail matters from >> when it leaves a factory to when it you know actually goes through all these different processes. These people are
different processes. These people are extremely valuable and add that extra layer that an engineer really can't think about day-to-day when they're writing code. And it's actually a
writing code. And it's actually a different brain that is required for thinking about these problems cuz they'll actually approach it more from a user centric standpoint.
>> And so the bugs that they find there have actually a higher priority for fixing and they have these established tests that things that they check daily every week to every month. And so it
gets me thinking that maybe we are also going to turn into QA engineers too >> and start to think about it that way and the AI can help us actually grab the relevant pieces of code or maybe debug
faster. But you have to maybe think
faster. But you have to maybe think about these things maybe in a more procedural business operations type of way of like okay what parts of my software stack are the highest impact
for customers that have to be tested every day and then you know do we have humans around them actually going through these workflows if we don't have some type of test or some type of first
gate you know signal because if a build breaks now you have 20,000 30,000 you know however many millions of engineers that you have across the globe like Microsoft right uh these people are going to wait on a build and then that's
wasted money. And so some of these tests
wasted money. And so some of these tests can help uh you know find problems like find the smoke. uh the human goes and takes a look at the fire and see what bigger problems because there have been
some big massive massive bugs uh or refactorings of things where like we forgot like literally one function and it it just broke things in ways that you're like oh yeah of course we didn't
think that that was relevant but now there's this really weird use case with the old version of some other thing that when it gets introduced into the new ecosystem it now throws everything for a wrench. It's like
wrench. It's like >> how are you going to update older code from a thing that doesn't even take software anymore?
>> Yeah. The other the other thing is is like you guys have definitely dealt with this. There are times that a service
this. There are times that a service exists that has a bug in it and getting that updated or changed in some way is just not going to fit the timelines around what you need to release.
>> But it's not like a major bug. You can
kind of work around it. So you make decisions to now like build around this this flaw. Mhm.
this flaw. Mhm.
>> Eventually that flaw gets fixed. What
happens to the stuff that you actually did to kind of over So there's like all of these little micro decisions that happen to try to like get something out the door and then people forget about it. They forget that like hey if I
it. They forget that like hey if I update this thing to make it the way it should be the correct we're going to break all these people over here that had to work around the way it was
before. Uh, and it's just uh, so anyway,
before. Uh, and it's just uh, so anyway, like when I think about the pockets of people that are AI coders, there's people like us that like are fully bought in. We love it. It works really
bought in. We love it. It works really good, especially on a fairly newer code bases and you're building up. They're
the vibe coders that just want to like make zero to one stuff.
>> Then there are the people that are working in these like >> banking core systems that were made 50 years ago. And like I had a lot of
years ago. And like I had a lot of empathy for those people being scared of putting AI on their system where before, you know, I was of the mindset like everybody should be doing it. But I've
really started to try to like, you know, kind of look at it from the perspective of if I as a human am scared to change something. Putting AI on that actually
something. Putting AI on that actually is going to make it more tricky.
>> There's a couple other axes there too though is is you know like an AI model is only as good as this training data.
And you know, a lot of these enterprise legacy code bases just don't exist on the web. And so the AIS just don't have
the web. And so the AIS just don't have any idea how they work. And it's not just the languages, too. Like you might have like a shortage of cobalt online, which definitely is the case. Um, but
it's also like each enterprise codebase is its own beast. And even though it shares like technical, you know, stacks with other things that are on the web, the the the code itself is like an
intricate machine that the AI wasn't trained on. And to your point, Adam, you
trained on. And to your point, Adam, you know, there's these flaws that are kind of caked in and worked around that like you you the AI has no idea about. And at
the end of the day, an AI model only knows what it fits in its context window. So if it's poking around trying
window. So if it's poking around trying to find some things, it'll make judgment calls with what it can see, but it can't see everything. It's just impossible.
see everything. It's just impossible.
So, you know, for now, this is an inherent limitation. And so, you have to
inherent limitation. And so, you have to work with it to find out what you should give it as information. It has to complement your own knowledge. It's a
very tricky thing working in these code bases with AI models and you have to treat it almost like you're you're training up a new hire but every time you use it which is kind of you know upsetting uh to do. So you can't just vibe your way through it. Uh which which
is a big issue.
>> N you can't vibe your way through these old code bases.
>> Yep. Yeah.
>> But I thought it was a really cool perspective.
>> Yeah.
>> Anyway, I think that's uh there were other good talks but I think that was probably the one that stood out the most.
>> Very good. Well, thanks for sharing that with us, Adam. And I I think honestly that's a good place for us to wrap for today. So, uh, that is episode five of
today. So, uh, that is episode five of Rate Limited. So, thank you so much for
Rate Limited. So, thank you so much for joining us this week. Uh, so I'm been your moderator today, Eric. Uh, there's
thanks to Ray as well and Adam as well.
Uh, so hope you join us next time. Don't
forget to like and subscribe. Thank you
so much. Peace out, everyone. Take care.
Take it easy.
Loading video analysis...