Lecture 89: cuTile (from friends at NVIDIA)
By GPU MODE
Summary
Topics Covered
- Part 1
- Part 2
- Part 3
- Part 4
- Part 5
Full Transcript
If that's the case, uh, and if that is indeed the case, uh, please to our two speakers, let's go.
>> All right.
Do I really go through the agenda?
>> Uh, yeah, like maybe just go through the agenda again.
>> All right. Um, thanks everyone and welcome to this talk about CUDA file.
Here is the agenda for today. We're
gonna first talk about why we got into this big endeavor which is a major evolution of the CUDA programming model.
Uh then we're going to get in details about the tail fundamental the programming model running through a few example how we evolve gem for for getting the best performance
uh explaining how this all extends the CUDA platform as a whole we're going to introduce our tile gym that we just released talk about some performance aspects of the system today and mostly
about next steps because this is only the beginning of this exciting journey with Katai we're going to take your question uh at the end but also all along the talk. Don't hesitate to post
them in the chat and Mark is gonna, you know, interrupt me anytime. I'll be
happy to expand and Jared as well on anything you want to know more about.
All right. So, programming GPUs um has been traditionally divided between those two choices.
operating at the grid level where you express your program operating on full tensors in the global memory and you rely on the system to handle splitting the walk into block dividing the data
into tiles handling the everything that requires um that is required to exploit the powers of GPUs. On the other hand, you have the thread level where the user
has to get into all the nitty gritty details that comes with exploiting those GPUs and managing the threads, the data movement etc.
What we are introducing now in CUDA is this in between tile level programming model where the user still has to divide
the global workload into blocks and the data into tiles but then the systems makes this all easier and get user more productive
by mapping um everything into threads at the low level. Right? And this new programming
level. Right? And this new programming model that we are adding to CUDA is not like something that is brand new. Uh
people may be already familiar with Triton and Philip TA is the person who brought this to GPU in the first place and make it made it popular.
So at the great level we always had solution like Kublas KNNN Kai this is all the numpy style programming model.
PyTorch also all those coarse grain operators make you very productive but you don't have much control. Now when
you want more control or you want to do something that the library doesn't allow you to do you have to go down to the thread level or you had to with kuda C++ and there you have to handle all this
the the low-level details of thread synchronization and management and that can be cumbersome to to do and so the high level um is supposed to make you
more productive uh while providing you the flexibility that you require to program GPUs.
So why are we doing all this? Right, we
want to of course make it simpler and more accessible to program GPUs for data parallel application. We also want
parallel application. We also want something that is more targetable from any language. With CUDA C++, you had
any language. With CUDA C++, you had this like one option to target our GPUs.
By raising the level of abstraction and providing a well doumented spec and and hooks to get into the system, we hope that more languages and more DSLs will
be able to target CUDA.
Finally, as the tensor calls are getting more and more complex to program, um having a stable abstraction will provide us day zero compatibility with all
future GPUs.
CUDA is designed so that the system the driver will always be able to jet your your program that you write today on a future GPU architecture and that's that
has been a major aspect of Nvidia success with PTX for the last 18 years and now we're bringing this to Tensor Core and the same guarantee will apply
to future GPUs.
So what do we abstract exactly with CUDA?
With SIMT the unit of execution is threads. The user has to think about
threads. The user has to think about those threads which are grouped into warp of threads that are then grouped into block of warps and all manage all the synchronization and all those blocks
are grouped then in the in the global launch grid that that that you're familiar with. On the other hand, with
familiar with. On the other hand, with the tilebased programming model, you still have a grid and you launch blocks, but those blocks have virtually a single
thread. It's one block of execution and
thread. It's one block of execution and you program it as a single unit. So you
don't have to think about threads and walk and all the synchronization for the data. With the SIM programming model, you every thread gets some
elements of the block. So you still think of the block having some work to do and some data to process but then you have to dispatch those individual elements to the threads. Whereas on a
tile because you stop your programming at the at the block level you just dispatch one tile as a single array and
you program those array as single units.
The system will map those two threads.
So let's look at everything that has been involved in this. When you program GPUs, right, we we have a grid of we we have the the global data in memory. We
have to cut it into tiles that blocks will process and then map those to threads, right? So have to think about
threads, right? So have to think about the tile storage when when those gets into into SM are we using registers, shared memory, tensor memory? How do we
how do we do all of that? what is the layout we're going to use um to avoid bank conflict and maximize maximize the memory bandwidth um we have to think about the memory movement strategy you
know are we using thread as synchronicity or the TMA engine to to handle all all these data movements and hide latency uh of of the DRM accesses
and you have also to think about how you organize your compute strategy how you organize your threads and and do you use WP special specialization or other
advanced strategy for the compute.
And so with kudis imt the user has a lot of responsibility. The user has to do
of responsibility. The user has to do all of these ask all of those individual question and and code it into this this perfectly aligned solution to to get the
best performance.
With tile programming model, the system will take more responsibility. So that
as a user, you still have a lot of control, but the system will take control of some other aspects. And by
construction, that eliminates a bunch of bugs like a race condition because shared memory is not exposed to the user. You don't have to synchronize
user. You don't have to synchronize threads. This makes it all safer to
threads. This makes it all safer to program and that contributes to the productivity aspect. Now we still want
productivity aspect. Now we still want to give control to the user to some extent. We don't want to take away
extent. We don't want to take away everything and make it a purely opaque system. And so tile IR is designed with
system. And so tile IR is designed with a sets of hints and knobs that we're going to expose to the user so that you
still have the ways to instruct the system about all those aspect. What is
the memory movement strategy you want for this kernel? What do you want the system to use as a layout? What kind of storage would you prefer to use for this particular tile or this other tile
whether it's shared memory or tensor memory or those kind of thing? But we
designed those as knobs and hints so that we retain portability with future hardware. Meaning the hints you give to
hardware. Meaning the hints you give to the system are going to be very effective for the current hardware. But
if in the future hardware the system has better solutions or if some memories or or architecture change the system can discard those hints without losing the
correctness of the system and that will provide us functional correctness from day zero on any new systems. >> Matthew I actually have a lot of questions but let me know if you'd rather I punt the specific ones a little
bit later.
>> I'm at the end of of the intro almost so yeah we can pause and get those questions.
>> Okay. Yeah. So um I think hardware portability and has been sort of like an interesting like debate recently like I don't know if you if you were at the recent Trident conference but generally
there was this sort of sense that like it's sort of a mythical idea like basically it's it's sort of great if we can imagine that you can write code and it just like runs faster and faster when you just upgrade your hardware but in
practice it's not happening because black wall is more complicated. Um like
obviously you probably when you're developing this you're working probably heavily on black wall. So I'm sort of curious if you have any insight as to what you think of where you stand on this side of the debate.
>> Yeah. Yeah indeed. Um so I think we have to differentiate portability in terms of functional portability and performance portability right and the ladder is much
more uh difficult to reach.
So what I that's that's that's the kind of line we're trying to navigate here.
uh our financial found foundation and our basis for kudatile is to provide this functional portability right that if we release a new GPU everything should work from day zero right then we
can do more tuning to recover as much performance as possible and that's what drove PEX from the beginning in Nvidia's hardware right every new generation PEX
allowed us to have this functional portability performance you may not reach peak performance but you're going to still get a performance boost uh by the new hardware. That's that's at least
how how what we were able to achieve for the last you know 18 years. Now
the the the knobs and the tuning that we expose are also designed to plug into autotuners, right? So the system that
autotuners, right? So the system that will come with newer GPUs are meant to be plugged into those autotuners to instruct the kind of strategy that are memory movements, compute layouts and
storage so that we can explore with the system what is the best mapping for this newer architecture.
But but but then the trade-off, right?
And I think this is something for instance that the Helium team talked about, which is that you get exploding compilation times, right? So in your case, like I forget if it was you or the cute DSL team, I forget, but I I do
remember seeing some like drastic improvements in compilation times. Uh
was that was that you as well? Like is
there like a trick to make this work?
>> It was the QTSL probably.
>> Oh, I see. So I see I see. Okay.
>> Yeah. [clears throat]
Yeah. So indeed compile time can be a problem but I think those kind of um hints that I described in those knobs um the idea would be that you would autotune for some particular
architecture and save those knobs right there is there is a time of exploration to find the best config and then there is if if you just take attention right it's like there are so many ways to
write an attention and then we add you know the current flash attention and and and the latest algorithm. We're not
trying again all the variance possible of attention, right? We know which algorithm is the best. We know that this kind of is the best. There are some tuning parameters, but we we kind of narrow it down to this. And I think for
that's kind of what's going to happen.
We're not going to retune through the entire space every time, but we're gonna we're gonna evolve that like this.
>> I'd like to hear it. Yeah. Go ahead.
>> Oh, yeah. I was also going to say I think part of the question there though Mark in my mind is the challenge is to build something like open CL where you want a programming model that is not
just device agnostic but also like vendor agnostic because the semantic gap like like I think the challenge in always designing abstraction is that can the abstraction be realized by the
target abstract or physical machine that you're trying to lower to and and I think that's the challenge of designing one of these is like I don't think it's impossible to build a portable abstraction but the more like covering
area you you want to cover the more challenging it be becomes like for example even just not in deep learning the like shader languages have struggled with this because the different GPUs don't have the same memory model so if
you want to provide you know a unified abstraction over all of them you've just either stuck with no no guarantee which sort of defeats your purpose as you stated or you try to put a guarantee in
but it's become so slow that you then have traded a cost. I think even some of the things happening in gluon at some level or you know like kind of gluon like abstractions is around this is that they're trying to build synchronization
behavior that is platform specific. So I
think that's something where we have an easier problem like the the goal of Triton or I used to work in TVM or other systems like this are are noble to try to to give you a runs everywhere
experience but I think it gets harder because even though we call many of these things GPUs for example they actually have very different semantics in certain areas. I think it's one of the the the the things that makes it
challenging right?
>> So, I think Matthew also had referred to like hints being like one way of addressing this and so I'm sort of curious if you could speak about like what kinds of specific hints are we talking about like what do they look like?
>> Yeah, we have some examples that I can touch on later. uh I think the word they're annotated um but I mean roughly right now we have a small set partly maybe design philosophical thing
is we're trying to keep the model abstract as possible so that we have a chance of achieving our goal uh and we're we're being selective about what we like kind of what escape patches we provide and I'm sure we will provide
more over time but we want to make sure from a design perspective because we will be committed to them forever uh that that they're correct so some of them today are um the ability to control
the lowering behavior of any memory operation. So you can you know request
operation. So you can you know request either to never use a TMA to use a TMA or to possibly use a TMA. You uh have the ability to hint the on the behavior
of memory instructions. Uh so like kind of what you expect the dynamic DRAM traffic to be. Uh you also have the ability to hint at the kernel level some configuration on how many CTAs it gets
mapped to or um how you or whether it's running in sole tendency mode sort of on the SMS or not. Um there are other hints that we have been discussing adding around buffer reuse and some other
things like that. Um but those are sort of the core early set that we found at least for some core operations are enough to get you near you know near so near state-of-the-art performance. Um
the other thing we've we'll talk a little bit more about is having some you know escape hatches. So one thing that we've been working heavily on is the ability to interop with existing Cinti code. So you can kind of write a you
code. So you can kind of write a you know CGA or CTA level um primitive and then call it in your tile program. We
think this is really important because there are things you can't build in this model like a hasht or you know uh you know if you want to do write a solver or PRNG you don't you probably don't want
to rewrite that in tile. So those are two ways that we see addressing some of the the the gaps if if you will.
[clears throat] >> But so so may maybe last question I'll I'll I'll let you guys keep going. But
uh one of your colleagues has like a wonderful blog post like Matt Far like about how auto vectorization is not like a programming model. Uh the the reason why this this blog I'm thinking about it now is like hints are kind of like you
know the compiler can ignore them and so then you end up basically in this world where the user is saying something but then they don't really understand like they're kind of trying to trick the compiler into doing to to please them
into doing something specific.
>> Is this like something you worry about or uh Yeah.
>> Yeah. I mean I I I'll go just because I I feel like I we built the the like I you know I worked in TVM for four or five years and was the number two contributor behind TNC and we had taken
the halite approach of doing a scheduling language which is sort of hints are the programming model like just to be tongue-in-cheek right like the the the thing is you write a very
small functional program and then you use a scheduling language to describe or like to effectively script the compiler if you will to transform that program into your target program and that suffers from all this all these
challenges. So I think there not to
challenges. So I think there not to downplay that worry like I think there is a worry if it is only hint based then it becomes very unreliable and and that is maybe the the needle we're trying to
thread and we you know have belief that we can we can thread the needle. Um I I think one thing that helps though is that maybe that's slightly different to me than auto vectorization say in C is
that one of the challenges that a lot of these try to convert C to run on my on my target architecture uh sort of um compiler efforts is that they did never change the programming model. So they're
not really actually willing to change the program semantics. So it's the problem is always hard like uh HLS like high level synthesis for FPJs has this problem. you you try to take a C program
problem. you you try to take a C program which is an imperative you know memorybased program and put it under an FPGA it's a horrible mismatch between the programming model and so then people have created a cottage industry of
solving these problems for themselves and it's very hard and it's not very reliable but that's what we're actually trying to do here is instead of you know patching the world is try to lift the level of abstraction so like one thing
is by like changing the memory scopes we give ourselves both design freedom on the software side and the hardware side to evolve the hardware um and and the programming model. So I I think there is a slight difference and you know you guys will get to judge
whether we've done a good job or not over the next you know months and years.
Um >> yeah uh to expand on what Jared is saying also like the auto vectorization is not a programming model is really in a context where the compiler has to
raise the level of abstraction and and what they're talking about is this like very the system is full of performance cliff and if you just like are a little
bit out of what the system expect then it totally fails to do what you want and it's terrible. um you could think that
it's terrible. um you could think that we will run into the same kind of issues. I think we can do a better job
issues. I think we can do a better job at providing um predictability like and and smooth out those those those cliff because we are not trying to raise the
level of abstraction in the system. The
system is only lowering the the level of abstraction and so those hints are not helping the system recovering information that is lost in the
abstraction. it's instructing it how to
abstraction. it's instructing it how to map a higher level of abstraction into a lower level of abstraction. And that's
that's that's hopefully um you know we only have half of the problem that they were trying to solve uh with auto vectorization and hopefully it's the simpler part of the problem that we're solving. So it's not that we're much
solving. So it's not that we're much better than what they that all the previous attempt that fails at to auto vectorization. It's that we carved out a
vectorization. It's that we carved out a very much simpler problem so that we don't have to be too smart about it.
>> Excellent. Okay,
>> that's I'll let you guys keep going again.
Thank you. Thank you.
>> All right. Thanks. Don't hesitate.
Um, so that was really the end of the first section and and we're going to get into more the kudatas and get a bit more concrete about what what we've been doing and so I'm going to talk about
tail now. So this is um you know what we
tail now. So this is um you know what we presented when we talked about tar the first time at GTC we showed this picture right and explained that tile IR is really a new abstract machine model for
Nvidia GPUs.
With style IR we can express any program that operates on regular arrays that are data parallel right what maps to GPUs normally uh we expect that it's going to
be easier to achieve peak performance and uh for this we we're going to invest a lot in toolings and ergonomic of uh
this new programming model. So we just released yesterday or today I think an open source ML dialect uh to make it more a natural target for compilers of
DSL. So if you're not just want to
DSL. So if you're not just want to program kernels by hand but you want to integrate it in your system we're investing in tooling to make it much better.
So tile with this is going to provide you a stable and portable way to target our tensor cores and all other GPU core processors. Uh and this is future proof
processors. Uh and this is future proof and it's going to evolve very quickly with Nvidia GPUs in the near futures.
This is positioned in the stack just as PEX, right? It's part of CUDA and that's
PEX, right? It's part of CUDA and that's what makes it different from other ways to program GPUs. If you think about the trione compiler or if you think about
cute DSL and this kind of thing they are built on top of CUDA right tile is inside the platform that makes a difference because our it our drivers
are able to j it meaning new drivers will be upgraded with new hardware support new performance tuning etc etc. Um it also means that all the debugger
and the profilers will natively support it as a first class product. Any CUDA
API that takes PEX will take TYR as well. That means that today with 13.1
well. That means that today with 13.1 the driver API can load a TIR kernel exactly the same way as a PEX kernel. We
can shove them both in the same fat bin and launch them, right? and the driver is able to select the right one for the current platform. So it's really inside
current platform. So it's really inside the platform and that makes it um pretty important to CUDA. We also released a comprehensive specification of the
programming model and Jared is going to give you a more informal uh presentation with example later in this in this presentation today. But we invite you
presentation today. But we invite you also to look at the spec and provide feedback on that. many why not just extend PTX like doing this seems like a lot of work to now support new profilers and debuggers. Could you help us
and debuggers. Could you help us understand a bit more?
>> Yeah, that was definitely something we looked at at the very beginning of TIR which was over two years ago now. We
built prototype trying to do you know both things and at some point it just felt that extending PEX would create some sort of
a Frankenstein model where it would be really hard for us to to to distinguish right the tile mode what is under the responsibility of the system and what is
not. So instead we went with this clean
not. So instead we went with this clean slate approach for for this new model so that also we derisk uh disruptive PEX
and what works today. Um we are instead looking at them as two things that will interrupt and mesh together very well right we are looking into how you can
call PX function from a TI kernel and those kind of things. Um, but we believe that the stack is going to be just uh uh better and more clear uh in the tooling
the way we built it.
>> Yeah, I think to add to that, it's worth remembering that PEX itself is also a virtual abstraction. So I think part of
virtual abstraction. So I think part of it to to what Menny's saying is that if you think about what sort of semantic properties you want from each virtual
abstraction PEX is a SIMT sort of virtual machine you know like you know um virtual ISA where every thread is posted scheduling that is they know exactly what set of instructions they
will execute and all the resources have already been allocated to every every individual entity in the system and what we're shifting with tile IR is that each logical thread is pre-scheduling in some
sense where we don't know exactly what every low-level hardware thread will be doing and that gives us a huge amount of freedom and I think that is the change because like if you implement a
cooperative like uh primitive in Cinti today you have to commit resources to to it to write that that piece of code because you have to assign every thread what it's doing but here we have cooperative like things like an MMA or
other operations which we actually have freedom to readapt to different resource footprints and I think that's one of the to me the back to like the previous question you're asking that's one of the freedom uh you know degrees of freedom that we have that make the lowering
process easier is you know back to like the auto vectorization like you've committed to a loop with specific memory effects you have to realize that semantic for users but if you don't have to commit to those things and you have a lot of freedom and and I think that is
why having a separate abstraction here is helpful because if you look at the tensor cores and TMA they're in some sense co-processors that are have a different programming model than the
actual like SM does and in some ways these These are some like then diagrammed chunks of the programming capabilities of the hardware and we're trying to group them in ways it's more
natural and and easier to conceptualize.
>> So like a big part of it doesn't seem to just be mattles, right? Like I think what you seem to be hinting at is like things like mega kernel like work where like the scheduling your scheduling code is kind of complicated and you're saying
okay well if people are like really thinking independently about how threads and warps are cooperating like something's off and this is too hard and we need to >> Yes. And and I think there's always the
>> Yes. And and I think there's always the ability for the ultimate, you know, uh skilled humans to then do the the the best version that's 20 or, you know, like you can do the last, right? Like
people are still out there like fiddling with SAS or optimizing assembly out, you know, in CPU world too, right? But but I I think to your point like when you start to build bigger systems like let's say you have an LLM inference engine and
you have a bunch of different kernels, you want to be able to develop them independently. Maybe you want to build a
independently. Maybe you want to build a mega kernel out of them. Maybe you want to do fusion, maybe you want to do quantization, maybe you want to do a different algorithmic change. By lifting
the model up, it's easier to operate at that level. And I think those things
that level. And I think those things become sort of the uh you know push on the big O of the system or your development speed much more than you know individual memory operations where
that's kind of in the cost of factors at that point when you're starting to develop that way. So that's also part of it is being able to like lift the level of distraction I think opens up quite a quite a few new things.
Yeah, I think that exploring those mega kernel and all those combination in the past was just mostly out of reach for most people. It's like if you micro
most people. It's like if you micro optimize a kernel thinking it's going to run alone in itself on the GPU is not the same as optimizing it as part of the system or combining kernels together.
And having this abstraction and the system handling more of of the low level for you means you can focus on all those possible combination and explore them as well as part of the system.
>> Um so we have two related question by Jack Wolford and anime trailers. I'll
try to combine them. I think people have some questions around like the composability of this like as in uh you have a you know you you you like basically you call nvcc do you have to
pick a path or like is there like some amount of like you can have like your CUDA kernel here and your tile and your tile kernel here like can they talk to each other just any composibility I >> I have one slide on it if we want to
punt it to there and then I can when we get into code I can kind of touch on it >> so so for the purpose of this slide Dar and PEX right now are just different
kernels. They can live in the same fat
kernels. They can live in the same fat bin but they don't they like you can launch them in in the same stream because at the level this is still a GPU executable at the bottom. The kind of
question about the CUDA interrupt is a matter of front end and language exposure which you know can be made more and more integrated over time without
changing these underlying fundamentals that that um that which assembly is used at the bottom.
>> Um so I'm considering punting this but it's an interesting question like just I'll I'll rephrase it. So uh Aman Chararma is asking like do you think you'll potentially phase out PTX? I'll
ask it a bit differently which is that like Jared you mentioned something like you guys want to be supporting this like forever. Uh like what sort of gives you
forever. Uh like what sort of gives you like that level of confidence to say this is the new programming model that we're going to bet the the next like 10 years on.
Um well I think one I'll tackle the first part first which is like I was saying there there's some sort of ven diagram arrangement right where the this or you know or enclosing circles where some programs you want to write in tile
are are not expressible today and so like s the pex model still there's like a a bigger semantic model so I think for you know from that perspective it's not phased out I think in one in one sense
it's more about group what programs are written and where because there's a lot of the pex aa variant programs today that are non-portable because if you remember back historically you know PEX
the non-Avariance is portable across generations or for compatible and because everyone's dropped down to a to get the last level of performance that the sometimes the for a particular kernel that portability guarantee is now
is now broken. So I think part of it is migrating certain kernels there. I think
on the confidence is that you know we have collectively as a team we have people who have been doing this you know kind of in the Google ecosystem in XLA
in many other NPUs and you know maybe this is not a satisfying answer but from a design perspective I think we have confidence that this captures a significant footprint of the programs that people are going to write and we
believe that we have a small enough core that we can evolve forward um to be to make it the programming model that will last you know the next decade or two.
We also I would add also that um it's not only that we have people with a variety of background that's you know worked in many many other ecosystems uh
we are also been designing this closely with our hardware architects thinking about how they think about the evolution of GPUs and those kind of thing and the
future architecture for the next 10 years where they want to go and cross validating that our assumptions matches their expectation and vice versa.
So it has been a very like um cross functional effort to to to do this.
We've also been careful to start with the first version of tile that is pretty conservative on this. We explored many options that for now you know we tabled
out later. Um but that's that's what
out later. Um but that's that's what make gives us confidence right is that it's not just a small team that came up with this and rushed it. As I said earlier we started this over two years
ago, two and a half years ago I think.
Uh and and so we we took our time I would say to to make it right.
>> Yeah. Exactly. Medie's points are great.
Yeah. And I think my my thing about experience is that many of us this is not the first version of this thing that we built. So I I think that that's also
we built. So I I think that that's also part of it is that you know at least for me I've been working on kernel programming stuff now for like eight or nine years and I think there's a lot of lessons learned from all the different systems and attempts. Now, that's partly
why I was picking on TVM, which I worked on earlier, is we tried to do a lot in the scheduling space, and I I feel like we all have some like hard-earned lessons there about what works and what doesn't work.
>> All right, I have more questions, but I think I'll let you guys keep going.
>> Yeah. Yeah, we can come back to those later, right, in the talk. All right, so kudat, this is this new virtual ISA, right? So, it's a bite code based um
right? So, it's a bite code based um representation when you that you get out of it. Um and so how it's positioned is
of it. Um and so how it's positioned is next to Pitex. We also released Coutile which is a Python DSL that is slightly
higher level than Tile IR but it generates TR. So you can write within
generates TR. So you can write within Python your kernels with CQile. You can
save the bite code. You can embed it. Um
you can execute it. You can autotune.
There are a bunch of things and Jared is going to get into a lot of details.
Coutile is currently our way to program TIR. But as part of the platform, Tile
TIR. But as part of the platform, Tile is much bigger than than Coutile.
[snorts] So we're also going to release a backend for Open Penai Triton to target Tile so that you can get this portability and this day zero guarantee
that that TR as a system offers that's coming out soon. Um we think that many existing DSLs are going to be extended
to build to leverage DR and we are inviting you to build your own DSLs to uh target DR as well. And something else that we are actively working on is
extending CUDA C++ uh to have a tile C++ mode. And when we're going to get there
mode. And when we're going to get there and Jared has you know some teaser later um then you're going to be able to mesh really tile code within just like you've
you've been building CUDA for 18 years by having some tile function and tile kernel within your source code all seamlessly uh integrating with your host
code and your C++ applications.
Um and to reinforce a point about the platform, you can embed today tile and pitex in a fat bin and you can launch it using our driver APIs. That means that
today a coutile kernel that you save the tile bite code, you can embed it in a C++ application and launch it from C++ without any Python dependency. Right?
And and this is this is one of the power of being inside the the CUDA platform.
All right. And uh with this I'm going to let Jared gets into more details about the programming models and >> cool thanks so much Mie. Um so with
[clears throat] that I will dive in.
[snorts] So what we're going to do first is um oh uh is my Mark are my slides presenting? Sorry maybe technical
presenting? Sorry maybe technical difficulty.
Cool. There we go. Go back one. All
right. Cool. With that I'm going to jump in. Uh thanks so much Medi. So uh
in. Uh thanks so much Medi. So uh
Coutile uh is uh it's kind of worth talking about. So what Medi and I have
talking about. So what Medi and I have been discussing so far is we view CUDA tile as this overall platform effort of bringing tile programming to uh the CUDA
platform. Uh right and and although you
platform. Uh right and and although you know people refer to CUDA often just as CUDA C++ uh the way at least inside of NVIDIA we think about it is PEX is part of the platform. The developer tools are
part of the platform. the you know CUDA libraries are part of the platform. The
language exposures are part of the platform and so uh one big push that's been going on inside of Nvidia in the last uh you know few years if you've been paying attention to it is you know
the the push into making Python a first-class citizen you can see you know there's the new CUDA core APIs that Leo Fang and his team have been working on where you can now run all the drivers
and APIs directly from Python with Coutile we view our Py Python Coutile as an instantiation of the same thing. this
is a way to reach the programming model.
There's also you know the ability to write synt kernels today using uh you know a numba numba GPU mode as well in order to write python code um and turn it directly into cinti kernels. And so
what coupile is is that is our python instantiation of tile. So here's a basic soft max kernel. There are some things that we did differently you know um it's been popular on online as people been reacting to say that this is very
similar to Triton and there are definitely lots of overlap and inspiration but the programming model here is a little bit different and so I want to touch on that. So one thing that to note is that we've been talking about
this our goal is to bring the abstraction lower to the algorithm that you want. So you can see here's a very
you want. So you can see here's a very simple um you know uh very simple Python kernel compared to a simp kernel. This
is the naive one written in Python. You
can kind of see again Python is only our syntax. The change in size here between
syntax. The change in size here between these is really about the programming model. So same thing here's a you know a
model. So same thing here's a you know a cute cutless kernel over here on the left and the the compression here in size is really about the programming model shift. So I really want to make
model shift. So I really want to make sure that's clear. It's not just about being in Python. It's about the programming model provided by tal and the abstraction shift is what allows us
to be concise. So here let's take a let's go a little bit uh step by step on an example. So here's a basic kernel
an example. So here's a basic kernel doing a softmax. So here we have two input arrays an input and an output array. Both of these are mutable arrays
array. Both of these are mutable arrays in or tensors in memory and tile only has one memory space. So today if you
want to share a value across tile blocks it must be in a global array. So we have an input and an inout parameter here with O and what we want to do is we want
to get a tile out of one of these global arrays or global tensors. We provide an index which is this first argument here of which tile we would like to load and we provide a shape which describes how
to tile the underlying array. So in this case it will split I into into N um you know I0 by R tiles and then we will load
the one denoted by zero on the first dimension and then the first grid ID on the second dimension. When we've done that we will end up with what we refer to as a tile. The local tile array are
immutable and uh um and local to the logical block. So one way to think about
logical block. So one way to think about this is that these arrays are are what we sh how we share values and are observable of the system and then the tiles are our local you know registers if you will that we can compute on. So
in this case we take C we can compute max over it uh you know compute um X sum do division uh and then when we're done at the bottom we can store it back to memory and that in some sense publishes
this tile back to global memory and is now visible by other tile blocks. One
thing that's worth noting though is because these things are immutable. You
can you have variable updates. So you
can you know you can overwrite max here but you can think just like uh scalers or strings. These are atomic values. The
or strings. These are atomic values. The
tiles themselves. So updating the variable binding here in this example does not do perform a memory update. It
simply rebinds the name max to a new tile computed by doing uh division on numbum and den. So it's one thing that is a change from you know some programming models where everything is
mutable. In this case these are
mutable. In this case these are immutable values and just really want to send that point home. So as many said we've made this part of the um we we
made this part of the CUDA platform. So
when you produce a tile kernel and you compile it via tile today it's actually possible to launch that kernel directly from ka kernel. So it it acts just like a normal synt kernel does. Uh so here is
again the Python wrapping of it. You can
see that you write a what is equivalent to you know a global or device function here uh in softmax and you can use the decorator ct.funk to do that and an an
decorator ct.funk to do that and an an important design uh uh important um value that falls out by the design that we chose is that device code is
self-contained in this model. So today
um Triton's actually improved this quite a bit in the last year. uh but
historically managing TMA descriptors has been or or other host side resources like that has been quite painful. So if
you want to use TMA you have to think about the host code. You now have a portability story of understanding the host and device code compilation. You
now need to initialize them and perform updates. All of that though we've
updates. All of that though we've actually self-contained. So for example,
actually self-contained. So for example, if the TIR compiler chooses to use TMAs to realize this program, all the TMA descriptor management will be taken care of for you and it's actually integrated
in as part of the driver and the runtime. It will run the appropriate
runtime. It will run the appropriate host code, initialize them, pass them through um and uh you know handle that complexity for you. So that
self-containedness is a nice property we have by integrating into the system. So
we can make some of these sharp edges more smooth. Um then when you want to
more smooth. Um then when you want to launch it's just like what you you do normally. You copy some arrays over to
normally. You copy some arrays over to um to the GPU. You call in this case CUDAL has its own launch API. It's a
very thin wrapper around coup launch kernel which is does some translation of you know these input arrays into the low-level um you know arguments that need to be passed to the IR and then it
just launches a kernel like normal using coup launch kernel and when you're done you just copy the result back. So this
is what a complete you know uh kernel launch uh and reading the result back looks like. One thing uh is at the DSL
looks like. One thing uh is at the DSL level we've made it so this works with any of the array compatible um uh arrays. So DLPAC you know PyTorch all
uh arrays. So DLPAC you know PyTorch all the frameworks you know and love are easily interconvertible with the kernel and it can be passed directly to the kernel. Um on interoperation we can talk
kernel. Um on interoperation we can talk a little bit here. So Medi already said this earlier uh but you can write um a SIMT kernel and a tile kernel in the same source file or in different
translation units link them together.
You can launch them on the same stream.
You can put them in a CUDA graph. And
this is you know um uh inter kernel interop if you want to think about it that way where you have a tile kernel and a centi kernel maybe operating on some same data in global device memory
but we don't yet have uh inter kernel or sorry intra kernel interop which the goal uh I saw someone that asked a question about the design that we've been working towards is the ability to
annotate uh certain sets of device functions and reexport them uh to uh tir in a way that you call them. One kind of experiment we've been working on internally is allowing you to call
things like CCCL or Cub primitives. Uh
the ability to bind to other existing CUDA libraries like QUFFT or CU solver.
And we believe that this is a big feature that people are going to be really excited about in the coming year.
Um if there's things that you know as a user you're interested in or excited about or use cases there, we'd love feedback. U this is something we're
feedback. U this is something we're hoping to talk more about at GTC this year. Um, so with that, let's jump.
year. Um, so with that, let's jump.
>> So Jared, actually, I mean, I'm curious to ask you the sort of related like is is this that important? Like why not just have like your PyTorch program or like some your calling code be just like
calling whatever functions it wants?
Like do you sort of see like really like is it like for fusions? Like I mean that's actually the that's the the meat of it. So if you like imagine you want
of it. So if you like imagine you want to do a FFT on a small tile for example, >> uh you don't want to have to like write the tile back out to global memory like
call your like imagine it's like in a loop for example you're doing some kind of solver thing or like HPC problem. We
want it to be able to be sort of zero cost abstraction if you will where it can operate directly on the data in the memory space that we place it. And so
when we actually do uh when we do tile uh compilation and lowering you know your tile might be placed in um SM or registers or TM. And so the idea is that we can actually pass directly across the
boundary to the CI code without having to relay out the tile. And so that's what we're driving towards is as much as possible can we make a zerocost low synchronization way to share data across those functions because there are all
kinds of things that you might want to uh sort of shell out to uh that you don't either can't write easily in tile or at all or you don't want to for example you know in CCCL we have all
these very hand optimized reduction primitives or sorting primitives and we want people to be able to take advantage of those and that really kind of contributes to the you know the CUDA X like the CUDA library story is the
ability to reuse those things. There's
also a ton of stuff that's just you shouldn't reimplement, right? Like
solvers, randomness, uh, you know, cloning those is just sort of bad for everybody in my opinion.
>> Um, you might have more slides about this later, but like someone Matt is asking about like just like give us a sense of like for your uh softmax kernel like how far of birth difference would
this be relative to like an expertly written one?
>> Yeah, Medi has a slide later. I'll just
kind of touch at the high level. I mean,
our our target is to be like PEX where, you know, if if you've been around Nvidia for a while, you know, historically PEX was a few percentage performance hit over SAS, which is, you know, why there's all these famous
stories of, you know, Scott Gray or whatever at Nirvana re reimplementing um, you know, reverse engineering SAS.
Um but PEX over time has closed that gap quite a bit and that's sort of where we are at today is that for lots of workloads we're within you know 10 like let's call it 10 to 20% of what what we view as state-of-the-art. Sometimes
we're better sometimes we match it.
There's some workloads we're much slower. Um obviously we're that that's
slower. Um obviously we're that that's not the intent and some of this is just a matter of getting it you know this is our first release of the system and optimizing it. So Medicy's going to
optimizing it. So Medicy's going to touch a little bit more on that, but our goal really is that this, you know, one way to view this also in my mind is that we want there to be a strong default
fallback baseline for every target and every kernel because the reality of today is that people write kernels that only run one place often. And if you, you know, I've had this happen to me when I was not in Nvidia. You know, you
write someone wrote a handoptimized hopper kernel, you want to benchmark it on Blackwell, it doesn't run. Porting it
is extremely hard. Um, you know, and that is the world that we're trying to to help improve.
>> Um, speaking of compatibility, uh, you know, I'm looking like the first thing I saw when I thought I looked at the snippet, I'm like, great, you have like a you have like an M MMA here, so this probably works really well on H100. But
let's say someone wants to run this on Blackwell and they instead want to be calling like a TCG gen primitive.
Presumably, like here, like what I would expect to see is like some error like, hey, you're on Blackwell, you should be using TCG instead. Uh, so that's the first question. And then the second
first question. And then the second question is like considering like Nvidia's hardware changes so much every year and then it's like new intrinsics like how do you handle this like do you basically just expect to be adding new
Pythonic bindings and then over the every generation you'll just phase them out based on like what hardware you're running or is this yeah something that >> yeah so I think one thing that's worth addressing is that the expose to direct
primitives is more of the strategy that we've been running with QTSL which is like the full you know like because to me one of the big differences between what we're doing here in QTSL is the level of abstraction. Like QTSL is like
you have everything. We're going to give you all of the primitives. You're going
to get them at the lowest level of abstraction. You can do what you want
abstraction. You can do what you want there. Um we're exposing abstracted
there. Um we're exposing abstracted things. So there is no TCG operation.
things. So there is no TCG operation.
There is an MMA operation and in that case MMA is forever stable in in that sense. And
sense. And >> it's the most stable algorithm in humanist, you know, >> right? Well, but MMA itself is abstract.
>> right? Well, but MMA itself is abstract.
It's it's take an A and a B and an accumulator and produce a new accumulator. Right? So we that is why
accumulator. Right? So we that is why you know um we are abstract you kind of you can argue that we're hiding details from people but what we're actually trying to do is lift that level of
abstraction such that you know it doesn't matter what exact you know TCG gen or matrix multiplication you know lowering compute strategy is the best for your target um because for example
like even across 5090 and B100 or B200 it's different uh because they have different hardware right >> and So presumably then for things like quantization you would also expose like
dequant and quant primitives like >> Yes. And that's so that's some of why
>> Yes. And that's so that's some of why the low precision stuff is coming in the uh in a following release is that we want to make sure you get the design really right. So like you could do FPA
really right. So like you could do FPA today but some of the low bit like 4bit six um MVFP4 MXFP4 is coming in a follow on release this year and we've been trying to get the design of the scaled
operations right for that same reason.
>> I [snorts] see. Um, so anime trailer is asking like they thought that Kuda thought it was only supported for Blackwell, but it seems like you guys are hinting that that's >> today the first release is again we're just trying to phase you know all this
stuff. So Ampear's coming, Hopper's
stuff. So Ampear's coming, Hopper's coming. Our our goal is by you know
coming. Our our goal is by you know let's call it you know steady state next year um we will have Ampear on support and then when a new target comes out we will bring day one support for it.
>> Okay. Uh and I guess like the best place to know what's coming is your presumably the cutile GitHub repo would be for the latest stuff.
>> Yeah, I I think you know we just released so we will update people as we are communicating like uh you know I we don't have a product manager here with us like Rob you know we'll kind of communicate road maps. We're going to have a Stephen always does a huge state
of state of CUDA talk and there also probably be a tile talk at GTC this year where we'll give another big update. Um
>> I one one thing I would say like coutile is a open source front end for tile right it's on GitHub you can monitor you can get everything there great project
um the tail support like what you're asking about new features and new hardware support it's part of CUDA so you're going to get it with like CUDA 13.2 two 13 note three. So it's the
usual channel and the usual way to get information about CUDA the developer hub and the release notes. So I think every time we release a new CUDA version we like mention everything that is new
there and that's the best way to get the information about new tile features. Now
of course you know the because of the portability right it's like this exact mat that is on the screen today it works on blackwell when you get tail 13.2 two
is going to work on on Pierre as well, right? Without changing coutile. So,
right? Without changing coutile. So,
CQile will not tell you that supports on Pier now. It's going to be like directly
Pier now. It's going to be like directly supported by the system.
>> Yeah. So like whenever that release that ships the support comes like you will you'll be able to just get up and running and the cool thing is you can just literally take the bite code even you don't even need the source program at that point assuming you don't want to
change the tile size or anything.
>> Okay, go ahead. Oh, so so last question Theo's asking like you all must have seen that picture by Phil Tullet where he shows the Purto frontier of performance versus like like basically
developer experience. Uh where do you
developer experience. Uh where do you all feel like you stand on the spectrum?
[snorts] >> That's a good question. I mean I I think right now like we've been trying to get the experience and the abstraction right first. And so like you know that's why
first. And so like you know that's why like when people ask about performance like it's a moving target like already from what we've shipped to internally has changed you know week over week. So
I think we're trying to prioritize that first because many of the existing tools in Nvidia are performance first or that's the way that I think about them.
>> Okay. Got it.
Okay. There's more questions but I think we'll just stop.
>> Yeah. Let me just get in. Hopefully I'll
answer some of them by like showing code. Um okay. So we've talked a little
code. Um okay. So we've talked a little bit about it. roughly this is a cartoon of the the Tyler our abstract machine.
So I'm a compiler PhD or you know was and so I I I often think about things in abstraction and semantics. So what is the abstract machine really? Well you
know today you can think about your the GPU you're running on the physical machine is SMS and memory spaces and compute hierarchy. What's how IR
compute hierarchy. What's how IR abstraction of the machine is is there's global memory. Global memory has some
global memory. Global memory has some buffers in it like let's say the A and the B and the C that we just saw. It's
got some code that represents the tile kernels and tile functions that we're going to run. And then it's got n tile blocks which are the logical threads of the system each parameterized by a coordinate in the grid the overall grid.
So that's what the abstract machine looks like. Let's look at what actual
looks like. Let's look at what actual tile IR program. So we're zooming in on the code part here. This is actually a you know cleaned up for the slide version of the IR that is outputed by
lowering the the Python kernel I just showed you. So if you run coupile today
showed you. So if you run coupile today you can actually you know get the IR out and inspect it and you get something like this. There's a few more things in
like this. There's a few more things in the the production version. You know
they're annotated. There's some
assumptions and other things that I I've removed here just as a in full disclosure if you see new instructions when you go do this yourself. You can
see that all the arguments have been translated out um and expanded. So here
we pass all of the input arguments uh and their shapes and strides. So that
matte I showed you is actually fully dynamic over the shape of the input matrices. Uh and what that means here is
matrices. Uh and what that means here is that uh we can invoke this and even though we will tune for particular tile size, it will actually work on any matrix that you pass it in. Um you could write a kernel that's actually
completely statically specified and you get slightly different optimization behavior. you know if you if you want to
behavior. you know if you if you want to optimize further um what you can see though here is that all the input arguments are actually come as tiles so
every value in tile IR is either one of effectively two things which is it's either a tile a zero rank tile is a scaler so you know a tile of I 64 is a
zero rank uh tensor containing exactly one element which is a single number uh you can see that same thing is true for pointers and then we also have what we call views and I'll talk about those in
a moment. There's a family of view
a moment. There's a family of view types, but those are the two fundamental types that you get in tile. So, how does this matrix multiplication lower? Well,
here um you know, at the Python level, if you do cg.b or b1, you get out the grid coordinates. Here, we're going to
grid coordinates. Here, we're going to use a 2D grid to implement this matrix multiplication. So, what does that lower
multiplication. So, what does that lower to? Well, there's a tile instruction
to? Well, there's a tile instruction called get tile block ID, which is returns a three tupole of the X ID, the YD, the Z ID. If you're not using one of
these, for example, it will be zero when when you launch, just like it is in CUDA today. Um, and and that is just the the
today. Um, and and that is just the the basic setup. If we zoom back in on the
basic setup. If we zoom back in on the abstract machine, here's another cartoon picture of what a local tile block looks like. So, you can see the tile block is
like. So, you can see the tile block is some local state that is only viewable to the tile block like its grid coordinate and then its instruction that it's executing. So in this case when we
it's executing. So in this case when we invoke get tile block ids what we're going to do is logically we're going to bind the SSA value zero to a zero rank
tile you can think about this as storing a zero you know a zero tile into the logical register you know pointed to by zero we do the same thing for one now let's do that for something a little
more complicated here we're going to allocate a zeros buffer for um uh for accumulating the MMA so we're doing a simple like K reduction MMA where we're
going to sum over the K tiles.
In this case, when we allocate the sum, what this looks like is very simple.
It's actually a constant value [clears throat] allocated using the constant operation. The constant is F32
constant operation. The constant is F32 containing the constant value 0.0. So
this very not too complicated.
And what this looks like though is again we're in the zero tile block ported to the grid coordinate by 0000. you have a current instruction that we're executing as that constant instruction. And what
we're going to do is just going to allocate and store a multi-dimensional tensor into the logical register 5. So
you can see here what this is is a 64x32 tile of F32 values. So you can see the type signature here tile 64x32 by F-32
is a 64x32 tile of 32bit floatingoint values. And it looks just like this. And
values. And it looks just like this. And
so that is how execution is going to proceed is that every time we assign one of these in the in the logical model, we're effectively updating the register at the value and these are all immutable values again. So if we look at how we go
values again. So if we look at how we go through matrix multiplication, we use a helper here called numbum tiles at DSL level. What this actually does is it
level. What this actually does is it queries the number of tiles in um in a based on viewing it with that shape that's provided by the second argument.
Um, and so this turns into some basic arithmetic math here to kind of figure out the dimensionality of the loop that we're going to do iterations over because what we're doing is we're going
to run the loop the number of tile times. So if we have k you know we have
times. So if we have k you know we have uh k tiles along that dimension what we're going to do is run the loop k times. Now the interesting part here is
times. Now the interesting part here is how load is low or so here's the loop right so one thing that's we're talking about is we're in mir [clears throat] all the control flow is you know dataf
flowy functionalized however you want to think about it so for example like structure control flow is an upstream mir this for loop is going to go from
you know zero to f2i it's going to step by one every time and it needs to on every iteration return the next uh iteration's value. So here in every
iteration's value. So here in every iteration we're updating the accumulator buffer which if we pop back for a second is realized by assigning to sum there at the bottom. So the Python compiler will
the bottom. So the Python compiler will handle this translation from normal Python um you know assignment and control flow down into the mir control flow. And so what we're going to do is
flow. And so what we're going to do is run this loop and update the variable every time. The final iteration will be
every time. The final iteration will be the final value that we want to store.
So let's jump ahead to the load. The
load is the most interesting part. So
what the load logically does today is ct.load load. You take a array in the
ct.load load. You take a array in the coutile programming model. You give an index and you give a shape. And what
that allows you to do is it again views a as a collection of tiles of that shape and then picks the E index out of that and and hands it back to you. So what
this looks like in the tile rand is the first uh abstraction that we have is make tensor view. So make tensor view in some sense is the the top of the view
abstraction hierarchy. So when you when
abstraction hierarchy. So when you when you have a pointer um in memory. So this
is where we actually differ quite a bit from from trident. So what we're going to do is we're actually going to assign a layout the shapes and strides to that pointer and then from then on we can
logically uh work with that value as if it was an aggregate in memory. So in
this case you can see when we build the tensor view uh we get out a tensor view uh here with the type annotation and I'm going to zoom in here. So when we look at that sequence you can view any load
right now in the coutile programming model conceptually as these three steps and so we start with in arc zero being a pointer to global memory and we're going
to build the tensor view. What the
tensor view does is it gives us the volume of the tensor based on giving us the striding and the shapes and we that logical register or you know SSA value
will then contain this tensor view. The
important thing here is the tensor view is actually only a compile time concept.
Uh what it does is just associating metadata with the pointer. So one thing that's worth noting is I said this is dynamic because the values for the shapes and the strides are coming from dynamic values. You can see the type of
dynamic values. You can see the type of tensor view is question mark by question mark by F-32. That's because we've sort of erased the static information because we don't know it which allows us to
operate on a dynamically sized piece of memory. So this is the core concept is
memory. So this is the core concept is that tensor view gives us an arbitrary statically ranked dynamically shaped piece of memory in in global [clears throat] memory. So and it looks
like this. So then we're going to make a
like this. So then we're going to make a partition view which today is the only subview that we have. And what a partition view does is it provides a perfect or even tiling of the underlying
view with a given tile size. You can see here the tile is actually part of the type of partition view. So this is a statically known property. So we will
know statically that each tile is 64x6 but we can do it over dynamic memory. So
this just I keep hammering this because it is something that sometimes has been confusing is that even though we have statically sized tiles, you can still write very flexible programs over
dynamically sized chunks of memory. So
now PV3 will contain this partition view and then we can load and store from the partition view using the load and store operations. So this is the final part is
operations. So this is the final part is now we're going to do these loads twice and we're going to commute compute MMA here at the bottom. So what do we need to do to do this final thing? This is
what the MMA looks like here. So you can see that the MMA itself is you know take in the two loads that I performed take in the accumulator buffer
and produce a new accumulator buffer.
And what that looks like is simply this is roughly MMA's uh you know semantic is you know do a partial matrix multiplication and an accumulation. And
so when u Mark was asking about TCGen 5 on Blackwell with TCG 5, this will in many cases become TCG 5, but it also could be an emulation loop. Um it could
be super MMA on an old amper uh um chip.
It could be what you know any other kind of realization that we choose. Um and I think the one thing that's worth noting here is like the I didn't talk about the load before the load view uh TKO
operation is how we actually real get a value out of the loop index. Oh, sorry
out of the partition view. So what here is we pass the loop index and we pass the grid coordinate and we're we in the square brackets there we're actually loading the tile denoted by that index
into memory. So you can see here that uh
into memory. So you can see here that uh the type signature on the mir side takes a partition view and then gives us back a tile that's 16x32 by f-32. One thing
to note here is you can see that it also returns a token. So one part that I haven't talked about and one thing that is um different in tile IR is that by
default we have uh all memory operations suffixed with TKO are what we call tokenordered operations and memory operations in tile IR are by default
unordered and I'll talk a little bit the memory model in a moment but one thing that we've done that is different than Trident or other systems that we've actually uh built and specified a memory
consistency model uh for all the operations. Um and we've designed this
operations. Um and we've designed this to mirror some of the semantics of PEX's weird uh sorry weird weak memory model um in order to be compatible with the
the overall system. Um so finally just to wrap up the matrix multiplication we're going to do the same thing we did for loading for storing which is if we look at the IR we're going to build a tensor view we're going to build a
partition view and then symmetrically we have load view TKO we have storeview TTKO which is the same thing it takes a semantic of a store it takes the variable to store it takes a partition
and then an index and it stores it back into memory so you can see actually at some level the kernel and the IR are not that far away from each other. So to
really drive that point home is that many of the front-end abstraction is just a front end uh that a lot of the actual semantics are being powered by TAR itself. So just to touch on the
TAR itself. So just to touch on the memory model before we close this is probably one of the most complicated parts before we go to the memory model.
>> Yeah go ahead just a few >> quick back. Um so so first off like here like that function mmma f like this is basically um so this is not calling into a higher level library right this is like just
like whatever the hardware happens to support like over m over smaller tiles and this is both what the ptx and the and the tile programming model are going to be using.
>> Yeah so maf is like the built-in like if you go to read the t spec we have I think it the final number is like maybe 100ish 105 ops or something like that.
MMAF is one of those ops. So it is uh there's also MMAI. We we decided to to split the floatingoint integer operations and type them. So there's two
two variants. Um and MMAF is roughly
two variants. Um and MMAF is roughly take a an a tile of any floatingoint type, you know, two tiles of any floatingoint type that are compatible in uh in shape and an accumulator buffer
that's compatible in shape and return a new accumulator buffer. And and that's the semantic of the operation. And then
the lowering to your question is uh sort of defined by uh what target you're on uh what features are enabled, what memory space the oper operands are in,
so on and so forth.
>> And so and specifically for the for the type for the templated argument like the tile like 60 64* 32* FP32 it's not like you're packing everything into 32-bit values like this just can just support
whatever floating point. Yeah. So what
you're seeing here is like this is maybe one thing I actually thanks for calling that out because I forgot to say that.
So there's some compile time or like um JIT time parameterization happening. So
I didn't call it the arguments but we have three input arguments right that are dynamic A and C and then we have this TS which is a constant argument and just like Trident or QTSL these are being inlined in but the DT type is
another one of these. So in this case what's happened is we've specialized this kernel to be on the 64 uh by you know uh like the the tile triple that we
passed is like 641632 we specialize on that size and then because of the dype of a and b we specialize the kernel on floating point 32. So this could also if you just go
32. So this could also if you just go back and literally invoke this kernel with f F16 you'll get an F-16 version.
It could be FP8 um so on and so forth.
>> Last question. Uh what does style equals sum mean and the >> uh oh where sorry uh on this one >> no the before the very last the last
>> oh okay yeah so store has keyword arguments. So tile is the value to
arguments. So tile is the value to store.
>> So >> oh I see I see I see. Okay. Store
semantic out. It's out.
>> Yes, it's out. The store semantic in tile is that you take the global array then you take an index and that index is implied to be the index of viewing C as
if it was tiled by the size of the tile argument. So it's the here the goal here
argument. So it's the here the goal here is to be convenient in the sense that you know if you're storing a tile you don't have to think about also passing the tile parameters or sizes. All that's
like inferred and threaded through. So,
so like [laughter] a bit more characters would be something like uh like uh see it's more like append but like some by
sum you mean append and by tile you okay so this is like the style by which you compose what would be other values oh no no no >> it could be so I think some might be just a misnomer of the program like the
programming is program is making it more confusing like that could be a or b or sum or any tile that you've computed so like that's just the value that you want to store. So that if if you think about
to store. So that if if you think about it um you know what store needs is it needs a tile of a specific shape and it which implies a partitioning and then it needs an index to store at. So if we
look at the IR again right what this becomes is make a tensor view with the dynam the dimensions of C. Then we're
going to partition with the size of sum in the last um slide is 64 by32. So
we're going to build a 64 by32 partition and then we're going to store the XY right like if if we think back to this slide what we're computing right is our
goal is to compute X the XY tile right and store it back to memory so we're computing that incremental sum at the bottom of every loop and when we're done we have the tile specified at XY where X is the first grid dimension and Y is the
second grid dimension. Do you guys have like a global logging flag for people to look at that generated tile IR from uh from a Qile program or do they
>> Yes, there is a flag which uh because my brain I was very sick last week. I feel
like my brain it's just my cache is missing. It's something like admit tile
missing. It's something like admit tile you know and I believe it's documented in the Coutile repo. If not you know we can uh open a question there or get back to someone on it. We have a we have a
slide later about how to export the bite code from the from the the coutile front end. You can
>> there's also an environment variable somewhere that you can set. So I just I don't it's you know it's long enough that it doesn't fit in my my memory.
>> Yes. But it's not going to you require internal tooling to print the the dialog.
>> Yeah. Right. Right. Yeah. Yeah. So one
thing that's nice is with the dialect we have the disassembler as well. So if you get in these by code blobs, you can disassemble them and look at them in the textual format. One decision we made is
textual format. One decision we made is we're the textual format's not stable.
So we we're trying to discourage people from using it as an exchange format. You
know, it's not uh something like C++ code where we we want people to build around processing text. Um
>> Jack Wolfart is saying you can set CUDA tile dump byte code folder, but that just gives you the bite code, >> right? And then you can run the
>> right? And then you can run the disassembler is uh which is now in the MIR. um which we talked about shipping
MIR. um which we talked about shipping in the toolkit. We just we haven't done it yet. You know, if people get, you
it yet. You know, if people get, you know, if there's a need and desire for it, that's something we can just ship automatically as well in the future.
Right. But this is the kind of feedback that's great. You guys are in some ways
that's great. You guys are in some ways our, you know, our earliest, you know, K users. So,
users. So, >> excellent. Okay. Uh so, Matt in chat
>> excellent. Okay. Uh so, Matt in chat really wants to hear about the memory model. So, let's let's keep going.
model. So, let's let's keep going.
>> So, Tower has a a formally defined memory consistency model. So many many of the people who uh have worked on the PEX one at NVIDIA in our memory consistency group have helped assist in
this one. One of the things uh that
this one. One of the things uh that we've observed in the other DSLs and kind of challenges in the space is that that is uh a subtle thing that is un uh not always clear to people when they
first start working on this is that all of the memory consistency model work done for almost all devices is all about scalar values. Right? It's it's like
scalar values. Right? It's it's like what happens when I write to a me a scalar value to a memory address. The
challenge is that by lifting the programming model, we've now made it so that our you know sort of primitive scalar values are actually aggregate values. And so now you when you go to
values. And so now you when you go to write a value, you're not actually doing one write to memory, you're doing many writes to memory. And so we wanted to be able to define what are the semantics
just like a normal memory model of interle memory operations and what is the expected value so that people can reason about this. So what we've decided
and ended up up on today is that we we have a a weak memory model where we uh all of these sort of weak unordered in memory model instructions are
prefixed with TKO. We call them tokenordered by default that the defined order of any two memory operations are is unordered unless you put a token between the two of them linking them
which orders the operation and that allows you to then reason about their execution with respect to each other. Uh
we didn't actually use any tokens in the last slide because the loads and stores have no aliasing. So no ordering is required to write a correct kernel. And
the reason why we decided on this design is that we believe that this uh allows us to sidestep some of the performance penalty that you talked about uh earlier Mark and building and abstraction is that part of the performance problem and
why you see people exposing things like barriers is that if you have a strong memory model where so for example if a we guarantee that every value from a [clears throat] store was written before
we execute the next load we'd effectively need to synchronize every thread in between participating in those two computations And then what's different here is threads on a on a
piece of hardware today use caches to uh to enable their consistency. But when
I'm using many threads across the system to realize a computation, I don't necessarily guarantee that they're observing each other's behavior. So I
would need to e form some kind of hard synchronization in order to observe all those updates. And and so by shipping a
those updates. And and so by shipping a weaker model, we are now racy in that sense, but it's like defined raciness.
And then if you want I I showed here you can see that the store and loads take a semantic you can actually increase the string. So the all of our memory
string. So the all of our memory operations support like weak uh acquire release uh semantics. So you actually can benefit from this and we've defined all of this so that you can write
programs around the memory system and you can see and there's parts of Trident for example where the the model the memory model is sort of implementation defined and you can you can see it leaking. Um, and this is something that
leaking. Um, and this is something that we wanted to address overall. One other
final bit is the front ends today all implement a a strategy for doing token threading. So we actually haven't
threading. So we actually haven't exposed tokens to the end user programming yet. Um, they are all like
programming yet. Um, they are all like in inferred by doing analysis and then inserted for you by the front end. Uh,
and the specification goes into more details. Um, there is some desire from
details. Um, there is some desire from people at Nvidia to write a paper about it as well. I hopefully that will happen at some point. Uh, and you'll be able to get a really deep dive of it. Um, and if you have more questions, I'm sure Simon or some of the other folks who worked on
will be happy to go deeper. So, uh,
finally, I just want to touch on a few thing more, uh, a few little things. Um,
you know, you might say, okay, well, you guys took all the the low-level control from us, like, well, what what can I actually do to optimize my kernels? So,
I'm just going to give two quick simple examples. One is, you know, the me the
examples. One is, you know, the me the effects of the system are still visible to you in many ways. So, for example, if you want to write a gem, the one I just showed you is not the fastest gem. uh
it's in fact, you know, there's there's many ways to do variations on it. So,
here's the a gem that's actually doing swizzling of the um grid dimensions in order to get better reuse across um
across CTAs. This is actually something
across CTAs. This is actually something too that there's no defined behavior here like this is something that can vary as the CTA rasterization order. And
so, this is an optimization you can explore. And what this looks like today
explore. And what this looks like today is you can write this swizzle 2D function if we zoom in. And what happens is this is just a piece of Python code that you can write to do mapping of
grids. And this is the kind of of
grids. And this is the kind of of optimizations that we want, you know, um researchers or kernel developers to be able to iterate on is that you didn't need to drop down and get really low level in the details to be able to play
with something like this. You can just write a little bit of code to to make this better and get better locality. One
other thing going back here is if you look at the top here's actually how one of the hints is passed in the DSL today.
So if you look at the decorator at the top um it's actually passing one of the CTA uh hints and this one is defined only for SM100. So if you're on a different uh target you will not have
this hint apply and and so part of our hint system as many talked about is the ability to have per uh you know per target hints so that you can have a single kernel run everywhere uh with
different behavior specified. So one
other one.
>> So So what is that hint though? It just
says S 100 equals two. Like what was that?
>> Oh no. So the the the hint is actually the keyword is numbum CTAs. So this sets the number of CTAs being used in the in the kernel uh as it's being lowered.
>> Okay. What does the by target then mean?
uh by target in this case says o only set num CTAs for SM100 because the the DSL the DSL is trying to make it more friendly to users like if you set it unconditionally I believe it
will set it for all targets and in this case it will only narrow it to a single target and then you can provide uh SM100 SM 102 you know like you can put other configurations there in that in that uh
the by target class >> so sorry what does two mean then >> uh that's the value of numctas you want two CTA Oh, I see. I see. Okay. Okay.
>> Yeah.
>> But at the IR level, these actually just become attributes in the function. I
don't I I should have put an example actually in here, but uh I don't have one. We're showing what it looks like
one. We're showing what it looks like when it's rendered. Um here's another quick example and then I should probably hand off to Medi is uh doing static persistence. So if you want to write a
persistence. So if you want to write a persistent kernel for example, what that looks like today is you just write this prefix loop. So again, we change a
prefix loop. So again, we change a little bit of a block math. you put a prefix loop in front of it. And then the bottom roughly was very similar to the kernel that we just looked at. And so
there are ways to explore. We have
people, you know, kind of our first batch of internal users. People are
exploring different performance of kernels. You know, it's not just write a
kernels. You know, it's not just write a kernel and everything will be great. You
know, we haven't taken away all the challenge of programming for you. We've
just remove some of the details. I just
really want to make that uh clear that there is still space to explore and iterate and experiment. Um finally kind of one big thing is that as we've been
talking about this is part of a platform and coupile Python is just a one exposure of tile programming and this is the one that we released as part of uh
CUDA 13.1 but there's more so we've been working on bringing tile programming to NVCC and it will be part of CUDA plus um we're going to announce more information
about it at GTC and then also someone in research is working on an experimental safe tilebased programming model in rust. So safety here is like rust safety
rust. So safety here is like rust safety in the sense of you know trying to create memory safe abstractions to make it easier to write correct programs and um we expect more languages outside of
you know Nvidia you know sort of thirdparty adoption hopefully in you know other languages are being used in in machine learning and in high performance computing um and we we hope that people have all the tools today to
to do that. So just to give you a taste of what this looks like in these other languages before we're done, here's the tile crew tile C++ kernel as it looks today. Um you know note that syntax
today. Um you know note that syntax might be vary before we finally release and this is work in progress. Um
so you can see the kernel is actually very similar abstraction to Python. Um
and the structure is nearly identical.
One difference here is to match C++ semantics a little bit more closely we take you know pointers directly. So you
can see we take like float LHS, float RHS and then the tile R abstractions are more directly exposed. So uh you have a tensor span um which is equivalent to
tensor view. Some of this is C++ people
tensor view. Some of this is C++ people trying to match what the C++ standard library uh has these days. They build a partition view and then you know that that looks otherwise very similar and
you can see the kernel here. The core of it is the same. It's get the block ID, get an accumulator, do a a loop loading, do an MMA, do a store. And so, uh, we we
view that this will also be a way for the same programming model to be explored in C++. And this is one of the places we see the interop being even more valuable, allowing you to intermix
code uh, in C++. Finally, just show you example of Rust. Here's what it looks like in Rust. It's also very similar.
Um, one big difference though is that the the store doesn't require an index.
Here you can see that's because in the in the current um version as u Melly who's developer of this has been working on it. He's started to experiment with a
on it. He's started to experiment with a safe partitioning API where each tile block is actually getting a slice of the overall memory allocation and allows you to have like a non-alasing property. And
so this is another cool experiment that we're playing with um and will be open source soon. Um
source soon. Um finally this is the the slide that Medie was referring to earlier is we can export CUDAR. So you can see here is a
export CUDAR. So you can see here is a basic example like the vector add. We
can call compile tile IR from uh coupile. We can turn that function with
coupile. We can turn that function with some dimensions and some options and then we can you copy out the uh the cubin um that was produced by it and
then we can open the by code and we can uh you know write that right out. And so
we have ability to exchange you even though these are written in Python or C++ or um or Rust or whatever language in the future the by code itself can be exchanged free of the language and you
can deploy it in a different environment than you developed it and it'll be compatible with a future GPU. So with
that I'm going to hand it back to Medi for the close. Um and thanks so much for your time.
>> Uh last question. Uh let me just bring up Medi slides in the meantime. Um, so
one thing I was interesting when you were showing um when when you were showing the QT C++ is that it looked a lot like tile uh tile, like at least like the outer side of like the inner
loop looked like your Python code, but the outer stuff looked more like uh TR.
Uh, and I'm sort of curious like if people want to sort of write like their own like bootleg compiler passes. Uh,
have you noticed that people were sort of string splatting tile IR code or were they often strlattting um Qile code and like sort of early users you've worked with?
>> Yeah, that's a good question. So, I
actually think it's all about how you want to build things, you know, in the sense that like if you're an ML person, you know, like if you're the the modular
folks or, you know, you're another, you know, like hypers scale or whatever, you're like great, I love it. that I
have mir. If you're not an ML person, then I think that the using the the kernel authoring as a codegen target is actually quite attractive. Um, you know,
as many people know, I think this may might not be true, but it's was true at some point. And you know, NVCC is the
some point. And you know, NVCC is the number one way people were generating code for the GPU. You know, like generate C++, compile it, and then run it. Um, for example, like warp, which is
it. Um, for example, like warp, which is a a DSL for physics simulation from NVIDIA, it still just generates uh, you know, C uh, NVCC code and then, you
know, NVRTC uh, like runs and links it.
Um, so we I I think we expect that, you know, depending or like, uh, there's an experimental inductor backend that's actually using um, uh, Coutile itself to generate the kernels because it's more
compatible with the way that inductor is built today. Uh, because inductor
built today. Uh, because inductor generates threading kernels and then compiles them that way. Um, so I think it's really about what you like that's why we're trying to I think Medy's, you know, uh, point earlier is really we're
trying, it's a platform change and we're trying to make it so anywhere people are doing this, they have an approach that's compatible with what they're doing. Uh,
so I think that's the what what we've seen is it's just really a matter of where you're working more than anything.
>> You mentioned NVRTC. Um, does like does this work with NVRTC? like at least like I think with uh yeah I mean like what's sort of under the hood the compiler that's being used
>> NVRTC is the library version of NVCC the C++ front end right and so as NVCC and CUDA C++ is going to get capabilities to
accept tile IR right the snippets that Jared just showed you NVRTC will also take the advantage of this right so this is CUDA tile C++ with this new tile
global um way to write a tile kernel and VRTC should be able to accept this kind of kernel uh when we release the C++ support.
>> Yeah. And that's why part of bringing it to C++ is uh even a bigger change in some ways is that it really makes Tile like first class in all these places where uh you know C++ has been before
and making it fit with all the the workflows and tools is is part of what we've been hard at work at.
>> Um last question before we continue to the gym. Um I'll extend this question.
the gym. Um I'll extend this question.
that's from chat but have you considered offering a CPU reference implementation of tile IR and at least like my question will be like can I do like a the bite code output without needing a GPU for
local experiments or should I expect the IR to be the same across GPU generations >> so it's okay so so you have a few
question so for the CPU reference um yeah we prototype something like this but making it a product is you know um another step so I don't know if we have anything concrete with but this is definitely something we're interested
in.
>> Yeah, one other version there too just is I I did an experiment which is not complete of like writing an interpreter as well like is something that we we've thought about um is you know definitely some like just if if someone want is
excited about doing that like you have all the tools to experiment if you're interested I think it is you know to medie's point something we have we haven't talked about if there's demand that it'd be great to hear from you and something we can bring in and inform the
road map. Yeah, certainly for Trident
road map. Yeah, certainly for Trident there was tons of demand. I think it was like one of their most popular issues was like supporting prints.
>> Um, >> what was your second question again?
>> Um, so do you need u do you need a GPU to see what what tile IR you generate with a
piece of with a given piece of code? So
in theory we don't need to right coutile you can think of coutile python as um you know a a template meta programming system for tile IR right it's like what
what Jared showed earlier was interested where some of the uh kernel when you see all those tile size they are uppercase by convention here but they are CT constant it's like writing a C++
template and those would be C++ template arguments so that's really a way to meta program the system there's no fundamental reason here that you need a GPU connected to your system. We hope we don't need one.
>> Yeah, for what it's worth, I I as part of testing MIR bindings, I've been helping Melly, who's working on the Rust bindings, uh, get his stuff ready for open source and I built the Rust crate
that he has on my just for the the bindings part on my Mac and it worked.
So, I think, you know, from that perspective, you can uh and my Mac definitely doesn't have an Nvidia GPU.
So you you can you definitely experiment and uh it's something that I think that's why the CPU part could be interesting as well.
>> Yeah. What can be difficult is like it's very easy to accidentally in Python import like onload of a library to try to see if you have a GPU gets the CUDA version or something like that and fail.
Uh but that that's the kind of thing if it happens we we should fix it.
>> Yeah.
Okay. We can go to the gym then. Thank
you.
>> All right. Uh so yeah we're going to wrap up fairly quickly. I wanted to mention the gym which is a collection of kernel tutorial and examples that we provide uh for tilebased GPU programming
today. Right now it's everything
today. Right now it's everything coutile. This is the only front end we
coutile. This is the only front end we released for programming tile IR but as we get support in C++ you can expect also to see some examples popping up
there. Uh we also provided an example of
there. Uh we also provided an example of an autotuner that combines you know not only tile IR knobs but also higher level how you use the meta programming aspect
in Python to explore the space for your kernels.
Um so we are also looking into making those kind of thing more um robust and and and mature and they may learn directly inside Coutile with like
well integration with the the Python language and the front end and all the the meta programming aspects of of of Coutile. Uh so check it out and let us
Coutile. Uh so check it out and let us know what you think about it. Um this is a a sample of the autotuner I just mentioned right which is really um
designed as a decorator and the ability to to explore various aspects of the grid and and this config. So I'm not going to get in details with this. It's
still in flux and evolving, but we are trying to bring this to make it as convenient as possible to to users. And
[snorts] we're likely going to try to invest in solutions that then can be also used in the C++ land or the restland and those kind of things. So
those are parading that we're trying to to to bring into the system.
Uh in terms of performance, it's always difficult to talk about performance because we can shoo over a very large sample of kernels and cherrypick you
know all the ones where we do great right and this is what we see often with other you know libraries that will pick against Kublas and find the one kernels
or the few configuration where they can do better than the Kublas uristic.
Um so we didn't really want to get into this this kind of thing and performance is still very much an ongoing work right for the first release we focused on the spec on the quality of of what we were doing. uh we know that we still have
doing. uh we know that we still have work to do on the performance in some cases and this is like I know that the next CUDA release is going to be a major a major uh step in terms of performance
but just to show you a few examples that and show you that I didn't want to cherry pick only where where we we we shine uh the first two columns are FP16
jam and with very small gem 256 by 256 um you see that we are below our goals right we would want to never be uh below
80% of the state-of-the-art gem. And on
this right now, the system is lagging a little bit behind, but we we know we're going to fix it quickly. Uh but you can see that for larger gem, we are matching
uh the state-of-the-art um uh and and to find the state-of-the-art, we autotuned across Cutlass and Kublas to find the best and tron to find the best gem we
could. And and Tyler today is there. And
could. And and Tyler today is there. And
on FB8 you can see also similar things with very small gem we are at 80% of of state-of-the-art um in terms of laps and uh and with larger gem we just match it
and the last two columns are fused multi head attention one is causal the other one is non-causal and and there one of them is like at 90% and the other one 85% of of state-of-the-art
but this is very much a moving target this is where we started with our initial release and we planned is to to you know to improve um very quickly.
So this is our road map what's ahead. We
touched on it all along the presentation also a little bit. Our initial release was like blackwell focus. Uh we wanted to make it right. So we already have you
know internally a hopper but they required a little bit more polish. So
they're going to come in the next few cuda release. you can expect a hopper on
cuda release. you can expect a hopper on and any GPU above those um to be supported. Um something that that you
supported. Um something that that you may expect also from tile is that we support all of our family of GPUs. Most
libraries and tooling that you find out there may focus on B200 or GB200.
We also have the the graphic card that you get on your PC, right? Which are
variants that can be a little bit different and we're going to also support those and uh make sure that tile works across the entire um
product line that that Nvidia offers.
Um so we also released Coutile. Uh it's
ready to use pip install and you're ready to go with CUDA 13.1. uh it has a bunch of integration with tlp pack with PyTorch with with all the numpy like uh
support. It's all open source. The front
support. It's all open source. The front
end is entirely open source. Um and uh we're going to expand this to Kudat C++ very soon. Uh we're going to uh have a
very soon. Uh we're going to uh have a experimental Rust open source thing as a research product and we hope the community will you know get interested in this and and take it further. Um,
we're going to bring a lot of more tooling integrations. We are actively
tooling integrations. We are actively working on like the profiler integration, the debugger integration, and and a bunch of tooling is going to come out in the next few CUDA release.
Um, so yeah, a lot of of of exciting things ahead. Um
things ahead. Um and uh I just wanted to acknowledge that there have been like probably over 100 engineers working on this project all across the stack from driver loader
tooling compilers. Um so this has been a
tooling compilers. Um so this has been a huge effort and Jared and myself are privileged to showcase this to you today. uh but you know we're only the
today. uh but you know we're only the messengers here and uh you know kudos to all the NVIDIA engineers who are behind this this large um evolution like uh
it's not often that that CUDA gets gets this kind of refresh um and those are the next steps uh for you it's all live you can use it um we
also released just today or yesterday this ML based tooling that includes a bite code assembler disassembler So all the examples you saw from Jared that
shows the tile thing uh you can play with them on this um on this GitHub project. On the readme page you have
project. On the readme page you have also the instruction about how to take an example of TI assembly turn it into bite code compile it with the the tile
as which is the equivalent to ptxas. So
this tier as will turn it into a kubin and then you can launch it with a C++ example. So with this ML example, you
example. So with this ML example, you have a like you don't have any dependency on Python. If you want to play with something, if you want to play with bindings for another language than
Python, like hopefully tooling will be there to help you get started. And
again, this is only the initial release.
We're looking for feedback. We're
looking for your bugs. So for those two GitHub projects, feel free to use the GitHub issues to uh report um issues that you can encounter. for the general
CUDA platform. The usual channels of the
CUDA platform. The usual channels of the developer hub and and the usual CUDA channels are are relevant and uh yeah, that's it. Thank you for
attending. We're very exciting about
attending. We're very exciting about CUDA and we hope that you're sharing this excitement with us now.
>> Sweet. Thank you so much, folks. Thank
you, Mie. Thank you, Jared. Um I guess like before I get started with my last barrage of questions, uh I noticed like uh Jared, you especially were answering a lot of questions in chat. Do you think there's any that are worth like
escalating and and re reiterating on?
>> Yeah, I think if people want to bring any of them up, I I think one of the the kind of the question about tuning again is that the way we view the hints in the tuning is that's the performance part.
Back to what Miy was saying is like our goal with the portability is that we we will be functional. You know, I think the the user story, if you will, of that sort of set of features is that you can
just download a repo and get up and running and then you can look at it and be like, "Oh, it's not quite as fast."
You can start to fiddle with it and you can boost that performance up. But that
first, you know, sort of user journey is not always there today. And so to me, that's the goal of the portability really is is that part or you know what likely to happen. You download a library that contains a kernel that someone else
wrote and then it it works in the machine that you're on, right?
Um Pokemon is asking um are you planning on doing more autotuning helion?
Yeah. So we also we've been in talking to the meta folks uh and there is a uh you know I think it kind of got disrupted with the holidays. So but
there was an experiment where we're going through helon through the inductor backend that we have experimented with to produce uh tile kernels and just sort of benchmark against and understand how how good the performance is and how bad
that's actually we identified some performance issues and also some benchmarking issues while doing that.
for example, there's like a some bug in like low low tile sized uh Triton benchmarking utility which is what you know Helon or someone is using. So we're
experimenting with that. Uh I I think part of it is that the search space is slightly different. That was something
slightly different. That was something that we probably need to do some more work on is that the search space of Helon for Trident configuration today is shaped slightly different than the ones
for um uh for Coutile. So we need we need to be thoughtful about that because you can't sort of just fully apples to apples today apply the the configuration and and get the same results. Also the
the systems are different so sometimes knobs have a bigger effect. So there's
like homework to do there. Um but we uh do kind of see that. Um there's also some ideas internally people have with sort of doing PGO style uh work or program guided optimization. Uh so I
think it's something we're excited about like again ideas, examples and and if people want to hack on it we encourage people to build on it and experiment. Um
and you know we're happy to engage there >> but but NCU is still going to like just work out of the box right just because people will need that signal. Yeah, you
can use uh also if anyone hasn't tried it yet, there is a new uh Insight Python uh benchmarking uh toolkit that we released which is like a cool new product feature update which is like stop a decorator on a chunk of Python
code and get a lot of the profiling tools running. Um I think it's a really
tools running. Um I think it's a really cool new like user experience update if you haven't tried it. Um and all that stuff works as well. Today there's like more limited support like we're we are have as part of the road map as well as
Betty touched on to bring so you can kind of do some basic debugging today like you can use a debugger but we're going to bring like tile value inspection and tile stepping uh and other you know tile native experiences.
Um it turns out that debugging is almost harder than building an actual compiler in the first place. So that's something that we are working on you know very hard and it's going to come out next year sometime.
>> Got it. All right. Um I I guess like my last question before we close up the stream. Um what kinds of community
stream. Um what kinds of community projects would you be really excited about to see? Like obviously there's a lot of kernel DSLs. Uh and so like what do you think like would be the kinds of
projects that would showcase like unique strengths of CQL and that you'd like to see people explore?
>> You want to go first, Mie? Um
>> sure. I think on my side I'm pretty excited about mega kernel explorations and the interaction between autotuning and and mega kernel the kind of vision
we need to do. Um so I I I think this is an area where the community of programmers can you know um take over uh
some some exploration and and provide some solutions. Um yeah that would be
some solutions. Um yeah that would be one one area.
Yeah, I think one similar to medi is I think doing like building systems on top is is interesting to me because what once you free yourself from some of these low-level details or you have you
know like inductor is a great example of this is you know torch compile before torch compile happened people have done n attempts to make this work and part of what I I think has made these things
more successful is that they've had layered abstractions that they can now build on to generate code and they can kind of focus on one one problem to solve And I think that there's some interesting stuff on there like whether
it's building a you know like you could build a really easy teaching machine learning framework now that does codegen you could experiment with fusion you can do mega kernels I think once it because
for example doing fusion today is roughly as long as there's no aliasing if you see a load and then a store you could just eliminate the the two of them and you could write an l pass through this really easily. So I think those are
kind of things I'm interested in. I also
think just tooling in general like the so much of the tooling today is still very low level um for for developing this stuff and understanding it. Um and
I think that uh this provides an opportunity to build new tools and experiments. Um yeah and then maybe the
experiments. Um yeah and then maybe the third one is you know in the engetic world you know allowing people to do automatic exploration of ideas because some of the algorithmic ideas at the
kernel level are actually very simple you know uh but they're very hard to explore like uh today um you know if you're you know the sort of trial level kernel hacker then you know you can produce these magical things other than
that I think it's really challenging and I've met a bunch of people who come from trading or like finance background or research or HPC who have a ton of interesting cool ideas that they want to work on, but they actually just struggle
to be able to write the programs to do it. Like many people outside of deep
it. Like many people outside of deep learning are not using tensor cores at all like you know maybe like now like DLSS or neural rendering um but other than that there's actually a huge impact to make on those people. So I think we
didn't actually talk about it much here, but that's something that we're really excited about and we want to make this HPC friendly. You know, bringing FP64
HPC friendly. You know, bringing FP64 support or stencils and these are also things that are on the road map that we didn't we didn't talk about as much. But
>> yeah, I would reemphasize this like the aspect that there's always going to be like experts like Trido was going to bring the new flash attention which is heavily tuned with cute DSL which is
like this great low-level programming language. But that's not for everyone,
language. But that's not for everyone, right? We want like one of the goal of
right? We want like one of the goal of of of Nvidia is to make GPU computing ubiquitous and for this we want to make it more accessible and that means also
go beyond this like specific kernels that are very important. I mean I don't want to downplay the importance of LLM today but we also want people to go out
of this like use use our GPUs in in all possible scenarios where they would be applicable. So making the like that's
applicable. So making the like that's what that's what CUDA did in the first place when it was introduced making GPGPU. I remember before CUDA we had to
GPGPU. I remember before CUDA we had to like hack OpenGL to do GPGPU then CUDA suddenly made it possible. I started
CUDA in 2007 and that was magical to me you know I was still a student but um but yeah that's that's that's what I hope we're going to achieve with TR making making it just GPU more
accessible to everyone and used even further than they are today. Yeah, just
add one last sentence that like I I think someone asked in the first stream when we first started at the beginning like about cannibalizing users or customers. I think going back to that,
customers. I think going back to that, you know, I think philosophically our our goal is we want accelerated computing is important. It's still very hard for people to use and we want to
make it easier for everyone to be able to write accelerated programs and and I think that to me is a big part of this goal is just increasing the accessibility of these things. Even
someone had asked why we're doing it in Rust, you know, outside really motivated people or people are getting paid quite a lot of money to do these things. Many
programmers like to build in the environment that they're comfortable in.
And so if you can reach out and you're in and your database system that you wrote in Rust and you can now run a bit bit of GPU code next to it, I think we'll see new cool systems come out. And
I think we're on the precipice of that as like a an industry where we now have like single node machines that are supercomputerpowered and people spent a whole generation building distributed
systems to get enough compute power to do what they want. But I think there's another you know wave of systems where you could build your you know uh uh uh
like you know um ETL engine that is LLM powered where it's all running local with no you know with high you know high throughput low latency. But I we haven't seen those systems yet because it's actually kind of hard to build. And so
that's the stuff that I think is most exciting. Maybe just to close is like by
exciting. Maybe just to close is like by making accessible people's creativity will be able to shine and we'll get we'll get new things out of it.
>> So uh just because you mentioned the single node supercomputer stuff like do you have any good examples of like uh coms kernels like doing tensor parallelism across a node?
>> So that's you you baited yourself. So
>> no no it's all all good. It's it's the number one question we're getting. So
that's also uh I think we maybe it wasn't on the roadmap slide but we talked to a bunch of people about nurips that's also one of the things that we're cooking is bringing both native communication primitives and then also
as part of the interop giving you the ability to drop down and write. Oh, that
was a good example when you're asking why do we care about interop is like maybe you want to change the comm's primitive and we want people to have control because you can kind of see there are some solutions that are
blackbox in terms of we give you four communication primitives and if they don't work for you then you're kind of out of luck uh and we we want you to have control there and we think like the
multi-node is is kind of the next next frontier both inside of TR and also you can see it across Nvidia right uh the sort of scale up and scale
worlds becoming more and more important.
>> Right. Well, I think on that note, folks, thank you so much. Like this was uh our last talk of the year. So, I'm
really grateful you all both made the time before the holidays. Um thanks
again. You're both welcome. Again,
please when you do add NV FV4 support and when you do add com stuff, please come back. Uh we'd love to have you. And
come back. Uh we'd love to have you. And
thanks again, folks.
>> Yeah. Thank you so much. Yeah.
Loading video analysis...