Pablo Salgado - The soul of the beast
By EuroPython Conference
Summary
## Key takeaways - **Python's Grammar is LL(1): A Double-Edged Sword**: Python's grammar is classified as LL(1), meaning parsing decisions are made by looking at only one token ahead. This simplifies parser implementation and execution speed but introduces complexities for certain language features. [06:11] - **The 'Soul of the Beast': Python's Official Grammar**: The Python grammar, defined in a specific file, uses a meta-grammar similar to regular expressions to describe the language's structure. This formal definition dictates what constitutes valid Python code. [02:31] - **LL(1) Limitations: Ambiguity and 'Hacks'**: The LL(1) nature of Python's grammar leads to ambiguities, forcing workarounds like special handling for keyword arguments and making features like parenthesized 'with' statements difficult or impossible to implement directly. [19:39] - **Adding New Features: A Deep Dive into CPython's Pipeline**: Implementing a new operator like '->' in CPython involves modifying the grammar, tokenizer, Abstract Syntax Tree generator, compiler, and bytecode, demonstrating the interconnectedness of the language's components. [35:46] - **Concrete Syntax Trees: More Than Just an Intermediate Step**: The Concrete Syntax Tree (CST) generated by the parser retains information about comments and formatting, making it valuable for tools like code formatters that need to reconstruct the original source code accurately. [31:30]
Topics Covered
- Python's Grammar Is Wider Than You Think.
- LL(1) Grammar: Python's Hidden Design Trade-off.
- Parser 'Cheats' to Handle Python's Grammar Ambiguities.
- Why Parenthesized 'With' Statements Cannot Exist.
- CPython's Parser Shapes Python's Core Language Design.
Full Transcript
also so thank you very much for
attending so in this talk the year is
that we're going to have like a deep
dive on what makes python python right
because we have several pieces that
normally people identify as pi right we
have the interpreter but as you know
there is multiple version of them right
we have C Python PI by established
Python item Python so it seems like the
interpreter usually it's basically
implementing something else right
usually languages have what is called a
spec which is like you know like for
example C++ has the C++ standard which
is like books you can kill someone
hitting with them but like the Python is
a very interesting case because see
python is not only like one of the
implementation of the interpreter but
it's also the reference right so the
idea of this is like okay so what what
is Python really like when people think
about python like python the language
like the text as you write and then
executed by something what is this a
glitter suggest some more interaction
first so Jim Pablo I'm a Python core
developer I work at Bloomberg in the
Python infrastructure team before that I
used to black hole physics see someone
is interested in talking about that and
very happy always I'm one of the authors
and the implementer of Delphi 70 and I
must spend my free time just fixing bugs
in the C Python tissue okay so let's
start with a cool question who thinks
this is valid Python code raise your
hand you think it's body raise your
honey do you think is invalid
this doesn't add up all the people so
this is actually valid Python code if
you see like this X is a is a type
annotation property just property this T
that seems like a C++ template is
actually saying properties of temporary
is less than T bigger than X that's an
unpacking
that's an ellipsis and that's object and
in the circus this is actually the ast
of that expression not only that like
actually I just so it's nice like the
parser parses the scene but actually you
type this thing and call a spam it
actually returns because they are
notation
not executed so you can you cannot call
it no only value paisa call that is
actually you can try no it's not that
with no problem
so the you know this is that what people
normally think about Python is really a
small subset of what Python is or at
least what is described in the Python
grammar of the official Python grammar
and idea is that we are going to store
exactly how like which is the bigger
scope or which is the bigger difference
with what people think python normally
is and for that we need to introduce
some some special topic regarding
grammars and regular grammars so let's
just talk there so the Python grammar is
describing the Python source code in a
file called grammar is in the grammar
folder but for that we need to introduce
first how these files describe so in
this file you will find things like this
you will find rules so a rule basically
the sky like a name a colon and then
I'll rule description we are going to
see what it is then you can have many
alternatives for the rule right like you
can have the same rule name and two
different descriptions this description
usually they are called productions
because basically the grammar is written
in a way that the what is history
forward given these form is to generate
actually program not to recognize them
but usually when you have two
alternatives for a rule people put it in
this way which is like this rule name
has this alternative or this alternative
but it's basically the same form then
you will find this kind of constructions
which means that this rule name is one
letter a or more it's very similar to
regular expressions as you can imagine
the asterisk is basically zero eight or
more and you can find also like square
brackets with mean like this chunk here
is optional right so this is the
complete meta grammar so the meta
grammar is the grammar that talks about
the grammar right so is this the result
we found the put here right so the the
top bottom sorry the top rule is
basically the one that you will start
and
it will record so this is just for
reading the grammar file
so using this this expression we can
actually see the actual Python grammar
which looks like this right so for
example we recognize file input so this
is like the one of the star rules so it
says like okay so a file written in
Python it will be a new line or zero or
more statements and an end marker and
you say ok what is an statement okay so
a statement is a simple statement or a
company statement and then you say ok
well it's a simple statement so and then
you go that right you enough you will
start seeing things that you can
actually recognize like for example
compound next statement which is a the
bottom one you will find like if a
statement or while a statement or for
statement and then you come probably
mine what those are
okay so let's keep looking in this file
so for example you see if a statement
then you will find that any first
statement is the expression like the
string if and the difference we will see
what this means but the afem between the
strings that the ones that are in yellow
and the ones that i'm white is that one
of them cannot expand anymore right like
the string if is literally the string if
is not like another rule but you can see
like how do more or less recognize the
structure here right so and if it's like
if then something that probably like a
condition then : sweet is basically a
block and then a leaf and the same
structure right so it's very simple but
the idea is that one of the things that
we need to understand is how the partial
actually can recognize this thing and
see if a program is valid or not okay so
more more definitions so as I said the
ones that are in yellow are called
terminals because when the when when you
are like expanding the rule that thing
cannot be expanded anymore right just
like a string while the ones that are in
white they are called non terminals and
this is basically referring to another
rule that is in the in the in file and
this is going to be very important in
the following slides okay so let's talk
about la1 grammar so it turns out that
python has a sort of weird feature which
is the grammar is in a category that is
called ll1
so let's see what it is this what is bit
there are very few languages that have
this condition and these
in addition you will see how it actually
percolates over the language one thing
that we are going to understand is how
like something that technically is
something technical right because like
the fact that this other one is not it's
not very important or it seems like it's
an implementation detail but you will
see how this actually percolates over
the language that has very deep impact
so what is another one grammar so
technically the definition is a bit
boring another one grammar is a grammar
that can be parsed by little one person
right it seems like you know one in
University someone says yeah one grammar
can be passed to one pass I said thanks
I understand huge thing now but let's
see what the what another one partially
right so the idea is that you will pass
the grammar rules sorry that the tokens
that you receive left-to-right does the
first sell you do leftmost derivation so
you stand spanning the rules from the
left now the second L and then this is
the key point here is that when you are
parsing you can only look for the next
token which means for example if you are
parsing an if condition you can only
look at the next token in the stream so
if you look at the if and then you need
to decide what rule follows you can only
look the next token not more and this is
going to be super super important it was
not that Python has two hidden extra
rules it used to be in an all check out
of the source code there was a comment
explaining the same but this has
actually been lost promise actually even
before it ported to mercurial so the the
extra rules are these two so it turns
out that Python that have M pretty empty
productions usually these are called
epsilon productions an empty production
is basically a rule that has as a
possibility the empty string so it's
like you have a rule and the empty
string satisfy that rule it turns out
that that is makes everything much more
complicated but Python doesn't have them
so if you are basically you know how
this more or less works when you parse
and when you create another one parses
you need to create this thing called the
follow sets on the first sets but it
turns out that you don't have any
productions you don't need the follow
sets we are going to spring what this
heart so don't worry and it also there
is another rule which is not super
important but we will see how it
manifests which is at some levels of the
grammar only the last alternative can
have a non-terminal so let's see one
example so this is
more of the grammar and if you see here
I want you to focus on this flowy
statement here and you will see that the
flow is a break a continue I returned a
trace or a deal and these are here but
you see like the break here it starts
with a terminal which is the word break
continue we start with a terminal which
is continue return the same but only
Jesus statement is the one that actually
which is here start with something that
expands again this is what I mean right
only the last option can be expanded and
this simplifies a lot also a partial but
it turns out that this isn't one another
for so it's just a simplification for
for some levels but it makes it very
fast ok so let's see let's see how what
other pieces do we need so the first set
because in this thing right like when
you are passing this thing you can only
look at the next talk into the side
where rules you need to use and if you
think about that it will be very
important for you to know given a rule
like imagine if a statement what is the
what are the tokens like the strings
they're the terminals that this rule can
start for because and this is important
because as you can only lose one token
you can only use that one and if you see
that the rule that you are trying to
pass doesn't include this token do you
know that is invalid because you can
only look at one token and this rule
doesn't start with that token so a
symbolic so the first set Sal basically
all the words are rule collector would
write so we see for example that simple
statement can I start with a lot of
tokens like assured the asterisk a
lambda we see that for example bring the
stimuli canister with print but make
sense returns to Macalester with return
but other ones like Jill are women can
start with a lot of things right and
this is going to be key in in a few
slides
ok so let's see how actually cpython
works so it also that the pi the parser
of suppression is actually written by
hand but what we have is a parser
generator so we have a generator that
automatically receives the grammar and
generates a parser for you and this
makes everything much much easier but
let's see how it works it's actually
relatively simple so the via the like
global idea is that the we're
to have the grammar does the Avena
fly-ass terminal back was not formed so
is this text file with the roots we are
going to construct something called
non-deterministic finite automata which
sounds very scary well it's actually
pretty simple
and then we are going to simplify those
in what is called a deterministic finite
automata
so let's see what this heart so
basically a finite automata is the final
a set of the states it's basically like
it's actually a flowchart right like you
know would you follow arrows and enjoy
any different states right and idea that
you have a set of states which is places
where you can be you have transition
functions which tells you you can move
to an estate or another state and then
obviously you have a start state and you
can have one or more final states like
let's say an example right so if they
look like this so the you start with the
a let's say that is a start and then you
have transition functions which is this
lowercase B lowercase C these are these
are basically words in the in the Python
programs that you write so lowercase B a
lowercase C will be things like if for a
while and depending if you have if or a
while in this case a V or a C you will
move to state B or state C right it's
pretty easy and then you can have a
final state so in this case if this kind
of automata is going to recognize a
language if I am in the the state then
the program is valid if I am for example
in C and the next token is I don't know
H it the program is invalid because I
can only move to the final station and
hollaby right so that's that's basically
what the parser will do if the parcel
can reach the final state then your
program is good and it will continue if
the parser sees that in is D and you
don't have that it will raise us integer
it's as easy as that right but it turns
out that we have to kind of automata
right we have the non-deterministic ones
and we have the deterministic one so
which is this non determinism so it
turns out that there is two kind of
non-deterministic so here one of them is
that in mind that you have one of these
states right one of these upper case a
and it turns out that you have a
transition with the same letter right
this is one of the source of non
determination because which one do you
choose right you can actually choose
both you don't know the other is kind of
non determinism is if you have this
scenario so this epsilon which is the
empty string again
is a transition that you can choose even
if you don't have a token so in this
case if I mean a even if I don't have
any token or even if I have one I can
choose not to pick it I can move to see
for free
that's another non determination right
because like the state can be a or
actually I can be C if you want so you
don't know cool so yeah these these
things are actually pretty easy to
construct and they're pretty easy to
reason about
but when you try to run them if they
come they can actually bill it's going
to implement our runner for this because
they cannot they cannot end or you can
attend in cycles so idea is to simplify
these into something that is more easy
to parse so for example this is an
example right so you will have like this
tiny you have some excellent transition
if you look at this see you can transit
to see itself by a but if you choose a
you can also to be so this is like a
simple one because only has three nodes
but has like out of non-determinism and
they we have to construct a
deterministic one is actually pretty
simple
because if you look at the two
possibilities of the of this is the of
the non-deterministic ones and you can
say okay so if I can transit to both to
two nulls technically is like if this if
I consider this automata Technica is
like if is in C and B at the same time
right because like do you mind that you
are in a you can actually imagine like a
parallel universe in which you go to C
and a parallel universe into B so you
can consider all the multiverse if you
want so it really is like you having a
new node called BC right like a set of
states and now you have only one
transition which is a to be see if you
look up epsilon transitions because
epsilon is free is like if this thing is
in a and C at the same time right one
friend of mine says or it's like one no
no it's not like that right so that this
thing is like if you have a C right and
then you have whatever transition that
was and one procedure to be right - to
be using the toka name so that is how
applying these two rules is very simple
right like when you have one of these
you collapse them
in the previous one and when you have
the other two collapsing the second one
and ideas I supply this thing
iteratively right so the price that you
do that for this one
you win with this one which doesn't have
non determinism but it's much more
complicated like that's the price that
you will pay because it has more nodes
you have like empty node sometimes and
and it has a polar States right like all
the powerset combination it can have up
to two to the power of M new nodes so
this is basically summarizing the
difference right non-deterministic
automata the idea that it can have M
piston piston transactions a
deterministic and the terminus Taquan
cannot have them so it's easier to
reason about non-deterministic one is
simple as we saw in less notes the misty
one has more notes when technical thing
is that in a non-deterministic one is
very difficult to implement backtracking
and you want that sometimes for your
parcels terminated one is trivial and
the key point here is that the
non-deterministic one and super simple
to construct from the grammar like you
just like read the notes are your
parsing while the deterministic one you
need this process that we saw previously
okay so lots of terrible let's see how
it works so for example this is actually
see these structures in the actual
Python grammar so it is one of the rules
that you have in Python right this is
the rule for factors include like
addition subtraction not and then you
have more rules so this is the
non-deterministic automata as you see
here as you can see if you start from
the top this is state zero here doesn't
have any tokens you will transition
right like these arrows doesn't have a
talking like this one right for sample
is safe for you will go to a state aid
if you find a minus sign but from a
state zero to state one is the insulin
transition right so they are here right
and if you are in state zero you don't
have any clue to know if you have to go
to the right of to the left so you know
if you apply the process that I saw you
doing with this other one right and now
you have like interesting things like
you don't have epsilon transitions
anymore like in a state zero you will
transition to state 1 if you find one of
these three things so you can always
decide with arrow to choose but the
process leaves some nose I'll write for
example this is Stage five is there but
like is is not correct so there is an
extra process that the parser does is
grabbing one of these which is already
deterministic and is simplified
I see
device to these things so basically is
like removing nodes that are not used
and also maybe collapse some of them so
for example these state two and stage
three that are on top of factor they are
also eliminated because you cannot start
from them but this is the process you
can start one of these you'd make it a
terminus tick and then you simplify so
let's see some other examples that we
have here so for example this is for the
competition switch when comparing two
things so this is like a
non-deterministic one at the misty one
and as you can see the final one is
actually very simple but this is also
very simple rule for example this is
like the the actual sign that you can
see the idea of this is that you don't
need to like read them and understand
them part ideas that I want to give you
a by for how the parser is simplifying
these structures and how like you can
just come the ROC do right it's no good
measurement but idea is that you can see
more or less how it works right like I
state 0 will always transition depending
on the sign so for example you can
actually see in this in this graph what
how the grammar works even if you don't
need to read it for example this is for
the decorated rule and then you can see
in the final product that you can only
decorate three things right in the
Python grammar enhancing function of
class definition of function if you want
to now grade your own Python language
and decorate something else
you just put something here and it will
just work we will see an example is at
this time so for example this is for the
multiplication and division and
everything that has the same precedence
of multiplication this is for the print
statement
so if condition so you will see like how
how I simplify I want to show you one of
them which is my favorite so this one
this one is the one for is the most
complicated one in the room and this one
is for parsing functional definitions
like the arguments in the function
definition right like the for argument
keyword arguments it looks like this
but I worry because this is not the
non-deterministic one right like the
deterministic one lie after all this
complicated process so much simple well
oops this actually includes like
everything right like so okay so it
looks very complicated but with the nice
so you can see like actually things that
you will recognize like commas this is
for like the new type comments should
you recognize like the two stars so that
you have this kind of attractors which
is like knows that have a lot of arrows
inside and the reason is it looks like
so complicated is not because the rule
itself is that complicated because it is
really not it's actually a bunch of
rules but it's something that we're
going to see now the reason this looks
like is so complicated is that the fact
that the rammer is l1 makes the parser
super fast like very very fast and super
simple to implement like at least the
parser generator but writing things that
you may think is very simple it turns
out to be extremely complicated the fact
that we can write this rule is
technically you're cheating because this
rule is written in a way that is called
complete let's have punching because if
you write the rules that corresponds to
this you will read something that we
will see afterwards which is an
ambiguity like you cannot write in that
way so you need to literally expand them
and and the rule is to go like almost
talking by talking which is something
that usually doesn't do in recursive
descent parsers we'll see a lot of
examples of these and this makes some
other things so much complicated you
will see something that a lot of people
have requested of the Python grammar
even for 3 9 or 3/8 but it's actually
impossible with the current person and
like because language you use black this
is a feature that black needs to some
kind of formatting and it turns out that
is impossible to implement we'll see
which one okay so let's talk about this
right let's talk about other one
limitation so it turns out that yellow
one is sort of good right like if you
are like a very fast parser and it's
very easy to implement sort of and a lot
of people argue that the grammar being
other one makes the grammar more simple
to read right like humans are also
parsers and when you are reading the
grammar if you don't need to it's also
context-free it's a property for one
grammar so you don't need to remember
a bunch of other things that you have
seen before like in C++ right or in C
like see if you read a variable or a
type you kind of define if you don't you
cannot distinguish them you don't know
what you have read before but here you
just need the next token but it turns
out that yeah sure it makes like the
grammar sort of simple and that's what
Python people recognize has Python right
like python is simple to read and simple
to reason about and probably that's very
tight to further it start to be another
one but now it's just driving like is
very difficult to implement something so
let's see the limitations that we have
with this so let's imagine I have this
rule right I will have a rule and we
have two possible productions so the
rule will start with the word dooms like
a do-while loop and so we start with the
word doom it will go into a rule called
a so we don't know what they eat is
doing and then the world while and then
another expression and we have another
production that has the same terminals
but instead of a it has B rights to
tuition project doesn't matter and from
a and B we have the first sets here so
they so if you go to rule a a can start
with a lowercase a or B for the rule B
we can start with the lowercase a or C
right
so you mean now we have this program so
someone rise do be RLC while see
something right so we need to know which
one of the two productions this thing
corresponds to but it's very simple
right because you go and say okay do-do
okay both the start will do this that's
nice and then okay it's a or b so we see
lower case B so if we go to the first
set we see that lower case B is on only
on the upper case a so we know is the
left one right make sense because this
lower case B only appears in the first
sets of the first one so we know is the
left production so that's nice this
program is valid and is the left but
what happens if someone writes that it
turns out that the lower case a is
actually on both first set so you
actually don't know which one of the two
productions to use this is the ambiguity
this is this this theme that seems like
very stupid it's actually the program
because you don't know which one to
choose and you actually need to reject
this rule or the parse advice right like
the suppression part is most of the time
the parser generator most of the time
will tell you that it cannot produce the
parser but sometimes it actually sport
is it really serious so let's see some
cases when this happens actually I
so for example this is when you call
functions this is a rule for calling
functions right so you see that when you
are writing an argument the argument can
be a test and this is everything that
they've always to an object more or less
you can write maybe a comprehension you
can try the default argument which is
like some name so some variable name
like I don't know like reject equals
true or something all right
and taste is something that will twist
to an object you can do I'm packing with
dictionaries I'm unpacking with this
well it turns out that this is actually
not the rule this is false because if
you write this thing is invalid right
because imagine that you are parsing
this thing if you are here and you say
okay I can only look at the next token
let me see the first set of tests so the
first set of tests includes name but the
second rule also start with name
so using name which one of the two you
choose you cannot distinguish so this is
some bills the actual rule because you
will say okay okay
this is ambiguous but I can't write
functions right so I can call them so
which is the actual group so the actor
is this one which is very weird because
the actual rule it says that you can put
instead of string saying like I don't
know process equals true or something
like that you can literally put
arbitrary Python objects here this rule
allows things like list comprehension
equal this comprehension or deal equals
e or something like that but you cannot
write that right so how does it work so
it turns out that the grammar allows it
like the parser will parse it super
happily yep but it turns out when we are
constructing much later the aftern
syntax tree we look that if that test is
actually just a name and if something
else will reject it we are like sort of
cheating right because if someone goes
like I don't know if you send to outer
space like the grammar file and aliens
find this thing and they implement their
own Python they will allow some weird
like it's like what is this Riley
and it turns out actually you will see
how the Python is coupling with back
with the language itself right because
this is the this is the spec and if you
implement this aspect or tangle around
right like this is no poly Python and
actually the HDFC Python is especially a
piece of C Python right by pi will do in
different way and other implementations
can do whatever right but we've solved
this kind of limitations much later
let's see it is actually everywhere like
for example this is your favorite
operator so it turns out that this is
also the same program because this is
like what is called our name is
prescient test it can be an object or
the world roof right which is normally
like a variable while roosters right but
you do the same thing this optional
thing is also like an optional part so
if you want to you have a token let's
say some name and then you want to say
okay it's actually I'm actually in the
optional part or actually in the test is
the same program and as we know before
test includes name in the first set so
this is again an ambiguity right because
if I give you name you don't know if you
are on the optional pile and you are
going to use the evil operator or you
are going to go into the right bar and
you need to use test the rule again is
test test so again you can pull like
list comprehension while route list
comprehension and then someone will
complain much later awesome so what
about this so this is actually the thing
that makes us very sad so in Python 3
you can actually have in context
managers you can't have multiple ones in
the same in the same with expression
right this is something that Python to
do have because if because the price you
have six of them you need to indent like
one one every time in Python 2 but
imagine that you can write them but if
you have so many of them doesn't fit on
the same line how do you how do you do
that right so suddenly you need to write
that write that breaking continuation
line which is very weird because
foremothers doesn't know how to deal
with that because that line actually
allows every trained and Tatian in the
next line because literally is then the
same line for the parser which is always
very odd right so what do you want to
run yeah I pull it there yeah I know
actually Gupta's comparing from the same
time the last time I give this talk and
I remember to pull the comma but it is
it seems that not in all these lines
it's actually different font cool well
what do you want to write right you want
to write this so it the same is the same
as the imports Riley you have import and
then you open up on this you
bunch of them and you close the
parentheses and formatters know that how
to format these things because it's the
limited so you want to write that it's
nice right you cannot you cannot because
you could say okay the rule that the
grammar has is the top one right with an
item at least and then you can have zero
or more commas and another item so this
is the rule and then you say okay so
let's write another rule right right you
can have the same or a parenthesis the
same rule and a closed parenthesis right
but it turns out that the with item the
first sets of the with item include the
open parenthesis and these open
parenthesis is the one that you pull so
you can you can do something very weird
which is bad which is that it turns out
you can you can use have we are property
for parcel that if you write this ruin a
specific way the parser will say
everything looks fine and it will always
choose the first production but and you
can actually write these things it turns
out that the products that that this
allows one very weird thing which is
read writing will open parenthesis deal
closed parenthesis that's like a
generator that emits ending values in
and it turns out that the Python dishw
it has one instance of this rule so when
I was playing with this it fell one test
which is totally unrelated to our
grammar test he was in a single unit
right so short like the sad story of
this is that we cannot implement this
thing with the current person like it's
impossible I have spent hours writing
extremely weird rules for this like I
don't know removing indentation so it
the parser things that you're writing
multiple lines when it's not it does not
work like instead of saying Oh open
parentheses close parentheses I have
tried to say okay optionally open
parentheses and optionally close
parentheses and then I will find out if
I can fix it in the ast you don't have
that you cannot or you cannot easily so
sadly you won't have probably this rule
on three point nine or even ever if we
don't change the person so sorry but
cool so the year of this is that
the take out of these that yes you will
see a lot of people saying like oh yeah
the the Python grammar is great right
because the subtle one is one of the few
grammars of this other one so that's a
extreme nice property we shall be
thankful because that's the berry it's
it's up the berry core and it makes a
very simple thing which is sort of true
right don't get me wrong like I think
everybody sort of agree with that but
the problem is that it has two side
effects which are non desirable one of
them is that what it should be a
technical thing right it just is another
one parser but like l ll k parses can
pass also other one or almost impossible
one so you will choose a different
parameter definition maybe but the price
with this would this would be a
technical decision it has like
particularly it perkily suffer over
everything and it has impacts as this
one right like this is like the end of
the of the pipeline right something that
is a very technical thing it has the
consequence that you cannot allow this
rule which is actually the title right
so it's interesting because it makes you
like reflect on what this angle is
python right and what is written and all
these alien thing is so so let's
continue with idea but but this is this
is one of the core topics of this talk
that i want you to think about
especially when you when you see new
grammar rules opinion language and when
you do see what people call like
pythonic python or clear python right
like as you can do so in the first rule
like what python really is is actually
not what you think it is but leave what
he thinks is that it allows choking
let's continue so what the persisting so
we saw how the parser is described how
the grammar is describes what is the end
result of the parser so the end result
the parties have person three also
called a concrete syntax tree so for
example you have X plus y so the partial
Jenner is this theme these are numbers
which usually are linked to basically
defines in the see header so you
substitute by the names you will find
this so this is basically a tree read
like written in with with this but it is
that bad expression will go rule by
rules we will start with a valley input
and it's test list then you will go like
down down down the grammar until it
finds one of the
that actually can recognize particularly
term there and it will start going side
until you find the X and then it will
backtrack of it like basically it will
go a stack up I will find the plus sign
and then again and basically the factor
of this this intermediate same because
this is still very concrete of your
parcel right if I write different rules
this tree will be different and idea
that is an extra step that will convert
this thing though we are not going to
talk about here but will convert this
thing to a to an ast which is the
abstract syntax tree which is like
abstracted from the parser right that's
like more linked to the language itself
right instead of this the after canter
to you this will be basically like just
an expression with which three notes
tube for the operators and one for
applies but this is still useful this is
actually what it for matters use useful
in like particularly black uses these
things and originally this is useful is
this is because this is still at this
level you still have information about
comments because they still have
comments right is not part of things
that you can execute line breaks so you
can actually have this thing and
reconstruct your original program and
this is what formatters do right because
they don't want to lose your comments or
your maybe special indentation rules
that you made so even if this is like an
intermediate thing is still something
useful okay so let's see how do you make
a new grammar rule let's see if I can
listen quickly
awesome let's make this bigger so if you
start like an interpreter here this is
Python three nine because the future it
turns out if you try to use a regulator
and then you want to put a lambda here
right you say okay lambda or something
maybe 32 it says oh this is embarrassing
times because it turns out you're gonna
write lambda decorators or basically
every try things so let's let's actually
change that right so how do you do this
it's actually pretty simple so you go
this is the Python source tree so you go
to this grammar file which listen
grammar grammar you find the root for
the decorator which is this one and it
says okay is add and something called
Doulton name so let's change the theme
for this list which is basically
everything
and then you say okay here is an extra
step and I'm going to defy all the gods
of life calling and do see like : it's
actually pretty simple so you see this
is passing the nose is very boring
called it says 84 dot name so let's
nightly change this thing for HD for
this list first least ooh you will fail
because we have not ready needed a
passer so we do make regain grammar this
region is the parser and then we compile
awesome if in Israel they see what
happens add lamda 32 good function
return very odd things and what is this
function is 32 also much better right
don't do that at home
what
okay awesome so let's see another one
right this is actually pretty simple the
reason this is very simple is because
you just need is a substitution rule
right just like making a grammar a bit
less restrictive which is always easy to
implant so let's see actually so this is
actually how to imprint a more
complicated one is not the most
complicated one that you will see but
let's implement this thing right so
let's implant a new operator so it will
be an arrow it has to be a right arrow
because uh left arrow is basically less
than minus something and some people's
so let's implant this so this is going
to be something that you know you put an
object on the left but this a and you
write the arrow and then you pull an
object on the right and this will invoke
some magic method let's call it a draw
method or something like that right so
how do you do this thing so okay
so you go to the grammar file and then
say okay let's say that the arrow has
the same presence of the line
multiplication sign so you have the
arrow here so now you say okay so you
can multiply it you can refine you can
divide or all these things okay so you
regenerate the grammar as you saw just
now it was added because this is a new
talking you need to go to this file and
say okay this is a new talking is called
rye arrow and it has this arbitrary
number here and then you need to teach
the tokenizer to recognize that it looks
like you know C code or whatever but
it's actually pretty simple right
because the way the voice is that is a
switch here which is parsing a token and
then you say okay do you see a - good
you see then a bigger sign good so amid
right arrow is pretty simple right okay
good so now with these changes
you run the tokenizer and gneisses
recognize the arrow as an operator cool
so we have already the tokenizer
so we can move forward so it turns out
that now we need to teach the parser to
meet this theme but it turns out that
this is all always already implemented
because the rule for for multiplication
division is agnostic of the actual like
sign that you put in the middle so you
can hit that and now what we need here
is to to teach Python to me by call for
the news
so you need to go to the compiler so the
compiler is basically the ast and emits
the bytecode and so you go to the
compiler for the bind operation so it's
a binary operation this is I a plus B so
you will see that this is not in the
grammar right this is already extracted
and then you say okay if the actual
operator is one of these arrow operators
they meet this new while call that I'm
going to call in place right our
operator and then you know to go to this
file defining the why : sighs okay
there's a new bite called target which
is this one it's just having the name
here what you're saying is basically
that if you see the token then you meet
this this this token for tasty to
recognize actually you don't understand
it you just need to see that it's
actually very simple even for C code so
at this point you have already you check
the right code you will have a white
call for like right arrow something and
now the thing that is left is teaching
Python to call something on the class
and that's very simple right so this is
the evil loop this is you a lot sees one
of the core things here and when
basically is like if you if we have the
target the in place right arrow operator
then grab the object on the left grab
the o you come the right which is you
know it's a stack machine so you just
pop two from the stack and then you are
going to call this function call by
number right arrow operator that I am
going to implement and then this what is
left here and clean up the only
important part is this three first line
is giving me the two operands and call
the function and the function is going
to be super simple because all of the
machinery is already implanted so I just
need to go to the file defining the type
object so every class and I never know
to see okay these are new magic metal
and the reverse one because you know it
has to also go to your other one if it
doesn't have it which is called under
context cord right arrow and a
squandered call and it has the sign here
it is just registering it and the
function that I said I need to implement
basically is going to call binary up
which is already implemented with the
with with this thing here again you
don't need to and you need to know what
you need to change right but it you know
it's actually was just one line and just
with that literally you compile you can
try the scroll so you create this arrow
morphism to miss Haskell promise happy
it will receive a function that will
store and um plain this underscore
Iram that is going to be basically a map
so you will map the function over
whatever is there and now you can find
you can try this f ro this list and it
will like power every element square
power and it works I mean you have to
believe me well whatever
awesome so it's that simple
it's like going not that simple because
the two pipes that has been implemented
in 3/8 like the walrus and 570 like the
position only elements those those PRS
are like changing two thousand lines of
code because it's not like particularly
the walrus the problem is the scoping
likes a lot of a scoping rules and is
very difficult like Emily they are
really excellent job there and it turns
out that this lash is actually very
difficult to advocacy's on do you
remember that enormous gigantic like
automata for the root so you need to put
something there which is like you know
playing this game of like operation is
not not cool and you need to make it
more or less faster okay so let's wrap
up so what do you learn so we learn
basically which is the Python grammar
and basically how is define and what is
the fine like you know how it works we
have learned also how we don't write a
list in C Python right this is also tied
to interpreter or interpreting
implementation will do different things
but you have learn also how we generate
the parser and it's actually a very fast
parser I think someone will may be able
to confirm this by I think pipe I also
have another one parser this is an
interesting thing to think about also
because it turns out that all of the
parts well not all but like a lot of the
packages and other implantation of
Python somehow I will really rely in on
this to be ll1 right like for matters
packages I'll give you better parsers
implementation of the language etc so
inviting that because you know main that
you want to put these parentheses around
the with the statements and then we will
implant the parser like we have actually
discussions and we do is testing an
implementation for the parser but
imagine that we change this into
something more general like maybe a peg
or LK or something like that
this will basically make life for
everyone much harder right because now
all the other one parsers will not be
able to pass these new rules or
implement these parenthesized expression
with the parsers
so it's something to think about right
like because cpython sometimes is look
as just you know one of the
implementation of the language the main
of the main one or one of the most uses
but it's actually stream lis coupled
with the language so it's something to
consider and every time you change
something it has like effects everywhere
so it's something to think about and
then we learn how how all of all of
these things work together so we learn
also basically how using the things that
we have seen before like the person and
all that we can change the language in
different ways and more complicated ways
actually more difficult but the simple
ones are are great this is actually a
very good point of good design right the
fact that it has been very easy is
because we have a strong design
regarding all these pieces well it's an
old implementation but it's very strong
and also you have learned something that
I think is very important for Easter
which is like what are the limitation of
all the grammar right like even if you
think that a rule is very very simple
realize you have parenthesis and you may
think that you have parentheses around
it's super super easy to fall into one
of these everywhere like literally
everywhere is very difficult in playing
in grammar rules making everything more
consistent which is also very important
it's like even if you manage to
implement a new Rama rule the price that
you can be limiting everything right for
example the GLS statement is also
limiting a lot of the things right
because it forces some some paths in the
Rama just to be to be able to were the
way it works so the point is that even
if you manage to implement something you
may be limiting other things in the
future so it's something to think about
if you want to maintain another one
thing and this is everything I have so I
don't know if there is any time for
questions but I hope you do enjoy
[Music]
Thank You Pablo we have a minute for
questions there's a I can you so with
the width statement and adding the
parentheses for the practice eyes with
list that's ambiguous because there is
another thing that you can do with
parentheses what's the other thing that
you can do with parentheses there in the
wither statement yeah I so you can tribe
so for example this is Bali
let me change the side Jill or anything
that has parentheses so for example with
write a Python we did not adultery so
you can write that and the prints that
that parentheses like if you write which
is not valid right but if you write this
now when it sees the open parenthesis
you will go into the node that is
parsing the Jill and when it sees the
right paren that you will say this is
imbalance because I already parse this
technically you can also do this right
so you can you can write like here an
arbitrary number of parentheses and also
do this which is but it stupid but is
valid as well and the price that these
parentheses belongs to the to the test
particularly this is here what I can say
for this drink in power here you see the
atom here this rule would this rule
includes the open parenthesis and it
goes into a dual expression or at least
comprehension which you already know
that is everything so you can basically
write open parenthesis whatever and this
actually goes again you can look so
that's why electrons
thank you very much
fortune there's no more time for other
questions
[Applause]
Loading video analysis...