121 lines
9.1 KiB
Plaintext
121 lines
9.1 KiB
Plaintext
|
|
Episode: 1107
|
||
|
|
Title: HPR1107: Compilers Part 3
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1107/hpr1107.mp3
|
||
|
|
Transcribed: 2025-10-17 19:03:52
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
music
|
||
|
|
music
|
||
|
|
music
|
||
|
|
music
|
||
|
|
music
|
||
|
|
music
|
||
|
|
I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do
|
||
|
|
to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know what to do, I don't know
|
||
|
|
is what lexical analysis does. It generates tokens based on its input stream.
|
||
|
|
The input stream to lexical analyzer is typically raw bytes from a source code file.
|
||
|
|
From that comes things like keyword tokens, string tokens, integer tokens,
|
||
|
|
flooding quick tokens, operator tokens, and so on.
|
||
|
|
The mechanism that implements lexical analyzer is called a finite state machine.
|
||
|
|
But as a conceptual machine, that only can be in one state at a time.
|
||
|
|
Its current state is changed by the roles of the machine versus the input stream.
|
||
|
|
As it transitions from one state to another, it eventually settles on a state
|
||
|
|
that outputs a token indicating that a token was found.
|
||
|
|
Let's take searching for the keyword break, for instance.
|
||
|
|
The same that the input stream is BJ-BR-E-A-K.
|
||
|
|
Once the B flows by the stream, we might transition from our initial state to one where they are.
|
||
|
|
The state is adjacent. We find that's a J flows by the stream and we go back to the initial state.
|
||
|
|
Once again, we find that a B flows by and we transition to the B state.
|
||
|
|
Then in our flows by, we transition to the R states and so on until we reach the K state,
|
||
|
|
at which point we output a token.
|
||
|
|
Typically, we don't build these machines directly, rather they're built by a generator,
|
||
|
|
which takes it into a description of each token.
|
||
|
|
One of the more popular lexical analyzer generators is called Lex.
|
||
|
|
Let's review.
|
||
|
|
Our source flows through a lexical analyzer,
|
||
|
|
which matches its input stream with token descriptions using a finite state machine generator.
|
||
|
|
Then outputs tokens to a parser, which then uses ship-reduced parsing to build parse trees
|
||
|
|
and tell us significant match occurs in the global parse tree.
|
||
|
|
It performs actions to stitch together a syntax tree at each of these match points
|
||
|
|
and continues until the entire global parse tree is matched.
|
||
|
|
The theoretical machine that implements a parser is a pushdown finite automata,
|
||
|
|
which is just a finite state machine with a stack attached to it to handle the inherent stacking procedure of building trees.
|
||
|
|
Like a lexical analysis, we typically don't build these machines directly, rather they're generated.
|
||
|
|
In the case of parsing, one of the more common generators is called Yak.
|
||
|
|
Yak takes the language grammar and automatically generates a parser for us,
|
||
|
|
which is damn convenient.
|
||
|
|
So we have a syntax tree, what next?
|
||
|
|
Well, semantic analysis.
|
||
|
|
We've already validated input stream against context-free grammar.
|
||
|
|
Now we've validated against a context-aware grammar called a natural-beaded grammar.
|
||
|
|
What we're going to be doing in semantic analysis is validating that the symbols are declared before they are used.
|
||
|
|
At the same time, we're going to generate a symbol table and then we'll perform type checking.
|
||
|
|
There are any number of semantic checks you can do based on the nature of your language,
|
||
|
|
but we're just going to focus on these two as they're applicable to most languages,
|
||
|
|
and actually are two major classes.
|
||
|
|
In the process of validating semantics of a syntax tree, two things are generated,
|
||
|
|
an attributed syntax tree and a symbol table.
|
||
|
|
These two things are used in the next step of a compiler code generation.
|
||
|
|
There are two primary types of attribute grammar.
|
||
|
|
I'll attribute a grammar and that's attribute grammar.
|
||
|
|
I'll attribute a grammar table with carries information from parent to child,
|
||
|
|
whereas as attribute a grammar carries information from child to parent.
|
||
|
|
These two directions produce, on a syntax tree, what are called inherited and synthesized attributes.
|
||
|
|
I'll attribute a grammar being associated with inherited attributes and as attribute a grammar being associated with synthesized attributes.
|
||
|
|
The production of synthesized attributes is what type checking will do.
|
||
|
|
And the production of inherited attributes is what checking to see if a symbol is declared before it's used.
|
||
|
|
We'll call this scope checking, we'll do.
|
||
|
|
An attribute a grammar, it's just a grammar with attributes associated with it.
|
||
|
|
Typically this is something like the name, value rules for generating these things.
|
||
|
|
All right, blah, blah, blah, blah, blah, blah, blah, blah, blah.
|
||
|
|
Whatever, attribute a grammar is so our convenient way to tell us what to do with each step.
|
||
|
|
So let's do scope checking.
|
||
|
|
First, we use a data structure called a symbol table.
|
||
|
|
A symbol table is a map of symbol entries within a hierarchical scope or the basic scope model of your language,
|
||
|
|
which is typically hierarchical.
|
||
|
|
We start scope checking.
|
||
|
|
We're going down top-down in the syntax tree, so naturally we're going to start the global scope.
|
||
|
|
When a symbol is declared, we save it in a symbol table within the current scope.
|
||
|
|
It's already declared within the current scope we throw in error.
|
||
|
|
We encounter a function or something else that has a nested scope to make this our current scope.
|
||
|
|
Now whenever we encounter a symbol being used, we walk up the symbol table scope hierarchy in search of its declaration.
|
||
|
|
This will tell us if it's been declared, if it hasn't, we'll throw an error.
|
||
|
|
And also tell us what scope the symbol tables used in.
|
||
|
|
A part of the table entry is being used at each use of a symbol would make a fine, a fine contribution to the attributed syntax tree.
|
||
|
|
See what I'm getting at here? See what I'm saying?
|
||
|
|
Other information might trickle down the syntax tree like function return type, for instance.
|
||
|
|
And everything else in your L attribute is grammar.
|
||
|
|
That's enough of that.
|
||
|
|
Now let's talk synthesized attributes.
|
||
|
|
So we have a context-free grammar, e produces e times e, or e plus e, or id.
|
||
|
|
Let's take the following source code, 3 times 2 plus 5.
|
||
|
|
Imagine the tree that this produces, e on the top branching to e plus e on the left branching to e times e and so on.
|
||
|
|
Now let's take the s attribute grammar, e value equals e one value times e two value, whereas e produces e times e.
|
||
|
|
We do the same thing for e plus e, but we add values together.
|
||
|
|
And finally, we have e value equals id value, where e produces id.
|
||
|
|
First thing we do is check our frontier for syntax tree, which is id times e, id plus id.
|
||
|
|
We expand on that using e value equals id value, and we find that e value 3 times e value 2 plus e value 5.
|
||
|
|
We expand e value equals e one value times e two value, and find that e value equals 6 plus e value equals 5.
|
||
|
|
Finally, we expand on e value equals e one value plus e two value, and we find that at the very top e value equals 11.
|
||
|
|
We usually throw a type checker if the left node type of a parent operator does not match the right type.
|
||
|
|
That isn't always the case, of course.
|
||
|
|
Depending on your language, there might be room to add another node in the tree or type conversion between parent and child, for instance.
|
||
|
|
The funny thing about synthesized attributes is we can actually compute them at the same time as we do parsing, since we're building the syntax tree bottom up.
|
||
|
|
Just thought I'd throw that in there.
|
||
|
|
There's a logical separation between parsing and semantic analysis, and that's important to conceptualize, but practice the lines may be blurred just a little bit.
|
||
|
|
The goal of everything we talk about so far is to verify that source code matches our language description.
|
||
|
|
Almost as a byproduct of this verification, we generate attribute syntax tree and symbol tables.
|
||
|
|
Everything we've done so far is what's called front end of the compiler.
|
||
|
|
In the next episode, we're going to start talking about the backend of a compiler.
|
||
|
|
So, that's it. Thanks for listening to this episode, and I look forward to talking with you a bit later about the backend of a compiler.
|
||
|
|
Take care. Bye-bye.
|
||
|
|
We are a community podcast network that releases shows every weekday Monday through Friday.
|
||
|
|
Today's show, like all our shows, was contributed by a HBR listener by yourself.
|
||
|
|
If you ever consider recording a podcast, then visit our website to find out how easy it really is.
|
||
|
|
Hacker Public Radio was founded by the Digital.Pound and the Infonomicom Computer Club.
|
||
|
|
HBR is funded by the binary revolution at binref.com. All binref projects are crowd-responsive by linear pages.
|
||
|
|
From shared hosting to custom private clouds, go to lunarpages.com for all your hosting needs.
|
||
|
|
Unless otherwise stasis, today's show is released under a creative comments,
|
||
|
|
attribution, share a like, feed us our license.
|