Parsing System (v0.5)
|
Never Used a Parsing Tool?
How a Compiler Works
(Or: How can I use Goldie to create a compiler?)
There's a common misconception that language tools and computer science
theory have advanced enough that creating a compiler can be fully-automated.
This is not true. Only certain parts of writing a compiler can be
automated, and it's only these parts that parsing tools (such as Goldie, GOLD,
Lex/YACC, Flex/Bison and ANTLR) automate. A fully-working compiler is a rather
complex program that contains many different parts.
Here are the three main parts of a compiler:
-
Frontend:
This is the only part of a compiler that automated parsing tools deal with.
Even so, automated parsing tools only deal with part of the frontend
(see the section "The Frontend: Lexing, Parsing and Semantic Analysis" below).
A compiler takes source code as input and outputs a program.
The frontend is the "input" part. It brings in your source code, attempts to
understand it, turns it into some sort of internal representation, makes sure
everything is correct and gives errors if the source code has a problem.
-
Optimizer:
Optimization is an optional (but very common) step.
It adjusts the program being compiled so it runs faster, or in some cases,
so it takes up less space. Sometimes this is considered to be part of either
the front-end or the back-end, or split between both.
Automated parsing tools don't help with this step. This has to be either
written manually, or an existing optimizer could be used. Either way, this can
be a fair amount of work.
-
Backend:
The (somewhat amusingly-named) backend is the "output" part of a compiler.
It takes the internal representation of the code and converts it into either
machine code (possibly for a Virtual Machine) or another programming language.
In the case of generating machine code, there's a lot of associated
computer science theory (see the books in More Information).
For many languages, the last part of the backend is the linker. The linker
combines the many different compiled parts of a program (usually one for each
original source file) into one single program. Often, this is done by either a
completely separate program (as with many natively-compiled languages,
such as C/C++) or by the host platform itself (as with dynamically-loaded
libraries and many virtual machines such as JVM and .NET).
As with the optimizer, automated parsing tools don't help with this step.
The backend has to either be written manually, or an existing backend could be
used.
The Frontend: Lexing, Parsing and Semantic Analysis
What many people refer to as "parsing" is really a few separate steps: Lexing
(or "Lexical Analysis"), Parsing (or "Grammatical/Syntactical Analysis"), AST Creation,
and Semantic Analysis.
-
Lexing:
This separates the source into a series of tokens. For instance,
int numApples = 10 gets converted into
"Keyword 'int', Identifier 'numApples', Equals sign, Number 10".
Goldie does this in the Lexer class by using a
DFA.
Lexers are also sometimes called tokenizers and scanners.
You can view the result of this step using Parse and JsonViewer.
-
Parsing:
This arranges the lexed tokens into a tree. The structure of the tree is
based directly on the rules in the language's grammar.
Goldie does this in the Parser class by using an
LALR(1) algorithm.
You can view the result of this step using Parse and JsonViewer.
Sometimes parsers don't actually build a real parse tree. They may just simply
process the tokens as they're being parsed. Or they may merge the parsing step
with AST creation (see below) and directly output an AST.
Goldie always builds a parse tree. (Although a future version of Goldie might
provide the ability to omit it.)
-
AST Creation:
This step is optional and is only sometimes performed by automatic parsers.
Goldie doesn't currently perform this (so you'll have to do it yourself),
but it probably will in a future version. Sometimes this is considered
part of either the parsing step or the semantic analysis step.
In this step, the parse tree is converted into an AST (Abstract Syntax Tree).
An AST is like a parse tree, but it more closely resembles the way
humans understand the code. For instance, the parse tree representation of
5 + 10 can be somewhat complex, unintuitive and highly
dependent on the language's grammar. But an AST would most likely represent it
very naturally: With one node for "Addition" that contains two subnodes,
one for "5" and one for "10".
An XML/HTML DOM is a good example of an AST. For another example, see the output
of GenDocs's -ast flag.
-
Semantic Analysis:
This step is generally NOT performed by automatic parsers.
The user of such tools has to perform this step on their own because it's
not as easily formalized as lexing, parsing or AST creation.
In this step, the parse tree or the AST is analyzed and actual meaning is
determined. This often involves extra error
checking. For instance, in statically-typed languages, the type system exists
in the semantic analysis phase. This step is also where type-mismatch errors
and "undefined function/variable" errors are generated. Strictly speaking,
anything in the frontend that isn't formally defined by the language's grammar
is technically considered part of the semantic analysis phase.
See the
GenDocs source
for an example of lexing/parsing with Goldie and then constructing an AST tree
and performing semantic analysis.
Using a Parsing Tool
Parsing tools only deal with the frontend. But generally not the entire
frontend, just the lexing, parsing, and maybe AST creation parts.
The rest still has to be written by the user of the parsing tool.
There's a variety of parsing tools available, and they differ in
what parsing algorithm they use, what language is used to perform the
parsing, how the grammar is defined, what special tools are used, and
various other ways.
Goldie is compatible with the
GOLD Parsing System, and both use
DFA
for lexing and
LALR(1)
for parsing. GoldieLib is designed to be used in programs written in the
D programming language
(D version 2). But, the GOLD/Goldie systems
are designed to be easily extended to other "host" languages via GOLD-compatible
"engines",
many of which are available
in addition to GoldieLib.
See the How To Use Goldie page for an overview of using Goldie.
Additionally, the GOLD website has a great tutorial on
Getting Started With Parsing
that's geared towards GOLD. It's applicable to Goldie as well, since the two are compatible.
More Information
There are many resources for learning more about compilers and the
computer science theory behind them. Here are some recommended
resources to get started:
Web Pages:
Books:
-
Crafting a Compiler
ISBN-10: 0136067050
ISBN-13: 978-0136067054
All books on compiler theory I've seen target an audience of
mathematicians, computer scientists and/or students of such fields, rather
than programmers. But this
is by far the most programmer-friendly one I've come across.
I found it to be of immense help when writing GRMC: Grammar Compiler.
-
Compilers: Principles, Techniques, and Tools
(ie, "The Dragon Book")
ISBN-10: 0321486811
ISBN-13: 978-0321486813
Widely considered the classic text on compiler theory. It is, however,
one of the most heavily mathematician/computer-scientist-oriented
compiler books out there.
But, while it's far from being the most programmer-friendly,
it is very detailed and thorough.
-
GOLD Parsing System: Compiler Books
Links to some other books on compiler theory, and also other parsing systems.
|