Just as the title states, the following series of articles are my attempt to explain as simply as possible how to practically design and implement a compiler, starting from scratch. The completed compiler, though rudimentary, is able to generate binary code for the Intel x86 architecture, using an assembler as an external tool.
How is a compiler made? This question constantly recurs whenever I offer people a compiler-related project. The following pages present my answer.
The question might imply (incorrectly) that there is only one way to create a compiler. The field of information technology, especially software development, is overrun with development methodologies, each claiming to be the best (perhaps more so than in any other area of knowledge). In reality there are as many ways to make a compiler as there are ways to do any kind of software project. There is no single, optimum approach.
While it is true that a compiler is a difficult project (amongst the most complex that can be conceived, especially if it is self-contained), the difficulty merely means that you have to handle the compiler project as you handle any very complex software development – carefully.
The first compilers, like the first computers, were developed by small teams or even by a single individual. Fortran’s first compiler required many man-years of work, and was written by a team of top IBM professionals; while the Free Pascal compiler was started by a single person, although it is currently maintained by a team of programmers.
This has become possible at present, because recent software development tools are far more advanced than those that were available 30 or 40 years ago when Fortran’s first compiler was developed. Modern IDEs and RAD tools enable us to develop complex software much more quickly. We no longer need to handle complicated command-lines, or have in-depth hardware knowledge, nor do we need to be specialists in assembler or binary code.
Consequently it is perfectly feasible today, with little effort, to develop a simple compiler such as those used as examples in compiler courses.
To keep these chapters as simple as possible, I do not deal here with complex development techniques, modern methodologies, design patterns or context-free grammars. Nor do I use object-oriented programming, or new paradigms. I will simply use structured and imperative programming, at the lowest level. On the one hand, this makes the code examples more readable for beginners in the world of programming; and on the other hand, it facilitates the objective of making a self-contained compiler.
A self-contained compiler is one that can compile itself, using itself own code. Although this might seem to pose a chicken and egg problem, a self-compiling compiler is perfectly possible. There are several modern C, C++, and Pascal compilers. To enable self-compilation you have to begin by developing your compiler using a language for which an existing compiler is available. Then, when the developing compiler reaches the appropriate stage of completion, you can migrate it, and translate the developed code into its own language.
Creating a program with a compiler is like using a factory to create a product. Creating a compiler with another compiler is like using a factory to create another factory. And creating a self-contained compiler is like using a factory to create a factory that makes its own parts. At some stage, the “manufactured” factory will no longer depend on the factory that created it and will become independent. A self-contained compiler will depend only on itself. This implies that a such a compiler must be written in its own language as a creation that exists by feeding on itself with no outside help. This compiler will be alone in the world, without warranty, support, discussion forums, or online help, with no software yet developed for it.
This is the scale of the project I intend to pursue. To date, the author has created compilers and interpreters for various languages and architectures. One of my latest creations is a Pascal compiler for the 6502 microcontroller: https://github.com/t-edson/P65Pas. Despite this, I am not an expert on the subject, though I am now experienced. However, if I succeed, this will be the first self-contained compiler I will have created.
You must not expect to become a compiler expert, just because you are able to create a compiler by following a tutorial. Compiler development is a very extensive and complex topic. If you are interested in it you will need to do a great deal of in-depth study.
These articles show you only how a simple compiler can be created, a compiler specifically designed to teach you about how to develop a compiler. Our compiler (so far it has no name) will be built from scratch, with the help of another compiler, and an assembler. We will not use any modern tools like LEX or YACC or LLVM. These would complicate the project and ensure we missed our goal of making a self-contained compiler, since it would then depend on these external tools. We’ll start by coding the simplest part (the lexical parser routines) and end up by coding the most complex parts (the code generator routines).
We will deliberately sacrifice many possible features, avoiding complicated software architectural designs in order to make the code easy to follow.
Code development will be structured in the following phases:
• Introduction – preparing the working environment (tools)
• Inventing a language
• Creating a lexical analyser
• Completing the lexical analyser
• Starting to generate code
• Declaring variables
• Analysing expressions
• Developing initial instructions
• Testing the expression evaluator
• System functions
• Conditionals implementation
• Loop implementation
• Arrangement management
• Function management
• File handling
• Command line handling
At least at the outset, as the project is in its initial phase, these are the topics I think we need to cover. Writing a complete compiler is an enormous task which can take years of work. Consequently for this project we will pursue considerably reduced goals, namely:
• For the source language (the one we’re going to compile), we’ll use a simple, imperative, typed, language which has very few instructions.
• The high level code will be compiled to assembly code, which will then be converted to binary code, with the help of an assembler and linker.
• Only numeric types will be supported – there will be no complex types or any string type.
• We will write the minimum of code, with little modularity and no dependencies.
• The most ambitious goal is to be able to make the compiler self-contained.
The objective is to be able to build binaries for Intel’s x86 architecture on 32-bit Windows. Initially I planned to compile for 16-bit Intel, since that target is where I have the most experience. However, realizing that 16‑bit code these days is largely obsolete for common PCs, I decided instead to target 32-bit processors. I chose the Windows operating system, because it has well-documented APIs, and there is already a great deal of development for it.
To start creating the compiler source code, we’ll use the Pascal language. This is easy to handle and fits the project’s constraints and objectives well.
To develop a compiler requires a variety of knowledge. In our deliberately simplified case, we require specialized knowledge in the following areas:
- Assembler for Intel x86 32-bit architecture, since this is the target architecture for which we are going to generate code.
- Pascal, at least in its basic Turbo Pascal mode (advanced language features will not be used).
- Compiler theory, at least in terms of understanding the modules a compiler contains and the function of each module.
- The Windows OS, both its file handling and use of its CMD command line.
If you do not understand any of these topics, you are recommended to first review them in order to understand how this compiler is to be developed. Otherwise, you will be working “mechanically” without being clear about the basics of the topic.
Installing needed tools
For this ambitious project you will need to ensure you have three program tools installed on your Windows development machine as follows:
• The Lazarus development environment (https://www.lazarus-ide.org/)
• The MASM32 Assembler (http://www.masm32.com/download.htm)
• A text editor, such as the notepad++ (https://notepad-plus-plus.org/download/v7.6.2.html)
I assume that the interested reader is able to install these tools without problems (there is extensive information to help you on the Web). If you are not able to successfully install these tools, you are unlikely to be able to create a compiler.
We will use the (free) Lazarus environment to develop the first version of our compiler. In other words, we’ll use the Pascal language to create the compiler. But no advanced language objects or features will be used to make it easier to migrate the source code to the new compiler language.
I’ve chosen Pascal since apart from any other advantages, the language we are going to develop look like the standard Pascal. Of course you could use another language and another IDE if you wish, but these articles are based on Pascal and the Lazarus IDE.
We don’t require installation of any packages in Lazarus beyond those that come in its default installation, and we will use Lazarus to build a basic console-type project.
For our assembler, we will use the (free) MASM32 because it comes already prepared with the whole development kit needed of assembler and linker. This makes the installation easy and the tools are designed to work with the 32-bit architecture our compiler is designed for. Again, you can use another assembler if you wish.
We will actually use very little of the IDE that comes with MASM32, using just the assembler and the linker to be able to obtain the binaries, starting with the assembler program that we will generate.
The text editor I’m going to use is Notepad++, but as always, you are free to use an alternative from the multitude of text editors available across the net, many both free and of high quality. Use the editor that is to your taste. You might prefer to use the simple Windows Notepad.exe. No problem. Some people still compile using command-line editors.
I find it more convenient to use the NppExec plug-in, which I find very useful when testing assembly code. For reference only, I’m using a 64-bit Lazarus IDE version 1.8.0, MASM32 version 11, and Notepad++ version 7.2.2.