"An Incremental Approach to Compiler Construction"
Abdulaziz Ghuloum
http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf
Can't recommend it enough.This looks interesting:
> Like Ghuloum [2006], this course is different. Our development is by extending the complete, working compiler one small step at a time. At each step we end up with the working compiler, for a subset of the source language. Specifically, the methodology ([Ghuloum, 2006, §2.6]):
> 1. choose a small subset of the source language that is easy to directly compile to assembly;
> 2. Write the extensive test cases;
https://mitpress.mit.edu/9780262047760/essentials-of-compila... - Jeremy Siek's Essentials of Compilation (there's also a Python version)
Admittedly I didn't finish working through it (life got in the way), but I worked through a good chunk of it with a colleague from a non-CS background. He wanted to understand compilers and it seemed to be much more effective than other texts he'd tried to work through. I'll have to look at this one at some point, probably this summer when things are usually calmer for me.
These seems like a misstep that I've seen in a few other compiler implementation courses. For some reason these programming language professors always insist on completing the project in their personal favorite language (Haskell, OCaml, Standard ML, etc).
As a student this makes the project significantly more difficult. Instead of just learning how to implement a complier, you're also learning a new (and likely difficult) programming language.
Also, neither OCaml nor SML are hard to learn. Haskell is more challenging, but that's because it's become, in a sense, multiple languages. The core of Haskell is no harder than OCaml or SML to learn, except for reasoning about lazy evaluation and some of its consequences. All the things people use on top of Haskell, though, does make it more to learn but what you'd need to reach equivalent utility as SML or OCaml for a compilers course is not that hard to learn.
The compilers class can then be taught in it without worrying about that problem much.
This book is open access, https://github.com/IUCompilerCourse/Essentials-of-Compilatio... both the Python and the Racket versions.
If you are the kinda nerd that has nerd book club with friends, this is book is perfect for it.
I'm porting Dijkstra's algorithm over to C# at the moment, and in the last several hours here's the two most clownish things that have happened:
1) I have:
if (node is null) {
..
}
My IDE is telling me "Expression is always false according to nullable reference types' annotations". Nevertheless it enters that section every time.2) I have:
SortedSet<int> nums = [];
Console.Out.WriteLine(nums.Min);
You know what this prints? 0
The minimal element of a set which has no elements is 0.Yes, every language has its warts, and anecdotes like this aren't going to change anyone's mind. I only wrote these up because they're hitting me right now in one of those languages that companies use "because anyone can learn them". I like harder languages better than easy languages, because they're easy.
Frankly, I try to avoid languages that don’t have ADTs as much as possible. They are incredibly useful for specifying invariants via your design, and their constraints on inputs lend themselves to easier implementation and maintenance.
I wish there was a course designed somewhere which talked about more ingrained issues: how to structure/design the AST[0], buffer based vs noncontextual tokenization/parser design, index handling and error sync in the parser, supporting multiple codegen architectures, handling FFI, exposing the compiler as an API for external tooling, namespaces and linkage handling etc. etc. etc.
It is refreshing to see how Carbon designed some of its components (majorly the frontend, yet to take a look at the backend) as it touches on some of the subtleties I mentioned. If someone is starting out on writing one, I would recommend taking a look at it or any of the talks.
Always nice to see new material coming up. A few resources that I would like to mention would be dabaez's compiler course, Khoury college's compiler course (in Rust, previously i think and Ocaml), Nora Sandler's book as well as http://compilerbook.org; Which I consider to be the best guide out there for writing small learning compilers, the videos are good as well.
[0]: Some related content that I enjoyed reading: https://lesleylai.info/en/ast-in-cpp-part-1-variant/
That's a completely separate field. If you want to learn about language design you should study that, though it helps if you have some background on both to get started with either one.
While designing a language is by no means trivial, it generally really occupies just a very small fraction of the language/compiler developer's time. And, in most cases, the two things (language design + implementation details) have to walk hand-in-hand, since small changes to the language design can vastly improve the implementation end.
Besides, for someone learning it for the first time I think designing a new language seems a bit difficult.
Back in the 2000s there were some CS undergraduate programs that attempted to use Java in the entire curriculum, from introductory courses all the way to senior-level courses such as compilers. There was even an operating systems textbook that had Java examples throughout the text (https://www.amazon.com/Operating-System-Concepts-Abraham-Sil...).
I think using only one language for the entire undergraduate CS curriculum is a mistake. Sure, students don’t have to spend time learning additional languages. However, everything has to fit into that one language, depriving students the opportunity to see how languages that are better suited to specific types of problems could actually enhance their understanding of the concepts they are learning. In the case of Java, it’s a good general-purpose programming language, but there are classes such as computer organization and operating systems where it’s important to discuss low-level memory management, which conflicts with Java’s automatic memory management.
When it comes to writing compilers, it turns out that functional programming languages with algebraic data types and pattern matching make working with abstract syntax trees much easier. I learned this the hard way when I took compilers in 2009 at Cal Poly. At the beginning of the course, we were given two weeks to write an AST interpreter of a subset of Scheme. My lab partner and I didn’t like Dr. Scheme (now known as Racket), which we “endured” the previous quarter in a class on implementing programming language interpreters, and so we set about writing our interpreter in C++. It turned out to be a big mistake. We got it done, but it took us 40 hours to implement, and we had a complex class hierarchy to implement the AST. We realized that Dr. Scheme’s features were well-suited for manipulating and traversing ADTs. We never complained about Dr. Scheme or functional programming again, and we gladly did our other compiler assignments in that language.
16 years later, I teach Haskell to my sophomore-level students in my discrete mathematics class at a Bay Area community college that uses C++ for the introductory programming course.
Imagine if Java and C had a love child, basically.
MIR is a fantastic piece of engineering.
Honestly the hardest part is representing types. Having played around with other compilers it seems to be a typical problem.
I’m stuck in the minutiae of representing highly structured complexity and defining behavior in c. I can understand why many languages have an intermediate compiler - it’s too much work and it will probably change over time.
In general I'd recommend using Min() (from LINQ) which works as expected
But this property has this remark: "If the SortedSet<T> has no elements, then the Min property returns the default value of T."
1)
Feels like you used compiler hints incorrectly
As far as I have done in the toy compilers and seen the things in actual production ready compilers, the codegen is still very much tied to the one thing or the other rest llvm.
The LSP for Java [2] used in eg. VSCode’s Java plugins, builds on this API.
But, no, I haven’t seen a generalized approach to this architecture discussed in literature.
1: https://github.com/eclipse-jdt/eclipse.jdt.core 2: https://github.com/eclipse-jdtls/eclipse.jdt.ls
Hmm, I wonder if an LLM could sift a "Essentials of Compilation" search[3] for interesting repos?
[1] https://github.com/search?q=Ghuloum&type=repositories https://github.com/search?q=incremental%20approach%20to%20co... [2] just some old bookmarks: https://github.com/namin/inc https://github.com/iambrj/imin https://github.com/jaseemabid/inc [3] https://github.com/search?q=Essentials%20of%20Compilation&ty...
I keep running into situations where I'd like to describe data in a high level. BNF grammars often fit those situations, are more readable than regex's, and could make for nice parsers. One must know how to parse, though. :)
Is the project public? Really interested in the AOT support, I've always wanted to see its generated code but didn't find an easy way to dump it.
There’s also a new experimental rewrite of the Nim compiler called Nimony which targets a new intermediate called NIFC. That is intended to the be transformed to C, LLVM, JavaScript, etc.
IMHO C Family is a symptom of "low-level" of competence. I often get surprised by superficiality of their arguments. "Sophistry" is a useful concept (in a single word). Ad nauseam?
[1]: http://www.semanticdesigns.com/Products/DMS/LifeAfterParsing...
Lisp Machines? "Separation of Concerns" [1]? Conway's Law?
Hint: You might be interested in Forth Systems and Lisp Machines.
[1]: https://github.com/WeActStudio/WeActStudio.MiniSTM32F4x1 [2]: https://radxa.com/products/zeros/zero3e
You want to generate e.g x86 + ARM + RISCV, yea?, and shouldnt it be a result of modular architecture?
like your various codegens just take your AST and generate output
Right now you can just break before the (fun_call)() delegate and disassemble the fun_call in gdb.
The basic trick is to add reloc support to the x86 translate code, mark external calls and replace with 0x0 placeholders, and copy out the machine_code and data segment output to an object file.
I can do basic main functions with simple prints calls but not much more. It’s a hack for now but I’ll refactor it until it’s solid.
I think that would be helpful.