Compilers: Incrementally and Extensibly (2024)

(okmij.org)

1. soegaard ◴[05 Apr 25 15:09 UTC] No.43594068[source]▶

The notes reference:

    "An Incremental Approach to Compiler Construction"
    Abdulaziz Ghuloum
    http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf

Can't recommend it enough.

replies(2): >>43594737 #>>43596542 #

2. djoldman ◴[05 Apr 25 15:44 UTC] No.43594321[source]▶

>>43593088 (OP) #

Learning and working on this stuff almost always felt "all or nothing," (either it worked or it didn't), which was very frustrating.

This looks interesting:

> Like Ghuloum [2006], this course is different. Our development is by extending the complete, working compiler one small step at a time. At each step we end up with the working compiler, for a subset of the source language. Specifically, the methodology ([Ghuloum, 2006, §2.6]):

> 1. choose a small subset of the source language that is easy to directly compile to assembly;

> 2. Write the extensive test cases;

3. Jtsummers ◴[05 Apr 25 15:49 UTC] No.43594361[source]▶

>>43593088 (OP) #

Another text using this approach:

https://mitpress.mit.edu/9780262047760/essentials-of-compila... - Jeremy Siek's Essentials of Compilation (there's also a Python version)

Admittedly I didn't finish working through it (life got in the way), but I worked through a good chunk of it with a colleague from a non-CS background. He wanted to understand compilers and it seemed to be much more effective than other texts he'd tried to work through. I'll have to look at this one at some point, probably this summer when things are usually calmer for me.

replies(2): >>43594679 #>>43594699 #

4. nate_martin ◴[05 Apr 25 16:05 UTC] No.43594464[source]▶

>>43593088 (OP) #

>The compiler itself is to be developed in OCaml.

These seems like a misstep that I've seen in a few other compiler implementation courses. For some reason these programming language professors always insist on completing the project in their personal favorite language (Haskell, OCaml, Standard ML, etc).

As a student this makes the project significantly more difficult. Instead of just learning how to implement a complier, you're also learning a new (and likely difficult) programming language.

replies(6): >>43594528 #>>43594754 #>>43594821 #>>43595250 #>>43597712 #>>43602140 #

5. Jtsummers ◴[05 Apr 25 16:14 UTC] No.43594528[source]▶

>>43594464 #

Learning a new programming language isn't that hard of a task, every decent programmer will learn a dozen or maybe even dozens over their career.

Also, neither OCaml nor SML are hard to learn. Haskell is more challenging, but that's because it's become, in a sense, multiple languages. The core of Haskell is no harder than OCaml or SML to learn, except for reasoning about lazy evaluation and some of its consequences. All the things people use on top of Haskell, though, does make it more to learn but what you'd need to reach equivalent utility as SML or OCaml for a compilers course is not that hard to learn.

replies(1): >>43594587 #

6. remexre ◴[05 Apr 25 16:23 UTC] No.43594587{3}[source]▶

>>43594528 #

Many universities that use OCaml in upper-division courses also use it in lower-division; my university requires all CS majors to take a course that is taught in OCaml, and covers higher-order programming, "advanced" (Hindley-Milner) type systems, equational reasoning, etc., typically in their sophomore year.

The compilers class can then be taught in it without worrying about that problem much.

7. ◴[05 Apr 25 16:36 UTC] No.43594679[source]▶

>>43594361 #

8. sitkack ◴[05 Apr 25 16:38 UTC] No.43594699[source]▶

>>43594361 #

This is such a wonderful book.

This book is open access, https://github.com/IUCompilerCourse/Essentials-of-Compilatio... both the Python and the Racket versions.

If you are the kinda nerd that has nerd book club with friends, this is book is perfect for it.

9. sitkack ◴[05 Apr 25 16:43 UTC] No.43594737[source]▶

>>43594068 #

Andy Keep - Writing a Nanopass Compiler https://www.youtube.com/watch?v=Os7FE3J-U5Q

https://github.com/akeep/scheme-to-c

10. norir ◴[05 Apr 25 16:44 UTC] No.43594752[source]▶

>>43593088 (OP) #

The hardest part of writing a compiler is designing a language. In my opinion, most compiler courses seem too fixated on the implementation details. The fact that this course targets x86 already misses the mark for me. Computers are very fast, there is no reason to start with a language that translates to asm. This is incidental complexity that distracts from language design.

replies(4): >>43594903 #>>43594913 #>>43594938 #>>43595205 #

11. mrkeen ◴[05 Apr 25 16:44 UTC] No.43594754[source]▶

>>43594464 #

I grew up with "easy" languages. Pascal, VB, C, C++, Java, C#. And frankly I'm getting real sick of them.

I'm porting Dijkstra's algorithm over to C# at the moment, and in the last several hours here's the two most clownish things that have happened:

1) I have:

  if (node is null) {
    ..
  }

My IDE is telling me "Expression is always false according to nullable reference types' annotations". Nevertheless it enters that section every time.

2) I have:

  SortedSet<int> nums = [];
  Console.Out.WriteLine(nums.Min);

You know what this prints?

The minimal element of a set which has no elements is 0.

Yes, every language has its warts, and anecdotes like this aren't going to change anyone's mind. I only wrote these up because they're hitting me right now in one of those languages that companies use "because anyone can learn them". I like harder languages better than easy languages, because they're easy.

replies(2): >>43594914 #>>43595629 #

12. mattgreenrocks ◴[05 Apr 25 16:54 UTC] No.43594821[source]▶

>>43594464 #

Algebraic data types are superb for compilers. Case exhaustiveness checks are really helpful in this domain, especially when doing any sort of semantic analysis.

Frankly, I try to avoid languages that don’t have ADTs as much as possible. They are incredibly useful for specifying invariants via your design, and their constraints on inputs lend themselves to easier implementation and maintenance.

13. codr7 ◴[05 Apr 25 17:03 UTC] No.43594897[source]▶

>>43593088 (OP) #

I didn't make much progress on my own languages until I discovered Forth, always seemed to get stuck and lose motivation in the parser.

Writing the parser is the least interesting part of building a language imo.

replies(2): >>43598678 #>>43601564 #

14. tiu ◴[05 Apr 25 17:04 UTC] No.43594898[source]▶

>>43593088 (OP) #

I just got over writing 2 (well 2.5) toy compilers and I think a lot of the material in the compiler-teaching space lack some subtle developmental aspects.

I wish there was a course designed somewhere which talked about more ingrained issues: how to structure/design the AST[0], buffer based vs noncontextual tokenization/parser design, index handling and error sync in the parser, supporting multiple codegen architectures, handling FFI, exposing the compiler as an API for external tooling, namespaces and linkage handling etc. etc. etc.

It is refreshing to see how Carbon designed some of its components (majorly the frontend, yet to take a look at the backend) as it touches on some of the subtleties I mentioned. If someone is starting out on writing one, I would recommend taking a look at it or any of the talks.

Always nice to see new material coming up. A few resources that I would like to mention would be dabaez's compiler course, Khoury college's compiler course (in Rust, previously i think and Ocaml), Nora Sandler's book as well as http://compilerbook.org; Which I consider to be the best guide out there for writing small learning compilers, the videos are good as well.

[0]: Some related content that I enjoyed reading: https://lesleylai.info/en/ast-in-cpp-part-1-variant/

replies(3): >>43595603 #>>43595652 #>>43602829 #

15. wseqyrku ◴[05 Apr 25 17:04 UTC] No.43594903[source]▶

>>43594752 #

> The hardest part of writing a compiler is designing a language.

That's a completely separate field. If you want to learn about language design you should study that, though it helps if you have some background on both to get started with either one.

16. dccsillag ◴[05 Apr 25 17:05 UTC] No.43594913[source]▶

>>43594752 #

Hard disagree. There are a lot of implementation "details" that, if you want to do properly, are a lot of hard work and very much nontrivial. For example, do try to write a compiler with efficient incremental compilation, and especially one that does so while also having optimization passes. And that's just one example, most things in compiler implementations actually turn out to be fairly complex. And lots of features that modern languages support e.g. more powerful typesystems, trait/typeclass systems, etc. are also very very tricky.

While designing a language is by no means trivial, it generally really occupies just a very small fraction of the language/compiler developer's time. And, in most cases, the two things (language design + implementation details) have to walk hand-in-hand, since small changes to the language design can vastly improve the implementation end.

17. neonsunset ◴[05 Apr 25 17:06 UTC] No.43594914{3}[source]▶

>>43594754 #

This is possible if you disregarded compiler warnings and assigned null to a non-nullable location.

replies(1): >>43601690 #

18. ◴[05 Apr 25 17:09 UTC] No.43594937[source]▶

>>43593088 (OP) #

19. tiu ◴[05 Apr 25 17:09 UTC] No.43594938[source]▶

>>43594752 #

Regarding the language design part, many universities I know of offer a {Advanced} Programming Languages class later on after a Compilers 101 class and that is where many of the topics like EBNF, Grammar Design along with different constructs are done.

Besides, for someone learning it for the first time I think designing a new language seems a bit difficult.

20. f1shy ◴[05 Apr 25 17:48 UTC] No.43595205[source]▶

>>43594752 #

I disagree. First of all, you can write a compiler for any existing language. But second, inhave experience writing compilers and interpreters. For me the hardest was always to give good information/feedback on errors. Because the compiler cannot go ahead, means it is “confused” eliminates that confusion to help the user is very much non trivial.

21. linguae ◴[05 Apr 25 17:55 UTC] No.43595250[source]▶

>>43594464 #

In defense of such compilers professors, part of the purpose of a good undergraduate computer science program is to expose students to different programming language paradigms. Computing professionals are expected to grasp new languages, APIs, and tools quickly. Additionally, certain problems are easier to express in certain paradigms. For example, I would use a language like C or perhaps Rust in an operating systems class since we need to operate at a level that is much closer to the hardware. If I were teaching a course on machine learning, I’d use Python for its excellent ecosystem of libraries, though R and Julia are good alternatives. A course on relational databases should teach basic SQL.

Back in the 2000s there were some CS undergraduate programs that attempted to use Java in the entire curriculum, from introductory courses all the way to senior-level courses such as compilers. There was even an operating systems textbook that had Java examples throughout the text (https://www.amazon.com/Operating-System-Concepts-Abraham-Sil...).

I think using only one language for the entire undergraduate CS curriculum is a mistake. Sure, students don’t have to spend time learning additional languages. However, everything has to fit into that one language, depriving students the opportunity to see how languages that are better suited to specific types of problems could actually enhance their understanding of the concepts they are learning. In the case of Java, it’s a good general-purpose programming language, but there are classes such as computer organization and operating systems where it’s important to discuss low-level memory management, which conflicts with Java’s automatic memory management.

When it comes to writing compilers, it turns out that functional programming languages with algebraic data types and pattern matching make working with abstract syntax trees much easier. I learned this the hard way when I took compilers in 2009 at Cal Poly. At the beginning of the course, we were given two weeks to write an AST interpreter of a subset of Scheme. My lab partner and I didn’t like Dr. Scheme (now known as Racket), which we “endured” the previous quarter in a class on implementing programming language interpreters, and so we set about writing our interpreter in C++. It turned out to be a big mistake. We got it done, but it took us 40 hours to implement, and we had a complex class hierarchy to implement the AST. We realized that Dr. Scheme’s features were well-suited for manipulating and traversing ADTs. We never complained about Dr. Scheme or functional programming again, and we gladly did our other compiler assignments in that language.

16 years later, I teach Haskell to my sophomore-level students in my discrete mathematics class at a Bay Area community college that uses C++ for the introductory programming course.

replies(1): >>43602083 #

22. convivialdingo ◴[05 Apr 25 18:08 UTC] No.43595345[source]▶

>>43593088 (OP) #

I’ve been modifying the the MIR c2mir JIT compiler to extend the c11 compiler to support simple classes, boxed strings(immutable, nun-nullable) with AOT support.

Imagine if Java and C had a love child, basically.

MIR is a fantastic piece of engineering.

Honestly the hardest part is representing types. Having played around with other compilers it seems to be a typical problem.

I’m stuck in the minutiae of representing highly structured complexity and defining behavior in c. I can understand why many languages have an intermediate compiler - it’s too much work and it will probably change over time.

replies(1): >>43599367 #

23. seanmcdirmid ◴[05 Apr 25 18:35 UTC] No.43595603[source]▶

>>43594898 #

You can learn all that stuff by doing and looking at other people’s design, it’s a bit niche to package that up into a ln advanced language design/compiler class, everyone is making different trade offs so are focusing on different things. There also isn’t really a market for more than a few highly trained language designers and implementers, and most wind up doing different things than what they were trained for.

24. tester756 ◴[05 Apr 25 18:37 UTC] No.43595629{3}[source]▶

>>43594754 #

In general I'd recommend using Min() (from LINQ) which works as expected

But this property has this remark: "If the SortedSet<T> has no elements, then the Min property returns the default value of T."

Feels like you used compiler hints incorrectly

25. tester756 ◴[05 Apr 25 18:39 UTC] No.43595652[source]▶

>>43594898 #

>supporting multiple codegen architectures

Shouldnt it be a result of modular software design?

>exposing the compiler as an API for external tooling

Isn't it just generating libraries instead of executables?

replies(2): >>43595906 #>>43596118 #

26. remexre ◴[05 Apr 25 18:59 UTC] No.43595906{3}[source]▶

>>43595652 #

Making the _compiler itself_ provide an API enables things like LSP, which don't want to generate machine code at all. A traditional single-pass compiler usually can't accommodate this without re-plumbing.

replies(2): >>43596467 #>>43605124 #

27. tiu ◴[05 Apr 25 19:23 UTC] No.43596118{3}[source]▶

>>43595652 #

I wrote 'multiple codegen architectures' instead of 'multiple architectures for codegen'.

As far as I have done in the toy compilers and seen the things in actual production ready compilers, the codegen is still very much tied to the one thing or the other rest llvm.

replies(2): >>43599433 #>>43604777 #

28. halffullbrain ◴[05 Apr 25 20:13 UTC] No.43596467{4}[source]▶

>>43595906 #

The Eclipse Compiler for Java [1] is a notable exception, architected around incremental compilation, an API for “live” AST manipulation, and a layered non-batch approach to when to invoke various analysis steps.

The LSP for Java [2] used in eg. VSCode’s Java plugins, builds on this API.

But, no, I haven’t seen a generalized approach to this architecture discussed in literature.

1: https://github.com/eclipse-jdt/eclipse.jdt.core 2: https://github.com/eclipse-jdtls/eclipse.jdt.ls

29. mncharity ◴[05 Apr 25 20:23 UTC] No.43596542[source]▶

>>43594068 #

And as a github search[1], I recall it turning up fun work[2] amidst clutter.

Hmm, I wonder if an LLM could sift a "Essentials of Compilation" search[3] for interesting repos?

[1] https://github.com/search?q=Ghuloum&type=repositories https://github.com/search?q=incremental%20approach%20to%20co... [2] just some old bookmarks: https://github.com/namin/inc https://github.com/iambrj/imin https://github.com/jaseemabid/inc [3] https://github.com/search?q=Essentials%20of%20Compilation&ty...

30. sn9 ◴[05 Apr 25 23:32 UTC] No.43597712[source]▶

>>43594464 #

You have to pick some language and some amount of students are likely to be new to it, so you might as well pick one that excels at the task of writing compilers as ML-family languages are.

31. nickpsecurity ◴[06 Apr 25 03:13 UTC] No.43598678[source]▶

>>43594897 #

That varies by person. While it may or may not interest, I'll say learning how to write or generate parsers is a useful skill that might keep paying off.

I keep running into situations where I'd like to describe data in a high level. BNF grammars often fit those situations, are more readable than regex's, and could make for nice parsers. One must know how to parse, though. :)

replies(1): >>43636643 #

32. fuhsnn ◴[06 Apr 25 06:32 UTC] No.43599367[source]▶

>>43595345 #

>I’ve been modifying the the MIR c2mir JIT compiler to extend the c11 compiler to support simple classes, boxed strings(immutable, nun-nullable) with AOT support

Is the project public? Really interested in the AOT support, I've always wanted to see its generated code but didn't find an easy way to dump it.

replies(1): >>43605638 #

33. elcritch ◴[06 Apr 25 06:47 UTC] No.43599433{4}[source]▶

>>43596118 #

Nim has multiple backends and is relatively mature. It’s fairly readable as compilers go.

There’s also a new experimental rewrite of the Nim compiler called Nimony which targets a new intermediate called NIFC. That is intended to the be transformed to C, LLVM, JavaScript, etc.

34. DKordic ◴[06 Apr 25 14:17 UTC] No.43601564[source]▶

>>43594897 #

Exactly! "Life After Parsing" [1].

IMHO C Family is a symptom of "low-level" of competence. I often get surprised by superficiality of their arguments. "Sophistry" is a useful concept (in a single word). Ad nauseam?

[1]: http://www.semanticdesigns.com/Products/DMS/LifeAfterParsing...

35. nextaccountic ◴[06 Apr 25 14:36 UTC] No.43601690{4}[source]▶

>>43594914 #

Feels like this should be a compiler error. Or saying otherwise, this location isn't very non-nullable if you can assign null to it

replies(1): >>43602101 #

36. DKordic ◴[06 Apr 25 15:21 UTC] No.43602083{3}[source]▶

>>43595250 #

Interesting arguments.

Lisp Machines? "Separation of Concerns" [1]? Conway's Law?

[1]: http://alarmingdevelopment.org/?p=805

37. neonsunset ◴[06 Apr 25 15:23 UTC] No.43602101{5}[source]▶

>>43601690 #

Just add <WarningsAsErrors>nullable</WarningsAsErrors> to the project manifest.

38. DKordic ◴[06 Apr 25 15:29 UTC] No.43602140[source]▶

>>43594464 #

I have a challenge for You: make applications of Your choice of BlackPill [1] or Radxa ZERO 3E [2].

Hint: You might be interested in Forth Systems and Lisp Machines.

[1]: https://github.com/WeActStudio/WeActStudio.MiniSTM32F4x1 [2]: https://radxa.com/products/zeros/zero3e

39. yencabulator ◴[06 Apr 25 16:45 UTC] No.43602829[source]▶

>>43594898 #

The compilerbook link is broken in multiple ways, but www.compilerbook.com redirects to https://www3.nd.edu/~dthain/compilerbook/

replies(1): >>43603044 #

40. tiu ◴[06 Apr 25 17:14 UTC] No.43603044{3}[source]▶

>>43602829 #

Yes, thank you. That is the Douglas Thain book I meant. (compilerbook.org seems to work for me, only broken thing would be the semicolon).

41. tester756 ◴[06 Apr 25 20:37 UTC] No.43604777{4}[source]▶

>>43596118 #

>I wrote 'multiple codegen architectures' instead of 'multiple architectures for codegen'.

You want to generate e.g x86 + ARM + RISCV, yea?, and shouldnt it be a result of modular architecture?

like your various codegens just take your AST and generate output

42. tester756 ◴[06 Apr 25 21:31 UTC] No.43605124{4}[source]▶

>>43595906 #

I'm aware, I've used C#/.NET Compiler ecosystem and their Compiler as a Service approach

But it seems like you're just creating libs that everyone and embedd / use and that's it

e.g `auto ast = Compiler.GenerateAST(code);`

43. convivialdingo ◴[06 Apr 25 22:47 UTC] No.43605638{3}[source]▶

>>43599367 #

Once things are a little more stable, I will put it up!

Right now you can just break before the (fun_call)() delegate and disassemble the fun_call in gdb.

The basic trick is to add reloc support to the x86 translate code, mark external calls and replace with 0x0 placeholders, and copy out the machine_code and data segment output to an object file.

I can do basic main functions with simple prints calls but not much more. It’s a hack for now but I’ll refactor it until it’s solid.

replies(1): >>43620443 #

44. fuhsnn ◴[08 Apr 25 11:28 UTC] No.43620443{4}[source]▶

>>43605638 #

Thanks, looking forward to it!

45. codr7 ◴[09 Apr 25 19:29 UTC] No.43636643{3}[source]▶

>>43598678 #

Never found much use for any kind of generator, they don't scale once you want good error reporting/handle weird edge cases etc. It's nice to have a descriptive, succinct syntax though; as opposed to having it spread out over a recursive descent parser.

replies(1): >>43644876 #

46. nickpsecurity ◴[10 Apr 25 15:27 UTC] No.43644876{4}[source]▶

>>43636643 #

That's because they often weren't designed to handle such things. Imagine a BNF-style generator extended with constructs for error handling and edge cases. We already have to do specifications of such behaviors. We might as well automatically get good code out of them.

I think that would be helpful.