How I wrote a self-hosting C compiler in 40 days

1. peterkelly ◴[14 Dec 15 16:45 UTC] No.10732090[source]▶

For anyone interested in compiler writing and looking for a good resource to start, probably one of the best is the "Dragon Book":

http://www.amazon.com/Compilers-Principles-Techniques-Tools-...

I highly recommend it, but it's heavy stuff. There are probably simpler guides out there that just cover the basics.

replies(6): >>10732136 #>>10732162 #>>10732256 #>>10732890 #>>10733017 #>>10742248 #

2. Ocerge ◴[14 Dec 15 16:52 UTC] No.10732136[source]▶

>>10732090 (TP) #

Indeed it is the standard, but I'm pretty sure I have a permanent twitch in the corner of my eye from my university compilers course and that book. It's probably not something I could ever do for fun, but it's extremely important and useful stuff to know. I'm just too stupid to really grok a lot of the concepts required to write a compiler.

3. sklogic ◴[14 Dec 15 16:55 UTC] No.10732162[source]▶

>>10732090 (TP) #

Please stop recommending the Dragon Book already. It is not just heavy, it is mostly outdated and irrelevant.

replies(7): >>10732214 #>>10732311 #>>10732417 #>>10732559 #>>10732828 #>>10733072 #>>10733258 #

4. fahadkhan ◴[14 Dec 15 17:03 UTC] No.10732214[source]▶

>>10732162 #

Please could you recommend an alternative?

replies(3): >>10732239 #>>10732923 #>>10734295 #

5. landmark2 ◴[14 Dec 15 17:05 UTC] No.10732239{3}[source]▶

>>10732214 #

Modern Compiler Implementation in C

replies(1): >>10732740 #

6. cbd1984 ◴[14 Dec 15 17:08 UTC] No.10732256[source]▶

>>10732090 (TP) #

Jack Crenshaw, of "Let's Build A Compiler" fame, has an interesting critique of works like these:

http://compilers.iecc.com/crenshaw/tutor8.txt

The main point for our discussion is last:

> Desire for Generality

> We have been concentrating on the use of a recursive-descent parser to parse a deterministic grammar, i.e., a grammar that is not ambiguous and, therefore, can be parsed with one level of lookahead.

> In practice, these issues turn out to be considerably less important. Modern languages tend to be designed to be easy to parse, anyway. That was a key motivation in the design of Pascal. Sure, there are pathological grammars that you would be hard pressed to write unambiguous BNF for, but in the real world the best answer is probably to avoid those grammars!

Here's a bit more of what he said, heavily snipped:

> Limited RAM Forcing Multiple Passes

> All the early compiler writers had to deal with this issue: Break the compiler up into enough parts so that it will fit in memory. When you have multiple passes, you need to add data structures to support the information that each pass leaves behind for the next. That adds complexity, and ends up driving the design.

> Batch Processing

> In a mainframe compiler as well as many micro compilers, considerable effort is expended on error recovery ... it can consume as much as 30-40% of the compiler and completely drive the design. The idea is to avoid halting on the first error, but rather to keep going at all costs, so that you can tell the programmer about as many errors in the whole program as possible.

> Large Programs

[Basically comes down to "This is 1980s Micro Land. We don't have mmap or virtual memory in general. The simple way is to keep everything in RAM and encourage small subroutines."]

> Emphasis on Efficiency

[This is an interesting point. He says that we have fast enough CPUs that compilers can emit sub-optimal code and it doesn't matter.]

> Limited Instruction Sets

[Eh. I don't agree that limited instruction sets make compilers more complicated. Even in his design, code generation was a rather trivial part of the entire program.]

Main link:

http://compilers.iecc.com/crenshaw/

replies(1): >>10732377 #

7. barrkel ◴[14 Dec 15 17:15 UTC] No.10732311[source]▶

>>10732162 #

It's fairly heavy in lexing and parsing theory, especially around finite automata, pushdown automata, etc. It's the kind of book you might want to read if you're reimplementing yacc.

Modern code generation has moved on a bit, so I wouldn't dig too deeply into the latter third or so of the book.

All in all, for a hobby compiler, it would be a poor choice; heavy on unnecessary theory in the front end and outdated on on back end stuff.

Let's Build a Compiler by Jack Crenshaw is one of my favourite guides for the hobbyist, although it's a bit dated now since it used Turbo Pascal and targeted 16-bit x86.

replies(1): >>10732885 #

8. kevin_thibedeau ◴[14 Dec 15 17:24 UTC] No.10732377[source]▶

>>10732256 #

May of these issues are still relevant to embedded systems with limited resources.

replies(1): >>10734254 #

9. ◴[14 Dec 15 17:30 UTC] No.10732417[source]▶

>>10732162 #

10. nickpsecurity ◴[14 Dec 15 17:56 UTC] No.10732559[source]▶

>>10732162 #

There's quite a few decent alternatives. I often suggest Wirth's Compiler Construction and Oberon sources because they're straightforward lessons plus give experience with Wirth style of simple, safe, efficient languages. Then, they can improve the Oberon System or compilers for personal projects and improvement.

That said, I typically recommend compilers get written in an ML or LISP given it's so much easier to do in those languages. Ocaml and Racket are my main recommendations for modern work. One can also use techniques for deriving imperative code from functional programs if one wants to re-implement that compiler in specific imperative language on a machine. The constructs are even straight-forward enough for macro assembly for those that want to understand it down to bare metal.

replies(2): >>10732949 #>>10733059 #

11. fao_ ◴[14 Dec 15 18:22 UTC] No.10732740{4}[source]▶

>>10732239 #

By A. Appel? I heard that the ML version is better since much of the code apparently isn't very idiomatic (It was supposedly translated directly from the ML book). I would probably recommend both that and The Dragon Book, as they cover roughly the same material but in a slightly different manner.

12. hitlin37 ◴[14 Dec 15 18:33 UTC] No.10732828[source]▶

>>10732162 #

Since technologies changes before the book is out, how does that hold in compiler domain? Are there any modern compiler book that explains around llvm/gcc as an example compiler?

replies(1): >>10732897 #

13. sklogic ◴[14 Dec 15 18:40 UTC] No.10732885{3}[source]▶

>>10732311 #

It does not make any sense to reimplement yacc in the 21st century. There are far more powerful and yet simple parsing techniques, rendering all that automata stuff useless and outdated. Take a look at PEG, Pratt parsing and GLR.

replies(2): >>10733238 #>>10733267 #

14. gamapuna ◴[14 Dec 15 18:41 UTC] No.10732890[source]▶

>>10732090 (TP) #

I am working my way through Alex Aiken's Coursera compiler course, almost 3/4th done and really liking it till now. There was another resource posted on HN earlier which takes you through building a compiler for lisp using C. http://www.buildyourownlisp.com/contents

15. sklogic ◴[14 Dec 15 18:42 UTC] No.10732897{3}[source]▶

>>10732828 #

Not sure if there are any books around LLVM or gcc, but there is a lot of up to date texts.

Grune at al., "Modern Compiler Design",

Appel, "Modern Compiler Implementation in ML",

and many more.

P.S. Because of a slow-ban, adding another one to this answer. This is the best source on SSA:

http://www.cri.ensmp.fr/people/pop/papers/2006-12-thesis.pdf

replies(1): >>10733679 #

16. Ace17 ◴[14 Dec 15 18:46 UTC] No.10732923{3}[source]▶

>>10732214 #

Dick Grune - Modern Compiler Design

17. sklogic ◴[14 Dec 15 18:49 UTC] No.10732949{3}[source]▶

>>10732559 #

Yes, functional side of things is interesting and rarely covered in the standard compilers courses.

There is an old but good book, often overlooked: "Functional Programming" by Anthony J. Field and Peter G. Harrison (1988). Despite the title, it's more about various compilation techniques.

Also an old but still relevant (it is missing the STG, but otherwise full of interesting stuff): http://research.microsoft.com/en-us/um/people/simonpj/papers...

It also worth following this blog: https://wingolog.org/

replies(1): >>10733264 #

18. gnuvince ◴[14 Dec 15 18:59 UTC] No.10733017[source]▶

>>10732090 (TP) #

The Red Dragon Book is great (I like it better than the newer Purple edition), however for a gentler intro, I really like Crafting a Compiler[1] by Fischer et al.: the first chapter does a great job of introducing the role of each part of a "traditional" compiler and chapter two implements a complete compiler for a tiny calculator language. The rest of the book follows the traditional progression (scanner, parser, type checking, IR generation, etc.) It's a bit expensive, which is my only reservation when recommending it for a university class (although students are resourceful if you know what I mean ;))

[1] http://www.amazon.ca/Crafting-Compiler-Charles-N-Fischer/dp/...

19. urs2102 ◴[14 Dec 15 19:06 UTC] No.10733059{3}[source]▶

>>10732559 #

I definitely agree with ML and OCaml, but why do you recommend Lisp for compiler work? I love working withs Lisps, but how do you deal with dynamic typing bugs in compiler work? I prefer OCaml/Haskell/ML for its really strong static type checking for compiler work. Just curious though...

replies(2): >>10733223 #>>10734278 #

20. spooningtamarin ◴[14 Dec 15 19:08 UTC] No.10733072[source]▶

>>10732162 #

I learned compilers this way and yes, it's extremely hard. Things got so high-level that I forgot what the real problem was and kept wondering why the algorithm I wrote isn't working.

21. nickpsecurity ◴[14 Dec 15 19:35 UTC] No.10733223{4}[source]▶

>>10733059 #

LISP js easy to parse, transform, build clean-slate, verify, and do fast. My first 4GL tool was a BASIC w/ macros compatible with C and C++. When converted to LISP, it was vastly easier to do transforms and macros plus productivity boost with incremental, per-function compilation. Turns out, Julia did something similar by converting the source into AST's that are LISP expressions with compiler being femtolisp.

Far as safet, you can build checks into your code for protection or use a typed subset of LISP (eg Typed Racket). Shen goes further by embedding whole sequent calculus for custom, per app/module, type systems.

So, not saying LISP is ideal for compilers but does have some ideal and good attributes. Ocaml is my default given it combines conciseness, correctness, decent performance, ease of learning, and some helpful libs.

22. sparkie ◴[14 Dec 15 19:38 UTC] No.10733238{4}[source]▶

>>10732885 #

Pfft. Who wants to parse in 2015? We all know that constructing your language as a JSON schema is the way forward. Your JSON parser is the only parser you'll ever need.

replies(1): >>10733339 #

23. troydj ◴[14 Dec 15 19:41 UTC] No.10733258[source]▶

>>10732162 #

I wouldn't go so far to say that the Dragon Book is outdated and irrelevant. (I'm assuming you're referring to the 2nd edition from 2006.) Unless you're focusing on back-end optimization and code generation techniques (something a new compiler writer typically does NOT do), the bulk of the theory and material you'd cover in a first semester compiler course is fairly static.

But if a person is merely looking to bang out a compiler without getting overwhelmed with how to convert NFAs to DFAs for lexing, etc., some good alternative books are:

A Retargetable C Compiler: Design and Implementation, by Hanson and Fraser (http://www.amazon.com/Retargetable-Compiler-Design-Implement...). This book constructs and documents the explains the code for a full C compiler with a recursive descent approach (no flex/lex or bison/yacc). I have some experience augmenting this compiler, so I can vouch for the book's ability to clearly convey their design.

Compiler Design in C, by Allen Holub (http://www.holub.com/software/compiler.design.in.c.html). Downloadable PDF at that link as well. A book from 1990 in which Holub constructs his own version of lex and yacc, and then builds a subset-C compiler which generates intermediate code.

Practical Compiler Construction, by Nils Holm (http://www.lulu.com/shop/nils-m-holm/practical-compiler-cons...). A recent book which documents the construction of a SubC (subset of C) compiler and generates x86 code on the back end.

replies(1): >>10734348 #

24. nickpsecurity ◴[14 Dec 15 19:42 UTC] No.10733264{4}[source]▶

>>10732949 #

I got the 2nd link here before, maybe from you. Might be old but very thorough treatment of about every aspect of functional compilation. I have it saved in djvu and pdf in case I neet someone starting in that area. :)

The other two are new to me. Thanks for those.

25. wtetzner ◴[14 Dec 15 19:42 UTC] No.10733267{4}[source]▶

>>10732885 #

Or even better, GLL.

http://dotat.at/tmp/gll.pdf

replies(2): >>10735271 #>>10735314 #

26. ORioN63 ◴[14 Dec 15 19:54 UTC] No.10733339{5}[source]▶

>>10733238 #

Or a lisp.

27. sgeisenh ◴[14 Dec 15 20:41 UTC] No.10733679{4}[source]▶

>>10732897 #

I'll just add this on: https://www.cs.cmu.edu/~fp/courses/15411-f14/schedule.html

The lecture notes are simple and excellent, though at a very high level of abstraction. Though most of the notes are full of references to the original material, which is super nice.

28. sklogic ◴[14 Dec 15 21:58 UTC] No.10734254{3}[source]▶

>>10732377 #

If you really want to compile something on an embedded system, you'd be much better off with something like Forth anyway.

replies(1): >>10735265 #

29. sklogic ◴[14 Dec 15 22:02 UTC] No.10734278{4}[source]▶

>>10733059 #

ML got ADTs and static type guarantees, but Lisp got macros, which allows to implement things like http://andykeep.com/pubs/np-preprint.pdf - which is significantly less boilerplate than in ML. An ideal language for compilers would have combined both properties, but to make it happen, an expression problem must be solved first.

See a relevant discussion here from not long ago:

https://news.ycombinator.com/item?id=10712566

replies(1): >>10749258 #

30. xigency ◴[14 Dec 15 22:04 UTC] No.10734295{3}[source]▶

>>10732214 #

Engineering a Compiler 2nd edition, by Keith D. Cooper and Linda Torczon

The dragon books are better used as references to those who are already familiar with all of the high-level concepts in compiler writing.

31. sklogic ◴[14 Dec 15 22:10 UTC] No.10734348{3}[source]▶

>>10733258 #

It is actually an outdated view, to split a compiler into dedicated monolithic front-end and back-end parts. The more modern approach is very much the opposite. It is a nearly continuous, very long sequence of very simple transforms, rewriting a code seamlessly, all the way down from a front-end (i.e., a parser) to a back-end or multiple back-ends. And this approach is very alien to anything you'd find in the Dragon Book.

As for parsing, as I already said elsewhere in this thread, all the techniques from Dragon Book are not practical any more and are not used in the modern compilers. There are far better ways, which are not covered in the book, and they're far simpler, not even deserving a dedicated book at all.

32. nickpsecurity ◴[15 Dec 15 01:04 UTC] No.10735265{4}[source]▶

>>10734254 #

Or macro-assembler. I think HLA's, Hyde's and the general concept, are underappreciated for this use case.

33. nickpsecurity ◴[15 Dec 15 01:06 UTC] No.10735271{5}[source]▶

>>10733267 #

Best features of both types of parsers? Hell yeah! Never even heard of this one despite all the research updates I did a few years ago. Thanks for the link.

34. sklogic ◴[15 Dec 15 01:13 UTC] No.10735314{5}[source]▶

>>10733267 #

It's an interesting one, I'll add it to my collection.

Although, to be fair, PEG (Packrat) do support left recursion [1], and it can also be combined with Pratt for the binary expressions very efficiently, at O(n).

[1] http://www.vpri.org/pdf/tr2007002_packrat.pdf

replies(1): >>10735857 #

35. wtetzner ◴[15 Dec 15 04:04 UTC] No.10735857{6}[source]▶

>>10735314 #

> Although, to be fair, PEG (Packrat) do support left recursion [1]

Unfortunately it sometimes generates the wrong parse: http://tratt.net/laurie/research/pubs/papers/tratt__direct_l...

replies(1): >>10736567 #

36. sklogic ◴[15 Dec 15 08:12 UTC] No.10736567{7}[source]▶

>>10735857 #

And all the edge cases are related to the binary expressions - and you should be using Pratt for them instead.

Left recursive PEG must be reserved for things like method, structure field and array access, which are ok with that algorithm.

replies(1): >>10738280 #

37. wtetzner ◴[15 Dec 15 15:30 UTC] No.10738280{8}[source]▶

>>10736567 #

Ah, gotcha. I struggled with PEG and left recursion for a while before finding GLL. Do you have any resources on how to combine PEG and Pratt?

replies(1): >>10738685 #

38. sklogic ◴[15 Dec 15 16:31 UTC] No.10738685{9}[source]▶

>>10738280 #

There must have been some papers on it, but I lost a track. Will try to dig out something.

Meanwhile, you can take a look a my implementation of such a mix: https://github.com/combinatorylogic/mbase/tree/master/src/l/...

An example of a grammar built on top of it: https://github.com/combinatorylogic/clike/blob/master/clike/... - see the `binary` nodes there, they're translated into Pratt while all the others are Packrat.

39. astrange ◴[16 Dec 15 03:50 UTC] No.10742248[source]▶

>>10732090 (TP) #

Strong disagreement. The Dragon Book is super expensive, spends all its time discussing the easy part (parsing) as if it was complicated, then spends no time discussing modern algorithms in code generation.

Working from an example like LLVM or reading Appel or Wirth's books are better.

replies(1): >>10742274 #

40. isolate ◴[16 Dec 15 03:56 UTC] No.10742274[source]▶

>>10742248 #

Also Muchnick's book is good for the middle end.

41. nickpsecurity ◴[17 Dec 15 03:15 UTC] No.10749258{5}[source]▶

>>10734278 #

A commenter elsewhere said Racket had ADT's with langplai but also with langracket via subtypes of struct type with match macro. Said it was same style of programming.

So that's static typing (Typed Racket) and macros for sure. Possibly ADT's depending on whether you object to claim above. Should cover your list of requirements.

replies(1): >>10750279 #

42. sklogic ◴[17 Dec 15 08:38 UTC] No.10750279{6}[source]▶

>>10749258 #

The problem here is that a single variant tag does not uniquely define a type, as all such type systems expect.

replies(1): >>10752296 #

43. nickpsecurity ◴[17 Dec 15 16:15 UTC] No.10752296{7}[source]▶

>>10750279 #

You think it could be modified to handle that?