C4 – C in 4 functions

1. TazeTSchnitzel ◴[04 Nov 14 21:03 UTC] No.8559029[source]▶

Pretty damn impressive.

Though I must wonder: how complete is it? What does it and does it not support? It's at least complete enough to be self-hosting, but beyond that? The code doesn't use that much of C.

replies(2): >>8559058 #>>8559068 #

2. cbhl ◴[04 Nov 14 21:10 UTC] No.8559058[source]▶

>>8559029 (TP) #

Judging from the comments in c4.cpp, it probably only supports enough of a subset to compile itself.

Granted, while building a parser that can parse (let alone compiling) the full C language is nontrivial, any undergrad should be able to build a parser and compiler for a sufficiently simple subset of it. (In my undergrad, we used this subset to build a "compiler" in second year: https://www.student.cs.uwaterloo.ca/~cs241/wlp4/WLP4.html)

replies(2): >>8559064 #>>8560966 #

3. JoeAltmaier ◴[04 Nov 14 21:11 UTC] No.8559064[source]▶

>>8559058 #

You can build a C parser in an afternoon. It only has a few language constructs. Declarations are the hardest. Scanners are readily available for expressions and constants.

replies(2): >>8559077 #>>8560875 #

4. Lerc ◴[04 Nov 14 21:11 UTC] No.8559068[source]▶

>>8559029 (TP) #

It's clear from the huge number of 'else if' statements that it doesn't support switch statements.

This is also a bit of a clue

    p = "char else enum if int return while "
    "open read close printf malloc memset memcmp exit main";

replies(1): >>8559087 #

5. cbhl ◴[04 Nov 14 21:14 UTC] No.8559077{3}[source]▶

>>8559064 #

I would be worried about handling the not-context-free parts of the language: http://trevorjim.com/c-and-cplusplus-are-not-context-free/

How would you handle these?

replies(1): >>8559095 #

6. TazeTSchnitzel ◴[04 Nov 14 21:17 UTC] No.8559087[source]▶

>>8559068 #

Ah, the lack of struct explains why the compiler doesn't use any structs.

7. JoeAltmaier ◴[04 Nov 14 21:18 UTC] No.8559095{4}[source]▶

>>8559077 #

Scanner has to have access to the symbol table, emit 'ident' or 'type' as appropriate. So not a pure context-free parser.

replies(1): >>8559385 #

8. sedachv ◴[04 Nov 14 22:15 UTC] No.8559385{5}[source]▶

>>8559095 #

Yes. This is the reason it is way easier to hand-build a parser for C than try to use any kind of parser generator.

replies(1): >>8559433 #

9. JoeAltmaier ◴[04 Nov 14 22:24 UTC] No.8559433{6}[source]▶

>>8559385 #

Bison and Flex work fine - do they count as 'parser generators'?

replies(1): >>8559824 #

10. sedachv ◴[05 Nov 14 00:05 UTC] No.8559824{7}[source]▶

>>8559433 #

Flex does not work fine, that's why you need the lexer hack: https://en.wikipedia.org/wiki/The_lexer_hack

Non-working (no lexer hack) Flex and Bison specification for C99: https://gist.github.com/codebrainz/2933703 653 lines

Hand-written recursive-descent C parser and pre-processor (something not handled by Lex/Yacc) that produces executable Common Lisp code: https://github.com/vsedach/Vacietis/blob/master/compiler/rea... 935 lines

replies(1): >>8559930 #

11. JoeAltmaier ◴[05 Nov 14 00:40 UTC] No.8559930{8}[source]▶

>>8559824 #

You can put whatever code you want into the Flex phrase rules. The 'lexer hack' can be implemented there. That's what I do whenever I have to parse something like C.

replies(1): >>8560693 #

12. sedachv ◴[05 Nov 14 06:06 UTC] No.8560693{9}[source]▶

>>8559930 #

Exactly. You have to drop hooks into the lexer from the parser, and by the time you're done you end up with just as much code. Only it's slower than a recursive-descent parser would be, and a lot of the time parsing speed really matters because it shows up as user-visible latency.

replies(1): >>8561086 #

13. rui314 ◴[05 Nov 14 07:45 UTC] No.8560875{3}[source]▶

>>8559064 #

C is not a simple language as a CIL guy says [1]. I wrote my own C compiler [2] and I can say that writing a parser was harder than I thought. It would take more than half a day at least.

[1] http://www.cs.berkeley.edu/~necula/cil/cil016.html [2] https://github.com/rui314/8cc

replies(2): >>8560897 #>>8561106 #

14. zik ◴[05 Nov 14 07:56 UTC] No.8560897{4}[source]▶

>>8560875 #

Yes, I also wrote my own C interpreter and the parser was a lot more than a day's work.

15. verytrivial ◴[05 Nov 14 08:34 UTC] No.8560966[source]▶

>>8559058 #

Indeed. The lack of a case statement in the run loop is a bit of a giveaway. Excellent work nonetheless.

16. vidarh ◴[05 Nov 14 09:36 UTC] No.8561086{10}[source]▶

>>8560693 #

While I have a strong dislike for lex/yacc and descendants, the "hooks" you need for C are trivial. As far as I remember the only thing you need is an ability for the lexer to check whether or not a given identifier is a variable or type.

Even that is only needed if you want to report more specific information up to the parser. E.g. Clang doesn't. In Clang it is instead the parser that looks up the information in order to figure out what type of identifier it has received.

replies(2): >>8561477 #>>8563641 #

17. vidarh ◴[05 Nov 14 09:46 UTC] No.8561106{4}[source]▶

>>8560875 #

That first link is almost all about language semantics, not parsing issues.

As for your example, I'll be solomonic and say that you and Joe are right, of sorts (though I do think it'd be more than an afternoon).

It's certainly far more than a days work if you handwrite a lexer and parser that does the amount of additional work that yours do (AST construction; a lot of error reporting and sanity checking). But you can get very far with C very quickly if you use parser generation tools and have prior experience writing compilers and your goal is "just" to get something to parse it as quickly as possible - it's a tiny language.

Of course, in practice most real compilers don't use these parser-generation tools exactly because things like proper error reporting etc. is far harder, and a simple recursive descent parser is so much easier to work with.

18. JoeAltmaier ◴[05 Nov 14 12:30 UTC] No.8561477{11}[source]▶

>>8561086 #

Right but by then the rule matching has been done - now you're writing code to distinguish rules instead of code to handle rules.

The advantage of a parser tool is, it makes the language constructs explicit in the code - here be assignment, here be for() etc. That has some value.

19. JoeAltmaier ◴[05 Nov 14 19:15 UTC] No.8563641{11}[source]▶

>>8561086 #

And what are the alternatives? I've always wanted a parser object (in C++ anyway) that I can add constructions to, feed it scanner output and have it build a symbol table and semantic tree. Does such a thing exist?