What about naming your variables with descriptive names?
What about extracting complex conditions into well named function to understand what is going on (thus defeating the purpose of the "4 functions") ?
This list could go on forever...
Writing software is not a contest for who can write the most amount of code in the most cryptic way.
Though I must wonder: how complete is it? What does it and does it not support? It's at least complete enough to be self-hosting, but beyond that? The code doesn't use that much of C.
EDIT: just make next() function void and it works.
EDIT2: still no fortune :(
$ ./c4 hello.c
[1] 33920 segmentation fault ./c4 hello.c
Granted, while building a parser that can parse (let alone compiling) the full C language is nontrivial, any undergrad should be able to build a parser and compiler for a sufficiently simple subset of it. (In my undergrad, we used this subset to build a "compiler" in second year: https://www.student.cs.uwaterloo.ca/~cs241/wlp4/WLP4.html)
It's worth noting that compilers don't pop out of thin air -- you have to start with something simple in order to compile a more complicated compiler. Bootstrapping your own self-hosting compiler is a useful academic exercise, and you should try it sometime if you haven't already: http://en.wikipedia.org/wiki/Bootstrapping_(compilers)
IMO the real value of exhibits like this are boiling the problem (lexing, parsing, compiler, interpreting) down to their most basic parts. One could easily imagine this same language being implemented over 10's or 100's of class files in a more verbose language.
In any case if someone were competent enough to work on it then the style is actually quite readable. It's even full of comments.
It's not that bad (or difficult), really. It's a hand-written parser for a subset of C that emits assembly code right away. This is how compilers like Turbo Pascal used to work (see http://www.pcengines.ch/tp3.htm for an explanation of what's happening).
Sure, you could apply cosmetic changes like making "*++e = bla;" into "emit(bla);" and you could move the cases into independent methods (that are used once), but this isn't meant to be a state-of-the art compiler and it won't become one if one applies best practices to it.
If I had to guess, the functions are tied pretty closely to how the author is parsing the file; "next" (next token), "expr" (expression), "stmt" (statement), and "main".
As for the project being called "C in 4 functions"; at best, I'd argue that's just a linkbait-y title since it's not actually C (it's a subset). I don't have a problem with the code _per se_; just the title.
This brings to mind how fuzzy "interpreter" is as terminology. What is a virtual machine that JIT compiles the byte code or the AST? Is it a JIT implementation of an interpreter? What of implementations that only JIT the most frequently used functions? Wouldn't those be half interpreter and half virtual machine?
When it comes down to it, they're all really virtual machines. The real distinction is how we've come to think of different implementations and the representations sub-culturally. For some reason, it makes us feel better when we call certain things interpreters, because of some meaningless (and sometimes factually challenged) competitive instincts concerning implementation speed. (Also, we arbitrarily feel that byte code is somehow more "machine-y" than an AST.)
So do I have a problem with "interpreter"? Only when people correct others, as if they're making a correction about something fundamental and factual. In reality, the distinction is between machines that are intended to have the same runtime semantics and really the distinction is only around what optimizations are present in their implementations. Furthermore, if you look at those optimizations in detail, the distinction gets even hazier.
The same goes for symbolic constants. Sometimes (but not exactly "often"), use of numeric literals can vastly improve the maintainability of some code, assuming the reader understands how to maintain it in the first instance.
As for increasing reader comprehension, carefully thought out comments are a better mechanism by far.
In this case, it is sufficient to know that the file is a compiler/interpreter for its entirety to make sense, assuming the reader has implemented (or at least understood the principles behind) a compiler/interpreter in their past. Expanding the function/variable names, splitting "complicated" expressions out into their own functions, etc., does not magically improve the uninitiated's chance of understanding what is going on
For example, my entry http://www.ioccc.org/2012/tromp/hint.html is a 25 line "BLC in 7 functions".
Similar to C4, but completely unmaintainable, it compiles Binary Lambda Calculus to bytecode which is then interpreted by a virtual machine.
Edit: OTCC (http://www.bellard.org/otcc/ ) is another extremely tiny (to the point of being obfuscated) example of a compiler using a recursive-descent parser.
Any interpreter vs. VM distinction is mostly a social construct. Looked at technically, it's a mishmash.
Not exactly. You have to remember that language and compiler design require a LOT of work and experience to understand, and that many programmers will only see this as, frankly, spaghetti.
I think it could have used some more block comments, but that's just me.
Non-working (no lexer hack) Flex and Bison specification for C99: https://gist.github.com/codebrainz/2933703 653 lines
Hand-written recursive-descent C parser and pre-processor (something not handled by Lex/Yacc) that produces executable Common Lisp code: https://github.com/vsedach/Vacietis/blob/master/compiler/rea... 935 lines
What's your favorite shorter intro? I'd especially like to reference other educational compilers that are self-hosting.
For example, a simplified expression operator precedence can be expressed like this (where {} denotes repetition, implemented with e.g. a while loop in a recursive descent parser (RDP)):
expr ::= term { '+' term | '-' term } ;
term ::= factor { '*' factor | '/' factor } ;
factor ::= '-' factor | <number> | '(' expr ')' ;
Typically this would be parsed in RDP using expr(), term() and factor() functions, where expr() calls term(), term() calls factor(), and factor() calls expr() if it sees an opening parenthesis.The trouble is that this means that every single expression parse needs to call through this deep code flow before it can see a single interesting bit of the expression being parsed. Parsing "2 + 3" will result in 5 function calls: expr, term, factor, then term and factor again.
The operator precedence parser presented avoids this deeply nested set of calls before it starts parsing interesting things.
This problem is far more acute in a language like C, which has lots of precedence levels, than it is in a language like Pascal, which only has a handful.
C++ compilers nowadays necessarily include a Turing complete interpreter.
Of course figuring out what it's doing is one thing - understanding why it is done in this particular way is another, and while I was able to find my way around fairly quickly I'd cry if I had to re-implement it. I do love how small it is though, that gives it great educational value.
The 'why' of a very low-level tools like this is the sort of thing that needs to be explored at length in a paper or (in this case) a book, otherwise they'll swamp the actual code. Sometimes as a learning exercise I'll take something like this and comment the hell out of everything, but the value there is more in writing the comments than trying to read them again later. Of course this is very much a matter of personal taste.
I'm tempted to link to my draft chapter, but though the code is essentially done the text needs a lot of work.
reverseNaturalSortOrder(listOfItems); // case sensitive sort of items by reverse alphabetical order
or
reverseNaturalSortOrder(listOfItems); // sort this way because the upper layer expects items in reverse order since user requested it
I think it is usually significantly easier to understand what something is doing rather than why it is doing that. To answer the former it usually requires a narrow scope of focus, but the latter requires a very broad scope.
Mind, I mostly do hobby/experimental programming, I've just been doing it for a long time. So I'm not commending this as a work practice or anything - your point about business logic made good sense.
I think this code details a special case of the above though, in that it comments what the enums are instead of just naming the enums. I give that a pass strictly because this code needs to be able to compile itself, and I don't think it supports named enums, so the comment was necessary to make up for that.
[0]: http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Comp...
There are other aspects that are equally as important, and sometimes even more so. Exploring, learning and prototyping are among them, as are "back of the envelope" constructs.
This is a really cool example that shows how far you can go with tiny amounts of code.
Pithyness is actually one way to make code _more_ understandable, given that the reader is familiar with the subject area.
Don't be a snob. The world is a better place because someone wrote this and not worse off.
Since compiler construction is kinda a deep technical field, documenting this in pedagocially sane way would have been a huge task.
I'm happy the author took the effort of writing and publishing this piece, it would be sad if he hadn't just because it's missing an embedded tutorial.
He's not asking you to maintain it. There are references in this thread that explain what is going on...
I write and maintain C++ production code as a day job and some of that stuff is an order of magnitude harder to grok than this (no, it's not documented in any _helpfull_ way either).
I see no value in naming a variable 'tk', 'pp', or 'bt'. It can only help to make the code more readable with less context. I do not need to understand compilers in detail to know what this program is doing, except, for the names being useless. And if I do understand them but have not spent half an hour or probably much more to digest the exact system by which it operates, I would be completely unable to identify bugs or talk about modifying or extending it in any useful way.
Well written comments do not make the code below them more intelligible as text even if they do give you good hints for how to load it into your brain.
And "Expanding the function/variable names, splitting "complicated" expressions out into their own functions, etc." all ABSOLUTELY improve the uninitiated's chance of understanding what is going on. Speaking as one of the uninitiated (though I can mostly make out how this works). Man, I wish I knew what 'tk' was. That whole section at line 204 would be so clear if I could follow the intent.
I agree with you on symbolic constants and arithmetic though.
This code seems to follow a lot of conventions (if I see the var "i" I could bet a million dollars is an int that is being used as a counter, probably to go through the positions of an array). It uses plenty of enumerated constants, which is good too.
I've been doing some functional programming, where you find that often types are more important than names. See the "Names are overrated" section of this article [1] ( although this point may not apply to this piece of software... C being the language that it is :p).
1: http://techblog.realestate.com.au/the-abject-failure-of-weak...
[1] http://www.cs.berkeley.edu/~necula/cil/cil016.html [2] https://github.com/rui314/8cc
This is not how you code when you work in a team or when you know some other poor soul has to come along and maintain it.
I suggest you take a look at [1] then go and read this excellent book by Martin Fowler: "Refactoring: Improving the Design of Existing Code" [2]
Not all programming is enterprise quality, some programming is intentionally not.
Being so dismissive of that, seems a little silly to me.
The demoscene doesn't exist because of a bunch of programmers trying to make the lives of every other programmer more difficult, it exists because people like the challenge.
The functions may be big, but they also don't have all that much duplicate code inside them. The choice of 4 functions isn't arbitrary either - they nicely divide the problem into:
- next() - splits the source code into a series of tokens
- expr() - parses expressions
- stmt() - parses statements
- main() - starts the processing of the source, and also contains the main interpreter VM's execution loop
Code generation is integrated into the parsing, since it's generating code for a stack-based machine and that also very nicely follows the sequence of actions performed when parsing.
In fact I'm of the opinion that the obsession with breaking up code into tiny pieces (usually accompanied by the overuse of OOP) is harmful to the understanding of the program as a whole since it encourages looking at each piece independently and misses "seeing the forest for the trees".
In contrast, this is code that is designed to be easily read and understood by a single person, showing how very simple an entire compiler and interpreter/VM can be. It doesn't attempt to hide anything with thick layers upon layers of abstraction and deep chains of function calls, but instead is the "naked essence" of the solution to the problem.
Someone used to e.g. enterprise Java may find this style of code quite jarring to their senses, but that's only because they've grown accustomed to an environment in which everything is highly-abstracted and indirect, hiding the true nature of the solution. Personally, I think the simplicity and "nakedness" of this code has a great beauty to it --- it's a functional work of art.
printf("%8.4s", &"LEA ,IMM ,JMP ,JSR ,BZ ,BNZ ,ENT ,ADJ ,LEV ,LI ,LC ,SI ,SC ,PSH ,"
"OR ,XOR ,AND ,EQ ,NE ,LT ,GT ,LE ,GE ,SHL ,SHR ,ADD ,SUB ,MUL ,DIV ,MOD ,"
"OPEN,READ,CLOS,PRTF,MALC,MSET,MCMP,EXIT,"[*++le * 5]);
These are all great for dispelling the notion that all compilers somehow necessarily have to be greatly complex and impenetrable to anyone but highly-trained professionals and theoreticians. (Look at the Dragon Book, for example.)
Man, I can't even tell what this is supposed to be. My confusion is entirely founded. My thought process with articles like this goes something like "C in four functions, huh? Sounds like it could be clever. I'll just click and read the explanation... Oh, there isn't an explanation. Well, maybe this file will explain things! ...Nope, it's 500 lines of mostly-uncommented if-else statements. Maybe it's a compiler? I dunno!"
I'm sure there's a subset of the programming community for whom this is crystal clear on first sight, and that's great; but there's a lot more of us who could probably get the joke with a few hints, so it would be nice if you'd help out instead of declaring that if you understand it, it must be easy.
int *a, *b;
int t, *d;
I can't really see how anyone can say that code that uses one-letter variable names (with the exception of the "standard ones", whose meaning is defined at the top) is readable.Even that is only needed if you want to report more specific information up to the parser. E.g. Clang doesn't. In Clang it is instead the parser that looks up the information in order to figure out what type of identifier it has received.
As for your example, I'll be solomonic and say that you and Joe are right, of sorts (though I do think it'd be more than an afternoon).
It's certainly far more than a days work if you handwrite a lexer and parser that does the amount of additional work that yours do (AST construction; a lot of error reporting and sanity checking). But you can get very far with C very quickly if you use parser generation tools and have prior experience writing compilers and your goal is "just" to get something to parse it as quickly as possible - it's a tiny language.
Of course, in practice most real compilers don't use these parser-generation tools exactly because things like proper error reporting etc. is far harder, and a simple recursive descent parser is so much easier to work with.
Point being, there are a lot of different parsing approaches, with pros and cons, and it can be useful to treat them separately rather than view them within only a couple of lenses.
1. Readability for modification
2. Readability for the "what"
3. Readability for the "why"
All human-readable description in code is there to make the difference between having a piece of documentation pointing you in the right direction, and having to reverse engineer your understanding. It's an optimization problem with definite tradeoffs. Although CS professors and many tutorials will tend to encourage you towards heavy description, over-description creates space for inconsistent, misleading documentation, which is worse than "not knowing what it does."
When you see code that is dense and full of short variables, it's written favorably towards modification. It is relying on a summary comment for "what" it does, and perhaps on namespaces, scope blocks, or equivalent guards to keep names from colliding. Such code lets you reach flow state quickly once you've grokked it and are ready to begin work, because more of it fits onscreen, and you're assured the smallest amount of typing and thus can quickly try and revise ideas. The summary often gets to stay the same even if your implementation is completely different. And if lines of code are small, you enjoy a high comments-to-code ratio, at least visually.
Code that builds in the "what" through more descriptive variable names pays a price through being a little harder to actually work in, with big payoffs coming mostly where the idea cannot be captured and summarized so easily through a comment.
In your example, one might instead rework the whole layout of the code so:
var urq; /* user request type */
... (we build list "l")
/* adjustments for user request */
{
if (urq=="rns") { /* case sensitive sort by reverse alphabetical order */ ... }
}
If you aren't reusing the algorithm, inline and summarize. And if you're writing comments about "what the upper layer expects", then you(or your team) spread and sliced up the meaning of the code and created that problem; that kind of comment isn't answering a "why," it's excusing something obtuse and unintuitive in the design, and is a code smell - the premise for that comment is hiding something far worse. If the sequence of events is intentionally meaningful and there's no danger of cut-and-paste modifications getting out of sync, it doesn't need to be split into tiny pieces. Big functions can be well-organized and easy to navigate just with scope blocks, comments, and a code-folding editor."Why" is a complicated thing. It's not really explainable through any one comment, unless the comment is an excuse like the example you gave. The whole program has to be written towards the purpose of explaining itself(e.g. "literate programming"), yet most code is grown organically in a way that even the original programmers can only partially explain, starting with a simple problem and simple data and then gradually expanding in complexity. Experience(and unfamiliar new problem spaces) eventually turns most programmers towards the organic model, even if they're aware of top-down architecture strategies. Ultimately, a "why" has to ask Big Questions about the overall purpose of the software; software is a form of automation, and the rationale for automating things has to be continuously interrogated.
[1]: https://github.com/rswier/c4/blob/master/c4.c#L492-L498
The advantage of a parser tool is, it makes the language constructs explicit in the code - here be assignment, here be for() etc. That has some value.
First thing to say is that "* ++le" is the integer representing the operation to perform. This basically walks through the array of instructions returning each integer in turn.
Starting at the beginning of the line, we have "printf" with a format string of "%8.4s". This means print out the first 4 characters of the string that I pass next (padded to 8 characters). There then follows a string containing all of the operation names, in numerical order, padded to 4 characters and separated by commas (so the start of each is 5 apart). Finally, we do a lookup into this string (treating it as an array) at offset "* ++le * 5", i.e. the integer representing the operation multipled by 5 (5 being the number of characters between the start of each operation name). Doing this lookup gives us a char, but actually we wanted the pointer to this char (as we want printf to print out this char and the following 3 chars), so we take the address of this char (the & at the beginning of the whole expression).
It's concise, but not exactly self-documenting.
Does that make sense?
(I didn't downvote.)
There is one obsession that I am tired of is people posting "X awesome thing in 120 lines of JavaScript" or "Y in 4 functions". Just because the problem is reduced to small as possible metric, it doesn't make it good.
PS: Also you mention negatively connotated terms like "OOP", "Java", "thick layers of abstraction" and "deep chains of function calls" as if you've ascertained that I'm some enterprise developer that doesn't have any C experience and wouldn't know simplicity if it bit me in the ass.
If quite a few people are saying that it's unreadable, you don't just dismiss them as making "unfounded" statements.
> The demoscene doesn't exist because of a bunch of programmers trying to make the lives of every other programmer more difficult, it exists because people like the challenge.
I'm pretty sure that this doesn't need to be stated.
"Toy examples" are often the result of long & deep study and practice of a subject, creating something profound which casual observers are not entitled to instantly understand. In this case, it's a very clever compiler: everybody understands this summary, and if you want "a few contextual comments" beyond the source code itself then you know where to get enough information to learn what you need to understand this.
If you don't "get it", and don't want to "get it" on your own, it's not for you.
So, yes, it's either you're curious enough to dig into a code and find the relevant explanations somewhere else (the said Dragon Book and alike), or you won't get it, regardless of how comprehensive comments are.
I'm against a big list of forward declarations, with tens of different variables, each with a very short name, and a comment explaining what this is an abbreviation for. Just replace the names with the comments, with underscores for spaces, using find and replace. two minute job for the whole code base, much more readable code almost everywhere.
I agree that functional languages may be a counter case; but codebases in C and python (in my experience), benefit greatly from well named variables.
I didn't say that "verbose" variable names were mandatory everywhere - i and j have their place - but names which are at least pronounceable words are essential, especially if they appear in more places than a five line function.
This project is a special case, certainly, but toy compilers are nothing if not to learn from.
$ time ./c4 c4.c c4.c hello.c
hello, world
exit(0) cycle = 9
exit(0) cycle = 22614
exit(0) cycle = 9273075
real 0m0.067s
user 0m0.067s
sys 0m0.000s
$ time ./c4 c4.c c4.c c4.c hello.c
hello, world
exit(0) cycle = 9
exit(0) cycle = 22614
exit(0) cycle = 9273075
exit(0) cycle = 933197195
real 0m5.834s
user 0m5.827s
sys 0m0.000s
$ time ./c4 c4.c c4.c c4.c c4.c hello.c
Just kidding. :) Amazingly cool! Does anybody have a smaller self-hosing compiler & bytecode vm?C89 I think has too much complexity for the amount of power it offers. lcc is a nice example: Norman Ramsey said he asked one of the authors what he learned in writing it, and got an answer like "Not to write a compiler in C." But anyway the book about it https://sites.google.com/site/lccretargetablecompiler/ is very good. http://www.cs.princeton.edu/~appel/modern/ is my favorite general text.
I hope someone will write an explanation. I'm still working on one for my own (quite different) little compiler.
$ time ./c4 c4.c c4.c c4.c c4.c hello.c
hello, world
exit(0) cycle = 9
exit(0) cycle = 22614
exit(0) cycle = 9273075
exit(0) cycle = 933197195
exit(0) cycle = -1428163377
real 9m23.409s
user 9m22.673s
sys 0m0.020s
:)With code like this, the readability obviously isn't a high priority consideration and sometimes the exact opposite of the goal, with the impenetrability sometimes being part of its charm. This is Hacker News after all, and if your reaction to of a piece of code that describes itself as "an exercise in minimalism" is to leave a snarky comment about the lack of documentation, you should probably check your news elsewhere.
If you have any interest in the subject, the initial comment "just enough features to allow self-compilation and a bit more" should give the purpose of the code away. If not, it ought to have been a clear sign of dragons.
This submission in particular has dubious practical use, and the description, "an exercise in minimalism", is telling of a sort of artistic intent. If you don't understand what it does, how it does it, or if you don't like it, it won't lower my opinion of you in any way, but its inclusion on this site is part of why I like to come here every now and then. I get to discuss subjects that relate to my work and hobbies, but I also get to look at weird alien code and think hard in unfamiliar terms. From what you are saying, I think you can relate.
Personally, I could glance over it and get the idea that it is a C compiler, but if you were to show me some code in written with the latest JS MVC or FRP framework, don't hold your breath for me to tell you what it does. I can't say that I fully understand this, and that's why I enjoy the rich discussion the submission spawned here.