But this is wrong. Programmers should be writing parsers all the time!
But this is wrong. Programmers should be writing parsers all the time!
This project looks neat, I've never thought to use parser combinators for something other than left-to-right string/token stream parsing.
And I like how it uses Typescript's metaprogramming to generate types from the parser code. I think that would be much harder (or impossible) in other languages, making the idiomatic design of a similar similar library very different.
Don't get me wrong, I actually love writing parsers. It's just not required all that often in my day-to-day work. 99% of the time when I need to write a parser myself it's for and Advent of Code problem, usually I just import whatever JSON or YAML parser is provided for the platform and go from there.
Isn’t writing code and using zod the same thing? The difference being who wrote the code.
Of course, you hope zod is robust, tested, supported, extensible, and has docs so you can understand how to express your domain in terms it can help you with. And you hope you don’t have to spend too much time migrating as zod’s api changes.
I was recently thinking about type safety and validation strategies are particularly thorny in languages where the typings are just annotations. E.g. the Typescript/Zod or Python/Pydantic universes. Especially in IO cases where the data doesn't originate in the same type system.
In a language like Go (just an example, not endorsing) if you parse something into say a struct you know worst case you're getting that struct with all the fields set to zero, and you just have to handle the zero values. In typescript-likes you can get a totally different structure and run into all sorts of errors.
All that is to say, the runtime validation is always somewhere (perhaps in the library, as they often are?), and the feature here isn't no runtime validation but typed cli arguments. Which is cool and great.
https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va... (2019, using Haskell)
https://www.lelanthran.com/chap13/content.html (April 2025, using C)
Sometimes it's going down to machine code, or rolling your own hash table, or writing your own recursive-descent parser from first principles. But most of the time you don't have to reach that low, and things like parsing are but a minor detail in the grand scheme. The engineer should not spend time on building them, but should be able to competently choose a ready-made part.
I mean, creating your own bolts and nuts may be fun, but mot of the time, if you want to build something, you just pick a few from an appropriate box, and this is exactly right.
In the field I work, zero values are valid and doing it in Go would be a nightmare
>> const port = option("--port", integer());
I don't understand. Why is this a parser? Isn't it just way of enforcing a type in a language that doesn't have types?
I was expecting something like a state machine that takes the command line text and parses it to validate the syntax and values.
> Of course, you hope zod is robust, tested, supported, extensible, and has docs so you can understand how to express your domain in terms it can help you with. And you hope you don’t have to spend too much time migrating as zod’s api changes.
Yes, judgement is required to make depending on zod (or any library) worthwhile. This is not different in principle from trusting those same things hold for TypeScript, or Node, or V8, or the C++ compiler V8 was compiled with, or the x86_64 chip it's running on, or the laws of physics.
The problem I run into here is - how do you create good error messages when you do this? If the user has passed you input with multiple problems, how do you build a list of everything that's wrong with it if the parser crashes out halfway through?
In short, a great article.
Make use of the usage string be the specification!
A criminally underused library.
"options that depend on options" should not be a thing. Every option should be optional. Even if you have working code that can handle some complex situation, this doesn't make the situation any less unintuitive for the users.
If you need more complex relationships, consider using arguments as well. Top level, or under an option. Yes, they are not named, but since they are mandatory anyway, you are likely to remember their meaning (spaced repetition and all that). They can still be optional (if they come last). Sometimes an argument may need to have multiple parts, like user@host:port You can still parse it instead of validating, if you want.
> mutually exclusive --json, --xml, --yaml.
Use something like -t TYPE instead, where TYPE can be one of json, xml, or yaml. (Make illegal states unrepresentable.)
> debug: optional(option("--debug")),
Again, I believe it's called "option" because it's meant to be optional already.
optional(optional(option("--common-sense")))
EORThat might sound messy but to the author's point about parser combinators not being complicated, they really don't take much time to get used to, and they're quite simple if you wanted to build such a library yourself. There's not much code (and certainly no magic) going on under the hood.
The advantage of that parsing approach:
It's reasonably declarative. This seems like the author's core point. Parser-combinator code largely looks like just writing out the object you want as a parse result, using your favorite combinator library as the building blocks, and everything automagically works, with amazing type-checking if your language has such features.
The disadvantages:
1. Like any parsing approach, you have to actually consider all the nuances of what you really want parsed (e.g., conditional rules around whitespace handling). It looks a little to me (just from the blog post, not having examined the inner workings yet) like this project side-stepped that by working with the `Stream` type as just the `argv` list, allowing you to be able to say things like "parse the next blob as a string" without also having to encode whitespace and blob boundaries.
2. It's definitely slower (and more memory-intensive) than a hand-rolled parser, and usually also worse in that regard than other sorts of "auto-generated" parsing code.
For CLI arguments, especially if they picked argv as their base stream type, those disadvantages mostly don't exist. I could see it performing poorly for argv parsing for something like `cp` though (maybe not -- maybe something like `git cp`, which has more potential parse failures from delimiters like `--`?), which has both options and potentially ginormous lists of files; if you're not very careful in your argument specification then you might have exponential backtracking issues, and where that would be blatantly obvious in a hand-rolled parser it'll probably get swept under the rug with parser combinators.
TFA links to Alexis King’s Parse, Don’t Validate article, which explains this well. Did you not read it?
It's a genuine pleasure to use, and I use it often.
If you dig a little deeper into it, it does all the type and value validation, file validation, it does required and mutually exclusive args, it does subargs. And it lets you do special cases of just about anything.
And of course it does the "normal" stuff like short + long args, boolean args, args that are lists, default values, and help strings.
Then instead of validating a loose type & still using the loose type, you're parsing it from a loose type into a strict type.
The key point is you never need to look at a loose type and think "I don't need to check this is valid, because it was checked before"; the type system tracks that for you.
What would you do for "top level option, which can be modified in two other ways"?
(--option | --option-with-flag1 | --option-with-flag2 | --option-with-flag1-and-flag2)
would solve invalid representation, but is unwieldy.Something that results in the usage string
[--option [--flag1 --flag2]]
doesn't seem so bad at that point.In Python this was a motivating factor for letting functions demand their arguments be passed as named keywords. Something like send("foo", "bar") is easier to understand and call correctly when you have to say send(channel="foo", message="bar")
Whether you do that with Zod or manually or whatever isn't important, the important thing is having a preprocessing step that transforms the data and doesn't just validate it.
The result is that you often still this kind of defensive programming, where argparse ensures that an invariant holds, but other functions still check the same invariant later on because they might have been called a different way or just because the developer isn't sure whether everything was checked where they are in the program.
What I think the author is looking for is a combination of argparse and Pydantic, such that when you define a parser using argparse, it automatically creates the relevant Pydantic classes that define the type of the parsed arguments.
--option flag1,flag2
(Maybe with another separator, as long as it doesn't need to be escaped.)Another possibility is to make the main option an argument, like the subcommands in git, systemctl, and others:
command option --flag1 --flag2
This depends on the specifics, though.So maybe the reason why they were able to reduce the code is because they lost the ability to do good error reporting.
He even gives the example of zod, which is a validation library he defines to be a parser.
What he wants to say : "I don't want to write my own validation in a CLI, give me a good API already that first validates and then converts the inputs into my declared schema"
"Invalid data? The parser rejects it. Done."
"That validation logic that used to be 30% of my CLI code? Gone."
"Mutually exclusive groups? Sure. Context-dependent options? Why not."
For me this really piled on at the end of the blog post. But maybe it's just personal style too.
For parsing specifically, there's literature on error recovery to try to make progress past the error.
That said, I fully agree with the article content itself. It basically just boils down to:
When you create a program, eventually you'll need to process & check whether input data is valid or not. In C-like language, you have 2 options
void validate(struct Data d);
or struct ValidatedData;
ValidatedData validate(struct Data d);
"Parse, don't validate" is just trying to say don't do `void validate(struct Data d)` (procedure with `void`), but do `ValidatedData validate(struct Data d)` (function returning `ValidatedData`) instead.It doesn't mean you need to explicitly create or name everything as a "parser". It also doesn't mean "don't validate" either; in `ValidatedData validate(struct Data d)` you'll eventually have "validation" logic similar to the procedure `void` counterpart.
Specifically, the article tries to teach folks to utilize the type system to their advantage. Rather than praying to never forget invoking `validate(d)` on every single call site, make the type signature only accept `ValidatedData` type so the compiler will complain loudly if future maintainers try to shove `Data` type to it. This strategy offloads the mental burden of remembering things from the dev to the compiler.
I'm not exactly sure why the "Parse, don't validate" catchphrase keeps getting reused in other language communities. It's not clear to non-FP community what the distinction between "parser" and "validate", let alone "parser combinator". Yet somehow other articles keep reusing this same catchphrase.
For instance if validating parameter values requires multiple trips to a DB or other external system, weaving the calls in the logic can spare duplicating these round trips. Light "surface" validation can still be applied, but that's not what we're talking about here I think.
I still wouldn't need to check the inputs again because I know it's already been processed, even if the type system can't help me.
The library in the original post is essentially a Javascript library, but it's one designed so that if you use it with Typescript, it provides that type safety.
Even better, that conversion from interface type to internal type should ideally happen at one explicit point in the program - a function call which rejects all invalid inputs and returns a type that enforces the invariants we're interested in. That way, we gave a clean boundary point between the outside world and the inside one.
This isn't a performance issue at all, it's closer to the "imperative shell, functional core" ideas about structuring your application and data.
Not quite that, but https://typer.tiangolo.com/ is fully type driven.
The point is you don’t check that your string only contains valid characters and then continue passing that string through your system. You parse your string into a narrower type, and none of the rest of your system needs to be programmed defensively.
To describe this advice as “vacuous” says more about you than it does about the author.
I'm hung up on the type system because it's a great way to convey the validity of the data; it follows the data around as it flows through your program.
I don't (yet) Typescript, but jsdoc and linting give me enough type checking for my needs.
Embedding a second parse step that the first parser doesn't deal with is done, but it's a rough compromise.
It feels like the difficulty in dealing with
[--option [--flag1 --flag2]]
Is more to do with its expression in the language parsed to, than CLI elegance.":3000" -> use port 3000 with a default host.
"some-host" -> use host with a default port.
"some-host:3000" -> you guess it.
It also allows to extend it to other sources/destinations like unix domain sockets and other stuff without cluttering your CLI options.
Also please consider to use DSN or URI to define database configurations. Host, port, dbname, credentials as separate options or environment variables are quite painful to use.
You need a boundary to convert nice opts into nice types. Like pydantic models could take argparse namespace and convert it to something manageable.
Makes sense, I think a lot of developers would want to complect this problem with their runtime type system of choice without considering the set of downsides for the users
You either get the correctly parsed data or you get an error array. The incorrect input was never represented in code, vs a 0 value being returned or even worse random gibberish.
A trivial example: 1/0 should return DivisionByZero not 0 or infinity or NaN or whatever else. You can then decide in your UI whether that is a case you want to handle as an error or as an edge case but the parser knows that is not possible to represent.
Also - don't write CLI programs in languages that don't compile to native binaries. I don't want to have to drag around your runtime just to execute a command line tool.
One of the things I love about clap is that you can configure it to automatically spit out --help info, and you can even get it to generate shell autocompletions for you!
I think there are some other libraries that are challenging it now (fewer dependencies or something?) but clap sets the standard to beat.
My most comfortable tool is Java, but I'm not going to persuade most of the HN crowd to install a JVM unless the software I'm offering is unbearably compelling.
Internal to work? Yeah, Java's going to be an easy sell.
I don't think OP necessarily meant it as a political statement.
Although in practice, I find clap's approach works pretty well: define an object that represents the parsed arguments as you want them, with annotations for details that can't be represented in the type system, and then derive a parser from that. Because Rust has ADTs and other tools for building meaningful types, and because the derive process can do so much. That creates an arguments object that you can quite easily pass to a function which runs the command.
In fact, I think something like this already exists. I just can't recollect the project.
A CLI and an API should indeed occupy the same layer of a program architecture, namely they are entry points that live on the periphery. But really all you should be doing there is lifting the low byte stream you are getting from users to something higher level you can use to call your internals.
So "CLI validation" should be limited to just "I need an int here, one of these strings here, optionally" etc. Stuff like "is this port out of range" or "if you give me this I need this too" should be handled by your internals by e.g. throwing an exception. Your CLI can then display that as an error message in a nice way.
$ ldd /usr/bin/rg
linux-vdso.so.1 (0x00007fff45dd7000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000070764e7b1000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000070764e6ca000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000070764de00000)
/lib64/ld-linux-x86-64.so.2 (0x000070764e7e6000)
The worst is compiling a C program with a compiler that uses a more recent libc than is installed on the installation host.Sure, but probably at the cost of leaving everything in a horribly inconsistent state when you error out partway through. Which is almost always not worth it.
This approach IMNSHO is much cleaner than the intrication of cmdline parser libraries with application logic and application-domain-related types.
Then one can specify validation logic declaratively, and apply it generically.
This has the added benefit - for compiled rather than interpreted library - of not having to recompile the CLI parsing library for each different app and each different definition of options.
But that _is_ parsing, at least in the sense of "parse, don't validate". It's about turning inputs into real objects representing the domain code that you're about to be working with. The result is still going to be a DTO of some description, but it will be a DTO with guaranteed invariants that are useful to you. For example, a post request shouldn't be parsed into a user object just because it shares a lot of fields in common with a user. Instead it should become a DTO with the invariants fulfilled that makes sense for a DTO. Some of those invariants are simple (like "dates should be valid" -> the DTO contains Date objects not strings), and some will be more complex like the "if the server is active, then the port also needs to be provided" restriction from the article.
This is one of the key ideas behind Zod - it isn't just trying to validate whether an object matches a certain schema, but it converts the result into a type that accurately expresses the invariants that must be in place if the object is valid.
zod also allows invalid state as input, then attempts to shoehorn them into the desired schema, which still runs these validations the author was complaining about - just not in the code he wrote.
This clause is abstracting away a ton of work. If you want to compile the latest LLVM and get 'portable C++26', you need to bootstrap everything, including CMake from that old-hat libc on some ancient distro like CentOS 6 or Ubuntu 12.04.
I've said it before, I'll say it again: the Linux kernel may maintain ABI compatibility, but the fact that GNU libc breaks it anyway makes it a moot point. It is a pain to target older Linux with a newer distro, which is by far the most common development use case.
"Well, I already know this is a valid uuid, so I don't really need to worry about sql injection at this point."
Sure, this is a dumb thing to do in any case, but I've seen this exact thing happen.
Typesafety isn't safety.
Are you stuck in write-only mode or something? How does this make any sense to you?
$ wget 'https://github.com/BurntSushi/ripgrep/releases/download/14.1.1/ripgrep-14.1.1-x86_64-unknown-linux-musl.tar.gz'
$ tar -xvf 'ripgrep-14.1.1-x86_64-unknown-linux-musl.tar.gz'
$ ldd ripgrep-14.1.1-x86_64-unknown-linux-musl/rg
ldd (0x7f1dcb927000)
$ file ripgrep-14.1.1-x86_64-unknown-linux-musl/rg
ripgrep-14.1.1-x86_64-unknown-linux-musl/rg: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), static-pie linked, stripped
The quote here — which I suspect is a straw man — is such a weird non sequitur. What would logically follow from “I already know this is a valid UUID” is “so I don’t need to worry about this not being a UUID at this point”.
Pretty much agreed - once any sort of complicated logic enters a shell script it's probably better off written in C/Rust/Go or something akin to that.
Write your code such that you can load it onto (for example) the oldest supported Ubuntu and compile cleanly and you’ll have virtually zero problems. Again, I know that if your goal is to truly ship something written in e.g. C++26 portably then it’s a huge pain. But as someone who writes plain C and very much enjoys it, I think it’s better to skip this class of problem.
``` some_cli <some args> --some-option --no-some-option ```
Before parsing, the argument array contains both the flags to enable and disable the option. Validation would either throw an error or accept it as either enabled or disabled. But importantly, it wouldn't change the arguments. If the assumption is that the last option overwrites anything before it then the cli command is valid with the option disabled.
And now, correct behaviour relies on all the code using that option to always make the same assumption.
Parsing, on the other hand, would put create a new config where `option` is an enum - either enabled or disabled or not given. No confusion about multiple flags or anything. It provides a single view for the rest of the program of what the input config was.
Whether that parsing is done by a third party library or first party code, declaratively or imperatively, is besides the point.
I use Effect CLI https://github.com/Effect-TS/effect/tree/main/packages/cli for the same reasons. It has the advantage of fitting within the ecosystem. For example, I can reuse existing schemas.
Well that's confused me. I write a lot of scripts in BASH specifically to make it easy to move them to different architectures etc. and not require a custom runtime. Interpreted scripts also have the advantage that they're human readable/editable.
Even in languages like Haskell, "safety" is an illusion. You might create a NumberGreaterThanFive type with smart constructors but that doesn't stop another dev from exporting and abusing the plain constructor somewhere else.
For the most part it's fine to assume the names of types are accurate, but for safety critical operations it absolutely makes sense to revalidate inputs.
That seems like a pretty unfair constraint. Yes, you can deliberately circumvent safeguards and you can deliberately write bad code. That doesn't mean those language features are bad.
I'll keep my templates, smart pointers, concepts, RAII, and now reflection, thanks. C and its macros are good for compile times but nothing much else. Programming in C feels like banging rocks together.
Go programs compile to native executables, they're still rather slow to start, especially if you just want to do --help
Zod does take in invalid state as input, but that is what a parser does. In this case, the parser is `any -> T` as opposed to `string -> T`, but that's still a parsing operation.
And don't write programs with languages that depend on CMake and random tarballs to build and/or shared libraries to run.
I usually have a lot less issues with dragging a runtime than fighting with builds.
This is only a problem, when the program USES a symbol that was only introduced in the newer libc. In other words, when the program made a choice to deliberately need that newer symbol.
$ pkg install git rust
$ git clone https://github.com/BurntSushi/ripgrep.git
$ cd ripgrep
$ RUSTFLAGS='-C target-feature=+crt-static' cargo build --release
$ ldd target/release/rg
ldd: target/release/rg: not a dynamic ELF executable
$ file target/release/rg
target/release/rg: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), statically linked, for FreeBSD 14.3, FreeBSD-style, with debug_info, not stripped
> What is ValidatedData? A subset of the Data that is valid?
Usually, but not necessarily. `validate()` might add some additional information too, for example: `validationTime`.More often than not, in a real case of applying algebraic data type & "Parse, don't validate", it's something like `Option<ValidatedData>` or `Result<ValidatedData,PossibleValidationError>`, borrowing Rust's names. `Option` & `Result` expand the possible return values that function can return to cover the possibility of failure in the validation process, but it's independent from possible values that `ValidatedData` itself can contain.
> The way I see it is you use ‘validate’ when the format of the data you are validating is the exact same format you are gonna be working with right after, meaning the return type doesn’t matter.
The main point of "Parse, don't validate" is to distinguish between "machine-level data representation" vs "possible set of values" of a type and utilize this "possible set of values" property.Your "the exact same format" point is correct; oftentimes, the underlying data representation of a type is exactly the same between pre- & post-validation. But more often than not "possible set of values" of `ValidatedData` is a subset of `Data`. These 2 different "possible set of values" are given their own names in the form of a type `Data` and `ValidatedData`.
This distinction is actually very handy because types can be checked automatically by the (nominal) type system. If you make the `ValidatedData` constructor private & the only way to produce is function `ValidatedData validate(Data)`, then in any part of the codebase, there's no way any `ValidatedData` instance is malformed (assuming `validate` doesn't have bugs).
Extra note: I forgot to mention the "Parse, don't validate" article implicitly implies a nominal type system, where 2 objects with equivalent "data representation" doesn't mean it has the same type. This differs from Typescript's structural type system, where as long as the "data representation" is the same, both object are considered to have the same type.
Typescript will happily accept something like this because of structural
type T1 = { x: String };
type T2 = { x: String };
function f(T1): void { ... }
const t2: T2 = { x: "foo" };
f(t2);
While nominal type systems like Haskell or Java will reject such expressions class T1 { String x; }
class T2 { String x; }
void f(T1) { ... }
// f(new T2()); // Compile error: type mismatch
Because of this, the idea of using type as a "possible set of values" probably felt unintuitive to Typescript folks, as everything is just stringly-typed and different type felt synonymous with different "underlying data representation" there.You can simulate this "same structure, but different meaning" concept of nominal type system in Typescript with some hacky workaround with Symbol.
> The return type implies transformation – a write operation per se, whereas validation is always a read operation only
Why does the return type need to imply transformation and why is "validation" here always read-only? No-op function will return the exact same value you give it (in other words, identity transformation), and Java & Javascript procedures never guarantee a read-only operation.So, having used this thread to rubber-duck about how the principle of "parse-don't-validate" works with the principle of "provide good error messages", I'm arriving at these rules, which are really more about encapsulation than parsing:
1. Encapsulate both parsing and validation in a single function: `parse(RawInput) -> Result<ValidDomainObject,ListOfErrors>`
2. Ideally, `parse` is implemented by a robust parsing/validation library for the type of input that you're dealing with. It will create some intermediate representations that you need not concern yourself with.
3. If there isn't a good parser library for your use case, your implementation of `parse` will necessarily contain intermediate representations of potentially illegal state. This is both fine and unavoidable, just don't let them leak out of your parser.
Right - and one thing that keeps coming up for me is that, if you want to maintain complex invariants, it's quite natural to express them in terms of the domain object itself (or maybe, ugh, a DTO with the same fields), rather than in terms of input constraints.
Why CLIs in particular? Because they usually are smaller tools. For a big, important tool, you might be willing to jump through more hoops (installing the right runtime), but for a smaller, less important tool, it's just not worth it.
This function parses a number in 6502 asm. So `255` in dec or `$ff` in hex: https://github.com/geon/dumbasm/blob/main/src/parsers/parseN...
I looked at several typescript libraries but they all felt off. Writing my own at least ensured I know how it works.
args:
username str # Required string
password str? # Optional string
token str? # Optional auth token
age int # Required integer
status str # Required string
username requires password // If username is provided, password must also be provided
token excludes password // Token and password cannot be used together
age range [18, 99] // Inclusive range from 18 to 99
status enum ["active", "inactive", "pending"]
Rad will handle all the validation for you, you can just write the rest of your script assuming the constraints you declared are met.C feels a little like survival mode in Minecraft; you have a set of very simple abstractions, a relatively simple language, with which one can build the world (and in many cases, we have).
C++ feels like a complex city builder, with lots of tools, designs, and paradigms available, but also allows one to screw up in bigger ways.
The difference is (a) where and how validation happens, and (b) the type of the final result.
A parser is a function producing structured values - values of some type, usually different from the input type. In contrast, a validator is a predicate that only checks constraints on existing values.
For example, a parser can parse an email address into a variable of type EmailAddress. If the parser succeeds at doing that, assuming you're using a language with a decent type system, you now have a variable which is statically guaranteed to be an email address - not a string which you have to trust has passed validation at some point in the past.
This is part of the "Make illegal states unrepresentable" approach which allows for static debugging - debugging your code at compile time. It's a very powerful way to produce reliable systems with robust, statically proven guarantees.
But as Alexis King (who coined the phrase "Parse, don't validate") wrote, "Unless you already know what type-driven design is, my catchy slogan probably doesn’t mean all that much to you."
I'm just saying that TypeScript and jsdoc don't actually do any runtime enforcement. It's important that the library does that part, with or without types.