←back to thread

Parse, don't validate (2019)

(lexi-lambda.github.io)
398 points declanhaigh | 8 comments | | HN request time: 0.002s | source | bottom
Show context
bruce343434 ◴[] No.35053912[source]
Note that this basically requires your language to have ergonomic support for sum types, immutable "data classes", pattern matching.

The point is to parse the input into a structure which always upholds the predicates you care about so you don't end up continuously defensively programming in ifs and asserts.

replies(12): >>35054046 #>>35054070 #>>35054386 #>>35054514 #>>35054901 #>>35054993 #>>35055124 #>>35055230 #>>35056047 #>>35057866 #>>35058185 #>>35059271 #
mtlynch ◴[] No.35054046[source]
I get a lot of value from this rule even without those language features.

I follow "Parse, Don't Validate" consistently in Go. For example, if I need to parse a JSON payload from an end-user for Foo, I define a struct called FooRequest, and I have exactly one function that creates a FooRequest instance, given a JSON stream.

Anywhere else in my application, if I have a FooRequest instance, I know that it's validated and well-formed because it had to have come from my FooRequest parsing function. I don't need sum types or any special language features beyond typing.

replies(1): >>35054157 #
jotaen ◴[] No.35054157[source]
My main take-away is the same, I wonder though whether “parse, don’t validate” is the right term for it. To me, “parse, don’t validate” somehow suggests that you should do parsing instead of validation, but the real point for me is that I still validate (as before), plus I “capture”/preserve validation success by means of a type.
replies(8): >>35054350 #>>35054377 #>>35054626 #>>35054751 #>>35055151 #>>35055232 #>>35055382 #>>35056979 #
1. qsort ◴[] No.35054350[source]
It's in the same sense of "whitelist, don't blacklist", or "by the love of god it's 2023, do not escape SQL".

Don't define reasons why the input is invalid, instead have a target struct/object, and parse the input into that object.

replies(1): >>35055225 #
2. blincoln ◴[] No.35055225[source]
I like this explanation and approach, but how does it solve the first problem described in the article - the case where there's an array being processed that might be empty?

There are plenty of cases in real-world code where an array that's part of a struct or object may or may not contain any elements. If you're just parsing input into that, it seems like you'd either still end up doing an equivalent of checking whether the array is empty or not everywhere the array might be used later, even if that check is looking at an "array has elements" type flag in the struct/object, and so you're still maintaining a description of ways that the input may be invalid. But I'm not a world-class programmer, so maybe I'm missing something. Maybe you mean something like for branches of the code that require a non-empty array, you have a second struct/object and parser that's more strict and errors out if the array is empty?

replies(4): >>35055647 #>>35057974 #>>35058114 #>>35063412 #
3. strgcmc ◴[] No.35055647[source]
Remember, the author of the article constructed a scenario where, it was expected that the "main" function ends up treating an empty "CONFIG_DIRS" input as an uncatchable IOError; in other words, an empty array was invalid/not-allowed, per the rules of this program. Depending on the context in which you are operating, you may or may not have similar rules or requirements to follow.

Empty lists are actually generally not a big deal - they are just lists of size 0, and they generally follow all the same things you can do with non-empty lists. The fact that a "head" function throws an error on an empty list, is really just a specific form of the more general observation that: any array would throw an index-out-of-bounds exception when given an index that's... out of of bounds. So any time you are dealing with arrays, you probably need to think about, "what happens if I try to index something that's out of bounds? is that possible?"

In this particular contrived example, all that mattered was the head of the array. But what if you wanted to pick out the 3rd argument in a list of command line arguments, and instead the user only gave you 2 inputs? If 3 arguments are required, then throw an IOError as early as possible after failing to parse out 3 arguments; but once you pass the point of parsing the input into a valid object/struct/whatever, from that point forward you no longer care about checking whether the 3rd input is empty or not.

So again, it depends on your scenario. Actually the more interesting variant of this issue (in OO languages at least) is probably handling nulls, as empty lists are valid lists, but nulls are not lists, and requires some different logic usually (and hence why NullPointerExceptions aka NPEs are such a common failure mode).

replies(1): >>35059448 #
4. ◴[] No.35057974[source]
5. bcrosby95 ◴[] No.35058114[source]
Depending upon language, and what you're using to hold the array, inheritance.

A 'NotEmpty a' is just a subclass of a potentially empty 'a'. You also get the desirable behavior, in this scenario, of automatic upcasting of a 'NotEmpty a' into a regular old 'a'.

replies(1): >>35060098 #
6. blincoln ◴[] No.35059448{3}[source]
I see what you're saying, but I'm still not understanding how it becomes a generalizable rule for real-world code without adding a lot of exceptions to the rule, or doing something over-engineered like parsing into an increasingly specific number of customized structs/objects in different branches of the code.

Just to be clear, I actually really like the idea of parsing the input into a structure. I do the same thing in a lot of my code. I just don't see how it removes the need to also perform validation in many (maybe most) cases as soon as one gets beyond contrived examples.

The empty array example seems to be a can of worms. Maybe it's specific to the kinds of software that I've written, but in most of the cases I can think of, I wouldn't know if it was OK for a particular array within a structure to be empty until after the code had made some other determinations and branched based on them. And yet, like the example, once it got to the real handling for that case, it would be a problem if the array were empty. So the image in my mind is many layers of parsing that are much more complicated and confusing to read than validating the length of the array.

I still think it's a great idea for a lot of things, just that the "parse, don't validate" name seems really misleading. I might go with something like "parse first, validate where necessary".

7. secdeal ◴[] No.35060098{3}[source]
Not quite, 'a' is the type of the elements 'NonEmpty a' contains.

It is rather the subclass of some kind of 'Iterable a'.

8. lmm ◴[] No.35063412[source]
> There are plenty of cases in real-world code where an array that's part of a struct or object may or may not contain any elements. If you're just parsing input into that, it seems like you'd either still end up doing an equivalent of checking whether the array is empty or not everywhere the array might be used later, even if that check is looking at an "array has elements" type flag in the struct/object, and so you're still maintaining a description of ways that the input may be invalid.

You only check it if it makes a difference to validity or not. There's no scenario where you keep the array and a parallel flag - either an empty array is invalid in which case you refuse to construct if it's empty, or an empty array is valid in which case you don't even check. Same thing for if you're checking whether it's got an even number of elements, got less than five elements, etc. - you don't keep the flag around, you refuse to construct your validated structure (even if that "validated structure" is actually just a marker wrapper around the raw array, if your type system isn't good enough to express the real constraint directly).