Parse, don't validate (2019)

1. Octokiddie ◴[07 Mar 23 14:52 UTC] No.35055969[source]▶

I like how the author boils the idea down into a simple comparison between two alternative approaches to a simple task: getting the first element of a list. Two alternatives are presented: parseNonEmpty and validateNonEmpty. From the article:

> The difference lies entirely in the return type: validateNonEmpty always returns (), the type that contains no information, but parseNonEmpty returns NonEmpty a, a refinement of the input type that preserves the knowledge gained in the type system. Both of these functions check the same thing, but parseNonEmpty gives the caller access to the information it learned, while validateNonEmpty just throws it away.

This might not seem like much of a distinction, but it has far-reaching implications downstream:

> These two functions elegantly illustrate two different perspectives on the role of a static type system: validateNonEmpty obeys the typechecker well enough, but only parseNonEmpty takes full advantage of it. If you see why parseNonEmpty is preferable, you understand what I mean by the mantra “parse, don’t validate.”

parseNonEmpty is better because after a caller gets a NonEmpty it never has to check the boundary condition of empty again. The first element will always be available, and this is enforced by the compiler. Not only that, but functions the caller later calls never need to worry about the boundary condition, either.

The entire concern over the first element of an empty list (and handling the runtime errors that result from failure to meet the boundary condition) disappear as a developer concern.

replies(3): >>35056624 #>>35056955 #>>35058253 #

2. waynesonfire ◴[07 Mar 23 15:46 UTC] No.35056624[source]▶

>>35055969 (TP) #

Does this scale? What if you have 10 other conditions?

replies(6): >>35056897 #>>35056901 #>>35057510 #>>35057732 #>>35059327 #>>35061385 #

3. ElevenLathe ◴[07 Mar 23 16:04 UTC] No.35056897[source]▶

>>35056624 #

I think what you're getting at is that it seems ponderous to have types named things like NonEmptyListWhereTheThirdElementIsTheIntegerFourAndTheOtherElementsAreStringsOfLengthSixOrUnder and the answer is that you shouldn't do that, but instead name it something in the problem domain (of whatever the program is about) like WidgetDescription or whatever.

replies(2): >>35058402 #>>35062502 #

4. dangets ◴[07 Mar 23 16:04 UTC] No.35056901[source]▶

>>35056624 #

I would think stacking 10 generic conditions wouldn't scale if you are trying to mix and match arbitrarily. If you are trying to mix NonEmpty and AllEven and AllGreaterThan100 for the List example, then you would get the combinatorial explosion of types.

In practice, I find it is usually something like 'UnvalidatedCustomerInfo' being parsed into a 'CustomerInfo' object, where you can validate all of the fields at the same time (phone number, age, non-null name etc.). Once you have parsed it into the internal 'CustomerInfo' type - you don't need to keep re-validating the expected invariants everywhere this type is used. There was a good video on this that I wish I could find again, where the presenter gave the example of using String for a telephoneNumber field instead of a dedicated type. Any code that used the telephoneNumber field would have to re-parse and validate it "to be safe" it was in the expected format.

The topic of having untrusted external types and internal trusted types is also explained in the book `Domain Modeling Made Functional` which I highly recommend.

replies(4): >>35057468 #>>35058092 #>>35059831 #>>35063117 #

5. ◴[07 Mar 23 16:08 UTC] No.35056955[source]▶

>>35055969 (TP) #

6. hansvm ◴[07 Mar 23 16:45 UTC] No.35057468{3}[source]▶

>>35056901 #

In Zig at least, the types-as-values framework means it'd be pretty easy to support arbitrary mixing and matching. If nobody beats me to it, I'll make a library to remove the boilerplate for that idea this weekend.

replies(1): >>35105217 #

7. crvdgc ◴[07 Mar 23 16:47 UTC] No.35057510[source]▶

>>35056624 #

At least in Haskell this can be expressed as type classes. For each condition, you can create a (possibly empty) type class to guarantee that the condition is met. Then the call site type class constraints will be checked at compile time.

replies(1): >>35062416 #

8. layer8 ◴[07 Mar 23 17:00 UTC] No.35057732[source]▶

>>35056624 #

The code size of the “parse” version of a program is at worst linear in the code size of the corresponding “validate” version, so I’d say yes.

9. JamesSwift ◴[07 Mar 23 17:20 UTC] No.35058092{3}[source]▶

>>35056901 #

You would also run into a form of "type erasure" in most cases and would need a combinatorial explosion of methods unless the type system was able to 'decorate' the generic parameter with additional information.

e.g. imagine a `parseNonEmpty` and a `parseAllEven` method. Both take a list and return either a `NonEmpty` or `AllEven` type. If I call `parseNonEmpty` and get a `NonEmpty`, then pass to `parseAllEven`, I now have erased the `NonEmpty` as I'm left with an `AllEven`. I would need a `parseAllEvenFromNonEmpty` to take a `NonEmpty` and return a `AllEven & NonEmpty`.

10. epolanski ◴[07 Mar 23 17:32 UTC] No.35058253[source]▶

>>35055969 (TP) #

Also, parsing composes, validating, doesn't.

I once had to implement a feature on a real estate website.

For a given location I would get a list of stats (demographics, cost per square meter, average salary in the area, etc). Of those stats some themselves contained lists.

At the beginning I modeled everything with arrays in react. This led to passing down lists and having to check at multiple steps whether they were non-empty and handle that case.

Then I modeled everything with a NonEmptyArray guard in TypeScript

``` interface NonEmptyArray<A> extends Array<A> { 0: A // tells the typesystem that index 0 exists and is of type A }

function isNonEmpty(as: A[]): as is NonEmptyArray<A> { return as.length > 0 } ```

then after receiving the response with those lists I could parse them into NonEmptyArrays, remove all of the checks of emptiness inside the react components till handling the fact that some of these elements were empty trickled up to the outermost component and everything became very clean and simple to understand/maintain.

11. tialaramex ◴[07 Mar 23 17:41 UTC] No.35058402{3}[source]▶

>>35056897 #

And naming is actually a valuable activity. Knowing this is not merely NonEmptyListWhereTheThirdElementIsTheIntegerFourAndTheOtherElementsAreStringsOfLengthSixOrUnder but actually WidgetDescription is a valuable insight.

Deciding this thing is specifically a WidgetDescription, not a Widget or a WidgetLabel, or a WidgetAssociatedText and definitely not a ThingyDescription, can help both users and other developers produce a mental model of what's going on that results in a better experience for everyone.

12. dwohnitmok ◴[07 Mar 23 18:44 UTC] No.35059327[source]▶

>>35056624 #

This is where dependent types would shine.

  function processUserInput(input: String, requirement: InputReqs[input]): Unit = ...

  type InputReqs[input] = {
    // We'll say that a Proposition takes Boolean expressions and turn them into types
    notTooLong: Proposition[length(input) < 128],
    authorIsNotBlocked: AuthorIsNotBlocked[input],
    sanitized: Sanitized[input],
    ...
  }

where you might have the following functions (which are all examples of parse don't validate)

  function checkIfAuthorIsBlocked(author: String, input: String): Maybe[AuthorIsNotBlocked[input]] = ...

  // Create a pair that contains both the sanitized String and a tag that it has been sanitized
  function sanitizeString(input: String): (output: String, Sanitized[output]) = ...

where just by types alone I know that e.g. length checking must occur after sanitization (because sanitizeString generates a new `output` that is distinct from `input`) and don't have to write down in docs somewhere that you might cause a bug if you check lengths before sanitization because maybe sanitization changes the length of the input.

Note that this is also strictly stronger than a simple precondition/postcondition system or some sort of assertion system because properties of the input that we care about may not be observable at runtime/from the input alone (e.g. AuthorIsNotBlocked can't be asserted based only on input: you'd have to change the runtime representation of input to include that information).

13. lexi-lambda ◴[07 Mar 23 19:17 UTC] No.35059831{3}[source]▶

>>35056901 #

> If you are trying to mix NonEmpty and AllEven and AllGreaterThan100 for the List example, then you would get the combinatorial explosion of types.

This is overthinking it. Usually, when people are not used to doing constructive data modeling, they get caught up on this idea that they need to have a datatype that represents their data in some canonical representation. If you need a type that represents an even number, then clearly you must define a type that is an ordinary integer, but rules out all odd numbers, right?

Except you don’t have to do that! If you need a number to always be even (for some reason), that suggests you are storing the wrong thing. Instead, store half that number (e.g. store a radius instead of a diameter). Now all integers are legal values, and you don’t need a separate type. Similarly, if you want to store an even number greater than 100, then use a natural number type (i.e. a type that only allows non-negative integers; Haskell calls this type `Natural`) and store half that number minus 102. This means that, for example 0 represents 102, 1 represents 104, 2 represents 106, 3 represents 108, etc.

If you think this way, then there is no need to introduce a million new types for every little concept. You’re just distilling out the information you actually need. Of course, if this turns out to be a really important concept in your domain, then you can always add a wrapper type to make the distinction more formal:

    newtype EvenGreaterThan100 = EvenGreaterThan100 Natural

    evenGreaterThan100ToInteger :: EvenGreaterThan100 -> Integer
    evenGreaterThan100ToInteger (EvenGreaterThan100 n) = (toInteger n * 2) + 102

    integerToEvenGreaterThan100 :: Integer -> Maybe EvenGreaterThan100
    integerToEvenGreaterThan100 n
      | n < 100 = Nothing
      | otherwise = case n `quotRem` 2 of
          (q, 0) -> Just (EvenGreaterThan100 q)
          (_, _) -> Nothing

Of course, this type seems completely ridiculous like this, and it is. But that’s because no real program needs “an even number greater than one hundred”. That’s just a random bag of arbitrary constraints! A real type would correspond to a domain concept, which would have a more useful name and a more useful API, anyway.

I wrote a followup blog post here that goes into more detail about this style of data modeling, with a few more examples: https://lexi-lambda.github.io/blog/2020/08/13/types-as-axiom...

replies(1): >>35062414 #

14. cdaringe ◴[07 Mar 23 21:18 UTC] No.35061385[source]▶

>>35056624 #

Somebody somewhere in system will be checking. At the boundary likely scales the best because you dont burn cycles that otherwise would be distributed and likely redundant in system. If you have a massive model, its feasible that it makes sense to defer partial/subparsing?

Parser combinators seem to be pretty rippin fast for the most part, at least those ive used in ocaml and rust.

15. esrauch ◴[07 Mar 23 22:48 UTC] No.35062414{4}[source]▶

>>35059831 #

What if you want it to only be less that 100 though? Not everything is so easily expressable the way you're saying.

replies(2): >>35062811 #>>35068165 #

16. dwohnitmok ◴[07 Mar 23 22:48 UTC] No.35062416{3}[source]▶

>>35057510 #

This doesn't really work because you still end up needing specific types for the output of every "parser" you have and then you still need a way of combining those types together.

Or you get the ability to forge evidence (e.g. you use the evidence provided by a parser for one integer as evidence for another).

This works better for dependency injection scenarios (the Has* pattern).

17. dwohnitmok ◴[07 Mar 23 22:56 UTC] No.35062502{3}[source]▶

>>35056897 #

No the trickier problem is that without dependent types you are forced into a very specific, linear chain of validation or else deal with a combinatorial explosion of functions and types.

To take your type as an example, you could imagine a function

  validation : String -> Maybe FinalWidget

but maybe `validation` is really big and unwieldy and you want to reuse parts of it elsewhere so you break it down into a pipeline of

  -- Let's say a RawWidget is, operationally, a non-empty string
  validation0 : String -> Maybe RawWidget
  -- Let's say a RefinedWidget is a string consisting only of capital letters
  validation1 : RawWidget -> Maybe RefinedWidget
  -- A FinalWidget is a non-empty string of capital letters that has no whitespace
  validation2 : RefinedWidget -> Maybe FinalWidget

This is over-constrained. You don't really want to force yourself into a scenario where you must call validation0, then validation1, and finally validation2 because maybe in another code path it's more expedient to do it in another order. But the types don't line up if you do it in another order. And maybe you don't really care about `RawWidget` and `RefinedWidget`, but you're forced to create them just to make sure that you can build up to a `FinalWidget`.

This is where dependent types would really help relax those constraints.

replies(1): >>35062924 #

18. Quekid5 ◴[07 Mar 23 23:21 UTC] No.35062811{5}[source]▶

>>35062414 #

You can go as far (or as short) as the application warrants. The more static evidence you want, the more cumbersome it's going to become. That seems sort of a natural progression, but the the point is that you get to choose the trade-off. It's often easier to prove a special case than a more general case.

(Certainly, Haskell is probably not the most concise language for this kind of thing. LiquidHaskell adds interesting proof capabilities wrt. arithmetic.)

Regardless, even just parsing at the boundary and using an opaque type

    MySuperSpecialInt = Int @Even Min(2) Max(100)

(or whatever syntax you want) is still better than just using Int. At least you'll know that you'll always be handed a value that's in range (post-parsing).

19. Quekid5 ◴[07 Mar 23 23:31 UTC] No.35062924{4}[source]▶

>>35062502 #

I don't disagree that dependent types would help (and be really cool for lots of other uses!), but let's consider what the usual validation rules that we really need are: non-empty, basic interval constraints (non-negative/positive), only contains a certain set of characters... simple stuff like that, usually. If we're going wild, an interesting case would be effectful validation and how that fits in. In practice, what happens with any non-basic validation is that the server says 3xx, try again.

Anyway, validation/parsing is mostly pretty simple stuff where the "validate" bit is a simple function... and function composition works just fine.

(Assuming you can name the result type of your parse/validate individually according to your domain.)

replies(1): >>35064351 #

20. ambicapter ◴[07 Mar 23 23:49 UTC] No.35063117{3}[source]▶

>>35056901 #

> In practice, I find it is usually something like 'UnvalidatedCustomerInfo' being parsed into a 'CustomerInfo' object, where you can validate all of the fields at the same time

In my experience you usually can't validate them all at the same time. For example, address. You usually don't validate that until after the customer has selected items, and then you find out that some items won't deliver to their area, so whereas you previously had a Valid basket, now it's an Invalid state.

replies(2): >>35063216 #>>35155316 #

21. initplus ◴[07 Mar 23 23:58 UTC] No.35063216{4}[source]▶

>>35063117 #

Whether an item can be delivered to a particular address seems like a seperate concern to whether the address is correct/real.

22. dwohnitmok ◴[08 Mar 23 02:15 UTC] No.35064351{5}[source]▶

>>35062924 #

Without dependent types you can't do your common constraints in an order independent way.

You end up with four choices:

1. Have a single function that does all the constraint checking at once

2. Have a single linear order where each constraint check feeds into the next but only in that order

3. Acquiesce to a combinatorial explosion of functions that check every possible combination of those constraints

4. Give up keeping track of the constraints at a type level.

replies(1): >>35075015 #

23. lexi-lambda ◴[08 Mar 23 12:24 UTC] No.35068165{5}[source]▶

>>35062414 #

Yes, certainly. Constructive data modeling is useful for many things, but it’s not a panacea. Other techniques are useful in cases where it’s not practical; I discuss some of the tradeoffs in this followup post: https://lexi-lambda.github.io/blog/2020/11/01/names-are-not-...

24. Quekid5 ◴[08 Mar 23 21:31 UTC] No.35075015{6}[source]▶

>>35064351 #

I do think you can... just via phantom type parameters and type-level programming. In Scala you'd probably use Refined.

(But I'm not expert, admittedly, and I isn't an actual problem of much consequence in practical programming in Haskell or Scala. Opaque types do the 80% bit of 80-20 just fine.)

replies(1): >>35075984 #

25. dwohnitmok ◴[08 Mar 23 23:01 UTC] No.35075984{7}[source]▶

>>35075015 #

You can't with phantom type parameters and type-level programming alone, although you can get close. Scala's and Haskell's Refined both don't let you do what I'm thinking of.

You can get very close with type-level sets although at this point compile times probably go through the roof. You're basically emulating row types at this point.

  def wrapIntoRefined(str: String): Refined[String, Unit]

  def validate0[A](str: Refined[String, A]): Either[Error, Refined[String, And[Condition0, A]]]

  def validate1[A](str: Refined[String, A]): Either[Error, Refined[String, And[Condition1, A]]]

  // This requires ordering Condition0 before Condition1 but if we resorted 
  // to a type-level set we could get around that problem
  def process(input: Refined[String, And[Condition1, And[Condition0, Unit]]]): Unit

  // But linearity is still required in some sense. We can't e.g. do our checks
  // in a parallel fashion. You still need to pipe one function right after another

The central problem is if you have two validation functions

  def validate0(str: String): Refined[String, Condition0]

  def validate1(str: String): Refined[String, Condition1]

if you try to recombine them downstream, you don't know that `Refined[String, Condition0]` and `Refined[String, Condition1]` actually refer to the same underlying `String`. They could be refined on two completely separate strings. To tie them to a single runtime String requires dependent types.

You can approximate this in Scala with path-dependent types, but it's very brittle and breaks in all sorts of ways.

> isn't an actual problem of much consequence in practical programming in Haskell or Scala. Opaque types do the 80% bit of 80-20 just fine.

I think this is only true because there isn't a production-ready dependently typed language to show how to use these patterns effectively. In much the same way that "parse don't validate" isn't really much of a problem of consequence in older style Java code because sum types aren't really a thing, if there was an ergonomic way of taking advantage of it, I firmly believe these sorts of dependently typed tagged types would show up all over the place.

replies(1): >>35103371 #

26. Quekid5 ◴[11 Mar 23 00:21 UTC] No.35103371{8}[source]▶

>>35075984 #

> I think this is only true because there isn't a production-ready dependently typed language [...]

Now this I definitely agree with. I want to see what's possible!

27. hansvm ◴[11 Mar 23 04:10 UTC] No.35105217{4}[source]▶

>>35057468 #

https://github.com/hmusgrave/pdv

28. przemo_li ◴[14 Mar 23 17:40 UTC] No.35155316{4}[source]▶

>>35063117 #

Valid basket does not imply valid order.

Sometimes just splitting stuff into more types that are appropriate to specific portion of pipeline is a win.