The point is to parse the input into a structure which always upholds the predicates you care about so you don't end up continuously defensively programming in ifs and asserts.
The point is to parse the input into a structure which always upholds the predicates you care about so you don't end up continuously defensively programming in ifs and asserts.
2 years ago: https://news.ycombinator.com/item?id=27639890
3 years ago: https://news.ycombinator.com/item?id=21476261
I follow "Parse, Don't Validate" consistently in Go. For example, if I need to parse a JSON payload from an end-user for Foo, I define a struct called FooRequest, and I have exactly one function that creates a FooRequest instance, given a JSON stream.
Anywhere else in my application, if I have a FooRequest instance, I know that it's validated and well-formed because it had to have come from my FooRequest parsing function. I don't need sum types or any special language features beyond typing.
The trade-off is always hard to make. For instance: should I introduce a branded type for unsigned 32bit integer in TypeScript?
type u32 = number & { [U32_BRAND]: never }
const u32 = (n: number): u32 => (n >>> 0) as u32
And then to make it hard any use of this type: declare let x, y: u32
y = u32(x + 1)
Don't define reasons why the input is invalid, instead have a target struct/object, and parse the input into that object.
[0] https://lexi-lambda.github.io/blog/2020/11/01/names-are-not-...
if(!needToDoTheThing()) return;
DoTheThing();
We could have written it this way: if(needToDoTheThing()) {
DoTheThing();
}
else {
return;
}
The later is closer to how pattern match looks like. But in my experience, the majority of programmers prefer early return. I regularly see people "refactor" if-else to if-early-return, but I've never seen the opposite.In order to be useful, type systems need to be simple, but there's no such restrictions on rules that govern our expectations of data correctness.
OP is delusional if they think that their approach can be made practical. I mean, what if the expectation from the data that an value is a prime number? -- How are they going to encode this in their type systems? And this is just a trivial example.
There are plenty of useful constraints we routinely expect in message exchanges that aren't possible to implement using even very elaborate type systems. For example, if we want to ensure that all ids in XML nodes are unique. Or that the last digit of SSN is a checksum of the previous digits using some complex formula. I mean, every Web developer worth their salt knows that regular expressions are a bad idea for testing email addresses (which would be an example of parsing), and it's really preferable to validate emails by calling a number of predicates on them.
And, of course, these aren't the only examples: password validation (the annoying part that asks for capital letter, digit, special character? -- I want to see the author implement a parser to parse possible inputs to password field, while also giving helpful error messages s.a. "you forgot to use a digit"). Even though I don't doubt it's possible to do that, the resulting code would be an abomination compared to the code that does the usual stuff, i.e. just checks if a character is in a set of characters.
def TRUE(a, b):
return a
def FALSE(a, b):
return b
def IF(cond, a, b):
return cond(a, b)
assert IF(TRUE, 1, 2) == 1
assert IF(FALSE, 1, 2) == 2
This gives you the conditional statement in most languages ("cond ? a : b" or "a if cond else b").It does exactly what it says in the box: turns your TypeScript `types` / `interface` into machine-readable JSON schemas.
The library has a few open issues (does not deal well with some edge cases of composing Omit<> on sum types, and does not support dynamic (const) keys), but compared to manually writing JSON schemas, it's been amazing!
EDIT: I should add that the library supports adding further type constraints that are supported by JSON Schema but not by TS by using JSDoc (for instance, pattern matching on strings, ranges on numbers, etc.).
Dependent type checkers may be hard to implement, but the typing rules are fairly simple, and people have been using this correct by construction philosophy using dependently-typed languages for a while now.
There's nothing delusional about that.
People get too caught up in thinking that the type _has_ to express intricate properties, it doesn't. How am I going to express the expectation that something is prime? With the following closed API:
module Prime where
data PrimeNumber
parsePrime :: Int -> Maybe PrimeNumber
toInt :: PrimeNumber -> Int
Now the problem is that _leaving_ this API forgets information. Whether or not that is a problem is a different question, and very dependent on the context.The same applies to your comment about passwords. One can quite easily create a closed module that encapsulates a ValidPassword type that simply performs runtime character tests on a string.
I want to stress that this approach is making a trade off (as I earlier mentioned about leaving the API forgetting information, forcing you to re-parse). However, this puts this design somewhere in the middle of the spectrum. At one extreme end we have primitive obsession and shotgun parsing everywhere, with this we push the parsing into a sane place and try and hold on to these parsed values as long as possible, and at the extreme end we need dependent types or sophisticated encodings where the value carries a lot more information (and here we get towards propositions as types)
I had to work with this for a while because I wanted to implement Hello World in Javascript. https://github.com/quchen/lambda-ski/blob/master/helloworld/...
Nice error messages exist there as well.
If you're casting untyped results, you can change one side and not the other and find out about this problem when in production. Or simply any mistake will get unnoticed.
Using typescript first library allows you to do much more - supports opaque types, custom constructors and any imaginable validation that can't be expressed in json schema.
## type Silly = Foo | Bar Int | Qux String Silly
## Constructors
def Foo(onFoo, onBar, onQux):
return onFoo()
def Bar(arg0):
return lambda onFoo, onBar, onQux: onBar(arg0)
def Qux(arg0, arg1):
return lambda onFoo, onBar, onQux: onQux(arg0, arg1)
## Values of Silly type are Foo, Bar(x) and Qux(x, y)
## Destructor
def match_Silly(silly, onFoo, onBar, onQux):
return silly(onFoo, onBar, onQux)
You can make a whole language on top of that if you don't mind effectively disabling your CPU's branch predictor.> in my mind, the difference between validation and parsing lies almost entirely in how information is preserved
“parse don’t validate” is a pithy and easy to remember maxim for this preservation.
Because validation is implicitly necessary for parsing to a representation which captures your invariants anyway, by banning validation as a separate concept you ensure sole validation doesn’t get reintroduced, because any validation step outside of a wider parsing process is considered incorrect.
It also doesn't support inlined assertions, referring to existing classes, custom validations, opaque types etc.
Any time I see some regex, I start asking probing questions about the nature of the underlying abstraction.
Being able to deterministically convert something into an AST is the ultimate test of that thing's stability at any scale.
In TypeScript we can define
type Prime = number
function isPrime(value: number) value is Prime {
// run sieve
}
From here, you may have e.g. function foo(value: Prime, ...) {
}
And it will be typed checked. function fooOrFail(v: number) {
if (isPrime(v))
foo(v)
else
throw new TypeError()
}
I'd summarise boolean blindness as: implicit (often unsafe) coupling/dependencies of method results; which could instead be explicit data dependencies. That article's example is 'plus x y = if x=Z then y else S(plus (pred x) y)', which uses an unsafe 'pred' call that crashes when x is 'Z'. It avoids the crash by branching on an 'x=Z' comparison. The alternative is to pattern-match on x, to get 'Z' or 'S x2'; hence avoiding the need for 'pred'.
Another alternative is to have 'pred' return 'Maybe Nat'; although that's less useful when we have more constructors and more data (e.g. the 'NonEmptyList' in this "parse, don't validate" article!)
Unfortunately, it conceals it behind some examples that - while they do a good job of illustrating the generality of its applicability - don’t show as well how to use this in your own code.
Most developers are not writing their own head method on the list primitive - they are trying to build a type that encapsulates a meaningful entity in their own domain. And, let’s be honest, most developers are also not using Haskell.
As a result I have not found this article a good one to share with junior developers to help them understand how to design types to capture the notion of validity, and to replace validation with narrowing type conversions (which amount to ‘parsing’ when the original type is something very loose like a string, a JSON blob, or a dictionary).
Even though absolutely those practices follow from what is described here.
Does anyone know of a good resource that better anchors these concepts in practical examples?
If you nest if/else, you'll quickly approach a point where you have to keep a complex logic tree in your head to determine which states the system could be in inside of any given branch. If you use guard clauses and return early, you'll keep this complexity down to a minimum, since the list of possible states changes linearly with your code instead of exponentially.
I know not everybody likes it, but I think this makes cyclomatic complexity an extremely valuable metric for measuring "ease-of-reading".
checkAgainstMySchema: JSON -> Boolean
Or this: checkedAgainstMySchema: JSON -> JSON
Instead, it's better to use a type signature like this; checkAgainstMySchema: JSON -> Either Error MyJSON
(Where MyJSON is some type which wraps-up your data; which could be the raw JSON, or perhaps some domain objects, or whatever)The reason this is better, is that it's required for your program to work: if your processing functions take a 'MyJSON' as argument, then (a) your program must call the 'checkAgainstMySchema' function; and (b) you can only run your data processing in the successful branch (since that's the only way to get the 'MyJSON' argument you need).
In contrast, the functions which return 'Boolean' and 'JSON' are not required; in the sense that, we could completely forget to do any validation, and still end up with a runnable program. That's dangerous!
Kind of yes, but this discussion is much dependent on definitions of `parse` and `validate`, which the article does not explicitly elaborate on. The chapter "The power of parsing" captures this difference implicitly "validateNonEmpty always returns (), the type that contains no information". Validation, in the context of all of this, can be defined as "checking conformance to a set of rules" while parsing is mostly synonymous with deserialization.
In most practical application you explicitly do not want to only validate inputs as in you have no need to perform any computation on invalid input anyway. Sometimes you explicitly want to analyze invalid inputs, maybe try and recover some information or do some other magic. Sure then, go and validate input and do that magic on invalid input. In most cases you want to simply reject invalid inputs.
However, when you think about it, that is what parsing does. Validation happens during parsing implicitly: parser will either return a valid object or throw an error, but parsing has an added benefit that the end result is a datum of a known datatype. Of course it only really works in statically typed languages.
The thing is that it is rather easy to conflate the two. Take for example the following JSON `{"foo": "bar", "baz": 3}`. A "parser" can return 1) a list of `(String, String, String)` 3-tuples for (type, key, value) that downstream has to process again 2) full blown `FoobarizerParams` object or something in between.
parseNonEmpty [] = throwIO $ userError "list cannot be empty"
How would that interact with a scenario where we want a specific error message if a specific list is empty? Eg, "you want to build the list using listBuilder()". Making illegal states unrepresentable is good advice but I don't think that escapes the value of good validation.It is a mistake to do ad-hoc validation. But it makes a lot of sense to have a validation phase, a parse phase then an execution phase when dealing with untrusted data. The validation phase gives context-aware feedback, the parse phase catches what is left and then execution happens.
A type system doesn't seem like a good defence against end user error. The error messages in practice are mystic. I think the complaint here is if people are trying to implement a type system using ad-hoc validation which is a bad idea.
However, I have had to deal occasionally with http libraries that tried to parse everything and would not give you access to anything that they could not parse. This was incredibly frustrating for corner cases that the library authors hadn't considered.
If you are the one who is going to take action on the data, parse don't validate is the correct approach. If you are writing a library that deals with data that it doesn't fully understand, and you're handing that data to someone else to take action with, then it may not always be the right approach.
There are plenty of cases in real-world code where an array that's part of a struct or object may or may not contain any elements. If you're just parsing input into that, it seems like you'd either still end up doing an equivalent of checking whether the array is empty or not everywhere the array might be used later, even if that check is looking at an "array has elements" type flag in the struct/object, and so you're still maintaining a description of ways that the input may be invalid. But I'm not a world-class programmer, so maybe I'm missing something. Maybe you mean something like for branches of the code that require a non-empty array, you have a second struct/object and parser that's more strict and errors out if the array is empty?
Don't have to be 100% immutable or perfect ADTs: see Rust, Swift, Kotlin. Even TypeScript can do this, albeit it's uglier with untagged unions and flow typing instead of pattern matching.
If you're talking about consuming the looser input and producing a definitely-correct output, already, then you're talking about parsing, not validation. Most validation occurs naturally during parsing.
It's built on a particular technical distinction between paring and validating that (1) is not all that commonly understood or consistently accepted and (2) not actually explicitly stated in the article!
(validation: check data assumptions, fail of not met; parse: check data assumptions, fail if not met, and on success return data as a new type reflecting the additional constraints of the data, which can therefore be checked at compile time. Notice parsing includes validation, which makes the title of the article quite poor.)
That's important to know because the distinction is only meaningful in the context of certain language features, which may or may not apply.
Also, this is not great general advice:
> Push the burden of proof upward as far as possible, but no further
For one, it's a mostly meaningless, since it really just says put the burden of proof in the right place. But it implies that upward is preferable. You really want to push it upward if it's a high-level concern, and downward if it's a low-level concern. E.g., suppose you're working on an app or service that accesses the database, so the database is lower-level. You'll want to push your database-specific type transformations closer to the code that accesses the database.
Honestly, I find this whole thing kind of muddled.
(Also, in my experience, the fundamental limit here isn't on validation strategies, but the human ability to break down a problem and logically organize the solution. You can just as easily end up with an unmaintainable mess of spaghetti types as with any other useful abstraction.
const published = posts.filter(post => !post.draft);
- successfully parsed data objects
- error objects
- warning objects
That way your consumers can themselves decide what to do in the face of errors and warnings.
(Of course one ugly old fashioned way to add optional ‘error’ types to your return signature is checked exceptions, but we don’t talk about that model any more.)
IMO, database code is at exactly the same level of concern as network code or filesystem code. By “upward”, she means push parsing to the boundaries of your program — as close to the point of ingress as possible.
Empty lists are actually generally not a big deal - they are just lists of size 0, and they generally follow all the same things you can do with non-empty lists. The fact that a "head" function throws an error on an empty list, is really just a specific form of the more general observation that: any array would throw an index-out-of-bounds exception when given an index that's... out of of bounds. So any time you are dealing with arrays, you probably need to think about, "what happens if I try to index something that's out of bounds? is that possible?"
In this particular contrived example, all that mattered was the head of the array. But what if you wanted to pick out the 3rd argument in a list of command line arguments, and instead the user only gave you 2 inputs? If 3 arguments are required, then throw an IOError as early as possible after failing to parse out 3 arguments; but once you pass the point of parsing the input into a valid object/struct/whatever, from that point forward you no longer care about checking whether the 3rd input is empty or not.
So again, it depends on your scenario. Actually the more interesting variant of this issue (in OO languages at least) is probably handling nulls, as empty lists are valid lists, but nulls are not lists, and requires some different logic usually (and hence why NullPointerExceptions aka NPEs are such a common failure mode).
(In fact you could use an invalid password just fine: unless you're doing something really weird, your code would not misbehave because it's too short or missing digits and symbols. It's only due to security reasons that you choose to reject that string.)
But that doesn't mean that `string -> Password` isn't parsing! As long as you're outputting distinct types for ValidPassword and InvalidPassword, you are still following the advice of this article, because you can make all your internal code use ValidPassword as a type and you will not need to ever check for password validity again.*
Compare that to e.g. adding a { IsValid = true } field to the object, which would require you to defensively sprinkle `if (user.Password.IsValid)` every time you try to actually use the password field.
* One weakness arising from the fact that this is degenerate parsing, i.e. ValidPassword is just a string, is that a very stubborn fool could build a ValidPassword from any arbitrary string instead of using the proper parse function. Depending on your language, this can be prevented by e.g. hiding the constructor so that only parsePassword has access to it.
validateEmail :: String -> IO ()
and a 'parsing' function validateEmail :: String -> Either EmailError ValidEmail
The property encoded by the ValidEmail type is available throughout the rest of the program, which is not the case if you only validate.If this isn't clear to you, ask yourself why programming languages are parsed and not merely validated. Validation is a subset of parsing, so clearly there's something important added.
But, yeah, the clickbait title put me off, and you're right that the terminology is unhelpful, since the distinction between parsing and validation isn't consistently made, especially in practical work. Virtually all of the "validation" code I've seen in statically typed languages, in the codebases I've worked in, would be "parsing" by this definition.
I genuinely wonder how one would write a proof in something like Agda, that
parseJson("{foo:"+encodeJson(someObject)+"}")
always succeeds> The difference lies entirely in the return type: validateNonEmpty always returns (), the type that contains no information, but parseNonEmpty returns NonEmpty a, a refinement of the input type that preserves the knowledge gained in the type system. Both of these functions check the same thing, but parseNonEmpty gives the caller access to the information it learned, while validateNonEmpty just throws it away.
This might not seem like much of a distinction, but it has far-reaching implications downstream:
> These two functions elegantly illustrate two different perspectives on the role of a static type system: validateNonEmpty obeys the typechecker well enough, but only parseNonEmpty takes full advantage of it. If you see why parseNonEmpty is preferable, you understand what I mean by the mantra “parse, don’t validate.”
parseNonEmpty is better because after a caller gets a NonEmpty it never has to check the boundary condition of empty again. The first element will always be available, and this is enforced by the compiler. Not only that, but functions the caller later calls never need to worry about the boundary condition, either.
The entire concern over the first element of an empty list (and handling the runtime errors that result from failure to meet the boundary condition) disappear as a developer concern.
While the article is titled "parse don't validate" I like it's first point of make illegal states unrepresentable much better.
the difference between validation and parsing lies almost entirely in how information is preserved. Consider the following pair of functions:
validateNonEmpty :: [a] -> IO ()
parseNonEmpty :: [a] -> IO (NonEmpty a)
Both of these functions check the same thing, but parseNonEmpty gives the caller access to the information it learned, while validateNonEmpty just throws it away.
This is what Applicative Functors were born to do. Here's a good article on it: https://www.baeldung.com/vavr-validation-api
Check the types:
public Validation<Seq<String>, User> validateUser(...)
Even though it's called "validation", it's still the approach the OP recommends.It reads as "If you have a Seq of Strings, you might be able to construct a User, or get back validation errors instead".
Contrast this with the wrong way of doing things:
User user = new User(seqOfStrings);
user.validate();
I don't know if Ikea actually does this, but I just mean as a concept that's one way you can use to imagine it. There are so many examples of this in the wild, for important things, e.g. you can't use the washing machine unless the lid is actually closed all the way.
My bad.
> People get too caught up in thinking that the type _has_ to express intricate properties
Where do you get this from? Did you even read what you are replying to? I never said anything like that... What I'm saying is that the approach taken by OP is worthless when it comes to real-life uses of validation.
So, continuing with your example: you will either end up doing validation instead of parsing (i.e. you will implement parsePrime validator function), or your will not actually validate that your input is a prime number... The whole point OP was trying to make is that they wanted to capture the constraints on data in a type describing those constraints, but outside of trivial examples s.a. non-empty list, that's used by OP, that leads to programs that are either impossible or are extremely complex.
> One can quite easily create a closed module that encapsulates a ValidPassword
And, again, that would be doing _validation_ not parsing. I'm not sure if you even understand what the conflict here is, or are you somehow agreeing with me w/o saying so?
Really? How is that degenerate? Compared to what?
My guess is that you just decided to use a dictionary word you don't fully understand.
> In fact you could use an invalid password
Where does this nonsense come from? No. I cannot use invalid password. That's the whole point of validation: making sure it doesn't happen. What kind of bs is this?
> But that doesn't mean that `string -> Password` isn't parsing!
It's doing nothing useful, and that's the whole point. You just patted yourself on the head in front of the mirror for using some programming technique that you believe makes you special, but accomplished nothing of value. That was the original point: if you want results, you will end up doing validation, there's no way around it. You renaming of types is only interesting to you and a group of people who are interested in types, but doesn't advance the cause of someone who wants their password validated.
You just admitted in this sentence that the use of opaque types achieves nothing of value. Which was my point all along: why use them if they are useless? Just to feel smart because I pulled out an academia-flavored ninety-pound dictionary word to describe it?
Now, are there really tools to make type systems with dependent types simple to prove? In reasonable time? How about the effort developers would have to put into just trying to express such types and into verifying that such an expression is indeed accomplishing its stated goals?
Just for a moment, imagine you filing a PR in a typical Web shop for the login form validation procedure, and sending a couple of screenfulls of Coq code or similar as a proof of your password validation procedure. How do you think your team will react to this?
Again, I didn't say it's impossible. Quite the opposite, I said that it is possible, but will be so bad in practice that nobody will want to use it, unless they are batshit crazy.
I have no problems with the way you want to interpret this claim. But, really, I'm responding to the article linked to this thread, which isn't about at which point in application to perform the said validation or parsing.
Yes, it's fine, if you want to validate your input in this way -- I have no problems with it. It's just that you are doing validation, not parsing, at least not in the terms OP used them.
In practice, I find it is usually something like 'UnvalidatedCustomerInfo' being parsed into a 'CustomerInfo' object, where you can validate all of the fields at the same time (phone number, age, non-null name etc.). Once you have parsed it into the internal 'CustomerInfo' type - you don't need to keep re-validating the expected invariants everywhere this type is used. There was a good video on this that I wish I could find again, where the presenter gave the example of using String for a telephoneNumber field instead of a dedicated type. Any code that used the telephoneNumber field would have to re-parse and validate it "to be safe" it was in the expected format.
The topic of having untrusted external types and internal trusted types is also explained in the book `Domain Modeling Made Functional` which I highly recommend.
https://fsharpforfunandprofit.com/posts/designing-with-types...
https://ybogomolov.me/making-illegal-states-unrepresentable/
In my view it's a very bad design for an http library, although it would have been a lot less frustrating if it had at least provided an escape hatch.
Meanwhile, people with more than one bit in their worldview RAM can fall back to validation when it's the only option that makes sense for their domain, and use parsing when it's appropriate, which is, notwithstanding your handful of frankly niche examples compared to the vast bulk of CRUD code, most of the time in practice.
This confusion is, I think, just a question of different conceptions of the system architecture.
Your terminology is drawing from a three-tier architecture [0] with a presentation layer, logic layer, and data layer. Under this model, input (data) is the bottom layer and output (HTTP/GUI) is the top layer, with your application logic in the middle.
On the other hand, she is viewing the system through an inside-outside lens similar to the hexagonal architecture [1]. All input (data) and output (HTTP/GUI) is considered to be up and out of your application logic. Rather than being the middle of a sandwich, the application logic is the kernel of a seed.
This is a common way to view the system when programming in functional languages like Haskell because the goal is usually to push all I/O to the start of the call stack so as to minimize the amount of code that has to account for side effects. The three-tier architecture isn't concerned about isolating effects, so treating the data layer as the bottom layer of the code is reasonable.
In either model, the point is to push validation to the boundaries of your code and rely on the type checker to prove you're using things right within the logic layer.
[0] https://en.wikipedia.org/wiki/Multitier_architecture
[1] https://en.wikipedia.org/wiki/Hexagonal_architecture_%28soft...
Am I mistaken?
My mistake in the above snippets is precisely that TypeScript can not make the type more specific, i.e. Number to Prime, because `type Prime=number` is only creating an alias. I am not creating a type that is a more specific version of number but an alias.
Had I actually created a proper type, the parsing would have been correct. The parsing component is happening in the outer function because at some point I need to make the generic input more specific, and then allow it to flow through the rest of the program. Am I mistaken?
e.g. imagine a `parseNonEmpty` and a `parseAllEven` method. Both take a list and return either a `NonEmpty` or `AllEven` type. If I call `parseNonEmpty` and get a `NonEmpty`, then pass to `parseAllEven`, I now have erased the `NonEmpty` as I'm left with an `AllEven`. I would need a `parseAllEvenFromNonEmpty` to take a `NonEmpty` and return a `AllEven & NonEmpty`.
A 'NotEmpty a' is just a subclass of a potentially empty 'a'. You also get the desirable behavior, in this scenario, of automatic upcasting of a 'NotEmpty a' into a regular old 'a'.
Of course, if you want to share the schema with downstream clients so that other programs can use it, that is a great use case for something like JSON Schema. It is a common interface that allows two different programs—quite possibly written in completely different languages—to communicate using the same format. That’s great! But it’s only half the story, because just having the schema doesn’t help you in any way to make sure the code actually respects that schema. That’s where integration with the language’s type system can help, perhaps by automatically generating types from the schema and then generating parsing/serialization functions that use those generated types.
In Java, you'd implement this by making a class with a private constructor, no mutator methods, and a static factory method that throws an exception if the parsing fails. Since the only way to get an instance of the class is through the factory method, you've made illegal states unrepresentable and know that the class always holds to its invariants. No methods on instances of that class will throw exceptions from then on, so you've successfully applied "Parse, Don't Validate" without needing sum types.
The point of the article isn't the particular implementation in Haskell, it's the concept of pushing all data error states to the boundaries of your code, which applies anywhere as long as you translate it into the idioms of your language.
I once had to implement a feature on a real estate website.
For a given location I would get a list of stats (demographics, cost per square meter, average salary in the area, etc). Of those stats some themselves contained lists.
At the beginning I modeled everything with arrays in react. This led to passing down lists and having to check at multiple steps whether they were non-empty and handle that case.
Then I modeled everything with a NonEmptyArray guard in TypeScript
``` interface NonEmptyArray<A> extends Array<A> { 0: A // tells the typesystem that index 0 exists and is of type A }
function isNonEmpty(as: A[]): as is NonEmptyArray<A> { return as.length > 0 } ```
then after receiving the response with those lists I could parse them into NonEmptyArrays, remove all of the checks of emptiness inside the react components till handling the fact that some of these elements were empty trickled up to the outermost component and everything became very clean and simple to understand/maintain.
Everything in that post applies to the most common programming language out there: TypeScript.
And several popular others such as Rust, Kotlin or Scala.
Deciding this thing is specifically a WidgetDescription, not a Widget or a WidgetLabel, or a WidgetAssociatedText and definitely not a ThingyDescription, can help both users and other developers produce a mental model of what's going on that results in a better experience for everyone.
It absolutely is.
> I have no idea why would you question that.
I did not question [that they were different approaches], I explained, through example and counter-example, why they were the same approach. I will try again.
Alexis wrote both 'validate' and 'parse' examples in ML-style types:
validateNonEmpty :: [a] -> IO () // ML-typed 'validate'
parseNonEmpty :: [a] -> IO (NonEmpty a) // ML-typed 'parse'
More from the article: The difference lies entirely in the return type: validateNonEmpty always returns (), the type that contains no information, but parseNonEmpty returns NonEmpty a, a refinement of the input type that preserves the knowledge gained in the type system. Both of these functions check the same thing, but parseNonEmpty gives the caller access to the information it learned, while validateNonEmpty just throws it away.
I chose OO-style types for my samples, because there's a large fraction of HN users who dismiss ML-ish stuff as academic, or "practically useless outside of the most trivial cases". // OO-typed 'validate' (my straw man)
class User {
// returns void aka '()' aka "the type that contains no information"
void validateUser() throws InvalidUserEx {...}
}
/* OO-typed 'parse' (as per my baeldung link)
* "gives the caller access to the information it learned"
* In this case it gives back MORE than just the User,
* it also gives back 'why it went wrong', per your request above for password validation
* (In contrast with parseNonEmpty which just throws an exception.)
*/
class UserValidator {
Validation<Seq<String>, User> validateUser(...) {...}
}
> But, ML-style types have very limited expressive powerHindley-Milner types are a godddamned crown-jewel of computer science.
The example was written rather badly, though. It should have pointed out that the module was exporting the type and a couple helper functions, but not the data constructor.
But despite that, the key point was correct. Validating is examining a piece of data and returning "good" or "bad". Parsing is returning a new piece of data which encodes the goodness property at the type level, or failing to return anything. It's a better paradigm because the language prevents you from forgetting what situation you're in.
const info = type({ name: "string>0", email: "email" })
impl std::str::FromStr for Foo {
type Err = ReasonsItIsNotAFoo;
fn from_str(s: &str) -> Result<Self, Self::Err> {
/* etc. */
}
}
And then whenever I've got a string which I know ought to be a Foo, I can: let foo: Foo = string.parse().expect("This {string:?} ought to be a Foo but it isn't");
Since we said foo is a Foo, by inference the parsing of string needs to either succeed with a Foo, or fail while trying, so it calls that FromStr implementation we wrote earlier to achieve that.I couldn't disagree more. A type modeling an HTTP request should model HTTP requests. Not some theoretical description of an HTTP request.
Compare:
function validateNonEmpty<T>(list: T[]): void {
if (list[0] === undefined)
throw Error("list cannot be empty")
}
function parseNonEmpty<T>(list: T[]): [T, ...T[]] {
if (list[0] !== undefined) {
return list as [T, ...T[]]
} else {
throw Error("list cannot be empty")
}
}
function assertNonEmpty<T>(list: T[]): asserts list is [T, ...T[]] {
if (list[0] === undefined) throw Error("list cannot be empty")
}
function checkEmptiness<T>(list: T[]): list is [T, ...T[]] {
return list[0] !== undefined
}
declare const arr: number[]
// Error: Object is possibly undefined
console.log(arr[0].toLocaleString())
const parsed = parseNonEmpty(arr)
// No error
console.log(parsed[0].toLocaleString())
if (checkEmptiness(arr)) {
// No error
console.log(arr[0].toLocaleString())
}
assertNonEmpty(arr)
// No error
console.log(arr[0].toLocaleString())
For me the `${arg} is ${type}` approach is superior as you are writing the validation once and can pass the precise mechanism for handling of the error to the caller, who tends to have a better idea of what to do in degenerate cases (sometimes throwing a full on Exception is appropriate, but sometimes a different form of recovery is better).Yes, indeed. This is quite useful! But crabbone isn’t entirely wrong that it isn’t quite what the original article was about.
I’ve written quite a bit of code where constructive data modeling (which is what the original article is really about) was both practical and useful. Obviously it is not a practical approach everywhere, and there are lots of examples where other techniques are necessary. But it would be quite silly to write it off as useless outside of toy examples. A pretty massive number of domain concepts really can be modeled constructively!
But when they can’t, using encapsulation to accomplish similar things is absolutely a helpful approach. It’s just important to be thoughtful about what guarantees you’re actually getting. I wrote more about that in a followup post here: https://lexi-lambda.github.io/blog/2020/11/01/names-are-not-...
Not at all, the article is about pushing complexity to the "edges" of your code so that the gooey center doesn't have to faff around with (re-)checking the same invariants over and over... but its examples are also in Haskell, in which it would be weird to do this without the type system.
In python or java or whatever you'd just parse your received_api_request_could_be_sketchy or fetched_db_records_but_are_they_really into a BlessedApiRequest or DefinitelyForRealDBRecords in their constructors or builder methods or whatever, disallow any other ways of creating those types, and then exclusively using those types.
edit: wait, actually no we agree, I must have glossed over your second sentence, sorry
Instead it's parsing. It takes in a value of one type and returns a value of a different type that is known good. Or it fails. But what it never does is let you continue forward with an invalid value as if it was valid. This is because it's doing more than just validation.
function processUserInput(input: String, requirement: InputReqs[input]): Unit = ...
type InputReqs[input] = {
// We'll say that a Proposition takes Boolean expressions and turn them into types
notTooLong: Proposition[length(input) < 128],
authorIsNotBlocked: AuthorIsNotBlocked[input],
sanitized: Sanitized[input],
...
}
where you might have the following functions (which are all examples of parse don't validate) function checkIfAuthorIsBlocked(author: String, input: String): Maybe[AuthorIsNotBlocked[input]] = ...
// Create a pair that contains both the sanitized String and a tag that it has been sanitized
function sanitizeString(input: String): (output: String, Sanitized[output]) = ...
where just by types alone I know that e.g. length checking must occur after sanitization (because sanitizeString generates a new `output` that is distinct from `input`) and don't have to write down in docs somewhere that you might cause a bug if you check lengths before sanitization because maybe sanitization changes the length of the input.Note that this is also strictly stronger than a simple precondition/postcondition system or some sort of assertion system because properties of the input that we care about may not be observable at runtime/from the input alone (e.g. AuthorIsNotBlocked can't be asserted based only on input: you'd have to change the runtime representation of input to include that information).
SQL constraints are certainly useful. But they don’t really solve the same problem. SQL constraints ensure integrity of your data store, which is swell, but they don’t provide the same guarantees about your program that static types do, nor do they say much at all about how to structure your code that interacts with the database. I also think it is sort of laughable to claim that XSL is a good tool for solving just about any data processing problem in 2023, but even if you disagree, the same points apply.
Obviously, constructive data modeling is hardly a panacea. There are lots of problems it does not solve or that are more usefully solved in other ways. But I really have applied it to very good effect on many, many real engineering problems, not just toys, and I think the technique provides a nice framework for reasoning about data modeling in many scenarios. Your comments here seem almost bafflingly uncharitable given the article in question doesn’t make any absolutist claims and in fact discusses at some length that the technique isn’t always applicable.
See also: my other comment about using encapsulation instead of constructive modeling (https://news.ycombinator.com/item?id=35059113) and my followup blog post about how more things can be encoded using constructive data modeling than perhaps you think (https://lexi-lambda.github.io/blog/2020/08/13/types-as-axiom...).
My point was merely that the examples being presented in Haskell - and in the context of talking about lists in a very functional, lispy cons-ish kind of way, makes it less accessible for programmers who are using more object-oriented type systems.
Just to be clear, I actually really like the idea of parsing the input into a structure. I do the same thing in a lot of my code. I just don't see how it removes the need to also perform validation in many (maybe most) cases as soon as one gets beyond contrived examples.
The empty array example seems to be a can of worms. Maybe it's specific to the kinds of software that I've written, but in most of the cases I can think of, I wouldn't know if it was OK for a particular array within a structure to be empty until after the code had made some other determinations and branched based on them. And yet, like the example, once it got to the real handling for that case, it would be a problem if the array were empty. So the image in my mind is many layers of parsing that are much more complicated and confusing to read than validating the length of the array.
I still think it's a great idea for a lot of things, just that the "parse, don't validate" name seems really misleading. I might go with something like "parse first, validate where necessary".
it's often a matter of experience to get this nuance of programming. you just learn with time that it's very inconvenient to test for emptiness multiple levels deep in the callstack again and again and you go "why can't i just assume good data here?". and then you figure out a way to write the code so you can.
However, TypeScript does not really provide any facility for nominal types, which in my opinion is something of a failure of the language, especially considering that it is at odds with the semantics of `class` and `instanceof` in dynamically-typed JavaScript (which have generative semantics). Other statically typed languages generally provide some form of nominal typing, even gradually typed ones. Flow even provided nominal types in JavaScript! But TypeScript is generally also quite unsound (https://twitter.com/lexi_lambda/status/1621973087192236038), so the type system doesn’t really provide any guarantees, anyway.
That said, TypeScript programmers have developed a way to emulate nominal typing using “brands”, which does allow you to obtain some of these benefits within the limitations of TS’s type system. You can search for “TypeScript branded types” to find some explanations.
Tangentially, in Haskell specifically, I have actually written a library specifically designed for checking the structure of input data and raising useful error messages, which is somewhat ironically named `monad-validate` (https://hackage.haskell.org/package/monad-validate). But it has that name because similar types have historically been named `Validation` within the Haskell community; using the library properly involves doing “parsing” in the way this blog post advocates.
This is overthinking it. Usually, when people are not used to doing constructive data modeling, they get caught up on this idea that they need to have a datatype that represents their data in some canonical representation. If you need a type that represents an even number, then clearly you must define a type that is an ordinary integer, but rules out all odd numbers, right?
Except you don’t have to do that! If you need a number to always be even (for some reason), that suggests you are storing the wrong thing. Instead, store half that number (e.g. store a radius instead of a diameter). Now all integers are legal values, and you don’t need a separate type. Similarly, if you want to store an even number greater than 100, then use a natural number type (i.e. a type that only allows non-negative integers; Haskell calls this type `Natural`) and store half that number minus 102. This means that, for example 0 represents 102, 1 represents 104, 2 represents 106, 3 represents 108, etc.
If you think this way, then there is no need to introduce a million new types for every little concept. You’re just distilling out the information you actually need. Of course, if this turns out to be a really important concept in your domain, then you can always add a wrapper type to make the distinction more formal:
newtype EvenGreaterThan100 = EvenGreaterThan100 Natural
evenGreaterThan100ToInteger :: EvenGreaterThan100 -> Integer
evenGreaterThan100ToInteger (EvenGreaterThan100 n) = (toInteger n * 2) + 102
integerToEvenGreaterThan100 :: Integer -> Maybe EvenGreaterThan100
integerToEvenGreaterThan100 n
| n < 100 = Nothing
| otherwise = case n `quotRem` 2 of
(q, 0) -> Just (EvenGreaterThan100 q)
(_, _) -> Nothing
Of course, this type seems completely ridiculous like this, and it is. But that’s because no real program needs “an even number greater than one hundred”. That’s just a random bag of arbitrary constraints! A real type would correspond to a domain concept, which would have a more useful name and a more useful API, anyway.I wrote a followup blog post here that goes into more detail about this style of data modeling, with a few more examples: https://lexi-lambda.github.io/blog/2020/08/13/types-as-axiom...
This is sort of true. It is a good technique, but it is a different technique. I went into how it is different in quite some detail in this followup blog post: https://lexi-lambda.github.io/blog/2020/11/01/names-are-not-...
I think a common belief among programmers is that the true constructive modeling approach presented in the first blog post is not practical in languages that aren’t Haskell, so they do the “smart constructor” approach discussed in the link above instead. However, I think that isn’t actually true, it’s just a difference in how respective communities think about their type systems. In fact, you can definitely do constructive data modeling in other type systems, and I gave some examples using TypeScript in this blog post: https://lexi-lambda.github.io/blog/2020/08/13/types-as-axiom...
This is quite nice in situations where the type system already supports the refinement in question (which is true for this NonEmpty example), but it stops working as soon as you need to do something more complicated. I think sometimes programmers using languages where the TS-style approach is idiomatic can get a little hung up on that, since in those cases, they are more likely to blame the type system for being “insufficiently powerful” when in fact it’s just that the convenience feature isn’t sufficient in that particular case. I presented an example of one such situation in this followup blog post: https://lexi-lambda.github.io/blog/2020/08/13/types-as-axiom...
Marshalling the data for platform traversal is also very wise. A library like Xalan/xerces using XSLT is very powerful, or something lightweight like the JSON/BSON parser in libbson.
Accordingly, one must assume the data is _always_ malformed, and assign a scoring system to the expected format at each stage of decoding. i.e. each service/function does a sanity check, then scores which data is critical, optional, and prohibited.
This way your infrastructure handles the case when (not if) someone tries to put Coffee Grounds in your garbage disposal unit. =)
This is similar, and is indeed quite useful in many cases, but it’s not quite the same. I explained why in this comment: https://news.ycombinator.com/item?id=35059886 (The comment is talking about TypeScript, but really everything there also applies to Java.)
Because you wrote a validation function, the exact thing OP told you not to do. Hooray?!
The goal of OP was to create a type that incorporates constraints on data, just like in their example about the non-empty list they created a type that in the type itself contains the constraints s.t. it's impossible to implement this type in a way that it will have an empty list.
You did the opposite. You created a type w/o any constraints whatsoever, and then added a validation function to it to make sure you only create values validated by that function. So... you kind of proved my point: it's nigh impossible to create a program intelligible to human beings that has a "prime number" type, and that's why we use validation -- it's easy to write, easy to understand.
Your type isn't even a natural number, let alone a prime number.
When I wrote GP I had in mind branding as the "right" way to get those benefits - though I was unaware of the name - however I see that it is still limited by TS' compiler's limitations.
So then going back to the initial snippets, my issue is that the Prime type is essentially behaving like newtype, thus the inner calls can not actually rely on the value actually being prime, yes?
I have to admit that quite a few of the things in the blog are beyond my current understanding. Do you have any recommended reading for post grads with rudimentary understanding of Haskell who would like to get deeper into type systems?
See this fantastic reply by the author:
If I'm understanding the difference correctly, it's that the constructive data modeling approach can be proven entirely in the type system without any trust in the library code, while the Java approach I recommended depends on there being no other way to construct an instance of the class, which can be tricky to guarantee. Is that accurate?
validateEmail : String -> String -- post-condition: String contains valid email
whereas parse looks like: parseEmail : String -> Either EmailError ValidEmail
There is no problem using `ValidEmail` abstraction. The problem is type stability, when your program enters a stronger state at runtime (i.e. certain validations are performed at runtime) it's best to enter a strong state at compile time (stronger types) so that compiler can verify these conditions. If you remain at String, these validations (that a string is valid email) have no compile-time counterpart so there is no way for compiler to verify. So use `ValidEmail` instead.The problem with your `Prime` type is that it is just a type alias: a new way to refer to the exact same type. It’s totally interchangeable with `number`, so any `number` is necessarily also a `Prime`… which is obviously not very helpful. (As it happens, the Haskell equivalent of that would be basically identical, since Haskell also uses the `type` keyword to declare a type alias.)
As for recommended reading, it depends on what you’d like to know, really. There are lots of different perspectives on type systems, and there’s certainly a lot of stuff you can learn if you want to! But I think most working programmers probably don’t benefit terribly much from the theory (though it can certainly be interesting if you’re into that sort of thing). Perhaps you could tell me which things you specifically find difficult to understand? That would make it easier for me to provide suggestions, and it would also be useful to me, as I do try to make my blog posts as accessible as possible!
What on Earth are you talking about? What dynamic enforcement?
> Though I think it is a bit silly to suggest that I have “never even considered”
In the context of this conversation you showed no signs of such concerns. Had you have such concerns previously, you wouldn't have arrived at conclusions you apparently have.
> a dynamically-typed language
There's no such thing as dynamically-typed languages, just like there aren't blue or savory programming languages. Dynamically-typed is just a word combo that a lot of wannabe computer scientists are using, but there's no real meaning behind it. When "dynamic" is used in the context of types, it refers to the concrete type obtained during program execution, whereas "static" refers to the type that can be deduced w/o executing the program. For example, union types cannot be dynamic. Similarly, it's not possible to have generic dynamic types. Every language thus has dynamic and static types, except, in some cases, the static analysis of types isn't very useful because the types aren't expressive enough, or the verification is too difficult. Conversely, in some languages there's no mechanism to find out exact runtime types because the information about types is considered to be extraneous to the program and is removed from the runtime.
The division that wannabe computer scientists are thus trying to make between "dynamically-typed" and "statically-typed" lies roughly along the lines of "languages without useful static analysis method" and "languages that may be able to erase types from the runtime in most cases". Where "useful" and "most cases" are a matter of subjective opinion. Often times such boundaries lead claimers to ironically confusing conclusions, s.a. admitting that languages like Java aren't statically typed or that languages like Bash are statically typed and so on.
Note that "wannabee computer scientist" applies also to people with degrees in CS (I've met more than a dozen), some had even published books on this subject. This only underscores the ridiculous state in which this field is.
> discusses at some length that the technique isn’t always applicable.
This technique is not applicable to overwhelming majority of everyday problems. It's so niche it doesn't warrant a discussion, but it's instead presented as a thing to strive for. It's not a useful approach and at the moment, there's no hope of making it useful.
Validation, on the other hand, is a very difficult subject, but, I think that if we really want to deal with this problem, then TLA+ is a good approach for example. But it's still too difficult to the average programmer. Datalog would be my second choice, which also seems appropriate for general public. Maybe even something like XSL, which, in my view lacks a small core that would allow one to construct it from first principles, but it's still able to solve a lot of practical tasks when it comes to input validation.
ML-style types aren't competitive in this domain. They are very clumsy tools when it comes to expressing problems that programmers have to solve every day. We, as community, keep praising them because they are associated with the languages for the "enlightened" and thus must be the next best thing after sliced bread.
On what grounds did you decide that this is the requirement for validation? That's truly bizarre... Sometimes validating functions return booleans... but there's no general rule that they do.
Anyways, you completely missed the point OP was trying to make. Their idea was to include constraints on data (i.e. to ensure data validity) in the type associated with the data. You've done nothing of the kind: you created a random atomic type with a validation method. Your type isn't even a natural number, you definitely cannot add other natural number to it or to multiply etc...
Worse yet, you decided to go into a language with subtyping, which completely undermines all of your efforts, even if you were able to construct all of those overloads to make this type behave like a natural number: any other type that you create by inheriting from this class has the liberty to violate all the contracts you might have created in this class, but, through the definition of your language, it would still be valid to say that the subtype thus created is a prime number, even if it implements == in a way that it returns "true" when compared to 8 (only) :D
My own English parser is telling me it's actually four words, however.
Your mileage may validate that differently. ;)
Some dude comes up with another data definition language (DDL) that uses ML-style types. Everyone jumps from their seats in standing ovation. And in the end we get another useless configuration language that cannot come anywhere close to the needs of application developers, and so they pedal away on their squared-wheel bicycles of hand-rolled very custom data validation procedures.
This is even more disheartening because we already have created tools that made some very good progress into systematic input validation. And they were with us since the down of programming (well, almost, we had SQL since early 70's, then we also had Prolog, then we had various XML schema languages, and finally TLA+). It's amazing how people keep ignoring solutions that achieved so much compared to ensuring that a list isn't empty... and yet present it as the way forward...
Perhaps you are one, perhaps you are not, I don’t know, but either way, you certainly write like one. If you want people to take you seriously, I think it would behoove you to adopt a more leveled writing style.
Many of the claims in your comment are absurd. I will not pick them apart one by one because I suspect it will do little to convince you. But for the benefit of other passing readers, I will discuss a couple points.
> What on Earth are you talking about? What dynamic enforcement?
SQL constraints are enforced at runtime, which is to say, dynamically. Static types are enforced without running the program. This is a real advantage.
> There's no such thing as dynamically-typed languages, just like there aren't blue or savory programming languages. […] The division that wannabe computer scientists are thus trying to make between "dynamically-typed" and "statically-typed" lies roughly along the lines of "languages without useful static analysis method" and "languages that may be able to erase types from the runtime in most cases".
I agree that the distinction is not black and white, and in fact I am on the record in various places as saying so myself (e.g. https://twitter.com/lexi_lambda/status/1219486514905862146). Java is a good example of a language with a very significant dynamic type system while also sporting a static type system. But it is certainly still useful to use the phrase “dynamically-typed language,” because normal people know what that phrase generally refers to. It is hardly some gotcha to point out that some languages have some of both, and there is certainly no need to insult my character.
> This technique is not applicable to overwhelming majority of everyday problems. It's so niche it doesn't warrant a discussion, but it's instead presented as a thing to strive for. It's not a useful approach and at the moment, there's no hope of making it useful.
This is simply not true. I know because I have done a great deal of real software engineering in which I have applied constructive data modeling extensively, to good effect. It would be silly to list them because it would simply be listing every single software project I have worked on for the past 5+ years. Perhaps you have not worked on problems where it has been useful. Perhaps you do not like the tradeoffs of the technique. Fine. But in this discussion, it’s ultimately just your word against mine, and many other people seem to have found the techniques quite useful—and not just in Haskell. Just look at Rust!
> Datalog would be my second choice, which also seems appropriate for general public.
The idea that datalog, a first-order relational query language, solves data validation problems (without further clarification) is so laughable that merely mentioning it reveals that you are either fundamentally unserious or wildly uninformed. It is okay to be either or both of those things, of course, but most people in that position do not have the arrogance and the foolishness to leave blustering comments making an ass of themselves on the subject on an internet forum.
Please be better.
My man... if I, the author of my program, was constructing the input, I wouldn't need no validation. Input isn't meant to be constructed by the program's author, it's supposed to be processed...
Parsing is an additional job on top of validation - providing type-level evidence that the data is good. That's what makes it valuable. It's not some theoretical difference in power. It's better software engineering.
I see you read 'narrowing type conversions' rather literally in my statement - that might be my making my own analogy that doesn't go over very well. I literally mean using 'constructive modeled types' is a way to create true type-narrowing conversions, in the sense that a 'nonempty list' is a narrower type than 'list', or 'one to five' is a narrower type than 'int'.
I find it quite interesting though I never had the time to study it further until now, so any recommendations are appreciated!
Parser combinators seem to be pretty rippin fast for the most part, at least those ive used in ocaml and rust.
> To some readers, these pitfalls may seem obvious, but safety holes of this sort are remarkably common in practice. This is especially true for datatypes with more sophisticated invariants, as it may not be easy to determine whether the invariants are actually upheld by the module’s implementation. Proper use of this technique demands caution and care:
> * All invariants must be made clear to maintainers of the trusted module. For simple types, such as NonEmpty, the invariant is self-evident, but for more sophisticated types, comments are not optional.
> * Every change to the trusted module must be carefully audited to ensure it does not somehow weaken the desired invariants.
> * Discipline is needed to resist the temptation to add unsafe trapdoors that allow compromising the invariants if used incorrectly.
> * Periodic refactoring may be needed to ensure the trusted surface area remains small. It is all too easy for the responsibility of the trusted module to accumulate over time, dramatically increasing the likelihood of some subtle interaction causing an invariant violation.
> In contrast, datatypes that are correct by construction suffer none of these problems. The invariant cannot be violated without changing the datatype definition itself, which has rippling effects throughout the rest of the program to make the consequences immediately clear. Discipline on the part of the programmer is unnecessary, as the typechecker enforces the invariants automatically. There is no “trusted code” for such datatypes, since all parts of the program are equally beholden to the datatype-mandated constraints.
They are both quite useful techniques, but it’s important to understand what you’re getting (and, perhaps more importantly, what you’re not).
Take for example APIGatewayProxyEvent [1], which has a property `queryStringParameters` with type:
export interface APIGatewayProxyEventQueryStringParameters {
[name: string]: string | undefined;
}
You can then create a branded type like type AuthCodeEvent = APIGatewayProxyEvent & {
queryStringParameters: {
code: string;
state: string;
};
};
The branded type here means that as soon as you verify that the event has that structure above, and you can assume that it is correct in the code handles these specific cases.Though as the blog author mentioned in the other chain, the TS compiler is not particularly sound, so it's probably entirely possible to mess the structure and break the type without the compiler knowing about it.
[1] https://github.com/DefinitelyTyped/DefinitelyTyped/blob/mast...
Or you get the ability to forge evidence (e.g. you use the evidence provided by a parser for one integer as evidence for another).
This works better for dependency injection scenarios (the Has* pattern).
For a given call or request, there's input, some work done with that input, and the result. (This is true, whether we're talking about a functional or imperative style.) Your code will have some structure that reflects the work to be done. You want to push your parsing toward the input if it's concerned with the input, and toward the result if it's concerned with the result.
Whether you want to call the processing closer to the input "upward", or "earlier" or whatever, that's fine with me. If you call the processing closer to the input and closer to the result both "upward" then I think it's not a useful metaphor and you should choose a different one.
To take your type as an example, you could imagine a function
validation : String -> Maybe FinalWidget
but maybe `validation` is really big and unwieldy and you want to reuse parts of it elsewhere so you break it down into a pipeline of -- Let's say a RawWidget is, operationally, a non-empty string
validation0 : String -> Maybe RawWidget
-- Let's say a RefinedWidget is a string consisting only of capital letters
validation1 : RawWidget -> Maybe RefinedWidget
-- A FinalWidget is a non-empty string of capital letters that has no whitespace
validation2 : RefinedWidget -> Maybe FinalWidget
This is over-constrained. You don't really want to force yourself into a scenario where you must call validation0, then validation1, and finally validation2 because maybe in another code path it's more expedient to do it in another order. But the types don't line up if you do it in another order. And maybe you don't really care about `RawWidget` and `RefinedWidget`, but you're forced to create them just to make sure that you can build up to a `FinalWidget`.This is where dependent types would really help relax those constraints.
(Certainly, Haskell is probably not the most concise language for this kind of thing. LiquidHaskell adds interesting proof capabilities wrt. arithmetic.)
Regardless, even just parsing at the boundary and using an opaque type
MySuperSpecialInt = Int @Even Min(2) Max(100)
(or whatever syntax you want) is still better than just using Int. At least you'll know that you'll always be handed a value that's in range (post-parsing).Parse, Don't Validate (2019) - https://news.ycombinator.com/item?id=27639890 - June 2021 (270 comments)
Parse, Don’t Validate - https://news.ycombinator.com/item?id=21476261 - Nov 2019 (230 comments)
Parse, Don't Validate - https://news.ycombinator.com/item?id=21471753 - Nov 2019 (4 comments)
Anyway, validation/parsing is mostly pretty simple stuff where the "validate" bit is a simple function... and function composition works just fine.
(Assuming you can name the result type of your parse/validate individually according to your domain.)
In my experience you usually can't validate them all at the same time. For example, address. You usually don't validate that until after the customer has selected items, and then you find out that some items won't deliver to their area, so whereas you previously had a Valid basket, now it's an Invalid state.
They're created by something, but that something has more to do with a million blog posts and hallway conversations than it does with the formal RFC process. Certainly for most specifications of this kind, the working code came first and the specification was based largely on discovering what existing implementations did. If what the specification says is different from what HTTP clients send and HTTP servers understand, so much the worse for the specification.
I think your point of view would make more sense looking at the call stack — database access happens deeper than the code that handles the response, so you can’t push it “up” from there. And I mean, sure? But I don’t think that’s an inherently better frame than the one in which external sources are “upward” and your own application code is “downward”.
You only check it if it makes a difference to validity or not. There's no scenario where you keep the array and a parallel flag - either an empty array is invalid in which case you refuse to construct if it's empty, or an empty array is valid in which case you don't even check. Same thing for if you're checking whether it's got an even number of elements, got less than five elements, etc. - you don't keep the flag around, you refuse to construct your validated structure (even if that "validated structure" is actually just a marker wrapper around the raw array, if your type system isn't good enough to express the real constraint directly).
I typically design code around things like this with a sum type like:
data Header = KnownHeader1 | KnownHeader2 | UnknownHeader String String
Then I typically don't offer any extra support or extended functionality for the cases where the type is `UnknownHeader`.What context is it exactly where they don't matter?
I can tell you in practice, in the real world, they very much do.
> They are useless if you want to ensure that a given number is a prime number.
It's not useless. The point is that once you have type `PrimeNumber` that can only be constructed after being validated, you then can write functions exist in a reality where only PrimeNumber exists.
https://play.haskell.org/saved/gRsNcCGo
> They are useless if you want to ensure that a given number is a prime number.
This is wrong. In the example above `addPrimes` will only take prime numbers.
As such if I make a Jira story that says "add multiply/subtract functions using the PrimeNumber type" I'll know that implementation is simplified by only being able to concern itself with prime numbers.
> Still, perhaps you are skeptical of parseNonEmpty’s name. Is it really parsing anything, or is it merely validating its input and returning a result? While the precise definition of what it means to parse or validate something is debatable, I believe parseNonEmpty is a bona-fide parser (albeit a particularly simple one).
> Consider: what is a parser? Really, a parser is just a function that consumes less-structured input and produces more-structured output.
The OP is saying that a validator is a function which doesn't return anything, whereas parsing is a function which returns data. (Or in other words, validation is when you keep passing around the data in the old type, and parsing is when you pass around a new type). It is true that there is code inside the parser which you can call "validation", but the OP is labeling the function based on its signature. This is made more obvious towards the end of the article:
> Use abstract datatypes to make validators "look like" parsers. Sometimes, making an illegal state truly unrepresentable is just plain impractical given the tools Haskell provides, such as ensuring an integer is in a particular range. In that case, use an abstract newtype with a smart constructor to "fake" a parser from a validator.
They are talking about the interface, not the implementation. They are saying that you should pass around a parsed type, even if it's only wrapping a raw value, because it carries proof that this data has been validated. They are saying that you shouldn't be validating this data in lots of different places.
> It may not be immediately apparent what shotgun parsing has to do with validation—after all, if you do all your validation up front, you mitigate the risk of shotgun parsing. The problem is that validation-based approaches make it extremely difficult or impossible to determine if everything was actually validated up front or if some of those so-called “impossible” cases might actually happen. The entire program must assume that raising an exception anywhere is not only possible, it’s regularly necessary.
You end up with four choices:
1. Have a single function that does all the constraint checking at once
2. Have a single linear order where each constraint check feeds into the next but only in that order
3. Acquiesce to a combinatorial explosion of functions that check every possible combination of those constraints
4. Give up keeping track of the constraints at a type level.
Obviously not a closed API since the playground only gives you one module, but I wrote an example on the Haskell playground:
safeDiv :: (Monad m, Alternative m) => Int -> Int -> m Int
safeDiv x y = do
guard (y /= 0)
pure (x `div` y)
main :: IO ()
main = do
print $ safeDiv @Maybe 1 0
print $ safeDiv @[] 1 0
-- print =<< safeDiv @IO 1 0 -- guard throws an error in IO
Try it out at https://play.haskell.org/saved/a6VsE3uQAs for the difficulty in applying these ideas in other languages, I am sympathetic. The problem I always run into is that there is necessarily a tension between (a) presentations that are accessible to working programmers, (b) explanations that distill the essential ideas so they aren’t coupled to particular languages or language features, and (c) examples small enough to be clarifying and to fit in a blog post. Haskell is certainly not the best choice along that first axis, but it is quite exceptionally good along the second two.
For a somewhat concrete example of what I mean, see this comment I wrote a few years ago that translates the NonEmpty example into Java: https://news.ycombinator.com/item?id=21478322 I think the added verbosity and added machinery really does detract significantly from understanding. Meanwhile, a TypeScript translation would make a definition like this one quite tempting:
type NonEmpty<T> = [T, ...T[]]
However, I find this actually obscures application of the technique because it doesn’t scale to more complex examples (for the reasons I discussed at quite some length in https://lexi-lambda.github.io/blog/2020/08/13/types-as-axiom...).There are probably ways to thread this needle, but I don’t think any one “solution” is by any means obviously the best. I think the ways that other people have adapted the ideas to their respective ecosystems is probably a decent compromise.
If you’d like to learn Haskell, I think https://www.cis.upenn.edu/~cis1940/spring13/ is still a pretty nice resource. It is quick and to the point, and it provides some exercises to work through. There are lots of things in the Haskell ecosystem that you could explore if you wanted to after getting a handle on the basics.
If you want to learn about programming languages and type systems, you could read Programming Languages: Application and Interpretation (https://cs.brown.edu/courses/cs173/2012/book/), which has a chapter on type systems. Alternatively, if you want a more thorough treatment of type systems, you could read Types and Programming Languages by Benjamin Pierce. However, both PLAI and TAPL are textbooks, and they are primarily intended to be used as supporting material in a university course with an instructor. I think PLAI is relatively accessible, but TAPL is more likely to be a challenge without some existing background in programming languages.
Just wanted to say that fp-ts (now effect-ts, a ZIO port to TypeScript) author Giulio Canti is a great fan of your "parse don't validate" article. He's linked it many times in the TypeScript and functional programming channels (such as the fp slack).
Needless to say, both fp-ts-derived io-ts library and effect-ts library schema[1] are obviously quite advanced parsers (and in case of schema, there's decoding, encoding, APIs, guard, arbitrary and many other nice things I haven't seen in any functional language).
> You come off as a crank. [... because of X, Y ,Z ]
..
> Please be better. [in the following manner: ... even if takes summarizing what was said]
(But I'm not expert, admittedly, and I isn't an actual problem of much consequence in practical programming in Haskell or Scala. Opaque types do the 80% bit of 80-20 just fine.)
You can get very close with type-level sets although at this point compile times probably go through the roof. You're basically emulating row types at this point.
def wrapIntoRefined(str: String): Refined[String, Unit]
def validate0[A](str: Refined[String, A]): Either[Error, Refined[String, And[Condition0, A]]]
def validate1[A](str: Refined[String, A]): Either[Error, Refined[String, And[Condition1, A]]]
// This requires ordering Condition0 before Condition1 but if we resorted
// to a type-level set we could get around that problem
def process(input: Refined[String, And[Condition1, And[Condition0, Unit]]]): Unit
// But linearity is still required in some sense. We can't e.g. do our checks
// in a parallel fashion. You still need to pipe one function right after another
The central problem is if you have two validation functions def validate0(str: String): Refined[String, Condition0]
def validate1(str: String): Refined[String, Condition1]
if you try to recombine them downstream, you don't know that `Refined[String, Condition0]` and `Refined[String, Condition1]` actually refer to the same underlying `String`. They could be refined on two completely separate strings. To tie them to a single runtime String requires dependent types.You can approximate this in Scala with path-dependent types, but it's very brittle and breaks in all sorts of ways.
> isn't an actual problem of much consequence in practical programming in Haskell or Scala. Opaque types do the 80% bit of 80-20 just fine.
I think this is only true because there isn't a production-ready dependently typed language to show how to use these patterns effectively. In much the same way that "parse don't validate" isn't really much of a problem of consequence in older style Java code because sum types aren't really a thing, if there was an ergonomic way of taking advantage of it, I firmly believe these sorts of dependently typed tagged types would show up all over the place.
Sure, I agree (or perhaps: what's considered "best practice"; or whatever our existing codebase is doing)
> there's a lot of discussions about if booleans, Eithers/Optionals or exceptions should be used
That's just an implementation detail, and misses the point. For example, all of those can be used to 'validate'; e.g.
- A function/method 'v1: JSON -> Boolean'
- A function/method 'v2: JSON -> JSON', which may throw exceptions
- A function/method 'v3: JSON -> Optional JSON'
- A function/method 'v4: JSON -> Either Error JSON'
The reason these are all bad has nothing to do with the language features or error-handling mechanisms employed. The reason they are bad is that they are all completely unnecessary.
For example, here are a bunch of programs which the above validators. They're all essentially equivalent, and hence have the same fundamental flaw:
function trigger1(userInput: JSON) {
if (v1(userInput)) {
print "UNAUTHORISED, ABORTING"
sys.exit(1)
}
else {
launchMissiles(authorisation=userInput)
}
}
function trigger2(userInput: JSON) {
try {
launchMissiles(authorisation=v2(userInput))
}
catch {
print "UNAUTHORISED, ABORTING"
sys.exit(1)
}
}
function trigger3(userInput: JSON) {
v3(userInput) match {
case None => {
print "UNAUTHORISED, ABORTING"
sys.exit(1)
}
case Some(validated) => {
launchMissiles(authorisation=validated)
}
)
}
function trigger4(userInput: JSON) {
v3(userInput) match {
case Left(error) => {
print ("UNAUTHORISED, ABORTING: " + error)
sys.exit(1)
}
case Right(validated) => {
launchMissiles(authorisation=userInput)
}
)
}
The reason they're all flawed is that validation can be skipped. In other words, you can write any validation logic; implemented with any mechanism you like; in any language; but your colleague's codde might never call it! All of the above 'trigger' functions could be replaced by this, and it will still work: function trigger(userInput: JSON) {
launchMissiles(authoriser=userInput)
}
In contrast, the 'parse' approach cannot be skipped. Here are some examples:- A function/method 'p1: JSON -> Either Error MyJSON'
- A function/method 'p2: JSON -> Optional MyJSON'
- A function/method 'p3: JSON -> MyJSON', which may throw exceptions
Here are their corresponding 'trigger' functions:
function trigger5(userInput: JSON) {
p1(userInput) match {
case Left(error) => {
print ("UNAUTHORISED, ABORTING: " + error)
sys.exit(1)
}
case Right(parsed) => {
launchMissiles(authorisation=parsed)
}
}
}
function trigger6(userInput: JSON) {
p2(userInput) match {
case None => {
print "UNAUTHORISED, ABORTING"
sys.exit(1)
}
case Some(parsed) => {
launchMissiles(authorisation=parsed)
}
}
}
function trigger7(userInput: JSON) {
try {
launchMissiles(authorisation=p3(userInput))
}
catch {
print ("UNAUTHORISED, ABORTING: " + error)
sys.exit(1)
}
}
These alternatives are much safer, since the 'launchMissiles' function now takes a 'MyJSON' value as argument; so we can't do `launchMissiles(authorisation=userInput)` (since 'userInput' has the type JSON, which isn't a valid input). Our colleages cannot skip or forget to call these p1/p2/p3 functions, since that's they only way they can turn the 'userInput' value they have, into a 'MyJSON' value they need.> I don't want to implement that method "checkAgainstMySchema". Because I know there's already a library for that.
No, there isn't. I think you may be confused about how such 'parser' functions should be implemented. Nobody is saying to ignore existing libraries, or roll our own JSON grammars, or whatever. It's purely about how your project's datatypes are constructed. For example, something like this:
function parseMyThing(json: JSON) {
if (SomeExistingJSONSchemaLibrary.validate(json, SomeParticularSchemaMyApplicationIsUsing)) {
return Right(SomeDatatypeIHaveWritten(...))
}
else {
// Or use exceptions, or Optional, or whatever; it doesn't matter
return Left("Invalid")
}
}
(If all of your project's datatypes, schemas, class, etc. were already provided by some existing library, then that project would be a bit pointless!)