Parse, don't validate (2019)

(lexi-lambda.github.io)

398 points declanhaigh | 3 comments | 07 Mar 23 08:47 UTC | HN request time: 0.834s | source

Show context

bruce343434 ◴[07 Mar 23 10:54 UTC] No.35053912[source]▶

Note that this basically requires your language to have ergonomic support for sum types, immutable "data classes", pattern matching.

The point is to parse the input into a structure which always upholds the predicates you care about so you don't end up continuously defensively programming in ifs and asserts.

replies(12): >>35054046 #>>35054070 #>>35054386 #>>35054514 #>>35054901 #>>35054993 #>>35055124 #>>35055230 #>>35056047 #>>35057866 #>>35058185 #>>35059271 #

crabbone ◴[07 Mar 23 12:18 UTC] No.35054514[source]▶

>>35053912 #

It's not just about these limitations.

In order to be useful, type systems need to be simple, but there's no such restrictions on rules that govern our expectations of data correctness.

OP is delusional if they think that their approach can be made practical. I mean, what if the expectation from the data that an value is a prime number? -- How are they going to encode this in their type systems? And this is just a trivial example.

There are plenty of useful constraints we routinely expect in message exchanges that aren't possible to implement using even very elaborate type systems. For example, if we want to ensure that all ids in XML nodes are unique. Or that the last digit of SSN is a checksum of the previous digits using some complex formula. I mean, every Web developer worth their salt knows that regular expressions are a bad idea for testing email addresses (which would be an example of parsing), and it's really preferable to validate emails by calling a number of predicates on them.

And, of course, these aren't the only examples: password validation (the annoying part that asks for capital letter, digit, special character? -- I want to see the author implement a parser to parse possible inputs to password field, while also giving helpful error messages s.a. "you forgot to use a digit"). Even though I don't doubt it's possible to do that, the resulting code would be an abomination compared to the code that does the usual stuff, i.e. just checks if a character is in a set of characters.

replies(10): >>35054557 #>>35054562 #>>35054640 #>>35054916 #>>35054920 #>>35055046 #>>35055734 #>>35055902 #>>35056302 #>>35057473 #

thanatropism ◴[07 Mar 23 13:10 UTC] No.35054920[source]▶

>>35054514 #

The spirit of "parse, don't validate" is -- do all those things (SSN checksums, whatever) at the point where data enters the system, not at the point where it's used.

replies(1): >>35056790 #

crabbone ◴[07 Mar 23 15:56 UTC] No.35056790[source]▶

>>35054920 #

That's not how OP used it. OP wants to express constraints on data through ML-style type system. They didn't even consider obvious competition s.a. systems like SQL constraints or XSL.

I have no problems with the way you want to interpret this claim. But, really, I'm responding to the article linked to this thread, which isn't about at which point in application to perform the said validation or parsing.

replies(1): >>35059337 #

lexi-lambda ◴[07 Mar 23 18:45 UTC] No.35059337[source]▶

>>35056790 #

There is a fairly obvious difference between a dynamic enforcement mechanism like contracts or SQL constraints. Though I think it is a bit silly to suggest that I have “never even considered” such things given the blog post itself is rendered using Racket, a dynamically-typed language with a fairly sophisticated higher-order contract system.

SQL constraints are certainly useful. But they don’t really solve the same problem. SQL constraints ensure integrity of your data store, which is swell, but they don’t provide the same guarantees about your program that static types do, nor do they say much at all about how to structure your code that interacts with the database. I also think it is sort of laughable to claim that XSL is a good tool for solving just about any data processing problem in 2023, but even if you disagree, the same points apply.

Obviously, constructive data modeling is hardly a panacea. There are lots of problems it does not solve or that are more usefully solved in other ways. But I really have applied it to very good effect on many, many real engineering problems, not just toys, and I think the technique provides a nice framework for reasoning about data modeling in many scenarios. Your comments here seem almost bafflingly uncharitable given the article in question doesn’t make any absolutist claims and in fact discusses at some length that the technique isn’t always applicable.

See also: my other comment about using encapsulation instead of constructive modeling (https://news.ycombinator.com/item?id=35059113) and my followup blog post about how more things can be encoded using constructive data modeling than perhaps you think (https://lexi-lambda.github.io/blog/2020/08/13/types-as-axiom...).

replies(1): >>35060678 #

1. crabbone ◴[07 Mar 23 20:24 UTC] No.35060678[source]▶

>>35059337 #

> There is a fairly obvious difference between a dynamic enforcement mechanism like contracts or SQL constraints.

What on Earth are you talking about? What dynamic enforcement?

> Though I think it is a bit silly to suggest that I have “never even considered”

In the context of this conversation you showed no signs of such concerns. Had you have such concerns previously, you wouldn't have arrived at conclusions you apparently have.

> a dynamically-typed language

There's no such thing as dynamically-typed languages, just like there aren't blue or savory programming languages. Dynamically-typed is just a word combo that a lot of wannabe computer scientists are using, but there's no real meaning behind it. When "dynamic" is used in the context of types, it refers to the concrete type obtained during program execution, whereas "static" refers to the type that can be deduced w/o executing the program. For example, union types cannot be dynamic. Similarly, it's not possible to have generic dynamic types. Every language thus has dynamic and static types, except, in some cases, the static analysis of types isn't very useful because the types aren't expressive enough, or the verification is too difficult. Conversely, in some languages there's no mechanism to find out exact runtime types because the information about types is considered to be extraneous to the program and is removed from the runtime.

The division that wannabe computer scientists are thus trying to make between "dynamically-typed" and "statically-typed" lies roughly along the lines of "languages without useful static analysis method" and "languages that may be able to erase types from the runtime in most cases". Where "useful" and "most cases" are a matter of subjective opinion. Often times such boundaries lead claimers to ironically confusing conclusions, s.a. admitting that languages like Java aren't statically typed or that languages like Bash are statically typed and so on.

Note that "wannabee computer scientist" applies also to people with degrees in CS (I've met more than a dozen), some had even published books on this subject. This only underscores the ridiculous state in which this field is.

> discusses at some length that the technique isn’t always applicable.

This technique is not applicable to overwhelming majority of everyday problems. It's so niche it doesn't warrant a discussion, but it's instead presented as a thing to strive for. It's not a useful approach and at the moment, there's no hope of making it useful.

Validation, on the other hand, is a very difficult subject, but, I think that if we really want to deal with this problem, then TLA+ is a good approach for example. But it's still too difficult to the average programmer. Datalog would be my second choice, which also seems appropriate for general public. Maybe even something like XSL, which, in my view lacks a small core that would allow one to construct it from first principles, but it's still able to solve a lot of practical tasks when it comes to input validation.

ML-style types aren't competitive in this domain. They are very clumsy tools when it comes to expressing problems that programmers have to solve every day. We, as community, keep praising them because they are associated with the languages for the "enlightened" and thus must be the next best thing after sliced bread.

replies(1): >>35060984 #

2. lexi-lambda ◴[07 Mar 23 20:47 UTC] No.35060984[source]▶

>>35060678 (TP) #

You come off as a crank.

Perhaps you are one, perhaps you are not, I don’t know, but either way, you certainly write like one. If you want people to take you seriously, I think it would behoove you to adopt a more leveled writing style.

Many of the claims in your comment are absurd. I will not pick them apart one by one because I suspect it will do little to convince you. But for the benefit of other passing readers, I will discuss a couple points.

> What on Earth are you talking about? What dynamic enforcement?

SQL constraints are enforced at runtime, which is to say, dynamically. Static types are enforced without running the program. This is a real advantage.

> There's no such thing as dynamically-typed languages, just like there aren't blue or savory programming languages. […] The division that wannabe computer scientists are thus trying to make between "dynamically-typed" and "statically-typed" lies roughly along the lines of "languages without useful static analysis method" and "languages that may be able to erase types from the runtime in most cases".

I agree that the distinction is not black and white, and in fact I am on the record in various places as saying so myself (e.g. https://twitter.com/lexi_lambda/status/1219486514905862146). Java is a good example of a language with a very significant dynamic type system while also sporting a static type system. But it is certainly still useful to use the phrase “dynamically-typed language,” because normal people know what that phrase generally refers to. It is hardly some gotcha to point out that some languages have some of both, and there is certainly no need to insult my character.

> This technique is not applicable to overwhelming majority of everyday problems. It's so niche it doesn't warrant a discussion, but it's instead presented as a thing to strive for. It's not a useful approach and at the moment, there's no hope of making it useful.

This is simply not true. I know because I have done a great deal of real software engineering in which I have applied constructive data modeling extensively, to good effect. It would be silly to list them because it would simply be listing every single software project I have worked on for the past 5+ years. Perhaps you have not worked on problems where it has been useful. Perhaps you do not like the tradeoffs of the technique. Fine. But in this discussion, it’s ultimately just your word against mine, and many other people seem to have found the techniques quite useful—and not just in Haskell. Just look at Rust!

> Datalog would be my second choice, which also seems appropriate for general public.

The idea that datalog, a first-order relational query language, solves data validation problems (without further clarification) is so laughable that merely mentioning it reveals that you are either fundamentally unserious or wildly uninformed. It is okay to be either or both of those things, of course, but most people in that position do not have the arrogance and the foolishness to leave blustering comments making an ass of themselves on the subject on an internet forum.

Please be better.

replies(1): >>35070695 #

3. thanatropism ◴[08 Mar 23 16:05 UTC] No.35070695[source]▶

>>35060984 #

Since by now we're arguing style, you would achieve your purposes (whatever they are) if you completed certain aggressive sentences, for example:

> You come off as a crank. [... because of X, Y ,Z ]

> Please be better. [in the following manner: ... even if takes summarizing what was said]

↑