←back to thread

Parse, Don't Validate (2019)

(lexi-lambda.github.io)
389 points melse | 8 comments | | HN request time: 0.83s | source | bottom
1. wodenokoto ◴[] No.27640366[source]
When I think of validation I think of receiving a data file and checking that all rows and columns are correct and generating a report about all the problems.

Does my thing have a different name? Where can I read up on how to do that best?

replies(4): >>27640640 #>>27640784 #>>27640888 #>>27642921 #
2. quickthrower2 ◴[] No.27640640[source]
I thought of input validation for web forms. Similar thing I guess. In Haskell you can create a type that you know is a validated email address but you still need a validation function from String -> Maybe Email to actually validate it at runtime
replies(2): >>27640914 #>>27641008 #
3. foota ◴[] No.27640784[source]
Data validation?
4. rmnclmnt ◴[] No.27640888[source]
You can use the now widely adopted Great Expections[0] library, which fits exactly this use-case for data validation!

[0] https://greatexpectations.io

replies(1): >>27642801 #
5. jacoblambda ◴[] No.27640914[source]
That's just a parser though. Like described in the post, parsers sometimes can fail but importantly they always pass along the result if they succeed. Validation functions on the other hand only validate that said data is valid.

The argument is that if you need to interact with or operate on some data you shouldn't be designing functions to validate the data but rather to render it into a useful output with well defined behaviour.

6. WJW ◴[] No.27641008[source]
I think for the usecase GP gives it'd be even better to have a function `String -> Either (LineNumber,String,[Problem]) Email`, so that you can report back which of the lines had problems and what kind of problems. For web form validation you can skip the line number but it'd still be useful to keep the list of problems, so that you can report back to the user what about their input did not conform to expectations.
7. wodenokoto ◴[] No.27642801[source]
Thanks for the link. It looks really nice.

I see they’ve raised a lot of money. Does anyone know what their revenue model is?

8. geofft ◴[] No.27642921[source]
I think that is in fact validating in the sense that the article means it.

Here's validating a CSV in Python (which I'm using because it's a language that's, well, less excited about types than the author's choice of Haskell, to show that the principle still applies):

    def validate_data(filename):
        reader = csv.DictReader(open(filename))
        for row in reader:
            try:
                date = datetime.datetime.fromisoformat(row["date"])
            except ValueError:
                print("ERROR: Invalid date", row)
            if date < datetime.datetime(2021, 1, 1):
                print("ERROR: Last year's data", row))
            # etc.
        return errors

    def actually_work_with_data(filename):
        reader = csv.DictReader(open(filename))
        for row in reader:
            try:
                date = datetime.datetime.fromisoformat(row["date"])
            except ValueError:
                raise Exception("Wait, didn't you validate this already???")
            # etc.
Yes, it's a kind of silly example, but - the validation routine is already doing the work of getting the data into the form you want, and now you have some DRY problems. What happens if you start accepting additional time formats in validate_data but you forget to teach actually_work_with_data to do the same thing?

The insight is that the work of reporting errors in the data is exactly the same as the work of getting non-erroneous data into a usable form. If a row of data doesn't have an error, that means it's usable; if you can't turn it into a directly usable format, that necessarily means it has some sort of error.

So what you want is a function that takes the data and does both of these at the same time, because it's actually just a single task.

In a language like Haskell or Rust, there's a built-in type for "either a result or an error", and the convention is to pass errors back as data. In a language like Python, there isn't a similar concept and the convention is to pass errors as exceptions. Since you want to accumulate all the errors, I'd probably just put them into a separate list:

    @attr.s # or @dataclasses.dataclass, whichever
    class Order:
        name: str
        date: datetime.datetime
        ...

    def parse(filename):
        data = []
        errors = []
        reader = csv.DictReader(open(filename))
        for row in reader:
            try:
                date = datetime.datetime.fromisoformat(row["date"])
            except ValueError:
                errors.append(("Invalid date", row))
                continue
            if date < datetime.datetime(2021, 1, 1):
                errors.append(("Last year's data", row))
                continue
            # etc.
            data.append(Order(name=row["name"], date=date, ...))
        return data, errors
And then all the logic of working with the data, whether to actually use it or to report errors, is in one place. Both your report of bad data and your actually_work_with_data function call the same routine. Your actual code doesn't have to parse fields in the CSV itself; that's already been done by what used to be the validation code. It gets a list of Order objects, and unlike a dictionary from DictReader, you know that an Order object is usable without further checks. (The author talks about "Use a data structure that makes illegal states unrepresentable" - this isn't quite doable in Python where you can generally put whatever you want in an object, but if you follow the discipline that only the parse() function generates new Order objects, then it's effectively true in practice.)

And if your file format changes, you make the change in one spot; you've kept the code DRY.