Parse, Don't Validate (2019)

I think that is in fact validating in the sense that the article means it.

Here's validating a CSV in Python (which I'm using because it's a language that's, well, less excited about types than the author's choice of Haskell, to show that the principle still applies):

    def validate_data(filename):
        reader = csv.DictReader(open(filename))
        for row in reader:
            try:
                date = datetime.datetime.fromisoformat(row["date"])
            except ValueError:
                print("ERROR: Invalid date", row)
            if date < datetime.datetime(2021, 1, 1):
                print("ERROR: Last year's data", row))
            # etc.
        return errors

    def actually_work_with_data(filename):
        reader = csv.DictReader(open(filename))
        for row in reader:
            try:
                date = datetime.datetime.fromisoformat(row["date"])
            except ValueError:
                raise Exception("Wait, didn't you validate this already???")
            # etc.

Yes, it's a kind of silly example, but - the validation routine is already doing the work of getting the data into the form you want, and now you have some DRY problems. What happens if you start accepting additional time formats in validate_data but you forget to teach actually_work_with_data to do the same thing?

The insight is that the work of reporting errors in the data is exactly the same as the work of getting non-erroneous data into a usable form. If a row of data doesn't have an error, that means it's usable; if you can't turn it into a directly usable format, that necessarily means it has some sort of error.

So what you want is a function that takes the data and does both of these at the same time, because it's actually just a single task.

In a language like Haskell or Rust, there's a built-in type for "either a result or an error", and the convention is to pass errors back as data. In a language like Python, there isn't a similar concept and the convention is to pass errors as exceptions. Since you want to accumulate all the errors, I'd probably just put them into a separate list:

    @attr.s # or @dataclasses.dataclass, whichever
    class Order:
        name: str
        date: datetime.datetime
        ...

    def parse(filename):
        data = []
        errors = []
        reader = csv.DictReader(open(filename))
        for row in reader:
            try:
                date = datetime.datetime.fromisoformat(row["date"])
            except ValueError:
                errors.append(("Invalid date", row))
                continue
            if date < datetime.datetime(2021, 1, 1):
                errors.append(("Last year's data", row))
                continue
            # etc.
            data.append(Order(name=row["name"], date=date, ...))
        return data, errors

And then all the logic of working with the data, whether to actually use it or to report errors, is in one place. Both your report of bad data and your actually_work_with_data function call the same routine. Your actual code doesn't have to parse fields in the CSV itself; that's already been done by what used to be the validation code. It gets a list of Order objects, and unlike a dictionary from DictReader, you know that an Order object is usable without further checks. (The author talks about "Use a data structure that makes illegal states unrepresentable" - this isn't quite doable in Python where you can generally put whatever you want in an object, but if you follow the discipline that only the parse() function generates new Order objects, then it's effectively true in practice.)

And if your file format changes, you make the change in one spot; you've kept the code DRY.