Ask HN: How are you cleaning and transforming data before imports/uploads?

I transform hundreds of tabular sources. For the cleaning / transformation, I found that a very small number of transformations is required, and that we need to review them as a team including business owners. So, I wrote a simple grammar that is very English-like; that gets translated into Polars operations under the covers in Python. It covers 98% + of my ingestion needs, and means that we focus on the needs of the logical data transformations as a team. Business users can easily make changes for sources they manage.

One of the concepts is a “map”, for old values to new values. Those we keep in Excel in Git, so that business users can edit / maintain them. Being Excel, we’re careful to validate the import of those rules when we do a run, mainly to indicate where there’s been a lot of change to identify where there might be an unintended change. Excel makes me nervous in data processing work in general (exploration with Pivots is great, though I’ve moved to Visidata as my first tool of choice). But for years of running in this way we’ve worked around Excel lax approach to data, such as interpreting numerical ID fields as numbers rather than strings.

For output “rendering”, because everything is in Polars, we can most frequently simply output to CSV. We use Jinja for some funky cases.