Ask HN: How are you cleaning and transforming data before imports/uploads?

This was roughly half my job a decade ago. Literally because I split my time 50/50 between two separate projects, one of which was a business analytics automation project that was designed for exactly this purpose. Take all of the data exported from the various project management and issue tracking systems used by different product teams, transform as necessary, then import to a common corporate analytics system upper management used to track progress metrics.

The idea was to replace all of the bespoke custom Excel workflows that various program management teams used to do it themselves, so we were self-funding based on the estimated cost savings of replacing manual work, meaning I had pretty much free reign to do it however I wanted as long as I could get people to use it. I did it entirely in Python, making heavy use of Pandas, creating a toolchain that ingested mapping definitions from the users and relied on fuzzy matching (i.e. stemming, de-capitalization, recognizing word roots) to categorize field names into a common definition set used by the corporate system.

It wasn't a 100% solution, more like 95%, but the reality of this kind of work is we had a real garbage-in problem anyway that no software could solve. Extracting data from an issue tracker relies upon developers using it correctly in the first place. If you have a bunch of workflow transitions but whoever works an issue just transitions them all at once when work is complete, there's no way to know how long everything actually took. You can only extract information that was actually recorded. Coercing dates into a single format standard is fairly simple to do, but if the dates were wrong in the original format, they'll still be wrong in the corrected format.