Ask HN: How are you cleaning and transforming data before imports/uploads?

36 points dataflowmapper | 2 comments | 14 May 25 16:27 UTC | HN request time: 0.502s | source

Hi all,

I’m curious how folks handle the prep work for data imports/uploads into systems like Salesforce, Workday, NetSuite, or really any app that uses template based import for data loading, migration, or implementation.

Specifically: - How do you manage conversions/transformations like formatting dates, getting everything aligned with the templates, mapping old codes to new ones, etc.

- Are you primarily using Excel, custom scripts, Power Query or something else?

- What are the most tedious/painful parts of this process and what have you found that works?

Really appreciate any insights and am curious to learn from everyone's experience.

Show context

chaos_emergent ◴[14 May 25 16:41 UTC] No.43986590[source]▶

>>43986417 (OP) #

as I sit in front of my computer waiting for a transformation-for-import job to complete, I can describe my basic workflow:

1. define a clean interface target - for me, that's an interface that I made for my startup to import call data.

2. explore the data a little to get a sense of transformation mappings.

3. create a PII-redacted version of the file, upload it to ChatGPT along with the shape of my interface, ask it to write a transformation script in Python

4. run it on a subset of the data locally to verify that it works.

5. run it in production against my customer's account.

I'm curious - that seems like a reasonably standard flow, and it involves a bit of manual work, but it seems like the right tradeoff between toil and automation. Do you struggle with that workflow or think it could be better somehow?

replies(1): >>43986952 #

1. dataflowmapper ◴[14 May 25 17:15 UTC] No.43986952[source]▶

>>43986590 #

Thanks for sharing that workflow, for more straight forward flows, that sounds like a decent approach. My main thoughts on where it could be improved, or where I see potential struggles, are when:

  - People aren't comfortable or familiar with coding/Python.
  - You get into more complex imports like historical data, transactional data, etc. There you might have like 15 transaction types that have to be mapped, all with different fields, math, and conditional logic where the requirements become too much for just prompting ChatGPT effectively, and iterating on the Python can get pretty involved.
  - The source data structure and transformation needs aren't consistently the same, leading to a lot of 'throwaway' or heavily modified scripts for each unique case.
  - Tasks like VLOOKUPs or enriching data come into play, which might add manual coding or complexity beyond a simple 1-to-1 source-to-destination script.

These are the areas where I'm exploring if a more structured way could offer benefits in terms of repeatability and accessibility for a wider range of users or complex scenarios. Appreciate the insight into your process and your thoughts on this.

replies(1): >>43989075 #

2. sargstuff ◴[14 May 25 20:51 UTC] No.43989075[source]▶

>>43986952 (TP) #

From doing historical data restorations (pre-json), using documented industry/standard structures & protocols, normalized data formats/layouts helps out quite a bit.

Note: awk/shell scripts, ~80's because relatively portable across platforms (mac,pc,dec,ibm (ignoring ebcdic), etc.)

↑