←back to thread

245 points gatesn | 1 comments | | HN request time: 0s | source
Show context
Havoc ◴[] No.41840621[source]
Can one edit it in place?

That’s the main thing currently irritating me about parquet

replies(2): >>41841001 #>>41845342 #
aduffy ◴[] No.41841001[source]
You're unlikely to find this with any analytic file format (including Vortex). The main reason is that OLAP systems generally assume an immutable distributed object/block layer (S3, HDFS, ABFS, etc.).

It's then generally up to a higher-level component called a table format to handle the idea of edits. See for example how Apache Iceberg handles deletes https://iceberg.apache.org/spec/#row-level-deletes

replies(2): >>41841714 #>>41848589 #
slotrans ◴[] No.41841714[source]
This is true, and in principle a good thing, but in the time since Parquet and ORC were created GDPR and CCPA are things that have come to exist. Any format we build in that space, today, needs to support in-place record-level deletion.
replies(3): >>41841878 #>>41844443 #>>41846674 #
1. aduffy ◴[] No.41841878[source]
Yea so the thing you do for this is called "compaction", where you effectively merge the original + edits/deletes into a new immutable file. You then change your table metadata pointer to point at the new compacted file, and delete the old files from S3.

Due to the way S3 and the ilk are structured as globally replicated KV stores, you're not likely to get in-place edits anytime soon, and until the cost structure incentivizes otherwise you're going to continue to see data systems that preference immutable cloud storage.