←back to thread

245 points gatesn | 8 comments | | HN request time: 1.885s | source | bottom
1. Havoc ◴[] No.41840621[source]
Can one edit it in place?

That’s the main thing currently irritating me about parquet

replies(2): >>41841001 #>>41845342 #
2. aduffy ◴[] No.41841001[source]
You're unlikely to find this with any analytic file format (including Vortex). The main reason is that OLAP systems generally assume an immutable distributed object/block layer (S3, HDFS, ABFS, etc.).

It's then generally up to a higher-level component called a table format to handle the idea of edits. See for example how Apache Iceberg handles deletes https://iceberg.apache.org/spec/#row-level-deletes

replies(2): >>41841714 #>>41848589 #
3. slotrans ◴[] No.41841714[source]
This is true, and in principle a good thing, but in the time since Parquet and ORC were created GDPR and CCPA are things that have come to exist. Any format we build in that space, today, needs to support in-place record-level deletion.
replies(3): >>41841878 #>>41844443 #>>41846674 #
4. aduffy ◴[] No.41841878{3}[source]
Yea so the thing you do for this is called "compaction", where you effectively merge the original + edits/deletes into a new immutable file. You then change your table metadata pointer to point at the new compacted file, and delete the old files from S3.

Due to the way S3 and the ilk are structured as globally replicated KV stores, you're not likely to get in-place edits anytime soon, and until the cost structure incentivizes otherwise you're going to continue to see data systems that preference immutable cloud storage.

5. FridgeSeal ◴[] No.41844443{3}[source]
I mean, you can have it you’ve just got to be happy to bear the cost of rewriting the file every time you mutate a row.
6. runeblaze ◴[] No.41845342[source]
Did not read too deep into the original post, but if you use arrow you can (not sure if one ever should) do random lookups into the storage buffer and wipe out bytes + put tombstones, at least if the column is encoded "naively".

Of course if your arrow file is in some object store how you delete random bytes over that is unclear.

7. mkesper ◴[] No.41846674{3}[source]
You can avoid that if you save only per-user encrypted content (expensive, I know). That way you just should have to revoke that key to remove access to the data. Advantage is you cannot forget any old backup etc.
8. Havoc ◴[] No.41848589[source]
I see. Hadn’t made the connection to S3 etc. that makes sense though. Thanks for explaining