←back to thread

245 points gatesn | 1 comments | | HN request time: 0s | source
Show context
xiaodai ◴[] No.41846329[source]
There are a bunch of these including fst in the R ecosystem. JDF.jl in the julia ecosystem etc.
replies(1): >>41848629 #
danking00 ◴[] No.41848629[source]
Thanks for introducing me to these other formats! I hadn't heard of them yet. All three of fst, JDF, and Vortex appear share the goal of high throughput (de)serialization of tabular data and random access to the data. However, it is not clear to me how JDF and fst permit random access on compressed data because both appear to use block compression (respectively Blosc and LZ4 or Zstd). While both Blosc and Zstd are extremely fast, accessing a single value of a single row necessarily requires decompressing a whole block of data. Instead of O(1) random access you get O(N_ROWS_PER_BLOCK) random access.

In Vortex, we've specifically invested in high throughput compression techniques that admit O(1) random access. These kinds of techniques are also sometimes called "lightweight compression". The DuckDB folks have a good writeup [1] on the common ones.

[1] https://duckdb.org/2022/10/28/lightweight-compression.html

replies(2): >>41849819 #>>41854233 #
1. kwillets ◴[] No.41849819[source]
This paper compares the benefits of lightweight compression and other techniques:

https://blog.acolyer.org/2018/09/26/the-design-and-implement...