Show HN: Vortex – a high-performance columnar file format

1. xiaodai ◴[15 Oct 24 08:33 UTC] No.41846329[source]▶

There are a bunch of these including fst in the R ecosystem. JDF.jl in the julia ecosystem etc.

2. danking00 ◴[15 Oct 24 13:55 UTC] No.41848629[source]▶

Thanks for introducing me to these other formats! I hadn't heard of them yet. All three of fst, JDF, and Vortex appear share the goal of high throughput (de)serialization of tabular data and random access to the data. However, it is not clear to me how JDF and fst permit random access on compressed data because both appear to use block compression (respectively Blosc and LZ4 or Zstd). While both Blosc and Zstd are extremely fast, accessing a single value of a single row necessarily requires decompressing a whole block of data. Instead of O(1) random access you get O(N_ROWS_PER_BLOCK) random access.

In Vortex, we've specifically invested in high throughput compression techniques that admit O(1) random access. These kinds of techniques are also sometimes called "lightweight compression". The DuckDB folks have a good writeup [1] on the common ones.

[1] https://duckdb.org/2022/10/28/lightweight-compression.html

replies(2): >>41849819 #>>41854233 #

3. kwillets ◴[15 Oct 24 15:51 UTC] No.41849819[source]▶

>>41848629 #

This paper compares the benefits of lightweight compression and other techniques:

https://blog.acolyer.org/2018/09/26/the-design-and-implement...

4. xiaodai ◴[15 Oct 24 23:40 UTC] No.41854233[source]▶

>>41848629 #

I see. Very nice. So it's a trade-off. I imagine the throughput of these light-weight compression suffers a little. In analytical workloads, it's common to do things like compute the mean of a vector or compute the gradient for this batch of data so random access appear less of an issue here.

replies(1): >>41854975 #

5. danking00 ◴[16 Oct 24 02:13 UTC] No.41854975{3}[source]▶

>>41854233 #

We’ll post a blog post soon with specific, benchmarked numbers, but, in this case, you can have your cake and eat it too!

The compression and decompression throughputs of Vortex (and other lightweight compression schemes) are similar or better than Parquet for many common datasets. Unlike Zstd or Blosc, the lightweight encodings are, generally, both computationally simple and SIMD friendly. We’re seeing multiple gibibytes per second on an M2 MacBook Pro on various datasets in the PBI benchmark [1].

The key insight is that most data we all work with has common patterns that don’t require sophisticated, heavyweight compression algorithm. Let’s take advantage of that fact to free up more cycles for compute kernels!

[1] https://github.com/cwida/public_bi_benchmark

replies(1): >>41856961 #

6. xiaodai ◴[16 Oct 24 08:49 UTC] No.41856961{4}[source]▶

>>41854975 #

Cool looking forward to it.