←back to thread

245 points gatesn | 4 comments | | HN request time: 0.912s | source
1. jagged-chisel ◴[] No.41840818[source]
“Vortex is a toolkit for working with compressed Apache Arrow arrays in-memory, on-disk, and over-the-wire.”

So it’s a toolkit written in Rust. It is not a file format.

replies(1): >>41841079 #
2. _willmanning ◴[] No.41841079[source]
Perhaps that verbiage is just confusing. "On-disk" sort of implies "file format" but could be more explicit.

That said, the immediate next line in the README perhaps clarifies a bit?

"Vortex is designed to be to columnar file formats what Apache DataFusion is to query engines (or, analogously, what LLVM + Clang are to compilers): a highly extensible & extremely fast framework for building a modern columnar file format, with a state-of-the-art, "batteries included" reference implementation."

replies(1): >>41842846 #
3. jagged-chisel ◴[] No.41842846[source]
“Vortex is […] a highly extensible & extremely fast framework for building a modern columnar file format.”

It’s a framework for building file formats. This does not indicate that Vortex is, itself, a file format.

replies(1): >>41843130 #
4. aduffy ◴[] No.41843130{3}[source]
Will and I actually work on Vortex :wave:

Perhaps we should clean up the wording in the intro, but yes there is in fact a file format!

We actually built the toolkit first, before building the file format. The interesting thing here is that we have a consistent in-memory and on-disk representation of compressed, typed arrays.

This is nice for a couple of reasons:

(a) It makes it really easy to test out new compression algorithms and compute functions. We just implement a new codec and it's automatically available for the file format.

(b) We spend a lot of energy on efficient push down. Many compute functions such as slicing and cloning are zero-cost, and all compute operations can execute directly over compressed data.

Highly encourage you to checkout the vortex-serde crate in the repo for file format things, and the vortex-datafusion crate for some examples of integrating the format into a query engine!