←back to thread

249 points gatesn | 4 comments | | HN request time: 0.712s | source
Show context
the_mitsuhiko ◴[] No.41840459[source]
> One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data within the file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to the file format specification.

That is quite interesting. One challenge in general with parqet and arrow in the otel / observability ecosystem is that the shape of data is not quite known with spans. There are arbitrary attributes on them, and they can change. To the best of my knowledge no particularly great solution exists today for encoding this. I wonder to which degree this system could be "abused" for that.

replies(8): >>41840665 #>>41842038 #>>41842282 #>>41842347 #>>41843259 #>>41844697 #>>41846992 #>>41848634 #
1. marginalia_nu ◴[] No.41846992[source]
I've been experimenting with taking this self-description paradigm even farther, for a file format I've cooked up for ephemeral data in my search engine.

Basically, since I ended up building a custom library for this, I wanted to solve the portability problem by making it stupidly simple to reverse engineer, so I cooked up a convention where each column (and supporting column) is a file, with a file name that describes its format and role.

So a real-world production table looks like this if you ls in the directory (omitting a few columns for brevity):

  combinedId.0.dat.s64le.bin
  documentMeta.0.dat.s64le.bin
  features.0.dat.s32le.bin
  size.0.dat.s32le.bin
  termIds.0.dat-len.varint.bin
  termIds.0.dat.s64le[].zstd
  termMetadata.0.dat-len.varint.bin
  termMetadata.0.dat.s8[].zstd

The design goal is that just based on an ls output, someone who has never seen the code of the library producing the files should be able to trivially write code that reads it.
replies(2): >>41847464 #>>41847793 #
2. hiatus ◴[] No.41847464[source]
Do you have a deeper writeup of this anywhere?
replies(1): >>41847912 #
3. gatesn ◴[] No.41847793[source]
Internally the design of Vortex is very similar. The file consists of a whole bunch of "messages" (your files), which then have some metadata attached, and the read logic decides which messages it needs when.
4. marginalia_nu ◴[] No.41847912[source]
Not yet, but I will compile one at some point. I'm in the middle of moving right now so I don't quite have the time to sit down and finish the write-up...