←back to thread

245 points gatesn | 1 comments | | HN request time: 0.001s | source
Show context
the_mitsuhiko ◴[] No.41840459[source]
> One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data within the file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to the file format specification.

That is quite interesting. One challenge in general with parqet and arrow in the otel / observability ecosystem is that the shape of data is not quite known with spans. There are arbitrary attributes on them, and they can change. To the best of my knowledge no particularly great solution exists today for encoding this. I wonder to which degree this system could be "abused" for that.

replies(8): >>41840665 #>>41842038 #>>41842282 #>>41842347 #>>41843259 #>>41844697 #>>41846992 #>>41848634 #
robert3005 ◴[] No.41842282[source]
The thing we are trying to achieve is to be able to experiment and tune the way data is groupped on disk. Parquet has one way of laying data out, csv is another (though it's a text format so a bit moot), ORC is another, Lance has yet another different method. The file format itself stores how it's physically laid out on disk so you can tune and tweak physical layouts to match the specific storage needs of your system (this is the toolkit part where you can take vortex and use it to implement your own file format). Having said that we will have an implementation of file format that follows particular layout.
replies(1): >>41845178 #
1. infogulch ◴[] No.41845178[source]
Wow, I think this is the thing I wished existed for years! Most file formats leave a huge compression opportunity on the table just because their choice of physical layout. (I call the simple case "striding order", idk) But getting it right takes a lot of experimentation which becomes too much churn for applications, and can result in storage layouts that are great for compression but are annoying to code against. So the obvious answer (to me at least) is that you need to decouple physical and logical layouts. I'm glad someone is finally trying it!