←back to thread

245 points gatesn | 1 comments | | HN request time: 0s | source
Show context
the_mitsuhiko ◴[] No.41840459[source]
> One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data within the file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to the file format specification.

That is quite interesting. One challenge in general with parqet and arrow in the otel / observability ecosystem is that the shape of data is not quite known with spans. There are arbitrary attributes on them, and they can change. To the best of my knowledge no particularly great solution exists today for encoding this. I wonder to which degree this system could be "abused" for that.

replies(8): >>41840665 #>>41842038 #>>41842282 #>>41842347 #>>41843259 #>>41844697 #>>41846992 #>>41848634 #
sa46 ◴[] No.41843259[source]
Parquet also encodes the physical layout using footers [1], as does ORC [2]. Perhaps the author meant support for semi-structured data, like the spans you mention.

[1]: https://parquet.apache.org/docs/file-format/

[2]: https://orc.apache.org/specification/ORCv2/#file-tail

replies(1): >>41844396 #
1. danking00 ◴[] No.41844396[source]
Yeah we should be more clear in our description about how our footers differ from Parquet. Parquet is a bit more prescriptive; for example, it requires row groups which are not required by Vortex. If you have a column with huge values and another column of 8 bit ints, they can be paged separately, if you like.