Third, faster and cheaper storage devices mean that it is better to use faster decoding schemes to reduce computation costs than to pursue more aggressive compression to save I/O bandwidth. Formats should not apply general-purpose block compression by default because the bandwidth savings do not justify the decompression overhead.
Not sure I agree with that. Have a situation right now where I am bottlenecked by IO and not compute.I have made a presentation about it: https://presentations.clickhouse.com/meetup53/optimizations/
Actually, it depends on the ratio between memory speed, the number of memory channels, CPU speed, and the number of CPU cores.
But there are cases when compression by default does not make sense. For example, it is pointless to apply lossless compression for embeddings.
The benchmarks above that are usually misleading, because they assume no dependence between blocks, which is nuts. In real scenarios, blocks need to be parsed, depend on their previous blocks, and you need to carry around that context.
My RAM can deliver close to 20GB/s, and my SSD 7GB/s, and that is all commodity hardware.
Meaning unless you have quite slow disks, you're better off without compression.
you can partition your dataset and process each partition on separate core, which will produce some massive XX or even XXX GB/s?
> up to a maximum ratio < 3:1
this is obviously depends on your data pattern. If it is some low cardinality IDs, they can be compressed by ratio 100 easily.
Yes, but as I mentioned:
> multicore decompression is not really possible unless you manually tweak the compression
That is, there is no stable implementation out there that does it. You will have to do that manually and painfully. In which case, you're opening the doors for exotic/niche compression/decompression, and there are better alternatives than LZ4 if you're in the niche market.
> this is obviously depends on your data pattern. If it is some low cardinality IDs, they can be compressed by ratio 100 easily.
Everything is possible in theory. Yet we have to agree on what is a reasonable expectation. A compression factor of around 3:1 is, from my experience, what you would get from a reasonable compression speed on reasonably distributed data.
There is also Vortex (https://github.com/fulcrum-so/vortex). That has modern encoding schemes that we want to use.
BtrBlocks (https://github.com/maxi-k/btrblocks) from the Germans is another Parquet alternative.
Nimble (formerly Alpha) is a complicated story. We worked with the Velox team for over a year to open-source and extend it. But plans got stymied by legal. This was in collaboration with Meta + CWI + Nvidia + Voltron. We decided to go a separate path because Nimble code has no spec/docs. Too tightly coupled with Velox/Folly.
Given that, we are working on a new file format. We hope to share our ideas/code later this year.
I don't understand your point. Decompression will be applied on separate partitions using separate cores the same way as compression..
> Yet we have to agree on what is a reasonable expectation. A compression factor of around 3:1 is, from my experience
well, my prod database is compressed by ratio 7 (many hundreds billions IDs).
2c remark, zero horses in this race: I was surprised how few encodings were in Nimble at release. The skeleton superficially seemed fine I guess, I don't know, but not much meat on the bones. Without nice interesting optimized encodings, the container for them doesn't feel compelling. But also starting with some inarguable clear options makes some kind of sense too, is some kind of tactic.
They claim they're trying to figure out a path to decoupling from Velox/Folly, so hopefully that can come about. I tend to believe so, godspeed.
The "implementation not specification" does seem really scary though, isn't how we usually get breakout industry-changimg successes.
I wish I had the savy to contrast lance (V2) vs nimble a little better. Both seem to be containerizing systems, allowing streams to define their own encodings. Your comment about meta-data + encodings makes me feel like there's dimensions to the puzzle I haven't identified yet (mostly after chugging VeloxCon talks).
(Thanks for everything Andy, you're doing the good work (practicing and informing). Very very excited to see ya'll's alternative!!)
Compared to the cluster I was using a decade ago, individual nodes are an order of magnitude faster, they can run an order of magnitude bigger jobs, and local storage is two orders of magnitude faster. Meanwhile, increases in network bandwidth have been modest. I/O has become cheap relative to compute, while network has become a scarce resource.
> All papers published in this issue will be presented at the 50th International Conference on Very Large Data Bases, Guangzhou, China, 2024.
And that's because in the submission guidelines [1],
> The last three revision deadlines will be May 15, June 1, and July 15, 2023. Note that the June deadline is on the 1st instead of the 15th, and it is the final revision deadline for consideration to present at VLDB 2023; submissions received after this deadline will roll over to VLDB 2024.
So whether it is (2023) or (2024) is a little ambiguous.
[0] https://www.vldb.org/pvldb/vol17/FrontMatterVol17No2.pdf
https://twitter.com/andy_pavlo/status/1790524064995688660?s=...