An Empirical Evaluation of Columnar Storage Formats [pdf]

1. 0cf8612b2e1e ◴[15 May 24 19:50 UTC] No.40371539[source]▶

  Third, faster and cheaper storage devices mean that it is better to use faster decoding schemes to reduce computation costs than to pursue more aggressive compression to save I/O bandwidth. Formats should not apply general-purpose block compression by default because the bandwidth savings do not justify the decompression overhead.

Not sure I agree with that. Have a situation right now where I am bottlenecked by IO and not compute.

replies(6): >>40372011 #>>40372288 #>>40372399 #>>40372660 #>>40373077 #>>40373820 #

2. RhodesianHunter ◴[15 May 24 20:31 UTC] No.40372011[source]▶

>>40371539 (TP) #

Is this because you're using some sort of network backed storage like EBS?

3. epistasis ◴[15 May 24 20:59 UTC] No.40372288[source]▶

>>40371539 (TP) #

This is extremely common in genomics settings, and in the past I have spent far more time allocating disk iops, network bandwidth, and memory amounts for various pipeline stages than I have on CPUs in this space. Muck up and launch 30x as many processes as your compute node has, and it's fairy fixable, but muck up the RAM allocation and disk IO and you may not be able to fix it in any reasonable time. And if you misallocate your network storage, that can bring the entire cluster to a halt, not just a few nodes.

replies(1): >>40372870 #

4. apavlo ◴[15 May 24 21:11 UTC] No.40372399[source]▶

>>40371539 (TP) #

> Have a situation right now where I am bottlenecked by IO and not compute.

Can you describe your use-case? Are you reading from NVMe or S3?

5. zX41ZdbW ◴[15 May 24 21:33 UTC] No.40372660[source]▶

>>40371539 (TP) #

My point is near the opposite. Data formats should apply lightweight compression, such as lz4, by default because it could be beneficial even if the data is read from RAM.

I have made a presentation about it: https://presentations.clickhouse.com/meetup53/optimizations/

Actually, it depends on the ratio between memory speed, the number of memory channels, CPU speed, and the number of CPU cores.

But there are cases when compression by default does not make sense. For example, it is pointless to apply lossless compression for embeddings.

replies(2): >>40372775 #>>40375798 #

6. Galanwe ◴[15 May 24 21:45 UTC] No.40372775[source]▶

>>40372660 #

Last I checked you can't get much better than 1.5GB/s per core with LZ4 (from RAM), up to a maximum ratio < 3:1, and multicore decompression is not really possible unless you manually tweak the compression.

The benchmarks above that are usually misleading, because they assume no dependence between blocks, which is nuts. In real scenarios, blocks need to be parsed, depend on their previous blocks, and you need to carry around that context.

My RAM can deliver close to 20GB/s, and my SSD 7GB/s, and that is all commodity hardware.

Meaning unless you have quite slow disks, you're better off without compression.

replies(1): >>40372899 #

7. jltsiren ◴[15 May 24 21:53 UTC] No.40372870[source]▶

>>40372288 #

I think the idea is that you should design tools and pipelines to take advantage of current hardware. Individual nodes have more CPU cores, more RAM, and more and faster local storage than they used to. Instead of launching many small jobs that compete for shared resources, you should have large jobs that run the entire pipeline locally, using network and network storage only when it's unavoidable.

replies(1): >>40373040 #

8. riku_iki ◴[15 May 24 21:56 UTC] No.40372899{3}[source]▶

>>40372775 #

> Last I checked you can't get much better than 1.5GB/s per core with LZ4

you can partition your dataset and process each partition on separate core, which will produce some massive XX or even XXX GB/s?

> up to a maximum ratio < 3:1

this is obviously depends on your data pattern. If it is some low cardinality IDs, they can be compressed by ratio 100 easily.

replies(1): >>40373006 #

9. Galanwe ◴[15 May 24 22:08 UTC] No.40373006{4}[source]▶

>>40372899 #

> you can partition your dataset and process each partition on separate core, which will produce some massive XX or even XXX GB/s?

Yes, but as I mentioned:

> multicore decompression is not really possible unless you manually tweak the compression

That is, there is no stable implementation out there that does it. You will have to do that manually and painfully. In which case, you're opening the doors for exotic/niche compression/decompression, and there are better alternatives than LZ4 if you're in the niche market.

> this is obviously depends on your data pattern. If it is some low cardinality IDs, they can be compressed by ratio 100 easily.

Everything is possible in theory. Yet we have to agree on what is a reasonable expectation. A compression factor of around 3:1 is, from my experience, what you would get from a reasonable compression speed on reasonably distributed data.

replies(1): >>40373170 #

10. epistasis ◴[15 May 24 22:12 UTC] No.40373040{3}[source]▶

>>40372870 #

That is exactly right, and optimizing for the current distribution of hardware is always the case; however most interesting problems still do not fit on a single node. For example, large LLMs that whose training data, or sometimes even model itself, do not fit on a single node. Lots of the same principles of allocation show up again.

replies(1): >>40373205 #

11. miohtama ◴[15 May 24 22:16 UTC] No.40373077[source]▶

>>40371539 (TP) #

Try blosch, faster than memcpy

https://www.blosc.org/pages/blosc-in-depth/

12. riku_iki ◴[15 May 24 22:25 UTC] No.40373170{5}[source]▶

>>40373006 #

> Yes, but as I mentioned > multicore decompression is not really possible unless you manually tweak the compression

I don't understand your point. Decompression will be applied on separate partitions using separate cores the same way as compression..

> Yet we have to agree on what is a reasonable expectation. A compression factor of around 3:1 is, from my experience

well, my prod database is compressed by ratio 7 (many hundreds billions IDs).

13. jltsiren ◴[15 May 24 22:32 UTC] No.40373205{4}[source]▶

>>40373040 #

You mentioned genomics, and that's a field where problems have not grown much over time. You may have more of them, but individual problems are about the same size as before. Most problems have a natural size that depends on the size of the genome. Genomics tools never really embraced distributed computing, because there was no need for the added complexity.

replies(1): >>40373491 #

14. epistasis ◴[15 May 24 23:13 UTC] No.40373491{5}[source]▶

>>40373205 #

Sure, a 30x human WGS resequencing analysis has gotten pretty trivial over the past decade, but now we also have thousands or millions, plus expression data sets, HI-C, etc. etc. and how to combine them. There may not be compute clusters in genomics labs anymore, because funding agencies will only pay for cloud and not hardware, but there are lots of people working on large scale computation that doesn't fit on a single node.

replies(1): >>40373794 #

15. jltsiren ◴[16 May 24 00:02 UTC] No.40373794{6}[source]▶

>>40373491 #

We actually got a new cluster recently. ~25 nodes with 128 or 192 physical cores, 2 TB RAM, and >10 TB local scratch space each. And most issues arise from the old-school practice of running many small jobs to make scheduling easier. But if you restructure your work to copy the data to local storage, run for a number of hours without accessing the network, and copy the results back to network storage, the issues tend to go away.

Compared to the cluster I was using a decade ago, individual nodes are an order of magnitude faster, they can run an order of magnitude bigger jobs, and local storage is two orders of magnitude faster. Meanwhile, increases in network bandwidth have been modest. I/O has become cheap relative to compute, while network has become a scarce resource.

16. jauntywundrkind ◴[16 May 24 00:06 UTC] No.40373820[source]▶

>>40371539 (TP) #

I struggle to imagine being bandwidth limited in this day and age. Kioxia makes some mean mean SSDs, for not a wild price. A 1u can fit dozens of thee monsters easily.

17. tarasglek ◴[16 May 24 06:39 UTC] No.40375798[source]▶

>>40372660 #

Wrote a blog post on your other discussion https://taras.glek.net/post/overlooked-hn-databases-that-pro...