An Empirical Evaluation of Columnar Storage Formats [pdf]

(www.vldb.org)

77 points eatonphil | 3 comments | 15 May 24 18:29 UTC | HN request time: 0s | source

Show context

0cf8612b2e1e ◴[15 May 24 19:50 UTC] No.40371539[source]▶

  Third, faster and cheaper storage devices mean that it is better to use faster decoding schemes to reduce computation costs than to pursue more aggressive compression to save I/O bandwidth. Formats should not apply general-purpose block compression by default because the bandwidth savings do not justify the decompression overhead.

Not sure I agree with that. Have a situation right now where I am bottlenecked by IO and not compute.

replies(6): >>40372011 #>>40372288 #>>40372399 #>>40372660 #>>40373077 #>>40373820 #

epistasis ◴[15 May 24 20:59 UTC] No.40372288[source]▶

>>40371539 #

This is extremely common in genomics settings, and in the past I have spent far more time allocating disk iops, network bandwidth, and memory amounts for various pipeline stages than I have on CPUs in this space. Muck up and launch 30x as many processes as your compute node has, and it's fairy fixable, but muck up the RAM allocation and disk IO and you may not be able to fix it in any reasonable time. And if you misallocate your network storage, that can bring the entire cluster to a halt, not just a few nodes.

replies(1): >>40372870 #

jltsiren ◴[15 May 24 21:53 UTC] No.40372870[source]▶

>>40372288 #

I think the idea is that you should design tools and pipelines to take advantage of current hardware. Individual nodes have more CPU cores, more RAM, and more and faster local storage than they used to. Instead of launching many small jobs that compete for shared resources, you should have large jobs that run the entire pipeline locally, using network and network storage only when it's unavoidable.

replies(1): >>40373040 #

epistasis ◴[15 May 24 22:12 UTC] No.40373040[source]▶

>>40372870 #

That is exactly right, and optimizing for the current distribution of hardware is always the case; however most interesting problems still do not fit on a single node. For example, large LLMs that whose training data, or sometimes even model itself, do not fit on a single node. Lots of the same principles of allocation show up again.

replies(1): >>40373205 #

1. jltsiren ◴[15 May 24 22:32 UTC] No.40373205{3}[source]▶

>>40373040 #

You mentioned genomics, and that's a field where problems have not grown much over time. You may have more of them, but individual problems are about the same size as before. Most problems have a natural size that depends on the size of the genome. Genomics tools never really embraced distributed computing, because there was no need for the added complexity.

replies(1): >>40373491 #

2. epistasis ◴[15 May 24 23:13 UTC] No.40373491[source]▶

>>40373205 (TP) #

Sure, a 30x human WGS resequencing analysis has gotten pretty trivial over the past decade, but now we also have thousands or millions, plus expression data sets, HI-C, etc. etc. and how to combine them. There may not be compute clusters in genomics labs anymore, because funding agencies will only pay for cloud and not hardware, but there are lots of people working on large scale computation that doesn't fit on a single node.

replies(1): >>40373794 #

3. jltsiren ◴[16 May 24 00:02 UTC] No.40373794[source]▶

>>40373491 #

We actually got a new cluster recently. ~25 nodes with 128 or 192 physical cores, 2 TB RAM, and >10 TB local scratch space each. And most issues arise from the old-school practice of running many small jobs to make scheduling easier. But if you restructure your work to copy the data to local storage, run for a number of hours without accessing the network, and copy the results back to network storage, the issues tend to go away.

Compared to the cluster I was using a decade ago, individual nodes are an order of magnitude faster, they can run an order of magnitude bigger jobs, and local storage is two orders of magnitude faster. Meanwhile, increases in network bandwidth have been modest. I/O has become cheap relative to compute, while network has become a scarce resource.

↑