The Skyline algorithm for packing 2D rectangles

1. romesmoke ◴[18 Nov 24 17:56 UTC] No.42174962[source]▶

An interesting variation is Dynamic Storage Allocation, in which the rectangles can slide only alongside one axis.

AKA "static memory allocation", since the heights of the rectangles can be interpreted as buffer sizes, and their widths as lifetimes.

Most of my PhD's effort has been devoted to beating the SOTA in this. Problem's importance nowadays is owed to deep learning's memory wall.

replies(2): >>42175495 #>>42178981 #

2. auvi ◴[18 Nov 24 18:42 UTC] No.42175495[source]▶

>>42174962 (TP) #

nice! you have your thesis or any related paper available online?

replies(3): >>42177050 #>>42179023 #>>42180210 #

3. ◴[18 Nov 24 21:07 UTC] No.42177050[source]▶

>>42175495 #

4. almostgotcaught ◴[19 Nov 24 00:44 UTC] No.42178981[source]▶

>>42174962 (TP) #

> Most of my PhD's effort has been devoted to beating the SOTA in this. Problem's importance nowadays is owed to deep learning's memory wall.

lol i similarly spent 1 year of my phd trying to finagle this same thing for the same exact reason. got as far as proving that when it's parameterized (like in the case of DNNs with algebraically coupled layers) it's FPT (finite parameter tractable) which is like duh but also as soon as i found that i gave up (not much more to be done <shrug>).

replies(1): >>42180194 #

5. almostgotcaught ◴[19 Nov 24 00:50 UTC] No.42179023[source]▶

>>42175495 #

there are a bunch of papers on DSA from the OR people

http://adambuchsbaum.com/papers/dsa-stoc03.pdf

https://link.springer.com/chapter/10.1007/978-3-540-27798-9_...

https://users.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pd...

there are also some from ML people trying to allocate memory optimally for DNNs:

https://arxiv.org/abs/1907.01989

https://arxiv.org/abs/1804.10001

https://arxiv.org/abs/2001.03288

they all boil down to a couple of greedy heuristics. the most recent "cool" paper was from a group at google

https://dl.acm.org/doi/10.1145/3567955.3567961

basic idea is to use both ILP and heurstics. i asked them to open source but no dice :(

replies(1): >>42179064 #

6. p0ckets ◴[19 Nov 24 00:55 UTC] No.42179064{3}[source]▶

>>42179023 #

An even more recent one from Google: https://github.com/google/minimalloc

replies(1): >>42179116 #

7. almostgotcaught ◴[19 Nov 24 01:07 UTC] No.42179116{4}[source]▶

>>42179064 #

i gave up on the problem about 18 months ago so i didn't keep up with the research area. is this yours? the runtimes are of course very good but i don't see a comparison on how good the approximation is vs telamalloc (or just ILP). i'll say this though: it's miraculuos that the impl is so small.

replies(1): >>42186127 #

8. romesmoke ◴[19 Nov 24 04:58 UTC] No.42180194[source]▶

>>42178981 #

Hm, you sound like you approached it from the theoretical ML side. My strategy was much less sophisticated: I implemented the Buchsbaum et al. paper you posted in another comment of yours.

replies(1): >>42183186 #

9. romesmoke ◴[19 Nov 24 05:01 UTC] No.42180210[source]▶

>>42175495 #

Hopefully I'll defend next summer.

Papers-and-code-wise, I haven't published the "final" thing yet. Right now I'm looking into integrating it with IREE, to get some end-to-end figures.

replies(1): >>42184591 #

10. almostgotcaught ◴[19 Nov 24 13:29 UTC] No.42183186{3}[source]▶

>>42180194 #

I don't know what you mean. I studied the problem, implemented several papers, wasn't happy with the results, and then kept pushing on the actual complexity. When I figured out there was an existing bound I gave up because I wasn't a theory student so I didn't have time (or really desire) to try to improve on the bound.

replies(1): >>42183608 #

11. romesmoke ◴[19 Nov 24 14:11 UTC] No.42183608{4}[source]▶

>>42183186 #

TBH I don't know what you mean either :D

I know that DSA is NP-complete, and none of the existing deep learning compilers implement the theoretical SOTA (you're probably referring to the 2+ε bound by Buchsbaum et al.)

I also know that scaling laws and the memory wall are soon gonna overpower the currently used heuristics (e.g., sort by size and do best-fit).

Anyways, I'm glad I "met" someone who has struggled against the same problem. Good luck with your research!

replies(1): >>42184157 #

12. almostgotcaught ◴[19 Nov 24 15:04 UTC] No.42184157{5}[source]▶

>>42183608 #

> TBH I don't know what you mean either :D

Think about the ILP with disjunctions formulation of DSA. It's basically a set of linear equations right? In the case of DNNs, the equations are related: the sizes of the allocations in one layer (in say a CNN) are algebraically a function of the outputs of the previous layer. And so on all the way back to the inputs. So what looks like a set of linear equations in N variables is actually a much smaller set of nonlinear equations in K << N variables.

Okay non-linear equations are hard but that's not actually what I'm getting at - what I'm getting at is the instances of the DSA problem in a production ML pipeline are parametric. So I tried to solve this problem: given a DNN and its parameterized DSA instances, can you do some work offline (to solve "part" of the problem) so that at runtime there's not a lot of work left. And that question/problem/challenge falls into the bucket of FPT.

> Good luck with your research

I changed topics and graduated. No more research for me yay

replies(1): >>42185853 #

13. almostgotcaught ◴[19 Nov 24 15:34 UTC] No.42184591{3}[source]▶

>>42180210 #

> Right now I'm looking into integrating it with IREE

clever guy. IREE is just about the only serious/available runtime where you can do this because IREE codegens the runtime calls as well as the kernel code. But you're gonna have to either patch an existing HAL or write a new one to accomplish what you want to accomplish. If you want I can help you - if you go to the discord (https://discord.gg/J68usspH) and ask about this in #offtopic I'll DM you and can point you to the right places.

14. romesmoke ◴[19 Nov 24 17:19 UTC] No.42185853{6}[source]▶

>>42184157 #

Wait a minute:

1. Formulating DSA as ILP doesn't scale. TelaMalloc starts from this fact as motivation.

2. There's no 1-1 mapping between tensors and allocations. Buffer re-use is important.

3. There's room for offline DSA instances as well. IREE's Stream dialect has a related transform (LayoutSlices.cpp).

Anyways, I'm no ML expert. DSA is much more generic than deep learning compilers. I can't wait to graduate myself and never hear the involved keywords again.

replies(1): >>42187570 #

15. p0ckets ◴[19 Nov 24 17:41 UTC] No.42186127{5}[source]▶

>>42179116 #

Not mine. I just happened to hear about it from a colleague recently.

16. almostgotcaught ◴[19 Nov 24 20:09 UTC] No.42187570{7}[source]▶

>>42185853 #

> Formulating DSA as ILP doesn't scale.

it doesn't scale to what? 405B LLMs? no probably not. but i have plots that show not unreasonable solve times for CNNs with ~30k tensors (~10 minutes) using gurobi.

> There's no 1-1 mapping between tensors and allocations. Buffer re-use is important.

yes i'm aware... that's why I made the comment above that you can't pull this trick in e.g., PyTorch.

> 3. There's room for offline DSA instances as well. IREE's Stream dialect has a related transform (LayoutSlices.cpp).

yes again, i'm aware - that's why I made the comment above that IREE is the only place you could pull this trick. LayoutSlices is one place but hooking the allocator in the HAL is simpler if you don't want to fight IREE's various transformations that happen after that.

> DSA is much more generic than deep learning compilers.

yes that's I posted the OR literature first...

> I can't wait to graduate myself and never hear the involved keywords again.

amen