Pixar's Render Farm | slacker news

1. thomashabets2 ◴[02 Jan 21 20:38 UTC] No.25616274[source]▶

I'm surprised they hit only 80-90% CPU utilization. Sure, I don't know their bottlenecks, but I understood this to be way more parallelizable than that.

I ray trace quake demos for fun at a much much lower scale[0], and have professionally organized much bigger installs (I feel confident in saying even though I don't know Pixar's exact scale).

But I don't know state of the art rendering. I'm sure Pixar knows their workload much better than I do. I would be interested in hearing why, though.

[0] Youtube butchers the quality in compression, but https://youtu.be/0xR1ZoGhfhc . Live system at https://qpov.retrofitta.se/, code at https://github.com/ThomasHabets/qpov.

Edit: I see people are following the links. What a day to overflow Go's 64bit counter for time durations on the stats page. https://qpov.retrofitta.se/stats

I'll fix it later.

replies(5): >>25616362 #>>25616369 #>>25616380 #>>25616401 #>>25617648 #

2. dagmx ◴[02 Jan 21 20:46 UTC] No.25616362[source]▶

>>25616274 (TP) #

I suspect they mean core count utilization, not per core utilization.

Ie there's some headroom left for rush jobs and a safety net, because full occupancy isn't great either.

replies(1): >>25617696 #

3. brundolf ◴[02 Jan 21 20:47 UTC] No.25616369[source]▶

>>25616274 (TP) #

My guess would be that the core-redistribution described in the OP only really works for cores on the same machine. If there's a spare core being used by none of the processes on that machine, a process on another machine might have trouble utilizing it because memory isn't shared. The cost of loading (and maybe also pre-processing) all of the required assets may outweigh the brief window of compute availability you're trying to utilize.

replies(1): >>25617718 #

4. ◴[02 Jan 21 20:49 UTC] No.25616380[source]▶

>>25616274 (TP) #

5. mike_d ◴[02 Jan 21 20:51 UTC] No.25616401[source]▶

>>25616274 (TP) #

Rendering may be highly parallelizable, but the custom bird flock simulation they wrote may be memory constrained. This is why having a solid systems team who can do care and feeding of a job scheduler is worth more than expanding a cluster.

6. KaiserPro ◴[02 Jan 21 23:24 UTC] No.25617648[source]▶

>>25616274 (TP) #

maxing a CPU is easy, keeping it fed with data, and being able to save that data out is hard.

replies(1): >>25617691 #

7. thomashabets2 ◴[02 Jan 21 23:29 UTC] No.25617691[source]▶

>>25617648 #

Yes, but the work units (frames) are large enough that I'm still surprised.

Maybe they're not as parallelizable as I'd expect. E.g. if there's serial work to be done by reusing scene layout algorithms between frames.

replies(1): >>25617738 #

8. thomashabets2 ◴[02 Jan 21 23:30 UTC] No.25617696[source]▶

>>25616362 #

Must be RAM then, because CPU is easy to prioritize.

9. thomashabets2 ◴[02 Jan 21 23:32 UTC] No.25617718[source]▶

>>25616369 #

Yeah. With my pov-ray workload the least efficient part is povray loading the complex scene. And that's not multithreaded. The solution that works for me is that I can just start ncore frames concurrently, or just two, but stagger them a bit (thus there's always one frame doing parallelizable work and can use all cores, even if the other is not).

But at that point it may get into RAM constraint, or some as yet unmentioned inter-frame dependency/caching.

10. KaiserPro ◴[02 Jan 21 23:36 UTC] No.25617738{3}[source]▶

>>25617691 #

A scene will have many thousands of assets (trees, cars, people, etc) each one will have the geo, which could be in the milllions of polygons (although they use sub-ds)

each "polygon" could have a 16k texture on it. You're pulling TBs of textures and other assets in each frame.

replies(1): >>25623662 #

11. thomashabets2 ◴[03 Jan 21 18:19 UTC] No.25623662{4}[source]▶

>>25617738 #

Hmm, yes I see. TBs? Interesting. I'd like to hear a talk about these things.

Naively I would expect that (as is the case for my MUCH smaller scale system) that I can compensate for network/disk-bound and non-multithreaded stages by merely running two concurrent frames.

On a larger scale I would expect to be able to estimate RAM-cheap frames, and always have one of them running per machine, but at SCHED_IDLE priority, so that they only get CPU when the "main" frame is blocked on disk or network, or a non-parallelizable stage. By starving one frame of CPU, it's much more likely that it'll need CPU the short intervals when it's allowed to get it.