Pixar's Render Farm | slacker news

1. nom ◴[02 Jan 21 20:39 UTC] No.25616292[source]▶

Oh man, I wanted this to contain much more details :(

Whats the hardware? How much electric energy goes into rendering a frame or a whole movie? How do they provision it (as they keep #cores fixed)? They only talk about cores, do they even use GPUs? What's running on the machines? What did they optimize lately?

So many questions! Maybe someone from Pixar's systems department is reading this :)?

replies(7): >>25616619 #>>25616668 #>>25616803 #>>25616962 #>>25617126 #>>25617551 #>>25622359 #

2. daxfohl ◴[02 Jan 21 21:16 UTC] No.25616619[source]▶

>>25616292 (TP) #

And when leasing cores, who do they lease from and why?

3. aprdm ◴[02 Jan 21 21:23 UTC] No.25616668[source]▶

>>25616292 (TP) #

Not Pixar specifically but Modern VFX and Animation studios usually have a bare metal render farm, they usually are pretty beefy -- think at least 24 cores / 128 GB of RAM per node.

Usually in crunch time if there's not enough nodes in the render farm they might rent nodes connecting them to their network for a period of time, or they might use the cloud, or they might get budget to increase their render farms.

From what I've seen the Cloud is extremely expensive for beefy machines with GPUs, but, you can see that some companies use it if you google [0] [1].

GPUs can be used for some workflows in modern studios but I would bet the majority of it is CPUs, those machines are usually running a Linux distro and the render processes (like vray / prman , etc.). Everything runs from a big NFS cluster.

[0] https://deadline.com/2020/09/weta-digital-pacts-with-amazon-...

[1] https://www.itnews.com.au/news/dreamworks-animation-steps-to...

replies(1): >>25617317 #

4. dahart ◴[02 Jan 21 21:41 UTC] No.25616803[source]▶

>>25616292 (TP) #

> They only talk about cores, do they even use GPUs?

They’ve been working on a GPU version of RenderMan for a couple of years.

https://renderman.pixar.com/news/renderman-xpu-development-u...

5. mroche ◴[02 Jan 21 22:00 UTC] No.25616962[source]▶

>>25616292 (TP) #

Former Pixar Systems Intern (2019) here. Though I was not part of the team involved in this area, but I have some rough knowledge around some of the parts.

> Whats the hardware?

It varies. They have several generations of equipment, but I can say it was all Intel based, and high core count. I don't know how different the render infra was to the workstation infra. I think the total core count (aggregate of render, workstation, and leased) was ~60K cores. And they effectively need to double that over the coming years (trying to remember one of the last meetings I was in) for the productions they have planned.

> How much electric energy goes into rendering a frame or a whole movie?

A lot. The render farm is pretty consistently running at high loads as they produce multiple shows (movies, shorts, episodics) simultaneously so that there really isn't idle times. I don't have numbers, though.

> How do they provision it

Not really sure how to answer this question. But in terms of rendering, to my knowledge shots are profiled by the TDs and optimized for their core counts. So different sequences will have different rendering requirements (memory, cores, hyperthreading etc). This is all handled by the render farm scheduler.

> What's running on the machines?

RHEL. And a lot of Pixar proprietary code (along with the commercial applications).

> They only talk about cores, do they even use GPUs?

For rendering, not particularly. The RenderMan denoiser is capable of being used on GPUs, but I can't remember if the render specific nodes have any in them. The workstation systems (which are also used for rendering) are all on-prem VDI.

Though with RenderMan 24 due out in Q1 2021 will include RenderMan XPU, which is a GPU (CUDA) based engine. Initially it'll be more of a workstation facing product to allow artists to iterate more quickly (it'll also replace their internal CUDA engine used in their propriety look-dev tool Flow, which was XPU's predecessor), but it will eventually be ready for final-frame rendering. There is still some catchup that needs to happen in the hardware space, though NVLink'ed RTX8000's does a reasonable job.

A small quote on the hardware/engine:

>> In Pixar’s demo scenes, XPU renders were up to 10x faster than RIS on one of the studio’s standard artist workstations with a 24-core Intel Xeon Platinum 8268 CPU and Nvidia Quadro RTX 6000 GPU.

If I remember correctly that was the latest generation (codenamed Pegasus) initially given to the FX department. Hyperthreading is usually disabled and the workstation itself would be 23-cores as they reserve one for the hypervisor. Each workstation server is actually two+1, one workstation per CPU socket (with NUMA configs and GPU passthrough) plus a background render vm that takes over at night. The next-gen workstations they were negotiating with OEMs for before COVID happened put my jaw on the floor.

6. lattalayta ◴[02 Jan 21 22:22 UTC] No.25617126[source]▶

>>25616292 (TP) #

Also, renderfarms are usually referred to "in cores", because it's usually heterogeneous hardware networked together over the years. You may have some brand new 96 core 512 GB RAM machines mixed in with some several year old 8 core 32 GB machines. When a technical artist is submitting their work to be rendered on the farm, they often have an idea of how expensive their task will be. They will request a certain number of cores from the farm and a scheduler will go through and try to optimize everyone's requests across the available machines.

7. tinco ◴[02 Jan 21 22:44 UTC] No.25617317[source]▶

>>25616668 #

Can confirm cloud GPU is way overpriced if you're doing 24/7 rendering. We run a bare metal cluster (not VFX but photogrammetry) and I pitched our board on the possibilities. I really did not want to run a bare metal cluster, but it just does not make sense for a low margin startup to use cloud processing.

Running 24/7 for three months, it's cheaper to buy consumer grade hardware with similar (probably better) performance. "Industrial" grade hardware (Xeon/Epyc + Quadro) it's under 12 months. We chose consumer grade bare metal.

On thing that was half surprising, half calculated in our decision was despite the operational overhead how much less stressful running your own hardware is. When we ran experimentally on the cloud, a misrender could cost us 900 euro, and sometimes we'd have to render 3 times or more for a single client. Bringing us from healthily profitable to losing money. The stress of having to get it right the first time sucked.

replies(3): >>25617371 #>>25620467 #>>25621281 #

8. jack2222 ◴[02 Jan 21 22:49 UTC] No.25617371{3}[source]▶

>>25617317 #

I've had renders cost $50,000! Or CTO was less than amused

replies(1): >>25621655 #

9. KaiserPro ◴[02 Jan 21 23:13 UTC] No.25617551[source]▶

>>25616292 (TP) #

> How do they provision it

ex VFX sysadmin here. I'm not sure if they use their own scheduler or not. IF they do, they use tractor(might be tractor 2 now), which looks after putting the processes in the right places. Think K8s, but actually easy to use, well documented and reliable. (just not distributed, but then it scales way higher and is nowhere near as chatty)

They would have a whole bunch of machines, some old some new, some with extra memory, for particle sims, some with extra cores for just plain rendering. Each machine will be separated into slots, which are made up of a fixed number of cores. Normally memory is guarded but CPU is not (ie, you only get 8 gigs of ram, but as much CPU as you can consume. Context switching the CPU is fast, memory not so much.) I'm not sure on how pixar does it, but at a large facility like ilm/framestore/dneg the farm will be split into shows, with guaranteed minimum allocation of cores. this is controlled by the scheduler. crucially it'll be over subscribed, so jobs are ordered by priority.

As for actual hardware provisioning, thats quite cool. In my experience there will be a bringup script that talks to the iLo/iDrac/other management system. When a machine is plugged in, it'll be seen by the bringup script, download the xml/config/other goop that tells the bios how to configure and boot from the network, connect to the imaging system and install whatever version of linux they have.

As for power per frame, each frame will be made up of different plates, so if you have a water sim, that'll be rendered separately, along with other assets. These can then be combined afterwards in nuke to tweak and make pretty without having to render everything again.

That being said, a crowd shot with lots of characters with hair, or a water/smoke/ice effect can take 25+ hours per frame to render. So think a 100core/thread machine redlining for 25 hours, plus a few hundreds TB of spinny disk. (then it'll be tweak 20ish times)

optimisations wise, I suspect it's mostly getting software to play nice on the same machine, or beating TDs to make better use of assets, or adjusting the storage to make sure its not being pillaged too much.

replies(2): >>25617782 #>>25617873 #

10. aprdm ◴[02 Jan 21 23:41 UTC] No.25617782[source]▶

>>25617551 #

Out of curiosity, did you move outside VFX and if yes to what industry, have you been enjoying and what motivated you?

Cheers

replies(2): >>25617891 #>>25618544 #

11. berkut ◴[02 Jan 21 23:54 UTC] No.25617873[source]▶

>>25617551 #

Rumour in the industry is that Pixar don't use Tractor themselves, and have a custom solution in Emeryville :)

replies(2): >>25617901 #>>25624474 #

12. KaiserPro ◴[02 Jan 21 23:57 UTC] No.25617891{3}[source]▶

>>25617782 #

I spent tenish years in VFX. I moved away in 2014, because the hours and pay were abysmal. I still love the industry.

I moved to a large profitable financial news paper, which had cute scaling issues (ie they were all solved, so engineers tried to find new and interesting ways to unsolved them )

I then moved to a startup that made self building machine readable maps, which allowed me to play with scale again, but on AWS (alas no real hardware). We were then bought out by a FAAMG company, so now I'm getting bored but being paid loads to do so.

Once the golden handcuffs have been broken, I'd like to go back, but only if I can go home at 5 every day...

replies(2): >>25618192 #>>25618560 #

13. KaiserPro ◴[02 Jan 21 23:58 UTC] No.25617901{3}[source]▶

>>25617873 #

Lol, that doesn't surprise me at all. Pixar were the last of the original companies to custom make everything....

replies(1): >>25618210 #

14. aprdm ◴[03 Jan 21 00:40 UTC] No.25618192{4}[source]▶

>>25617891 #

Interesting, thanks for sharing! I've been in VFX around ~6 years and was in the sw industry before (and the hw industry before it).

I find VFX really fun as far as job! Sometimes I do think about leaving mostly for Pay reasons but the pay has been decent enough recently (basically FAANG base pay without RSU/Bonus...).

It is interesting how we have a lot of big scale problems that goes unrecognized, I find the problems really challenging. Compared to when I worked in the software industry we had a team 10x as big for a problem 100x simpler.

Outside of some big tech companies, biology, oil industry and finance, I cannot imagine many companies having such a big scale on number of cores/memory/disk.

Working in Pipeline I haven't found crazy hours yet, has been mostly a 8h/day job that I can disconnect after I am done. Also with Covid some people even switched to 4 days weeks which is quite interesting.

Anyho, thanks for sharing your perspective!

replies(1): >>25620804 #

15. berkut ◴[03 Jan 21 00:43 UTC] No.25618210{4}[source]▶

>>25617901 #

I think it's more that Tractor's not really very highly thought of in the industry :)

It works, and roughly does what it says on the tin, but most of the bigger studios (other than MPC and DNeg who do use it) have better custom solutions.

16. ◴[03 Jan 21 01:40 UTC] No.25618544{3}[source]▶

>>25617782 #

17. berkut ◴[03 Jan 21 01:43 UTC] No.25618560{4}[source]▶

>>25617891 #

Were you at The Foundry (I think I know who you are)? If so, I think we were both there at the same time!

replies(1): >>25620806 #

18. malthejorgensen ◴[03 Jan 21 08:31 UTC] No.25620467{3}[source]▶

>>25617317 #

How do you manage the bare metal cluster? (E.g. apt/yum updates but also networking and such)

replies(3): >>25620727 #>>25621636 #>>25622256 #

19. erosenbe0 ◴[03 Jan 21 09:33 UTC] No.25620727{4}[source]▶

>>25620467 #

I'm a bit out of date but if we are talking about rendering (not data retrieval workloads) I believe the best way is fundamentally the same as it was 25 years ago: network boot, mostly network storage, and applying local config overlays based on MAC address or equivalents. Exactly what push or pull techniques are in vogue I am not sure but definitely no running package managers on each node. You want as little as possible locally -- just a scratchpad disk that can be rebuilt automatically in minutes.

20. KaiserPro ◴[03 Jan 21 09:56 UTC] No.25620804{5}[source]▶

>>25618192 #

MAybe we will work together soon!

21. KaiserPro ◴[03 Jan 21 09:57 UTC] No.25620806{5}[source]▶

>>25618560 #

If you worked on katana then I think I know who you are too!

22. rbanffy ◴[03 Jan 21 11:59 UTC] No.25621281{3}[source]▶

>>25617317 #

> Running 24/7 for three months, it's cheaper to buy consumer grade hardware

If you have a steady load cloud makes little sense. It only makes sense if you have a tight deadline (as is not that uncommon with video and VFX) and can't fit it within your deployed capacity.

23. tinco ◴[03 Jan 21 13:25 UTC] No.25621636{4}[source]▶

>>25620467 #

When it was 3 nodes, and then 6 nodes, the answer was very unprofessionally. I didn't get the budget for a system administrator, and I spent all my budget on developers that could build our application and automate our preprocessing, overlooked system administration skills. So besides the DoE, managing 3 small teams and being the lead developer, I also am the system administrator.

So no fancy answer, our 3D experts got TeamViewer access to the nodes running Windows Pro. Sometimes our renders fail on patch Tuesday because I forgot to reapply the no-reboot hack.

We're professionalizing now at 12 nodes, we got to the point where the 3D experts don't need to TeamViewer in, so we're swapping them to headless Linux. No idea on the update management yet, but they're clean nodes running Ubuntu server.

24. cosmodisk ◴[03 Jan 21 13:30 UTC] No.25621655{4}[source]▶

>>25617371 #

I hope you didn't have to bin them, as Vogue did with one of their photoshoots costing a bank:)

replies(1): >>25621715 #

25. dtgriscom ◴[03 Jan 21 13:46 UTC] No.25621715{5}[source]▶

>>25621655 #

Sounds interesting: reference?

replies(1): >>25644923 #

26. Narann ◴[03 Jan 21 15:25 UTC] No.25622256{4}[source]▶

>>25620467 #

Network solutions highly depends on the physical infrastructure, but for setup maintenance, you can often see SaltStack.

27. Narann ◴[03 Jan 21 15:41 UTC] No.25622359[source]▶

>>25616292 (TP) #

> They only talk about cores, do they even use GPUs?

From my experience (animated movies) GPU is still very experimental because of how limited it can scale, and definitely not use in render farm.

And I'm not even talking about the cost.

GPU rendering demos focus on speed, but of the biggest problem with full feature is flexibility. The more complex your image is, the more problems/artifacts you will "create" on it. Your render time can be x100 faster, if you need to spend two days to fix a problem for each shot, the quality VS speed ratio completely fall over.

Everything get easier, slowly, so maybe one day we will have a 100% GPU farm on big budget projects, but for now, CPU is the most predictable way to manage large scale rendering for both sides (sysadmin/artists).

28. jason_slack ◴[03 Jan 21 19:40 UTC] No.25624474{3}[source]▶

>>25617873 #

Do you have a link to Tractor to let folks check it out?

replies(1): >>25629433 #

29. easton ◴[04 Jan 21 10:06 UTC] No.25629433{4}[source]▶

>>25624474 #

https://renderman.pixar.com/tractor

30. cosmodisk ◴[05 Jan 21 12:42 UTC] No.25644923{6}[source]▶

>>25621715 #

In the documentary: The September Issue