Most active commenters

banana_giraffe(3)
dagmx(3)
mroche(3)
JustinGarrison(3)

Pixar's Render Farm

(twitter.com)

1. banana_giraffe ◴[02 Jan 21 21:38 UTC] No.25616781[source]▶

One of the things they mentioned briefly in a little documentary on the making of Soul is that all of the animators work on fairly dumb terminals connected to a back end instance.

I can appreciate that working well when people are in the office, but I'm amazed that worked out for them when people moved to work from home. I have trouble getting some of my engineers to have a stable connection stable enough for VS Code's remote mode. I can't imagine trying to use a modern GUI over these connections.

replies(6): >>25616815 #>>25616858 #>>25617057 #>>25617074 #>>25618038 #>>25628067 #

2. pstrateman ◴[02 Jan 21 21:42 UTC] No.25616815[source]▶

>>25616781 (TP) #

I think most connections could be massively improved with a VPN that supports Forward Error Correction, but there doesn't seem to be any that do.

Seems very strange to me.

3. dagmx ◴[02 Jan 21 21:46 UTC] No.25616858[source]▶

>>25616781 (TP) #

A lot of studios use thin client/ PCoiP boxes from teradici etc..

They're pretty great overall and the bandwidth requirements aren't crazy high but it does max out your data usage if you're capped pretty quickly. The faster you can be, the better the experience.

Some studios like Imageworks don't even have the backend data center in the same location. So the thin clients connect to a center in Washington state when the studios are in LA and Vancouver.

4. mroche ◴[02 Jan 21 22:12 UTC] No.25617057[source]▶

>>25616781 (TP) #

The entire studio is VDI based (except for the Mac stations, unsure about Windows), utilizing the Teradici PCoIP protocol, 10Zig zero-clients, and (at the time, not sure if they've started testing the graphical agent), Teradici host cards for the workstations.

I was an intern in Pixar systems for 2019 (at Blue Sky now), and we're also using a mix of PCoIP and NoMachine for home users. We finally figured out a quirk with our VPN terminal we sent home with people that was throttling connections, but the experience post-that fix is actually really good. There are a few things that can cause lag (such as moving apps like Chrome/Firefox), but for the most part unless your ISP is introducing problems it's pretty stable. And everyone with a terminal setup has two monitors, either 2*1920x1200 or 1920x1200+2560x1440.

I have a 300Mbps/35Mbps plan (turns into a ~250/35 on VPN) and it's great. We see bandwidth usage ranging from 1Mbps to ~80 on average. The vast majority being sub-20. There are some outliers that end up in mid-100s, but we still need to investigate those.

We did some cross country tests with our sister studio ILM over the summer and was hitting ~70-90ms latency which although not fantastic, was still plenty workable.

replies(2): >>25617917 #>>25618614 #

5. lattalayta ◴[02 Jan 21 22:15 UTC] No.25617074[source]▶

>>25616781 (TP) #

That is correct. It's pretty common for a technical artist to have a 24-32 core machine, with 128 GB of RAM, and a modern GPU. Not to mention that the entirety of the movie is stored on a NFS and can approach many hundreds of terabytes. When you're talking about that amount of power and data, it makes more sense to connect into the on-site datacenter.

replies(1): >>25618185 #

6. dgrant ◴[03 Jan 21 00:00 UTC] No.25617917[source]▶

>>25617057 #

Hi. I used to work at Teradici. It was always interesting that Pixar went with VDI because it meant the CPUs that were being used as desktops during the day could be used for rendering at night. Roughly speaking. The economics made a lot of sense. A guy from Pixar came to Teradici and gave a talk all about it. Amazing stuff.

Interesting contrast with other companies that switched to VDI where it made very little sense. VMware + server racks + zero clients compared to desktops never made economic sense, at the time. But oftent here is some other factor that tips things in VDI's favour.

replies(1): >>25618231 #

7. nikon ◴[03 Jan 21 00:17 UTC] No.25618038[source]▶

>>25616781 (TP) #

Where can I watch the documentary?

replies(1): >>25618320 #

8. __turbobrew__ ◴[03 Jan 21 00:39 UTC] No.25618185[source]▶

>>25617074 #

I’m guessing Pixar is using a distributed file system opposed to traditional NFS? Do you have any idea what storage system render farms tend to use?

At my workplace we have a smallish HPC center and ended up moving off of NFS at about 2PB of storage since we were starting to hit the limits of NFS (think 1TB of RAM and 88 cores on a single NFS server).

replies(1): >>25618539 #

9. mroche ◴[03 Jan 21 00:47 UTC] No.25618231{3}[source]▶

>>25617917 #

Yep, all of their workstations were dual socket servers, where each socket was a workstation VM with PCIe passthrough, and each getting their own hostcard+GPU. Each VM had dedicated memory, but no ownership of the cores they were pinned to, so overnight if the 'workstations' were idle, another VM (also with dedicated memory) would spin up (the other VMs would be backgrounded) and consume the available cores and add itself to the render farm. An artist could then log in and suspend the job to get their performance back (I believe this was one of the reasons behind the checkpointing feature in RenderMan).

The Teradici stuff was great, and from an admin perspective having everything located in the DC made maintenance SO much better. Switching over to VDI is a long term goal for us at Blue Sky as well, but it'll take a lot more time and planning.

replies(1): >>25620189 #

10. banana_giraffe ◴[03 Jan 21 01:00 UTC] No.25618320[source]▶

>>25618038 #

It's a little bit on Disney+, one of the extras called "Soul, Improvised". It's very much not technical, more focused on the emotional impact of WFH.

replies(1): >>25619249 #

11. aprdm ◴[03 Jan 21 01:40 UTC] No.25618539{3}[source]▶

>>25618185 #

Everywhere I worked has been traditional NFS, and I've seen more than 3 times the figure you quoted working well. Usually you have different mountpoints/vfs`s in different servers for different kinds of files.

replies(2): >>25618878 #>>25643065 #

12. jfindley ◴[03 Jan 21 01:54 UTC] No.25618614[source]▶

>>25617057 #

Few years ago I spoke to some ILM people about their VDI setup, which at the time was cobbled together out of mesos and a bunch of xorg hacks to get VDI server scheduling working on a pool of remote machines with GPUs (I think they might even have used AWS intially but not sure - this is going back a fair few years now). I was doing a lot of work with mesos at the time, and we chatted a bit about this as our work overlapped a fair bit.

Are you still using a similar sort of setup to orchestrate the backend of this, and if so have you published anything about it? I've had a few people ask me about this sort of problem lately and there aren't too many great resources out there I can point people new to this sort of tech towards.

replies(2): >>25619559 #>>25643048 #

13. __turbobrew__ ◴[03 Jan 21 02:48 UTC] No.25618878{4}[source]▶

>>25618539 #

Interesting, maybe the scientific computations we are doing are more I/O intense than render applications? How do studios manage disaster recovery? What happens when a multi petabyte NFS server keels over? Are there tape drive backups? It seems risky to have a such a critical system serviced by only a single node.

replies(2): >>25619450 #>>25622246 #

14. nikon ◴[03 Jan 21 03:53 UTC] No.25619249{3}[source]▶

>>25618320 #

Thanks! I just checked it out. So interesting they use Linux for non-developer staff!

replies(2): >>25619457 #>>25643094 #

15. dagmx ◴[03 Jan 21 04:27 UTC] No.25619450{5}[source]▶

>>25618878 #

They're serviced by multiple nodes and have very strong backup policies.

At rhythm and Hues, you could request footage all the way back from the founding of the studio for example.

CG work is fairly IO intensive for tasks like rendering where you're reading hundreds or even thousands of geometry caches per frame. But for other things, your IO isn't as frequent since it's not about constant r/w as there are long computational or artist time between saves and reads.

16. dagmx ◴[03 Jan 21 04:29 UTC] No.25619457{4}[source]▶

>>25619249 #

Most of the big animation and visual effects studios are Linux based.

We even have a reference platform spec for some kind of industry wide baseline: https://vfxplatform.com/

17. mroche ◴[03 Jan 21 04:52 UTC] No.25619559{3}[source]▶

>>25618614 #

I wish I could answer this, but I really can't. Not because of any NDA, just that I don't know. I wasn't involved with the workstation team at Pixar (or ILM at all); I was part of the Network and Server Admin [NSA] team, specifically focused on OpenShift. There are a lot of tools that Pixar use that I don't have the full picture of how they work together.

Here at Blue Sky we are in our infancy for thin client based work. Remote terminals aren't too new as they were used for contract workers and artists who needed to WFH on the prior show, but we don't have VDI as we still use deskside workstations. For COVID, the workstations have been retrofitted with Teradici remote workstation hostcards and we send the artists home with a VPN client and zero client, utilizing direct connect. It was enough to get us going, but we have a long road ahead in optimizing this stack and eventually (if our datacenters can handle it) switching over to VDI.

18. a_e_k ◴[03 Jan 21 07:20 UTC] No.25620189{4}[source]▶

>>25618231 #

That's one reason for the checkpoint feature, yes, but there are others. A few years back (Dory-era), I participated in a talk at SIGGRAPH '15 about some of them:

https://dl.acm.org/doi/abs/10.1145/2775280.2792573

http://eastfarthing.com/publications/checkpoint.pdf

19. mprovost ◴[03 Jan 21 15:23 UTC] No.25622246{5}[source]▶

>>25618878 #

At Weta we divided up the NFS servers into "src" and "dat" - "src" was everything made by artists, and "dat" was the output from the renderwall. We backed up "src" every night, but "dat" was never backed up. Every once in a while there would be some mass deletion event but it was always faster to re-render the lost data than to restore from backups.

Also none of the high end commercial filers are single node - they're all clusters of varying sizes.

20. rperez333 ◴[04 Jan 21 05:01 UTC] No.25628067[source]▶

>>25616781 (TP) #

I work as a compositor on a visual effects studio that had to adapt, and can say that I'm impressed too!

The studio internally uses PCOIP boxes, which I don't like due the added tiny delay (I'm a bit like those developers who complain about miliseconds of latency on their text editors...). Anyways, for the work for home setup, we are using NoMachine, which doesn't feel any different from the PCoIP boxes - unless if using the MacOS client, which is much laggier than the Windows or Linux versions.

Actually, I went ahead and tried installing Nomachine on Google Cloud and Amazon AWS CPU only instances, and got the same responsiveness of my studio setup. No fancy setups or gpu encoding/decoding.

So if you have a Nuke license, you can do some pretty heavy 2D vfx for about 1USD/hour on a 96vcpu machine (performance similar to an AMD 32core) and 196GB of RAM, even without any GPU acceleration.

replies(1): >>25628228 #

21. banana_giraffe ◴[04 Jan 21 05:41 UTC] No.25628228[source]▶

>>25628067 #

I've tried a few remote desktop systems. The last one I tried was Parsec, which works well, but always made me feel queasy since it requires you to trust their connection service. (To be clear, I know of no security issues there, I just don't like relying on a third party for my security)

NoMachine looks like a good answer for people like me. Thanks for the pointer, I'll check it out.

22. JustinGarrison ◴[05 Jan 21 07:24 UTC] No.25643048{3}[source]▶

>>25618614 #

I worked at WDAS and remember talking to the ILM team testing out the mesos VDI stuff. AFAIK it never left the POC stage but it was a really neat demo.

My team at WDAS mirrored pretty closely what Pixar did with VDI although we didn't fully switch to it for different reasons (power and heat constraints in the datacenter and price). IIRC the VDI hosts had static VMs and the teradici connection manager did all the smarts of routing user requests to a VM. There was no dynamic orchestration for us because we only had 60ish users using full VDI VMs, but even our plans of hundreds of users was still to use teradici and standard VMs on each host.

We rendered different than Pixar which also made our system a bit more static. We didn't have a separate render VM and instead rendered directly on the workstation VM when users were idle or disconnected.

23. JustinGarrison ◴[05 Jan 21 07:27 UTC] No.25643065{4}[source]▶

>>25618539 #

Same here. I had some passing information from Pixar, WDAS, and ILM and they were pretty much all NFS. Lots of NFS caching (avere) and high performance NFS appliances in use.

24. JustinGarrison ◴[05 Jan 21 07:31 UTC] No.25643094{4}[source]▶

>>25619249 #

I worked at Disney Animation on the Linux engineering team for a few years. The flexibility of Linux was a key enabler for us being able to produce movies the way we did. Artists overall seemed to love the power of the Linux desktop setup we provided.

↑