Most active commenters

0xbadcafebee(4)

Popular/hot comments

>>44387810 #

←back to thread

Apptainer: Application Containers for Linux

(apptainer.org)

Show context

cs_throwaway ◴[26 Jun 25 11:51 UTC] No.44386489[source]▶

>>44385742 (OP) #

Funny this is here. Apptainer is Singularity, described here:

https://journals.plos.org/plosone/article?id=10.1371/journal...

If you ever use a shared cluster at a university or run by the government, Apptainer will be available, and Podman / Docker likely won't be.

In these environments, it is best not to use containers at all, and instead get to know your sysadmin and understand how he expects the cluster to be used.

replies(1): >>44387033 #

1. shortrounddev2 ◴[26 Jun 25 13:04 UTC] No.44387033[source]▶

>>44386489 #

Why are docker/podman less common? And why do you say it's better not to use containers? Performance?

replies(1): >>44387172 #

2. kgxohxkhclhc ◴[26 Jun 25 13:22 UTC] No.44387172[source]▶

>>44387033 (TP) #

docker and podman expect to extract images to disk, then use fancy features like overlayfs, which doesn't work on network filesystems -- and in hpc, most filesystems users can write to persistently are network filesystems.

apptainer images are straight filesystem images with no overlayfs or storage driver magic happening -- just a straight loop mount of a disk image.

this means your container images can now live on your network filesystem.

replies(1): >>44387810 #

3. 0xbadcafebee ◴[26 Jun 25 14:30 UTC] No.44387810[source]▶

>>44387172 #

Do the compute instances not have hard disks? Because it seems like whoever's running these systems doesn't understand Linux or containers all that well.

If there's a hard disk on the compute nodes, then you just run the container from the remote image registry, and it downloads and extracts it temporarily to disk. No need for a network filesystem.

If the containerized apps want to then work on common/shared files, they can still do that. You just mount the network filesystem on the host, then volume-mount that into the container's runtime. Now the containerized apps can access the network filesystem.

This is standard practice in AWS ECS, where you can mount an EFS filesystem inside your running containers in ECS. (EFS is just NFS, and ECS is just a wrapper around Docker)

replies(3): >>44388068 #>>44388221 #>>44395112 #

4. jdjcdbxh ◴[26 Jun 25 14:55 UTC] No.44388068{3}[source]▶

>>44387810 #

yes, nodes have local disks, but any local filesystem the user can write to is ofen wiped between jobs as the machines are shared resources.

there is also the problem of simply distributing the image and mounting it up. you don't want to waste cluster time at the start of your job pulling down an entire image to every node, then extract the layers -- it is way faster to put a filesystem image in your home directory, then loop mount that image.

replies(1): >>44389930 #

5. NGRhodes ◴[26 Jun 25 15:14 UTC] No.44388221{3}[source]▶

>>44387810 #

Scale of data we see on our HPC, it is way better performance per £/$ to use Lustre mounted over fast network. Would spend far too much time shifting data otherwise. Local storage should be used for tmp and scratch purposes.

replies(1): >>44388707 #

6. snickerdoodle12 ◴[26 Jun 25 16:05 UTC] No.44388707{4}[source]▶

>>44388221 #

The docker image is a scratch purpose.

replies(1): >>44389545 #

7. trueismywork ◴[26 Jun 25 17:40 UTC] No.44389545{5}[source]▶

>>44388707 #

Imagine copying 8gb image to 96000 ranks over network

replies(1): >>44389888 #

8. 0xbadcafebee ◴[26 Jun 25 18:18 UTC] No.44389888{6}[source]▶

>>44389545 #

It's called caching layers bruv, container images do it. Plus you can stagger registries in a tiered cache per rack/cage/etc. OTOH, constantly re-copying the same executable over and over every time you execute or access it over a network filesystem wastes bandwidth and time, and a network filesystem cache is both inefficient and runs into cache invalidation issues.

9. 0xbadcafebee ◴[26 Jun 25 18:22 UTC] No.44389930{4}[source]▶

>>44388068 #

> yes, nodes have local disks, but any local filesystem the user can write to is ofen wiped between jobs as the machines are shared resources.

This is completely compatible with containerized systems. Immutable images stay in a filesystem directory users have no access to, so there is no need to wipe them. Write-ability within a running container is completely controlled by the admin configuring how the container executes.

> you don't want to waste cluster time at the start of your job pulling down an entire image to every node, then extract the layers -- it is way faster to put a filesystem image in your home directory, then loop mount that image

This is actually less efficient over time as there's a network access tax every time you use the network filesystem. On top that, 1) You don't have to pull the images at execution time, you can pull them immediately as soon as they're pushed to a remote registry, well before your job starts, and 2) Containers use caching layers so that only changed layers need to be pulled; if only 1 file is changed in a new container image layer, you only pull 1 file, not the entire thing.

replies(2): >>44390593 #>>44390605 #

10. o7ri6246iu45 ◴[26 Jun 25 19:43 UTC] No.44390593{5}[source]▶

>>44389930 #

there generally is no central shared immutable image store because every job is using its own collection of images.

what you're describing might work well for a small team, but when you have a few hundred to thousand researchers sharing the cluster, very few of those layers are actually shared between jobs

even with a handful of users, most of these container images get fat at the python package installation layer, and that layer is one of the most frequently changed layers, and is frequently only used for a single job

replies(2): >>44392072 #>>44395409 #

11. o7ri6246iu45 ◴[26 Jun 25 19:44 UTC] No.44390605{5}[source]▶

>>44389930 #

the "network tax" is not really a network tax. the network is generally a dedicated storage network using infiniband or roce if you cheap out. the storage network and network storage is generally going to be faster than local nvme.

12. 0xbadcafebee ◴[26 Jun 25 22:23 UTC] No.44392072{6}[source]▶

>>44390593 #

Just to review, here are the options:

1. Create an 8gb file on network storage which is loopback-mounted. Accessing the file requires a block store pull over the network for every file access. According to your claim now, these giant blobs are rarely shared between jobs?

2. Create a Docker image in a remote registry. Layers are downloaded as necessary. According to your claim now, most of the containers will have a single layer which is both huge and changed every time python packages are changed, which you're saying is usually done for each job?

Both of these seem bad.

For the giant loopback file, why are there so many of these giant files which (it would seem) are almost identical except for the python differences? Why are they constantly changing? Why are they all so different? Why does every job have a different image?

For the container images, why are they having bloated image layers when python packages change? Python files are not huge. The layers should be between 5-100MB once new packages are installed. If the network is as fast as you say, transferring this once (even at job start) should take what, 2 seconds, if that? Do it before the job starts and it's instantaneous.

The whole thing sounds inefficient. If we can make kubernetes clusters run 10,000 microservices across 5,000 nodes and make it fast enough for the biggest sites in the world, we can make an HPC cluster (which has higher performance hardware) work too. The people setting this up need to optimize.

replies(1): >>44395206 #

13. lazylizard ◴[27 Jun 25 08:59 UTC] No.44395112{3}[source]▶

>>44387810 #

on a compute node, / is maybe 500gb of nvme. thats all the disk it has.

the users mount their $home over nfs. and get whatever quota we assign. can be 100s of tb.

i actually allow rootless podman to run. but frown at it. its not very hard for a few jobs to use up all that 500gb if everyone is using podman.

i don't care if you run apptainer/singularity though. since it exists entirely within your own $home and doesnt use the local disk.

14. lazylizard ◴[27 Jun 25 09:15 UTC] No.44395206{7}[source]▶

>>44392072 #

example tiny hpc cluster...

100 nodes. 500gb nvme disk per node. maybe 4 gpus per node. 64 cores? all other storage is network. could be nfs, beegfs, lustre.

100s of users that change over time. say 10 go away and 10 new one comes every 6mths. everyone has 50tb of data. tiny amount of code. cpu and/or gpu intensive.

all those users do different things and use different software. they run batch jobs that go for up to a month. and those users are first and foremost scientists. they happen to write python scripts too.

edit: that thing about optimization.. most of the folks who setup hpc clusters turn off hyperthreading.

15. robertlagrant ◴[27 Jun 25 09:54 UTC] No.44395409{6}[source]▶

>>44390593 #

> container images get fat at the python package installation layer, and that layer is one of the most frequently changed layers

This might be mitigated by having a standard set of packages, which you install in a lower layer, and then changing ones, at a higher layer.

↑