More nix less containers, btw.
E.g. docker run -ti nixery.dev/shell/cowsay bash for on-the-fly containers based on Nix.
We use singularity in the HPCs (like Leonardo, LUMI, Fugaku, NeSI NZ, Levante) but some devs and researchers have apptainer installed locally.
We found a timezone bug a few days ago in our Python code (matplotlib,xarray,etc.), but that didn't happen with apptainer.
As the code bases are still a bit similar, I could confirm apptainer fixed it but singularity ce was still affected by the bug -- singularity replaces the UTC timezone file by the user's timezone, Helsinki EEST in our case in LUMI HPC.
So I guess Apptainer is the solution to this use case - anyone had any experience with using it to bundle up an AI/ML application for redistribution? Thoughts/tips?
Like why should I put time into learning this instead of rootless podman? Aside from this secret management thing it sounds like same feature set
The problems are:
1. You can't have apptainers that use each other. The most common case was things like Make, GCC, Git etc. If Make is in a different apptainer to GCC then it won't work because as soon as you go into Make then it can't see GCC any more.
2. It doesn't work if any of your output artefacts depend on things inside the container. For example you use your GCC apptainer to compile a program. It appears to work, but when you run it you find it actually linked to something in the apptainer that isn't visible any more. This is also a problem for C headers.
3. We had constant issues with PATH getting messed up so you can't see things outside the apptainer that should have been available.
All in all it was a nice idea but ended up causing way more hassle than it was worth. It was much easier just to use an old OS (RHEL8) and get everything to work directly on that.
Perhaps the problems need to be addressed on a more fundamental level.
This paper might help
Apptainer is not a fork of the old Singularity project: Apptainer is the original project, but the community voted to change its name. It also came under the umbrella of the Linux Foundation:
* https://apptainer.org/news/community-announcement-20211130/
Sylabs (where the original Singularity author first worked) was the one that forked off the original project.
I'm not familiar with it (I don't know if it changed names or just didn't notice)
Some attrition using it though: is there a good in-depth book about it?
Many container platforms are available, but Apptainer is focused on:
Verifiable reproducibility and security, using cryptographic signatures, an immutable container image format, and in-memory decryption.
Integration over isolation by default. Easily make use of GPUs, high speed networks, parallel filesystems on a cluster or server by default.
Mobility of compute. The single file SIF container format is easy to transport and share.
A simple, effective security model. You are the same user inside a container as outside, and cannot gain additional privilege on the host system by default. Read more about Security in Apptainer.
[1] https://apptainer.org/docs/user/main/introduction.htmlhttps://journals.plos.org/plosone/article?id=10.1371/journal...
If you ever use a shared cluster at a university or run by the government, Apptainer will be available, and Podman / Docker likely won't be.
In these environments, it is best not to use containers at all, and instead get to know your sysadmin and understand how he expects the cluster to be used.
The most annoying thing is not the lack of privileges, but that the compute nodes do not have internet access (because "security") beside connecting to the headnode, so there is the whole song and dance of running the container (or installing conda packages) on the headnode so I can download everything I need, then saving the state and running them on the compute node.
https://www.docker.com/blog/introducing-docker-hardened-imag...
Also I am not sure if apptainers are completely isolated.
Though I suppose through tools like https://containertoolbx.org/ such point also becomes moot & then I guess if they move to container, doesn't it sort of become like toolbx?
To be honest, I think a lot of tools can have a huge overlap b/w them and I guess that's okay too
For my workflows on HPC, I use apptainers as basically drop-in replacements for Docker, and for that, they work quite well. These biggest benefit is that the containers are unprivileged. This means you can’t do a lot of things (in particular complex networking), but it also makes it much more secure for multi-tenant systems (like HPC).
(I know Docker and Apptainer are slightly different beasts, but I’m speaking in broad strokes in a general sense without extra permissions).
I actually really like nixery.dev idea. Sounds kinda neat.
If I am being really honest, there are a lot of ways to go around tbh, there are ways to run nix inside of docker and docker inside of nix too.
There are ways to convert docker images into os too and there are tools like coreos.
There is nix-shell and someone on hackernews told me about comma and I am still figuring out comma (haha! Thanks to them!)
And if one just wants isolation, they can use bubblewrap or (pledge by jart) and I guess there is complete beauty and art in such container-esque space and I truly love this space a lot.
I am actually wondering right now that using traefik (as load balancer) + nats (for a modular monolith) + podman/coreos + (cloudflare tunnels?) + any vps and you can use nix to build those containers too or you can go the other way around by having a nixos on vps with traefik + nats can be a really good alternative to kubernetes.
I mean, There is docker swarm too if you don't want any of such complexity but people say that its less worked upon but still I guess there is a sort of fun in reinventing the wheel of kubernetes, but I guess I don't have tooo much problems with kubernetes I suppose because of the existence of helm charts (I haven't used kubernetes) but helm charts are written in go templates and I think they are a bit clunky but still I love golang and I feel like I would be okay with writing helm charts but I guess I am one of the people who just believes to scale horizontally first than vertically untill the economic scale gets broken and its more cheaper to use kubernetes / learn it than not.
[1]: https://docs.fedoraproject.org/en-US/fedora-silverblue/toolb...
apptainer images are straight filesystem images with no overlayfs or storage driver magic happening -- just a straight loop mount of a disk image.
this means your container images can now live on your network filesystem.
Find the code on https://github.com/evertheylen/probox or read my blog post on https://evertheylen.eu/p/probox-intro/
You can use a container as a single environment in which to do development, and that works fine. But they are by definition an isolated environment with different dependencies than other containers. The result of compiling something in a container necessarily needs to end up in its own container.
...that said, you could use the exact same container base image, and make many different container images from it, and those files would be compatible (assuming you shipped all needed dependencies).
If there's a hard disk on the compute nodes, then you just run the container from the remote image registry, and it downloads and extracts it temporarily to disk. No need for a network filesystem.
If the containerized apps want to then work on common/shared files, they can still do that. You just mount the network filesystem on the host, then volume-mount that into the container's runtime. Now the containerized apps can access the network filesystem.
This is standard practice in AWS ECS, where you can mount an EFS filesystem inside your running containers in ECS. (EFS is just NFS, and ECS is just a wrapper around Docker)
- Need to run more than one activity in a single container (this is an anti-pattern in other container technologies)
- HPC (and sometimes college) environments
- Want single-file distribution model (although doesn't support deltas)
- Cryptographically sign a SIF file without an external server
- Robust GPU support
there is also the problem of simply distributing the image and mounting it up. you don't want to waste cluster time at the start of your job pulling down an entire image to every node, then extract the layers -- it is way faster to put a filesystem image in your home directory, then loop mount that image.
What's the appeal of using this over unshare + chroot to a mounted tarball with a tmpfs union mount where needed? Saner default configuration? Saner interface to cgroups?
Just like with Docker, it spins up a Linux VM that integrates with Apptainer. You can install/use it with Lima (much like Docker).
You can also install it with `brew install lima` and then run `limactl start template://apptainer` to get a running Apptainer compatible VM running.
You can achieve that with docker by `docker save image-name > image-name.tar.gz` and `docker load --input image-name.tar.gz`.
It likewise doesn't support deltas but there was a link here on HN recently to something called "unregistry" which allows for doing "docker push" to deploy an image to a remote machine without a registry, and that thing does take deltas into account.
I was only partially aware of it as I tend to use Colima more than Lima, but have started to move towards Lima more in general.
That said, I still stick to Docker-style containers personally as they are more widely supported (e.g. VS Code). However, I also work a lot in HPC, so migrating workflows cross-platform to Apptainer containers is a goal of mine.
Process isolation should be the default. You should be able to opt out of certain parts of it as required by your application.
This should not be something you add on top of the OS, nor should it be something that configures existing OS functionality for you. Isolation should be the default.
Only MacOS does anything like this out of the box, that I’m aware of, and I’m not sure that it is granular enough for my liking as it is today. I often see apps asking for full disk access or local network access and deny them, because they don’t need those things, they maybe need a subset of it, but I can’t allow a subset of “full disk access” or “local network access” if the application is running as myself.
You can absolutely mix and match lots of different binaries from different sources on one Linux system. That's exactly what we're doing now with TCL modules.
> and make many different container images from it
Well yes, that's the problem. You end up either putting everything in one container (in which case why bother with a container?), or with a combinatorial explosion of every piece and version of software you might use.
TCL modules are better. They don't let you cheat like containers do, but in return you get a better system.
On our HPC cluster, each user has a quota of inodes on the shared filesystem. This makes installing some software with lots of files problematic (like Anaconda). An Apptainer image is a single file on the filesystem though (basically squashfs) so you can have those with as many files as you want in each.
Installing the same software normally is easy and works fine though, you just exchaust your quota.
Doing this across different Linux distributions is inherently prone to failure. I don't know about your TCL modules specifically, but unless you have an identical and completely reproducible software toolchain across multiple linux distributions, it's going to end with problems.
Honestly, it sounds like you just don't understand these systems and how they work. TCL modules aren't better than containers, this is like comparing apples and organgutans.
This is completely compatible with containerized systems. Immutable images stay in a filesystem directory users have no access to, so there is no need to wipe them. Write-ability within a running container is completely controlled by the admin configuring how the container executes.
> you don't want to waste cluster time at the start of your job pulling down an entire image to every node, then extract the layers -- it is way faster to put a filesystem image in your home directory, then loop mount that image
This is actually less efficient over time as there's a network access tax every time you use the network filesystem. On top that, 1) You don't have to pull the images at execution time, you can pull them immediately as soon as they're pushed to a remote registry, well before your job starts, and 2) Containers use caching layers so that only changed layers need to be pulled; if only 1 file is changed in a new container image layer, you only pull 1 file, not the entire thing.
I am sure that a lot of people have them deployed and don't even realize it. If you are using Gitea, Gitlab, Github, or any of their major forks/variations of you probably already have a place to put your images.
So I really don't know what the advantage of 'single file distribution model' is here.
This is probably why people don't bother sharing tarballs of docker images with one another even though it is has been a option this entire time.
what you're describing might work well for a small team, but when you have a few hundred to thousand researchers sharing the cluster, very few of those layers are actually shared between jobs
even with a handful of users, most of these container images get fat at the python package installation layer, and that layer is one of the most frequently changed layers, and is frequently only used for a single job
So unfortunately your example doesn't illustrate why Apptainer is a better option.
https://www.redhat.com/en/blog/7-linux-namespaces
After a quick view of the apptainer documentation it looks like it minimally takes advantage of user and mount namespaces. So each apptainer gets its own idea of what the users/groups are and what the file system looks like.
Flatpak is more about desktop application sandboxing. So while it does use user and mount namespaces like apptainer it takes advantage of more Linux features then that to help enhance the isolation.
Which appears to be the opposite of the point of apptainer. Apptainer wants to use containers that integrate tightly with the rest of the system with very little isolation versus Flatpak wants to be maximally isolated with only the permissions necessary for the application.
That isn't to say that apptainer can't use more Linux features to increase isolation. It supports the use of cgroups for resource quotas and can take advantage of different types of namespaces for network isolation among other things.
Now as far as "OSTree vs containers" statement you are replying to... This is kinda misleading.
OSTree is designed to manage binaries files in a way similar to git with text file. It isn't a type of container technology in itself. It just used for managing how objects on the file system are arranged and managed.
It is used by some flatpak applications, but it is used for things besides flatpak.
The 'containers' he mentioned is really a reference to OCI container image format.
OCI container images is, again, a way to manage the file system contents typically used in containers. It isn't a container technology itself.
It is like a tarball, but for file system images.
OCI containers is a standardized version of Docker images.
Due to the popularity and ubiquity of OCI image related tools and hosting software it makes sense for Flatpak to support it.
OCI images, when combined with bootc, also can be used to deploy Linux container images to "bare hardware". Which is gaining popularity in helping to create and deploy "immutable" or "atomic" Linux distributions. Fedora Atomic-based OSes seem to be moving to use Bootc with OCI over pure OSTree approach... although they still use OSTree in some capacity.
Incidentally Apptainer supports the use of OCI images (in addition to it's native SIF) as well as other commonly used container technologies like CNI. CNI is container network interface and is used with Kubernetes among other things.
In Linux (docker, podman, lxc, apptainer, etc) containers are produced by combining underlying Linux features in different way. All of them use Linux namespaces.
https://www.redhat.com/en/blog/7-linux-namespaces
When using docker/podman/apptainer you can pick and choose when and how to use namespaces. Like I can use just use the 'mount' namespace to create a unique view of file systems, but not use the 'process', 'networking', and 'user' namespaces so that the container shares all of those things with the host OS.
For example when using podman the default is to use the networking namespace so it gets its own IP address. When you are using rootless (unprivileged) mode it will use "usermode network" in the form of slirp4netns. This is good enough for most things, but it is limited and slow.
Well I can turn that off. So that applications running in a podman container share the networking with the host OS. I do this for things like 'syncthing' so that the container version of that runs with the same performance as non-containered services without having to require special permissions for setting up rootful network (ie: macvlans or linux bridges with veth devices, etc) )
By default apptainer just uses mount and user namespaces. But it can take advantage of more Linux isolation features if you want it to.
So the process ids, networking, and the rest of it is shared with the host OS.
The mount namespace is like chroot on steroids. It is relatively trivial to break out of chroot jails. In fact it can happen accidentally.
And it makes it easier to take adavantage of container image formats (like apptainer's SIF or more traditional OCI containers)
This is Linux's approach as opposed to the BSD one of BSD Jails were the traditional limited Chroot feature was enhanced to make it robust.
If you want to use toolbx for more isolation you'll have to end up turn off a bunch of features and configuring it in weird ways that ultimately defeats the purpose of having toolbx in the first place....
It is a lot easier to just to cut out the middle man and use podman directly.
Nix is a huge pain to deal with.
Nix makes me think of the old Zawinski joke of:
"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems,"
Except there are less upsides for using Nix over something like OCI.
1. Create an 8gb file on network storage which is loopback-mounted. Accessing the file requires a block store pull over the network for every file access. According to your claim now, these giant blobs are rarely shared between jobs?
2. Create a Docker image in a remote registry. Layers are downloaded as necessary. According to your claim now, most of the containers will have a single layer which is both huge and changed every time python packages are changed, which you're saying is usually done for each job?
Both of these seem bad.
For the giant loopback file, why are there so many of these giant files which (it would seem) are almost identical except for the python differences? Why are they constantly changing? Why are they all so different? Why does every job have a different image?
For the container images, why are they having bloated image layers when python packages change? Python files are not huge. The layers should be between 5-100MB once new packages are installed. If the network is as fast as you say, transferring this once (even at job start) should take what, 2 seconds, if that? Do it before the job starts and it's instantaneous.
The whole thing sounds inefficient. If we can make kubernetes clusters run 10,000 microservices across 5,000 nodes and make it fast enough for the biggest sites in the world, we can make an HPC cluster (which has higher performance hardware) work too. The people setting this up need to optimize.
unshare --mount -- /bin/bash
> It is relatively trivial to break out of chroot jails. In fact it can happen accidentally.
Same is true for namespaces actually.
https://www.helpnetsecurity.com/2025/05/20/containers-namesp...
If you're deploying to a server, I don't see a point in setting up a registry, regardless of how trivial it is. It seems even more trivial to just send the deployment package to the server.
It is? I have no issues packing my development containers full of concurrent running processes. systemd even supports running as a "container init" out of the box, so you can get something that looks very similar to a full VM.