Most active commenters
  • teekert(4)

←back to thread

90 points sugarpimpdorsey | 13 comments | | HN request time: 0s | source | bottom
1. teekert ◴[] No.44775397[source]
Perhaps it is worth noting that all super computers I know (like the Dutch Snellius and the Finnish Lumi) are Slurm clusters with login nodes.

Bioinformaticians (among others) in (for example) University Medical Centers won’t get much more bang for the buck than on a well managed Slurm cluster (ie with GPU and Fat nodes etc to distinguish between compute loads). You buy the machines, they are utilized close to 100% over their life time.

replies(4): >>44775708 #>>44775996 #>>44777261 #>>44784010 #
2. JdeBP ◴[] No.44775708[source]
One of the more prominent uses of Slurm to hit the headlines recently is by the data access centres for the LSST data from the Vera Rubin Observatory. Such as the U.S.A. facility in Stanford and the U.K. facility at the University of Edinburgh's Somerville.

* https://developer.lsst.io/usdf/batch.html

* https://epcc.ed.ac.uk/hpc-services/somerville

But they're all over the place, from the James Hutton Institute to London Imperial.

* https://www.cropdiversity.ac.uk

* https://www.imperial.ac.uk/computing/people/csg/guides/hpcom...

3. janeway ◴[] No.44775996[source]
Yes, I spend a majority of my professional life on similar systems writing code in vim and running massive jobs via slurm. Required for processing TBs of data on secured environments with seamless command line access. I hate web-based connections or vscode type system. Although open to any improvements, this works best to me. It’s like a world inside one’s head with a text-based interface.

Graphical data exploration and stats with R, python, etc is a beautiful challenge at that scale.

replies(2): >>44779805 #>>44799879 #
4. secabeen ◴[] No.44777261[source]
In HPC, the general rule of thumb is that if you can keep your machine busy more than 40% of the time, it will be cheaper to run on-prem than cloud.
replies(1): >>44784601 #
5. sevensor ◴[] No.44779805[source]
Aside from how slow and user hostile it is compared to a text editor, my biggest complaint about vs code is the load it puts on the login node. You get 40 people each running multiple vs code servers and it brings the poor computer to its knees.
replies(2): >>44784127 #>>44799845 #
6. tomcam ◴[] No.44784010[source]
Dutch Snellius sounds like an obscure baseball player from the 1940s
replies(1): >>44784123 #
7. teekert ◴[] No.44784123[source]
Snellius was the Latin name of the Dutch mathematician Willebrord Snel van Royen [0] (" ... best known for Snell's law, named after him, which indicates how light rays are broken when light passes through different materials").

[0] https://servicedesk.surf.nl/wiki/spaces/WIKI/pages/30660184/...

8. teekert ◴[] No.44784127{3}[source]
I know indeed that our sys-admins also don't like it.
9. teekert ◴[] No.44784601[source]
I had 70% in mind but it certainly sounds reasonable (do you have a source? That would be great for our management). In our university medical hospital we have a very hard time to get rid of "shadow IT" because a single 50K machine can just process so much data (ie Next Generation Sequencing data) and can be amortized over 5 (probably 10!) years.

And then we aren't even talking about the EPD servers that are amortized in 4 years and can easily become a compute node in the cluster for another 6 (only problem are the bookkeepers who just can't live with post-amortized hardware! What a world!!)

replies(1): >>44793288 #
10. secabeen ◴[] No.44793288{3}[source]
I don't have any hard data, my role is somewhat HPC-adjacent, rather than directly in it, so this is mostly what I've heard. One way to look at it is that for most HPC operators, they are not charged any of the following for their gear, it's just provided by the organization as part of the larger pool: Power, Cooling, Real Estate, Networking, Security, Silicon-Valley SRE salaries, 38% Cloud Vendor Profit Margin.

Of course, the organization will pay for some of those eventually, so it's not fully fair to not roll them into the IT costs, but there are also lots of ways that non-profits also don't pay those costs at the same levels that the cloud providers do either due to differences in overall costs, or in providing a lesser level of capability. (As a quick example, cloud providers need extensive physical security for their datacenters. A hospital server needs a locked door, and can leverage the existing hospital security team for free.)

Cloud is great if your need is elastic, or if you have time-sensitive revenue dependent on your calculations. In non-profit research environments, that is often not the case. Users have compute they want done "eventually", but they don't really care if it's done in 1 hour or 4 hours; they have lots of other good work to do while waiting for the compute to run in their background.

11. mattpallissard ◴[] No.44799845{3}[source]
Every job on an HPC cluster should have a memory and CPU limit. Nearly every job should have a time limit as well. I/O throttling is a much trickier problem.

I wound up having a script for users on a jump host that would submit an sbatch job that ran sshd as the user on a random high level port and stored the port in the output. The output was available over NFS so the script parsed the port number and displayed the connection info to the user.

The user could then run a vscode server over ssh within the bounds of CPU/memory/time limits.

replies(1): >>44806988 #
12. mattpallissard ◴[] No.44799879[source]
> It’s like a world inside one’s head with a text-based interface.

I had a co-worker describe it as a giant Linux playground.

Another as ETL nirvana.

13. sevensor ◴[] No.44806988{4}[source]
That’s a really cool idea!