Sierra uses Slurm https://en.wikipedia.org/wiki/Slurm_Workload_Manager

evanb · on Oct 30, 2018

It actually doesn't use SLURM, even though SLURM is originally an LLNL development. It uses LSF. I do not know why the shift.

https://hpc.llnl.gov/training/tutorials/using-lcs-sierra-sys...

zandl · on Oct 30, 2018

So your process is limited to the resources of a node right? And coordinating data between jobs is via a shared file system or network messaging between nodes?

evanb · on Oct 30, 2018

Depending on the step a LQCD calculation might run on 4, 8, 16, 32, or even more nodes (linear solves and tensor contractions, for example). It's coordinated with OpenMPI and MPI (or equivalent, see my other comment on the software stack). The results from solves are typically valuable and are stored to disk for later reuse. That may prove impractical at this scale.

I'm not sure how big the HMC jobs get on these new machines---it depends on the size of the lattice (which gets optimized for physics but also algorithmic speed / sitting in a good spot for the efficiency of the machine).

selimthegrim · on Oct 31, 2018

Any idea how it compares to Bridges?

michaf · on Oct 30, 2018

I have read about some difficulties of running SLURM on POWER9 systems, so maybe IBM proposed/insisted to run their own POWER9-tested scheduler?

wickberg · on Oct 30, 2018

Slurm is endian and word-size agnostic; there's nothing about the POWER9 platform that would be insurmountable for it to run on. There are occasionally some teething issues with NUMA layout and other Linux kernel differences on newer platforms, but this tends to affect everyone, and get resolved quickly.

My understanding is that Spectrum (formerly Platform) LSF was included as part of their proposal.

evanb · on Oct 30, 2018

It could be---I really don't know. SLURM ran the BG/Q very successfully, which was PowerPC.