Hacker Newsnew | past | comments | ask | show | jobs | submitlogin



It actually doesn't use SLURM, even though SLURM is originally an LLNL development. It uses LSF. I do not know why the shift.

https://hpc.llnl.gov/training/tutorials/using-lcs-sierra-sys...


So your process is limited to the resources of a node right? And coordinating data between jobs is via a shared file system or network messaging between nodes?


Depending on the step a LQCD calculation might run on 4, 8, 16, 32, or even more nodes (linear solves and tensor contractions, for example). It's coordinated with OpenMPI and MPI (or equivalent, see my other comment on the software stack). The results from solves are typically valuable and are stored to disk for later reuse. That may prove impractical at this scale.

I'm not sure how big the HMC jobs get on these new machines---it depends on the size of the lattice (which gets optimized for physics but also algorithmic speed / sitting in a good spot for the efficiency of the machine).


Any idea how it compares to Bridges?


I have read about some difficulties of running SLURM on POWER9 systems, so maybe IBM proposed/insisted to run their own POWER9-tested scheduler?


Slurm is endian and word-size agnostic; there's nothing about the POWER9 platform that would be insurmountable for it to run on. There are occasionally some teething issues with NUMA layout and other Linux kernel differences on newer platforms, but this tends to affect everyone, and get resolved quickly.

My understanding is that Spectrum (formerly Platform) LSF was included as part of their proposal.


It could be---I really don't know. SLURM ran the BG/Q very successfully, which was PowerPC.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: