Lawrence Livermore National Lab's powerful new supercomputer

evanb · on Oct 30, 2018

I'm privileged to be an early-science user of Sierra (and Lassen) to pursue lattice QCD calculations while the machine is still in commissioning and acceptance. It's completely absurd. A calculation that took about a year on Titan last year we reproduced and blew past with a week on Sierra.

evanb · on Oct 30, 2018

To add: Titan has 1 GPU per node, we were getting about 300 GFlop/sec/GPU sustained. Sierra has 4 GPUs per node, we get about 1.5 TFlop/sec/GPU sustained. (Summit has 6 GPUs per node, also about 1.5 TFlop/sec/GPU sustained). So performance went up by about 20 on a per-node basis. The large memory is not just luxurious but essential too---in our applications it has really helped compensate for the comparatively minor improvements in the communication fabric.

xae342 · on Oct 30, 2018

What’s it like to program for it? Do you need to manage the distribution and utilization per node your self like a normal HPC farm? Or is it more like a single computer like a Cray?

perryh2 · on Oct 30, 2018

Sierra uses Slurm https://en.wikipedia.org/wiki/Slurm_Workload_Manager

evanb · on Oct 30, 2018

It actually doesn't use SLURM, even though SLURM is originally an LLNL development. It uses LSF. I do not know why the shift.

https://hpc.llnl.gov/training/tutorials/using-lcs-sierra-sys...

zandl · on Oct 30, 2018

So your process is limited to the resources of a node right? And coordinating data between jobs is via a shared file system or network messaging between nodes?

evanb · on Oct 30, 2018

Depending on the step a LQCD calculation might run on 4, 8, 16, 32, or even more nodes (linear solves and tensor contractions, for example). It's coordinated with OpenMPI and MPI (or equivalent, see my other comment on the software stack). The results from solves are typically valuable and are stored to disk for later reuse. That may prove impractical at this scale.

I'm not sure how big the HMC jobs get on these new machines---it depends on the size of the lattice (which gets optimized for physics but also algorithmic speed / sitting in a good spot for the efficiency of the machine).

selimthegrim · on Oct 31, 2018

Any idea how it compares to Bridges?

michaf · on Oct 30, 2018

I have read about some difficulties of running SLURM on POWER9 systems, so maybe IBM proposed/insisted to run their own POWER9-tested scheduler?

wickberg · on Oct 30, 2018

Slurm is endian and word-size agnostic; there's nothing about the POWER9 platform that would be insurmountable for it to run on. There are occasionally some teething issues with NUMA layout and other Linux kernel differences on newer platforms, but this tends to affect everyone, and get resolved quickly.

My understanding is that Spectrum (formerly Platform) LSF was included as part of their proposal.

evanb · on Oct 30, 2018

It could be---I really don't know. SLURM ran the BG/Q very successfully, which was PowerPC.

JackFr · on Oct 30, 2018

I'm curious -- is there something special about the calculations which requires bespoke supercomputers rather than large cloud installations?

SEJeff · on Oct 30, 2018

Likely fast interconnects and higher performance (bare metal vs VMs). Cloud instances tend to have a lot less consistent performance due to oversubscription of the physical hardware, so you have high "jitter". High performance computing systems (like this) tend to have a better grasp of the required resources and can push the hardware to the max without oversubscription too much.

The biggest difference however, is that the goal for supercomputers such as this is as high of an average usage rate as feasible. The cloud is abysmal for this where it is more for bursting to say 100k cpu core jobs. For systems like this, they'd want the average utilization to be 80%+ all the time. The cost of constant cloud computing like this, even using reserved instances, would be a multiple of the 162 million USD it cost to build this. Also, the IO patterns you'll see for large amounts of data like this (almost certainly many Petabytes) isn't nearly as cost effective as it is to hire a team to build it yourself.

evanb · on Oct 30, 2018

Not being in industry, I forget that AWS doesn't have 100% utilization 100% of the time. One of the reasons the lab likes us lattice QCD people is that we always have more computing to do, and are happy to wait in the queue if we get to run at all. So we really help keep the utilization very high. If the machine ever sits empty, that's a waste of money.

You're right that the IO tends to be very high performance and high throughput, too.

SEJeff · on Oct 30, 2018

Yup totally understood. We have a much smaller (but still massive) supercomputer for $REAL_JOB where there is always a queue of work to do with embarrassingly parallel jobs or ranks and ranks of MPI work to do. When we add more resources, the users simply can run their work faster, but it never really stops no matter how much hardware we add.

As much as people love to hate them, I'd love to see you get IO profiles remotely similar to what you can get with Lustre or Spectrum Scale (gpfs). They're simply in an entirely different ballpark compared to anything in any public cloud.

evanb · on Oct 30, 2018

We're lucky in the sense that the IO for LQCD is small (compared to other scientific applications), in that we're usually only reading or writing gigabytes to terbytes. But also our code uses parallel HDF5 and it's someone else's job to make sure that works well :)

wnissen · on Oct 30, 2018

To add to this, the jitter becomes more and more important the larger the scale you are running. The whole calculation (potentially the whole machine) is sitting around waiting for the slowest single task to finish up. Thus you get counterintuitive architectures, like a dedicated core just to handle networking operations. It seems like a bad deal to throw away 6% of your computation, but the alternative is even worse utilization. Highly coupled calculations are a very different beast because they cannot be executed out of sequence.

evanb · on Oct 30, 2018

Ha, I just replied to your other message asking for an AWS comparison. The answer is yes, the communications has to be substantially better than anything you can order up.

LLNL also has national security concerns that are unparalleled by most AWS applications ;)

halbritt · on Oct 30, 2018

Supercomputers tend to need very high bisectional bandwidth.

With the Clos network style topologies that are commonplace in large data centers today, I'm not sure one couldn't achieve decent results in the public cloud.

AWS networking is pretty terrible, but in GCP, I can get 2gbps per core up to 16Gbps for an 8-core instance. For any bare metal deployment, I'm going to be maxed out around 100Gbps which will be close to saturating an x16 PCIe bus.

It's hard to find a dual-cpu frequency optimized processor with less than 8 cores and I'm not sure that'd be cost effective. With hyperthreading, that yields 32 usable cores or around 3.125gbps per core.

Even still, I wager they'd go for better density.

Also, I can get 8 GPUs along with that 8 core/16gbps instance in GCP. Sounds totally doable to me.

halbritt · on Oct 30, 2018

My back of the napkin calculation says that I can get 270 petaflops with 2160 n1-highmem-16 each with 8 v100 gpu on preemptible instances costing roughly $13k/hr or about $10m/mo

rdtsc · on Oct 30, 2018

So with $120m/y in less than 2 years you'd exceed the price of the whole thing and also likely get worse interconnect speed and possibly raw computational speed.

AWS looks like a bad deal here.

karambahh · on Oct 30, 2018

If you have a constant load, on premise is always cheaper. Scientific computation has, for a large enough organization, a 100%

If you have a variable load, cloud infrastructure may make sense if you can easily auto-scale.

In my experience, most business real world applications are multi tiered applications with variable loads hence are a good fit for cloud infrastructure.

However, attaining the required application flexibility and KPIs for efficient auto scaling is quite hard and require strong functional & technical expertise.

halbritt · on Oct 31, 2018

My experience totally reflects this. Most enterprise IT infrastructure is idle the majority of the time.

I'm running infrastructure for a SaaS app in k8s. I feel like I'm doing well sustaining >50% efficiency, i.e. all cores running >50% all the time and more than half the memory consumed for things that aren't page cache. Hard to get better efficiency without creating hot spots.

halbritt · on Oct 31, 2018

That's GCP on preemptible nodes and you are correct.

Not a great deal.

rbanffy · on Oct 30, 2018

Cloud is great when you have variable usage. These machines are probably driven near 100% all the time. In that scenario, they are probably more cost-effective than cloud infrastructure.

grkvlt · on Oct 31, 2018

So, Lassen looks to be a 'mini Sierra' with ~20% performance but can be used for unclassified jobs, whereas Sierra is only for classified work, right?

0xdeadbeefbabe · on Oct 30, 2018

I've heard the software stack includes MPI. Can you tell us more about the software? Is the software pretty efficient?

Edit: It sure isn't popular or easy to talk about from the looks of things.

evanb · on Oct 30, 2018

There's a large stack of community software that was developed with funding from the DOE's SciDAC program.

http://usqcd-software.github.io/

People use the stack in a variety of different ways---I'll describe my own usage.

There's a message-passing abstraction layer, QMP, sitting over MPI or SMP or what have you (you can compile for your laptop for development purposes, for example). This keeps most of the later layers relatively architecture agnostic.

Over that sits QDP, the data parallel library. Here's where the objects we discuss in quantum field theory are defined. We almost always work on regular lattices. QDP also contains things like "shove everybody one site over in the y direction" (for example).

Finally, there's the physics/application layer, where the physics algorithms live. I am most familiar with chroma. QUDA is the GPU library and can talk to most application-layer libraries and has at least simple solvers for most major discretizations people want to use (it also has fancier solvers such as multigrid methods for some discretizations). Code in chroma by and large looks like physics equations, if you had a pain-in-the-ass pedantic student who didn't understand any abuse of notation.

Chroma can be used as a library, so that for your particular project you can do nonstandard things while leveraging everything it can already do.

Other physics layers include CPS, which grew out of the effort at Columbia with QCDSP/QCDOC, MILC (really optimized code for staggered fermions), and others.

The USQCD stack isn't the only one. Another modern lattice field theory package is grid, developed by a tight collaboration between intel and University of Edinburgh https://github.com/paboyle/grid. There's also openQCD http://luscher.web.cern.ch/luscher/openQCD/

evanb · on Oct 30, 2018

As for efficiency: it depends on exactly what code you use and what your problem is. BAGEL, which included hand-coded assembly http://usqcd-software.github.io/bagel_qdp/ was getting something like 25% of peak, sustained, for doing linear solves on the BG/Q.

On a POWER8/NVIDIA P100 machine I know QUDA gets 20% of peak, sustained.

gok · on Oct 30, 2018

Have you been able to use the V100s' 16-bit tensor cores to speed up QCD at all?

evanb · on Oct 30, 2018

I know some Gordon Bell finalists have used these tensor cores for a dramatic speedup. As far as I know, there isn't any QCD code yet. NVIDIA has a small team of developers who write libraries to make QCD screamingly fast https://github.com/lattice/quda/ ; I don't know if/where the tensor cores may be most effectively leveraged.

erickj · on Oct 30, 2018

"It’s not just powerful, it has a stunning memory. There’s enough storage space to hold every written work of humanity, in all languages – twice."

I really wish they just said it has XX TB of memory

Maybe I'm grossly underestimating how much data all human written works would actually occupy... but that sounds like the amount of data I could put on a home NAS (frankly I would have guessed my laptop hard drive until reading that comparison).

JackFr · on Oct 30, 2018

"every written work of humanity X times" and "weighs as much as X elephants"? What is this 1895? How about Y hours of HD video and weighs as much as Y cars.

gpm · on Oct 30, 2018

1.3824 petabytes per https://hpc.llnl.gov/hardware/platforms/sierra

rdtsc · on Oct 30, 2018

The article is a bit light on technical details. I found some specs here:

https://hpc.llnl.gov/hardware/platforms/sierra

Summary:

* 190,080 IBM Power9 CPU cores

* 17,280 NVIDIA V100 (Volta) GPUs

* 125,626 peak TFLOPS (CPUs+GPUs)

theandrewbailey · on Oct 30, 2018

Firefox Dev Edition doesn't like the site:

Certificates issued by GeoTrust, RapidSSL, Symantec, Thawte, and VeriSign are no longer considered safe because these certificate authorities failed to follow security practices in the past. Error code: MOZILLA_PKIX_ERROR_ADDITIONAL_POLICY_CONSTRAINT_FAILED

swarnie_ · on Oct 30, 2018

Can you see a price quoted anywhere? The original site is a dumpster fire to read.

_zy9m · on Oct 30, 2018

$162 million.

Though it's a bit fuzzy. This article is about "Sierra." Department of Energy awarded $325 million to build two supercomputers, "Summit" and "Sierra" both at 150 petaflops each. IBM took $325M, so, it's $162.5M until someone corrects me.

However, the actual proposal it's under (CORAL) included more than that, approaching almost $2 billion in the RFP. They also added $100M for R&D. All these machine systems are considered NRE projects (non-recurring engineering) and contain more than just the hardware and each is estimated to be budgeted at about $400M per system (operational expenses, etc).

JackFr · on Oct 30, 2018

I would be nice if the article had indicated 1) the incremental improvement over renting ~200K cores and similar memory from AWS, and 2) the cost of this behemoth. I assume there is a significant advantage -- it would be nice to know how much and at what cost.

SEJeff · on Oct 30, 2018

The cost of Amazon for this type of thing is only useful to burst to 200K cores, this is a supercomputer, which will be heavily used all the time. From a pure economics standpoint, the cloud makes zero sense for this sort of thing.

Also, the performance of 200K cores on Amazon in VMs compared to 200K physical cores is a lot different. These HPC systems are designed to eek every last 3-5% of performance out of the entire thing, something that you simply can not do even if you try using virtual machines or the cloud.

scrooched_moose · on Oct 30, 2018

I maintain a much (much much much much.....much) smaller HPC system at my company and this seems to be a quarterly battle I'm beyond sick of.

Every time a new IT manager sees our costs they immediately declare 'I can get you that on the cloud for a fraction of the cost' and starts trying to decommission the system. Never mind the 20-100x increase in solve time and the multi-TB a day data transfer required. Waste 10 hours in meetings, stave it off once again, and gear up for the next round in January...

evanb · on Oct 30, 2018

It's not remotely a fair comparison, as typical scientific computing requires high-performance communication to run with any modicum of efficiency. It's not just a matter of scale---the architecture of a machine like this differs dramatically from AWS of a similar size. [Never mind the national security concerns.]

halbritt · on Oct 30, 2018

How, specifically does the architecture differ dramatically?

scrooched_moose · on Oct 30, 2018

Generally Infiniband [1] or Omni-Path [2] interconnects between the compute nodes - extremely high throughput and low latency communication on the order of 100 Gb/s and 1 microsecond. These problems aren't trillions of completely independent operations which may work for cloud/distributed architecture. For these types of calculations compute node 6523 needs to know the data coming from compute node 2447 in near-realtime to continue working.

https://en.wikipedia.org/wiki/InfiniBand

https://en.wikipedia.org/wiki/Omni-Path

halbritt · on Oct 31, 2018

Thanks for the detail. I can clearly see the benefit of a couple orders of magnitude less latency.

dekhn · on Oct 30, 2018

amazon does not have a product that competes with this. they don't have a low-latency network. Without that, you're limited to coarse-grained parallelism. Supercomputers like this are built for getting near-peak performance across the entire machine by using a low-latency network to accelerate tightly coupled codes.

Modern supercomputers aren't particularly expensive- in the several tens of millions for the capital cost of the machine, several tens of millions for the storage system, several tens of millions for the space, several tens of millions for the power, and a few million for the support staff. In this case, Sierra probably cost $100M for the base machine and storage services.

halbritt · on Oct 30, 2018

What is the distinction between low latency networking commonly used in supercomputers today and cloud computing?

What is the cutoff?

I've commonly seen 500 microsecond latency between certain AWS zones. That's pretty impressive. Inside a given zone, 200-300 microseconds isn't uncommon.

scrooched_moose · on Oct 30, 2018

Infiniband is on the order of 1 microsecond between compute nodes and 100 Gb/s throughput. It's often still the bottleneck in HPC systems.

https://en.wikipedia.org/wiki/InfiniBand

wmf · on Oct 30, 2018

HPC networking is more like 2-3 microseconds. Also it's 200 Gbps instead of 25 Gbps.

fulafel · on Nov 6, 2018

Many hpc codes work on big data in lockstep an all nodes, so virtualization risks slowdown every tine one nodebis preempted in shared virtualization hosting.

erickj · on Oct 30, 2018

Well I just built a 4 node array of Raspberry Pis... so eat your heart out Lawrence Livermore National Lab

0xdeadbeefbabe · on Oct 30, 2018

They have more than one heart btw. It's like a super heart.

twtw · on Oct 30, 2018

Interesting thing about Summit and Sierra: the CPUs and GPUs are connected via nvlink rather than PCIe, which reduces the well known cost to move data to GPU.

nuguy · on Oct 30, 2018

I know it’s stupid but I wonder what it would be like to play a video-game that fully utilized this super-computer. I imagine an open world rpg where there are no loading screens and where every npc ai is running all the time. It would be closer to an interactive simulation than to a game. Fun to think about.

rjplatte · on Oct 30, 2018

I'm not even sure what would be possible with this kind of power. Gamedevs: What's your dream, limited only by lack of power?

grkvlt · on Oct 31, 2018

The compute platform overview pages [0] list all the HPC systems at LLNL. It's an incredible list, with a huge amount of TFLOPS, GBs, CPU and GPU cores available. The systems range from small 50 TFLOP, 20 node Xeon CPU Linux clusters up to the multi PetaFLOP POWER9 and Nvidia GPU monsters with millions of cores. Not sure what the total compute power available across all lab systems would be, although I guess it will be dominated by recent installations like Sierra - maybe 250 PFLOPS? Anyway, impressive engineering....

0. https://hpc.llnl.gov/hardware/platforms

onetimemanytime · on Oct 30, 2018

Is there a bragging rights thing to this? Not wanting to be left behind China? Seriously, seems like an arms race--which is cool for me, USA has the money and computers are it.

Will this computer be used close to 100% /was the old super-computer just not enough?

dragontamer · on Oct 30, 2018

Supercomputers are used for:

* Weather modeling

* Nuclear Research

* Car crash simulations

* Electronic Design Automation (ex: mathematically proving chips are correct)

* Protein folding: looking for new chemicals for medicine.

Etc. etc.

There's a huge need to grow our supercomputer capacity. I bet you that every major field has a use for a super-computer.

I think there's an element of bragging rights. But the USA buys "practical" supercomputers most of the time. There are designs that push out more FLOPs but are less useful to scientists.

The hugely powerful interconnect and CAPI / NVLink connections on this supercomputer demonstrate how "practical" the device is. Most people are RAM constrained, or message-constrained, and these are the biggest and best interconnects available in 2018.

Interconnects are NOT a "bragging" metric, very few people look at it. Most people look at the Linpack benchmark (purely FLOPs measurement). However, experts can tell when a supercomputer is built with a poor interconnect and purely for "bragging rights" reasons.

tanderson92 · on Oct 30, 2018

See the top comment on this very HN discussion: https://news.ycombinator.com/item?id=18336811

antpls · on Oct 30, 2018

So, it runs RHEL according to the link from another comment.

How many instances of the Linux Kernel is running in total? Is it only one Kernel instance for the entire machine?

tanderson92 · on Oct 30, 2018

These machines typically have many thin clients ('nodes') each running their own kernel with network-attached storage (but high-performance storage/interconnect/filesystem technology)

antpls · on Oct 30, 2018

According to https://hpc.llnl.gov/training/tutorials/using-lcs-sierra-sys... , there are 4320 of such "nodes", so there are 4320 linux kernels running? It means there are some commanding kernels scheduling the job on other kernels, it sounds not optimal

According to https://access.redhat.com/articles/rhel-limits , RHEL 7 on POWER system can "only" manage 32TB of memory, (the whole system has more than a petabyte of memory) assuming they don't run a modified version. So there is definitely not only one OS running, I guess.

Still, the goal of an OS, especially an Unix-like "Time Sharing" OS, is to manage resources. I wonder how hard it would be for the Linux kernel to manage the entire system (with some virtual devices to aggregate all the nodes in one system, even if they are at the opposite side of the room), and if all the developed code about scheduling could be reused at that scale.

tanderson92 · on Oct 30, 2018

The problem is less of scheduling and more of communication. Typically all nodes will run (facilitated by a program like mpirun or mpiexec) the same executable which will be launched simultaneously in each OS. This is done inside a job scheduler and the resulting host list results in a defined communication pattern.

By the way, even systems completely across the room may be "close" to each other, depending on the topology of the system. You can imagine this being important for solving a physical system which is periodic in certain dimensions, so there should be little interconnect distance between physically-distant nodes.

antpls · on Oct 30, 2018

At that scale, I find odd people didn't come up with an optimized OS that sees and manages the resources and also plan the communications between them.

It looks more like a private data center with 4000 dedicated machines in the same network to run distributed algorithms than a "single super computer". Are we just "wow-ing" at what is basically a data center here?

philipkglass · on Oct 30, 2018

For HPC applications that actually need low-latency coordination between nodes, the application code itself manages communication. The communication can't be better optimized by the OS.

If you have an embarrassingly parallel problem, it will run well on this machine but it will also be a waste of the machine's expensive design. Embarrassingly parallel problems run just as well on generic data center hardware. This machine is built for problems that only parallelize effectively with low-latency coordination between nodes. Such problems come up a lot in scientific/engineering simulations but are comparatively rare in general purpose computing environments. General purpose nodes in a cloud computing environment cannot run some of the harder problems this machine runs, at any price. For any non-trivial parallel computing job there comes a crossover point where adding more nodes makes the total time-to-solution longer rather than shorter. This point comes a lot sooner if you don't have dedicated high-bandwidth, low-latency interconnects between nodes.

antpls · on Oct 31, 2018

Precisely, if we are talking about low-latency, why would we let the communications go through the application (in user-land), to the kernel, then to the network stack, then be received by a kernel from another node, and then finally received again by the application in user-land. I would imagine as a first guess that bypassing several linux kernels and directly accessing remote hardwares would be mandatory for best low-latency.

If they are some info on internet about the software stack/architecture of the entire system, I would document myself on that. I didn't explore all the links I posted above yet.

I'm nowhere an expert, and HPC is really specific use case, but there is surely interesting bits to learn from it

tanderson92 · on Oct 31, 2018

One place to start might be http://infiniband.sourceforge.net/

buttslasher69 · on Oct 30, 2018

So is it filled with nvidias TESLA GPUs?

qubax · on Oct 30, 2018

Yes and Power9.