I'm privileged to be an early-science user of Sierra (and Lassen) to pursue lattice QCD calculations while the machine is still in commissioning and acceptance. It's completely absurd. A calculation that took about a year on Titan last year we reproduced and blew past with a week on Sierra.
To add: Titan has 1 GPU per node, we were getting about 300 GFlop/sec/GPU sustained. Sierra has 4 GPUs per node, we get about 1.5 TFlop/sec/GPU sustained. (Summit has 6 GPUs per node, also about 1.5 TFlop/sec/GPU sustained). So performance went up by about 20 on a per-node basis. The large memory is not just luxurious but essential too---in our applications it has really helped compensate for the comparatively minor improvements in the communication fabric.
What’s it like to program for it? Do you need to manage the distribution and utilization per node your self like a normal HPC farm? Or is it more like a single computer like a Cray?
So your process is limited to the resources of a node right? And coordinating data between jobs is via a shared file system or network messaging between nodes?
Depending on the step a LQCD calculation might run on 4, 8, 16, 32, or even more nodes (linear solves and tensor contractions, for example). It's coordinated with OpenMPI and MPI (or equivalent, see my other comment on the software stack). The results from solves are typically valuable and are stored to disk for later reuse. That may prove impractical at this scale.
I'm not sure how big the HMC jobs get on these new machines---it depends on the size of the lattice (which gets optimized for physics but also algorithmic speed / sitting in a good spot for the efficiency of the machine).
Slurm is endian and word-size agnostic; there's nothing about the POWER9 platform that would be insurmountable for it to run on. There are occasionally some teething issues with NUMA layout and other Linux kernel differences on newer platforms, but this tends to affect everyone, and get resolved quickly.
My understanding is that Spectrum (formerly Platform) LSF was included as part of their proposal.
Likely fast interconnects and higher performance (bare metal vs VMs). Cloud instances tend to have a lot less consistent performance due to oversubscription of the physical hardware, so you have high "jitter". High performance computing systems (like this) tend to have a better grasp of the required resources and can push the hardware to the max without oversubscription too much.
The biggest difference however, is that the goal for supercomputers such as this is as high of an average usage rate as feasible. The cloud is abysmal for this where it is more for bursting to say 100k cpu core jobs. For systems like this, they'd want the average utilization to be 80%+ all the time. The cost of constant cloud computing like this, even using reserved instances, would be a multiple of the 162 million USD it cost to build this. Also, the IO patterns you'll see for large amounts of data like this (almost certainly many Petabytes) isn't nearly as cost effective as it is to hire a team to build it yourself.
Not being in industry, I forget that AWS doesn't have 100% utilization 100% of the time. One of the reasons the lab likes us lattice QCD people is that we always have more computing to do, and are happy to wait in the queue if we get to run at all. So we really help keep the utilization very high. If the machine ever sits empty, that's a waste of money.
You're right that the IO tends to be very high performance and high throughput, too.
Yup totally understood. We have a much smaller (but still massive) supercomputer for $REAL_JOB where there is always a queue of work to do with embarrassingly parallel jobs or ranks and ranks of MPI work to do. When we add more resources, the users simply can run their work faster, but it never really stops no matter how much hardware we add.
As much as people love to hate them, I'd love to see you get IO profiles remotely similar to what you can get with Lustre or Spectrum Scale (gpfs). They're simply in an entirely different ballpark compared to anything in any public cloud.
We're lucky in the sense that the IO for LQCD is small (compared to other scientific applications), in that we're usually only reading or writing gigabytes to terbytes. But also our code uses parallel HDF5 and it's someone else's job to make sure that works well :)
To add to this, the jitter becomes more and more important the larger the scale you are running. The whole calculation (potentially the whole machine) is sitting around waiting for the slowest single task to finish up. Thus you get counterintuitive architectures, like a dedicated core just to handle networking operations. It seems like a bad deal to throw away 6% of your computation, but the alternative is even worse utilization. Highly coupled calculations are a very different beast because they cannot be executed out of sequence.
Ha, I just replied to your other message asking for an AWS comparison. The answer is yes, the communications has to be substantially better than anything you can order up.
LLNL also has national security concerns that are unparalleled by most AWS applications ;)
Supercomputers tend to need very high bisectional bandwidth.
With the Clos network style topologies that are commonplace in large data centers today, I'm not sure one couldn't achieve decent results in the public cloud.
AWS networking is pretty terrible, but in GCP, I can get 2gbps per core up to 16Gbps for an 8-core instance. For any bare metal deployment, I'm going to be maxed out around 100Gbps which will be close to saturating an x16 PCIe bus.
It's hard to find a dual-cpu frequency optimized processor with less than 8 cores and I'm not sure that'd be cost effective. With hyperthreading, that yields 32 usable cores or around 3.125gbps per core.
Even still, I wager they'd go for better density.
Also, I can get 8 GPUs along with that 8 core/16gbps instance in GCP. Sounds totally doable to me.
My back of the napkin calculation says that I can get 270 petaflops with 2160 n1-highmem-16 each with 8 v100 gpu on preemptible instances costing roughly $13k/hr or about $10m/mo
So with $120m/y in less than 2 years you'd exceed the price of the whole thing and also likely get worse interconnect speed and possibly raw computational speed.
If you have a constant load, on premise is always cheaper. Scientific computation has, for a large enough organization, a 100%
If you have a variable load, cloud infrastructure may make sense if you can easily auto-scale.
In my experience, most business real world applications are multi tiered applications with variable loads hence are a good fit for cloud infrastructure.
However, attaining the required application flexibility and KPIs for efficient auto scaling is quite hard and require strong functional & technical expertise.
My experience totally reflects this. Most enterprise IT infrastructure is idle the majority of the time.
I'm running infrastructure for a SaaS app in k8s. I feel like I'm doing well sustaining >50% efficiency, i.e. all cores running >50% all the time and more than half the memory consumed for things that aren't page cache. Hard to get better efficiency without creating hot spots.
Cloud is great when you have variable usage. These machines are probably driven near 100% all the time. In that scenario, they are probably more cost-effective than cloud infrastructure.
People use the stack in a variety of different ways---I'll describe my own usage.
There's a message-passing abstraction layer, QMP, sitting over MPI or SMP or what have you (you can compile for your laptop for development purposes, for example). This keeps most of the later layers relatively architecture agnostic.
Over that sits QDP, the data parallel library. Here's where the objects we discuss in quantum field theory are defined. We almost always work on regular lattices. QDP also contains things like "shove everybody one site over in the y direction" (for example).
Finally, there's the physics/application layer, where the physics algorithms live. I am most familiar with chroma. QUDA is the GPU library and can talk to most application-layer libraries and has at least simple solvers for most major discretizations people want to use (it also has fancier solvers such as multigrid methods for some discretizations).
Code in chroma by and large looks like physics equations, if you had a pain-in-the-ass pedantic student who didn't understand any abuse of notation.
Chroma can be used as a library, so that for your particular project you can do nonstandard things while leveraging everything it can already do.
Other physics layers include CPS, which grew out of the effort at Columbia with QCDSP/QCDOC, MILC (really optimized code for staggered fermions), and others.
As for efficiency: it depends on exactly what code you use and what your problem is. BAGEL, which included hand-coded assembly http://usqcd-software.github.io/bagel_qdp/ was getting something like 25% of peak, sustained, for doing linear solves on the BG/Q.
On a POWER8/NVIDIA P100 machine I know QUDA gets 20% of peak, sustained.
I know some Gordon Bell finalists have used these tensor cores for a dramatic speedup. As far as I know, there isn't any QCD code yet. NVIDIA has a small team of developers who write libraries to make QCD screamingly fast https://github.com/lattice/quda/ ; I don't know if/where the tensor cores may be most effectively leveraged.
"It’s not just powerful, it has a stunning memory. There’s enough storage space to hold every written work of humanity, in all languages – twice."
I really wish they just said it has XX TB of memory
Maybe I'm grossly underestimating how much data all human written works would actually occupy... but that sounds like the amount of data I could put on a home NAS (frankly I would have guessed my laptop hard drive until reading that comparison).
"every written work of humanity X times" and "weighs as much as X elephants"? What is this 1895? How about Y hours of HD video and weighs as much as Y cars.
Certificates issued by GeoTrust, RapidSSL, Symantec, Thawte, and VeriSign are no longer considered safe because these certificate authorities failed to follow security practices in the past. Error code: MOZILLA_PKIX_ERROR_ADDITIONAL_POLICY_CONSTRAINT_FAILED
Though it's a bit fuzzy. This article is about "Sierra." Department of Energy awarded $325 million to build two supercomputers, "Summit" and "Sierra" both at 150 petaflops each. IBM took $325M, so, it's $162.5M until someone corrects me.
However, the actual proposal it's under (CORAL) included more than that, approaching almost $2 billion in the RFP. They also added $100M for R&D. All these machine systems are considered NRE projects (non-recurring engineering) and contain more than just the hardware and each is estimated to be budgeted at about $400M per system (operational expenses, etc).
I would be nice if the article had indicated 1) the incremental improvement over renting ~200K cores and similar memory from AWS, and 2) the cost of this behemoth. I assume there is a significant advantage -- it would be nice to know how much and at what cost.
The cost of Amazon for this type of thing is only useful to burst to 200K cores, this is a supercomputer, which will be heavily used all the time. From a pure economics standpoint, the cloud makes zero sense for this sort of thing.
Also, the performance of 200K cores on Amazon in VMs compared to 200K physical cores is a lot different. These HPC systems are designed to eek every last 3-5% of performance out of the entire thing, something that you simply can not do even if you try using virtual machines or the cloud.
I maintain a much (much much much much.....much) smaller HPC system at my company and this seems to be a quarterly battle I'm beyond sick of.
Every time a new IT manager sees our costs they immediately declare 'I can get you that on the cloud for a fraction of the cost' and starts trying to decommission the system. Never mind the 20-100x increase in solve time and the multi-TB a day data transfer required. Waste 10 hours in meetings, stave it off once again, and gear up for the next round in January...
It's not remotely a fair comparison, as typical scientific computing requires high-performance communication to run with any modicum of efficiency. It's not just a matter of scale---the architecture of a machine like this differs dramatically from AWS of a similar size. [Never mind the national security concerns.]
Generally Infiniband [1] or Omni-Path [2] interconnects between the compute nodes - extremely high throughput and low latency communication on the order of 100 Gb/s and 1 microsecond. These problems aren't trillions of completely independent operations which may work for cloud/distributed architecture. For these types of calculations compute node 6523 needs to know the data coming from compute node 2447 in near-realtime to continue working.
amazon does not have a product that competes with this. they don't have a low-latency network. Without that, you're limited to coarse-grained parallelism. Supercomputers like this are built for getting near-peak performance across the entire machine by using a low-latency network to accelerate tightly coupled codes.
Modern supercomputers aren't particularly expensive- in the several tens of millions for the capital cost of the machine, several tens of millions for the storage system, several tens of millions for the space, several tens of millions for the power, and a few million for the support staff. In this case, Sierra probably cost $100M for the base machine and storage services.
What is the distinction between low latency networking commonly used in supercomputers today and cloud computing?
What is the cutoff?
I've commonly seen 500 microsecond latency between certain AWS zones. That's pretty impressive. Inside a given zone, 200-300 microseconds isn't uncommon.
Many hpc codes work on big data in lockstep an all nodes, so virtualization risks slowdown every tine one nodebis preempted in shared virtualization hosting.
Interesting thing about Summit and Sierra: the CPUs and GPUs are connected via nvlink rather than PCIe, which reduces the well known cost to move data to GPU.
I know it’s stupid but I wonder what it would be like to play a video-game that fully utilized this super-computer. I imagine an open world rpg where there are no loading screens and where every npc ai is running all the time. It would be closer to an interactive simulation than to a game. Fun to think about.
The compute platform overview pages [0] list all the HPC systems at LLNL. It's an incredible list, with a huge amount of TFLOPS, GBs, CPU and GPU cores available. The systems range from small 50 TFLOP, 20 node Xeon CPU Linux clusters up to the multi PetaFLOP POWER9 and Nvidia GPU monsters with millions of cores. Not sure what the total compute power available across all lab systems would be, although I guess it will be dominated by recent installations like Sierra - maybe 250 PFLOPS? Anyway, impressive engineering....
Is there a bragging rights thing to this? Not wanting to be left behind China? Seriously, seems like an arms race--which is cool for me, USA has the money and computers are it.
Will this computer be used close to 100% /was the old super-computer just not enough?
* Electronic Design Automation (ex: mathematically proving chips are correct)
* Protein folding: looking for new chemicals for medicine.
Etc. etc.
There's a huge need to grow our supercomputer capacity. I bet you that every major field has a use for a super-computer.
I think there's an element of bragging rights. But the USA buys "practical" supercomputers most of the time. There are designs that push out more FLOPs but are less useful to scientists.
The hugely powerful interconnect and CAPI / NVLink connections on this supercomputer demonstrate how "practical" the device is. Most people are RAM constrained, or message-constrained, and these are the biggest and best interconnects available in 2018.
Interconnects are NOT a "bragging" metric, very few people look at it. Most people look at the Linpack benchmark (purely FLOPs measurement). However, experts can tell when a supercomputer is built with a poor interconnect and purely for "bragging rights" reasons.
These machines typically have many thin clients ('nodes') each running their own kernel with network-attached storage (but high-performance storage/interconnect/filesystem technology)
According to https://hpc.llnl.gov/training/tutorials/using-lcs-sierra-sys... , there are 4320 of such "nodes", so there are 4320 linux kernels running? It means there are some commanding kernels scheduling the job on other kernels, it sounds not optimal
According to https://access.redhat.com/articles/rhel-limits , RHEL 7 on POWER system can "only" manage 32TB of memory, (the whole system has more than a petabyte of memory) assuming they don't run a modified version. So there is definitely not only one OS running, I guess.
Still, the goal of an OS, especially an Unix-like "Time Sharing" OS, is to manage resources. I wonder how hard it would be for the Linux kernel to manage the entire system (with some virtual devices to aggregate all the nodes in one system, even if they are at the opposite side of the room), and if all the developed code about scheduling could be reused at that scale.
The problem is less of scheduling and more of communication. Typically all nodes will run (facilitated by a program like mpirun or mpiexec) the same executable which will be launched simultaneously in each OS. This is done inside a job scheduler and the resulting host list results in a defined communication pattern.
By the way, even systems completely across the room may be "close" to each other, depending on the topology of the system. You can imagine this being important for solving a physical system which is periodic in certain dimensions, so there should be little interconnect distance between physically-distant nodes.
At that scale, I find odd people didn't come up with an optimized OS that sees and manages the resources and also plan the communications between them.
It looks more like a private data center with 4000 dedicated machines in the same network to run distributed algorithms than a "single super computer". Are we just "wow-ing" at what is basically a data center here?
For HPC applications that actually need low-latency coordination between nodes, the application code itself manages communication. The communication can't be better optimized by the OS.
If you have an embarrassingly parallel problem, it will run well on this machine but it will also be a waste of the machine's expensive design. Embarrassingly parallel problems run just as well on generic data center hardware. This machine is built for problems that only parallelize effectively with low-latency coordination between nodes. Such problems come up a lot in scientific/engineering simulations but are comparatively rare in general purpose computing environments. General purpose nodes in a cloud computing environment cannot run some of the harder problems this machine runs, at any price. For any non-trivial parallel computing job there comes a crossover point where adding more nodes makes the total time-to-solution longer rather than shorter. This point comes a lot sooner if you don't have dedicated high-bandwidth, low-latency interconnects between nodes.
Precisely, if we are talking about low-latency, why would we let the communications go through the application (in user-land), to the kernel, then to the network stack, then be received by a kernel from another node, and then finally received again by the application in user-land. I would imagine as a first guess that bypassing several linux kernels and directly accessing remote hardwares would be mandatory for best low-latency.
If they are some info on internet about the software stack/architecture of the entire system, I would document myself on that. I didn't explore all the links I posted above yet.
I'm nowhere an expert, and HPC is really specific use case, but there is surely interesting bits to learn from it