To add: Titan has 1 GPU per node, we were getting about 300 GFlop/sec/GPU sustai...

To add: Titan has 1 GPU per node, we were getting about 300 GFlop/sec/GPU sustained. Sierra has 4 GPUs per node, we get about 1.5 TFlop/sec/GPU sustained. (Summit has 6 GPUs per node, also about 1.5 TFlop/sec/GPU sustained). So performance went up by about 20 on a per-node basis. The large memory is not just luxurious but essential too---in our applications it has really helped compensate for the comparatively minor improvements in the communication fabric.