User:Joel welling/Limits to Scaling in High Performance Computing

This article addresses the limitations to software running on very large supercomputers. Supercomputing hardware systems have reached very impressive sizes. The Blue Waters system, scheduled for completion in 2011, is designed to sustain performance of a petaflops across roughly 40,000 8-core processors. However, software performance for these systems has not scaled up nearly as quickly. This is due in part to the increasing frequency of hardware failures as the number of components in the system increases, but also to the shifting balance between computational demand and network demand as a problem of fixed size is distributed across a growing number of processors.

Canonical Supercomputing System

Very large general purpose computing systems generally consist of at least the following components:

Local computing nodes consisting of one or more multi-core processor sockets, local RAM memory and a network interface. These nodes may be of uniform architecture (e.g. the Cray XT3 architecture) or a mixed hybrid architecture (e.g. IBM Roadrunner).

An internal network interconnect, for example an Infiniband fat tree.

A parallel filesystem for access to disk files, for example Lustre.

An interface to an external network.

Possibly separate front-end nodes for user interaction.

Some systems contain special purpose components like GPGPUs or field-programmable gate arrays which provide high floating point performance at relatively low cost; the scaling limits imposed by these features are not addressed in this article. In the case of Grid computing the performance of the external network interface can also limit scaling.

These systems tend to support symmetric multiprocessing at least at the socket or node level, but not at the full system level. Thus the software environment can support threaded computing (e.g. OpenMP) at the small scale but only message passing (e.g. MPI) at the scale of the full system.

Performance Measures

The oldest and most often cited measure of supercomputer performance is floating point operations per second, or FLOPS, calculated simply as the peak possible collective rate of floating point calculations for all processors in the system. In late 2008 the performance of the fastest general purpose supercomputers crossed the 1 petaflops (10¹⁵ FLOPS) threshold with the construction of the IBM Roadrunner system.

Real programs perform substantially less well, however. While each core of a modern processor may perform 8 single precision floating point operations per clock, the bandwidth limits of the node's RAM support only a fraction of that rate. The performance of a particular program thus depends on the degree to which its data resides in cache memory, and this varies by program and by problem size. Thus it is generally preferable to test system performance with benchmarks. Commonly used benchmarks include the HPL benchmark used by the Top500 rating of supercomputers, the HPC Challenge benchmark Suite and the NSF benchmark set used by the National Science Foundation in soliciting supercomputer procurements. Each places different demands on the computing, memory and interconnect components of a supercomputer, and thus each will perform best on a different hypothetical architecture. The improvement in performance hoped for as the number of processors in the system increases is known as scaling, and the limits of performance for a particular application or benchmark are set by the processor count at which it ceases to scale.

Weak Scaling vs. Strong Scaling

Under weak scaling the problem size per node is held constant as the number of nodes scales up. For example, a simulation might use more processors to model a larger volume while holding the spatial resolution constant, or simulate the same volume with increased spatial and temporal resolution (for example see ^[1]). In this situation the demands on computation, memory access and network communication for a node remain relatively constant. Conversely, under strong scaling the number of computing nodes is increased while the problem size is held constant. This reduces the amount of computation per processor. An example would be simulation of molecular dynamics, where the researcher hopes to simulate the same molecule in less wall clock time^[2]. It is generally agreed that it is more difficult to achieve strong scaling than weak scaling.

Since the communication requirements for a processor typically scale linearly with the surface area of the sub-volume it simulates while the computational requirements scale linearly with the volume, the ratio of communication to computation increases with node count under strong scaling. The application ceases to scale when there is no longer enough computation per node to amortize the time spent on communication. Thus the interconnect bandwidth and latency of a system tend to have a strong impact on applications under strong scaling. The memory requirements per node decrease under strong scaling, however. As the data size per node decreases, more of the data fits in the processor CPU cache, bandwidth to RAM becomes less of a bottleneck and the per-processor performance improves. This effect is sometimes called cache condensation. The implication of these trends is that performance under strong scaling is limited by the ratio of communication to computation, while under weak scaling performance tends to be limited by global effects like the time to do collective operations or perform Fourier transforms.

Reliability of Very Large Systems

Some limits to scaling arise from the hardware layer and are largely independent of the specific program being executed. A single computing node may be extremely reliable, but given enough nodes the frequency of random failures can become catastrophic. If a single node has a mean time between failures of one year a system consisting of 40,000 such nodes will have a MTBF of roughly 13 minutes. Any program running on such a system must somehow survive these failures.

The usual survival strategy is checkpointing, in which a copy of the application's working memory is saved to disk at frequent intervals. This allows the application to be restarted after the failure so that progress can continue. Obviously the time to save and restore the checkpoint data must not exceed the time between failures; increasing checkpoint time reduces the fraction of system uptime available to make progress on the application. Improved IO bandwidth reduces checkpoint time and increases the fraction of wall clock time available for actual calculations. The demand for bandwidth for checkpointing drives the configuration of IO systems for large machines. Using a simplified model, John W. Young produced an expression for the optimal checkpointing interval in 1974^[3], specifically:

$T_{c}\simeq {\sqrt {2T_{s}T_{f}}}$

where $T_{c}$ is the time interval between checkpoints, $T_{s}$ is the time to save a checkpoint and $T_{f}$ is the MTBF for the system. More complete models have since been developed^[4].

Fault-tolerant runtime environments can provide an alternative to checkpointing. The notion is that the runtime can detect the failures of nodes and rebuild the lost environment from automatic checkpoints and/or journaled transactions, with minimal intervention from the user's program^[5] ^[6].

Questions of disk reliability must also be addressed, typically through a combination of SANs, RAID and file replication. The mechanisms for fault tolerance and fail-over in the Lustre filesystem provide an example^[7].

As the total memory size of a supercomputer increases, so does the likelihood of soft errors. The designers of the machine must take steps to minimize the impact of these errors. Supercomputer memory systems utilize error correcting codes at the hardware level. See for example ^[8] for a description of these mechanisms in the case of FBDIMM memory.

Algorithmic Efficiency

Does this matter? It doesn't limit strong scaling.

Specific example of cubebench.

The Parallel Fast Fourier Transform

Why it's so common

Standard algorithm, failure via mpi_all_to_all. I have a paper copy of a paper.

PGAS version

Scaling Limits of Specific Applications

Note that all of these applications scale unusually well.

Notes

Say something about Amdahl's law, and how that limit is never approached?

Reference some mean-time-to-failure data?

References

^ Some Quake reference?
^ Kumar, S., Huang, C., Zheng, G., Bohm, E., Bhatele, A., Phillips, J. C., Yu, H., and Kalé, L. V. 2008. Scalable molecular dynamics with NAMD on the IBM Blue Gene/L system. IBM J. Res. Dev. 52, 1/2 (Jan. 2008), 177-188.
^ John W. Young, A First Order Approximation to the Optimum Checkpoint Interval, CACM 17.9 (1974), pp. 530-531
^ Wang et al., Modeling Coordinated Checkpointing for Large-Scale Supercomputers, Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN'05) (IEEE 2005)
^ Dewolfs, D., Broeckhove, J., Sunderam, V., Fagg, G. FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, ICL-UT-06-14, Vol. 4192, Number 2006, pp. 133-140, 2006.
^ Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., and Selikhov, A. 2002. MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (Baltimore, Maryland). Conference on High Performance Networking and Computing. IEEE Computer Society Press, Los Alamitos, CA, 1-18.
^ http://manual.lustre.org/manual/LustreManual16_HTML/index.html, accessed 2008/12/17
^ Intel 6400/6402 Advanced Memory Buffer datasheet, October 2006

External links

Here are some
external links

[[Category:Distributed computing]] [[Category:Parallel computing|*]] [[Category:Concurrent computing]] [[Category:Supercomputers]]

[1] Some Quake reference?

[2] Kumar, S., Huang, C., Zheng, G., Bohm, E., Bhatele, A., Phillips, J. C., Yu, H., and Kalé, L. V. 2008. Scalable molecular dynamics with NAMD on the IBM Blue Gene/L system. IBM J. Res. Dev. 52, 1/2 (Jan. 2008), 177-188.

[3] John W. Young, A First Order Approximation to the Optimum Checkpoint Interval, CACM 17.9 (1974), pp. 530-531

[4] Wang et al., Modeling Coordinated Checkpointing for Large-Scale Supercomputers, Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN'05) (IEEE 2005)

[5] Dewolfs, D., Broeckhove, J., Sunderam, V., Fagg, G. FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, ICL-UT-06-14, Vol. 4192, Number 2006, pp. 133-140, 2006.

[6] Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., and Selikhov, A. 2002. MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (Baltimore, Maryland). Conference on High Performance Networking and Computing. IEEE Computer Society Press, Los Alamitos, CA, 1-18.

[7] ttp://manual.lustre.org/manual/LustreManual16_HTML/index.html, accessed 2008/12/17

[8] Intel 6400/6402 Advanced Memory Buffer datasheet, October 2006

[1]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing