Scalable Techniques for Fault Tolerant High Performance Computing

Download or Read eBook Scalable Techniques for Fault Tolerant High Performance Computing PDF written by and published by . This book was released on 2006 with total page 174 pages. Available in PDF, EPUB and Kindle.
Scalable Techniques for Fault Tolerant High Performance Computing
Author :
Publisher :
Total Pages : 174
Release :
ISBN-10 : OCLC:70064571
ISBN-13 :
Rating : 4/5 (71 Downloads)

Book Synopsis Scalable Techniques for Fault Tolerant High Performance Computing by :

Book excerpt: As the number of processors in todayʹs parallel systems continues to grow, the mean-time-to-failure of these systems is becoming significantly shorter than the execution time of many parallel applications. It is increasingly important for large parallel applications to be able to continue to execute in spite of the failure of some components in the system. Todayʹs long running scientific applications typically tolerate failures by checkpoint/restart in which all process states of an application are saved into stable storage periodically. However, as the number of processors in a system increases, the amount of data that need to be saved into stable storage increases linearly. Therefore, the classical checkpoint/restart approach has a potential scalability problem for large parallel systems. In this research, we explore scalable techniques to tolerate a small number of process failures in large scale parallel computing. The goal of this research is to develop scalable fault tolerance techniques to help to make future high performance computing applications self-adaptive and fault survivable. The fundamental challenge in this research is scalability. To approach this challenge, this research (1) extended existing diskless checkpointing techniques to enable them to better scale in large scale high performance computing systems; (2) designed checkpoint-free fault tolerance techniques for linear algebra computations to survive process failures without checkpoint or rollback recovery; (3) developed coding approaches and novel erasure correcting codes to help applications to survive multiple simultaneous process failures. The fault tolerance schemes we introduce in this dissertation are scalable in the sense that the overhead to tolerate a failure of a fixed number of processes does not increase as the number of total processes in a parallel system increases. Two prototype examples have been developed to demonstrate the effectiveness of our techniques. In the first example, we developed a fault survivable conjugate gradient solver that is able to survive multiple simultaneous process failures with negligible overhead. In the second example, we incorporated our checkpoint-free fault tolerance technique into the ScaLAPACK/PBLAS matrix-matrix multiplication code to evaluate the overhead, survivability, and scalability. Theoretical analysis indicates that, to survive a fixed number of process failures, the fault tolerance overhead (without recovery) for matrix-matrix multiplication decreases to zero as the total number of processes (assuming a fixed amount of data per process) increases to infinity. Experimental results demonstrate that the checkpoint-free fault tolerance technique introduces surprisingly low overhead even when the total number of processes used in the application is small.


Scalable Techniques for Fault Tolerant High Performance Computing Related Books

Fault-Tolerance Techniques for High-Performance Computing
Language: en
Pages: 325
Authors: Thomas Herault
Categories: Computers
Type: BOOK - Published: 2015-07-01 - Publisher: Springer

DOWNLOAD EBOOK

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introducti
Scalable Techniques for Fault Tolerant High Performance Computing
Language: en
Pages: 174
Authors:
Categories:
Type: BOOK - Published: 2006 - Publisher:

DOWNLOAD EBOOK

As the number of processors in todayʹs parallel systems continues to grow, the mean-time-to-failure of these systems is becoming significantly shorter than the
A Scalable Unified Fault Tolerance for High Performance Computing Environments
Language: en
Pages: 132
Authors: Kulathep Charoenpornwattana
Categories: Electronic data processing
Type: BOOK - Published: 2008 - Publisher:

DOWNLOAD EBOOK

Fault-Tolerant Parallel and Distributed Systems
Language: en
Pages: 396
Authors: Dimiter R. Avresky
Categories: Computers
Type: BOOK - Published: 2012-12-06 - Publisher: Springer Science & Business Media

DOWNLOAD EBOOK

The most important use of computing in the future will be in the context of the global "digital convergence" where everything becomes digital and every thing is
Methods, Models and Tools for Fault Tolerance
Language: en
Pages: 350
Authors: Michael Butler
Categories: Computers
Type: BOOK - Published: 2009-03-03 - Publisher: Springer

DOWNLOAD EBOOK

The growing complexity of modern software systems increases the di?culty of ensuring the overall dependability of software-intensive systems. Complexity of envi