The following article is Open access

Understanding failures in petascale computers

and

Published under licence by IOP Publishing Ltd
, , Citation Bianca Schroeder and Garth A Gibson 2007 J. Phys.: Conf. Ser. 78 012022 DOI 10.1088/1742-6596/78/1/012022

1742-6596/78/1/012022

Abstract

With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interruptions that even smaller terascale computers experience. The information that is available suggests that application interruptions will become far more common in the coming decade, and the largest applications may surrender large fractions of the computer's resources to taking checkpoints and restarting from a checkpoint after an interruption. This paper reviews sources of failure information for compute clusters and storage systems, projects failure rates and the corresponding decrease in application effectiveness, and discusses coping strategies such as application-level checkpoint compression and system level process-pairs fault-tolerance for supercomputing. The need for a public repository for detailed failure and interruption records is particularly concerning, as projections from one architectural family of machines to another are widely disputed. To this end, this paper introduces the Computer Failure Data Repository and issues a call for failure history data to publish in it.

Export citation and abstract BibTeX RIS

Please wait… references are loading.
10.1088/1742-6596/78/1/012022