Ghiasvand, Siavash and Ciorba, Florina M. and Tschüter, Ronny and Nagel, Wolfgang E.. (2016) Lessons learned from spatial and temporal correlation of node failures in high performance computers. In: Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2016). p. 5.
PDF
- Accepted Version
1485Kb | |
PDF
Restricted to Repository staff only 1485Kb |
Official URL: http://edoc.unibas.ch/40836/
Downloads: Statistics Overview
Abstract
In this paper we study the correlation of node failures in time and space. Our study is based on measurements of a production high performance computer over an 8-month time period. We draw possible types of correlations between node failures and show that, in many cases, there are direct correlations between observed node failures. The significance of such a study is twofold: achieving a clearer understanding of correlations between node failures and enabling failure detection as early as possible. The results of this study are aimed at helping the system administrators minimize (or even prevent) the destructive effects of correlated node failures.
Faculties and Departments: | 05 Faculty of Science > Departement Mathematik und Informatik > Informatik > High Performance Computing (Ciorba) |
---|---|
UniBasel Contributors: | Ciorba, Florina M. |
Item Type: | Conference or Workshop Item, refereed |
Conference or workshop item Subtype: | Conference Paper |
Publisher: | Institute of Electrical and Electronics Engineers (IEEE) |
ISBN: | 9781467387774 |
Series Name: | Proceedings of the IEEE |
Note: | Publication type according to Uni Basel Research Database: Conference paper -- The final publication is available at Institute of Electrical and Electronics Engineers (IEEE) |
Language: | English |
Language: | English |
edoc DOI: | |
Last Modified: | 29 Jan 2018 04:18 |
Deposited On: | 20 Sep 2017 11:55 |
Repository Staff Only: item control page