Autonomic failure prediction based on manifold learning for large-scale distributed systems

doi:10.1016/S1005-8885(09)60497-0

中国邮电高校学报(英文) ›› 2010, Vol. 17 ›› Issue (4): 116-124.doi: 10.1016/S1005-8885(09)60497-0

Autonomic failure prediction based on manifold learning for large-scale distributed systems

卢旭¹,王慧强²,周仁杰²,葛宝玉²

1. 哈尔滨工程大学计算机科学与技术学院
2.

收稿日期:2009-10-20 修回日期:2010-01-23 出版日期:2010-08-30 发布日期:2010-08-31
通讯作者: 卢旭 E-mail:luxu_hrbeu@yahoo.cn;luxu@hrbeu.edu.cn;
基金资助:
This work was supported by the Hi-Tech Research and Development Program of China (2007AA01Z401), the National Natural Science Foundation of China (90718003, 60973027).

Autonomic failure prediction based on manifold learning for large-scale distributed systems

LU Xu , WANG Hui-qiang, ZHOU Ren-jie, GE Bao-yu

College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China

Received:2009-10-20 Revised:2010-01-23 Online:2010-08-30 Published:2010-08-31
Supported by:
This work was supported by the Hi-Tech Research and Development Program of China (2007AA01Z401), the National Natural Science Foundation of China (90718003, 60973027).

摘要/Abstract

摘要：

This article investigates autonomic failure prediction in large-scale distributed systems with nonlinear dimensionality reduction to automatically extract failure features. Most existing methods for failure prediction focus on building prediction models or heuristic rules by discovering failure patterns, but the process of feature extraction before failure patterns recognition is rarely considered due to the increasing complexity of modern distributed systems. In this work, a novel performance-centric approach to automate failure prediction is proposed based on manifold learning (ML). In addition, the ML algorithm named supervised locally linear embedding (SLLE) is applied to achieve feature extraction. To generalize the dimensionality reduction mapping, the nonlinear mapping approximation and optimization solution is also proposed. In experimental work a file transfer test bed with fault injection is developed which can gather multilevel performance metrics transparently. Based on the runtime monitoring of these metrics, the SLLE method can automatically predict more than 50 % of the central processing unit (CPU) and memory failures, and around 70 % of the network failure.

关键词:

failure prediction, manifold learning, locally linear embedding, autonomic computing

Abstract:

Key words:

failure prediction, manifold learning, locally linear embedding, autonomic computing

LU Xu , WANG Hui-qiang, ZHOU Ren-jie, GE Bao-yu. Autonomic failure prediction based on manifold learning for large-scale distributed systems[J]. Acta Metallurgica Sinica(English letters), 2010, 17(4): 116-124.

参考文献

1. Liang Y L, Zhang Y Y, Jette M, et al. BlueGene/L failure analysis and prediction models. Proceedings of the International Conference on Dependable Systems and Networks (DSN’06), Jun 25-28, 2006, Philadelphia, PA, USA. New York, NY, USA: ACM, 2006: 425-434

2. Salfner F, Malek M. Using hidden semi-Markov models for effective online failure prediction. Proceedings of the 26th IEEE Symposium on Reliable Distributed Systems (SRDS’07), Oct 10-12, 2007, Beijing, China. Piscataway, NJ, USA: IEEE, 2007:161-174

3. Fu S, Xu C Z. Exploring event correlation for failure prediction in coalitions of clusters. Proceedings of the 21st International Conference on High Performance Computing, Networking, Storage and Analysis (SC’07), Nov 10-16, 2007, Reno, NV, USA. Los Alamitos, CA, USA: IEEE Computer Society, 2007: 456-468

4. Hacker T J, Romero F, Carothers C D. An analysis of clustered failures on large supercomputing systems. Journal of Parallel and Distributed Computing, 2009, 69 (7): 652-665

5. Salfner F, Tröger P, Tschirpke S. Cross-core event monitoring for processor failure prediction. Proceedings of the 23rd International Symposium on High Performance Computing and Simulation (HPCS’09), Jun 21-24, 2009, Leipzig, Germany. Los Alamitos, CA, USA: IEEE Computer Society, 2009: 67-73

6. Taerat N, Nakisinehaboon N, Chandler C, et al. Using log information to perform statistical analysis on failures encountered by large-scale HPC deployments. Proceeding of the 5th High Availability and Performance Computing Workshop (HAPCW’08), Apr 2-4, 2008, Denver, CO, USA. 2008

7. Solano-Quinde L D, Bode B M. Module prototype for online failure prediction for the IBM BlueGene/L. Proceeding of the IEEE International Conference on Electro/Information Technology (EIT’08), May 18-20, 2008, Ames, IA, USA. Piscataway, NJ, USA: IEEE, 2008: 470-474

8. Zhang Y Y, Sivasubramaniam A. Failure prediction in IBM BlueGene/L event logs. Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07), Oct 28-31, 2007, Omaha, NE, USA. Los Alamitos, CA, USA: IEEE Computer Society, 2007: 583-588

9. Liang Y L, Zhang Y Y, Sivasubramaniam A, et al. Filtering failure logs for a BlueGene/L prototype. Proceedings of the International Conference on Dependable Systems and Networks (DSN’05), Jun 28-Jul 1, 2005, Yokohama, Japan. Los Alamitos, CA, USA: IEEE Computer Society, 2005: 476-485

10. Schroeder B, Gibson G A. A large-scale study of failures in high-performance computing systems. Proceedings of the International Conference on Dependable Systems and Networks (DSN’06), Jun 25-28, 2006, Philadelphia, PA, USA. Los Alamitos, CA, USA: IEEE Computer Society, 2006: 249-258

11. Williams A W, Pertet S M, Narasimhan P. Tiresias: Black-box failure prediction in distributed systems. Proceedings of the 21st IEEE International on Parallel and Distributed Processing Symposium (IPDPS’07), Mar 26-30, 2007, Long Beach, CA, USA. Los Alamitos, CA, USA: IEEE Computer Society, 2007: 8p

12. De Silva V, Tenenbaum J B. Global versus local methods in nonlinear dimensionality reduction. Proceedings of the 16th Annual Conference on Neural Information Processing Systems (NIPS’02), Dec 9-14, 2002, Vancouver, Canada. 2002: 705-712

13. Roweis S T, Saul L K. Nonlinear dimensionality reduction by local linear embedding. Science, 2000, 290 (5500): 2323-2326

14. De Ridder D, Duin R P W. Locally linear embedding for classification. Technical Report, PH-2002-01. Delft, The Netherland: Delft University of Technology, 2002

15. Kégl B. Intrinsic dimension estimation using packing numbers. Proceedings of the 16th Annual Conference on Neural Information Processing Systems (NIPS’02), Dec 9-14, 2002, Vancouver, Canada. 2002: 681-688

Autonomic failure prediction based on manifold learning for large-scale distributed systems

Autonomic failure prediction based on manifold learning for large-scale distributed systems

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

编辑推荐

Metrics

本文评价