Acta Metallurgica Sinica(English letters) ›› 2010, Vol. 17 ›› Issue (4): 116-124.doi: 10.1016/S1005-8885(09)60497-0

• Wireless • Previous Articles     Next Articles

Autonomic failure prediction based on manifold learning for large-scale distributed systems

LU Xu , WANG Hui-qiang, ZHOU Ren-jie, GE Bao-yu   

  1. College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China
  • Received:2009-10-20 Revised:2010-01-23 Online:2010-08-30 Published:2010-08-31
  • Supported by:

    This work was supported by the Hi-Tech Research and Development Program of China (2007AA01Z401), the National Natural Science Foundation of China (90718003, 60973027).

Abstract:

This article investigates autonomic failure prediction in large-scale distributed systems with nonlinear dimensionality reduction to automatically extract failure features. Most existing methods for failure prediction focus on building prediction models or heuristic rules by discovering failure patterns, but the process of feature extraction before failure patterns recognition is rarely considered due to the increasing complexity of modern distributed systems. In this work, a novel performance-centric approach to automate failure prediction is proposed based on manifold learning (ML). In addition, the ML algorithm named supervised locally linear embedding (SLLE) is applied to achieve feature extraction. To generalize the dimensionality reduction mapping, the nonlinear mapping approximation and optimization solution is also proposed. In experimental work a file transfer test bed with fault injection is developed which can gather multilevel performance metrics transparently. Based on the runtime monitoring of these metrics, the SLLE method can automatically predict more than 50 % of the central processing unit (CPU) and memory failures, and around 70 % of the network failure.

Key words:

failure prediction, manifold learning, locally linear embedding, autonomic computing