Resiliency is and will be a critical factor in determining scientific productivity on current and supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. We introduce a novel application-driven silent error detection and recovery strategy based on application health monitoring.
View Article and Find Full Text PDF