...
Tatiana Martsinkevich: On distributed recovery for SPMD deterministic HPC applications
Hybrid fault tolerance protocols for HPC applications are the most likely protocols to find their niche in the umcoming exascale era. HydEE, a hierarchical rollback-recovery protocol for message passing applications, is an example of such protocol. Current scheme in HydEE requires a dedicated process to orchestrate the recovery. This centralized solution can slow down the recovery which otherwise could be sped up by the fact that all the messages that process needs to recover to the point of failure are logged and could be accessed immediately. If we find a way for each process to control its recovery by itself, we could fully benefit from message logging. This talk is about finding an approach to implement such distributed recovery.
Laxmikant (Sanjay) Kale: The recovery and rise of checkpoint/restart (http://charm.cs.uiuc.edu)
...