Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Tatiana Martsinkevich: On distributed recovery for SPMD deterministic HPC applications

Hybrid fault tolerance protocols for HPC applications are the most likely protocols to find their niche in the umcoming exascale era. HydEE, a hierarchical rollback-recovery protocol for message passing applications, is an example of such protocol. Current scheme in HydEE requires a dedicated process to orchestrate the recovery. This centralized solution can slow down the recovery which otherwise could be sped up by the fact that all the messages that process needs to recover to the point of failure are logged and could be accessed immediately. If we find a way for each process to control its recovery by itself, we could fully benefit from message logging. This talk is about finding an approach to implement such distributed recovery.

Laxmikant (Sanjay) Kale: The recovery  and rise of checkpoint/restart (http://charm.cs.uiuc.edu)

...