...
Main Topics | Schedule | Speaker | Affiliation | Type of presentation | Title (tentative) | Download |
|
|
|
|
|
|
|
Dinner Before the Workshop | 7:30 PM | Only people registered for the dinner |
|
|
| |
|
|
|
|
|
|
|
Workshop Day 1 | Wednesday June 12th |
|
|
|
|
|
|
|
|
|
| TITLES ARE TEMPORARY (except if in bold font) |
|
Registration | 08:00 |
|
|
|
|
|
Welcome and Introduction | 08:30 | Marc Snir + Franck Cappello | INRIA&UIUC&ANL | Background | Welcome, Workshop objectives and organization |
|
| 08:45 | Bill Kramer | UIUC | Background | NCSA updates and vision of the collaboration |
|
| 09:00 | Marc Snir | ANL | Background | ANL updates vision of the collaboration |
|
| 09:15 | Frederic Desprez | Inria | Background | INRIA updates and vision of the collaboration |
|
Big systems | 9:30 | Bill Kramer | UIUC | Background | Update on BlueWaters |
|
| 10:00 | Break |
|
|
|
|
| 10:30 | Mitsuhisa Sato | U. Tsukuba & AICS | Background | AICS and the K computer |
|
CANCELED | 11:00 | Paul Gibbon | Juelich | Background | Meeting the Exascale Challenge at the Juelich Supercomputing Centre. |
|
Resilience&fault tolerance and simulation | 11:00 | Marc Snir | ANL&UIUC | Report | ICIS report on Resilience |
|
11:30 | Vincent Baudoui | Total & ANL | Joint-Results | Round-off error and silent soft error propagation in exascale applications | ||
| 12:00 | Lunch |
|
|
|
|
Numerical Algorithms | 13:30 | Bill Gropp | UIUC | Background | Topics for Collaboration in Numerical Libraries |
|
14:00 | Paul Hoveland | ANL | Background | Argonne strategic plan in applied math |
| |
| 14:30 | Frederic Nataf | INRIA&P6 | Background | Toward black-box adaptive domain decomposition methods |
|
| 15:00 | Luke Olson | UIUC | Background | Opportunities in developing a more robust and scalable multigrid solver |
|
15:30 | Break | |||||
| 16:00 | Marc Baboulin | INRIA | Background | Using condition numbers to assess numerical quality in high-performance computing applications |
|
Resilience&fault tolerance and simulation Chair: Franck Cappello | 16:30 | Bogdan Nicolae | IBM | Joint Result | AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing | |
17:00 | Martin Quison | INRIA | Result | Improving Simulations of MPI Applications Using A Hybrid Network Model with Topology and Contention Support | ||
| 17:30 | Adjourn |
|
|
|
|
| 19:00 | Dinner |
|
|
|
|
|
|
|
|
|
|
|
Workshop Day 2 | Thursday June 13th |
|
|
|
|
|
|
|
|
|
|
|
|
Programming Models (cont.) | 08:30 | Jean-François Mehaut | INRIA | Result | Progresses in the European FP7 Mont-Blanc 1 project and objectives of its follow up: Mont-Blanc 2 |
|
| 09:00 | Rajeev Thakur | ANL | Background | TBA |
|
| 09:30 | Andra Ecaterina Hugo | INRIA | Results | TBA |
|
| 10:00 | Celso Mendes | UIUC | Background | Dynamic Load Balancing for Weather Models via AMPI |
|
| 10:30 | Break |
|
|
|
|
Big Data, I/O, Visualization | 11:00 | Dries Kimpe | ANL | Results | TBA |
|
| 11:30 | Gilles Fedak | INRIA | Result | Active Data: A Programming Model to Manage Data Life Cycle Across Heterogeneous Systems and Infrastructures |
|
| 12:00 | Matthieu Dorrier | INRIA | Joint Result | Data Analysis of Ensemble Simulations: an In Situ Approach using Damaris |
|
| 12:30 | Ian Foster | ANL | Background | TBA |
|
| 13:00 | Lunch |
|
|
|
|
|
|
|
|
|
|
|
Mini Workshop1 |
|
|
|
|
|
|
Resilience | 14:00 | Ana Gainaru | UIUC | Results | Challenges in predicting failures on the Blue Waters system. |
|
| 14:30 | Xiang Ni | UIUC | Results | ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection. |
|
| 15:00 | Tatiana | INRIA & ANL | Result | TBA |
|
| 15:30 | Mohamed Slim Bouguerra | INRIA & ANL | ResultTBA | Investigating the probability distribution of false negative failure alerts in HPC systems |
|
| 16:00 | Break |
|
|
|
|
| 16:30 | Amina Guermouche | UVSQ | Result | Multi-criteria Checkpointing Strategies: Response-time versus Resource Utilization |
|
| 17:00 | Thomas Ropars | EPFL | Result | Towards efficient replication of HPC applications to deal with crash failures |
|
| 17h30 | Mehdi Diouri | INRIA | Result | ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols for HPC Applications |
|
| 18:00 | Adjourn |
|
|
|
|
|
|
|
|
|
|
|
Mini Workshop2 |
|
|
|
|
|
|
Numerical Algorithms and Libraries | 14:00 | Laura Grigori | INRIA | Result | TBA |
|
| 14:30 | Stefan Wild | ANL | Result | TBA |
|
| 15:00 | Frederic Hecht | INRIA/P6 | Result | TBA |
|
| 15:30 | Jed Brown | ANL | Result | Vectorization, communication aggregation, and reuse in stochastic and temporal dimensions |
|
| 16:00 | Break |
|
|
|
|
| 16:30 | Yushan Wang | INRIA P11 | Result | TBA |
|
| 17:00 | Jean Utke | ANL | Result | Designing and implementing a tool-indedendent, adjoinable MPI wrapper library |
|
| 17:30 | Laurent Hascoet | INRIA | Result | The adjoint of MPI one-sided communications |
|
| 18:00 | Adjourn |
|
|
|
|
|
|
|
|
|
|
|
| 19:00 | Banquet |
|
| Lyon |
|
|
|
|
|
|
|
|
Workshop Day 3 | Friday June 14th |
|
|
|
|
|
|
|
|
|
|
|
|
Mini Workshop1 (cont.) |
|
|
|
|
|
|
Resilience | 08:30 | Di Sheng | INRIA | Result | TBA |
|
| 09:00 | Guillaume Aupy | INRIA | Result | TBA |
|
| 09:30 | Discussion |
|
|
|
|
| 10:00 | Break |
|
|
|
|
Mini Workshop3 | 10:30 | Guillaume Mercier | INRIA | Result | TBA |
|
Programming and Scheduling | 11:00 | Vincent Lanore | INRIA | Result | TBA |
|
| 11:30 | Anne Benoit | INRIA | Result | Energy-efficient scheduling |
|
| 12:00 | François Tessier | INRIA | Result | TBA |
|
| 12:30 | Discussions |
|
|
|
|
| 13:00 | Closing and Lunch |
|
|
|
|
|
|
|
|
|
|
|
Mini Workshop2 (cont.) |
|
|
|
|
|
|
Numerical Algorithms and Libraries | 08:30 | François Pellegrini | INRIA | Result | Shared memory parallel algorithms in Scotch 6 |
|
| 09:00 | Luc Giraud | INRIA | Result | TBA |
|
| 09:30 | Discussions |
|
|
|
|
| 10:00 | Break |
|
|
|
|
Mini Workshop4 | 10:30 | Kate Keahey | ANL | Result | TBA |
|
Clouds | 11:00 | Gabriel Antoniu | INRIA | Result | TBA |
|
| 11:30 | Christian Perez | INRIA | Result | TBA |
|
| 12:00 | Eddy Caron | INRIA | Result | TBA |
|
| 12:30 | Discussions |
|
|
|
|
| 13:00 | Closing and Lunch |
|
|
|
|
...
As the size of supercomputers increases, so does the probability of a single component failure within a time frame. With the growing operation cost of extreme scale supercomputers like Blue Waters, the act of predicting failures to prevent the loss of computation hours becomes cumbersome and presents a couple of challenges not encountered for smaller systems. The talk will focus on presenting online failure prediction and analyzing the Blue Water system. We show to what extent online failure prediction is a possibility at petascale and what are the challenges in achieving an effective fault prevention mechanism for Blue Waters.
Mohamed Slim Bouguerra
Investigating the probability distribution of false negative failure alerts in HPC systems
As large parallel systems increase in size and complexity, failures are inevitable and exhibit complex space and time dynamics. Several key results have demonstrated that recent advances in event log analysis can provide precise failure prediction. The state-of-the-art in failure prediction provides a ratio of correctly identified failures to the number of all predicted failures of over 90\% and its able to discover around 50\% of all failures in a system. However large part of failures are not predicted and considered as false negative alerts. Therefore, developing efficient fault tolerance strategies to tolerate failures requires a good perception and understanding of failure prediction properties and characteristics. In order to study and understand the properties and characteristics of the false negative alerts, we conduct in this paper a statistical analysis to discover the probability distribution of such alerts and their impact on fault tolerance techniques. To this end we study failures logs from different HPC production systems. We show that: (i) surprisingly the false negative distribution has the same nature as the failure distribution; (ii) after adding failure prediction we were able to infer statistical models that describes the inter arrival time between false negative alerts and so current fault tolerance can be applied on these systems; (iii) the current failures traces contain a high amount of correlation between the failure inter arrival time that can be used to improve the failure prediction mechanism. Another important result is that checkpoint intervals can still be computed from existing first order formula when failure distribution is purely random.