...
Main Topics | Schedule | Speaker | Affiliation | Type of presentation | Title (tentative) | Download |
|
|
|
|
|
|
|
Dinner Before the Workshop | 7:30 PM | Only people registered for the dinner |
|
|
| |
|
|
|
|
|
|
|
Workshop Day 1 | Wednesday June 12th |
|
|
|
|
|
|
|
|
|
| TITLES ARE TEMPORARY (except if in bold font) |
|
Registration | 08:00 |
|
|
|
|
|
Welcome and Introduction Amphitheatre | 08:30 | Marc Snir + Franck Cappello | INRIA&UIUC&ANL | Background | Welcome, Workshop objectives and organization |
|
| 08:45 | Bill Kramer | UIUC | Background | NCSA updates and vision of the collaboration |
|
| 09:00 | Marc Snir | ANL | Background | ANL updates vision of the collaboration |
|
| 09:15 | Frederic Desprez | Inria | Background | INRIA updates and vision of the collaboration | |
Big systems | 9:30 | Bill Kramer | UIUC | Background | Update on BlueWaters |
|
| 10:00 | Break |
|
|
|
|
| 10:30 | Mitsuhisa Sato | U. Tsukuba & AICS | Background | AICS and the K computer | |
CANCELED | 11:00 | Paul Gibbon | Juelich | Background | Meeting the Exascale Challenge at the Juelich Supercomputing Centre. |
|
Resilience&fault tolerance and simulation | 11:00 | Marc Snir | ANL&UIUC | Report | ICIS report on Resilience |
|
11:30 | Vincent Baudoui | Total & ANL | Joint-Results | Round-off error and silent soft error propagation in exascale applications | ||
| 12:00 | Lunch |
|
|
|
|
Numerical Algorithms | 13:30 | Bill Gropp | UIUC | Background | Topics for Collaboration in Numerical Libraries |
|
14:00 | Paul Hoveland | ANL | Background | Argonne strategic plan in applied math |
| |
| 14:30 | Marc Baboulin | INRIA | Background | Using condition numbers to assess numerical quality in high-performance computing applications |
|
| 15:00 | Luke Olson | UIUC | Background | Opportunities in developing a more robust and scalable multigrid solver |
|
15:30 | Break | |||||
| 16:00 | Frederic Nataf | INRIA&P6 | Background | Toward black-box adaptive domain decomposition methods |
|
Resilience&fault tolerance and simulation Chair: Franck Cappello | 16:30 | Bogdan Nicolae | IBM | Joint Result | AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing | |
17:00 | Martin Quison | INRIA | Result | Improving Simulations of MPI Applications Using A Hybrid Network Model with Topology and Contention Support | ||
| 17:30 | Adjourn |
|
|
|
|
| 18:45 | Bus for Diner |
|
|
|
|
|
|
|
|
|
|
|
Workshop Day 2 | Thursday June 13th |
|
|
|
|
|
|
|
|
|
|
|
|
Programming Models (cont.) | 08:30 | Jean-François Mehaut | INRIA | Result | Progresses in the European FP7 Mont-Blanc 1 project and objectives of its follow up: Mont-Blanc 2 |
|
| 09:00 | Rajeev Thakur | ANL | Background | Update on MPI and OS/R Activities at Argonne |
|
| 09:30 | Andra Ecaterina Hugo | INRIA | Results | Composing multiple StarPU applications over heterogeneous machines: a supervised approach |
|
| 10:00 | Celso Mendes | UIUC | Background | Dynamic Load Balancing for Weather Models via AMPI |
|
| 10:30 | Break |
|
|
|
|
Big Data, I/O, Visualization | 11:00 | Dries Kimpe | ANL | ResultsTBA | Triton: Exascale Storage |
|
| 11:30 | Gilles Fedak | INRIA | Result | Active Data: A Programming Model to Manage Data Life Cycle Across Heterogeneous Systems and Infrastructures |
|
| 12:00 | Matthieu Dorrier | INRIA | Joint Result | Data Analysis of Ensemble Simulations: an In Situ Approach using Damaris |
|
| 12:30 | Ian Foster | ANL | Background | TBA |
|
| 13:00 | Lunch |
|
|
|
|
|
|
|
|
|
|
|
Mini Workshop1 |
|
|
|
|
|
|
Resilience | 14:00 | Ana Gainaru | UIUC | Results | Challenges in predicting failures on the Blue Waters system. |
|
| 14:30 | Xiang Ni | UIUC | Results | ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection. |
|
| 15:00 | Tatiana | INRIA & ANL | Result | TBA |
|
| 15:30 | Mohamed Slim Bouguerra | INRIA & ANL | Result | Investigating the probability distribution of false negative failure alerts in HPC systems |
|
| 16:00 | Break |
|
|
|
|
| 16:30 | Amina Guermouche | UVSQ | Result | Multi-criteria Checkpointing Strategies: Response-time versus Resource Utilization |
|
| 17:00 | Thomas Ropars | EPFL | Result | Towards efficient replication of HPC applications to deal with crash failures |
|
| 17h30 | Mehdi Diouri | INRIA | Result | ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols for HPC Applications |
|
| 18:00 | Adjourn |
|
|
|
|
|
|
|
|
|
|
|
Mini Workshop2 |
|
|
|
|
|
|
Numerical Algorithms and Libraries | 14:00 | Jean Utke | ANL | Result | Designing and implementing a tool-indedendent, adjoinable MPI wrapper library |
|
| 14:30 | Laurent Hascoet | INRIA | Result | The adjoint of MPI one-sided communications |
|
| 15:00 | Stefan Wild, | ANL | Result | Loud computations? Noise in iterative solvers |
|
| 15:30 | Jed Brown | ANL | Result | Vectorization, communication aggregation, and reuse in stochastic and temporal dimensions |
|
| 16:00 | Break |
|
|
|
|
| 16:30 | Yushan Wang | INRIA P11 | Result | TBA |
|
| 17:00 | Frederic Hecht | INRIA/P6 | Result | TBA |
|
| 18:00 | Adjourn |
|
|
|
|
|
|
|
|
|
|
|
| 18:45 | Bus for diner |
|
| Lyon |
|
|
|
|
|
|
|
|
Workshop Day 3 | Friday June 14th |
|
|
|
|
|
|
|
|
|
|
|
|
Mini Workshop1 (cont.) |
|
|
|
|
|
|
Resilience | 08:30 | Di Sheng | INRIA | Result | TBA |
|
| 09:00 | Guillaume Aupy | INRIA | Result | On the Combination of Silent Error Detection and Checkpointing |
|
| 09:30 | Discussion |
|
|
|
|
| 10:00 | Break |
|
|
|
|
Mini Workshop3 | 10:30 | Guillaume Mercier | INRIA | Result | TBA |
|
Programming and Scheduling | 11:00 | Vincent Lanore | INRIA | Result | Static 2D FFT adaptation through a component model based on Charm++ |
|
| 11:30 | Anne Benoit | INRIA | Result | Energy-efficient scheduling |
|
| 12:00 | François Tessier | INRIA | Result | TBA |
|
| 12:30 | Discussions |
|
|
|
|
| 13:00 | Closing and Lunch |
|
|
|
|
|
|
|
|
|
|
|
Mini Workshop2 (cont.) |
|
|
|
|
|
|
Numerical Algorithms and Libraries | 08:30 | François Pellegrini | INRIA | Result | Shared memory parallel algorithms in Scotch 6 |
|
| 09:00 | Luc Giraud | INRIA | Result | TBA |
|
| 09:30 | Discussions |
|
|
|
|
| 10:00 | Break |
|
|
|
|
Mini Workshop4 | 10:30 | Kate Keahey | ANL | Result | TBA |
|
Clouds | 11:00 | Gabriel Antoniu | INRIA | Result | TBA |
|
| 11:30 | Christian Perez | INRIA | Result | TBA |
|
| 12:00 | Eddy Caron | INRIA | Result | TBA |
|
| 12:30 | Discussions |
|
|
|
|
| 13:00 | Closing and Lunch |
|
|
|
|
...
In this talk, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.
Dries Kimpe
Triton: Exascale Storage
In this talk, I will present a status update of our work on Triton, a newly designed exascale era storage system. In addition to Triton specific information, the presentation will also include a brief discussion about the tools and techniques that help us in implementing and designing Triton. One such tool is the use of discrete event simulation to quickly evaluate algorithms at scale before implementing them in Triton.