Page History

...

Main Topics	Schedule	Speaker	Affiliation	Type of presentation	Title (tentative)	Download

Dinner Before the Workshop	7:30 PM	Only people registered for the dinner			Valpré hotel

Workshop Day 1	Wednesday June 12th
					TITLES ARE TEMPORARY (except if in bold font)
Registration	08:00
Welcome and Introduction Amphitheatre	08:30	Marc Snir + Franck Cappello	INRIA&UIUC&ANL	Background	Welcome, Workshop objectives and organization
	08:45	Bill Kramer	UIUC	Background	NCSA updates and vision of the collaboration
	09:00	Marc Snir	ANL	Background	ANL updates vision of the collaboration
	09:15	Frederic Desprez	Inria	Background	INRIA updates and vision of the collaboration
Big systems Chair: Christian Perez	9:30	Bill Kramer	UIUC	Background	Update on BlueWaters
	10:00	Break
	10:30	Mitsuhisa Sato	U. Tsukuba & AICS	Background	AICS and the K computer
CANCELED	11:00	Paul Gibbon	Juelich	Background	Meeting the Exascale Challenge at the Juelich Supercomputing Centre.
Resilience&fault tolerance and simulation Chair: Franck Cappello	11:00	Marc Snir	ANL&UIUC	Report	ICIS report on Resilience
	11:30	Vincent Baudoui	Total & ANL	Joint-Results	Round-off error and silent soft error propagation in exascale applications
	12:00	Lunch
Numerical Algorithms Chair: Frederic Desprez	13:30	Bill Gropp	UIUC	Background	Topics for Collaboration in Numerical Libraries
	14:00	Paul Hoveland	ANL	Background	Argonne strategic plan in applied math
	14:30	Marc Baboulin	INRIA	Background	Using condition numbers to assess numerical quality in high-performance computing applications
	15:00	Luke Olson	UIUC	Background	Opportunities in developing a more robust and scalable multigrid solver
	15:30	Break
	16:00	Frederic Nataf	INRIA&P6	Background	Toward black-box adaptive domain decomposition methods
Resilience&fault tolerance and simulation Chair: Franck Cappello	16:30	Bogdan Nicolae	IBM	Joint Result	AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing
	17:00	Martin Quison	INRIA	Result	Improving Simulations of MPI Applications Using A Hybrid Network Model with Topology and Contention Support
	17:30	Adjourn
	18:45	Bus for Diner

Workshop Day 2	Thursday June 13th

Programming Models (cont.) Chair: Frederic Desprez	08:30	Jean-François Mehaut	INRIA	Result	Progresses in the European FP7 Mont-Blanc 1 project and objectives of its follow up: Mont-Blanc 2
	09:00	Rajeev Thakur	ANL	Background	Update on MPI and OS/R Activities at Argonne
	09:30	Andra Ecaterina Hugo	INRIA	Results TBA	Composing multiple StarPU applications over heterogeneous machines: a supervised approach
	10:00	Celso Mendes	UIUC	Background	Dynamic Load Balancing for Weather Models via AMPI
	10:30	Break
Big Data, I/O, Visualization Chair: Kate Keahey	11:00	Dries Kimpe	ANL	Results	TBA
	11:30	Gilles Fedak	INRIA	Result	Active Data: A Programming Model to Manage Data Life Cycle Across Heterogeneous Systems and Infrastructures
	12:00	Matthieu Dorrier	INRIA	Joint Result	Data Analysis of Ensemble Simulations: an In Situ Approach using Damaris
	12:30	Ian Foster	ANL	Background	TBA
	13:00	Lunch

Mini Workshop1
Resilience Chair: Marc Snir	14:00	Ana Gainaru	UIUC	Results	Challenges in predicting failures on the Blue Waters system.
	14:30	Xiang Ni	UIUC	Results	ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection.
	15:00	Tatiana	INRIA & ANL	Result	TBA
	15:30	Mohamed Slim Bouguerra	INRIA & ANL	Result	Investigating the probability distribution of false negative failure alerts in HPC systems
	16:00	Break
	16:30	Amina Guermouche	UVSQ	Result	Multi-criteria Checkpointing Strategies: Response-time versus Resource Utilization
	17:00	Thomas Ropars	EPFL	Result	Towards efficient replication of HPC applications to deal with crash failures
	17h30	Mehdi Diouri	INRIA	Result	ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols for HPC Applications
	18:00	Adjourn

Mini Workshop2
Numerical Algorithms and Libraries Chair: Bill Gropp	14:00	Jean Utke	ANL	Result	Designing and implementing a tool-indedendent, adjoinable MPI wrapper library
	14:30	Laurent Hascoet	INRIA	Result	TBA
	15:00	Stefan Wild,	ANL	Result	The adjoint of MPI one-sided communications
	15:30	Jed Brown	ANL	Result	Vectorization, communication aggregation, and reuse in stochastic and temporal dimensions
	16:00	Break
	16:30	Yushan Wang	INRIA P11	Result	TBA
	17:00	Frederic Hecht	INRIA/P6	Result	TBA
	18:00	Adjourn

	18:45	Bus for diner			Lyon

Workshop Day 3	Friday June 14th

Mini Workshop1 (cont.)
Resilience Chair: Franck Cappello.	08:30	Di Sheng	INRIA	Result	TBA
	09:00	Guillaume Aupy	INRIA	Result	TBA
	09:30	Discussion
	10:00	Break
Mini Workshop3	10:30	Guillaume Mercier	INRIA	Result	TBA
Programming and Scheduling Chair: Rajeev Thakur	11:00	Vincent Lanore	INRIA	Result	TBA
	11:30	Anne Benoit	INRIA	Result	Energy-efficient scheduling
	12:00	François Tessier	INRIA	Result	TBA
	12:30	Discussions
	13:00	Closing and Lunch

Mini Workshop2 (cont.)
Numerical Algorithms and Libraries Chair: Paul Hovland	08:30	François Pellegrini	INRIA	Result	Shared memory parallel algorithms in Scotch 6
	09:00	Luc Giraud	INRIA	Result	TBA
	09:30	Discussions
	10:00	Break
Mini Workshop4	10:30	Kate Keahey	ANL	Result	TBA
Clouds Chair: Frederic desprez	11:00	Gabriel Antoniu	INRIA	Result	TBA
	11:30	Christian Perez	INRIA	Result	TBA
	12:00	Eddy Caron	INRIA	Result	TBA
	12:30	Discussions
	13:00	Closing and Lunch

...

Vectorization, communication aggregation, and reuse in stochastic and temporal dimensions

Celso Mendes

Dynamic Load Balancing for Weather Models via AMPI

Load imbalances can severely limit the scalability of a parallel application. Typically, the solution adopted to overcome this problem is to change the application code as an attempt to distribute the load more uniformly across the available processors. This solution, however, requires deep knowledge of the application, and needs to be redone as new sources of imbalance arise. In this presentation, we show how an intelligent, adaptive runtime system can help in addressing this problem. Using Adaptime-MPI, an implementation of the MPI standard based on the Charm++ runtime system, we demonstrate how to achieve a better balance without requiring major changes or much knowledge about the application. As a case-study, we show an application of this approach with weather forecasting models, which can suffer from severe imbalances due to several sources, including dynamic variations in the atmosphere. Besides presenting recent results, we also point to some remaining challenges, which make opportunities for further work in this area.

Xiang Ni

ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection.

As the scale of machines increase, the HPC community has seen a steady decrease in reliability of the systems, and hence an increase in the down time. Moreover, soft errors such as bit flips do not prevent execution but generate incorrect results. Checkpoint/restart is by far the most commonly used fault tolerance method for hard errors, and its efficiency and scalability has been improved with recent research. In this talk, we will discuss a holistic methodology for automatically detecting and recovering from soft or hard faults with minimal application intervention. This is demonstrated by ACR: an automatic checkpoint/restart framework that performs application replication and automatically adapts the checkpoint period using online information about the current failure rate. ACR performs an application- and user-oblivious recovery. We empirically test ACR by injecting failures that follow different distributions for five applications and show low overhead when scaled to 131,072 cores. We also analyze the interaction between soft and hard errors and propose three recovery schemes that explore the trade-off between performance and reliability requirements.

Thomas Ropars,

Towards efficient replication of HPC applications to deal with crash failures

Ana Gainaru

Challenges in predicting failures on the Blue Waters system.

As the size of supercomputers increases, so does the probability of a single component failure within a time frame. With the growing operation cost of extreme scale supercomputers like Blue Waters, the act of predicting failures to prevent the loss of computation hours becomes cumbersome and presents a couple of challenges not encountered for smaller systems. The talk will focus on presenting online failure prediction and analyzing the Blue Water system. We show to what extent online failure prediction is a possibility at petascale and what are the challenges in achieving an effective fault prevention mechanism for Blue Waters.

Mohamed Slim Bouguerra

Investigating the probability distribution of false negative failure alerts in HPC systems

As large parallel systems increase in size and complexity, failures are inevitable and exhibit complex space and time dynamics. Several key results have demonstrated that recent advances in event log analysis can provide precise failure prediction. The state-of-the-art in failure prediction provides a ratio of correctly identified failures to the number of all predicted failures of over 90\% and its able to discover around 50\% of all failures in a system. However large part of failures are not predicted and considered as false negative alerts. Therefore, developing efficient fault tolerance strategies to tolerate failures requires a good perception and understanding of failure prediction properties and characteristics. In order to study and understand the properties and characteristics of the false negative alerts, we conduct in this paper a statistical analysis to discover the probability distribution of such alerts and their impact on fault tolerance techniques. To this end we study failures logs from different HPC production systems. We show that: (i) surprisingly the false negative distribution has the same nature as the failure distribution; (ii) after adding failure prediction we were able to infer statistical models that describes the inter arrival time between false negative alerts and so current fault tolerance can be applied on these systems; (iii) the current failures traces contain a high amount of correlation between the failure inter arrival time that can be used to improve the failure prediction mechanism. Another important result is that checkpoint intervals can still be computed from existing first order formula when failure distribution is purely random.

Rajeev Thakur,

Update on MPI and OS/R Activities at Argonne

Transformative computing in science and engineering involves problems posed in more than just the spatial domain: temporal, stochastic, and parameter spaces also play a role. Current methods for solving such problems are predominantly based on the concept that the fundamental building block is the solution of a deterministic PDE model, or perhaps one time step of a transient model. This is practical: it permits comfortable partitioning of mathematical analysis and relatively unintrusive software interfaces, but it eagerly chooses which dimensions are treated sequentially, which are distributed in parallel, etc. These imposed choices leave developers of the PDE models banging their heads against the familiar challenges of efficiently utilizing increasingly precious memory bandwidth, hiding and reducing synchronization costs, and obtaining vectorization. Meanwhile, the stochastic and temporal dimensions provide structure that is ideally suited to extreme-scale architectures, if only they could be promoted to first-class citizens, alongside the spatial dimensions, in algorithmic analysis and in software. Exploiting this structure in ``full-space'' methods will require crosscutting development: improved convergence theory, efficient hardware-adapted algorithms, high-quality software libraries, and programming tools and run-time systems to facilitate the development of libraries and applications. In this talk, I present several examples and propose a guideline for reasoning about efficient mappings of full-space analysis onto parallel computers.

Celso Mende

Dynamic Load Balancing for Weather Models via AMPI

Load imbalances can severely limit the scalability of a parallel application. Typically, the solution adopted to overcome this problem is to change the application code as an attempt to distribute the load more uniformly across the available processors. This solution, however, requires deep knowledge of the application, and needs to be redone as new sources of imbalance arise. In this presentation, we show how an intelligent, adaptive runtime system can help in addressing this problem. Using Adaptime-MPI, an implementation of the MPI standard based on the Charm++ runtime system, we demonstrate how to achieve a better balance without requiring major changes or much knowledge about the application. As a case-study, we show an application of this approach with weather forecasting models, which can suffer from severe imbalances due to several sources, including dynamic variations in the atmosphere. Besides presenting recent results, we also point to some remaining challenges, which make opportunities for further work in this area.

Xiang Ni

ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection.

As the scale of machines increase, the HPC community has seen a steady decrease in reliability of the systems, and hence an increase in the down time. Moreover, soft errors such as bit flips do not prevent execution but generate incorrect results. Checkpoint/restart is by far the most commonly used fault tolerance method for hard errors, and its efficiency and scalability has been improved with recent research. In this talk, we will discuss a holistic methodology for automatically detecting and recovering from soft or hard faults with minimal application intervention. This is demonstrated by ACR: an automatic checkpoint/restart framework that performs application replication and automatically adapts the checkpoint period using online information about the current failure rate. ACR performs an application- and user-oblivious recovery. We empirically test ACR by injecting failures that follow different distributions for five applications and show low overhead when scaled to 131,072 cores. We also analyze the interaction between soft and hard errors and propose three recovery schemes that explore the trade-off between performance and reliability requirements.

Thomas Ropars,

Towards efficient replication of HPC applications to deal with crash failures

Ana Gainaru

Challenges in predicting failures on the Blue Waters system.

As the size of supercomputers increases, so does the probability of a single component failure within a time frame. With the growing operation cost of extreme scale supercomputers like Blue Waters, the act of predicting failures to prevent the loss of computation hours becomes cumbersome and presents a couple of challenges not encountered for smaller systems. The talk will focus on presenting online failure prediction and analyzing the Blue Water system. We show to what extent online failure prediction is a possibility at petascale and what are the challenges in achieving an effective fault prevention mechanism for Blue Waters.

Mohamed Slim Bouguerra

Investigating the probability distribution of false negative failure alerts in HPC systems

As large parallel systems increase in size and complexity, failures are inevitable and exhibit complex space and time dynamics. Several key results have demonstrated that recent advances in event log analysis can provide precise failure prediction. The state-of-the-art in failure prediction provides a ratio of correctly identified failures to the number of all predicted failures of over 90\% and its able to discover around 50\% of all failures in a system. However large part of failures are not predicted and considered as false negative alerts. Therefore, developing efficient fault tolerance strategies to tolerate failures requires a good perception and understanding of failure prediction properties and characteristics. In order to study and understand the properties and characteristics of the false negative alerts, we conduct in this paper a statistical analysis to discover the probability distribution of such alerts and their impact on fault tolerance techniques. To this end we study failures logs from different HPC production systems. We show that: (i) surprisingly the false negative distribution has the same nature as the failure distribution; (ii) after adding failure prediction we were able to infer statistical models that describes the inter arrival time between false negative alerts and so current fault tolerance can be applied on these systems; (iii) the current failures traces contain a high amount of correlation between the failure inter arrival time that can be used to improve the failure prediction mechanism. Another important result is that checkpoint intervals can still be computed from existing first order formula when failure distribution is purely random.

Rajeev Thakur,

Update on MPI and OS/R Activities at Argonne

This talk will give an update on MPI and OS/R activities at Argonne, including a big new project that is about to start in the area of exascale operating systems and runtime.

Andra Hugo

Composing multiple StarPU applications over heterogeneous machines: a supervised approach

Enabling HPC applications to perform efficiently when invoking multiple parallel libraries simultaneously is a great challenge. Even if a single runtime system is used underneath, scheduling tasks or threads coming from different libraries over the same set of hardware resources introduces many issues, such as resource oversubscription, undesirable cache flushes or memory bus contention. We present an extension of StarPU, a runtime system specifically designed for heterogeneous architectures, that allows multiple parallel codes to run concurrently with minimal interference. Such parallel codes run within scheduling contexts that provide confined execution environments which can be used to partition computing resources. Scheduling contexts can be dynamically resized to optimize the allocation of computing resources among concurrently running libraries. We introduce a hypervisor that automatically expands or shrinks contexts using feedback from the runtime system (e.g. resource utilization). We demonstrate the relevance of our approach using benchmarks invoking multiple high performance linear algebra kernels simultaneously on top of heterogeneous multicore machines. We show that our mechanism can dramatically improve the overall application run time (-34%), most notably by reducing the average cache miss ratio (-50%)This talk will give an update on MPI and OS/R activities at Argonne, including a big new project that is about to start in the area of exascale operating systems and runtime.

Child pages

Versions Compared

Old Version 46

New Version 47

Key

TITLES ARE TEMPORARY (except if in bold font)