joint-lab workshop Jun. 12-14 2013

UNDER construction: The agenda below is not the final one

This event is supported by INRIA, UIUC, NCSA, ANL

Main Topics	Schedule	Speaker	Affiliation	Type of presentation	Title (tentative)	Download

Dinner Before the Workshop	7:30 PM	Only people registered for the dinner			Valpré hotel

Workshop Day 1	Wednesday June 12th
					TITLES ARE TEMPORARY (except if in bold font)
Registration	08:00
Welcome and Introduction	08:30	Marc Snir + Franck Cappello	INRIA&UIUC&ANL	Background	Welcome, Workshop objectives and organization
	08:45	Thom Dunning	UIUC	Background	NCSA updates and vision of the collaboration
	09:00	Marc Snir	ANL	Background	ANL updates vision of the collaboration
	09:15	Frederic Desprez	Inria	Background	INRIA updates and vision of the collaboration
Big systems Chair: Christian Perez	9:30	Bill Kramer	UIUC	Background	Update on BlueWaters
	10:00	Break
	10:30	Mitsuhisa Sato	U. Tsukuba & AICS	Background	AICS and the K computer
	11:00	Paul Gibbon	Juelich	Background	TBA
Resilience&fault tolerance and simulation Chair: Franck Cappello	11:30	Marc Snir	ANL&UIUC	Report	ICIS report on Resilience
	12:00	Lunch
Resilience&fault tolerance and simulation	13:30	Vincent Baudoui	Total & ANL	Joint-Results	TBA
	14:00	Bogdan Nicolae	IBM	Joint Result	ACM HPDC 2013 paper
	14:30	Martin Quison	INRIA	Result	Improving Simulations of MPI Applications Using A Hybrid Network Model with Topology and Contention Support
Numerical Algorithms Chair: Laura Grigori	15:00	Bill Gropp	UIUC	Background	TBA
	15:30	Break
	16:00	Paul Hoveland	ANL	Background	TBA
	16:30	Frederic Nataf	INRIA&P6	Background	TBA
	17:00	Luke Olson	UIUC	Background	TBA
	17:30	Marc Baboulin	INRIA	Background	Using condition numbers to assess numerical quality in high-performance computing applications
	18:00	Adjourn

	19:00	Dinner

Workshop Day 2	Thursday June 13th

Programming Models (cont.) Chair: Frederic Desprez	08:30	Jean-François Mehaut	INRIA	Result	Progresses in the European FP7 Mont-Blanc 1 project and objectives of its follow up: Mont-Blanc 2
	09:00	Rajeev Thakur	ANL	Background	TBA
	09:30	Andra Ecaterina Hugo	INRIA	Results	TBA
	10:00	Celso Mendes	UIUC	Background	TBA
	10:30	Break
Big Data, I/O, Visualization Chair: Gabriel Antoniu	11:00	Dries Kimpe	ANL	Results	TBA
	11:30	Gilles Fedak	INRIA	Result	Active Data: A Programming Model to Manage Data Life Cycle Across Heterogeneous Systems and Infrastructures
	12:00	Matthieu Dorrier	INRIA	Joint Result	Data Analysis of Ensemble Simulations: an In Situ Approach using Damaris
	12:30	Ian Foster	ANL	Background	TBA
	13:00	Lunch

Mini Workshop1
Resilience Chair: Marc Snir	14:00	Ana Gainaru	UIUC	Results	Failure prediction on Blue Waters
	14:30	Xiang Ni	UIUC	Results	TBA
	15:00	Tatiana	INRIA & ANL	Result	TBA
	15:30	Mohamed Slim Bouguerra	INRIA & ANL	Result	TBA
	16:00	Break
	16:30	Amina Guermouche	UVSQ	Result	Multi-criteria Checkpointing Strategies: Response-time versus Resource Utilization
	17:00	Thomas Ropars	EPFL	Result	TBA
	17h30	Mehdi Diouri	INRIA	Result	ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols for HPC Applications
	18:00	Adjourn

Mini Workshop2
Numerical Algorithms and Libraries Chair: Bill Gropp	14:00	Laura Grigori	INRIA	Result	TBA
	14:30	Stefan Wild	ANL	Result	TBA
	15:00	Frederic Hecht	INRIA/P6	Result	TBA
	15:30	Jed Brown	ANL	Result	TBA
	16:00	Break
	16:30	Yushan Wang	INRIA P11	Result	TBA
	17:00	Jean Utke	ANL	Result	Designing and implementing a tool-indedendent, adjoinable MPI wrapper library
	17:30	Laurent Hascoet	INRIA	Result	The adjoint of MPI one-sided communications
	18:00	Adjourn

	19:00	Banquet			Lyon

Workshop Day 3	Friday June 14th

Mini Workshop1 (cont.)
Resilience Chair: Franck Cappello.	08:30	Di Sheng	INRIA	Result	TBA
	09:00	Guillaume Aupy	INRIA	Result	TBA
	09:30	Discussion
	10:00	Break
Mini Workshop3	10:30	Guillaume Mercier	INRIA	Result	TBA
Programming and Scheduling Chair: Rajeev Thakur	11:00	Vincent Lanore	INRIA	Result	TBA
	11:30	Anne Benoit	INRIA	Result	Energy-efficient scheduling
	12:00	François Tessier	INRIA	Result	TBA
	12:30	Discussions
	13:00	Closing and Lunch

Mini Workshop2 (cont.)
Numerical Algorithms and Libraries Chair: Paul Hovland	08:30	François Pellegrini	INRIA	Result	TBA
	09:00	Luc Giraud	INRIA	Result	TBA
	09:30	Discussions
	10:00	Break
Mini Workshop4	10:30	Kate Keahey	ANL	Result	TBA
Clouds Chair: Frederic Desprez	11:00	Gabriel Antoniu	INRIA	Result	TBA
	11:30	Christian Perez	INRIA	Result	TBA
	12:00	Eddy Caron	INRIA	Result	TBA
	12:30	Discussions
	13:00	Closing and Lunch

Abstracts

Martin Quison

Improving Simulations of MPI Applications Using A Hybrid Network Model with Topology and Contention Support

Proper modeling of collective communications is essential for understanding the behavior of medium-to-large scale parallel applications, and even minor deviations in implementation can adversely affect the prediction of real-world performance. We propose a hybrid network model extending LogP based approaches to account for topology and contention in high-speed TCP networks. This model is validated within SMPI, an MPI implementation provided by the SimGrid simulation
toolkit. With SMPI, standard MPI applications can be compiled and run in a simulated network environment, and traces can be captured without incurring errors from tracing overheads or poor clock synchronization as in physical experiments. SMPI provides features for simulating applications that require large amounts of time or resources, including selective execution, ram folding, and off-line replay of execution traces. We validate our model by comparing traces produced by SMPI with those from other simulation platforms, as well as real world environments.

Frederic Nataf

Toward black-box adaptive domain decomposition methods

Domain decomposition methods address in a natural and powerful way modern parallel architectures. In order to be scalable, these methods involve coarse spaces. These coarse spaces are specifically designed for the two-level methods to be scalable and robust with respect to the coefficients in the equation and the choice of the decomposition. We achieve this in an automatic way by solving generalized eigenvalue problems on the interfaces between subdomains to identify the modes which slow down convergence.This construction allows for a black-box implementation. Theoretical bounds for the condition numbers of the preconditioned operators which depend only on a chosen threshold and the maximal number of neighbours of a subdomain are presented and proved. Scalable implementations on HPC platforms make it possible to solve problems with several billions of unknowns in three dimensions using FreeFem++ DSL for finite element simulations.

Marc Baboulin

Using condition numbers to assess numerical quality in high-performance computing applications

We explain how condition numbers of problems can be used to assess the quality of a computed solution. We illustrate our approach by considering the example of overdetermined linear least squares (linear systems being a special case of the latter). Our method is based on deriving exact values or estimates for the condition number of these problems. We describe algorithms and software to compute these quantities using standard parallel libraries. We present numerical experiments in a physical application and we propose performance results using new routines on top of the multicore-GPU library MAGMA.

Jean François Mehaut

Progresses in the European FP7 Mont-Blanc 1 project and objectives of its follow up: Mont-Blanc 2

Amina Guermouche

Multi-criteria Checkpointing Strategies: Response-time versus Resource Utilization

Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this talk, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.

Anne Benoit

Energy-efficient scheduling

Jean Utke

Designing and implementing a tool-indedendent, adjoinable MPI wrapper library

The efficient computation of gradients by the "adjoint-mode" of algorithmic differentiation (AD) entails the inversion of MPI communication graphs. The logic to be implemented for adjoining non-blocking communication patterns is sufficiently complex to warrant a design of components that is independent of the algorithmic differentiation tool that provides the context in which the adjoint communication is to take place. We discuss (i) how we account for the different data models implied by the AD tool as well as the target language, (ii) the implementation choices among the possible adjoint communications, and (iii) the currently known limitations of our approach. We hope for feedback from the community regarding this design particularly with respect to performance and current developments in the MPI standard.

Laurent Hascoet

The adjoint of MPI one-sided communications
Computing gradients of numerical models by the adjoint mode of algorithmic differentiation is a crucial ingredient for model optimization, sensitivity analysis, and uncertainty quantification of many large-scale science and engineering applications. The adjoint mode implies a reversal of the data dependencies and consequently a reversal of communications in parallelized models. Building on previous studies regarding the adjoining of MPI two-sided communications, we investigate the construction of adjoints for certain one-sided MPI communications

Mehdi Diouri

ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols for HPC Applications

Energy consumption and fault tolerance are two interrelated issues to address for designing future exascale systems. Fault tolerance protocols used for checkpointing have different energy consumption depending on parameters like application features, number of processes in the execution and platform characteristics. Currently, the only way to select a protocol for a given execution is to run the application and monitor the energy consumption of different fault tolerance protocols. This is needed for any variation of the execution setting. To avoid this time and energy consuming process, we propose an energy estimation framework. It relies on an energy calibration of the considered platform and a user description of the execution setting. We evaluate the accuracy of our estimations with real applications running on a real platform with energy consumption monitoring. Results show that our estimations are highly accurate and allow selecting the best fault tolerant protocol without pre-executing the application.

Matthieu Dorier

Data Analysis of Ensemble Simulations: an In Situ Approach using Damaris
As we approach exascale, simulations running on ever more cores on supercomputers produce ever larger data that has to be stored for subsequent analysis. With unmatched storage and computation performance, in situ analysis has been proposed as a way to run analysis tasks along with the running simulation. While this reduces the need to store massive amounts of raw data and lets scientists get a direct insight into their simulation, it does not allow to compare multiple runs of the same simulation (ensemble simulations), as these runs are not performed at the same moment. Thus in situ approaches remain limited and ensemble simulations still requires to store raw data. We present a complete framework for comparing data produced by different runs of the same simulation. This framework uses the Damaris I/O middleware to re-load data from previous experiments inside a running instance of the simulation, allowing a direct in situ comparison of data between older and current runs.

Gille Fedak

Active Data: A Programming Model to Manage Data Life Cycle Across Heterogeneous Systems and Infrastructures
The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key point is to handle the complexity of the data life cycle, i.e. the various operations performed on data: transfer, archiving,
replication, deletion, etc. To alleviate the complexity of the data life cycle, we propose Active Data, a programming model to automate and improve the expressiveness of data management applications. We first introduce the concept of data life cycle and define a formal model that allow to expose data life cycle across heterogeneous systems and infrastructures. The Active Data
programming model allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data. We implement and evaluate the model with four use cases: a storage cache to Amazon-S3, a cooperative sensor network, an incremental implementation of the MapReduce
programming model and automated data provenance tracking across heterogeneous systems. Altogether, these scenarios illustrate the adequateness of the model to program applications that manage
distributed and dynamic data sets. We also show that applications that do not leverage on data life cycle can benefit from Active Data to improve their performances.

Child pages

joint-lab workshop Jun. 12-14 2013

UNDER construction: The agenda below is not the final one

This event is supported by INRIA, UIUC, NCSA, ANL

TITLES ARE TEMPORARY (except if in bold font)

Abstracts