You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 27 Next »

UNDER construction: The agenda below is not the final one

This event is supported by INRIA, UIUC, NCSA, ANL and French Ministry of Foreign Affairs

Main Topics

Schedule

            Speaker

Affiliation

Type of presentation

Title (tentative)

Download

 

 

 

 

 

 

 

Dinner Before the Workshop

7:00 PM

Only people registered for the dinner

 

 

 

 

 

 

 

 

 

 

 

Workshop Day 1

Monday Nov. 25th

 

 

 

 

 

 

 

 

 

 

TITLES ARE TEMPORARY (except if in bold font)

 

Registration

08:00

 

 

 

 

 

Welcome and Introduction

Amphitheatre

Chair: Franck

08:30

Marc Snir + Franck Cappello

INRIA&UIUC&ANL

Background

Welcome, Workshop objectives and organization

 

 

08:45

Peter Schiffer

UIUC

Background

Welcome from UIUC Vice Chancellor for Research

 

 

09:00

Ed. Siedel

UIUC

Background

NCSA update and vision of the collaboration

 

 

09:15

Michel Cosnard

Inria

Background

INRIA updates and vision of the collaboration

 


9:30

Marc Snir

ANL

Background

Argonne updates and vision of the collaboration

 

 

9h45

Franck Cappello

ANL

Background

Joint-Lab, New Joint-Lab, PUF articulation

 

 

10:15

Break

 

 

 

 

Extreme Scale Systems and infrastructures

Amphitheatre

Chair: Marc Snir

10:45

Pete Beckman

ANL

 

Extreme Scale Computing & Co-design Challenges

 

 

11:15

John Towns

UIUC

 

Plenary talk

 
 11:45Gabriel AntoniuINRIA  Plenary talk 

 

12:15

Lunch

 

 

Plenary talk

 


13:45

Bill Kramer

UIUC

Blue Waters

BW Observations and new challenges

 


14:15

Marc Snir

UIUC

 

G8 ECS and international collaboration toward extreme scale climate simulation

 

 

14:45

Rob Ross

ANL

 

Thinking Past POSIX: Persistent Storage in Extreme Scale Systems

 
 15:15François PellegriniINRIA Plenary talk 
 15:45Break    

 

16:15

Yves Robert

INRIA

 

Assessing the impact of ABFT & Checkpoint composite strategies

 
 16:15Pavan BalagiANL Conflict 
 16:45Wen Mei HwuUIUC 

Plenary talk

 
 17:15Adjourn    

 

18:45

Bus for Diner

 

 

 

 

 

 

 

 

 

 

 

Workshop Day 2


Tuesday Nov. 26

 

 

 

 

 

Applications, I/O, Visualization, Big data

Amphitheatre

Chair: Rob Ross

08:30

Greg BauerUIUC  Applications and their challenges on Blue Waters

 

 

09:00

Matthieu Dorier

INRIA

Joint-result, submitted

CALCioM: Mitigating I/O Interferences in HPC Systems through Cross-Application Coordination

 
 

09:30

Dries Kempe

ANL

 

Plenary talk

 

 

10:00

Venkat Vishwanath

ANL

 

Plenary talk

 

 

10:30

Break

 

 

 

 

 

11:00

Babak Behzad

UIUC

ACM/IEEE SC13

Taming Parallel I/O Complexity with Auto-Tuning

 

 

11:30

McHenry, Kenton Guadron

UIUC

 

NSF CIF21 DIBBs: Brown Dog

 

 

12:00

Lunch

 

 


 

 

 

 

 

 

 

 

Mini Workshop1

Resilience

Room 1030

Chair: Yves Robert

 

 

 

 

 

 

 

13:30

Leonardo

ANL

Joint-result


 

 

14:00

Tatiana

INRIA

Joint-result


 

 

14:30

Mohamed Slim Bouguera

INRIA

Joint-result, submitted


 

 

15:00

Ana Gainaru

UIUC

Joint-result, submitted


 

 

15:30

Break

 

 

 

 

 

16:00

Sheng Di

INRIA

Joint-result, submitted


 

 

16:30

Frederic Vivien

INRIA

 


 

 

17h00

Weslay Bland

ANL

 

Fault Tolerant Runtime Research at ANL

 

 

17H30

Adjourn

 

 

 

 

 

19:00

Bus for Diner

 

 

 

 

       

Mini Workshop2

Numerical Agorithms

Room 1040

Chair: Bill Gropp

 

 

 

 

 

 

 

13:30

Luke Olson

UIUC

 

  
 14:00 Prasanna BalaprakashANL  Active-Learning-based Surrogate Models for Empirical Performance Tuning 

 

14:30

Yushan Wang

INRIA

 

Solving 3D incompressible Navier-Stokes equations on hybrid CPU/GPU systems.

 

 

15:00

Jed Brown

ANL

 

 

 

 

15:30

Break

 

 

 

 

 

16:00

Pierre Jolivet

INRIA

Best Paper nomiee, IEEE, ACM SC13


 
 16:30Vincent BaudouiTotal&ANL   
 17:00TBD  TBD 

 

17:30

Adjourn

 

 

 

 

       

 

19:00

Bus for diner

 

 

 

 

 

 

 

 

 

 

 

Workshop Day 3


Wednesday Nov. 27

 

 

 

 

 

 

 

 

 

 

 

 

Mini Workshop3


 

 

 

 

 

 

 Programming models, compilation and runtime.

Room 1030

Chair: Marc Snir

08:30

Grigori Fursin

INRIA

 

 

 

 

09:00

Maria Garzaran

UIUC

 


 


09:30

Jean-François Mehaut

INRIA

 


 
 10:00Break    

 

10:30

Pavan Balaji

ANL

 

Can only talk on Monday

 

 

11:00

Rafael Tesser

INRIA

Joint result PDP 2013


 

 

11:30

Emmanuel Jeannot

INRIA

Joint-result, IEEE Cluster2013

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch

 

 

12:00

Closing

 

 

 

 

 

12:30

Lunch

 

 

 

 

       

 

18:00

Bus for diner

 

 

 

 

Mini Workshop4

Large scale systems and their simulators

Room 1040

Chair: Bill Kramer

 

 

 

 

 

 


08:30

Sanjay Kale

 

 


 

 

09:00

Arnault Legrand

 

 

SMPI: Toward Better Simulation of MPI Applications

 


09:30

Kate Kahey

 

 


 

 

10:00

Break

 

 

 

 


10:30

Gille Fedak

 

 


 

 

11:00

Jeremy Henos

 

 


 

 

11:30

TBD

 

 


 

 

12:00

Closing

 

 

 

 

 

12:30

Lunch

 

 

 

 

       
 18:00Bus for diner    

Abstracts

Kenton McHenry

NSF CIF21 DIBBs: Brown Dog

The objective of this project is to construct a service that will allow for past and present un-curated data to be utilized by science while simultaneously demonstrating the novel science that can be conducted from such data. The proposed effort will focus on the large distributed and heterogeneous bodies of past and present un-curated data, what is often referred to in the scientific community as long-tail data, data that would have great value to science if its contents were readily accessible. The proposed framework will be made up of two re-purposable cyberinfrastructure building blocks referred to as a Data Access Proxy (DAP) and Data Tilling Service (DTS). These building blocks will be developed and tested in the context of three use cases that will advance science in geoscience, biology, engineering, and social science. The DAP will aim to enable a new era of applications that are agnostic to file formats through the use of a tool called a Software Server which itself will serve as a workflow tool to access functionality within 3rd party applications. By chaining together open/save operations within arbitrary software the DAP will provide a consistent means of gaining access to content stored across the large numbers of file formats that plague long tail data. The DTS will utilize the DAP to access data contents and will serve to index unstructured data sources (i.e. instrument data or data without text metadata). Building off of the Versus content based comparison framework and the Medici extraction services for auto-curation the DTS will assign content specific identifiers to untagged data allowing one to search collections of such data. The intellectual merit of this work lies in the proposed solution which does not attempt to construct a single piece of software that magically understands all data, but instead aims at utilizing every possible source of automatable help already in existence in a robust and provenance preserving manner to create a service that can deal with as much of this data as possible. This proverbial “super mutt” of software, or Brown Dog, will serve as a low level data infrastructure to interface with digital data contents and through its capabilities enable a new era of science and applications at large. The broader impact of this work is in its potential to serve not just the scientific community but the general public, as a DNS for data, moving civilization towards an era where a user’s access to data is not limited by a file’s format or un-curated collections.


Emmanuel Jeannot, Esteban Meneses-Rojas, Guillaume Mercier, François Tessier and Gengbin Zheng

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch

Abstract—Programming multicore or manycore architectures is a hard challenge particularly if one wants to fully take advantage of their computing power. Moreover, a hierarchical topology implies that communication performance is heterogeneous and this characteristic should also be exploited. We developed two load balancers for Charm++ that take into account both aspects depending on the fact that the application is compute-bound or communication-bound. This work is based on our TREEMATCH library that compute process placement in order to reduce an application communication cost based on the hardware topology. We show that the proposed load-balancing scheme manages to improve the execution times for the two classes of parallel applications.


Matthieu Dorier

CALCioM: Mitigating I/O Interferences in HPC Systems through Cross-Application Coordination

Unmatched computation and storage performance in new HPC systems have led to a plethora of I/O optimizations ranging from application-side collective I/O to network and disk-level request scheduling on the file system side. As we deal with ever larger machines, the interference produced by multiple applications accessing a shared parallel file system in a concurrent manner become a major problem. Interference often breaks single-application I/O optimizations, dramatically degrading application I/O performance and, as a result, lowering machine wide efficiency.
This talk will focuse on CALCioM, a framework that aims to mitigate I/O interference through the dynamic selection of appropriate scheduling policies. CALCioM allows several applications running on a supercomputer to communicate and coordinate their I/O strategy in order to avoid interfering with one another. In this work, we examine four I/O strategies that can be accommodated in this framework: serializing, interrupting, interfering and coordinating. Experiments on Argonne’s BG/P Surveyor machine and on several clusters of the French Grid’5000 show how CALCioM can be used to efficiently and transparently improve the scheduling strategy between two otherwise interfering applications, given specified metrics of machine wide efficiency.


Babak Behzad

Taming Parallel I/O Complexity with Auto-Tuning

We present an auto-tuning system for optimizing I/O performance of HDF5 applications and demonstrate its value across platforms, applications, and at scale. The system uses genetic algorithms to search a large space of tunable parameters and to identify effective settings at all layers of the parallel I/O stack. The parameter settings are applied transparently by the auto-tuning system via dynamically intercepted HDF5 calls. To validate our auto-tuning system, we applied it to three I/O benchmarks (VPIC, VORPAL, and GCRM) that replicate the I/O activity of their respective applications. We tested the system with different weak-scaling configurations (128, 2048, and 4096 CPU cores) that generate 30 GB to 1 TB of data, and executed these configurations on diverse HPC platforms (Cray XE6, IBM BG/P, and Dell Cluster). In all cases, the auto-tuning framework identified tunable parameters that substantially improved write performance over default system settings. We consistently demonstrate I/O write speedups between 2x and 100x for test configurations.


Yves Robert, ENS Lyon, INRIA & Univ. Tenn. Knoxville

Assessing the impact of ABFT & Checkpoint composite strategies
 

Algorithm-specific fault tolerant approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization kernels) have been proven capable of such properties. These algorithms provide a temporal section of the execution when the data is protected by it's own intrinsic properties, and can be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of  their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT.  As a consequence, the only fault-tolerance approach that is currently used for these applications is  checkpoint/restart. In this talk, we propose a model and a simulator to investigate the behavior of a composite protocol,  that alternates  between ABFT and checkpoint/restart protection for effective protection of each phase of an iterative application composed of ABFT-aware and ABFT-unaware sections. We highlight this approach drastically increases the performance delivered by the system, especially at scale, by providing means to rarefy the checkpoints while simultaneously decreasing the volume of data needed to be saved in the checkpoints.


Prasanna Balaprakash

Active-Learning-based Surrogate Models for Empirical Performance Tuning

Performance models have profound impact on hardware-software co-design, architectural explorations, and performance tuning of scientific applications. Developing algebraic performance models is becoming an increasingly challenging task. In such situations, a statistical surrogate-based performance model, fitted to a small number of input-output points obtained from empirical evaluation on the target machine, provides a range of benefits. Accurate surrogates can emulate the output of the expensive empirical evaluation at new inputs and therefore can be used to test and/or aid search, compiler, and autotuning algorithms. We present an iterative parallel algorithm that builds surrogate performance models for scientific kernels and work-loads on single-core and multicore and multinode architectures. We tailor to our unique parallel environment an active learning heuristic popular in the literature on the sequential design of computer experiments in order to identify the code variants whose evaluations have the best potential to improve the surrogate. We use the proposed approach in a number of case studies to illustrate its effectiveness.


Greg Bauer

Applications and their challenges on Blue Waters

The leadership class Blue Waters system is providing petascale level computational and I/O capabilities to its partners. To date there are approximately 32 teams using Blue Waters to pursue their science and engineering on 22,640 Cray XE CPU compute nodes and 4,224 Cray XK GPU nodes with a 26 PB, 1 TB/s filesystem. The challenges encountered by the teams are as varied as the applications running on Blue Waters. This talk will provide an overview of the Blue Waters system, its recent upgrade in GPU computing capability and network dimension, and a discussion of the
applications and their challenges computing at scale on Blue Waters.


Yushan Wang
Solving 3D incompressible Navier-Stokes equations on hybrid CPU/GPU systems.

  The Navier-Stokes equations are the fundamental bases of many computational fluid dynamics problems. In this presentation, we will talk about a hybrid multicore/GPU solver for the incompressible Navier-Stokes equations with constant coefficients, discretized by the finite difference method. We use the prediction-projection method which transforms the Navier-Stokes problem into Helmholtz-like and Poisson problems. Efficient solvers for the two subproblems will be presented with implementations which take advantages of GPU accelerators. We will also provide numerical experiments on a current hybrid machine.

Arnaud Legrand
SMPI: Toward Better Simulation of MPI Applications

We will present our last result on the SMPI/SimGrid framework. SMPI now implements all the collective algorithms and selection logics of both OpenMPI and MPICH and even a few other collective algorithms from Star MPI. Together with a flexible network model and topology description mechanisme, this allowed us to obtain almost perfect prediction of NASPB and BigDFT on Ethernet/TCP based clusters. We are currently working on extending this work to other kind of networks as well as on mixing the emulation capability of SMPI with the trace replay mechanism. We are also working on improving the replay mechanism so that it handles seamlessly classical trace formats.

Welsley Bland
Fault Tolerant Runtime Research at ANL

Fault tolerance has been presented as an emerging problem for decades, with researchers often claiming that the next generation of hardware will introduce new levels of failure rates that will destroy productivity and cause applications to become unusable. While it is true that as machines have scaled, resilience has become more and more of a concern, there are issues already affecting applications at current scales. Process failure remains a concern, though primarily for applications that can run at the largest scales or on very unstable hardware. For smaller applications however, there are other concerns, such as soft errors, performance loss, etc. This talk will cover some of the research being performed in the Programming Models and Runtime Systems group at Argonne National Laboratory to study these phenomena.


Jed Brown and Debojyoti Ghosh  

Fast solvers for implicit Runge-Kutta systems

Implicit Runge-Kutta methods offer very high order accuracy, excellent stability properties, and optional symplecticity at the expense of needing to solve a coupled system of equations.  In the past, this has been seen as a detractor and implicit RK methods have received little attention in the large-scale computing world, apart from recent interest in Spectral Deferred Correction (SDC) methods which are a particular iterative method for solving implicit RK systems, but the work scales quadratically in the number of stages and SDC is rarely more efficient than conventional sequential time stepping.  Implicit RK systems have tensor product structure $$ S \otimes I + I \otimes J $$ where $S = (h A)^{-1}$ comes from the $s\times s$ Butcher table $A$, and $J$ is the (typically sparse) Jacobian of the spatial discretization. Diagonalization of $S$ was proposed independently by Butcher (1976) and Bickert (1977) as a solution method, leading to $s$ decoupled sparse systems, each with a different (complex-valued) diagonal shift, and quickly became the standard approach in the ODE community.  Instead of distributing the stages, we permute the multivector and solve all stages at once using preconditioned iterative methods that achieve much higher machine utilization due to a computational structure similar to solving a single linear system with multiple right hand sides. 


Mohamed Slim Bougerra

Failure prediction: what to do with unpredicted failures ?

 

 

As large parallel systems increase in size and complexity, failures are inevitable and exhibit complex space and time dynamics. Several key results have demonstrated that recent advances in event log analysis can provide precise failure prediction. The state of the art in failure prediction provides a ratio of correctly identified failures to the number of all predicted failures of over 90\% and  able to discover around 50\% of all failures in a system. However, large parts of failures are not predicted and are considered as false negative alerts. Therefore, developing  efficient fault tolerance strategies to tolerate failures requires a good  perception and understanding of failure prediction  characteristics.  To understand the properties of  false negative alerts, we conducted a statistical analysis of the probability distribution of such alerts and their impact on fault tolerance techniques. Specifically  we studied  failures logs from different HPC production systems. We show that (i)  the false negative distribution has the same nature as the failure distribution (ii) After adding failure prediction, we were able to infer statistical models that describe the inter-arrival time between false negative alerts and hence current fault tolerance can be applied to these systems. Moreover, we show that  the current failures traces have a high correlation between the failure inter-arrival time that can be used to improve the failure prediction mechanism.  Another important result is that checkpoint intervals for unpredicted failures can be computed from the existing high-order Daly's formula. We show how we can apply the proposed statistical-model to combine proactive migration and preventive checkpoints. Trace based simulations show that the proposed combination leads to an improvement of the execution useful work by more than 13\% with only 45\% of recall.


 

 

 


  • No labels