Main Topics |
Schedule |
Speakers |
Types of presentation |
Titles (tentative) |
|
|
|
|
|
Workshop Day 1 (Auditorium) |
Monday Nov. 22cd |
|
|
|
Welcome and Introduction |
08:30 |
Franck Cappello, INRIA & UIUC, France and Thom dunning, NCSA, USA |
Background |
Workshop details |
Post PetaScale and Exascale Systems |
08:45 |
Mitsuhisa Sato, U. Tsukuba, Japan |
Trends in HPC |
Next Gen and Exascale initiative in Japan |
|
09:15 |
Marc Snir, UIUC, USA |
Trends in HPC |
Toward Exascale |
|
09:45 |
Wen Mei Wu, UIUC, USA |
Trends in HPC |
Exascale and Accelerators |
|
10:15 |
Arun Rodrigues, Sandia, USA |
Trends in HPC |
X-Caliber (DARPA UHPC) |
|
10:45 |
Break |
|
|
Post Petascale Applications and System Software |
11:15 |
Pete Beckman, ANL, USA |
Trends in HPC |
Exascale Sofware Center |
|
11:45 |
Michael Norman, SDSC, USA |
Trends in HPC |
ENZO |
|
12:15 |
Eric Bohm, UIUC, USA |
Trends in HPC |
NAMD |
|
12:30 |
Lunch |
|
|
|
|
|
|
|
|
|
|
|
|
BLUE WATERS |
14:00 |
Bill Kramer, NCSA, USA |
Overview |
Update on Blue Waters |
Collaborations on System Software |
14:30 |
Ana Gainaru, NCSA, USA |
Early Results |
A Framework for System Event Analysis |
|
15:00 |
Thomas Ropars, INRIA, France |
Results |
Uncoordinated checkpointing without domino effect for send-deterministic applications |
|
15:30 |
Esteban Menese, UIUC, USA |
Early Results |
Clustering Message Passing Applications to Enhance Fault Tolerance Protocols |
|
16:00 |
Break |
|
|
Collaborations on System Software |
16:30 |
Leonardo Bautista, Titech, Japan |
Results/International collaboration with Japan |
Transparent low-overhead checkpoint for GPU-accelerated clusters |
|
17:00 |
Gabriel Antoniu, INRIA/IRISA, France |
Results |
Concurrency-optimized I/O for visualizing HPC simulations: An Approach Using Dedicated I/O cores |
|
17:30 |
Mathias Jacquelin, INRIA/ENS Lyon |
Results |
Vertical vs Horizontal parity for tape archives |
|
18:00 |
Olivier Richard, INRIA/U. Grenoble, France |
Early Results |
I/O aware Resource Management Software |
|
18:30 |
Torsten Hoefler, NCSA, USA |
Potential collaboration |
TBA |
|
|
|
|
|
Workshop Day 2 (Auditorium) |
Tuesday Nov. 23rd |
|
|
|
|
|
|
|
|
Collaborations on System Software |
08:30 |
Frederic Viven, INRIA/ENS Lyon, France |
Potential collaboration |
|
Collaborations on Programming models |
09:00 |
Thierry Gautier |
Early Results |
TBA |
|
09:30 |
Jean François Méhaut, INRIA/U. Grenoble, France |
Early Results |
TBA |
|
10:00 |
Emmanuel Jeannot, INRIA/U. Bordeaux, France |
Early Results |
TBA |
|
10:30 |
Break |
|
|
|
11:00 |
Raymon Namyst, INRIA/U. Bordeaux, France |
Early Results |
TBA |
|
11:30 |
Brian Amedo, INRIA/U. Nice, France |
Potential collaboration |
TBA |
|
12:00 |
Christian Perez, INRIA/ENS Lyon, France |
Early Results |
|
|
12:30 |
Lunch |
|
|
Collaborations on Numerical Algorithms and Libraries |
14:00 |
Bill Gropp, UIUC, USA |
Early Results |
TBA |
|
14:30 |
Simplice Donfac, INRIA/U. Paris Sud, France |
Early Results |
TBA |
|
15:00 |
Desiré Nuentsa, INRIA/IRISA, France |
Early Results |
Parallel Implementation of deflated GMRES in the PETSc package |
|
15:30 |
Sebastien Fourestier, INRIA/U. Bordeaux, France |
Early Results |
TBA |
|
16:00 |
Break |
|
|
|
16:30 |
Marc Baboulin, INRIA, U. Paris Sud, France |
Early Results |
Accelerating linear algebra computations with hybrid GPU-multicore systems |
|
17:00 |
Daisuke Takahashi, U. Tsukuba, Japan |
Results/International collaboration with Japan |
|
|
17:30 |
Alex Yee, UIUC, USA |
Early Results |
A Single-Transpose implementation of the Distributed out-of-order 3D-FFT |
|
17:50 |
Jeongnim Kim, NCSA, USA |
Early Results |
|
|
|
|
|
|
|
|
|
|
|
Workshop Day 3 (Auditorium) |
Wednesday Nov 24th |
|
|
|
|
|
|
|
|
Break out sessions introduction |
8:30 |
Cappello, Snir |
Overview |
Objectives of Break-out, expected results |
Topics |
|
Participants |
Other NCSA participants |
|
Break out session 1 |
9:00-10:30 |
|
|
|
Routing, topology mapping, scheduling, perf. modeling |
|
Snir, Hoefler, Vivien, Jeannot, Kale |
|
Room |
3D-FFT |
|
Cappello, Takahashi, Yee, Jeongnim |
|
Room |
Libraries |
|
Gropp, Baboulin, Désiré, Simplice, Sébastien, Fourestier |
|
Room |
|
|
|
|
|
|
10:15 |
Break |
|
|
Break out session 2 |
10:30-12:00 |
|
|
|
Resilience |
|
Kramer, Cappello, Gainaru, Ropars, Menese, Beautista, |
|
Room |
Programing models / GPU |
|
Kale, Méhaut, Namyst, Wu, Amedo, Perez, Hoefler, Jeannot |
|
Room |
I/O |
|
Snir, Viven, Jaquelin, Antoniu, Richard |
|
|
Break out session report |
12:00 |
Speakers: Snir, Cappello, Gropp, Kramer, Kale |
|
Auditorium |
Closing |
12:30 |
Cappello, Snir |
|
Auditorium |
|
13:00 |
Lunch |
|
|
Abstracts
Marc Snir, UIUC
The talk will position exascale research in the context of the pending slow-down in the exponential increase in chip densities and discuss some fundamental research problems that need to be addressed in order to reach exascale performance at reasonable expense.
Esteban Meneses, UIUC
Clustering Message Passing Applications to Enhance Fault Tolerance Protocols
This talk describes the effort of an ongoing collaboration to find meaningful clusters in a parallel computing application using its communication behavior. We start by showing the communication pattern of various MPI benchmarks and how we can use standard graph partitioning techniques to group the ranks into subsets. For Charm++ applications, we describe the changes on the runtime system to dynamically find the clusters even in the presence of object migration. The information about clusters is used to improve two major message logging protocols for fault tolerance. In one case, we manage to reduce its memory overhead, while in the other we are able to limit the number of processes to roll back during recovery.
Leonardo Bautista, Titech
Transparent low-overhead checkpoint for GPU-accelerated clusters
Fast checkpointing will be a necessary feature for future large-scale systems. Particularly, large GPU-accelerated systems lack of an efficient checkpoint-restart mechanism able to checkpoint CUDA applications in a transparent fashion (without code modification). Most of current fault tolerance techniques do not support CUDA applications or have severe limitations. We propose a transparent low-overhead checkpointing technique for GPU accelerated clusters that avoid the I/O bottleneck by using erasure codes and SSDs on the compute nodes. We achieve this by combining mature production tools, such as BLCR and OpenMPI, with our previous work and some new developed components.
Gabriel Antoniu, INRIA/IRISA
Concurrency-optimized I/O for visualizing HPC simulations: An Approach Using Dedicated I/O cores
Research at the Joint INRIA-UIUC Lab for Petascale computing is currently in progress in several directions, with the global goal of efficiently exploiting this machine that whill serve to run heavy, data-intensive or computation-intensive simulations. Such simulations usually require to be coupled with visualization tools. On supercomputers, previous studies already showed the need of adapting the I/O path from data generation to visualization. We focus on a particular tornado simulation that is intended to be run on BlueWaters. This simulation currently generates large amount of data in many files, in a way that is not adapted for afterward visualization. We describe an approach to this problem based on the usage of dedicated I/O cores. As a further step, we intend to explore the use of BlobSeer, a large-scale data management service, as an intermediate layer between the simulation, the filesystem and visualization tools. We propose to go further in this approach by enabling BlobSeer to run on dedicated cores and schedule I/O operations coming from the simulation.
Frédéric Viven, INRIA/ENS Lyon
On scheduling the checkpoints of exascale applications
Checkpointing is one of the tools used to provide resilience to applications run on failure-prone platforms. It is usually claimed that checkpoints should occur periodically, as such a policy is optimal. However, most of the existing proofs rely on approximations. One such assumption is that the probability that a fault occurs during the execution of an application is very small, an assumption that is no longer valid in the context of exascale platforms. We have begun studying this problem in a fully general context. We have established that, when failures follow a Poisson law, the periodic checkpointing policy is optimal. We have also showed an unexpected result: in some cases, when the platform is sufficiently large, the checkpointing costs sufficiently expensive, or the failures frequent enough, one should limit the application parallelism and duplicate tasks, rather than fully parallelize the application on the whole platform.
Christian Perez INRIA/ENS Lyon
High Performance Component with Charm++ and OpenAtom
Software component models appear as a solution to handle the complexity and the evolution of applications. It turns out to be a powerful abstraction mechanism for dealing with parallel and heterogeneous machines as it enable the structure of an application to be manipulated, and hence specialized. HLCM is a hierarchical component model with support for genericity & connector that enables to adapt an application to the resources as well as to input parameters. HLCM is an abstract model as it does not depend on on a particular primitive component implementation. This talk will present our ongoing work on defining and implementing HLCM/Charm+, a specialization of HLCM with primitive component expressed in Charm. It will also provide information on a study on the benefits HLCM/Charm+ can bring to OpenAtom.
Marc Baboulin INRIA/Univ. Paris Sud
Accelerating linear algebra computations with hybrid GPU-multicore systems
We describe how hybrid multicore+GPU systems can be used to enhance performance of linear algebra libraries in high performance computing.
We illustrate this approach with the solution of general linear systems based on a hybrid LU factorization where we split the computation over a multicore and a graphic processor, and use particular statistical techniques to reduce the amount of pivoting and communication between the hybrid components. We also show how mixed precision algorithms can be used for accelerating performance.
Désiré Nuentsa_wakam INRIA/IRISA
Parallel Implementation of deflated GMRES in the PETSc package
The deflation process is effective to prevent stagnation in the GMRES iterative method. However, it induces extra operations as the spectral information should be computed during each restart. In this work, we develop an adaptive strategy that switchs to the deflated version when the stagnation is detected in the iterative process. Then we provide a parallel implementation as a new KSP type in the PETSc package. Several tests are performed to show the usefulness of this approach on real applications.
Daisuke Takahashi, U. Tsukuba
Optimization of a Parallel 3-D FFT with 2-D Decomposition
In this talk, an optimization method for parallel 3-D fast Fourier transform (FFT) with 2-D decomposition is presented.The 2-D decomposition effectively improves performance by reducing the communication time for larger numbers of MPI processes. The another way to reduce the communication overhead is to overlap communication and computation. An overlapping method for the parallel 3-D FFT is also presented. Performance results of parallel 3-D FFTs on clusters of multi-core processors are reported.
Alex Yee, UIUC
A Single-Transpose implementation of the Distributed out-of-order 3D-FFT
The classic approach to computing the distributed in-order 3D-FFT requires up to 3 expensive all-to-all communication transpose steps. Given the memory-bound nature of the FFT, these transposes are dominant factors in the total run-time. Here we present a new approach that reduces the number of transposes to 2 for the in-order transform, and 1 for the out-of-order transform.
Jeongnim Kim, NCSA, UIUC
Toward petaflop 3D FFT on clusters of SMP
A wide range of scientific applications employs 3D FFT. Sustained petaflop performance of 3D FFT is necessary to meet the NSF Direct Numerical Simulation (DNS) turbulence benchmark on the Blue Waters which represents the current generation of HPC platforms, clusters of multi/many-core SMPs. I present the analysis of 3D FFT implementations and the optimization strategies on the BW. Also discussed is the design of parallel 3D FFT library that can meet the diverse requirements of applications using 3D FFT.