Main Topics | Schedule | Speakers | Types of presentation | Titles (tentative) | Download | ||||
Diner | Sunday Nov. 21st | Radio Maria |
| |
|
| |||
Workshop Day 1 (Auditorium) | Monday Nov. 22cd |
|
|
| |||||
Welcome and Introduction | 08:30 | Franck Cappello, INRIA & UIUC, France and Thom dunning, NCSA, USA | Background | Workshop details |
| ||||
Post PetaScale and Exascale Systems , chair: Franck Cappello | 08:45 | Mitsuhisa Sato, U. Tsukuba, Japan | Trends in HPC | ||||||
| 09:15 | Marc Snir, UIUC, USA | Trends in HPC | ||||||
| 09:45 | Wen Mei Wu, UIUC, USA | Trends in HPC | ||||||
| 10:15 | Arun Rodrigues, Sandia, USA | Trends in HPC | ||||||
| 10: |
| 10:45 | Break |
|
|
| ||
Post Petascale Applications and System Software chair: Marc Snir | 11:15 | Pete Beckman, ANL, USA | Trends in HPC | Exascale Sofware Center | |||||
| 11:45 | Michael Norman, SDSC, USA | Trends in HPC | ||||||
| 12:15 | Eric Bohm, UIUC, USA | Trends in HPC | ||||||
| 12:30 45 | Lunch |
|
|
| ||||
|
|
|
|
|
|
|
| ||
BLUE WATERS , chair Bill Gropp | 14:00 | Bill Kramer, NCSA, USA | Overview | Update on Blue Waters | |||||
Collaborations on System Software Collaborations on System Software | 14:30 | Ana Gainaru, NCSA, USA | Early Results | ||||||
| 15:00 | Thomas Ropars, INRIA, France | Results | Latest Progresses on Rollback-Recovery Protocols for Send-Deterministic Applications | |||||
Steve Gottlieb | 15:00 |
| 15:30 | Esteban Menese, UIUC, USA | Early Results | Clustering Message Passing Applications to Enhance Fault Tolerance Protocols | |||
| 16 15:00 30 | Thomas Ropars, INRIA, France | Results | Latest Progresses on Rollback-Recovery Protocols for Send-Deterministic Applications | |||||
| 16:00 | Break |
|
|
| ||||
Collaborations on System Software, chair: Bill Kramer | 16:30 | Leonardo Bautista, Titech, Japan | Results/International collaboration with Japan | Transparent low-overhead checkpoint for GPU-accelerated clusters | |||||
| 17:00 | Gabriel Antoniu, INRIA/IRISA, France | Results | Concurrency-optimized I/O for visualizing HPC simulations: An Approach Using Dedicated I/O cores | |||||
| 17:30 | Mathias Jacquelin, INRIA/ENS Lyon | Results | ||||||
| 18:00 | Olivier Richard, Joseph Emeras, INRIA/U. Grenoble, France | Early Results | Studying the RJMS, applications and File System triptych: a first step toward experimental approach | |||||
Diner | 19:30 | Gould's |
| ||||||
|
|
|
|
|
| ||||
Workshop Day 2 (Auditorium) | Tuesday Nov. 23rd |
|
|
|
| ||||
|
|
|
|
|
| ||||
Collaborations on System Software, chair: Raymond Namyst | 08:30 | Torsten Hoefler, NCSA, USA | Potential collaboration | ||||||
| 09:00 | Frederic Viven, INRIA/ENS Lyon, France | Potential collaboration | ||||||
Collaborations on Programming models, | 09:30 | Thierry Gautier | Early Results | TBA | Potential collaboration | On the cost of managing data flow dependencies for parallel programming | |||
| 10:00 | Jean François Méhaut Laercio Pilla, INRIA/U. Grenoble, France | Early Results | Charm++ on NUMA Platforms: the impact of SMP Optimizations and a NUMA-aware Load Balancing | |||||
| 10:30 | Break |
|
|
| ||||
chair: Sanjay Kale | 11:00 | Raymon Namyst, INRIA/U. Bordeaux, France | Early Results | Potential collaboration | Bridging the gap between runtime systems and programming languages on heterogeneous GPU clusters | ||||
| 11:30 | Brian Amedo, INRIA/U. Nice, France | Potential collaboration | ||||||
| 12: |
| 12:00 | Christian Perez, INRIA/ENS Lyon, France | Early Results | ||||
| 12:30 | Lunch |
|
|
| ||||
Collaborations on Numerical Algorithms and Libraries, chair Mitsuhisa Sato | 14:00 | Luke Olson, Bill Gropp, UIUC, USA | Early Results | ||||||
| 14:30 | Simplice Donfac, INRIA/U. Paris Sud, France | Early Results | Improving data locality in communication avoiding LU and QR factorizations | |||||
| 15:00 | Desiré Nuentsa, INRIA/IRISA, France | Early Results | Parallel Implementation of deflated GMRES in the PETSc package | |||||
| 15:30 | Sebastien Fourestier, INRIA/U. Bordeaux, France | Early Results | ||||||
| 16:00 | Break |
|
|
| ||||
chair: Luke Olson | 16:30 15 | Marc Baboulin, INRIA, U. Paris Sud, France | Early Results | Accelerating linear algebra computations with hybrid GPU-multicore systems | |||||
| 17 16:00 45 | Daisuke Takahashi, U. Tsukuba, Japan | Results/International collaboration with Japan | ||||||
| 17:30 15 | Alex Yee, UIUC, USA | Early Results | A Single-Transpose implementation of the Distributed out-of-order 3D-FFT | |||||
| 17:50 35 | Jeongnim Kim, NCSA, USA | Early Results | ||||||
Diner | 19:30 | Escobar's |
|
| |||||
|
|
|
|
|
| ||||
Workshop Day 3 (Auditorium) | Wednesday Nov 24th |
|
|
|
| ||||
|
|
|
|
|
| ||||
Break out sessions introduction | 8:30 | Cappello, Snir | Overview | Objectives of Break-out, expected results |
| ||||
Topics |
| Participants | Other NCSA participants |
|
| ||||
Break out session 1 | 9:00-10:30 15 |
|
|
|
| ||||
Routing, topology mapping, scheduling, perf. modeling |
| Snir, Hoefler, Vivien, Gautier, Jeannot, Kale , Kale, Namyst, Méhaut, Bohm, Pilla, Amedo, Perez, Baboulin |
| Room 1030 | |||||
Resilience 3D-FFT |
| Kramer, Cappello, Takahashi, Yee, Jeongnim , Gainaru, Ropars, Menese, Bautista, Antoniu, Richard, Fourestier, Jacquelin |
| Room 1040 | |||||
Libraries |
| Gropp, BaboulinOlson, Désiré, Simplice, Sébastien, Fourestier |
| Room 1104 |
| ||||
| 10:15 | Break |
|
| 10:15 | Break |
|
| |
Break out session 2 | 10:30-1211:00 45 |
|
|
| Resilience |
| Kramer, Cappello, Gainaru, Ropars, Menese, Beautista, |
| Room |
Programing models / GPU |
| Kale, Méhaut, Namyst, Wu, AmedroAmedo, Perez, Hoefler, Jeannot , Bohm, Pilla, Baboulin, Fourestier, Gautier |
| Room 1030 |
| ||||
I/O |
| Snir, Viven, Jaquelin, Antoniu, Richard, Kramer, Gainaru, Ropars |
| Room 1040 | Break-out session report | ||||
3D-FFT |
| Cappello, Takahashi, Yee, Jeongnim, Hoefler |
| Room 1104 | |||||
Break out session report | 12:00 | Speakers: | 12:00 | Speakers: Snir, Cappello, Gropp, Kramer, Kale, Olson |
| Auditorium |
| ||
Closing | 12:30 | Cappello, Snir |
| Auditorium |
| ||||
| 13:00 | Lunch |
|
|
| ||||
Diner | 19:00 | Buttitta's |
|
|
Abstracts
Anchor | ||||
---|---|---|---|---|
|
...
Cosmological simulations present well-known difficulties scaling to large core counts because of the large spatial inhomogeneities and vast range of length scales induced by gravitational instability. These difficulties are compounded when baryonic physics is included which introduce their own multiscale challenges. In this talk I review efforts to scale the Enzo adaptive mesh refinement hydrodynamic cosmology code to O(100,000) cores, and I also discuss Cello--an extremely scalable AMR infrastructure under development at UCSD for the next generation of computer architectures which will underpin petascale Enzo.
Anchor | ||||
---|---|---|---|---|
|
...
Eric Bohm,
...
NCSA
Framework for Event Log Analysis in HPC
...
Scaling NAMD into the Petascale and Beyond
Many challenges arise when employing ever larger supercomputers for the simulation of biological molecules in the context of a mature molecular dynamics code. Issues stemming from the scaling up of problem size, such as input and output require both parallelization and revisions to legacy file formats. Order of magnitude increases in the number of processor cores evoke problems with O(P) structures, load balancing, and performance analysis. New architectures present code optimization opportunities (VSX SIMD) which must be carefully applied to provide the desired performance improvements without dire costs in implementation time and code quality. Looking beyond these imminent concerns for sustained petaflop performance on Blue Waters, we will also consider scalability concerns for future exascale machines.
Anchor | ||||
---|---|---|---|---|
|
Bill Kramer, NCSA
Blue Waters: A Super-System to Explore the Expanse and Depth of 21st Century Science
While many people think that Blue Waters means a single Power7 IH supercomputer, in reality, the Blue Waters Project is deploying an entire system architecture that includes an eco-system surrounding the Power7 IH system to make it highly effective, ultra-scale science and engineering. This is what we term the Blue Waters "Super System" which we will describe in detail in this talk along with its corresponding service architecture.
Anchor | ||||
---|---|---|---|---|
|
Ana Gainaru, UIUC/NCSA
Framework for Event Log Analysis in HPC
In this talk, we present a fault analysis framework that combines different event analysis modules. We present the clustering module that extracts message patterns from log files. We also describe how looking at event repetitions as signals could help system administrators mine information about failure causes post-mortem or even how it could help the system take proactive measurement. The modules are working in a pipeline manner, the first one feeding event templates to the second module, which is used to decide if the event signals are periodic, partial periodic or noise. We analyse if a change in the characteristics of an event influences modifications in other signals.
Anchor | ||||
---|---|---|---|---|
|
Thomas Ropars, INRIA
Latest Results in Rollback-Recovery Protocols for Send-Deterministic Applications.
In very large scale HPC systems, rollback-recovery techniques are mandatory to ensure applications correct termination despite failures. Nowadays coordinated checkpointing is almost always used, mainly because it is simple to implement and use. However coordinated checkpointing has several drawbacks: i) at checkpoint time, it is stressing the file system when the image of all application processes have to be saved "at the same time"; ii) recovery is energy consuming because a single failure makes all application processes rollback to their last checkpoint. Alternatives based on message logging have never been widely adopted mainly because of the additional cost induced by logging all the messages during failure free execution. It has been shown that most MPI HPC applications are send-deterministic, i.e. that the sequence of message sendings in an execution is deterministic. We are trying to take advantage of this property to design new rollback-recovery protocols overcoming the limits of existing approaches. In this talk, we first present an uncoordinated checkpointing protocol that does not suffer from the domino effect, while logging only a small subset of the application messages. We present experimental results showing its good performance on failure free execution on an high performance network. We also show how applying process clustering to this protocol can limit the number of processes to rollback in the event of a single failure to 50% on average. Then we introduce a new protocol combining cluster-based coordinated checkpointing and inter-cluster message logging to further reduce the amount of rolled-back computation after a failure.
Anchor | ||||
---|---|---|---|---|
|
Esteban Meneses, UIUC
Clustering Message Passing Applications to Enhance Fault Tolerance Protocols
This talk describes the effort of an ongoing collaboration to find meaningful clusters in a parallel computing application using its communication behavior. We start by showing the communication pattern of various MPI benchmarks and how we can use standard graph partitioning techniques to group the ranks into subsets. For Charm++ applications, we describe the changes on the runtime system to dynamically find the clusters even in the presence of object migration. The information about clusters is used to improve two major message logging protocols for fault tolerance. In one case, we manage to reduce its memory overhead, while in the other we are able to limit the number of processes to roll back during recovery.
Anchor | ||||
---|---|---|---|---|
|
Leonardo Bautista, Titech
Transparent low-overhead checkpoint for GPU-accelerated clusters
Fast checkpointing will be a necessary feature for future large-scale systems. Particularly, large GPU-accelerated systems lack of an efficient checkpoint-restart mechanism able to checkpoint CUDA applications in a transparent fashion (without code modification). Most of current fault tolerance techniques do not support CUDA applications or have severe limitations. We propose a transparent low-overhead checkpointing technique for GPU accelerated clusters that avoid the I/O bottleneck by using erasure codes and SSDs on the compute nodes. We achieve this by combining mature production tools, such as BLCR and OpenMPI, with our previous work and some new developed components.
Anchor | ||||
---|---|---|---|---|
|
Mathias Jacquelin, INRIA/ENS Lyon
Comparing archival policies for BlueWaters
In this work, we introduce two archival policies tailored for the tape storage system that will be available on BlueWaters. We also show how to adapt the well known RAIT strategy (the counterpart
of RAID policy for tapes) for BlueWaters. We provide an analytical model of the tape storage platform of BlueWaters, and use it to asses and analyze the performance of the three policies through simulations. We use random workloads whose characteristics model various realistic scenarios. The throughput of the system, as well as the average (weighted) response time for each user, are the main objectives.
Anchor | ||||
---|---|---|---|---|
|
Gabriel Antoniu, INRIA/IRISA
Concurrency-optimized I/O for visualizing HPC simulations: An Approach Using Dedicated I/O cores
Research at the Joint INRIA-UIUC Lab for Petascale computing is currently in progress in several directions, with the global goal of efficiently exploiting this machine that whill serve to run heavy, data-intensive or computation-intensive simulations. Such simulations usually require to be coupled with visualization tools. On supercomputers, previous studies already showed the need of adapting the I/O path from data generation to visualization. We focus on a particular tornado simulation that is intended to be run on BlueWaters. This simulation currently generates large amount of data in many files, in a way that is not adapted for afterward visualization. We describe an approach to this problem based on the usage of dedicated I/O cores. As a further step, we intend to explore the use of BlobSeer, a large-scale data management service, as an intermediate layer between the simulation, the filesystem and visualization tools. We propose to go further in this approach by enabling BlobSeer to run on dedicated cores and schedule I/O operations coming from the simulation.
Anchor | ||||
---|---|---|---|---|
|
Olivier Richard, Joseph Emeras, INRIA/U. Grenoble
Studying the RJMS, applications and File System triptych: a first step toward experimental approach
In a High Performance Computing infrastructure, it is particularly difficult to master the architecture as a whole. With the physical infrastructure, the platform management software and the users' applications, understanding the global behavior and diagnosing problems is quite challenging. And it is even more true in a petascale context with thousands of compute nodes to manage and a high occupation rate of the resources. A global study of the platform will thus consider the Resource and Job Management System (RJMS), the File System and the Applications triptych as a whole. Studying their behavior is complicated because it means having some knowledge of the applications requirements in terms of physical resources and access to the File System. In this presentation, we propose a first step toward an experimental approach that mix the use of Jobs Workloads patterns and File System access patterns that, once combined, will give a full set of jobs behaviors. These synthetic jobs will then be used to test and benchmark infrastructure, considering the RJMS and the File System.
Anchor | ||||
---|---|---|---|---|
|
Torsten Hoefler, NCSA
Application Performance Modeling on Petascale and Beyond
Performance modeling of parallel application is gaining more importance. It can not only help to predict scalability and find performance bottlenecks but it can also help to understand trade-offs in the design space of computing systems and drive hardware-software co-design of future computing systems. We will discuss established performance modeling techniques and propose a mixed approach to analytic application performance modeling. We then discuss open problems and possible future research directions.
Anchor | ||||
---|---|---|---|---|
|
Frédéric Viven, INRIA/ENS Lyon
On scheduling the checkpoints of exascale applications
Checkpointing is one of the tools used to provide resilience to applications run on failure-prone platforms. It is usually claimed that checkpoints should occur periodically, as such a policy is optimal. However, most of the existing proofs rely on approximations. One such assumption is that the probability that a fault occurs during the execution of an application is very small, an assumption that is no longer valid in the context of exascale platforms. We have begun studying this problem in a fully general context. We have established that, when failures follow a Poisson law, the periodic checkpointing policy is optimal. We have also showed an unexpected result: in some cases, when the platform is sufficiently large, the checkpointing costs sufficiently expensive, or the failures frequent enough, one should limit the application parallelism and duplicate tasks, rather than fully parallelize the application on the whole platform.
Anchor | ||||
---|---|---|---|---|
|
Jean-François Mehaut INRIA/U. Grenoble
Charm++ on NUMA Platforms: the impact of SMP Optimizations and a
NUMA-aware Load Balancing
Cache-coherent Non-Uniform Memory Access (ccNUMA) platforms based on multi-core chips are now a common resource in High Performance Computing. To overcome scalability issues in such platforms, the shared memory is physically distributed among several memory banks. Its memory access costs may vary depending on the distance between processing units and data. The main challenge of a ccNUMA platform is to manage efficiently threads, data distribution and communication over all the machine nodes. Charm++ is a parallel programming system that provides a portable programming model for platforms based on shared and distributed memory. In this work, we revisit some of the implementation decisions currently featured on Charm++ on the context of ccNUMA platforms. First, we studied the impact of the new -- shared-memory based -- inter-object communication scheme utilized by Charm+. We show how this shared-memory approach can impact the performance of Charm+ on ccNUMA machines. Second, we conduct a performance evaluation of the CPU and memory affinity mechanisms provided by Charm++ on ccNUMA platforms. Results show that SMP optimizations and affinity support can improve the overall performance of our benchmarks in up to 75%. Finally, in light of these studies, we have designed and implemented a NUMA-aware load balancing algorithm that addresses the issues found. The performance evaluation of our prototype showed results as good as the ones obtained by GreedyLB and significant improvements when compared to GreedyCommLB.
Anchor | ||||
---|---|---|---|---|
|
Thierry Gautier INRIA
On the cost of managing data flow dependencies for parallel programming.
Several parallel programming languages or libraries (TBB, Cilk+, OpenMP) allows to spawn independent tasks at runtime. In this talk, I will give an overview of the work about the Kaapi runtime system and its management of dependencies between tasks scheduled by a work stealing algorithm. I will show you that at a lower cost than TBB or Cilk+, it is possible to program with data flow dependencies.
Anchor | ||||
---|---|---|---|---|
|
Raymond Namyst INRIA/Univ. Bordeaux
Bridging the gap between runtime systems and programming languages on heterogeneous GPU clusters
In this talk, I will give an overview of our recent work about the StarPU runtime system. I will also present a number of extensions that leverage StarPU and bridge the gap with programming environments such as OpenCL or StarSuperscalar, and which provide better integration potential with programming standards such as MPI, OpenMP, etc
...
Thomas Ropars, INRIA
Latest Results in Rollback-Recovery Protocols for Send-Deterministic Applications.
...
Esteban Meneses, UIUC
Clustering Message Passing Applications to Enhance Fault Tolerance Protocols
...
Leonardo Bautista, Titech
Transparent low-overhead checkpoint for GPU-accelerated clusters
Fast checkpointing will be a necessary feature for future large-scale systems. Particularly, large GPU-accelerated systems lack of an efficient checkpoint-restart mechanism able to checkpoint CUDA applications in a transparent fashion (without code modification). Most of current fault tolerance techniques do not support CUDA applications or have severe limitations. We propose a transparent low-overhead checkpointing technique for GPU accelerated clusters that avoid the I/O bottleneck by using erasure codes and SSDs on the compute nodes. We achieve this by combining mature production tools, such as BLCR and OpenMPI, with our previous work and some new developed components.
...
Mathias Jacquelin, INRIA/ENS Lyon
Comparing archival policies for BlueWaters
...
Gabriel Antoniu, INRIA/IRISA
Concurrency-optimized I/O for visualizing HPC simulations: An Approach Using Dedicated I/O cores
Research at the Joint INRIA-UIUC Lab for Petascale computing is currently in progress in several directions, with the global goal of efficiently exploiting this machine that whill serve to run heavy, data-intensive or computation-intensive simulations. Such simulations usually require to be coupled with visualization tools. On supercomputers, previous studies already showed the need of adapting the I/O path from data generation to visualization. We focus on a particular tornado simulation that is intended to be run on BlueWaters. This simulation currently generates large amount of data in many files, in a way that is not adapted for afterward visualization. We describe an approach to this problem based on the usage of dedicated I/O cores. As a further step, we intend to explore the use of BlobSeer, a large-scale data management service, as an intermediate layer between the simulation, the filesystem and visualization tools. We propose to go further in this approach by enabling BlobSeer to run on dedicated cores and schedule I/O operations coming from the simulation.
...
Olivier Richard, Joseph Emeras, INRIA/U. Grenoble
Studying the RJMS, applications and File System triptych: a first step toward experimental approach
In a High Performance Computing infrastructure, it is particularly difficult to master the architecture as a whole. With the physical infrastructure, the platform management software and the users' applications, understanding the global behavior and diagnosing problems is quite challenging. And it is even more true in a petascale context with thousands of compute nodes to manage and a high occupation rate of the resources. A global study of the platform will thus consider the Resource and Job Management System (RJMS), the File System and the Applications triptych as a whole. Studying their behavior is complicated because it means having some knowledge of the applications requirements in terms of physical resources and access to the File System. In this presentation, we propose a first step toward an experimental approach that mix the use of Jobs Workloads patterns and File System access patterns that, once combined, will give a full set of jobs behaviors. These synthetic jobs will then be used to test and benchmark infrastructure, considering the RJMS and the File System.
...
On scheduling the checkpoints of exascale applications
Checkpointing is one of the tools used to provide resilience to applications run on failure-prone platforms. It is usually claimed that checkpoints should occur periodically, as such a policy is optimal. However, most of the existing proofs rely on approximations. One such assumption is that the probability that a fault occurs during the execution of an application is very small, an assumption that is no longer valid in the context of exascale platforms. We have begun studying this problem in a fully general context. We have established that, when failures follow a Poisson law, the periodic checkpointing policy is optimal. We have also showed an unexpected result: in some cases, when the platform is sufficiently large, the checkpointing costs sufficiently expensive, or the failures frequent enough, one should limit the application parallelism and duplicate tasks, rather than fully parallelize the application on the whole platform.
...
Jean-François Mehaut INRIA/U. Grenoble
Charm++ on NUMA Platforms: the impact of SMP Optimizations and a
NUMA-aware Load Balancing
Abstract: Cache-coherent Non-Uniform Memory Access (ccNUMA) platforms based on multi-core chips are now a common resource in High Performance Computing. To overcome scalability issues in such platforms, the shared memory is physically distributed among several memory banks. Its memory access costs may vary depending on the distance between processing units and data. The main challenge of a ccNUMA platform is to manage efficiently threads, data distribution and communication over all the machine nodes. Charm++ is a parallel programming system that provides a portable programming model for platforms based on shared and distributed memory. In this work, we revisit some of the implementation decisions currently featured on Charm++ on the context of ccNUMA platforms. First, we studied the impact of the new -- shared-memory based -- inter-object communication scheme utilized by Charm+. We show how this shared-memory approach can impact the performance of Charm+ on ccNUMA machines. Second, we conduct a performance evaluation of the CPU and memory affinity mechanisms provided by Charm++ on ccNUMA platforms. Results show that SMP optimizations and affinity support can improve the overall performance of our benchmarks in up to 75%. Finally, in light of these studies, we have designed and implemented a NUMA-aware load balancing algorithm that addresses the issues found. The performance evaluation of our prototype showed results as good as the ones obtained by GreedyLB and significant improvements when compared to GreedyCommLB.
Anchor | ||||
---|---|---|---|---|
|
Christian Perez INRIA/ENS Lyon
...