...
Main Topics | Schedule | Speaker | Affiliation | Type of presentation | Title (tentative) | Download | |||
|
|
|
|
|
|
| |||
Sunday Nov. 24th | 7:00 PM (Departure from Hampton Inn at 6:45PM) with mini buses | Only people registered for the dinner |
|
|
| ||||
|
|
|
|
|
|
| |||
Workshop Day 1 | Monday Nov. 25th |
|
|
|
|
| |||
|
|
|
|
| TITLES ARE TEMPORARY (except if in bold font) |
| |||
Registration | 08:00 |
|
|
|
|
| |||
Welcome and Introduction Auditorium 1122 Chair: Franck | 08:30 | Marc Snir + Franck Cappello Co-directors of the joint-lab |
| Background | Welcome, Workshop objectives and organization | ||||
| 08:45 | Ed. Seidel Incoming NCSA director | UIUC | Background | NCSA update and vision of the collaboration (This address has been inverted with the next one due to schedule constraints) | ||||
09:00 | Peter Schiffer UIUC Vice Chancellor for Research | UIUC | Background | Welcome from UIUC Vice Chancellor for Research | |||||
| 09:15 | Michel Cosnard Inria CEO and President | Inria | Background | INRIA updates and vision of the collaboration | ||||
09:30 | Marc Snir Director of Argonne/ MCS and co-director of the joint-lab | ANL | Background | Argonne updates and vision of the collaboration | |||||
09:45 | Marc Daumas Attaché for Science and Technology | Embassy of France | Background | France-USA collaboration program updates | |||||
| 9h55 | Franck Cappello Co-director of the Joint-lab | ANL | Background | Joint-Lab, PUF, New Joint-Lab, organization |
| |||
| 10:15 | Break |
|
|
| ||||
Extreme Scale Systems and infrastructures Auditorium 1122 Chair: Marc SnirYves Robert | 10:45 | Pete Beckman | ANL |
| Extreme Scale Computing & Co-design Challenges |
| |||
| 11:15 | John Towns | UIUC |
| Applications Challenges in the XSEDE Environment | ||||
11:45 | Gabriel Antoniu | Inria | A-Brain and Z-CloudFlow: Scalable Data Processing on Azure Clouds - Lessons Learned in Three Years and Future Directions | ||||||
| 12:15 | Lunch |
|
|
|
| |||
13:45 | Bill Kramer | UIUC | Blue Waters | Is Petascale Completely Done? What Should We Do Now? | |||||
14:15 | Marc SnirTorsten Hoefler | UIUC |
| ETH | IEEE/ACM SC13 Best Paper | Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One SidedG8 ECS and international collaboration toward extreme scale climate simulation |
| ||
| 14:45 | Rob Ross | ANL |
| Thinking Past POSIX: Persistent Storage in Extreme Scale Systems | ||||
15:15 | François Pellegrini | Inria | Parallel repartitioning and remeshing : results and prospects | ||||||
15:45 | Break | ||||||||
16:15 | Pavan Balagi | ANL | Message Passing in Massively Multithreaded Environments | ||||||
16:45 | Wen Mei Hwu | UIUC | A New, Portable Algorithm Framework for Parallel Linear Recurrence Problems | ||||||
17:15 | Adjourn | ||||||||
| 18:45 | Bus for Diner |
|
|
|
| |||
|
|
|
|
|
|
| |||
Workshop Day 2 | Tuesday Nov. 26 |
|
|
|
|
| |||
Applications, I/O, Visualization, Big data Auditorium 1122 Chair: Rob Ross | 08:30 | Greg Bauer | UIUC | Applications and their challenges on Blue Waters |
| ||||
| 09:00 | Matthieu Dorier | Inria | Joint-result, submitted | CALCioM: Mitigating I/O Interferences in HPC Systems through Cross-Application Coordination | ||||
09:30 | Dries Kempe | ANL |
| Mercury: Enabling Remote Procedure Call for High-Performance Computing | |||||
| 10:00 | Venkat Vishwanath | ANL |
| Addressing I/O Bottlenecks and Simulation-Time Analytics at Extreme Scales | ||||
| 10:30 | Break |
|
|
|
| |||
| 11:00 | Babak Behzad | UIUC | ACM/IEEE SC13 | Taming Parallel I/O Complexity with Auto-Tuning |
| |||
| 11:30 | McHenry, Kenton Guadron | UIUC |
| NSF CIF21 DIBBs: Brown Dog | ||||
| 12:00 | Lunch |
|
|
| ||||
|
|
|
|
|
|
| |||
Mini Workshop1 Resilience Room 1030 Chair: Yves Robert |
|
|
|
|
|
| |||
| 13:30 | Leonardo | ANL | Joint-result | Detecting Silent Data Corruption through Data Dynamic Monitoring for Scientific Applications | ||||
| 14:00 | Tatiana Martsinkevich | Inria | Joint-result | On the feasibility of message logging in hybrid hierarchical FT protocols | ||||
| 14:30 | Mohamed Slim Bouguera | Inria | Joint-result, submitted | Failure prediction: what to do with unpredicted failures ? | ||||
| 15:00 | Ana Gainaru | UIUC | Joint-result, submitted | Topology and behaviour aware failure prediction for Blue Waters. | ||||
| 15:30 | Break |
|
|
|
| |||
| 16:00 | Sheng Di | Inria | Joint-result, submitted | Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications | ||||
| 16:30 | Yves Robert | Inria | Joint-result, | Assessing the impact of ABFT & Checkpoint composite strategies |
| |||
| 17h00 | Weslay Bland | ANL |
| Fault Tolerant Runtime Research at ANL | ||||
| 17H30 | Adjourn |
|
|
|
| |||
| 19:00 | Bus for Diner |
|
|
|
| |||
Mini Workshop2 Numerical Agorithms Room 1040 Chair: Bill Gropp |
|
|
|
|
|
| |||
| 13:30 | Luke Olson | UIUC |
| Toward a more robust sparse solver with some ideas on resilience and scalability | ||||
14:00 | Prasanna Balaprakash | ANL | Active-Learning-based Surrogate Models for Empirical Performance Tuning | ||||||
| 14:30 | Yushan Wang | Inria |
| Solving 3D incompressible Navier-Stokes equations on hybrid CPU/GPU systems. | ||||
| 15:00 | Jed Brown | ANL |
| Fast solvers for implicit Runge-Kutta systems | ||||
| 15:30 | Break |
|
|
|
| |||
| 16:00 | Pierre Jolivet | Inria | Best Paper finalist, IEEE, ACM SC13 | Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems | ||||
16:30 | Vincent Baudoui | Total&ANL | Joint-result | Round-off error propagation and non-determinism in parallel applications | |||||
17:00 | Torsten Hoefler | EPFL | Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes | ||||||
| 17:30 | Adjourn |
|
|
|
| |||
| 19:00 | Bus for diner |
|
|
|
| |||
|
|
|
|
|
|
| |||
Workshop Day 3 | Wednesday Nov. 27 |
|
|
|
|
| |||
|
|
|
|
|
|
| |||
Mini Workshop3 |
|
|
|
|
|
| |||
Programming models, compilation and runtime. Room 1030 Chair: Marc Snir | 08:30 | Grigori Fursin | Inria |
| Collective Mind: making auto-tuning practical using crowdsourcing and predictive modeling | ||||
| 09:00 | Maria Garzaran | UIUC |
| Optimization by Run-time Specialization for Sparse Matrix-Vector Multiplication | ||||
09:30 | Jean-François Mehaut | Inria |
| From Multicores to Manycores Processors: Challenging Programming Issues with the MPPA/KALRAY | |||||
10:00 | Break | ||||||||
| 10:30 | Frederic Vivien | Inria |
| Scheduling tree-shaped task graphs to minimize memory and makespan | ||||
| 11:00 | Rafael Tesser | Inria | Joint result PDP 2013 | Using AMPI to improve the performance of the Ondes3D seismic wave simulator through dynamic load balancing | ||||
| 11:30 | Emmanuel Jeannot | Inria | Joint-result, IEEE Cluster2013 | Communication and Topology-aware Load Balancing in Charm++ with TreeMatch | ||||
| 12:00 | Closing |
|
|
| ||||
| 12:30 | Lunch |
|
|
|
| |||
| 18:00 | Bus for diner |
|
|
|
| |||
Mini Workshop4 Large scale systems and their simulators Room 1040 Chair: Bill Kramer |
|
|
|
|
|
| |||
08:30 | Eric Bohm | UIUC |
| A Multi-resolution Emulation + Simulation Methodology for Exascale | |||||
| 09:00 | Arnault Legrand | Inria |
| SMPI: Toward Better Simulation of MPI Applications |
| |||
09:30 | Torsten Hoefler | EPFL | Best Paper and best student finalist, IEEE, ACM SC13 | TBD | ANL | G8 ECS and international collaboration toward extreme scale climate simulationEnabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided |
| ||
| 10:00 | Break |
|
|
|
| |||
10:30 | Kate Kahey | ANL |
| Evaluating Streaming Strategies for Event Processing across Infrastructure Clouds | |||||
| 11:00 | Jeremy Henos | UIUC |
| Application Runtime Consistency and Performance Challenges on a shared 3D torus. | ||||
| 11:30 | TBD |
|
|
| ||||
Auditorium 1122 | 12:00 | Closing |
|
|
|
| |||
| 12:30 | Lunch |
|
|
|
| |||
18:00 | Bus for diner |
...
Software and hardware optimization and co-design of computer systems becomes intolerably complex, ad-hoc, time consuming and error prone due to enormous number of available design and optimization choices, complex interactions between all software and hardware components, and ever changing tools and applications. We present our novel long-term holistic and practical solution to address these problems using new plugin-based Collective Mind infrastructure and repository. For the first time, it can preserve the whole experimental setup and all associated artifacts to distribute program analysis and multi-objective optimization among many participants while utilizing any available smart phone, tablet, laptop, cluster or data center, and continuously observing, classifying and modeling realistic their behavior. Any unexpected behavior is analyzed using shared data mining and predictive modeling plugins or exposed to the community at a public portal cTuning.org and repository c-mind.org/repo for collaborative explanation. Gradually increasing public optimization knowledge helps to continuously improve optimization heuristics of any compiler, predict optimizations for new programs or suggest efficient run-time adaptation strategies depending on end-user requirements. We successfully validated this approach and framework in several academic and industrial projects while releasing hundreds of codelets, numerical applications, data sets, models, universal experimental pipelines, and unified tools to start community-driven, systematic and reproducible R&D to build adaptive, self-tuning computer systems, and initiate new publication model where experiments and techniques are continuously validated and improved by the community.
Wen-Mei Hwu
A New, Portable Algorithm Framework for Parallel Linear Recurrence Problems
Linear recurrence solvers are common constructs in a class of important scientific applications. Many parallel algorithms have been proposed to achieve high performance for different problems that are linear recurrence in nature. Through a detailed investigation of the existing parallel implementations, we identify a general, hierarchical parallel linear recurrence algorithm that has the potential to fully utilize a wide variety of hardware. However, this algorithm is complex and requires enormous programming efforts to achieve high performance across different architectures. To achieve single source performance portability, we create a code-generator using auto-tuning for optimizing high-performance, parallel, linear recurrence solvers that are retargetable to specific platforms. The framework is composed of two major components. The first component is an auto-tuned tiling procedure which generates tiling by searching a unified tiling space (UTS). The UTS combines on-chip memory resources to simplify the complexity of tiling decisions. Based on the tiling decision, the second component selects the best communication implementation to minimize the communication overhead. By heuristically reducing the search space, our auto-tuning technique generates optimized programs in a reasonable time. We evaluate our framework using several benchmarks including prefix sum, IIR filter, bidiagonal solver and tridiagonal solver on GPU architectures. The resulting linear recurrence solvers significantly outperforms the previous state-of-the-art, specialized GPU implementations.
François Pellegrini
Parallel repartitioning and remeshing : results and prospects
The purpose of this talk is to expose the current state and the prospects of research and of implementation regarding two software tools that we develop for HPC : PT-Scotch and PaMPA. PT-Scotch is a parallel partitionning and mapping tool that has been recently extended to provide dynamic remapping features. While its algorithms have been developed with scalability in mind, several algorithmic bottelnecks appear, which impose to re-think the way we perform repartitioning. PaMPA is a library for parallel (re)meshing of distributed, unstructured meshes, that delegates (re)partitioning to PT-SCOTCH. After basic mesh handling features were developed, we focused on parallel remeshing itself, allowing us to produce distributed, tetraedral meshes comprising several hundred million elements.
Venkatram Vishwanath
Addressing I/O Bottlenecks and Simulation-Time Analytics at Extreme Scales
We will first present our work in GLEAN - a flexible and extensible framework that takes application, analysis, and system characteristics into account to facilitate simulation-time data analysis and I/O acceleration. The GLEAN infrastructure hides significant details from the end user, while at the same time providing a flexible intterface to the fastest path for their data and analysis needs and, in the end, scientific insight. We describe the efficacy of our approaches in scaling to 768K cores of the Mira BG/Q system, and on the Cray supercomputer. If time permits, we will present our work on Concerted Flows - A parallel data movement infrastructure that takes into account analytical and empirical models of an end-to-end system infrastructure together with mathematical optimization to improve the achievable performance for parallel data flows at various system scales.
Luke Olson
In this talk we look at some recent attempts to improve robustness in algebraic multigrid solvers for a wider range of problems. In particular we look at optimality throughout the solver by refining interpolation and the sense of strength in the method. With this we comment on some current directions for improving scalability by thinning the hierarchy and some possibilities for strengthening resilience.
Torsten Hoefler
Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes
Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made—a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. In this paper, we show how both coverage and speed of this scalability analysis can be substantially improved. Generating an empirical performance model automatically for each part of a parallel program, we can easily identify those parts that will reduce performance at larger core counts. Using a climate simulation as an example, we demonstrate that scalability bugs are not confined to those routines usually chosen as kernels.
...