joint-lab workshop Jun. 9-11 2014

UNDER construction: The agenda below is not the final one

This event is supported by INRIA, UIUC, NCSA, ANL, BSC, PUF NEXTGEN,

Main Topics	Schedule	Speaker	Affiliation	Type of presentation	Title (tentative)	Download
	Sunday June 8th
Dinner Before the Workshop	7:30 PM	Only people registered for the dinner (included)			Mercure Hotel

Workshop Day 1	Monday June 9th
					TITLES ARE TEMPORARY (except if in bold font)
Registration	08:00	At Inria Sophia Antipolis
Welcome and Introduction Amphitheatre	08:30	Franck Cappello + Marc Snir + Yves Robert + Bill Kramer + Jesus Labarta	INRIA&UIUC&ANL&BSC	Background	Welcome, Workshop objectives and organization
Plenary Amphitheatre Chair: Franck Cappello	09:00	Jesus Labarta	BSC	Background	Presentation of BSC activities
Mini Workshop Math app. Room 1
Chair: Paul Hovland	09:30	Bill Gropp	UIUC
	10:00	Jed Brown	ANL
	10:30	Break
	11:00	Ian Masliah	Inria		Automatic generation of dense linear system solvers on CPU/GPU architectures
	11:30	Luke Olson	UIUC
	12:00	Lunch
Chair: Bill Gropp	13:30	Vincent Baudoui	Inria
	14:00	Paul Hovland	ANL
	14:30	Stephane Lanteri	Inria		C2S@Exa: a multi-disciplinary initiative for high performance computing in computational sciences
Mini Workshop I/O and BigData Room 1
Chair: Rob Ross	15:00	Wolfgang Frings	JSC
	15:30	Break
	16:00	Jonathan Jenkins	ANL		Towards Simulating Extreme-scale Distributed Systems
	16:30	Matthieu Dorier	Inria		Omnisc'IO: A Grammar-Based Approach to Spatial and Temporal I/O Patterns Prediction
	17:00	Kenton Guadron McHenry,	NCSA		The NCSA Image and Spatial Data Analysis Division
	17:30	Adjourn
	18:30	Bus for dinner (dinner included)

Mini Workshop Runtime Room 2
Chair: Jesus Labarta	9:30	Pavan Balaji	ANL
	10:00	Augustin Degomme	Inria		Status Report on the Simulation of MPI Applications with SMPI/SimGrid
	10:30	Break
	11:00	Ronak Buch	UIUC
	11:30	Victor Lopez	BSC
	12:00	Lunch
Chair: Rajeev Thakur	13:30	Xin Zhao	ANL
	14:00	Brice Videau	Inria
	14:30	Pieter Bellens	BSC
	15:00	Martin Quinson	Inria
	15:30	Break
Chair: Sanjay Kale	16:00	Francois Tessier	Inria
	16:30	Jean-François Mehaud	Inria
	17:00	Luka Nussbaum	Inria		Evaluating exascale HPC runtimes through emulation with Distem
	17:30	Adjourn
	18:30	Bus for dinner (dinner included)

Workshop Day 2	Tuesday June 10th

Formal opening Amphitheatre Chair: Bill Kramer	08:30	Marc Snir + Franck Cappello	INRIA&UIUC&ANL	Background
	08:40	Claude Kirchner	Inria	Background	Inria updates and vision of the collaboration	TBD
	08:50	Marc Snir	ANL	Background	ANL updates vision of the collaboration	TBD
Plenary Amphitheatre	09:00	Wolfgan Frings	JSC	Background	JSC activities in HPC	TBD
Mini Workshop I/O and Big Data Room 1
Chair: Gabriel Antoniu	09:30	Rob Ross	ANL		Understanding and Reproducing I/O Workloads
	10:00	Guillaume Aupy	Inria		Scheduling the I/O of HPC applications under congestion
	10:30	Break
	11:00	Lokman Rahmani	Inria
	11:30	Anthony Simonet	Inria		Using Active Data to Provide Smart Data Surveillance to E-Science Users
	12:00	Lunch
Mini Workshop Runtime Room 2
Chair: Jean François Mehaud	09:30	Sanjay Kale	UIUC		Temperature, Power and Energy: How an Adaptive Runtime can optimize them
	10:00	Florentino Sainz	BSC		DEEP Collective offload
	10:30	Break	Inria
	11:00	Arnaud Legrand	Inria		Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures
	11:30	Grigori Fursin	Inria
	12:00	Lunch
Formal encouragments Amphitheatre Chair: Franck Cappello	13:45	Ed Seidel	UIUC	Background	NCSA updates and vision of the collaboration
Plenary Amphitheatre Chair: Wolfgan Frings	14:00	Yves Robert	Inria
	14:30	Marc Snir	ANL
	15:00	Break
Mini Workshop Resilience Room 1
Chair: Franck Cappello	15:30	Luc Jaulmes	BSC		Checkpointless exact recovery techniques for Krylov-based iterative methods
	16:00	Ana Gainaru	UIUC
	16:30	Tatiana Martsinkevich	Inria
	17:00	Adjourn
Mini Workshop Cloud & Cyber-infrastructure Room 2
Chair: Kate Keahey	15:30	Justin Wozniak	ANL
	16:00	Shaowen Wang	UIUC		CyberGIS @ Scale
	16:30	Christine Morin	Inria
	17:00	Adjourn
	18:30	Bus for Dinner (dinner included)

Workshop Day 3	Wednesday June 11th
Plenary Amphitheatre Chair: Jesus Labarta	8:30	Bill Kramer	NCSA		Blue Waters - A year of results and insights
Mini Workshop Resilience Room 1
Chair: Yves Robert	9:00	Leonardo Bautista Gomez	ANL
	9:30	Slim Bougera	Inria
	10:00	Break
	10:30	Sheng Di	ANL		Round-off error propagation in large-scale applications
	11:00	Vincent Baudoui	ANL		Five open questions on Resilience for the Exascale era
Plenary Amphitheatre	11:30	Closing
	12:00	Lunch (included)
Mini Workshop Cloud & Cyber-infrastructure Room 2
Chair: Christine Morin	09:00	Kate Keahey	ANL
	09:30	Radu Tudoran	Inria		JetStream: Enabling High Performance Event Streaming across Cloud Data-Centers
	10:00	Break
	10:30	Sri Hari Krishna Narayanan	ANL
	11:00	Timothy Armstrong	ANL
Plenary Amphitheatre	11:30	Closing
	12:00	Lunch (included)

Abstract

Matthieu Dorier

Title: Omnisc'IO: A Grammar-Based Approach to Spatial and Temporal I/O Patterns Prediction

The increasing gap between the computation performance of post-petascale machines and the performance of their I/O subsystem has motivated many I/O optimizations including prefetching, caching, and scheduling techniques. To further improve these techniques, modeling and predicting spatial and temporal I/O patterns of HPC applications as they run have become crucial.

This presentation introduces Omnisc'IO, an original approach that aims to make a step forward toward an intelligent I/O management of HPC applications in next-generation post-petascale supercomputers. It builds a grammar-based model of the I/O behavior of any HPC application and uses that model to predict when future I/O operations will occur, as well as where and how much data will be accessed. Omnisc'IO is transparently integrated into the POSIX and MPI I/O stacks and does not require any modification to application sources or to high level I/O libraries. It works without prior knowledge of the application, and converges to accurate predictions within a couple of iterations only. Its implementation is efficient both in computation time and in memory footprint. Omnisc'IO was evaluated with four real HPC applications -- CM1, Nek5000, GTC, and LAMMPS -- using a variety of I/O backends ranging from simple POSIX to Parallel HDF5 on top of MPI I/O. Our experiments show that Omnisc'IO achieves from 79.5% to 100% accuracy in spatial prediction and an average precision of temporal predictions ranging from 0.2 seconds to less than a millisecond.

Sheng Di

Optimization of Multi-level Checkpoint Model with Uncertain Execution Scales

As for future extreme scale systems, there could be different types of failures striking exa-scale applications with different failure scales, from transient uncorrectable memory errors in processes to massive system outages. In this work, a multi-level checkpoint model is proposed by taking into account uncertain execution scales (different numbers of processes/cores). The contribution is three-fold. (1) We provide an in-depth analysis on why it is very tough to derive the optimal checkpoint intervals for different checkpoint levels and optimize the number of cores simultaneously. (2) We devise a novel method which can quickly obtain an optimized solution, which is the first successful attempt in the multi-level checkpoint model with uncertain scales. (3) We perform both large-scale real experiments and extreme-scale numerical simulation to validate the effectiveness of our design. Experiments confirm our optimized solution outperforms other state-of-the-art solutions by 4.3-88% on wall-clock length.

Augustin Degomme/Arnaud Legrand

Status Report on the Simulation of MPI Applications with SMPI/SimGrid

- Virtualisation: The automatic approaches we had for application emulation required to rely on an alternative compiling chain (e.g., using GNU TLS), which is problematic as it could dramatically changes code performance and was not sufficiently generic. We have looked forward alternative approaches and have recently designed a new one based on the OS-like organization of SimGrid that allows us to identify heaps and stacks of virtual MPI process and to mmap them whenever context switching. This new approach enables to *emulate unmodified MPI applications* regardless of the language with which they are written and regardless of the compiling toolchain. Although this has not been evaluated yet, this approach should also allow to use classical profilers at small scale to identify which variables should be aliased and which kernels should be modeled rather than truly executed in simulation.

- Trace replay and interoperability: we have a current effort toward SMPI interoperability. Each simulation tool (BigSim, LogGOPSIM, Dimemas, SimGrid, SST/Macro ...) has its own strength and weaknesses but is often stronly biased toward a given tracing format. Working toward interoperability would allow researchers to seamlessly move to another simulator whenever it is more appropriate rather than trying to fix the one linked to its tracing tool or to its application. Replaying BigSim and scalatrace traces is now possible in SMPI/SimGrid but the validation remains to be done. We have plans to perform similar work with Dimemas and SST/Macro so as to ease the use of SimGrid's fluid models.

- Status report and current effort on network modeling (IB, fat-tree and torus-like topologies).

Luka Stanisic/Arnaud Legrand

Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures

[Joint work between Luka Stanisic, Samuel Thibault, Arnaud Legrand, Brice Videau and Jean-François Méhaut, accepted for publication at Europar'14]

Multi-core architectures comprising several GPUs have become mainstream in the field of High-Performance Computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requires to carefully offload computations and manage data movements between the different processing units. The most promising and successful approaches so far rely on task-based runtimes that abstract the machine and rely on opportunistic scheduling algorithms. As a consequence, the problem gets shifted to choosing the task granularity, task graph structure, and optimizing the scheduling strategies. Trying different combinations of these different alternatives is also itself a challenge. Indeed, getting accurate measurements requires reserving the target system for the whole duration of experiments. Furthermore, observations are limited to the few available systems at hand and may be difficult to generalize. In this research report, we show how we crafted a coarse-grain hybrid simulation/emulation of StarPU, a dynamic runtime for hybrid architectures, over SimGrid, a versatile simulator for distributed systems. This approach allows to obtain performance predictions accurate within a few percents on classical dense linear algebra kernels in a matter of seconds, which allows both runtime and application designers to quickly decide which optimization to enable or whether it is worth investing in higher-end GPUs or not.

Guillaume Aupy

Scheduling the I/O of HPC applications under congestion

A significant percentage of the computing capacity of large-scale platforms is wasted due to interferences incurred by multiple applications that access a shared parallel file system concurrently. One solution to handling I/O bursts in large-scale HPC systems is to absorb them at an intermediate storage layer consisting of burst buffers. However, our analysis of the Argonne’s Mira system shows that burst buffers cannot prevent congestion at all times. As a consequence, I/O performance is dramatically degraded, showing in some cases a decrease in I/O throughput of 67%. In this paper, we analyze the effects of interference on application I/O bandwidth, and propose several scheduling techniques to mitigate congestion. We show through extensive experiments that our global I/O scheduler is able to reduce the effects of congestion, even on systems where burst buffers are used, and can increase the overall system throughput up to 56%. We also show that it outperforms current Mira I/O schedulers.

Florentino Sainz

DEEP Collective offload

Abstract: We present a new extension of OmpSs programming model which allows users to dynamically offload C/C++ or Fortran code from one or many nodes to a group of remote nodes. Communication between remote nodes executing offloaded code is possible through MPI. It aims to improve programmability of Exascale and nowadays supercomputers which use different type of processors and interconnection networks which have to work together in order to obtain the best performance. We can find a good example of these architectures in the DEEP project, which has two separated clusters (CPUs and Xeon Phis). With our technology, which works in any architecture which fully supports MPI, users will be able to easily offload work from the CPU cluster to the accelerators cluster without the constraint of falling back to the CPU cluster in order to perform MPI communications.

Radu Tudoran

JetStream: Enabling High Performance Event Streaming across Cloud Data-Centers

The easily-accessible computation power offered by cloud infrastructures coupled with the revolution of Big Data are expanding the scale and speed at which data analysis is performed. In their quest for finding the Value in the 3 Vs of Big Data, applications process larger data sets, within and across clouds. Enabling fast data transfers across geographically distributed sites becomes particularly important for applications which manage continuous streams of events in real time. In this paper, we propose a set of strategies for efficient transfers of events between cloud data-centers. Our approach, called, JetStream, is able to self-adapt to the streaming conditions by modeling and monitoring a set of context parameters. It further aggregates the available bandwidth by enabling multi-route streaming across cloud sites. The prototype was validated on tens of nodes from US and Europe data-centers of the Microsoft Azure cloud using synthetic benchmarks and with application code from the context of the Alice experiment at CERN. The results show an increase in transfer rate of 250 times over individual event streaming. Besides, introducing an adaptive transfer strategy brings an additional 25% gain. Finally, the transfer rate can further be tripled thanks to the use of multi-route streaming.

Anthony Simonet

Using Active Data to Provide Smart Data Surveillance to E-Science Users

Modern scientific experiments often involve multiple storage and computing platforms, software tools, and analysis scripts. The resulting heterogeneous environments make data management operations challenging; the significant number of events and the absence of data integration makes it difficult to track data provenance, manage sophisticated analysis processes, and recover from unexpected situations. Current approaches often require costly human intervention and are inherently error prone. The difficulties inherent in managing and manipulating such large and highly distributed datasets also limits automated sharing and collaboration.
We study a real world e-Science application involving terabytes of data, using three different analysis and storage platforms, and a number of applications and analysis processes. We demonstrate that using a specialized data life cycle and programming model---Active Data---we can easily implement global progress monitoring, and sharing; recover from unexpected events; and automate a range of tasks.

Ian Ma

Automatic generation of dense linear system solvers on CPU/GPU architectures

The increasing complexity of new parallel architectures has widened the gap between adaptability and efficiency of the codes. As high performance numerical libraries tend to focus more on performance, we wish to address this issue using a C++ library called NT2. By analyzing the properties of the linear algebra domain that can be extracted from numerical libraries like LAPACK and MAGMA and combining them with architectural features, we developed a generic approach to solve dense linear systems on hybrid architectures. We report performance results that correspond to what state-of-the-art codes achieve while maintaining a generic code that can run either on CPU or GPU.

Kenton Guadron McHenry

The NCSA Image and Spatial Data Analysis Division

The Image and Spatial Data Analysis division conducts research and development in general purpose data cyberinfrastructure, addressing specifically the growing need to make use of large collections of non-universally accessible, or individually-managed, data and software (i.e. executable data). We attempt to address these needs through the development of a common suite of internally and externally created open source tools/platforms that provide means of auto and assisted curation for data/software collections. To acquire some of the needed high level metadata not provided with un-curated data we make heavy use of techniques founded in artificial intelligence, machine learning, computer vision, and natural language processing. To close the gap between the state of the art of these fields and current needs, while also providing a sense of oversight many of our domain users desire, we attempt to keep the human in the loop wherever possible by incorporating elements of social curation, crowd sourcing, and error analysis. Given the ever growing urgency to gain benefit from the deluge of un-curated data we push for the adoption of solutions derived from these relatively young fields, highlighting the value of having tools to deal with this data where there would be nothing otherwise. Attempting to follow in the footsteps of the great software cyberinfrastructure successes of NCSA (i.e. mosaic, httpd, and telnet) we attempt to address these scientific and industrial needs in a manner that is also applicable to the general public. By catering toward broad appeal rather than focusing on a niche within the total possible users we aim at stimulating uptake and providing a life for our software solutions beyond funded project deliverables. We will briefly go over a handful of our current projects spanning data integration and visualization, data mining, and the creation of general purpose software tools.

Bill Kramer

Blue Waters - A year of results and insights

This talk will discuss the first year of full service for Blue Waters, including highlights of science and results and well as insights into the use of the systems. The talk will also point to lessons that might be important as we move into the extreme scale era.

Vincent Baudoui

Round-off error propagation in large-scale applications

Round-off errors coming from numerical calculation finite precision can lead to catastrophic losses in significant numbers when they accumulate. They will become more and more overriding in the future as the problem size increases with the refinement of numerical simulations. Existing analytical bounds for round-off errors are known to be poorly scalable and they become quite useless for large problems. That is why the propagation of round-off errors throughout a computation needs to be better understood in order to ensure large-scale application results accuracy. We study here a round-off error estimation method based on first order derivatives computed thanks to algorithmic differentiation techniques. It can help following the error propagation through a computational graph and identifying the sensitive sections of a code. It has been experimented on well known LU decomposition algorithms that are widely used to solve linear systems. We will present some examples as well as challenges that need to be tackled as part of future research work in order to set up a strategy to analyze round-off error propagation in large-scale problems.

Luc Jaulmes

Checkpointless exact recovery techniques for Krylov-based iterative methods

By exploiting inherent redundancy in iterative solvers, especially Krylov-subspace methods, we can recover from non-silent errors in data without reverting to techniques like checkpointing. We implemented this recovery scheme for the Conjugate Gradient (CG) and its Preconditioned variant (PCG) and show near-zero overheads without faults, and fast recoveries that preserve all convergence properties of the solver. Using the asynchronous task-based programming model OmpSs, these overheads are even further minimized.

Lokman Rahmani

Smart In Situ Visualization for Climate Simulations

The increasing gap between computational power and I/O performance in new supercomputers has started to drive a shift from an offline approach to data analysis to an inline approach, termed in situ visualization (ISV). While most visualization software now provides ISV, they typically visualize large dumps of unstructured data, by rendering everything at the highest possible resolution. This often negatively impacts the performance of simulations that support ISV, in particular when ISV is performed interactively, as in situ visualization requires synchronization with the simulation. In this work, we advocate for a smarter method of performing ISV. Our approach is data-driven: it aims to detect potentially interesting regions in the generated dataset in order to feed ISV frameworks with “the interesting” subset of the data produced by the simulation. While this method mitigates the load on ISV frameworks by making them more efficient and more interactive, it also helps scientists focus on the relevant part of their data. We investigate smart ISV in the context of a climate simulation, with a set of generic filters derived from information theory, statistics and image processing, and show the tradeoff between performance and quality of visualization.

Lucas Nussbaum

Evaluating exascale HPC runtimes through emulation with Distem

The Exascale era will require the HPC software stack to face important challenges such as platform hetereogeneity and evolution during execution, or reliability issues. We propose a framework to evaluate key aspects of a central part of this software stack: the HPC runtimes. Starting from Distem, which is a versatile emulator for studying distributed systems, we designed an emulator suitable for the evaluation of HPC runtimes, enabling specifically: (1) emulation of a very large scale platform on top of a regular cluster; (2) introduction of heterogeneity and dynamic imbalance among the computing resources; (3) introduction of failures. Those features provide runtime designers with the ability to experiment their prototypes under a large range of conditions, to discover performance gaps, understand future bottlenecks, and evaluate fault tolerance and load balancing mechanisms. We validate the usefulness of this approach with experiments on two HPC runtimes: Charm++ and OpenMPI.

Sanjay Kale

Temperature, Power and Energy: How an Adaptive Runtime can optimize them.

Jonathan Jenkins

Towards Simulating Extreme-scale Distributed Systems

Simulating future extreme-scale parallel/distributed systems can be an important component in understanding these systems at a scale at which prototyping cannot feasibly reach. For HPC, big-data/cloud, or other computing/analysis platforms, the design decisions for developing systems that scale beyond current-generation systems are multi-dimensional in nature. For example, these decisions encompass distributed storage software/hardware solutions, network topologies within and between computing centers, algorithms for data analysis and compute services in heterogeneous software/hardware environments, etc., each of which can potentially be rich targets for exploring via a simulation-based approach. This talk will examine our ongoing work in developing a simulation model framework using parallel discrete event simulation to examine various design aspects of extreme-scale distributed systems. As an exemplar, simulation of protocols used in distributed storage systems will be examined in detail.

Child pages

joint-lab workshop Jun. 9-11 2014

UNDER construction: The agenda below is not the final one

This event is supported by INRIA, UIUC, NCSA, ANL, BSC, PUF NEXTGEN,

TITLES ARE TEMPORARY (except if in bold font)