RUSpaces - Addressing Data Challenges at Extreme Scale, Rutgers University

DataSpaces: DataSpaces is a data sharing framework that enables dynamic and asynchronous applications interactions. It provides the abstraction of a virtual semantically specialized shared space that can be associatively and asynchronously accessed using simple yet powerful and flexible operators (e.g., put() and get()) with appropriate data selectors or filters. These operators are agnostic of the location, e.g., source/destination, as well as the data distribution and decomposition of the interacting application components. It also provides a runtime system for “in-the-space” data manipulation and/or reduction, using predefined or customized and user-defined functions, which can be dynamically downloaded and executed at runtime while the data is in-transit through the space. DataSpaces has an extensible architecture and can provide new data services, e.g., data subscription and notification.

The DataSpaces framework provides flexible, decoupled and asynchronous data sharing semantics that enables interactions between multiple distributed application services. It easily integrates into the data pipeline of data workflow engines to complement or replace the more traditional file-based approaches. It alleviates the performance penalties associated with these approaches, e.g., latency, variability, by providing transparent and memory-to-memory data sharing.

ActiveSpaces: Data-intensive application workflows typically transform and often reduce the data before it can be processed by consumer applications or services. For example, application coupling may only require subsets of data that are sorted and processed to match the data representation at the consumer. Similarly, visualization and monitoring applications may only require discrete values, such as the maximum, minimum or average value of a variable over a region of interest. Processing the data before transporting it can be advantageous in these scenarios. For example, previous experiments have explored this approach by embedding pre-defined data transformation operations in the staging area so that CPU resources at the staging area can be utilized to transform the data before it is shipped to the consumer. This approach requires a priori knowledge of the processing, as well as the data structures and data representation. However, the dynamic nature of the overall workflow, especially in terms of the amount of data and the processing requirements of the monitoring, analytics and visualization consumers, warrants a more general approach, where application developers can programmatically define data-processing routines which are dynamically deployed and executed in the staging area at runtime.

Child pages

RUSpaces - Addressing Data Challenges at Extreme Scale, Rutgers University