The Pegasus project encompasses a set of technologies that help workflow-based applications execute on distributed resources. Scientific workflows allow users to easily express multi-step computations, for example, retrieving data from a database, reformatting the data, and running an analysis or simulation. Once an application is formalized as a workflow, the Pegasus Workflow Management System can map it onto available compute resources, perform optimizations, and reliably execute the steps in the appropriate order.

Pegasus automatically maps abstract workflows to the underlying infrastructure. This makes workflows portable by enabling the user to define the workflow once, and run it anywhere. Pegasus supports a number of different environments including desktops, campus clusters, grids, and clouds.

Pegasus can easily scale both the size of the workflow, and the resources that the workflow is distributed over. Pegasus runs workflows ranging from just a few computational tasks up to 1 million tasks. The number of resources involved in executing a workflow can scale as needed without any impediments to performance. In addition, the Pegasus mapper can reorder, group, and prioritize tasks in order to increase the performance of the workflow.

When errors occur, Pegasus tries to recover when possible by retrying tasks, by retrying the entire workflow, by providing workflow-level checkpointing, by re-mapping portions of the workflow, by trying alternative data sources for staging data, and, when all else fails, by providing a rescue workflow containing a description of only the work that remains to be done.

Pegasus handles all of the data management tasks required to execute workflows on distributed resources, including replica selection, data transfers, and data registration. In addition, Pegasus cleans up files as the workflow is executed so that data-intensive workflows have enough space to execute on storage-constrained resources. Pegasus keeps track of what has been done (provenance) including the locations of data used and produced, and which software was used with which parameters.

The Pegasus project started in 2001 and has users in astronomy, earthquake science, botany, chemistry, climate modeling, computer vision, genomics, helioseismology, limnology, neuroscience, ocean science, and physics.

http://pegasus.isi.edu

  • No labels