Table of Contents |
---|
Overview
Summary
- The practice of cloud computing in Astronomy area is focused on data processing such as image from telescope (Berriman et al. 2010, Jackson et al. 2010, Berriman et al. 2010(2), Juve et al. 2009), or data sharing (Juve et al. 2010).
- The common approach is to implement existing pipeline on to public cloud platform (Berriman et al. 2010, Jackson et al. 2010, Berriman et al. 2010(2), Hoffa et al. 2008).
Workflow
- Eucalyptus is used to allocate resources and start virtual machines (VMs) (Vockler et al. 2011).
- Hadoop MapReduce is a useful tool for parallel computing applications (Wiley et al. 2011).
Data
- The data throughput for Astronomy applications is usually very big. For example, astronomical surveys of the sky generates tens of terabytes of images and detect hundreds of millions of sources every night (Wiley et al. 2011).
- With cloud computing, the data processing time can be reduced. For example, 20TB data can be processed in about ~7hrs with 80-core Amazon EC2 instance (Jackson et al. 2010).
Cloud platform
- Amazon EC2 is popular (Berriman et al. 2010, Jackson et al. 2010, Berriman et al. 2010(2), Juve et al. (2009), Juve et al. (2010), Vockler et al. (2011)) since it is convenient to implement existing techniques on the Amazon cloud due to the IaaS property of AWS.
- Community clouds are also used because the cost effective property (comparing to commercial clouds). For example, Nimbus (Hoffa et al. 2008), FutureGrid (Vockler et al. 2011), and Magellan (Vockler et al. 2011) are used.
- Since Astronomy research is usually conducted by national organization, sometimes they will build their own cloud platform, such as CANFAR (Gaudet et al. 2010).
Issues/Gaps
- Data transfer (Vockler et al. 2011, Berriman et al. 2010)
- Cost of transferring and storage of huge input/output data on commercial cloud service (Berriman et al. 2010).
- Cost of S3 is at a disadvantage for workflows with many files since Amazon charges a fee per S3 transaction (Juve et al. 2010).
- Need to replicate HPC cluster environment in cloud or the application must be modified (Jackson et al. 2010).
- Cloud performs poorly on workflows with a large number of small files (Juve, et al. 2010).
A study of cost and performance of the application of cloud computing to Astronomy 1
...
Application of Cloud computing to the creation of image mosaic and management of their provenance 3
Summary
Similar content as the first paper.
Workflow
Data
Cloud platform
...
Scientific workflow applications on Amazon EC2 4
Summary
Similar content as the first paper.
Workflow
Data
Cloud platform
...
Data Sharing Options for Scientific Workflows on Amazon EC2 5
Summary
- Choice of storage system has a significant impact on workflow runtime
- Investigated data management options in the cloud for workflow applications
Workflow
- Montage: high I/O, low Memory, low CPU
- Broadband: medium I/O, high memory, medium CPU
- Epigenome: low I/O, medium memory, high CPU
Data
Cloud platform
Comparison:
- Amazon EC2/S3
- NFS
- GlusterFS
- PVFS
Cloud performance
- S3 produces good performance for one application due to the use of caching in the implementation of the S3 client
- S3 performs poorly on workflows with a large number of small files
- Cost of S3 is at a disadvantage for workflows with many files, because Amazon charges a fee per S3 transaction
Issues/Gaps
Using MapReduce for Image Coaddition 6
Summary
- The paper presents implementation and evaluation of image coaddition within the MapReduce data-processing framework using Hadoop.
Workflow
Data
- Processed dataset containing 100,000 individual FITS files
Cloud platform
- Hadoop on cluster
Cloud performance
- Process 100,000 files (300 million pixels) in three minutes on a 400-node cluster
Issues/Gaps
CANFAR: Canadian Advanced Network for Astronomical Research 7
Summary
- The Canadian Advanced Network For Astronomical Research (CANFAR) is a project that is delivering a network-enabled platform for the accessing, processing, storage, analysis, and distribution of very large astronomical datasets
Workflow
Data
Cloud platform
Comparison of processing models
| Grid | Cloud | CANFAR |
---|---|---|---|
Ample CPU Cycles | | | |
Job Scheduling | | | |
User customized environment | | | |
Resource Sharing | | | |
Portability of environment | | | |
Cloud performance
Issues/Gaps
A Multi-Dimensional Classification model for Scientifc workflow Characteristics 8
Summary
- A multi-dimensional classification model is presented with workflow examples.
Workflow
- Astronomy workflow:
- Pan-STARRS's (Panoramic Survey Telescope And Rapid Response System) project is a continuous survey of the entire sky
- PSLoad workflow stages incoming data files from the telescope pipeline and loads them into individual relational databases each night
- PSMerge workflow: Each week, the production databases that astronomers query are updated with the new data staged during the week
- Pan-STARRS's (Panoramic Survey Telescope And Rapid Response System) project is a continuous survey of the entire sky
Data
Cloud platform
Cloud performance
...
On the use of cloud computing for scientific workflows 10
Summary
- Montage is a widely used astronomy application with short job runtimes.
- The virtual environment can provide good compute time performance but it can suffer from resource scheduling delays and wide-area communications.
Workflow
- Montage
Data
Cloud platform
- University of Chicago's 16-node TeraPort cluster with Nimbus science cloud
- Globus
Cloud performance
Issues/Gaps
- Large overheads of jobs waiting in the Condor and resource queues
- May use clustering techniques to reduce the scheduling overheads
Experiences using cloud computing for a scientific workflow application 11
Summary
- An application for processing astronomy data released by the NASA Kepler project which is to search for Earth-like planets orbiting other stars.
Workflow
- The workflow is deployed across multiple clouds using the Pegasus Workflow Management System
- Allocate 6 nodes with 8 cores each in all cases
Data
Cloud platform
Comparison:
- FutureGrid with Eucalyptus
- Magellan with Eucalyptus
- Amazon EC2
Cloud performance
- Allocate 6 nodes with 8 cores each in all cases
- Runtime is longer on EC2 due to: 1. A lower CPU speed, and 2. Poor WAN performance.
Issues/Gaps
- Better utilization of remote resources
- Different clustering strategies: explore the benefits of different task cluster sizes
- Submit host management
- Alternative data staging mechanisms, explore different protocols, and storage solutions
References
- Berriman, G.B. et al. Sixth IEEE International Conference on e-Science, 1-7 (2010)
- Jackson, K.R. et al. Proc. ACM Int. Symp. HPDC, 421-429 (2010)
- Berriman, G.B. et al. SPIE Conference 7740: Software and Cyberinfrastructure for Astronomy (2010)
- Juve, G. et al. Cloud Computing Workshop in Conjunction with e-Science Oxford, UK: IEEE (2009)
- Juve, G. et al. SC(2010)
- Wiley, K. et al. Publications of the Astronomical Society of the Pacific 123 366-380 (2011)
- Gaudet, S. et al. Proc SPIE (2010)
- Ramakrishnan, L. et al. Wands '10 (2010)
- Simmhan, Y. et al. ADVCOMP 09' (2009)
- Hoffa, C. et al. ESCIENCE 08' (2008)
- Vockler, J. et al. ScienceCloud '11 (2011)
...