Scientific Cloud Computing Survey White Paper

Part 1. Introduction of Cloud Computing Technology

In this survey, we use the NIST definition and categorization of the Cloud computing technology. Cloud Computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of the following essential characteristics: On-demand self-service, Broad network access, Resource pooling, Rapid Elasticity, and Measured Service. The Cloud can be operated as several modes such as PaaS, IaaS, SaaS and be deployed as the following types:

  • Private Cloud
  • Community Cloud
  • Public Cloud
  • Hybrid Cloud

Some of the details of the definition can be found in the wiki under section Cloud Definition: https://wiki.ncsa.illinois.edu/display/CLOUD/Cloud+Definition

Part 2. Science Stories and Requirements for the Cloud

There are a lot of practices of implementing scientific applications on cloud computing resources such as biology/bioinformatics (Stein 2010, Schatz et al. 2010), Geospatial Information System (Yang et al. 2011), Astronomy, and Environmental Science.

Due to different requirements in each science area, the focuses of cloud computing applications are various. In Biology/Bioinformatics area, many applications such as DNA sequencing require processing of large data throughput (Schatz et al. 2010 and Langmead et al. 2009). Many opensource projects can be easily implemented in cloud such as Myrna (Langmead et al. 2010), CloudBLAST (Matsunaga et al. 2008), and Galaxy (Afgan et al. 2010). The cloud computing workflow in Geospatial sciences mainly involves data storage and processing (Cui et al. 2010, Huang et al. 2010, Park et al. 2011, Yang et al. 2010, Bunzel et al. 2010) and simulation and modeling. Also, a main IT challenge in Geospatial sciences is to deal with massive concurrent users access (Huang et al. 2010, Bernstein et al. 2010, Wang et al. 2010, Janakiraman et al. 2010, Blower et al. 2010). The practice of cloud computing in Astronomy is focused on data processing such as processing images from telescope (Berriman et al. 2010, Jackson et al. 2010, Berriman et al. 2010(2), Hoffa et al. 2008) or data sharing (Juve et al. 2010). In Environmental sciences, the practice of implementing cloud computing is focused on modeling such as ocean climate modeling (Evangelinos et al. 2008) and groundwater modeling (Hunt et al. 2010), cloud computing is also used in data analysis such as parallel sequential data analysis tasks (Hasenkamp et al. 2010).

The most common used cloud service model in scientific applications is IaaS. Amazon Cloud services is the most popular cloud platform in almost all the scientific areas, this is because it is convenient to implement existing techniques on the Amazon cloud. For example, in the Biology/Bioinformatics area, most applications use linux-based system and technologies which can be easily implement on to Amazon EC2 (Gunarathne et al. 2010, Qiu et al., Langmead et al. 2010, Vecchiola et al. 2009, Nguyen et al. 2011, Afgan et al. 2010). Amazon cloud service is also popular in Astronomy (Berriman et al. 2010, Jackson et al. 2010, Juve et al. 2009, Vockler et al. 2011), GIS (Huang et al. 2010, Janakiraman et al. 2010, Bunzel et al. 2010), and Environmental sciences (Evangelinos et al. 2008, He et al. 2010). Other community IaaS cloud platforms are also used because the cost effective property compared to commercial clouds. For example, FutureGrid(Qiu et al. 2010) and Magellan(Taylor et al. 2010) are used in Bioinformatics applications; Nimbus(Hoffa et al. 2008), FutureGrid(Vockler et al. 2011), and Magellan(Vockler et al. 2011) are used in Astronomy; GoGrid(Hunt et al. 2010, He et al. 2010); OpenNebula(Park et al. 2011) is used in GIS application; FutureGrid with Nimbus and Eucalyptus(Fox et al. 2011) and Magellan with Eucalyptus(Hasenkamp et al. 2010) are used in Environmental sciences applications.

PaaS are also used in scientific areas. For example, Microsoft Azure is used in Biology/Bioinformatics applications(Qiu et al. 2009, Qiu et al. 2010, Lu et al. 2010), groundwater risk analysis (Liu et al. 2010, 2011 ) and Astronomy(Eye on Earth project). Google App Engine is used in GIS area(Blower et al. 2010). Scientific researchers choose PaaS platform because some technologies they need is constructed based on specific platform, such as MapReduce implementation Dryad is based on Microsoft platform(Qiu et al. 2009).

Table below lists the cloud platforms used in scientific applications.

A lot of scientific computing applications involve parallel computing algorithms. As a framework to support distributed computing on large data sets on clusters, MapReduce is popular in scientific applications, such as in the field of sequencing analysis in Biology/Bioinformatics (Langmead et al. 2009, Gunarathne et al. 2010). Hadoop is a free opensource implementation of MapReduce, and it is commonly used in science areas such as Biology/Bioinformatics and Astronomy (Wiley et al. 2011). Other MapReduce implementations and extensions are also used such as Microsoft Dryad (Qiu et al. 2009, Lu et al. 2010) and Twister (Qiu et al. 2010).

The practices and experiments of scientific computing applications in cloud demonstrate many advantages of cloud computing such as improved data processing time and reduced cost. For instance, in the Crossbow project, a human sample comprising 2.7 billion reads can be genotyped by crossbow in about 4 hours including data uploading time in Amazon Cloud and the cost is about $85 (Langmead et al. 2009). And in the work of Schadt et al., 1 PB of data can be traversed on a 1,000 node instance on Amazon EC2 within ~350 minutes and cost about $2,040 (Schadt, et al. 2011). The data throughput for Astronomy applications is usually very big. For example, astronomical surveys of the sky generates tens of terabytes of image data and detect hundreds of millions of sources every night (Wiley et al. 2011). With cloud computing, the data processing time can be reduced. In the experiment of Jackson et al., 20TB data can be processed in about ~7 hours with 80-core Amazon EC2 instance (Jackson et al. 2010).

Part 3. Cloud Computing Platforms and Tools

We survey several popular Cloud computing platforms and tools. These include the following:

1. Cloud Services Cloud Services

1.1 Community Clouds Community Clouds
FutureGrid FutureGrid
Magellan Magellan
Science Clouds Science Clouds

1.2 Public Clouds Public Clouds
Amazon Amazon
AT&T Synaptic AT&T Synaptic
GoGrid GoGrid
Google App Engine Google App Engine
Microsoft Azure Microsoft Azure
Rackspace Rackspace

Joyent: http://www.joyent.com/

2. Cloud Software Cloud Software

2.1 Cloud Applications Cloud Applications
Hadoop Distributed File System Hadoop Distributed File System
Hadoop MapReduce Hadoop MapReduce
Sector & Sphere Sector & Sphere

2.2. Cloud Platforms Cloud Platforms
Eucalyptus Eucalyptus
Nimbus Nimbus
OpenNebula OpenNebula
OpenStack OpenStack
WSO2 Stratos WSO2 Stratos

2.3 Multi-Cloud API Multi-Cloud API
Apache Deltacloud Apache Deltacloud
Apache LibCloud Apache LibCloud
JClouds JClouds
Jets3t Jets3t
SMEStorage SMEStorage
Typica Typica

Part 4. Gap Analysis and Known Issues

There are several issues raised in the practice of scientific cloud computing. The first issue is that the input data (usually with large size) must be deposited in a cloud resource to run a cloud program over the data set. So the compatibility between data-generation and transfer speeds achievable must be assessed (Schatz et al. 2010). Currently one option is to use High-speed internet. Another option is to ship physical hard drives to the cloud vender.

Another issue is the cost. Although with cloud computing we can avoid dealing with cost associated with local equipment maintenance and staffing, the cost model of current cloud service provider is complex for scientific cloud computing users to determine the actual full cost (Truong et al. 2010). For instance, in the work of Juve et al., the cost of Amazon S3 is at a disadvantage for workflows with many files since Amazon charges a fee per S3 transaction (Juve et al. 2010).

Security and privacy is another concern for scientific cloud computing users. In GIS applications, many location-based services involve the location and identity information of users, and location and identity privacies need to be considered (Wang et al. 2010). Also, the geospatial data of a country are usually very sensitive that it will raise concern when data are stored in cloud provided by foreign organization (Yang et al. 2010).

  • No labels