Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
h1. Survey White Paper Outline


h2. Part 1. Introduction of Cloud Computing Technology


h2. Part 2. Science Stories and Requirements for the Cloud

There are a lot of practices of implementing scientific applications on cloud computing resources such as biology/bioinformatics ([Stein 2010|http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2898083/?report=abstract], [Schatz _et al._ 2010|http://www.nature.com/nbt/journal/v28/n7/full/nbt0710-691.html]), Geospatial Information System ([Yang _et al._ 2011|http://cisc.gmu.edu/scc/readings/spatial_cloud_computing.pdf]), Astronomy, and Environmental Science.

Due to different requirements in each science area, the focuses of cloud computing applications are various. In Biology/Bioinformatics area, many applications such as DNA sequencing require processing of large data throughput (Schatz _et al._ 2010 and Langmead _et al._ 2009). Many opensource projects can be easily implemented in cloud such as Myrna (Langmead _et al._ 2010), CloudBLAST (Matsunaga _et al._ 2008), and Galaxy (Afgan _et al._ 2010). The cloud computing workflow in Geospatial sciences mainly involves data storage and processing ([Cui _et al._ 2010|http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5532992&tag=1], [Huang _et al._ 2010|http://portal.acm.org/citation.cfm?doid=1869692.1869699], [Park _et al._ 2011|http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5746010], [Yang _et al._ 2010|http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5602628&tag=1], [Bunzel _et al._ 2010|http://portal.acm.org/citation.cfm?doid=1823854.1823894]) and simulation and modeling. Also, a main IT challenge in Geospatial sciences is to deal with massive concurrent users access (Huang _et al._ 2010, [Bernstein _et al._ 2010|http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5635224], [Wang _et al._ 2010|http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5489727], [Janakiraman _et al._ 2010|http://portal.acm.org/citation.cfm?doid=1869790.1869813], [Blower _et al._ 2010|http://portal.acm.org/citation.cfm?doid=1823854.1823893]). The practice of cloud computing in Astronomy is focused on data processing such as processing images from telescope (Berriman _et al._ 2010, Jackson _et al._ 2010, Berriman _et al._ 2010(2), Hoffa _et al._ 2008) or data sharing (Juve _et al._ 2010). In Environmental sciences, the practice of implementing cloud computing is focused on modeling such as ocean climate modeling (Evangelinos _et al._ 2008) and groundwater modeling (Hunt _et al._ 2010), cloud computing is also used in data analysis such as parallel sequential data analysis tasks (Hasenkamp _et al._ 2010).

The most common used cloud service model in scientific applications is IaaS. [Amazon Cloud services|CLOUD:Amazon] is the most popular cloud platform in almost all the scientific areas, this is because it is convenient to implement existing techniques on the Amazon cloud. For example, in the Biology/Bioinformatics area, most applications use linux-based system and technologies which can be easily implement on to Amazon EC2 (Gunarathne _et al._ 2010, Qiu _et al._, Langmead _et al._ 2010, Vecchiola _et al._ 2009, Nguyen _et al._ 2011, Afgan _et al._ 2010). Amazon cloud service is also popular in Astronomy (Berriman _et al._ 2010, Jackson _et al._ 2010, Juve _et al._ 2009, Vockler _et al._ 2011), GIS (Huang _et al._ 2010, Janakiraman _et al._ 2010, Bunzel _et al._ 2010), and Environmental sciences (Evangelinos _et al._ 2008, He _et al._ 2010). Other community IaaS cloud platforms are also used because the cost effective property compared to commercial clouds. For example, [FutureGrid|CLOUD:FutureGrid](Qiu _et al._ 2010) and [Magellan|CLOUD:Magellan](Taylor _et al._ 2010) are used in Bioinformatics applications; [Nimbus|CLOUD:Nimbus](Hoffa _et al._ 2008), FutureGrid(Vockler _et al._ 2011), and Magellan(Vockler _et al._ 2011) are used in Astronomy; [GoGrid|CLOUD:GoGrid](Hunt _et al._ 2010, He _et al._ 2010); [OpenNebula|CLOUD:OpenNebula](Park _et al._ 2011) is used in GIS application; FutureGrid with Nimbus and [Eucalyptus|CLOUD:Eucalyptus](Fox _et al._ 2011) and Magellan with Eucalyptus(Hasenkamp _et al._ 2010) are used in Environmental sciences applications.

PaaS are also used in scientific areas. For example, [Microsoft Azure|CLOUD:Microsoft Azure] is used in Biology/Bioinformatics applications(Qiu _et al._ 2009, Qiu _et al._ 2010, Lu _et al._ 2010) and Astronomy([Eye on Earth project|http://www.eyeonearth.eu/]). Google App Engine is used in GIS area(Blower _et al._ 2010). Scientific researchers choose PaaS platform because some technologies they need is constructed based on specific platform, such as MapReduce implementation Dryad is based on Microsoft platform(Qiu _et al._ 2009).

Table below lists the cloud platforms used in scientific applications.
{table-plus:title=Table 1: Statistics of cloud platforms used in scientific applications (Sample: Astro: 11 papers, Bio: 15 papers, Env: 10 papers, GIS: 11 papers)}
|| || Astronomy || Biology || Environmental || GIS ||
| Amazon | 6 | 9 | 2 | 3 |
| Azure | | 4 | 1 | |
| Google App Engine | | | | 1 |
| FutureGrid | 1 | 1 | 1 | |
| Magellan | 1 | | 1 | |
| GoGrid | | | 2 | |
| Eucalyptus | 1 | | 2 | |
| Nimbus | | | 1 | |
| OpenNebula | | | | 1 |
| IBM Grid | | | 1 | |
{table-plus}

A lot of scientific computing applications involve parallel computing algorithms. As a framework to support distributed computing on large data sets on clusters, MapReduce is popular in scientific applications, such as in the field of sequencing analysis in Biology/Bioinformatics (Langmead _et al._ 2009, Gunarathne _et al._ 2010). [Hadoop|CLOUD:Hadoop MapReduce] is a free opensource implementation of MapReduce, and it is commonly used in science areas such as Biology/Bioinformatics and Astronomy (Wiley _et al._ 2011). Other MapReduce implementations and extensions are also used such as Microsoft Dryad (Qiu _et al._ 2009, Lu _et al._ 2010) and Twister (Qiu _et al._ 2010).

The practices and experiments of scientific computing applications in cloud demonstrate many advantages of cloud computing such as improved data processing time and reduced cost. For instance, in the Crossbow project, a human sample comprising 2.7 billion reads can be genotyped by crossbow in about 4 hours including data uploading time in Amazon Cloud and the cost is about $85 (Langmead _et al._ 2009). And in the work of Schadt _et al._, 1 PB of data can be traversed on a 1,000 node instance on Amazon EC2 within \~350 minutes and cost about $2,040 (Schadt, _et al._ 2011). The data throughput for Astronomy applications is usually very big. For example, astronomical surveys of the sky generates tens of terabytes of image data and detect hundreds of millions of sources every night (Wiley _et al._ 2011). With cloud computing, the data processing time can be reduced. In the experiment of Jackson _et al._, 20TB data can be processed in about \~7 hours with 80-core Amazon EC2 instance (Jackson _et al._ 2010).

h2. Part 3. Cloud Computing Platforms and Tools


h2. Part 4. Gap Analysis and Known Issues

There are several issues raised in the practice of scientific cloud computing. The first issue is that the input data (usually with large size) must be deposited in a cloud resource to run a cloud program over the data set. So the compatibility between data-generation and transfer speeds achievable must be assessed (Schatz _et al._ 2010). Currently one option is to use High-speed internet. Another option is to ship physical hard drives to the cloud vender.

Another issue is the cost. Although with cloud computing we can avoid dealing with cost associated with local equipment maintenance and staffing, the cost model of current cloud service provider is complex for scientific cloud computing users to determine the actual full cost (Truong _et al._ 2010). For instance, in the work of Juve _et al._, the cost of Amazon S3 is at a disadvantage for workflows with many files since Amazon charges a fee per S3 transaction (Juve _et al._ 2010).

Security and privacy is another concern for scientific cloud computing users. In GIS applications, many location-based services involve the location and identity information of users, and location and identity privacies need to be considered (Wang _et al._ 2010). Also, the geospatial data of a country are usually very sensitive that it will raise concern when data are stored in cloud provided by foreign organization (Yang _et al._ 2010).

h2. Part 5. Recommendations

What should NCSA do next step? Should we invest some time on setting up some private cloud? some Cloud tools?