Overview

Bioinformatic applications such as DNA sequencing require processing of large data throughput. To close the gap between data set size and the computer speed, one option is to make use of large pool of computational resources available 'in the cloud'.

Tools and methods

To make use of cloud computing, one needs to develop a software pipeline combining newly developed or existing tools (e.g., Hadoop implementation of MapReduce) that runs in the cloud (e.g., Amazon EC2 and S3).

Typical methods and platforms

Cloud services
Commercial public cloud such as Amazon Elastic Compute Cloud ^a (Amazon EC2)
Academic cloud such as US Department of Energy Magellan Cloud (http://magellan.alcf.anl.gov)

Bioinformatics cloud services
http://dnanexus.com
http://www.spiralgenetics.com

Methods
One of typical methods for bioinformatics cloud application is to run parallel computational steps (e.g. Map and Reduce) in program based on MapReduce framework such as Hadoop.
One example: Crossbow project for SNPs searching uses map-shuffle-scan framework. See story 1 for details.

Common related tools

Bowtie: Ultrafast short read alignment
Hadoop: Open Source MapReduce
Contrail: Cloud-based de novo assembly
Cloudburst: Sensitive MapReduce alignment
Myrna: Cloud, differential gene expression
Tophat: RNA-Seq splice junction mapper
Cufflinks: Isoform assembly, quantitation
SoapSNP: Accurate SNP/consensus calling

Advantages

Cloud performance
Due to the large pool of resources available in cloud, the performance is usually significant enhanced.
For example, in the benchmark test on the Amazon cloud in story 1, a human sample comprising 2.7 billion reads were genotyped by Crossbow in about 4 hours including the time for raw data uploading.

Reusability and reproducibility
The use of both standard software (Hadoop) and standard hardware (utility computing) affords reusability and reproducibility.

Cost
Cost is usually competitive comparing to building local facility which consumes space, power, cooling and staff support.
Cloud computing resources provided by third party provider is 'pay-as-you-go'.
As in the example above, the total cost is only about $85.

Issues

Data transfer
The input data (which is usually with large size) must be deposited in a cloud resource to run a cloud program over the data set. So the compatibility between data-generation and transfer speeds achievable must be assessed.
Option 1: High-speed Internet. Such as Internet2 and JANET.
Option 2: Ship physical hard drives to the cloud vender (http://aws.amazon.com/importexport).

Data security and privacy
Policy on security for data storing and processing in the cloud is still under development. Users need to determine whether cloud computing is compatible with any privacy or security requirements associated with their institutions.

Applications redesigning

Usability

Bioinformatics Cloud Resources

Applications

Applications
CloudBLAST	Scalable BLAST in the cloud (http://www.acis.ufl.edu/~ammatsun/mediawiki-1.4.5/index.php/CloudBLAST_Project)
CloudBurst	Highly sensitive short-read mapping (http://cloudburst-bio.sf.net/)
Cloud RSD	Reciprocal smallest distance ortholog detection (http://roundup.hms.harvard.edu)
Contrail	De novo assembly of large genomes (http://contrail-bio.sf.net)
Crossbow	Alignment and SNP genotyping (http://bowtie-bio.sf.net/crossbow)
Myrna (B.L., K. Hansen and J. Leek, unpublished data)	Differential expression analysis of mRNA-seq (http://bowtie-bio.sf.net/myrna)
Quake (D.R. Kelley, M.C.S. and S.L.S., unpublished data)	Quality guided correction of short reads (https://github.com/davek44/error_correction/)

Analysis environments and data sets

Analysis environments and data sets
AWS Public Data	Cloud copies of Ensembl, GenBank, 1000 Genomes and other data (http://aws.amazon.com/publicdatasets/)
CLoVP	Genome and metagenome annotation and analysis (http://clover.igs.umaryland.edu/)
Cloud BioLinux	Genome assembly and alignment (http://www.cloudbiolinux.com)
Galaxy	Platform for interactive large-scale genome analysis (http://galaxy.psu.edu)

Story 1 Crossbow project: SNPs searching with cloud computing ²

Summary

Crossbow is a Hadoop-based software tool combining short read aligner Bowtie with the accuracy of the SNP caller SOAPsnp to perform alignment and SNP detection for multiple whole-human datasets per day.
In this work, Crossbow analyzes data comprising 38-fold coverage of a Han Chinese male genome in 4 hours 30 minutes (including transfer time) using a 320-core cluster rented from Amazon EC2.

Tools and methods

Cloud platform: Amazon EC2/S3
Tools
- Bowtie: 'Map', an aligner to find best alignment for each read
- Hadoop: 'Shuffle', groups and sorts alignments by region
- SOAPsnp: 'Reduce', scans alignments for divergent columns, accounts for sequencing error, know SNPs
- Scripts allowing the crossbow to run on Amazon EC2
Methods
- Basic steps to running the Crossbow computation (See the figure below)
  1. Step 1 (red), short reads are copied to the permanent store.
  2. Step 2 (green), the cluster is allocated and the scripts driving the computation are uploaded to the master node.
  3. Step 3 (blue), the computation is run. The computation download reads from the permanent store, operates on them, and stores the results in the Hadoop distributed filesystem.
  4. Step 4 (orange), the results are copied to the client machine and the job completes.

- Workflow
  The Crossbow uses a 'Map-shuffle-scan' framework (see figure below).
  1. 'Map': Users upload sequencing reads into the cloud storage. Hadoop, running on a cluster of virtual machines in the cloud, maps the unaligned reads to the reference genome using many parallel instances of Bowtie.
  2. 'Shuffle': Hadoop then automatically shuffles the alignments into sorted bins determined by chromosome region.
  3. 'Scan': Many parallel instances of SOAPsnp scan the sorted alignments in each bin. Then the final output is a stream of SNP calls stored within the cloud that can be downloaded.

Cloud performance

Computation was performed both locally and in Amazon EC2 cluster. Data: 2.66 billion reads (~85 Gb)

Local cluster
- Hadoop 0.20 cluster with 10 worker nodes
- Each node: 4-core 3.2 GHz Intel Xeon (40-core total)
- 64-bit Redhat Enterprise Linux Server 5.3
- Each node: 4 GB memory and 366 GB local storage available for the HDFS
- Connection: gigabit ethernet
- Performance: Requires about 1 day of wall clock time to run

Amazon EC2
- Amazon EC2 service on clusters of 40 nodes
- Each node: EC2 Extra Large High CPU Instance (High-CPU XL)
- Each node: a virtualized 64-bit computer with 7 GB of memory and the equivalent of 8 processor cores clocked at approximately 2.5 to 2.8 Ghz.
- Cost: $0.68 per node per hour (2009) (Actually the price is the same as 2011, see current Amazon price)
- Overall performance in Amazon EC2: Discovered 3.7M SNPs in one human genome for ~$100 in an afternoon. Accuracy validated at >99% (See the table in the figure below)

Summary

The Crossbow project uses cloud computing (with MapReduce and Hadoop) to efficiently parallelize existing sequence alignment and genotyping algorithms.
By taking advantage of commodity processors available via cloud computing services, Crossbow condenses over 1,000 hours of computation into a few hours without requiring the user to own or operate a computer cluster.
Also, running on standard software (Hadoop) and hardware (EC2 Instances) makes it easier for results reproduction and customized analysis with Crossbow.

Story 2 AzureBlast: A case study of developing science applications on the cloud ⁴

Tools and Methods

Cloud platform: Windows Azure

References

Notes and other links:

a. Learn more about Amazon EC2/S3
b. Learn more about Microsoft Azure

Story 1 A study of cost and performance of the application of cloud computing to Astronomy ¹

The performance of three workflow applications with different I/O, memory and CPU requirements are investigated on Amazon EC2 and the performance of cloud are compared with that of a typical HPC (Abe in NCSA).
The goal is to determine which type of scientific workflow applications are cheaply and efficiently run on the Amazon EC2 cloud.
Also the application of cloud computing to the generation of an atlas of periodograms for the 210,000 light curves is described.

Part I - Performance of three workflow applications

Tools and methods

Cloud platform: Amazon EC2 (http://aws.amazon.com/ec2/)
Summary of the processing resources on Amazon EC2 and the Abe high-performance cluster

Type	Architecture	CPU	Cores	Memory	Network	Storage	Price
Amazon EC2
ml.small	32-bit	2.0-2.6 GHz Opteron	1-2	1.7 GB	1 Gbps Ethernet	Local	$0.10/hr
ml.large	64-bit	2.0-2.6 GHz Opteron	2	7.5 GB	1 Gbps Ethernet	Local	$0.40/hr
ml.xlarge	64-bit	2.0-2.6 GHz Opteron	4	15 GB	1 Gbps Ethernet	Local	$0.80/hr
cl.medium	32-bit	2.33-2.66 GHz Xeon	2	1.7 GB	1 Gbps Ethernet	Local	$0.20/hr
cl.xlarge	64-bit	2.0-2.66 GHz Xeon	8	7.5 GB	1 Gbps Ethernet	Local	$0.80/hr
Abe Cluster
abe.local	64-bit	2.33 GHz Xeon	8	8 GB	10 Gbps InfiniBand	Local	N/A
abe.lustre	64-bit	2.33 GHz Xeon	8	8 GB	10 Gbps InfiniBand	Lustre ^TM	N/A

Workflow ^a applications
Three different workflow applications are chosen.
- Montage (http://montage.ipac.caltech.edu) from astronomy: a toolkit for aggregating astronomical images in Flexible Image Transport System (FITS) format into mosaic
  The workflow contained 10,429 tasks, read 4.2 GB of input data, and produced 7.9 GB of output data.
  Montage is considered I/O-bound because it spends more than 95% of its time waiting on I/O operations.
- Broadband (http://scec.usc.edu/research/cme) from seismology: generates and compares intensity measures of seismograms from several high- and low-frequency earthquake simulation codes
  The workflow contained 320 tasks, read 6 GB of input data, and produced 160 MB of output data.
  Broadband is considered memory-limited because more than 75% of its runtime is consumed by tasks requiring more than 1 GB of physical memory.
- Epigenome (http://epigenome.usc.edu) from biochemistry: maps short DNA segments collected using high-throughput gene sequencing machines to a previously constructed reference genome
  The workflow contained 81 tasks, read 1.8 GB of input data, and produced 300 MB of output data.
  Epigenome is considered CPU-bound because it spends 99% of its runtime in the CPU and only 1% on I/O and other activities.

- Summary of resource use by the workflow applications
  
  Application
  
  I/O
  
  Memory
  
  CPU
  
  Montage
  
  High
  
  Low
  
  Low
  
  Broadband
  
  Medium
  
  High
  
  Medium
  
  Epigenome
  
  Low
  
  Medium
  
  High

Methods
The experiments were all run on single nodes to provide an unbiased comparison of the performance of workflows on Amazon EC2 and Abe.
For experiments on EC2:
- Executables were pre-installed in a Virtual Machine image which is deployed on the node.
- Input data was stored in the Amazon EBS.
- Output, intermediate files and the application executables were stored on local disks.
- All jobs were managed and executed through a job submission host at the Information Sciences Institute (ISI) using the Pegasus Workflow Management System (Pegasus WMS) including Pegasus and Condor.

Cloud performance

Montage (I/O-bound)
The processing times on abe.lustre are nearly three times faster than the fastest EC2 machines ^b.
Broadband (Memory-bound)
The processing advantage of the parallel file system largely disappears. And abe.local's performance is only 1% better than cl.xlarge.
For memory-intensive application, Amazon EC2 can achieve nearly the same performance as Abe.
Epigenome (CPU-bound)
The parallel file system in Abe provides no processsing advantage for Epigenome. The machines with the most cores gave the best performance for CPU-bound application.

Figure below shows the processing time for the three workflows.

Cost

The cost of Amazon EC2 includes:

Resource cost: the figure below shows processing cost of three workflows in EC2.

Storage Cost: Cost to store VM images in S3 and cost of storing input data in EBS.
The table summarizes the monthly storage cost

Application

Input Volume

Monthly Storage Cost

Montage

4.3 GB

$0.66

Broadband

4.1 GB

$0.66

Epigenome

1.8 GB

$0.26

Transfer cost: AmazonEC2 charges $0.10 per GB for transter into the cloud and $0.17 per GB for transfer out of the cloud.
The data size and transfer costs are summarized in the tables below.
Data transfer size per workflow on Amazon EC2

Application

Input

Output

Logs

Montage

4,291 MB

7,970 MB

40 MB

Broadband

4,109 MB

159 MB

5.5 MB

Epigenome

1,843 MB

299 MB

3.3 MB

Costs of transferring data into and out the EC2 cloud

Application

Input

Output

Logs

Total

Montage

$0.42

$1.32

$<0.01

$1.75

Broadband

$0.40

$0.03

$<0.01

$0.43

Epigenome

$0.18

$0.05

$<0.01

$0.23

Cost effectiveness study
Cost calculations based on processing reqeusts for 36,000 mosaic of 2MASS images (Total size 10TB) of size 4 sq deg over a period of three years (typical workload for image mosaic service).
Results show that Amazon EC2 is much less attractive than a local service for I/O-bound application due to the high costs of data storage in Amazon EC2.
Tables below show the cost of both local and Amazon EC2 service.
Cost per mosaic of a locally hosted image mosaic service

Item	Cost ($)
12 TB RAID 5 disk farm and enclosure (3 yr support)	12,000
Dell 2650 Xeon quad-core processor, 1 TB staging area	5,000
Power, cooling and administration	6,000
Total 3-year Cost	23,000
Cost per mosaic	0.64

Cost per mosaic of a mosaic service hosted in the Amazon EC2 cloud

Item	Cost ($)
Network Transfer In	1000
Data Storage on Elastic Block Storage	36,000
Processor Cost (cl.medium)	4,500
I/O operations	7,000
Network Transfer Out	4,200
Total 3-year Cost	52,700
Cost per mosaic	1.46

Summary

For CPU-bound applications, virtualization overhead on Amazon EC2 is generally small.
The resources offered by EC2 are generally less powerful than those available in HPC. Particularly for I/O-bound applications.
Amazon EC2 offers no cost benefit over locally hosted storage, but does eliminate local maintenance and energy costs, and does offer high-quality, reliable storage.
As a result, commercial clouds may not be best suited for large-scale computations ^c.

Part II - Application to calculation of periodograms

Generation of a science product: an atlas of periodograms for the 210,000 light curves released by the NASA Kepler Mission.

		Result
Runtimes	Tasks	631,992
	Mean Task Runtime	6.34 sec
	Jobs	25,401
	Mean Job Runtime	2.62 min
	Total CPU Time	1,113 hr
	Total Wall Time	26.8 hr
Inputs	Input Files	210,664
	Mean Input Size	0.084 MB
	Total Input Size	17.3 GB
Outputs	Output Files	1,263,984
	Mean Output Size	0.124 MB
	Total Output Size	76.52 GB
Cost	Compute Cost	$291.58
	Transfer Cost	$11.48
	Total Cost	$303.06

Application	I/O	Memory	CPU
Montage	High	Low	Low
Broadband	Medium	High	Medium
Epigenome	Low	Medium	High

Application	Input	Output	Logs
Montage	4,291 MB	7,970 MB	40 MB
Broadband	4,109 MB	159 MB	5.5 MB
Epigenome	1,843 MB	299 MB	3.3 MB

Application	Input	Output	Logs	Total
Montage	$0.42	$1.32	$<0.01	$1.75
Broadband	$0.40	$0.03	$<0.01	$0.43
Epigenome	$0.18	$0.05	$<0.01	$0.23

Application	Input Volume	Monthly Storage Cost
Montage	4.3 GB	$0.66
Broadband	4.1 GB	$0.66
Epigenome	1.8 GB	$0.26

Child pages

Backup - detailed info

Overview

Tools and methods

Typical methods and platforms

Common related tools

Advantages

Issues

Bioinformatics Cloud Resources

Applications

Analysis environments and data sets

Story 1 Crossbow project: SNPs searching with cloud computing 2

Summary

Tools and methods

Cloud performance

Summary

Story 2 AzureBlast: A case study of developing science applications on the cloud 4

Tools and Methods

References

Notes and other links:

Story 1 A study of cost and performance of the application of cloud computing to Astronomy 1

Part I - Performance of three workflow applications

Tools and methods

Cloud performance

Cost

Summary

Part II - Application to calculation of periodograms

Story 1 Crossbow project: SNPs searching with cloud computing ²

Story 2 AzureBlast: A case study of developing science applications on the cloud ⁴

Story 1 A study of cost and performance of the application of cloud computing to Astronomy ¹