- What is it?
- Implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
- Clone of Google's MapReduce
- Supports Parallel Processing of Large Datasets
- Architecture 1
- Components
- Job Tracker manages cluster resources and jobs.
- Task Tracker manages tasks. There is one task tracker per node.
- Components
- Scalability
- Demonstrated to work on ~4000 nodes (at Yahoo!) 2
- Performance
- Benchmark Study 2
- Configuration
- ~3800 nodes each with,
- 2 quad code Xeons @ 2.5 GHz
- 4 SATA disks
- 8 GB RAM (16 GB for Petabyte Sort)
- 1 Gbps Ethernet Link
- 40 nodes per rack with 8 Gbps uplinks from each rack to the core
- RHEL 5.1 w/kernel 2.6.18
- Sun Java JDK 1.6.0_13-b03 (32/64 bit)
- Jim Gray's Sort Benchmark
Bytes
Nodes
Maps
Reduces
Replication
Time
0.5 TB
1406
8000
2600
1
59 seconds
1.0 TB
1460
8000
2700
1
62 seconds
100 TB
3452
10,000
10,000
2
173 minutes
1 PB
3658
80,000
20,000
2
975 minutes
- ~3800 nodes each with,
- Configuration
- Benchmark Study 2
- Reliability
- The MapReduce server is a single point of failure 3
- Failure kills all queued jobs
- Jobs need to be resubmitted by user
- The MapReduce server is a single point of failure 3
- Interoperability 1
- Multiple Language Bindings
- Java API (Native)
- C/C++ API (Wrapper)
- Python API (Wrapper)
- Lacks support for alternate paradigms like K-means or PageRank 3
- Lack of wire-compatible protocols 3
- Client/Cluster must be same version. Hence migration of workflows to different clusters is difficult.
- Multiple Language Bindings
- Scientific Applications
- An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
- Description - This paper outlines the current usage of Hadoop within the bioinformatics community.
- Summary - Hadoop and the MapReduce programming paradigm have a substantial base in the bioinformatics community, especially in sequencing analysis. Such use is increasing due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters as well as via cloud vendors (like Amazon) who have implemented Hadoop; and due to the effectiveness and ease-of-use of the MapReduce method in parallelization of many relevant algorithms.
- An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
Source(s):
1 http://wiki.apache.org/hadoop/FAQ#What_is_Hadoop.3F
2 http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/
3 apachehadoopmapreducenextgen-110630154552-phpapp01.pptx
References:
Astronomy:
Biology/Bioinformatics:
Environmental Sci/Engin:
GIS: