Hadoop MapReduce

What is it?
- Implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
- Clone of Google's MapReduce
- Supports Parallel Processing of Large Datasets

Architecture ¹
- Components
  - Job Tracker manages cluster resources and jobs.
  - Task Tracker manages tasks. There is one task tracker per node.

Performance

Benchmark Study ²

Configuration

~3800 nodes each with,
- 2 quad code Xeons @ 2.5 GHz
- 4 SATA disks
- 8 GB RAM (16 GB for Petabyte Sort)
- 1 Gbps Ethernet Link
40 nodes per rack with 8 Gbps uplinks from each rack to the core
RHEL 5.1 w/kernel 2.6.18
Sun Java JDK 1.6.0_13-b03 (32/64 bit)

Jim Gray's Sort Benchmark

Reliability
- The MapReduce server is a single point of failure ³
  - Failure kills all queued jobs
  - Jobs need to be resubmitted by user

Scientific Applications
- An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
  - Description - This paper outlines the current usage of Hadoop within the bioinformatics community.
  - Summary - Hadoop and the MapReduce programming paradigm have a substantial base in the bioinformatics community, especially in sequencing analysis. Such use is increasing due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters as well as via cloud vendors (like Amazon) who have implemented Hadoop; and due to the effectiveness and ease-of-use of the MapReduce method in parallelization of many relevant algorithms.

References:

Astronomy:

Biology/Bioinformatics:

Environmental Sci/Engin:

GIS: