• What is it?
    • Implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
    • Clone of Google's MapReduce
    • Supports Parallel Processing of Large Datasets
  • Architecture 1
    • Components
      • Job Tracker manages cluster resources and jobs.
      • Task Tracker manages tasks. There is one task tracker per node.
  • Scalability
    • Demonstrated to work on ~4000 nodes (at Yahoo!) 2
  • Performance
    • Benchmark Study 2
      • Configuration
        • ~3800 nodes each with,
          • 2 quad code Xeons @ 2.5 GHz
          • 4 SATA disks
          • 8 GB RAM (16 GB for Petabyte Sort)
          • 1 Gbps Ethernet Link
        • 40 nodes per rack with 8 Gbps uplinks from each rack to the core
        • RHEL 5.1 w/kernel 2.6.18
        • Sun Java JDK 1.6.0_13-b03 (32/64 bit)
        • Jim Gray's Sort Benchmark

          Bytes

          Nodes

          Maps

          Reduces

          Replication

          Time

          0.5 TB

          1406

          8000

          2600

          1

          59 seconds

          1.0 TB

          1460

          8000

          2700

          1

          62 seconds

          100 TB

          3452

          10,000

          10,000

          2

          173 minutes

          1 PB

          3658

          80,000

          20,000

          2

          975 minutes

  • Reliability
    • The MapReduce server is a single point of failure 3
      • Failure kills all queued jobs
      • Jobs need to be resubmitted by user
  • Interoperability 1
    • Multiple Language Bindings
      • Java API (Native)
      • C/C++ API (Wrapper)
      • Python API (Wrapper)
    • Lacks support for alternate paradigms like K-means or PageRank 3
    • Lack of wire-compatible protocols 3
      • Client/Cluster must be same version. Hence migration of workflows to different clusters is difficult.
  • Scientific Applications
    • An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
      • Description - This paper outlines the current usage of Hadoop within the bioinformatics community.
      • Summary - Hadoop and the MapReduce programming paradigm have a substantial base in the bioinformatics community, especially in sequencing analysis. Such use is increasing due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters as well as via cloud vendors (like Amazon) who have implemented Hadoop; and due to the effectiveness and ease-of-use of the MapReduce method in parallelization of many relevant algorithms.

Source(s):
1 http://wiki.apache.org/hadoop/FAQ#What_is_Hadoop.3F
2 http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/
3 apachehadoopmapreducenextgen-110630154552-phpapp01.pptx

References:

Astronomy:

Biology/Bioinformatics:

Environmental Sci/Engin:

GIS:

  • No labels