• What is it?
    • A Distributed File System like the Google File System
    • Typically used in conjunction with a Parallel Data Processing Framework like MapReduce
  • Architecture 2
    • Master/Slave architecture
      • The namespace forms the file system meta-data, which is maintained by a dedicated server called the Name-node
        • The name-node uses a transaction log called EditLog to record every change that occurs to the file-system meta data
        • The entire file-system namespace including block-to-file mappings are stored in FsImage
        • The EditLog and FsImage are stored on the local file-system of the name-node
        • Estimated size of meta-data per object is about 200 bytes 1
        • Responsible for replication
          • Periodically receives a Heart Beat from each data-node, which indicate that a particular data-node is alive
          • Periodically receives a Block Report from each data-node. The Block Report contains the list of all blocks on a particular data-node.
      • The data resides on servers called Data-nodes
    • Supports a traditional hierarchical file system
      • File-system objects are stored as a sequence of blocks
      • Default Block Size is 128 MB
        • Can be configured on a per file basis
    • Allows streaming access to data
      • Supports write-once-read-many with reads at streaming speeds
    • All communication protocols are TCP/IP based
      • Client and Name-node communicate using "ClientProtocol"
      • Data-node and Name-node communicate using "DatanodeProtocol"
      • Both the above protocols are wrapped by RPC abstraction
      • Name-node never initiates a request (it is just a server)
  • Scalability 1
    • The entire namespace is memory-resident. This limits the number of namespace objects a single namespace server can manage.
    • Small files problem - If the file to block ratio (number of blocks per file) is close to 1, the memory footprint of a single namespace server grows faster than the physical storage. This constitutes a scalability bottleneck when working with small files.
      • This problem can be somewhat overcome using archives, sequence files or HBase
  • Performance 1
    • High Internal Load
      • During normal operation, data-nodes periodically send heartbeats to the name-node to indicate that the data-node is alive. The default heartbeat interval is 3 seconds. If the name-node does not receive a heartbeat from a data-node in 10 minutes, it pronounces the data-node dead and schedules its blocks for replication on other nodes. The block reports and heartbeats form the internal load of the cluster.
      • The internal load is proportional to the number of nodes in the cluster and the average number of blocks on a node.
      • The internal load for block reports and heartbeat processing on a 10,000 node HDFS cluster with the total storage capacity of 60 PB will consume 30% of the total name-node processing capacity.
    • Read/Write Throughput (DFSIO benchmark)

      Average Read Throughput

      66 MB/s

      Average Read Throughput

      40 MB/s

    • Open/Create Throughput (NNThroughputBenchmark)

      Open

      126,119 ops/s

      Create

      5,600 ops/s

  • Reliability 2
    • Replication
      • Block based replication
      • Replicates to 3 data-nodes by default
    • File System Snapshots can be made at any time. This allows for rollback and recovery in case of failures.
    • Data Integrity
      • HDFS stores block level checksums of all files
      • When a block/file becomes corrupted, HDFS can recover the block/file from a replica.
    • The name-node is a single point of failure for an HDFS cluster.
      • There a two data structures central to the functionality of HDFS, namely, FsImage and EditLog. A corruption of either can cause HDFS to be non-functional. It is possible to configure the name-node to maintain multiple (synchronous) synchronized copies of these structures. However this configuration leads to degraded performance of meta-data operations. Note that meta-data is not data-intensive.
      • Name-node software currently lacks automatic failover capability
    • Undelete - A deleted file can be recovered (from trash) within a configured window of time.
  • Interoperability 2
    • Multiple Language Bindings
      • Java API (Native)
      • C/C++ API (Wrapper)
      • Python API (Wrapper)
  • Ease-of-use 2
    • Command line interface
      • "FSshell" for file-system operations. The syntax is similar to bash and csh.
      • "DFSAdmin" for configuring HDFS.
    • Web-Browser interface for namespace browsing.

Source(s):
1 http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/
2 http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html

  • No labels