Hadoop Distributed File System

Created by Neel Patel, last modified on Jul 22, 2011

What is it?
- A Distributed File System like the Google File System
- Typically used in conjunction with a Parallel Data Processing Framework like MapReduce

Architecture ²
- Master/Slave architecture
  - The namespace forms the file system meta-data, which is maintained by a dedicated server called the Name-node
    - The name-node uses a transaction log called EditLog to record every change that occurs to the file-system meta data
    - The entire file-system namespace including block-to-file mappings are stored in FsImage
    - The EditLog and FsImage are stored on the local file-system of the name-node
    - Estimated size of meta-data per object is about 200 bytes ¹
    - Responsible for replication
      - Periodically receives a Heart Beat from each data-node, which indicate that a particular data-node is alive
      - Periodically receives a Block Report from each data-node. The Block Report contains the list of all blocks on a particular data-node.
  - The data resides on servers called Data-nodes
- Supports a traditional hierarchical file system
  - File-system objects are stored as a sequence of blocks
  - Default Block Size is 128 MB
    - Can be configured on a per file basis
- Allows streaming access to data
  - Supports write-once-read-many with reads at streaming speeds
- All communication protocols are TCP/IP based
  - Client and Name-node communicate using "ClientProtocol"
  - Data-node and Name-node communicate using "DatanodeProtocol"
  - Both the above protocols are wrapped by RPC abstraction
  - Name-node never initiates a request (it is just a server)

Scalability ¹
- The entire namespace is memory-resident. This limits the number of namespace objects a single namespace server can manage.
- Small files problem - If the file to block ratio (number of blocks per file) is close to 1, the memory footprint of a single namespace server grows faster than the physical storage. This constitutes a scalability bottleneck when working with small files.
  - This problem can be somewhat overcome using archives, sequence files or HBase

Performance ¹
- High Internal Load
  - During normal operation, data-nodes periodically send heartbeats to the name-node to indicate that the data-node is alive. The default heartbeat interval is 3 seconds. If the name-node does not receive a heartbeat from a data-node in 10 minutes, it pronounces the data-node dead and schedules its blocks for replication on other nodes. The block reports and heartbeats form the internal load of the cluster.
  - The internal load is proportional to the number of nodes in the cluster and the average number of blocks on a node.
  - The internal load for block reports and heartbeat processing on a 10,000 node HDFS cluster with the total storage capacity of 60 PB will consume 30% of the total name-node processing capacity.
- Read/Write Throughput (DFSIO benchmark)
  
  Average Read Throughput
  
  66 MB/s
  
  Average Read Throughput
  
  40 MB/s
- Open/Create Throughput (NNThroughputBenchmark)
  
  Open
  
  126,119 ops/s
  
  Create
  
  5,600 ops/s

Reliability ²
- Replication
  - Block based replication
  - Replicates to 3 data-nodes by default
- File System Snapshots can be made at any time. This allows for rollback and recovery in case of failures.
- Data Integrity
  - HDFS stores block level checksums of all files
  - When a block/file becomes corrupted, HDFS can recover the block/file from a replica.
- The name-node is a single point of failure for an HDFS cluster.
  - There a two data structures central to the functionality of HDFS, namely, FsImage and EditLog. A corruption of either can cause HDFS to be non-functional. It is possible to configure the name-node to maintain multiple (synchronous) synchronized copies of these structures. However this configuration leads to degraded performance of meta-data operations. Note that meta-data is not data-intensive.
  - Name-node software currently lacks automatic failover capability
- Undelete - A deleted file can be recovered (from trash) within a configured window of time.

Interoperability ²
- Multiple Language Bindings
  - Java API (Native)
  - C/C++ API (Wrapper)
  - Python API (Wrapper)

Ease-of-use ²
- Command line interface
  - "FSshell" for file-system operations. The syntax is similar to bash and csh.
  - "DFSAdmin" for configuring HDFS.
- Web-Browser interface for namespace browsing.

Source(s):
¹ http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/
² http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html

No labels