- What is it?
- A Distributed File System like the Google File System
- Typically used in conjunction with a Parallel Data Processing Framework like MapReduce
- Architecture 2
- Master/Slave architecture
- The namespace forms the file system meta-data, which is maintained by a dedicated server called the Name-node
- The name-node uses a transaction log called EditLog to record every change that occurs to the file-system meta data
- The entire file-system namespace including block-to-file mappings are stored in FsImage
- The EditLog and FsImage are stored on the local file-system of the name-node
- Estimated size of meta-data per object is about 200 bytes 1
- Responsible for replication
- Periodically receives a Heart Beat from each data-node, which indicate that a particular data-node is alive
- Periodically receives a Block Report from each data-node. The Block Report contains the list of all blocks on a particular data-node.
- The data resides on servers called Data-nodes
- The namespace forms the file system meta-data, which is maintained by a dedicated server called the Name-node
- Supports a traditional hierarchical file system
- File-system objects are stored as a sequence of blocks
- Default Block Size is 128 MB
- Can be configured on a per file basis
- Allows streaming access to data
- Supports write-once-read-many with reads at streaming speeds
- All communication protocols are TCP/IP based
- Client and Name-node communicate using "ClientProtocol"
- Data-node and Name-node communicate using "DatanodeProtocol"
- Both the above protocols are wrapped by RPC abstraction
- Name-node never initiates a request (it is just a server)
- Master/Slave architecture
- Scalability 1
- The entire namespace is memory-resident. This limits the number of namespace objects a single namespace server can manage.
- Small files problem - If the file to block ratio (number of blocks per file) is close to 1, the memory footprint of a single namespace server grows faster than the physical storage. This constitutes a scalability bottleneck when working with small files.
- This problem can be somewhat overcome using archives, sequence files or HBase
- Performance 1
- High Internal Load
- During normal operation, data-nodes periodically send heartbeats to the name-node to indicate that the data-node is alive. The default heartbeat interval is 3 seconds. If the name-node does not receive a heartbeat from a data-node in 10 minutes, it pronounces the data-node dead and schedules its blocks for replication on other nodes. The block reports and heartbeats form the internal load of the cluster.
- The internal load is proportional to the number of nodes in the cluster and the average number of blocks on a node.
- The internal load for block reports and heartbeat processing on a 10,000 node HDFS cluster with the total storage capacity of 60 PB will consume 30% of the total name-node processing capacity.
- Read/Write Throughput (DFSIO benchmark)
Average Read Throughput
66 MB/s
Average Read Throughput
40 MB/s
- Open/Create Throughput (NNThroughputBenchmark)
Open
126,119 ops/s
Create
5,600 ops/s
- High Internal Load
- Reliability 2
- Replication
- Block based replication
- Replicates to 3 data-nodes by default
- File System Snapshots can be made at any time. This allows for rollback and recovery in case of failures.
- Data Integrity
- HDFS stores block level checksums of all files
- When a block/file becomes corrupted, HDFS can recover the block/file from a replica.
- The name-node is a single point of failure for an HDFS cluster.
- There a two data structures central to the functionality of HDFS, namely, FsImage and EditLog. A corruption of either can cause HDFS to be non-functional. It is possible to configure the name-node to maintain multiple (synchronous) synchronized copies of these structures. However this configuration leads to degraded performance of meta-data operations. Note that meta-data is not data-intensive.
- Name-node software currently lacks automatic failover capability
- Undelete - A deleted file can be recovered (from trash) within a configured window of time.
- Replication
- Interoperability 2
- Multiple Language Bindings
- Java API (Native)
- C/C++ API (Wrapper)
- Python API (Wrapper)
- Multiple Language Bindings
- Ease-of-use 2
- Command line interface
- "FSshell" for file-system operations. The syntax is similar to bash and csh.
- "DFSAdmin" for configuring HDFS.
- Web-Browser interface for namespace browsing.
- Command line interface
Source(s):
1 http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/
2 http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html