Child pages
  • Hadoop User Guide
Skip to end of metadata
Go to start of metadata

This page contains macros or features from a plugin which requires a valid license.

You will need to contact your administrator.

Introduction To Hadoop

The Hadoop framework in ROGER is configured to run MapReduce programs using Java and Apache Pig. It also offers Hadoop Streaming API interfaces to create and run MapReduce programs with any executable or script as the mapper and/or the reducer, e.g., Python, R or bash scripts. In addition, the Hadoop environment is configured to allow Apache Spark to interact with the Hadoop Distributed File System (HDFS), and managing Spark jobs with YARN.

The Hadoop cluster deployed in ROGER consists of 11 nodes:  

  • to

  • the NAMENODE and LOGIN-NODE is cg-hm08.

The Hadoop clusters provides multiple web interfaces to allow users keep track of nodes usage and jobs.The two most useful web interfaces are:

  1. HDFS monitoring web interface: that includes availability of data storage and current files configuration on the HDFS. (

  2. Job monitoring web interface: that includes progress status of each job in addition to statistics regarding resource and time usage of jobs. (

Logging on to the Hadoop Cluster Nodes

To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:




after entering password, we can login to the Hadoop login-node (i.e., 


ssh cg-hm08

Data access on Hadoop

After login to the Hadoop cluster, a user can interact with the HDFS with a set of commands.

To check the version of the Hadoop software:


hadoop version 

or the existing nodes in the cluster:


yarn node -list 


Before writing specific MapReduce code, the following steps are necessary to make sure the settings are properly configured. The first thing to do is check existing files in HDFS:


hdfs dfs -ls PATH (e.g., /user/account_name/folder_name)
hdfs dfs -ls /user (to check existing users in the Hadoop cluster)

if it is the first time for a user to use the Hadoop cluster, creating a user folder with the same name as the cluster account name is necessary. To make a directory:


hdfs dfs -mkdir /user/account_name (for first time usage)
hdfs dfs -mkdir folder_name (it will create a folder /user/account_name/folder_name)

To copy your data from local disks to HDFS:


hdfs dfs -copyFromLocal  PATH_TO_LOCAL_FILE /user/account_name/folder_name


which is equivalent to:


hdfs dfs -copyFromLocal  PATH_TO_LOCAL_FILE folder_name


Contrary, to collect data from HDFS to local storage:




To remove a file (or directory) from HDFS we can use:


hdfs dfs -rm(r) PATH_TO_HDFS_FILE


Furthermore, options such as replication factor and block size can be specified using -D directive.


hdfs dfs -Ddfs.replication=1 -copyFromLocal PATH_TO_LOCAL_FILE PATH_TO_HDFS_FILE


Click here for a detailed HDFS commands guide» 

Writing MapReduce Programs

Interact with Hadoop using Pig script

  1. To run a Pig script, directly type in the Linux console: pig pig_script.pig

  2. or directly enter pig console: pig 

Note that the Pig script is submitted via YARN, which will be introduced later.

Writing MapReduce program using Java

  1. A Java program

  2. It is recommended to compile the Java source code using “Ant”, where all the dependency and libraries are configured in the build.xml (see Appendix for details)

  3. A MapReduce program usually consists of three main classes: one Mapper class, one reducer class and one runner class which runs the Hadoop job. (combiner class is optional)

MapReduce with Hadoop Streaming API

Hadoop streaming API can be utilized by any executable or script as the mapper and/or the reducer respectively, in this case, Python.

  1. Create a folder named “app”

  2. Create a python script to handle mapper, e.g.,

  3. Create a python script to handle reducer, e.g.,

  4. If third-party scripts are imported, put the source code in the same folder.

Writing Spark program

  1. Currently, Spark supports Scala (native), Python and R (in this case, we will use Python as an example)

  2. Spark can be launched directly in the local node by typing: pyspark

  3. It can also execute python script by:

    1. pyspark (in the local node)

    2. spark-submit --master yarn-client (as a job running in the cluster)

Submitting Jobs via YARN

Hadoop jobs are managed by YARN.

YARN handles:

a. Queuing jobs

b. Assigning submitted jobs to computing resources

Submit Java programs


hadoop jar dist/your_compiled_program.jar [input parameters]

Input parameters will then be handled by the main runner class of the program. They usually include input and output address on HDFS, as well as some configuration of the job.

Submit MapReduce programs with Hadoop Streaming API (Python)

If your source code only contains and


yarn jar /usr/hdp/ \
-D mapred.reduce.tasks=5 \
-mapper "" \
-reducer "" \


if your source code utilizes third-party libraries


yarn jar /usr/hdp/ \
-D mapred.reduce.tasks=5 \
-files app \
-mapper "app/" \
-reducer "app/" \

The “-file” function will upload the specific python file to the entire cluster so that every node is aware of the file and will execute this script without problem.

On This Page

Browse Content



  • No labels