This documentation is outdated. Please see ROGER: The CyberGIS Supercomputer for updated documentation.
The Hadoop framework in ROGER is configured to run MapReduce programs using Java and Apache Pig. It also offers Hadoop Streaming API interfaces to create and run MapReduce programs with any executable or script as the mapper and/or the reducer, e.g., Python, R or bash scripts. In addition, the Hadoop environment is configured to allow Apache Spark to interact with the Hadoop Distributed File System (HDFS), and managing Spark jobs with YARN.
The Hadoop cluster deployed in ROGER consists of 11 nodes:
cg-hm08.ncsa.illinois.edu to cg-hm18.ncsa.illinois.edu
In particular, the namenode and login-node is cg-hm08. The Hadoop clusters provides multiple web interfaces to allow users keep track of nodes usage and jobs.
The two most useful web interfaces are:
HDFS monitoring web interface: that includes availability of data storage and current files configuration on the HDFS. (http://cg-hm08.ncsa.illinois.edu:50070)
Job monitoring web interface: that includes progress status of each job in addition to statistics regarding resource and time usage of jobs. (http://cg-hm09.ncsa.illinois.edu:8088/cluster )
ROGER-Hadoop Cluster login
To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:
after entering password, we can login to the Hadoop login-node (i.e., cg-hm08.ncsa.illinois.edu)
Data access on Hadoop
After login to the Hadoop cluster, a user can interact with the HDFS with a set of commands.
To check the version of the Hadoop software:
or the existing nodes in the cluster:
Before starting write specific MapReduce code, the following steps are necessary to make sure the settings are properly configured. The first thing to do is check existing files in HDFS:
if it is the first time for a user to use the Hadoop cluster, creating a user folder with the same name as the cluster account name is necessary. To make a directory:
To copy your data from local disks to HDFS:
which is equivalent to:
Contrary, to collect data from HDFS to local storage:
To remove a file (or directory) from HDFS we can use:
Furthermore, options such as replication factor and block size can be specified using -D directive.
A detailed HDFS commands guide can be found from the following link:
Writing MapReduce Programs
Interact with Hadoop using Pig script:
To run a Pig script, directly type in the Linux console: pig pig_script.pig
or directly enter pig console: pig
Note that the Pig script is submitted via YARN, which will be introduced later.
Writing MapReduce program using Java:
A Java program
It is recommended to compile the Java source code using “Ant”, where all the dependency and libraries are configured in the build.xml (see Appendix for details)
A MapReduce program usually consists of three main classes: one Mapper class, one reducer class and one runner class which runs the Hadoop job. (combiner class is optional)
MapReduce with Hadoop Streaming API:
Hadoop streaming API can be utilized by any executable or script as the mapper and/or the reducer respectively, in this case, Python.
Create a folder named “app”
Create a python script to handle mapper, e.g., mapper.py
Create a python script to handle reducer, e.g., reducer.py
If third-party scripts are imported, put the source code in the same folder.
Writing Spark program
Currently, Spark supports Scala (native), Python and R (in this case, we will use Python as an example)
Spark can be launched directly in the local node by typing: pyspark
It can also execute python script by:
pyspark script.py (in the local node)
spark-submit script.py --master yarn-client (as a job running in the cluster)
Submitting Jobs (via YARN)
Hadoop jobs are managed by YARN, which handles queueing jobs and assign computing resources to the corresponding job.
Submit Java programs:
Input parameters will be then handled by the main runner class of the program. They usually include input and output address on HDFS, as well as some configuration of the job.
Submit MapReduce jobs with Hadoop Streaming API (Python)
If your source code only contains mapper.py and reducer.py
if your source code utilizes third-party libraries
Note: the “-file” function will upload the specific python file to the entire cluster so that every node is aware of the file and will execute this script without problem, whereas “-files” wraps up all the data and code in the folder and upload to the cluster.
Submit Spark jobs
Spark provides a suite of interfaces to support read/write files in different storage, such as files in the local disk or HDFS. To take advantages of the HDFS for large files, spark jobs are recommended to submitted via YARN. However, there are several parameters to configure to tune the performances:
In particular, by using “--master yarn-client”, it requires the the computing environment in each node is configured the same, otherwise there will be errors. For example, if program needs numpy packages, which is available in the namenode by: module load anaconda, the “--master” option cannot be set to “yarn-client”, because anaconda is not available in others nodes, so the command will be:
In addition to use third-party scripts or files, all these files should be specifically added in the program, which will be uploaded to the cluster and distributed to each node, for example:
Also, the program need to find the path to the files by using the SparkFiles.get() function, e.g.:
An example build.xml
Spark parameters directly set in programs