8 October 2012 / #Hadoop #CentOS

Running Hadoop on CentOS 6 (Multi-Node Cluster)

This demonstration has been tested with the following software versions and network settings:

Oracle VM VirtualBox 4.1.22 r80657
CentOS-6.3_x86_64 with Minimal Installation
Hadoop 1.0.3

Hostname	IP Address	Roles
hdp01	192.168.1.39	NameNode, JobTracker and DataNode
hdp02	192.168.1.40	DataNode, TaskTracker
hdp03	192.168.1.41	DataNode, TaskTracker

Candidate CentOS VMs

After installing the CentOS in VirtualBox, patch up system by applying all updates and install wget tool for future use. Note that the following instructions are executed on hdp01.

[root@hdp01 ~]# yum -y update
[root@hdp01 ~]# yum -y install wget

Here we disable the iptables firewall In Redhat/CentOS Linux for facilitating the demonstration.

[root@hdp01 ~]# service iptables stop && chkconfig iptables off

Install Oracle Java SDK and configure the system environment variables. Note that the Oracle has recently disallowed direct downloads of java from their servers, [2] provides an feasible solution:

wget -c --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F" "<DOWNLOAD URL>" --output-document="<DOWNLOAD FILE NAME>"

For example, we can use following command to obtain the JDK 1.7.0_07 rpm:

[root@hdp01 ~]# wget -c --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F" "http://download.oracle.com/otn-pub/java/jdk/7u7-b10/jdk-7u7-linux-x64.rpm" --output-document="jdk-7u7-linux-x64.rpm"
[root@hdp01 ~]# rpm -ivh jdk-7u7-linux-x64.rpm
[root@hdp01 ~]# vi /etc/profile

Add the following variables in /etc/profile:

JAVA_HOME=/usr/java/jdk1.7.0_07
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib/dt.jar
export PATH JAVA_HOME CLASSPATH

Finally, we take user hadoop to perform the Hadoop operations.

[root@hdp01 ~]# adduser hadoop
[root@hdp01 ~]# passwd hadoop
[root@hdp01 ~]# gpasswd -a hadoop root
[root@hdp01 ~]# grep "^root" /etc/group

OpenSSH Packages

There are some omissive packages in minimal installation of CentOS, e.g., scp command.

[root@hdp01 ~]# yum -y install openssh-server openssh-clients
[root@hdp01 ~]# chkconfig sshd on
[root@hdp01 ~]# service sshd start
[root@hdp01 ~]# yum -y install rsync

FQDN Mapping

Add the following variables (all machines) into /etc/hosts to make sure each particular host is reachable.

127.0.0.1 localhost
192.168.1.39 hdp01
192.168.1.40 hdp02
192.168.1.41 hdp03

VBoxManage Duplication

VBoxManage clonehd <source_file> <output_file>

After clonehd, the VM raises the network issue, delete the eth0 line and modify the eth1 line to be eth0. In addition, modify the hostname value.

[root@hdp01 ~]# vi /etc/udev/rules.d/70-persistent-net.rules
[root@hdp01 ~]# vi /etc/sysconfig/network

SSH Access

[hadoop@hdp01 ~]$ ssh-keygen -t rsa -P ''
[hadoop@hdp01 ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[hadoop@hdp01 ~]$ chmod 600 ~/.ssh/authorized_keys
[hadoop@hdp01 ~]$ scp -r ~/.ssh hdp*:~/

Hadoop Configuration

[hadoop@hdp01 ~]$ wget http://ftp.tc.edu.tw/pub/Apache/hadoop/common/hadoop-1.0.3/hadoop-1.0.3.tar.gz
[hadoop@hdp01 ~]$ tar zxvf hadoop-1.0.3.tar.gz

Setup the Hadoop environment with the following files:

hadoop-1.0.3/conf/core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://hdp01:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/var/hadoop/hadoop-${user.name}</value>
    </property>
</configuration>

hadoop-1.0.3/conf/hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

hadoop-1.0.3/conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>hdp01:9001</value>
    </property>
</configuration>

hadoop-1.0.3/conf/hadoop-env.sh

export JAVA_HOME=/usr/java/jdk1.7.0_07
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/conf
export HADOOP_HOME_WARN_SUPPRESS=1

Deployment

[hadoop@hdp01 ~]$ scp -r hadoop-1.0.3 hdp01:~/
[hadoop@hdp02 ~]$ scp -r hadoop-1.0.3 hdp02:~/

# directories setting for each machines (hdp*)
[root@hdp01 ~]# cp hadoop-1.0.3 /opt/hadoop
[root@hdp01 ~]# cd /opt
[root@hdp01 ~]# mkdir /var/hadoop
[root@hdp01 ~]# chown -R hadoop.hadoop hadoop
[root@hdp01 ~]# chown -R hadoop.hadoop /var/hadoop

Configuration (Master Only)

hadoop-1.0.3/conf/masters

hdp01

hadoop-1.0.3/conf/slaves

hdp01
hdp02
hdp03

Formatting the HDFS filesystem via the NameNode (master only)

[hadoop@hdp01 hadoop]$ bin/hadoop namenode -format

Now, we can start the multi-node cluster via bin/start-all.sh, and stop the service via bin/stop-all.sh command. Finally, we can put some materials into HDFS and conduct the wordcount on master machine to verify the whole process.