Running Hadoop on CentOS 6 (Multi-Node Cluster)
This demonstration has been tested with the following software versions and network settings:
Hostname | IP Address | Roles |
---|---|---|
hdp01 | 192.168.1.39 | NameNode, JobTracker and DataNode |
hdp02 | 192.168.1.40 | DataNode, TaskTracker |
hdp03 | 192.168.1.41 | DataNode, TaskTracker |
Candidate CentOS VMs
After installing the CentOS in VirtualBox, patch up system by applying all updates and install wget tool for future use. Note that the following instructions are executed on hdp01
.
[root@hdp01 ~]# yum -y update
[root@hdp01 ~]# yum -y install wget
Here we disable the iptables firewall In Redhat/CentOS Linux for facilitating the demonstration.
[root@hdp01 ~]# service iptables stop && chkconfig iptables off
Install Oracle Java SDK and configure the system environment variables. Note that the Oracle has recently disallowed direct downloads of java from their servers, [2] provides an feasible solution:
wget -c --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F" "<DOWNLOAD URL>" --output-document="<DOWNLOAD FILE NAME>"
For example, we can use following command to obtain the JDK 1.7.0_07 rpm:
[root@hdp01 ~]# wget -c --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F" "http://download.oracle.com/otn-pub/java/jdk/7u7-b10/jdk-7u7-linux-x64.rpm" --output-document="jdk-7u7-linux-x64.rpm"
[root@hdp01 ~]# rpm -ivh jdk-7u7-linux-x64.rpm
[root@hdp01 ~]# vi /etc/profile
Add the following variables in /etc/profile
:
JAVA_HOME=/usr/java/jdk1.7.0_07
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib/dt.jar
export PATH JAVA_HOME CLASSPATH
Finally, we take user hadoop
to perform the Hadoop operations.
[root@hdp01 ~]# adduser hadoop
[root@hdp01 ~]# passwd hadoop
[root@hdp01 ~]# gpasswd -a hadoop root
[root@hdp01 ~]# grep "^root" /etc/group
OpenSSH Packages
There are some omissive packages in minimal installation of CentOS, e.g., scp command.
[root@hdp01 ~]# yum -y install openssh-server openssh-clients
[root@hdp01 ~]# chkconfig sshd on
[root@hdp01 ~]# service sshd start
[root@hdp01 ~]# yum -y install rsync
FQDN Mapping
Add the following variables (all machines) into /etc/hosts
to make sure each particular host is reachable.
127.0.0.1 localhost
192.168.1.39 hdp01
192.168.1.40 hdp02
192.168.1.41 hdp03
VBoxManage Duplication
VBoxManage clonehd <source_file> <output_file>
After clonehd, the VM raises the network issue, delete
the eth0 line and modify the eth1 line to be eth0. In addition, modify the hostname
value.
[root@hdp01 ~]# vi /etc/udev/rules.d/70-persistent-net.rules
[root@hdp01 ~]# vi /etc/sysconfig/network
SSH Access
[hadoop@hdp01 ~]$ ssh-keygen -t rsa -P ''
[hadoop@hdp01 ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[hadoop@hdp01 ~]$ chmod 600 ~/.ssh/authorized_keys
[hadoop@hdp01 ~]$ scp -r ~/.ssh hdp*:~/
Hadoop Configuration
[hadoop@hdp01 ~]$ wget http://ftp.tc.edu.tw/pub/Apache/hadoop/common/hadoop-1.0.3/hadoop-1.0.3.tar.gz
[hadoop@hdp01 ~]$ tar zxvf hadoop-1.0.3.tar.gz
Setup the Hadoop environment with the following files:
hadoop-1.0.3/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hdp01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/hadoop/hadoop-${user.name}</value>
</property>
</configuration>
hadoop-1.0.3/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
hadoop-1.0.3/conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hdp01:9001</value>
</property>
</configuration>
hadoop-1.0.3/conf/hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.7.0_07
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/conf
export HADOOP_HOME_WARN_SUPPRESS=1
Deployment
[hadoop@hdp01 ~]$ scp -r hadoop-1.0.3 hdp01:~/
[hadoop@hdp02 ~]$ scp -r hadoop-1.0.3 hdp02:~/
# directories setting for each machines (hdp*)
[root@hdp01 ~]# cp hadoop-1.0.3 /opt/hadoop
[root@hdp01 ~]# cd /opt
[root@hdp01 ~]# mkdir /var/hadoop
[root@hdp01 ~]# chown -R hadoop.hadoop hadoop
[root@hdp01 ~]# chown -R hadoop.hadoop /var/hadoop
Configuration (Master Only)
hadoop-1.0.3/conf/masters
hdp01
hadoop-1.0.3/conf/slaves
hdp01
hdp02
hdp03
Formatting the HDFS filesystem via the NameNode (master
only)
[hadoop@hdp01 hadoop]$ bin/hadoop namenode -format
Now, we can start the multi-node cluster via bin/start-all.sh
, and stop the service via bin/stop-all.sh
command. Finally, we can put some materials into HDFS and conduct the wordcount on master machine to verify the whole process.