Deploy Hadoop CDH5.4 on CentOS 6: Install HDFS, YARN, and WebHDFS
This guide walks through preparing three CentOS 6.9 nodes, configuring hostnames, time sync, password‑less SSH, disabling IPv6, installing JDK, downloading CDH 5.4, setting up core‑site and hdfs‑site XML files, formatting the NameNode, starting HDFS services, configuring YARN and MapReduce, and verifying the installations via the Web UI.
Overview
This guide details a step‑by‑step deployment of a Cloudera CDH 5.4 Hadoop cluster on three CentOS 6.9 nodes (cdh1, cdh2, cdh3). It covers host preparation, package installation, core Hadoop configuration (HDFS, HttpFS), YARN and MapReduce setup, and verification through the web UIs.
Prerequisites
Operating system: CentOS 6.9 (kernel 2.6.32‑696.el6.x86_64).
Three machines with static IPs: 192.168.199.132 (cdh1), 192.168.199.133 (cdh2), 192.168.199.134 (cdh3).
Each node: 2 CPU, 1 GB RAM, 20 GB disk.
Required services per node:
cdh1 – NameNode, ResourceManager, HBase, Hive Metastore, Impala Catalog, Impala Statestore, Sentry.
cdh2 – DataNode, SecondaryNameNode, NodeManager, HBase, HiveServer2, Impala Server.
cdh3 – DataNode, NodeManager, HBase, HiveServer2, Impala Server.
1. Preparation
Set hostnames to cdh1, cdh2, cdh3.
Configure DNS resolution by editing /etc/hosts on every node:
192.168.199.132 cdh1
192.168.199.133 cdh2
192.168.199.134 cdh3Synchronize time using NTP:
# on cdh1
ntpdate cn.pool.ntp.org
# on cdh2 and cdh3
ntpdate cdh1Enable password‑less SSH (generate RSA keys with ssh-keygen -t rsa and copy them using ssh-copy-id).
Disable IPv6 (required by CDH) by adding to /etc/sysctl.conf:
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1Apply with sysctl -p and verify cat /proc/sys/net/ipv6/conf/all/disable_ipv6 returns 1.
Flush firewall rules with iptables -F.
Install JDK 1.7 on all nodes:
rpm -ivh jdk-7u55-linux-x64.rpm
java -version # verify2. Hadoop (CDH 5.4) Installation
Download and extract the CDH 5.4 tarball or use the native RPM packages.
Configure the Cloudera YUM repository (e.g., /etc/yum.repos.d/cloudera‑gplextras5.repo) with the following content:
[cloudera-gplextras5]
name=Cloudera GPL Extras 5
baseurl=http://archive.cloudera.com/gplextras5/redhat/6/x86_64/gplextras/5/
gpgkey=http://archive.cloudera.com/gplextras5/redhat/6/x86_64/gplextras/RPM-GPG-KEY-cloudera
gpgcheck=1Install Hadoop packages per role:
# cdh1 (NameNode, ResourceManager, HBase, Hive Metastore, Impala Catalog, Impala Statestore, Sentry)
yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-namenode -y
# cdh2 (DataNode, SecondaryNameNode, NodeManager, HBase, HiveServer2, Impala Server)
yum install hadoop-hdfs-secondarynamenode -y
yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-datanode -y
# cdh3 (same as cdh2)
yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-datanode -yConfigure core Hadoop files on the NameNode (cdh1) and replicate to the other nodes: /etc/hadoop/conf/core-site.xml – set fs.defaultFS to hdfs://cdh1:8020. /etc/hadoop/conf/hdfs-site.xml – define name‑node and data‑node directories and permissions:
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/dfs/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/dfs/dn</value>
</property>
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>Create the directories on each host and set ownership:
mkdir -p /data/dfs/nn /data/dfs/dn
chown -R hdfs:hdfs /data/dfs/nn /data/dfs/dn
chmod 700 /data/dfs/nnEnable WebHDFS on the NameNode: yum install hadoop-httpfs -y and add proxy‑user settings to core-site.xml:
<property>
<name>hadoop.proxyuser.httpfs.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.groups</name>
<value>*</value>
</property>Install LZO compression libraries (optional but often required): yum install hadoop-lzo* impala-lzo -y and add the codecs to core-site.xml:
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>3. Start HDFS
Synchronize the configuration directory to the other nodes:
scp -r /etc/hadoop/conf root@cdh2:/etc/hadoop/
scp -r /etc/hadoop/conf root@cdh3:/etc/hadoop/Format the NameNode on cdh1: sudo -u hdfs hadoop namenode -format Start all HDFS daemons on each host:
for svc in $(ls /etc/init.d/ | grep hadoop-hdfs); do service $svc start; doneCreate the temporary directory required by Hadoop and set permissive mode:
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmpStart the HttpFS service: service hadoop-httpfs start Verify the NameNode UI at http://192.168.199.132:50070/.
4. YARN and MapReduce Configuration
Install YARN packages :
# cdh1 (ResourceManager, HistoryServer, ProxyServer)
yum install hadoop-yarn hadoop-yarn-resourcemanager hadoop-mapreduce-historyserver hadoop-yarn-proxyserver -y
# cdh2 and cdh3 (NodeManager, MapReduce)
yum install hadoop-yarn hadoop-yarn-nodemanager hadoop-mapreduce -yConfigure yarn-site.xml on cdh1 (ResourceManager addresses, web UI, auxiliary services):
<property>
<name>yarn.resourcemanager.address</name>
<value>cdh1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>cdh1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>cdh1:8088</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*</value>
</property>Create local and log directories for NodeManager on each NodeManager host:
mkdir -p /data/yarn/{local,logs}
chown -R yarn:yarn /data/yarnand create the remote HDFS log directory:
sudo -u hdfs hadoop fs -mkdir -p /yarn/apps
sudo -u hdfs hadoop fs -chown yarn:mapred /yarn/apps
sudo -u hdfs hadoop fs -chmod 1777 /yarn/appsConfigure MapReduce History Server in mapred-site.xml on cdh1:
<property>
<name>mapreduce.jobhistory.address</name>
<value>cdh1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>cdh1:19888</value>
</property>Add proxy‑user entries for mapred and yarn in core-site.xml (hosts and groups set to *).
Create HDFS user directories required by MapReduce:
sudo -u hdfs hadoop fs -mkdir -p /user
sudo -u hdfs hadoop fs -chmod 777 /user
sudo -u hdfs hadoop fs -mkdir -p /user/history
sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
sudo -u hdfs hadoop fs -chown mapred:hadoop /user/historySynchronize the updated configuration files to cdh2 and cdh3 (same scp commands as in step 3.1).
Start YARN services on every node:
for svc in $(ls /etc/init.d/ | grep hadoop-yarn); do service $svc start; done5. Validation
After all daemons are running, access the following web interfaces to confirm successful deployment:
NameNode UI: http://192.168.199.132:50070/ YARN ResourceManager UI: http://192.168.199.132:8088/ MapReduce JobHistory UI: http://192.168.199.132:19888/jobhistory Each page should display the corresponding service dashboard, indicating that HDFS, YARN, and MapReduce are operational.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Full-Stack DevOps & Kubernetes
Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
