Big Data 18 min read

Deploy Hadoop CDH5.4 on CentOS 6: Install HDFS, YARN, and WebHDFS

This guide walks through preparing three CentOS 6.9 nodes, configuring hostnames, time sync, password‑less SSH, disabling IPv6, installing JDK, downloading CDH 5.4, setting up core‑site and hdfs‑site XML files, formatting the NameNode, starting HDFS services, configuring YARN and MapReduce, and verifying the installations via the Web UI.

Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Deploy Hadoop CDH5.4 on CentOS 6: Install HDFS, YARN, and WebHDFS

Overview

This guide details a step‑by‑step deployment of a Cloudera CDH 5.4 Hadoop cluster on three CentOS 6.9 nodes (cdh1, cdh2, cdh3). It covers host preparation, package installation, core Hadoop configuration (HDFS, HttpFS), YARN and MapReduce setup, and verification through the web UIs.

Prerequisites

Operating system: CentOS 6.9 (kernel 2.6.32‑696.el6.x86_64).

Three machines with static IPs: 192.168.199.132 (cdh1), 192.168.199.133 (cdh2), 192.168.199.134 (cdh3).

Each node: 2 CPU, 1 GB RAM, 20 GB disk.

Required services per node:

cdh1 – NameNode, ResourceManager, HBase, Hive Metastore, Impala Catalog, Impala Statestore, Sentry.

cdh2 – DataNode, SecondaryNameNode, NodeManager, HBase, HiveServer2, Impala Server.

cdh3 – DataNode, NodeManager, HBase, HiveServer2, Impala Server.

1. Preparation

Set hostnames to cdh1, cdh2, cdh3.

Configure DNS resolution by editing /etc/hosts on every node:

192.168.199.132 cdh1
192.168.199.133 cdh2
192.168.199.134 cdh3

Synchronize time using NTP:

# on cdh1
ntpdate cn.pool.ntp.org
# on cdh2 and cdh3
ntpdate cdh1

Enable password‑less SSH (generate RSA keys with ssh-keygen -t rsa and copy them using ssh-copy-id).

Disable IPv6 (required by CDH) by adding to /etc/sysctl.conf:

net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1

Apply with sysctl -p and verify cat /proc/sys/net/ipv6/conf/all/disable_ipv6 returns 1.

Flush firewall rules with iptables -F.

Install JDK 1.7 on all nodes:

rpm -ivh jdk-7u55-linux-x64.rpm
java -version   # verify

2. Hadoop (CDH 5.4) Installation

Download and extract the CDH 5.4 tarball or use the native RPM packages.

Configure the Cloudera YUM repository (e.g., /etc/yum.repos.d/cloudera‑gplextras5.repo) with the following content:

[cloudera-gplextras5]
name=Cloudera GPL Extras 5
baseurl=http://archive.cloudera.com/gplextras5/redhat/6/x86_64/gplextras/5/
gpgkey=http://archive.cloudera.com/gplextras5/redhat/6/x86_64/gplextras/RPM-GPG-KEY-cloudera
gpgcheck=1

Install Hadoop packages per role:

# cdh1 (NameNode, ResourceManager, HBase, Hive Metastore, Impala Catalog, Impala Statestore, Sentry)
yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-namenode -y

# cdh2 (DataNode, SecondaryNameNode, NodeManager, HBase, HiveServer2, Impala Server)
yum install hadoop-hdfs-secondarynamenode -y
yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-datanode -y

# cdh3 (same as cdh2)
yum install hadoop hadoop-hdfs hadoop-client hadoop-doc hadoop-debuginfo hadoop-hdfs-datanode -y

Configure core Hadoop files on the NameNode (cdh1) and replicate to the other nodes: /etc/hadoop/conf/core-site.xml – set fs.defaultFS to hdfs://cdh1:8020. /etc/hadoop/conf/hdfs-site.xml – define name‑node and data‑node directories and permissions:

<property>
  <name>dfs.namenode.name.dir</name>
  <value>file:///data/dfs/nn</value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:///data/dfs/dn</value>
</property>
<property>
  <name>dfs.permissions.superusergroup</name>
  <value>hadoop</value>
</property>

Create the directories on each host and set ownership:

mkdir -p /data/dfs/nn /data/dfs/dn
chown -R hdfs:hdfs /data/dfs/nn /data/dfs/dn
chmod 700 /data/dfs/nn

Enable WebHDFS on the NameNode: yum install hadoop-httpfs -y and add proxy‑user settings to core-site.xml:

<property>
  <name>hadoop.proxyuser.httpfs.hosts</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.httpfs.groups</name>
  <value>*</value>
</property>

Install LZO compression libraries (optional but often required): yum install hadoop-lzo* impala-lzo -y and add the codecs to core-site.xml:

<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>
<property>
  <name>io.compression.codec.lzo.class</name>
  <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

3. Start HDFS

Synchronize the configuration directory to the other nodes:

scp -r /etc/hadoop/conf root@cdh2:/etc/hadoop/
scp -r /etc/hadoop/conf root@cdh3:/etc/hadoop/

Format the NameNode on cdh1: sudo -u hdfs hadoop namenode -format Start all HDFS daemons on each host:

for svc in $(ls /etc/init.d/ | grep hadoop-hdfs); do service $svc start; done

Create the temporary directory required by Hadoop and set permissive mode:

sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

Start the HttpFS service: service hadoop-httpfs start Verify the NameNode UI at http://192.168.199.132:50070/.

4. YARN and MapReduce Configuration

Install YARN packages :

# cdh1 (ResourceManager, HistoryServer, ProxyServer)
yum install hadoop-yarn hadoop-yarn-resourcemanager hadoop-mapreduce-historyserver hadoop-yarn-proxyserver -y

# cdh2 and cdh3 (NodeManager, MapReduce)
yum install hadoop-yarn hadoop-yarn-nodemanager hadoop-mapreduce -y

Configure yarn-site.xml on cdh1 (ResourceManager addresses, web UI, auxiliary services):

<property>
  <name>yarn.resourcemanager.address</name>
  <value>cdh1:8032</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address</name>
  <value>cdh1:8030</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>cdh1:8088</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>
<property>
  <name>yarn.application.classpath</name>
  <value>$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*</value>
</property>

Create local and log directories for NodeManager on each NodeManager host:

mkdir -p /data/yarn/{local,logs}
chown -R yarn:yarn /data/yarn

and create the remote HDFS log directory:

sudo -u hdfs hadoop fs -mkdir -p /yarn/apps
sudo -u hdfs hadoop fs -chown yarn:mapred /yarn/apps
sudo -u hdfs hadoop fs -chmod 1777 /yarn/apps

Configure MapReduce History Server in mapred-site.xml on cdh1:

<property>
  <name>mapreduce.jobhistory.address</name>
  <value>cdh1:10020</value>
</property>
<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>cdh1:19888</value>
</property>

Add proxy‑user entries for mapred and yarn in core-site.xml (hosts and groups set to *).

Create HDFS user directories required by MapReduce:

sudo -u hdfs hadoop fs -mkdir -p /user
sudo -u hdfs hadoop fs -chmod 777 /user
sudo -u hdfs hadoop fs -mkdir -p /user/history
sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
sudo -u hdfs hadoop fs -chown mapred:hadoop /user/history

Synchronize the updated configuration files to cdh2 and cdh3 (same scp commands as in step 3.1).

Start YARN services on every node:

for svc in $(ls /etc/init.d/ | grep hadoop-yarn); do service $svc start; done

5. Validation

After all daemons are running, access the following web interfaces to confirm successful deployment:

NameNode UI: http://192.168.199.132:50070/ YARN ResourceManager UI: http://192.168.199.132:8088/ MapReduce JobHistory UI: http://192.168.199.132:19888/jobhistory Each page should display the corresponding service dashboard, indicating that HDFS, YARN, and MapReduce are operational.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataInstallationYARNHDFSHadoopCentOSCDH
Full-Stack DevOps & Kubernetes
Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.