Big Data 9 min read

Build a Hadoop Cluster with Docker: Step‑by‑Step Guide

Learn how to quickly set up a multi‑node Hadoop cluster on a single machine using Docker containers, covering image preparation, SSH configuration, fixed IP assignment with pipework, and building custom Hadoop images, enabling a lightweight, cost‑effective big‑data environment for development and testing.

Java High-Performance Architecture

Sep 27, 2016

Build a Hadoop Cluster with Docker: Step‑by‑Step Guide

After writing an article on Hadoop cluster setup, a friend suggested using Docker for deployment. Docker simplifies creating a learning environment on a personal computer without needing multiple virtual machines.

Setting up a cluster traditionally requires several servers, which is a barrier for individuals. Using Docker, you can download a CentOS image, run multiple containers that act like lightweight virtual machines, and assign each an IP address for SSH access.

Install Docker.

Obtain a CentOS image.

Install SSH.

Configure container IP addresses.

Install Java and Hadoop.

Configure Hadoop.

The first step is straightforward: download Docker from the official site. Steps 5 and 6 are the same as on a physical server, so the guide focuses on steps 2‑4.

Get the centos7 image

$ docker pull centos

The image is about 70 MB; using a Docker accelerator like Alibaba Cloud speeds up the download. List images with:

$ docker images

Install SSH

Create a Dockerfile based on the centos7 image to add SSH support:

FROM centos
MAINTAINER dys

RUN yum install -y openssh-server sudo
RUN sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config
RUN yum install -y openssh-clients

RUN echo "root:111111" | chpasswd
RUN echo "root ALL=(ALL)       ALL" >> /etc/sudoers
RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key

RUN mkdir /var/run/sshd
EXPOSE 22
CMD ["/usr/sbin/sshd", "-D"]

This Dockerfile installs SSH packages, sets the root password to 111111, and starts the SSH daemon.

Build the image and name it centos7-ssh: $ docker build -t="centos7-ssh" . Verify the new image appears in the list:

$ docker images

Set a Fixed IP

Use pipework to assign IP addresses to containers.

$ git clone https://github.com/jpetazzo/pipework.git
$ cp pipework/pipework /usr/local/bin/

Install bridge-utils: $ yum -y install bridge-utils Create a bridge network:

$ brctl addbr br1
$ ip link set dev br1 up
$ ip addr add 192.168.3.1/24 dev br1

Run a container from the centos7-ssh image: $ docker run -d --name=centos7.ssh centos7-ssh Assign it an IP: $ pipework br1 centos7.ssh 192.168.3.20/24 Verify connectivity with ping and ssh:

$ ping 192.168.3.20
$ ssh 192.168.3.20

Repeat to create two more containers, giving them IPs 192.168.3.22 and 192.168.3.23, resulting in three SSH‑accessible containers that act as three servers.

Build a Hadoop Image

Create another Dockerfile based on centos7-ssh to add Java and Hadoop:

FROM centos7-ssh
ADD jdk-8u101-linux-x64.tar.gz /usr/local/
RUN mv /usr/local/jdk1.8.0_101 /usr/local/jdk1.8
ENV JAVA_HOME /usr/local/jdk1.8
ENV PATH $JAVA_HOME/bin:$PATH

ADD hadoop-2.7.3.tar.gz /usr/local
RUN mv /usr/local/hadoop-2.7.3 /usr/local/hadoop
ENV HADOOP_HOME /usr/local/hadoop
ENV PATH $HADOOP_HOME/bin:$PATH

RUN yum install -y which sudo

Place the JDK and Hadoop tarballs in the Dockerfile directory, then build the image named hadoop: $ docker build -t="hadoop" . Run three containers from this image, naming them hadoop0, hadoop1, and hadoop2, with hadoop0 as the master and exposing ports 50070 and 8088 for the web UI:

$ docker run --name hadoop0 --hostname hadoop0 -d -P -p 50070:50070 -p 8088:8088 hadoop
$ docker run --name hadoop1 --hostname hadoop1 -d -P hadoop
$ docker run --name hadoop2 --hostname hadoop2 -d -P hadoop

Assign fixed IPs to the Hadoop containers:

$ pipework br1 hadoop0 192.168.3.30/24
$ pipework br1 hadoop1 192.168.3.31/24
$ pipework br1 hadoop2 192.168.3.32/24

Configure the Hadoop Cluster

Open three terminal windows and attach to each container:

$ docker exec -it hadoop0 /bin/bash
$ docker exec -it hadoop1 /bin/bash
$ docker exec -it hadoop2 /bin/bash

In each container, edit /etc/hosts to add:

192.168.3.30    master
192.168.3.31    slave1
192.168.3.32    slave2

Proceed with password‑less SSH setup and Hadoop configuration files as described in the original Hadoop cluster tutorial.

Thus, the Hadoop cluster is successfully built using Docker containers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker Big Data container Cluster Hadoop CentOS Pipework

Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.