Big Data 14 min read

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

This guide walks through deploying Hadoop 3.x on a Kubernetes cluster using Helm, covering repository addition, Docker image creation, Helm chart configuration, service adjustments, installation, verification commands, and clean uninstallation, complete with code snippets and screenshots.

Open Source Linux

Nov 11, 2022

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

Overview

Hadoop is an Apache open‑source distributed computing platform that provides HDFS and MapReduce (YARN in Hadoop 2.0) as core components, offering high fault‑tolerance, scalability and efficiency on inexpensive hardware. The latest stable version is 3.x.

Deployment Steps

1) Add Helm Repository

helm repo add apache-hadoop-helm https://pfisterer.github.io/apache-hadoop-helm/
helm pull apache-hadoop-helm/hadoop --version 1.2.0
tar -xf hadoop-1.2.0.tgz

2) Build Docker Image

FROM myharbor.com/bigdata/centos:7.9.2009

RUN rm -f /etc/localtime && ln -sv /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo "Asia/Shanghai" > /etc/timezone

RUN export LANG=zh_CN.UTF-8

# create user and group
RUN groupadd --system --gid=9999 admin && useradd --system --home-dir /home/admin --uid=9999 --gid=admin admin

# install sudo
RUN yum -y install sudo ; chmod 640 /etc/sudoers

# give admin sudo permission
RUN echo "admin ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

RUN yum -y install net-tools telnet wget

RUN mkdir /opt/apache/
ADD jdk-8u212-linux-x64.tar.gz /opt/apache/
ENV JAVA_HOME=/opt/apache/jdk1.8.0_212
ENV PATH=$JAVA_HOME/bin:$PATH

ENV HADOOP_VERSION 3.3.2
ENV HADOOP_HOME=/opt/apache/hadoop

ENV HADOOP_COMMON_HOME=${HADOOP_HOME} \
    HADOOP_HDFS_HOME=${HADOOP_HOME} \
    HADOOP_MAPRED_HOME=${HADOOP_HOME} \
    HADOOP_YARN_HOME=${HADOOP_HOME} \
    HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop \
    PATH=${PATH}:${HADOOP_HOME}/bin

ADD hadoop-${HADOOP_VERSION}.tar.gz /opt/apache
RUN ln -s /opt/apache/hadoop-${HADOOP_VERSION} ${HADOOP_HOME}

RUN chown -R admin:admin /opt/apache

WORKDIR ${HADOOP_HOME}

# HDFS ports
EXPOSE 50010 50020 50070 50075 50090 8020 9000

# MapReduce ports
EXPOSE 19888

# YARN ports
EXPOSE 8030 8031 8032 8033 8040 8042 8088

# Other ports
EXPOSE 49707 2122

3) Build and Push Image

docker build -t myharbor.com/bigdata/hadoop:3.3.2 . --no-cache
# -t: image name
# . : Dockerfile directory
# --no-cache: do not use cache

docker push myharbor.com/bigdata/hadoop:3.3.2

4) Adjust Directory Structure

mkdir hadoop/templates/hdfs hadoop/templates/yarn
mv hadoop/templates/hdfs-* hadoop/templates/hdfs/
mv hadoop/templates/yarn-* hadoop/templates/yarn/

5) Modify Configuration (values.yaml excerpt)

image:
  repository: myharbor.com/bigdata/hadoop
  tag: 3.3.2
  pullPolicy: IfNotPresent

persistence:
  nameNode:
    enabled: true
    storageClass: "hadoop-nn-local-storage"
    accessMode: ReadWriteOnce
    size: 10Gi
    local:
      - name: hadoop-nn-0
        host: "local-168-182-110"
        path: "/opt/bigdata/servers/hadoop/nn/data/data1"
  dataNode:
    enabled: true
    storageClass: "hadoop-dn-local-storage"
    accessMode: ReadWriteOnce
    size: 20Gi
    local:
      - name: hadoop-dn-0
        host: "local-168-182-110"
        path: "/opt/bigdata/servers/hadoop/dn/data/data1"
      - name: hadoop-dn-1
        host: "local-168-182-110"
        path: "/opt/bigdata/servers/hadoop/dn/data/data2"
      ...

service:
  nameNode:
    type: NodePort
    ports:
      dfs: 9000
      webhdfs: 9870
    nodePorts:
      dfs: 30900
      webhdfs: 30870
  dataNode:
    type: NodePort
    ports:
      dfs: 9000
      webhdfs: 9864
    nodePorts:
      dfs: 30901
      webhdfs: 30864
  resourceManager:
    type: NodePort
    ports:
      web: 8088
    nodePorts:
      web: 30088

6) Service Definitions (YAML snippets)

# hdfs‑nn headless service
apiVersion: v1
kind: Service
metadata:
  name: {{ include "hadoop.fullname" . }}-hdfs-nn
  labels:
    app.kubernetes.io/name: {{ include "hadoop.name" . }}
    helm.sh/chart: {{ include "hadoop.chart" . }}
    app.kubernetes.io/instance: {{ .Release.Name }}
    app.kubernetes.io/component: hdfs-nn
spec:
  ports:
    - name: dfs
      port: {{ .Values.service.nameNode.ports.dfs }}
      protocol: TCP
      nodePort: {{ .Values.service.nameNode.nodePorts.dfs }}
    - name: webhdfs
      port: {{ .Values.service.nameNode.ports.webhdfs }}
      nodePort: {{ .Values.service.nameNode.nodePorts.webhdfs }}
  type: {{ .Values.service.nameNode.type }}
  selector:
    app.kubernetes.io/name: {{ include "hadoop.name" . }}
    app.kubernetes.io/instance: {{ .Release.Name }}
    app.kubernetes.io/component: hdfs-nn

# yarn‑rm service
apiVersion: v1
kind: Service
metadata:
  name: {{ include "hadoop.fullname" . }}-yarn-rm
  labels:
    app.kubernetes.io/name: {{ include "hadoop.name" . }}
    helm.sh/chart: {{ include "hadoop.chart" . }}
    app.kubernetes.io/instance: {{ .Release.Name }}
    app.kubernetes.io/component: yarn-rm
spec:
  ports:
    - port: {{ .Values.service.resourceManager.ports.web }}
      name: web
      nodePort: {{ .Values.service.resourceManager.nodePorts.web }}
  type: {{ .Values.service.resourceManager.type }}
  selector:
    app.kubernetes.io/name: {{ include "hadoop.name" . }}
    app.kubernetes.io/instance: {{ .Release.Name }}
    app.kubernetes.io/component: yarn-rm

7) Security Context Addition

containers:
  ...
  securityContext:
    runAsUser: {{ .Values.securityContext.runAsUser }}
    privileged: {{ .Values.securityContext.privileged }}

8) Install Chart

# create storage directories
mkdir -p /opt/bigdata/servers/hadoop/{nn,dn}/data/data{1..3}

# install via Helm
helm install hadoop ./hadoop -n hadoop --create-namespace

9) Post‑Installation Notes

1. Check HDFS status: kubectl exec -n hadoop -it hadoop-hadoop-hdfs-nn-0 -- /opt/hadoop/bin/hdfs dfsadmin -report 2. List YARN nodes: kubectl exec -n hadoop -it hadoop-hadoop-yarn-rm-0 -- /opt/hadoop/bin/yarn node -list 3. Port‑forward YARN ResourceManager UI: kubectl port-forward -n hadoop hadoop-hadoop-yarn-rm-0 8088:8088 then open http://localhost:8088 4. Run Hadoop tests, e.g., TestDFSIO write: kubectl exec -n hadoop -it hadoop-hadoop-yarn-nm-0 -- /opt/hadoop/bin/hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.2-tests.jar TestDFSIO -write -nrFiles 5 -fileSize 128MB -resFile /tmp/TestDFSIOwrite.txt

10) HDFS Verification

kubectl exec -it hadoop-hadoop-hdfs-nn-0 -n hadoop -- bash
hdfs dfs -mkdir /tmp
hdfs dfs -put test.txt /tmp/
hdfs dfs -cat /tmp/test.txt

11) Uninstall

helm uninstall hadoop -n hadoop
kubectl delete pod -n hadoop $(kubectl get pod -n hadoop | awk 'NR>1{print $1}') --force
kubectl patch ns hadoop -p '{"metadata":{"finalizers":null}}'
kubectl delete ns hadoop --force

Git repository: https://gitee.com/hadoop-bigdata/hadoop-on-k8s

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker Big Data Kubernetes YARN Hadoop

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.