Big Data 14 min read

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

This tutorial walks you through deploying Hadoop 3.x on a Kubernetes cluster using Helm, covering repository setup, Docker image creation, Helm chart customization, service configuration, installation, verification, and clean‑up, with all necessary commands and YAML snippets.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

Overview

Hadoop is an Apache open‑source distributed computing platform built around HDFS (Hadoop Distributed File System) and MapReduce; Hadoop 2.0 introduced YARN as a fine‑grained resource scheduler that can also run other frameworks such as Spark. Its high fault tolerance, scalability and efficiency allow deployment on inexpensive hardware, and the current stable release is 3.x.

Start Deployment

1) Add Helm Repository

helm repo add apache-hadoop-helm https://pfisterer.github.io/apache-hadoop-helm/
helm pull apache-hadoop-helm/hadoop --version 1.2.0
tar -xf hadoop-1.2.0.tgz

2) Build Docker Image

FROM myharbor.com/bigdata/centos:7.9.2009

RUN rm -f /etc/localtime && ln -sv /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo "Asia/Shanghai" > /etc/timezone
RUN export LANG=zh_CN.UTF-8

# Create user and group for securityContext.runAsUser: 9999
RUN groupadd --system --gid=9999 admin && useradd --system --home-dir /home/admin --uid=9999 --gid=admin admin

# Install sudo and grant permissions
RUN yum -y install sudo ; chmod 640 /etc/sudoers
RUN echo "admin ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

RUN yum -y install net-tools telnet wget
RUN mkdir /opt/apache/
ADD jdk-8u212-linux-x64.tar.gz /opt/apache/
ENV JAVA_HOME=/opt/apache/jdk1.8.0_212
ENV PATH=$JAVA_HOME/bin:$PATH

ENV HADOOP_VERSION 3.3.2
ENV HADOOP_HOME=/opt/apache/hadoop
ENV HADOOP_COMMON_HOME=${HADOOP_HOME} \
    HADOOP_HDFS_HOME=${HADOOP_HOME} \
    HADOOP_MAPRED_HOME=${HADOOP_HOME} \
    HADOOP_YARN_HOME=${HADOOP_HOME} \
    HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop \
    PATH=${PATH}:${HADOOP_HOME}/bin

ADD hadoop-${HADOOP_VERSION}.tar.gz /opt/apache
RUN ln -s /opt/apache/hadoop-${HADOOP_VERSION} ${HADOOP_HOME}
RUN chown -R admin:admin /opt/apache
WORKDIR $HADOOP_HOME

# Expose ports
EXPOSE 50010 50020 50070 50075 50090 8020 9000
EXPOSE 19888
EXPOSE 8030 8031 8032 8033 8040 8042 8088
EXPOSE 49707 2122

3) Build Image

docker build -t myharbor.com/bigdata/hadoop:3.3.2 . --no-cache
# -t: image name, .: Dockerfile directory, --no-cache: do not use cache

4) Push Image

docker push myharbor.com/bigdata/hadoop:3.3.2

5) Adjust Directory Structure

mkdir hadoop/templates/hdfs/hdfs-nn-pv.yaml hadoop/templates/hdfs/hdfs-dn-pv.yaml
mv hadoop/templates/hdfs/hdfs-nn-pv.yaml hadoop/templates/hdfs/hdfs-nn-pv.yaml
mv hadoop/templates/hdfs/hdfs-dn-pv.yaml hadoop/templates/hdfs/hdfs-dn-pv.yaml

6) Modify Configuration

hadoop/values.yaml

– set image repository, tag, pullPolicy, persistence for NameNode and DataNode, service ports, securityContext (runAsUser, privileged). hadoop/templates/hdfs/hdfs-nn-pv.yaml – PersistentVolume definition for NameNode. hadoop/templates/hdfs/hdfs-dn-pv.yaml – PersistentVolume definition for DataNode. hadoop/templates/hdfs/hdfs-nn-svc.yaml – Headless Service for NameNode. hadoop/templates/hdfs/hdfs-dn-svc.yaml – Headless Service for DataNode. hadoop/templates/yarn/yarn-rm-svc.yaml – Service for YARN ResourceManager UI.

Update controllers to include securityContext.runAsUser and securityContext.privileged.

Adjust hadoop/templates/hadoop-configmap.yaml – replace /root with /opt/apache and set TMP_URL for YARN UI.

Installation

# Create storage directories
mkdir -p /opt/bigdata/servers/hadoop/{nn,dn}/data/data{1..3}

# Install chart
helm install hadoop ./hadoop -n hadoop --create-namespace

Post‑Installation Notes

NAME: hadoop
LAST DEPLOYED: Sat Sep 24 17:00:55 2022
NAMESPACE: hadoop
STATUS: deployed

# Check HDFS status
kubectl exec -n hadoop -it hadoop-hadoop-hdfs-nn-0 -- /opt/hadoop/bin/hdfs dfsadmin -report

# List YARN nodes
kubectl exec -n hadoop -it hadoop-hadoop-yarn-rm-0 -- /opt/hadoop/bin/yarn node -list

# Port‑forward YARN ResourceManager UI
kubectl port-forward -n hadoop hadoop-hadoop-yarn-rm-0 8088:8088
# Then open http://localhost:8088 in a browser

# Run Hadoop test (TestDFSIO)
kubectl exec -n hadoop -it hadoop-hadoop-yarn-nm-0 -- /opt/hadoop/bin/hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.2-tests.jar TestDFSIO -write -nrFiles 5 -fileSize 128MB -resFile /tmp/TestDFSIOwrite.txt

# List MapReduce jobs
kubectl exec -n hadoop -it hadoop-hadoop-yarn-rm-0 -- /opt/hadoop/bin/mapred job -list

# Use with Zeppelin chart
helm install --namespace hadoop --set hadoop.useConfigMap=true,hadoop.configMapName=hadoop-hadoop stable/zeppelin

# Scale Yarn NodeManagers
helm upgrade hadoop --set yarn.nodeManager.replicas=4 stable/hadoop

Access Web UIs

HDFS web UI: http://192.168.182.110:30870/

YARN web UI: http://192.168.182.110:30088/

HDFS Test Verification

kubectl exec -it hadoop-hadoop-hdfs-nn-0 -n hadoop -- bash
hdfs dfs -mkdir /tmp
hdfs dfs -ls /
hdfs dfs -put test.txt /tmp/
hdfs dfs -cat /tmp/test.txt

Uninstall

helm uninstall hadoop -n hadoop
kubectl delete pod -n hadoop $(kubectl get pod -n hadoop | awk 'NR>1{print $1}') --force
kubectl patch ns hadoop -p '{"metadata":{"finalizers":null}}'
kubectl delete ns hadoop --force

The Helm chart source code is available at https://gitee.com/hadoop-bigdata/hadoop-on-k8s . This single‑node deployment is intended for testing; a future article will cover high‑availability Hadoop on Kubernetes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockerBig DataKubernetesYARNHadoophelm
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.