Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide
This guide walks through deploying Hadoop 3.x on a Kubernetes cluster using Helm, covering repository addition, Docker image creation, Helm chart configuration, service adjustments, installation, verification commands, and clean uninstallation, complete with code snippets and screenshots.
Overview
Hadoop is an Apache open‑source distributed computing platform that provides HDFS and MapReduce (YARN in Hadoop 2.0) as core components, offering high fault‑tolerance, scalability and efficiency on inexpensive hardware. The latest stable version is 3.x.
Deployment Steps
1) Add Helm Repository
helm repo add apache-hadoop-helm https://pfisterer.github.io/apache-hadoop-helm/
helm pull apache-hadoop-helm/hadoop --version 1.2.0
tar -xf hadoop-1.2.0.tgz2) Build Docker Image
FROM myharbor.com/bigdata/centos:7.9.2009
RUN rm -f /etc/localtime && ln -sv /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo "Asia/Shanghai" > /etc/timezone
RUN export LANG=zh_CN.UTF-8
# create user and group
RUN groupadd --system --gid=9999 admin && useradd --system --home-dir /home/admin --uid=9999 --gid=admin admin
# install sudo
RUN yum -y install sudo ; chmod 640 /etc/sudoers
# give admin sudo permission
RUN echo "admin ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
RUN yum -y install net-tools telnet wget
RUN mkdir /opt/apache/
ADD jdk-8u212-linux-x64.tar.gz /opt/apache/
ENV JAVA_HOME=/opt/apache/jdk1.8.0_212
ENV PATH=$JAVA_HOME/bin:$PATH
ENV HADOOP_VERSION 3.3.2
ENV HADOOP_HOME=/opt/apache/hadoop
ENV HADOOP_COMMON_HOME=${HADOOP_HOME} \
HADOOP_HDFS_HOME=${HADOOP_HOME} \
HADOOP_MAPRED_HOME=${HADOOP_HOME} \
HADOOP_YARN_HOME=${HADOOP_HOME} \
HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop \
PATH=${PATH}:${HADOOP_HOME}/bin
ADD hadoop-${HADOOP_VERSION}.tar.gz /opt/apache
RUN ln -s /opt/apache/hadoop-${HADOOP_VERSION} ${HADOOP_HOME}
RUN chown -R admin:admin /opt/apache
WORKDIR ${HADOOP_HOME}
# HDFS ports
EXPOSE 50010 50020 50070 50075 50090 8020 9000
# MapReduce ports
EXPOSE 19888
# YARN ports
EXPOSE 8030 8031 8032 8033 8040 8042 8088
# Other ports
EXPOSE 49707 21223) Build and Push Image
docker build -t myharbor.com/bigdata/hadoop:3.3.2 . --no-cache
# -t: image name
# . : Dockerfile directory
# --no-cache: do not use cache docker push myharbor.com/bigdata/hadoop:3.3.24) Adjust Directory Structure
mkdir hadoop/templates/hdfs hadoop/templates/yarn
mv hadoop/templates/hdfs-* hadoop/templates/hdfs/
mv hadoop/templates/yarn-* hadoop/templates/yarn/5) Modify Configuration (values.yaml excerpt)
image:
repository: myharbor.com/bigdata/hadoop
tag: 3.3.2
pullPolicy: IfNotPresent
persistence:
nameNode:
enabled: true
storageClass: "hadoop-nn-local-storage"
accessMode: ReadWriteOnce
size: 10Gi
local:
- name: hadoop-nn-0
host: "local-168-182-110"
path: "/opt/bigdata/servers/hadoop/nn/data/data1"
dataNode:
enabled: true
storageClass: "hadoop-dn-local-storage"
accessMode: ReadWriteOnce
size: 20Gi
local:
- name: hadoop-dn-0
host: "local-168-182-110"
path: "/opt/bigdata/servers/hadoop/dn/data/data1"
- name: hadoop-dn-1
host: "local-168-182-110"
path: "/opt/bigdata/servers/hadoop/dn/data/data2"
...
service:
nameNode:
type: NodePort
ports:
dfs: 9000
webhdfs: 9870
nodePorts:
dfs: 30900
webhdfs: 30870
dataNode:
type: NodePort
ports:
dfs: 9000
webhdfs: 9864
nodePorts:
dfs: 30901
webhdfs: 30864
resourceManager:
type: NodePort
ports:
web: 8088
nodePorts:
web: 300886) Service Definitions (YAML snippets)
# hdfs‑nn headless service
apiVersion: v1
kind: Service
metadata:
name: {{ include "hadoop.fullname" . }}-hdfs-nn
labels:
app.kubernetes.io/name: {{ include "hadoop.name" . }}
helm.sh/chart: {{ include "hadoop.chart" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/component: hdfs-nn
spec:
ports:
- name: dfs
port: {{ .Values.service.nameNode.ports.dfs }}
protocol: TCP
nodePort: {{ .Values.service.nameNode.nodePorts.dfs }}
- name: webhdfs
port: {{ .Values.service.nameNode.ports.webhdfs }}
nodePort: {{ .Values.service.nameNode.nodePorts.webhdfs }}
type: {{ .Values.service.nameNode.type }}
selector:
app.kubernetes.io/name: {{ include "hadoop.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/component: hdfs-nn # yarn‑rm service
apiVersion: v1
kind: Service
metadata:
name: {{ include "hadoop.fullname" . }}-yarn-rm
labels:
app.kubernetes.io/name: {{ include "hadoop.name" . }}
helm.sh/chart: {{ include "hadoop.chart" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/component: yarn-rm
spec:
ports:
- port: {{ .Values.service.resourceManager.ports.web }}
name: web
nodePort: {{ .Values.service.resourceManager.nodePorts.web }}
type: {{ .Values.service.resourceManager.type }}
selector:
app.kubernetes.io/name: {{ include "hadoop.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/component: yarn-rm7) Security Context Addition
containers:
...
securityContext:
runAsUser: {{ .Values.securityContext.runAsUser }}
privileged: {{ .Values.securityContext.privileged }}8) Install Chart
# create storage directories
mkdir -p /opt/bigdata/servers/hadoop/{nn,dn}/data/data{1..3}
# install via Helm
helm install hadoop ./hadoop -n hadoop --create-namespace9) Post‑Installation Notes
1. Check HDFS status: kubectl exec -n hadoop -it hadoop-hadoop-hdfs-nn-0 -- /opt/hadoop/bin/hdfs dfsadmin -report 2. List YARN nodes: kubectl exec -n hadoop -it hadoop-hadoop-yarn-rm-0 -- /opt/hadoop/bin/yarn node -list 3. Port‑forward YARN ResourceManager UI: kubectl port-forward -n hadoop hadoop-hadoop-yarn-rm-0 8088:8088 then open http://localhost:8088 4. Run Hadoop tests, e.g., TestDFSIO write: kubectl exec -n hadoop -it hadoop-hadoop-yarn-nm-0 -- /opt/hadoop/bin/hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.2-tests.jar TestDFSIO -write -nrFiles 5 -fileSize 128MB -resFile /tmp/TestDFSIOwrite.txt
10) HDFS Verification
kubectl exec -it hadoop-hadoop-hdfs-nn-0 -n hadoop -- bash
hdfs dfs -mkdir /tmp
hdfs dfs -put test.txt /tmp/
hdfs dfs -cat /tmp/test.txt11) Uninstall
helm uninstall hadoop -n hadoop
kubectl delete pod -n hadoop $(kubectl get pod -n hadoop | awk 'NR>1{print $1}') --force
kubectl patch ns hadoop -p '{"metadata":{"finalizers":null}}'
kubectl delete ns hadoop --forceGit repository: https://gitee.com/hadoop-bigdata/hadoop-on-k8s
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
