Deploy a Complete Big Data Cluster: Hadoop, Spark, Hive, Zookeeper & Kafka
This guide walks you through installing, configuring, and tuning a comprehensive big data environment—including Hadoop, Zookeeper, Spark, Hive, DolphinScheduler, Doris, and Kafka—covering cluster planning, component version selection, environment variables, scripts for deployment, and performance optimizations for production use.
Big Data Framework Overview
This document provides a step‑by‑step guide for building a production‑grade big data platform covering Hadoop, Zookeeper, Spark, Hive, DolphinScheduler, Doris, and Kafka. It includes hardware planning, component selection, installation, configuration, scripting for deployment, and tuning recommendations.
1. Hadoop Deployment
1.1 Install Hadoop
tar -zxvf hadoop-3.1.3.tar.gz -C /home/module/1.2 Configure core‑site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/module/hadoop-3.1.3/data/tmp</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop-07.test.com:2181,hadoop-08.test.com:2181,hadoop-09.test.com:2181</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<value>hadoop</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.users</name>
<value>*</value>
</property>
</configuration>1.3 Configure hdfs‑site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>128m</value>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>36</value>
</property>
<property>
<name>dfs.datanode.max.transfer.threads</name>
<value>65535</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/module/hadoop-3.1.3/data/datanode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/module/hadoop-3.1.3/data/namenode</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>hadoop-01.test.com:9020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>hadoop-02.test.com:9020</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>hadoop-01.test.com:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>hadoop-02.test.com:9870</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop-01.test.com:8485;hadoop-02.test.com:8485/mycluster</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/home/module/hadoop-3.1.3/data/journaldata</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
<property>
<name>dfs.ha.nn.not-become-active-in-safemode</name>
<value>true</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
</configuration>1.4 Configure yarn‑site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/home/module/hadoop-3.1.3/etc/hadoop:/home/module/hadoop-3.1.3/share/hadoop/common/lib/*:/home/module/hadoop-3.1.3/share/hadoop/common/*:/home/module/hadoop-3.1.3/share/hadoop/hdfs:/home/module/hadoop-3.1.3/share/hadoop/hdfs/lib/*:/home/module/hadoop-3.1.3/share/hadoop/hdfs/*:/home/module/hadoop-3.1.3/share/hadoop/mapreduce/lib/*:/home/module/hadoop-3.1.3/share/hadoop/mapreduce/*:/home/module/hadoop-3.1.3/share/hadoop/yarn:/home/module/hadoop-3.1.3/share/hadoop/yarn/lib/*:/home/module/hadoop-3.1.3/share/hadoop/yarn/*</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>yarncluster</value>
</property>
<property>
<name>yarn.resourcemanager.rm‑ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>hadoop-03.test.com</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>hadoop-04.test.com</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>hadoop-03.test.com:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>hadoop-04.test.com:8088</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory‑mb</name>
<value>81920</value>
</property>
<property>
<name>yarn.scheduler.minimum‑allocation‑mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum‑allocation‑mb</name>
<value>8192</value>
</property>
</configuration>1.5 Configure mapred‑site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop-01.test.com:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop-01.test.com:19888</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/job/history/done</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate‑done-dir</name>
<value>/job/history/done_intermediate</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1433m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2866m</value>
</property>
</configuration>1.6 Workers File
hadoop-04.test.com
hadoop-05.test.com
hadoop-06.test.com
hadoop-07.test.com
hadoop-08.test.com
hadoop-09.test.com1.7 Cluster Startup Script (hadoop‑start.sh)
#!/bin/bash
if [ $# -lt 1 ]; then
echo "No Args Input..."
exit 1
fi
case $1 in
start)
echo "=================== Start Hadoop Cluster ==================="
echo "--- Start HDFS ---"
ssh hadoop-01.test.com "/home/module/hadoop-3.1.3/sbin/start-dfs.sh"
echo "--- Start YARN ---"
ssh hadoop-03.test.com "/home/module/hadoop-3.1.3/sbin/start-yarn.sh"
echo "--- Start HistoryServer ---"
ssh hadoop-01.test.com "/home/module/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
stop)
echo "=================== Stop Hadoop Cluster ==================="
echo "--- Stop HistoryServer ---"
ssh hadoop-01.test.com "/home/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
echo "--- Stop YARN ---"
ssh hadoop-03.test.com "/home/module/hadoop-3.1.3/sbin/stop-yarn.sh"
echo "--- Stop HDFS ---"
ssh hadoop-01.test.com "/home/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
echo "Invalid argument"
;;
esac2. Zookeeper Deployment
2.1 Install Zookeeper
tar -zxvf zookeeper-3.5.7.tar.gz -C /home/module/
mv /home/module/apache-zookeeper-3.5.7-bin /home/module/zookeeper-3.5.72.2 Configure myid and zoo.cfg
# /home/module/zookeeper-3.5.7/zkData/myid (one per node)
# Example content for hadoop-07.test.com: 2
# zoo.cfg (common for all nodes)
clientPort=2181
dataDir=/home/module/zookeeper-3.5.7/zkData
server.2=hadoop-07.test.com:2888:3888
server.3=hadoop-08.test.com:2888:3888
server.4=hadoop-09.test.com:2888:38882.3 Start / Stop Script (zk.sh)
#!/bin/bash
case $1 in
start)
for host in hadoop-07.test.com hadoop-08.test.com hadoop-09.test.com; do
echo "Starting Zookeeper on $host"
ssh $host "/home/module/zookeeper-3.5.7/bin/zkServer.sh start"
done
;;
stop)
for host in hadoop-07.test.com hadoop-08.test.com hadoop-09.test.com; do
echo "Stopping Zookeeper on $host"
ssh $host "/home/module/zookeeper-3.5.7/bin/zkServer.sh stop"
done
;;
status)
for host in hadoop-07.test.com hadoop-08.test.com hadoop-09.test.com; do
echo "Status of Zookeeper on $host"
ssh $host "/home/module/zookeeper-3.5.7/bin/zkServer.sh status"
done
;;
esac3. Spark Deployment
3.1 Install Spark
tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /home/module/
mv /home/module/spark-3.0.0-bin-hadoop3.2 /home/module/spark-3.0.03.2 Set SPARK_HOME
# In /etc/profile.d/custom_env.sh
export SPARK_HOME=/home/module/spark-3.0.0
export PATH=$PATH:$SPARK_HOME/bin3.3 Spark Defaults (spark-defaults.conf)
spark.eventLog.enabled true
spark.eventLog.dir hdfs://mycluster/spark-logs
spark.history.fs.logDirectory hdfs://mycluster/spark-logs
spark.yarn.historyServer.address hadoop-02.test.com:18080
spark.history.ui.port 18080
spark.master yarn
spark.executor.instances 4
spark.executor.memory 2g
spark.driver.memory 4g
spark.driver.memoryOverhead 2g
spark.default.parallelism 10
spark.sql.shuffle.partitions 50
spark.serializer org.apache.spark.serializer.KryoSerializer4. Hive Deployment
4.1 Install Hive
tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /home/module/
mv /home/module/apache-hive-3.1.2-bin /home/module/hive-3.1.24.2 Set HIVE_HOME
# In /etc/profile.d/custom_env.sh
export HIVE_HOME=/home/module/hive-3.1.2
export PATH=$PATH:$HIVE_HOME/bin4.3 Metastore Configuration (hive‑site.xml)
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop-03.test.com:3306/hive_metastore?useSSL=false&createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>test</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hadoop-02.test.com</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop-02.test.com:9083</value>
</property>
</configuration>4.4 Enable Vectorized Engine
set enable_vectorized_engine = true;
set global batch_size = 4096;5. DolphinScheduler Deployment
Copy the Hadoop configuration files (core‑site.xml and hdfs‑site.xml) into each DolphinScheduler component’s conf directory (api‑server, worker‑server, master‑server, alert‑server) so that the scheduler can access HDFS.
6. Doris Deployment
6.1 Directory Layout
/home/module/doris-1.1.1 – root directory
/home/module/doris-1.1.1/fe – Frontend nodes
/home/module/doris-1.1.1/be – Backend nodes
/home/module/doris-1.1.1/apache_hdfs_broker – Broker nodes
/home/module/doris-1.1.1/doris-meta – FE metadata storage
/home/module/doris-1.1.1/doris-storage – BE data storage
6.2 FE Configuration (fe/conf/fe.conf)
meta_dir = /home/module/doris-1.1.1/doris-meta
http_port = 38030
rpc_port = 39020
query_port = 39030
edit_log_port = 39010
mysql_service_nio_enabled = true
priority_networks = 192.168.9.8/246.3 Start FE
bin/start_fe.sh --daemon6.4 Add Followers
ALTER SYSTEM ADD FOLLOWER "hadoop-02:39010";
ALTER SYSTEM ADD FOLLOWER "hadoop-03:39010";6.5 BE Configuration (be/conf/be.conf)
be_port = 39060
webserver_port = 38040
heartbeat_service_port = 39050
brpc_port = 38060
priority_networks = 192.168.9.8/24
storage_root_path = /home/module/doris-1.1.1/doris-storage6.6 Start BE
bin/start_be.sh --daemon6.7 Add Backends
ALTER SYSTEM ADD BACKEND "hadoop-04:39050";
ALTER SYSTEM ADD BACKEND "hadoop-05:39050";
ALTER SYSTEM ADD BACKEND "hadoop-06:39050";
ALTER SYSTEM ADD BACKEND "hadoop-07:39050";
ALTER SYSTEM ADD BACKEND "hadoop-08:39050";
ALTER SYSTEM ADD BACKEND "hadoop-09:39050";6.8 Broker Configuration (apache_hdfs_broker/conf/apache_hdfs_broker.conf)
broker_ipc_port = 380006.9 Start Broker
bin/start_broker.sh --daemon6.10 Add Brokers
ALTER SYSTEM ADD BROKER broker_name "hadoop-01:38000";
ALTER SYSTEM ADD BROKER broker_name "hadoop-02:38000";
... (repeat for all nodes) ...7. Kafka Deployment
7.1 Install Kafka
tar -zxvf kafka_2.12-3.0.0.tgz -C /home/module/
mv /home/module/kafka_2.12-3.0.0 /home/module/kafka7.2 Configure server.properties
broker.id=0
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/home/module/kafka/datas
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=hadoop-07.test.com:2181,hadoop-08.test.com:2181,hadoop-09.test.com:2181/kafka7.3 Set KAFKA_HOME
# In /etc/profile.d/custom_env.sh
export KAFKA_HOME=/home/module/kafka
export PATH=$PATH:$KAFKA_HOME/bin7.4 Start Zookeeper (already covered) and Kafka
# On each broker node
bin/kafka-server-start.sh -daemon config/server.properties7.5 Kafka Control Script (kf.sh)
#!/bin/bash
case $1 in
start)
for i in hadoop-07.test.com hadoop-08.test.com hadoop-09.test.com; do
echo "Starting Kafka on $i"
ssh $i "/home/module/kafka/bin/kafka-server-start.sh -daemon /home/module/kafka/config/server.properties"
done
;;
stop)
for i in hadoop-07.test.com hadoop-08.test.com hadoop-09.test.com; do
echo "Stopping Kafka on $i"
ssh $i "/home/module/kafka/bin/kafka-server-stop.sh"
done
;;
esac8. Tuning and Best Practices
Enable vectorized execution in Hive for better performance (set enable_vectorized_engine=true and increase batch_size).
Adjust YARN capacity‑scheduler parameter yarn.scheduler.capacity.maximum-am-resource-percent from the default 0.1 to a higher value (e.g., 0.3) when cluster resources are limited.
Configure Spark event logging to HDFS and enable the history server for job inspection.
Set proper time zone in Hive: SET global time_zone = 'Asia/Shanghai'; Use Kafka stream‑load parameters such as max_batch_size and max_batch_interval to improve ingestion throughput.
9. Verification Commands
# Verify Zookeeper status
zk.sh status
# Verify Hadoop daemons
jps
# Verify YARN UI at http://hadoop-01.test.com:8088
# Verify Spark history server at http://hadoop-02.test.com:18080
# Verify Hive connection
hive -e "show databases;"
# Verify Doris FE and BE
mysql -h hadoop-01.test.com -P 38030 -e "show frontends;"
mysql -h hadoop-01.test.com -P 38030 -e "show backends;"
# Verify Kafka topics
kafka-topics.sh --list --bootstrap-server hadoop-07.test.com:9092Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tuanzi Tech Team
Tuanzi Mobility, Ticketing & Cloud Systems – we provide mature industry solutions, share high‑quality technical insights, and warmly welcome everyone to follow and share.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
