Big Data 19 min read

Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

This comprehensive tutorial details the process of deploying a complete Hadoop-based big data ecosystem on ARM architecture, covering the installation and configuration of essential components including Java, Zookeeper, Hadoop, MySQL, Hive, and Spark with practical code examples.

政采云技术

Aug 23, 2023

Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

This article provides a comprehensive guide to deploying a Hadoop-based big data cluster on ARM architecture, addressing the growing demand for open-source, power-efficient computing solutions. It begins by comparing X86 and ARM architectures, highlighting ARM's advantages in power consumption and open-source flexibility, and outlines three primary deployment strategies before focusing on open-source component integration.

The tutorial starts with essential cluster prerequisites, including configuring passwordless SSH access and NTP time synchronization across a minimum of three nodes. It then details the step-by-step installation of Java 8, followed by Zookeeper for distributed coordination. The Zookeeper configuration involves setting up the configuration file and generating unique myid files for each node:

cd /opt/zookeeper/conf
cp zoo_sample.cfg zoo.cfg
vim zoo.cfg
dataDir=/opt/zookeeper/data
server.1=node1:2888:3888
server.1=node1:2888:3888
server.1=node1:2888:3888

Next, the core Hadoop ecosystem is deployed. The guide walks through extracting Hadoop packages, configuring environment variables, and modifying critical configuration files. The core-site.xml configuration defines global cluster parameters:

<configuration>
  <property>
    <!-- namenode address -->
    <name>fs.defaultFS</name>
    <value>hdfs://node1:8020</value>
  </property>
  <property>
    <name>fs.trash.interval</name>
    <value>1</value>
  </property>
  <property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec</value>
  </property>
  <property>
    <name>hadoop.security.authentication</name>
    <value>simple</value>
  </property>
  <property>
    <name>hadoop.security.authorization</name>
    <value>false</value>
  </property>
  <property>
    <name>hadoop.rpc.protection</name>
    <value>authentication</value>
  </property>
  <property>
    <name>hadoop.security.auth_to_local</name>
    <value>DEFAULT</value>
  </property>
  <property>
    <name>hadoop.proxyuser.oozie.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.oozie.groups</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.flume.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.flume.groups</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.HTTP.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.HTTP.groups</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hive.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hive.groups</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hue.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hue.groups</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.httpfs.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.httpfs.groups</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hdfs.groups</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hdfs.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.yarn.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.yarn.groups</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.security.group.mapping</name>
    <value>org.apache.hadoop.security.ShellBasedUnixGroupsMapping</value>
  </property>
  <property>
    <name>hadoop.security.instrumentation.requires.admin</name>
    <value>false</value>
  </property>
  <property>
    <name>net.topology.script.file.name</name>
    <value>/etc/hadoop/conf.cloudera.yarn/topology.py</value>
  </property>
  <property>
    <name>io.file.buffer.size</name>
    <value>65536</value>
  </property>
  <property>
    <name>hadoop.ssl.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>hadoop.ssl.require.client.cert</name>
    <value>false</value>
    <final>true</final>
  </property>
  <property>
    <name>hadoop.ssl.keystores.factory.class</name>
    <value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>
    <final>true</final>
  </property>
  <property>
    <name>hadoop.ssl.server.conf</name>
    <value>ssl-server.xml</value>
    <final>true</final>
  </property>
  <property>
    <name>hadoop.ssl.client.conf</name>
    <value>ssl-client.xml</value>
    <final>true</final>
  </property>
</configuration>

The hdfs-site.xml, yarn-site.xml, and mapred-site.xml files are similarly configured to define storage paths, replication factors, resource manager addresses, and job scheduling parameters. After distributing files to worker nodes, the cluster is formatted and launched:

cd /opt/hadoop/bin
./hadoop namenode -format

cd /opt/hadoop/sbin
./start-all.sh

The tutorial continues with MySQL installation to serve as the Hive metastore, followed by Hive deployment. It covers extracting Hive packages, modifying hive-site.xml for database connectivity, initializing the schema, and verifying table creation. Finally, Spark is integrated as a high-performance in-memory computing engine, with configuration steps for spark-env.sh and spark-defaults.conf to enable YARN integration and event logging. The article concludes by confirming the successful startup of the Spark shell, marking the completion of a fully functional ARM-based big data cluster.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Zookeeper Hive Spark Hadoop Cluster Deployment ARM architecture

Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.