Ambari Technical Practice for Managing Hadoop Big Data Platforms
The team adopted Apache Ambari to streamline deployment, scaling, monitoring, and upgrade of their Hadoop‑centric big‑data platform, overcoming HA cluster takeover and custom Hive 2.1 integration through a three‑phase test, gray‑scale, and production rollout, thereby improving management efficiency and reducing O&M costs.
Background: As the company’s business expands and we enter the era of data society, data becomes increasingly important for maintaining competitive advantage. Big data is the foundation of data‑driven operations. Although our big‑data technology started late, we have quickly caught up, building a Hadoop‑centric big‑data platform that has grown rapidly in cluster scale, data resources, and business coverage.
However, because the platform supports many complex business scenarios and contains numerous technical components, we face several shortcomings in daily management, deployment, upgrade, and monitoring:
Installation, scaling, and upgrading of resources are inconvenient, especially dependency configuration management between components.
Each service cluster lacks unified overall monitoring; troubleshooting and performance tuning rely on per‑host metrics, leading to low efficiency.
Component‑level monitoring does not fully exploit Metrics data for time‑series health observation, making stability analysis difficult.
Integrating existing monitoring tools and providing an intuitive UI for platform inspection and management is challenging.
After recognizing these issues, we investigated solutions and focused on Apache Ambari. Ambari is a top‑level open‑source project that creates, manages, and monitors Hadoop ecosystems (including HBase, Hive, Spark, Storm, etc.) and provides a complete set of RESTful APIs based on Metrics for web‑based platform management.
Ambari Technical Practice
Our big‑data technology interest group began a comprehensive study of Ambari in July 2017, aiming to apply Ambari to our production Hadoop environment, improve resource utilization, enhance management efficiency, and reduce O&M costs. We first cataloged the production architecture, component versions, upper‑layer applications, and developer interfaces. Our clusters consist of a Hadoop main cluster for Hive offline processing and a Hadoop advertising cluster for Spark real‑time processing. To avoid impacting online services, we planned a two‑step approach: first deploy Ambari, then let Ambari take over the cluster as a single upgrade operation.
Key technical challenges identified include:
Difficulty integrating Ambari with an existing platform.
Complex compatibility of platform component architectures.
Our main Hadoop cluster runs Hadoop 2.7.3, HDFS with HA (nn + jns), YARN with FairScheduler, Hive 2.1, Flume 1.7, etc. Ambari supports Hortonworks Data Platform (HDP) components; HDP 2.6 aligns with Hadoop 2.7.3 and Hive 1.2.1/2.1, ensuring downward compatibility. However, HDP 2.6 only automatically supplies Hive 1.2.1, so we needed to enable Hive 2.1 support.
Ambari Practice Steps
We designed a three‑phase plan:
Build a test cluster, install Ambari, and achieve deployment, monitoring, and alerting for an HDP 2.6 cluster.
In a gray‑scale environment, simulate an upgrade, test component and application compatibility, and evaluate stability.
Apply Ambari to the production environment, covering both deployment and upgrade.
Ambari Architecture Overview
Ambari consists of five parts:
Ambari Web: user interface that communicates with Ambari Server via REST APIs.
Ambari Server: web server handling logic for agents and storing data in a database.
Ambari Agent: daemon on each node reporting status and receiving commands.
Host: physical machines running big‑data services, each with an Ambari Agent and Metrics Monitor.
Metrics Collector: stores metrics in HBase and provides query interfaces to the server.
Ambari is built with Java, JavaScript, and Python, leveraging open‑source tools such as Puppet (agent management), Ember.js (frontend MVC), Spring and JAX‑RS (server), and integrates Ganglia and Nagios for distributed monitoring.
HDP Introduction
Hortonworks Data Platform (HDP) is an open‑source, YARN‑centric enterprise Hadoop distribution. Ambari manages HDP stacks, services, and components, handling dependencies via Service Metainfo (e.g., YARN requires HDFS and MR2).
Managing Existing Clusters with Ambari
We focused on two representative problems in the gray‑scale test cluster:
Ambari taking over a high‑availability Hadoop cluster while maintaining compatibility.
Custom deployment to support Hive 2.1, which Ambari does not natively supply.
High‑Availability Hadoop Cluster Takeover
Our gray‑scale cluster mirrors the production environment (Hadoop 2.7.3, HA HDFS with active/standby NameNodes, JNs, and YARN with FairScheduler). We selected HDP 2.6 as the compatible stack. The upgrade process follows: deploy Ambari and HDP components → stop the existing cluster → start the HDP cluster via Ambari. Critical steps include matching original node assignments, preserving data directories, and keeping configuration parameters (e.g., block size, resource policies) consistent. Ambari’s web UI simplifies configuration adjustments.
We first configured a single‑node NameNode to align file structures, then compared and synchronized configuration items across nodes. After stopping the original cluster, we started Zookeeper, HDFS, MR2, and YARN in order via Ambari, confirming normal operation through the NameNode UI (port 50070). Next, we enabled HDFS HA using Ambari’s integrated workflow, ensuring standby NameNode, JournalNode, and ZKFC components had correct metadata paths. After completing HA configuration, we verified that block counts before and after the upgrade matched exactly.
Custom Deployment of Hive 2.1
Ambari’s HDP 2.6 only auto‑supplies Hive 1.2.1, but our production workloads require Hive 2.1. To overcome this, we traced the command execution flow: HeartBeatHandler processes Agent heartbeats, queues commands, and eventually invokes HiveServer logic in Hive.py. By modifying the configuration methods in Hive.py to point to Hive 2.1 directories and adding custom key‑value pairs, we enabled Hive 2.1 management without breaking existing functionality.
Future Outlook
Although we only covered two representative challenges, the experience builds confidence for full production deployment. Further work will explore additional Ambari capabilities and continue advancing intelligent, enterprise‑grade big‑data platform construction.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
