MaGe Linux Operations
MaGe Linux Operations
Sep 8, 2025 · Big Data

Build Enterprise‑Grade HDFS HA and Optimize YARN Scheduling from Scratch

This comprehensive guide walks you through constructing a fault‑tolerant HDFS high‑availability architecture, configuring dual NameNodes with ZooKeeper and JournalNode clusters, fine‑tuning YARN resource schedulers, implementing monitoring, automated failover testing, and performance optimization, all backed by real‑world production experiences and code examples.

Big Data OperationsYaRNhdfs
0 likes · 24 min read
Build Enterprise‑Grade HDFS HA and Optimize YARN Scheduling from Scratch
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 8, 2025 · Artificial Intelligence

Unlocking Big Data Ops with Large Models: Opportunities, Challenges, Design

This article summarizes a Cloud Summit talk where Alibaba Cloud’s AI expert Zhang Yingying explains how large language models can enhance big‑data intelligent operations, covering opportunities, challenges, RAG‑based Q&A, multi‑agent diagnostics, and the engineering architecture needed for reliable, scalable deployment.

AI engineeringBig Data OperationsRAG
0 likes · 20 min read
Unlocking Big Data Ops with Large Models: Opportunities, Challenges, Design
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 29, 2023 · Operations

How AIOps and DataOps Transform Big Data Operations: Lessons from ABM Platform

This article examines the challenges of big‑data operations, explains how DataOps and AIOps complement each other, and details the ABM intelligent operations architecture, platform components, and real‑world use cases such as Flink hotspot detection, ChatOps assistants, and dynamic MaxCompute resource optimization.

Big Data OperationsDataOpsaiops
0 likes · 11 min read
How AIOps and DataOps Transform Big Data Operations: Lessons from ABM Platform
dbaplus Community
dbaplus Community
Dec 15, 2021 · Big Data

How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime

This article details the background, challenges, and step‑by‑step solutions for migrating over a hundred petabytes of Hadoop HDFS data across data centers within a month, covering strategy selection, code modifications, balance optimization, and tool enhancements.

Balance OptimizationBig Data OperationsDistcp
0 likes · 14 min read
How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime
Efficient Ops
Efficient Ops
Dec 17, 2019 · Operations

How Alibaba Scales Flink: Lessons in Big Data Operations

This article details Alibaba's massive Flink deployment, covering its historical background, the operational challenges of managing tens of thousands of nodes, the design of a comprehensive Flink management platform, and the automated solutions for fault handling, resource allocation, and performance testing in a large‑scale big‑data environment.

Big Data OperationsFlinkautomation
0 likes · 20 min read
How Alibaba Scales Flink: Lessons in Big Data Operations
dbaplus Community
dbaplus Community
Aug 19, 2019 · Big Data

Automating Fault Recovery in 5,000‑Node Hadoop Clusters with Fabric & CM_API

This article explains how a large‑scale Hadoop environment can automatically detect common failures—such as swap usage, clock drift, agent crashes, role outages, and disk imbalance—and recover them using Prometheus alerts, Fabric/Paramiko remote execution, and Cloudera Manager APIs, complete with code examples and step‑by‑step commands.

Big Data OperationsCM_APICluster Automation
0 likes · 12 min read
Automating Fault Recovery in 5,000‑Node Hadoop Clusters with Fabric & CM_API