Tagged articles

Big Data Operations

7 articles · Page 1 of 1

Sep 8, 2025 · Big Data

Build Enterprise‑Grade HDFS HA and Optimize YARN Scheduling from Scratch

This comprehensive guide walks you through constructing a fault‑tolerant HDFS high‑availability architecture, configuring dual NameNodes with ZooKeeper and JournalNode clusters, fine‑tuning YARN resource schedulers, implementing monitoring, automated failover testing, and performance optimization, all backed by real‑world production experiences and code examples.

Big Data OperationsHDFSResource Scheduling

0 likes · 24 min read

Build Enterprise‑Grade HDFS HA and Optimize YARN Scheduling from Scratch

Alibaba Cloud Big Data AI Platform

Aug 8, 2025 · Artificial Intelligence

Unlocking Big Data Ops with Large Models: Opportunities, Challenges, Design

This article summarizes a Cloud Summit talk where Alibaba Cloud’s AI expert Zhang Yingying explains how large language models can enhance big‑data intelligent operations, covering opportunities, challenges, RAG‑based Q&A, multi‑agent diagnostics, and the engineering architecture needed for reliable, scalable deployment.

AI engineeringBig Data OperationsRAG

0 likes · 20 min read

Unlocking Big Data Ops with Large Models: Opportunities, Challenges, Design

Alibaba Cloud Big Data AI Platform

Nov 29, 2023 · Operations

How AIOps and DataOps Transform Big Data Operations: Lessons from ABM Platform

This article examines the challenges of big‑data operations, explains how DataOps and AIOps complement each other, and details the ABM intelligent operations architecture, platform components, and real‑world use cases such as Flink hotspot detection, ChatOps assistants, and dynamic MaxCompute resource optimization.

AIOpsBig Data OperationsDataOps

0 likes · 11 min read

How AIOps and DataOps Transform Big Data Operations: Lessons from ABM Platform

dbaplus Community

Dec 15, 2021 · Big Data

How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime

This article details the background, challenges, and step‑by‑step solutions for migrating over a hundred petabytes of Hadoop HDFS data across data centers within a month, covering strategy selection, code modifications, balance optimization, and tool enhancements.

Balance OptimizationBig Data OperationsData Migration

0 likes · 14 min read

How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime

Efficient Ops

Dec 17, 2019 · Operations

How Alibaba Scales Flink: Lessons in Big Data Operations

This article details Alibaba's massive Flink deployment, covering its historical background, the operational challenges of managing tens of thousands of nodes, the design of a comprehensive Flink management platform, and the automated solutions for fault handling, resource allocation, and performance testing in a large‑scale big‑data environment.

Big Data OperationsCluster ManagementFlink

0 likes · 20 min read

How Alibaba Scales Flink: Lessons in Big Data Operations

dbaplus Community

Aug 19, 2019 · Big Data

Automating Fault Recovery in 5,000‑Node Hadoop Clusters with Fabric & CM_API

This article explains how a large‑scale Hadoop environment can automatically detect common failures—such as swap usage, clock drift, agent crashes, role outages, and disk imbalance—and recover them using Prometheus alerts, Fabric/Paramiko remote execution, and Cloudera Manager APIs, complete with code examples and step‑by‑step commands.

Big Data OperationsCM_APICluster Automation

0 likes · 12 min read

Automating Fault Recovery in 5,000‑Node Hadoop Clusters with Fabric & CM_API

StarRing Big Data Open Lab

Oct 17, 2016 · Operations

How Transwarp Manager Simplifies HDFS Monitoring and Boosts Operational Efficiency

This article explains how Transwarp Manager aggregates key HDFS metrics into a single dashboard, demonstrates a DataNode failure scenario on a three‑node test cluster, and shows how the visual alerts help operators quickly identify and resolve big‑data service issues.

Big Data OperationsHDFS monitoringTDH

0 likes · 6 min read

How Transwarp Manager Simplifies HDFS Monitoring and Boosts Operational Efficiency