Operations 16 min read

From Manual Ops to AIOps: Real‑World DataOps Practices at Alibaba

This talk walks through the evolution from manual to automated operations, outlines Alibaba’s integrated operations platform architecture, showcases practical DataOps implementations, and explores AIOps applications such as self‑healing monitoring, hardware remediation, and resource optimization, offering concrete examples and insights for modern ops engineers.

Efficient Ops
Efficient Ops
Efficient Ops
From Manual Ops to AIOps: Real‑World DataOps Practices at Alibaba

Speaker: Fan Lenting, senior operations expert at Alibaba Cloud Platform, has been in the operations field since 2008 and currently focuses on the real‑time computing platform Stream‑Compute.

1. Operations Advancement

The evolution of operations moves from manual processes to automation and then to AIOps, with an intermediate stage often considered the early phase of AIOps. DataOps differs from AIOps in that its results serve as decision‑support rather than fully automated actions, requiring human judgment before integration.

2. Integrated Operations Platform

Alibaba Cloud Platform’s architecture consists of two major distributed computing engines that support numerous big‑data platforms such as MaxCompute, AnalyticDB, and Stream‑Compute, forming a physical scale of over 100,000 nodes.

To manage the complexity of multiple engines and platforms, a three‑layer operations solution is built:

Operations IaaS layer, relying on corporate infrastructure.

Operations PaaS layer, providing common services like user and role management.

Application layer, enabling customers to quickly build personalized services on top of the shared service layer.

The platform addresses two main information flows: upward flow (monitoring service health) and downward flow (command execution such as service start/stop). Automation UI views allow operators to handle these flows, and AI techniques can close the loop, forming an AIOps stage.

3. DataOps Practice

DataOps relies on standardized data‑warehouse principles. Typical operational data includes dimension data (metadata), metric data (runtime indicators and events), and log data. These form the foundation for DataOps.

The DataOps architecture abstracts data across multiple big‑data platforms, separating common data (e.g., machines, metrics, logs) from business‑specific data, and builds scenarios such as fault self‑healing and anomaly detection.

Examples include knowledge‑graph‑based operation search, where users can query entities like queues and receive related actions, and ChatOps for simple query‑driven interactions.

Job diagnostics combine end‑to‑end analysis of job lifecycles, machine health checks, I/O diagnostics, and network traffic analysis to answer performance questions.

Anomaly detection examples include clustering‑based detection for homogeneous hosts and log‑pattern‑based detection that converts log anomalies into metric‑like alerts.

Another example is optimizing synchronization tasks across different business units by clustering job configurations and applying the best‑performing settings to similar jobs, achieving up to a 7‑fold speed increase.

4. AIOps Related

4.1 AIOps Monitoring Self‑Healing

Self‑healing monitoring consists of perception (detecting anomalies), decision (real‑time processing), and remediation (automated actions), with optional human approval for critical operations.

4.2 AIOps Hardware Self‑Healing

Hardware fault detection agents collect data, send it through SLS pipelines to stream‑compute for analysis, store results in OLAP, and trigger actions such as offline or reboot, handling over 200,000 self‑healing events annually with >99% server availability.

4.3 AIOps Resource Optimization

Resource quota allocation is optimized by building a satisfaction model (considering resource contention, wait time, and fulfillment rate) and using time‑series anomaly detection to adjust allocations based on predicted usage and historical satisfaction data.

Platform EngineeringautomationOperationsAIOpsDataOps
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.