Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices
At the 2024 GOPS Global Operations Conference in Shanghai, Alipay’s monitoring lead Tang Liang presented the challenges, architecture, risk‑prevention practices, and implementation details of the company’s full‑ecosystem availability monitoring system, highlighting its role in DevOps, SRE, and AIOps initiatives.
Alipay Full‑Ecosystem Availability Monitoring – Background and Challenges
The 24th GOPS Global Operations Conference and Research‑Operation Intelligence Summit was successfully held in Shanghai on October 18‑19, 2024. The two‑day event focused on hot topics such as large models, DevOps, SRE, AIOps, BizDevOps, cloud‑native and security, with special tracks covering large‑model‑plus‑operations/testing, digital transformation in banking and securities, platform engineering, DevOps/AIOps best practices, and leading internet companies. Tang Liang, Head of Alipay Ecosystem Monitoring Assurance, delivered a talk titled “Technical System and Application of Alipay’s Full‑Ecosystem Availability Monitoring Assurance” (unauthorized reproduction prohibited).
Alipay Full‑Ecosystem Monitoring Assurance Architecture
The presentation described the overall technical architecture that enables end‑to‑end availability monitoring across Alipay’s entire ecosystem, integrating metrics collection, real‑time alerting, and automated remediation pipelines. The design emphasizes scalability, fault tolerance, and seamless integration with existing DevOps and SRE workflows.
Pre‑Risk Assurance Practices in the Alipay Ecosystem
Key risk‑prevention measures were highlighted, including proactive health checks, synthetic transaction monitoring, and predictive anomaly detection powered by AIOps. These practices aim to identify potential service degradations before they impact users, thereby maintaining high availability standards.
Monitoring System Construction and Practice
The talk concluded with concrete implementation details, such as the deployment of distributed tracing, centralized logging, and automated incident response playbooks. Real‑world case studies demonstrated how these components work together to achieve rapid detection and resolution of issues across the Alipay ecosystem.
For further details, the full PPT is available at: https://pan.baidu.com/s/1hpb2zy7qO-JNeDWjdLwa_Q?pwd=ih8b
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.