How AIA Built a Scalable Cloud‑Native Observability Platform for Insurance
This case study details how AIA Insurance transformed legacy insurance systems into a cloud‑native, micro‑service architecture and implemented a comprehensive observability platform using Kubernetes, Prometheus, Grafana and custom data pipelines to improve SLA, fault detection, and business‑level monitoring.
Business Context
AIA Insurance migrated legacy AS400 core applications to cloud‑native micro‑services and containerized both proprietary and third‑party apps on Kubernetes. The resulting hybrid cloud/on‑premise environment increased system complexity and introduced cross‑environment service calls.
Observability Challenges
Increased observation complexity : Business metrics such as policy issuance rate, daily active users, and underwriting success are scattered across many services.
Heterogeneous technology stack : Different languages, frameworks, and versions make unified tracing and metric collection difficult.
Fragmented logging : Development and operations teams store logs separately, preventing a single dashboard view.
Metric overload : Over 200 IaaS, PaaS and application metrics require continuous curation.
Costly commercial APM : Existing APM tools are expensive and provide incomplete trace data.
Observability Implementation Process
The project followed four stages: research & analysis, solution design, transformation implementation, and production validation. Five core requirements guided the design:
Service‑resource tracking (CPU, memory, network, disk, pod health).
Service‑level top‑view (call volume, latency, hotspot ranking).
End‑to‑end tracing with minimal intrusion.
Latency distribution analysis for upstream/downstream services.
Database‑operation correlation (SQL, Redis, Mongo slow queries).
Design Principles
Top‑down business‑driven design : Prioritize monitoring of user‑experience‑critical flows before low‑impact technical metrics.
Business‑centric tracing & performance monitoring : Translate API calls into business‑readable terms, combine tracing with tools such as Arthas, JVM tuning, and log analysis to pinpoint whether issues stem from code or underlying resources.
Full‑Lifecycle Metric Design
CI/CD pipelines run on Jenkins. Build‑related metrics (build count, duration, success rate, deployment frequency) are written to MySQL and later visualized alongside runtime data.
Runtime monitoring is divided into three layers:
Resource layer : Node, disk, network, and pod health collected by Prometheus.
Application layer : Health checks, HTTP status codes, JVM/GC metrics.
Business layer : Core insurance KPIs such as page views, unique visitors, policy count, premium amount, and signed contracts.
Observability Architecture
Collection layer :
Java‑based CICD data collector records build and deployment information into MySQL with tags for correlation.
DaemonSet gathers container logs.
Prometheus scrapes time‑series metrics.
ARMS Agent (validated during Double‑11) provides low‑overhead tracing, replacing SkyWalking.
Storage layer :
Structured build metadata stored in MySQL.
Time‑series metrics stored in Alibaba Cloud Prometheus.
Log and trace data persisted in Alibaba Cloud SLS (Log Service), both serverless and pay‑as‑you‑go.
Unified presentation layer :
Grafana (self‑hosted, version 8+) aggregates data from MySQL, Prometheus, and SLS.
Business dashboards combine SLS SQL queries with Grafana plugins to visualize policy‑level statistics alongside infrastructure metrics.
Alerting integrates with DingTalk or SMS for threshold breaches.
Unified Monitoring Platform Screens
Large screen : Cluster resource utilization and service health for executive decision‑making.
Medium screen : CI/CD efficiency, application performance, and full‑stack trace views for developers.
Small screen : Historical trend comparison and alert thresholds for operations staff.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
