Operations 17 min read

How Test Teams Can Build Observability Beyond Traditional Monitoring

This article examines how quality assurance engineers can adopt observability principles—distinct from conventional monitoring—to enhance system health detection, root‑cause analysis, and proactive risk mitigation across resources, services, business functions, data, and logs.

JD Cloud Developers

Oct 21, 2024

How Test Teams Can Build Observability Beyond Traditional Monitoring

Background Introduction

Currently the quality team is actively building and improving application monitoring capabilities to promptly detect and resolve issues, ensuring online service stability. With the growing popularity of observability, monitoring faces new challenges and missions. This article explores, from a tester’s perspective, ideas and reflections on quality assurance under observability, why it differs from the development side, and the value it brings to the business.

1. Understanding Observability

1.1 What is Observability

Wikipedia definition: In control theory, observability refers to the degree to which a system’s internal state can be inferred from its external outputs. The concept is mathematically dual to controllability and was introduced by Hungarian‑American engineer Rudolf Kalman for linear dynamic systems.

In software, observability means monitoring the internal state of a system using a white‑box approach, spanning the entire application lifecycle. By analyzing metrics, logs, and traces, a complete observation model is built to enable fault diagnosis, root‑cause analysis, and rapid recovery.

Gartner defines observability as a software and system characteristic that allows administrators to collect external and internal state data to answer behavior questions. Teams such as I&O, DevOps, SRE, and Support use this data for anomaly investigation, observability‑driven development, and improving performance and uptime. Gartner predicts that by 2024, 30% of cloud‑native companies will adopt observability technologies.

OpenTelemetry identifies three pillars of observability:

Observability workflow: Observe → Judge → Optimize → Observe again.

1.2 Difference between Observability and Monitoring

Both aim to timely and accurately understand system status, enhancing control and fault handling.

Monitoring : Collecting, analyzing, and using information to observe system progress over time, focusing on specific metrics.

Observability : Analyzing system‑generated data to infer internal state and provide data‑driven decision support.

Four performance‑based distinctions (illustrated in image):

Monitoring is an operation to improve observability; observability is an inherent system property reflecting health.

1.3 Relationship between Observability and Monitoring

Monitoring detects errors (external proactive behavior), while observability explains why problems occur by linking runtime data.

2. Quality Assurance Goals

Objectives

Achieve comprehensive system and application monitoring to proactively detect health issues.

Rapidly locate and resolve anomalies, discovering problems before users notice and providing remediation decisions.

Provide real‑time and historical comparable data reflecting system status to support technical decisions.

Scope

All critical application services and infrastructure.

Includes applications, servers, networks, databases, and business‑level data.

3. Quality Assurance Approach

The article proposes building a monitoring foundation and extending data observability to address the passive nature of traditional monitoring, combining collection, aggregation, and tracing to enable issue localization, risk prediction, and system decision‑making.

1. Monitoring Foundation

1.1 Monitoring Dimensions

Monitoring aims to improve observability and typically includes:

Resource‑level monitoring : Hardware, network bandwidth, usually led by operations.

Service stability : Service or interface availability (e.g., UMP), usually led by development.

Business‑function monitoring : Verifying that outward‑facing functionalities work correctly; a focus for testers.

Business‑data monitoring : Tracking data correctness and trends to infer system health.

Log‑clustering monitoring : Using statistical methods on aggregated logs to assess overall availability; alerts trigger when error rates exceed thresholds.

1.2 Monitoring Items Prioritized by Test Teams

1.2.1 Business‑Function Monitoring

Interface functionality : Monitor core interfaces.

Read interfaces : Can be validated directly in production as they do not generate dirty data.

Write interfaces : May produce dirty data; therefore, direct production testing is prohibited. Proposed “test back‑feeding” uses pre‑release environments to validate production behavior. Expected outcomes include:

Expected failures: Functional changes impact interfaces; monitoring content must be updated.

Unexpected failures: New test content reveals bugs.

Another idea is “traffic‑driven monitoring” that validates functionality with real user requests while masking sensitive data.

Traditional periodic monitoring cases rely heavily on tester knowledge and may miss scenarios as interfaces evolve. A black‑box approach—deriving monitoring cases from user‑visible elements—helps achieve broader coverage.

1.2.2 Business‑Data Monitoring

Business data reflects product value; its correctness and health indicate system stability. Examples include:

Core data volume (e.g., order count, premium) timeliness.

Data correctness checks (e.g., premium = tax + post‑tax fee).

Core data trend thresholds (e.g., daily cancellation ratio exceeding a limit).

1.2.3 Log‑Clustering Monitoring

Logs indirectly reflect system stability. Two approaches:

Short‑term

Does not depend on development changes; clusters error types from existing logs. Alerts trigger when error count exceeds a fixed threshold. Threshold tuning is required to avoid false positives/negatives.

Example alert: “Warning: >100 ‘insurance age error’ occurrences in 10 min, exceeding threshold 90.”

Long‑term

Relies on standardized log printing (start and end logs per request). Full‑scale log ingestion and cleaning compute application availability:

Application Availability = (Total Traffic – Error Traffic) / Total Traffic

This enables day‑level, hour‑level, and 10‑minute availability metrics and supports root‑cause tracing via log IDs.

2. Observability Dimensions

The group’s PFinder (Problem Finder) is a next‑generation APM system aligned with observability, gradually adopted by development teams.

Why should test teams build observability distinct from development? To avoid duplicating effort and to provide high sensitivity to functional availability and data correctness, tightly integrating with monitoring to deliver diagnosis, analysis, and localization capabilities.

2.1 Module‑Level Observability

Detects stability of individual modules, offering trend analysis and warning messages such as “Suspicious: Core metric X dropped continuously since service launch on YYYY‑MM‑DD.”

2.2 System‑Level Observability

Aggregates logs to map data flow across modules. When any module alerts, notifications include preliminary diagnosis and enable cross‑system data verification.

Linked Alerts

Observability’s global view enables linked alerts: a downstream module failure propagates upstream alerts, allowing rapid root‑cause identification. Example: when upstream service A calls downstream B and B fails, combined alerts pinpoint the issue.

Fault Localization

Linked alerts can provide detailed fault content, e.g., “Error: Service A function abnormal, downstream B shows suspicious logs {key info}, B’s last deployment on YYYY‑MM‑DD. Please investigate.”

Data Analysis

Cross‑system business interactions manifest as data flows. With linked capabilities, key data can be reconciled or conversion rates analyzed, ensuring consistency and revealing sensitive data comparisons.

2.3 Perception and Presentation

Both monitoring and observability require notification and visualization. Plans include integrating with existing business monitoring dashboards and offering a generic alert service supporting email, messaging, and voice channels.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations Observability software reliability quality assurance

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.