Why Observability Is the Missing Piece for Day‑2 Success in Cloud‑Native and Serverless Systems
The article explains how observability—through logs, metrics, and traces—transforms the opaque, complex day‑2 operations of micro‑service, Kubernetes, and serverless environments into a deterministic, diagnosable system, highlighting OpenTelemetry, practical collection methods, and real‑world implementation challenges and benefits.
What Observability Should Do
Observability aims to make a system’s internal state transparent, much like medical imaging lets doctors diagnose patients, by providing fine‑grained data such as logs, metrics, and request traces that reveal topology, performance bottlenecks, and failures.
Day‑2 Focus: Observability in Cloud‑Native and Serverless
While developers enjoy creative Day‑0/Day‑1 work, Day‑2—deployment, monitoring, maintenance, and iteration—often receives less attention. The article argues that robust observability is essential in this phase, especially for micro‑service architectures that may involve dozens or hundreds of services.
Foundations of Observability
Originating from Google’s Dapper paper, observability relies on three telemetry types:
Logs : Carry complete contextual information but can be costly to transmit and store.
Metrics : Provide abstracted statistical data with relatively fixed overhead, suitable for monitoring and alerting.
Traces : Describe request‑level topology across services; per‑request collection can be expensive.
OpenTelemetry Overview
OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.
Unlike backend solutions such as Jaeger or Prometheus, OpenTelemetry defines standard data formats and provides pluggable exporters, but does not include storage, query, or visualization components.
Observability in Kubernetes
Kubernetes components are distributed and declaratively managed, making observability more challenging than in VM environments. Effective observability must gather data from both the application layer and the control‑plane components.
Logs : Projects like Fluentd or Logstash run as DaemonSets on each node; they can forward logs to back‑ends such as ElasticStack.
Metrics : Kubernetes exposes three APIs— metrics.k8s.io, custom.metrics.k8s.io, and external.metrics.k8s.io. The Metrics Server implements the core API, while Prometheus Adapter supports the custom and external APIs, enabling autoscaling based on these metrics.
Traces : Service‑mesh solutions (e.g., Istio) can collect traces without instrumentation overhead; for languages like Java, agents can emit OpenTelemetry‑compatible traces.
Observability in Serverless
Serverless abstracts away infrastructure, which paradoxically reduces the visibility needed for troubleshooting. Nevertheless, observability still provides value by exposing topology, request context, performance bottlenecks, and optimization opportunities.
Collected telemetry can feed predictive autoscaling (HPA) or AIOps use‑cases, reducing cold‑start latency and improving reliability.
Practical Implementation: Tencent Cloud TEM
TEM (Tencent Cloud Serverless Platform) demonstrates concrete observability practices:
Image Build Observability
During container image construction, TEM records each step’s success and duration, enabling developers to pinpoint slow stages.
#5 [1/9] FROM ccr.ccs.tencentyun.com/tsf_build/tem-buildkit-war-open-base:8.5-jre8@sha256:…
#5 resolve ccr.ccs.tencentyun.com/tsf_build/tem-buildkit-war-open-base:8.5-jre8 done
#5 DONE 0.0s
#15 importing cache manifest …
#15 DONE 0.8s
…
#19 [auth] tem-100011913960-dsxh/svc-test-war-firstdeploy-kgqkyiqs:pull,push token for ccr.ccs.tencentyun.com
#19 DONE 0.0s
#16 exporting to image
#16 pushing layers 5.5s done
#16 pushing manifest … 1.4s done
#16 DONE 7.1sApplication Deployment Observability
TEM surfaces native Kubernetes logs and its own scheduling information, helping users diagnose issues such as missing images, quota limits, or invalid parameters.
Canary Release : Small batch validation of a new version.
Batch Release : Rolling updates with optional manual or automatic triggers.
In‑Place Upgrade : Rolling updates that preserve instance IDs and IPs.
Integrated Cloud‑Product Observability
TEM connects with other Tencent Cloud services to provide a unified observability stack:
Logs : Tencent Cloud CLS offers a one‑stop log collection, storage, and analysis solution.
Metrics : Integrated Cloud Monitoring and APM deliver comprehensive metrics, including JVM and request‑level data.
Trace : Java‑agent based, non‑intrusive tracing presents full request lifecycles for root‑cause analysis.
Conclusion
Micro‑service, container, and cloud‑native technologies bring powerful capabilities but also increase system complexity. Focusing on Day‑2 observability—collecting, standardizing, and visualizing logs, metrics, and traces—enables reliable operation, faster debugging, and better resource utilization in both Kubernetes and serverless environments.
References
https://www.infoq.cn/news/2017/11/observability-monitoring/
https://copyconstruct.medium.com/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e
https://www.observeinc.com/resources/observability-in-kubernetes/
https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/
https://lumigo.io/blog/understanding-serverless-observability/
Tencent Cloud Middleware
Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
