Tagged articles
969 articles
Page 9 of 10
Baidu Geek Talk
Baidu Geek Talk
Jun 30, 2021 · Operations

How Baidu Achieves 5‑9+ Availability: Inside Its Stability Engineering and Observability

This article dissects Baidu Search's ultra‑large micro‑service architecture, detailing the challenges of maintaining five‑nine‑plus availability, the diverse failure modes, and the step‑by‑step evolution of its observability stack—from early log‑only analysis to the kepler1.0/kepler2.0 tracing, full‑log indexing, custom span‑id generation, and compression techniques that together enable rapid root‑cause diagnosis at massive scale.

Baidu SearchDistributed TracingMetrics
0 likes · 21 min read
How Baidu Achieves 5‑9+ Availability: Inside Its Stability Engineering and Observability
Alibaba Cloud Native
Alibaba Cloud Native
Jun 28, 2021 · Cloud Native

How Chanjet Scaled SaaS for 1.3M SMEs with Cloud‑Native Architecture

Chanjet transformed its monolithic SaaS platform for millions of small‑business customers by adopting a cloud‑native, container‑based micro‑service architecture, enabling elastic scaling, reduced operational costs, unified data services, automated DevOps pipelines, and comprehensive observability across front‑end, back‑end, and infrastructure layers.

DevOpsMicroservicesObservability
0 likes · 27 min read
How Chanjet Scaled SaaS for 1.3M SMEs with Cloud‑Native Architecture
Tencent Cloud Developer
Tencent Cloud Developer
Jun 28, 2021 · Cloud Native

Effective Service Governance for Serverless: Challenges and Solutions

Effective serverless governance requires comprehensive observability, traffic management, and service registration built on Kubernetes, using either a mesh sidecar with Istio or an embedded SDK, to simplify complex operational tasks such as discovery, fault tolerance, gray releases, and metric correlation for large‑scale function deployments.

Cloud NativeObservabilityOperations
0 likes · 17 min read
Effective Service Governance for Serverless: Challenges and Solutions
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jun 19, 2021 · Operations

Solving Monitoring Pain Points: Unified Framework, Alert Prioritization, and Classification

The article discusses common monitoring challenges such as fragmented tooling and noisy alerts, and proposes solutions including consolidating to a single monitoring framework, prioritizing runtime exceptions, and classifying business alerts with codes and trace information to improve incident response.

AlertingObservabilitybest-practices
0 likes · 6 min read
Solving Monitoring Pain Points: Unified Framework, Alert Prioritization, and Classification
IT Architects Alliance
IT Architects Alliance
Jun 19, 2021 · Operations

Reference Architecture for Digital Transformation Platforms

The article outlines a comprehensive reference architecture for digital transformation platforms, detailing typical organizational contexts, desired outcomes, and key components such as integration layers, API gateways, IAM, BPM, observability, multi‑region deployment, and development practices to enable seamless, secure, and scalable business services.

Digital TransformationIAMIntegration
0 likes · 10 min read
Reference Architecture for Digital Transformation Platforms
58 Tech
58 Tech
Jun 11, 2021 · Frontend Development

Beidou Frontend Monitoring System: Architecture, Challenges, and Solutions

The article details the design, architecture, and operational challenges of the Beidou frontend monitoring platform at 58 Group, covering SDK management, behavior trace logging, front‑back link integration, performance optimizations, minute‑level alerting, and permission management.

AlertingArchitectureFrontend
0 likes · 22 min read
Beidou Frontend Monitoring System: Architecture, Challenges, and Solutions
Liulishuo Tech Team
Liulishuo Tech Team
Jun 2, 2021 · Backend Development

Understanding Distributed Tracing and Its Use at Liulishuo

This article explains what distributed tracing is, why it is needed alongside logging and metrics for observability, how it works with trace and span IDs, and describes Liulishuo's implementation using OpenTelemetry, W3C Trace Context, and tail‑based sampling to improve backend debugging.

Distributed TracingMicroservicesObservability
0 likes · 9 min read
Understanding Distributed Tracing and Its Use at Liulishuo
Baidu Geek Talk
Baidu Geek Talk
May 31, 2021 · Cloud Native

Adoption of Service Mesh (Istio) at Baidu iFanFan: Challenges, Migration Strategy, and Benefits

Baidu iFanFan migrated all its Java‑based services to a native Kubernetes + Istio service mesh within three months, replacing fragmented, manual governance with automated rate‑limiting, canary releases, chaos testing and observability, which cut governance cycles from months to minutes, reduced CI time by ~20 % and dramatically improved system stability and multi‑cloud readiness.

Cloud NativeIstioKubernetes
0 likes · 21 min read
Adoption of Service Mesh (Istio) at Baidu iFanFan: Challenges, Migration Strategy, and Benefits
Amap Tech
Amap Tech
May 28, 2021 · Operations

System Observability Practices in Gaode Ride-Hailing: From Unified Logging to Fault Defense

Gaode Ride‑Hailing created a comprehensive 360° observability platform—standardized logging, distributed tracing, multi‑domain metrics, visual dashboards, and an incident workflow—that transforms raw data into actionable insights, accelerates root‑cause analysis, and enables automated fault defense for its large‑scale cloud‑native microservice system.

Distributed SystemsObservabilityfault tolerance
0 likes · 22 min read
System Observability Practices in Gaode Ride-Hailing: From Unified Logging to Fault Defense
21CTO
21CTO
May 27, 2021 · Cloud Native

Mastering Cloud‑Native Architecture: Practical Steps to Transform SaaS on Alibaba Cloud

This article explains what cloud‑native architecture is, why it is essential for modern SaaS businesses, and provides a step‑by‑step guide—including maturity models, serverless migration, namespace and application setup, load‑balancer binding, service/configuration centers, elasticity, observability, resilience, and automation—using Alibaba Cloud SAE and MSE services.

Alibaba CloudCloud NativeMicroservices
0 likes · 23 min read
Mastering Cloud‑Native Architecture: Practical Steps to Transform SaaS on Alibaba Cloud
New Oriental Technology
New Oriental Technology
May 24, 2021 · Operations

Overview of SkyWalking UI: Dashboard, Topology, Tracing, Profiling, and Alerts

The article provides a comprehensive English overview of SkyWalking UI, detailing its dashboard metrics, topology visualization, trace analysis, performance profiling workflow, and alarm management, illustrating how the tool monitors microservice and cloud‑native environments with metrics such as throughput, latency, Apdex, and JVM statistics.

APMDistributed TracingObservability
0 likes · 11 min read
Overview of SkyWalking UI: Dashboard, Topology, Tracing, Profiling, and Alerts
DevOps
DevOps
May 17, 2021 · Cloud Native

Challenges of Testing Cloud‑Native Applications and the Need for New Approaches

Amid accelerating Agile and DevOps adoption, the rapid delivery of cloud‑native microservices introduces cascading risks and makes traditional monolithic testing inadequate, prompting a shift toward observability‑driven “right‑shift” testing, exploratory methods, and chaos engineering to embrace failure as the new normal.

DevOpsMicroservicesObservability
0 likes · 8 min read
Challenges of Testing Cloud‑Native Applications and the Need for New Approaches
Open Source Linux
Open Source Linux
May 6, 2021 · Cloud Native

Why Loki Beats ELK for Cloud‑Native Log Management

This article explains how Loki, a lightweight, Prometheus‑compatible logging system, addresses the high resource cost, complexity, and operational overhead of traditional ELK/EFK stacks by using label‑based indexing, efficient compression, and scalable architecture for container‑cloud environments.

Cloud NativeELK alternativeLog Management
0 likes · 7 min read
Why Loki Beats ELK for Cloud‑Native Log Management
DevOps
DevOps
May 6, 2021 · Cloud Native

Testing Strategies for Cloud‑Native Applications

The article explains how traditional testing falls short for cloud‑native, microservice‑based applications and outlines modern strategies—including unit, integration, contract, non‑functional, chaos engineering, and observability techniques—to ensure quality, resilience, and rapid delivery in dynamic cloud environments.

MicroservicesObservabilitychaos engineering
0 likes · 11 min read
Testing Strategies for Cloud‑Native Applications
Architects Research Society
Architects Research Society
Apr 30, 2021 · Operations

Health Management and Diagnostics in Microservices

The article explains how microservices can achieve resilience through health reporting, diagnostics, standardized logging, health‑check implementations, and orchestrator coordination to detect failures, restart services, handle upgrades, and recover from partial cloud‑based failures.

ObservabilityOrchestrationResilience
0 likes · 9 min read
Health Management and Diagnostics in Microservices
Java Architecture Diary
Java Architecture Diary
Apr 19, 2021 · Operations

Why Loki Is the Lightweight, Scalable Log Solution You Need Over EFK

This article introduces Loki, Grafana’s lightweight, horizontally scalable log aggregation system, compares it with the EFK stack, explains Promtail, LogQL query language, alerting, and how Loki integrates with Grafana and Prometheus for unified metrics and logs, highlighting its low‑resource, cloud‑native advantages.

Cloud NativeLokiObservability
0 likes · 8 min read
Why Loki Is the Lightweight, Scalable Log Solution You Need Over EFK
Efficient Ops
Efficient Ops
Apr 18, 2021 · Operations

How to Build a Scalable Prometheus Monitoring System with Thanos on Kubernetes

This article explains why monitoring is essential for production stability, compares white‑box and black‑box approaches, details the advantages of Prometheus, walks through its architecture, metric types, query language, high‑availability strategies with Thanos, and provides practical Kubernetes deployment manifests and configuration tips.

DevOpsKubernetesObservability
0 likes · 21 min read
How to Build a Scalable Prometheus Monitoring System with Thanos on Kubernetes
MaGe Linux Operations
MaGe Linux Operations
Apr 3, 2021 · Operations

Designing a Scalable, High‑Availability Monitoring System with Prometheus & Thanos

This article explores the challenges of building a reliable monitoring platform, compares open‑source solutions such as Elasticsearch, Nagios, Zabbix and Prometheus, and details how to achieve high availability and horizontal scaling using Prometheus, Thanos, sharding, remote‑write, and Kubernetes orchestration.

ObservabilityThanoshigh availability
0 likes · 22 min read
Designing a Scalable, High‑Availability Monitoring System with Prometheus & Thanos
21CTO
21CTO
Mar 22, 2021 · Cloud Native

How to Implement Cloud‑Native Architecture with SAE: A Step‑by‑Step Guide

This article explains why modern enterprises need cloud‑native architecture, introduces the SESORA maturity model, and provides a detailed, practical walkthrough of deploying a cloud‑native application on Alibaba Cloud SAE, covering namespace creation, app configuration, SLB binding, service discovery, elasticity, observability, resilience, and automation.

DeploymentMicroservicesObservability
0 likes · 23 min read
How to Implement Cloud‑Native Architecture with SAE: A Step‑by‑Step Guide
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 13, 2021 · Operations

Comprehensive Guide to Monitoring: Objectives, Methods, Tools, and Best Practices

This article provides an in‑depth overview of monitoring, covering its purpose, key objectives, practical methods, core processes, a detailed comparison of popular monitoring tools such as Zabbix and Prometheus, and best‑practice recommendations for building scalable, reliable, and intelligent monitoring platforms.

InfrastructureObservabilityOperations
0 likes · 42 min read
Comprehensive Guide to Monitoring: Objectives, Methods, Tools, and Best Practices
Node Underground
Node Underground
Mar 12, 2021 · Cloud Native

How Alinode Boosts Node.js Observability and Scheduling in the Cloud‑Native Era

Alinode expands its Node.js performance diagnostics into a full‑stack observability and scheduling platform for serverless workloads, offering traffic monitoring, white‑screen logs, remote debugging, crash analysis, standardized metrics, and a cloud‑native runtime that balances cost and performance.

Cloud NativeNode.jsObservability
0 likes · 11 min read
How Alinode Boosts Node.js Observability and Scheduling in the Cloud‑Native Era
Alibaba Terminal Technology
Alibaba Terminal Technology
Mar 12, 2021 · Cloud Native

How Alinode Boosts Node.js Observability & Scheduling in Serverless Cloud Native Era

This article outlines how Alinode has evolved from a Node.js performance diagnostic tool into a comprehensive observability and scheduling platform for serverless environments, detailing its Insight monitoring features, remote debugging, crash analysis, standardization efforts, and runtime optimizations that improve cost and performance.

AlinodeCloud NativeNode.js
0 likes · 12 min read
How Alinode Boosts Node.js Observability & Scheduling in Serverless Cloud Native Era
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 12, 2021 · Operations

Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform

To meet the LEDAO platform’s need for rapid anomaly detection, full‑stack observability, and reliable alerting across more than 100 microservices, iQIYI evaluated OpenFalcon, Prometheus and CAT, selected CAT, deployed separate mainland and overseas clusters, added configurable access, health‑check and integrated alert channels, enabling five‑minute service onboarding, near‑zero‑intrusion instrumentation, and real‑time business‑level monitoring.

AlertingCATDevOps
0 likes · 12 min read
Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform
dbaplus Community
dbaplus Community
Feb 25, 2021 · Operations

How Distributed Tracing Solves Microservice Performance Bottlenecks with SkyWalking

This article explains the principles of distributed tracing, the OpenTracing standard, SkyWalking's architecture and sampling strategies, and shares a company's practical customizations—including forced sampling, fine‑grained group sampling, log4j traceId injection, and self‑developed plugins—to help pinpoint performance issues in microservice environments.

Distributed TracingObservabilityOpenTracing
0 likes · 17 min read
How Distributed Tracing Solves Microservice Performance Bottlenecks with SkyWalking
Didi Tech
Didi Tech
Feb 25, 2021 · Industry Insights

Why DiDi’s Obsuite Is Redefining Hybrid‑Cloud Observability

Obsuite, DiDi’s open‑source observability suite, tackles hybrid‑cloud monitoring challenges by combining metrics, logs, and traces, while the article analyzes market trends, private‑cloud demand, and the product’s architecture, open‑source components, and the OCE certification program for enterprise users.

Log ManagementMetricsObservability
0 likes · 6 min read
Why DiDi’s Obsuite Is Redefining Hybrid‑Cloud Observability
Efficient Ops
Efficient Ops
Feb 22, 2021 · Operations

Why Does Prometheus Sometimes Fail to Trigger Alerts? Explained

Prometheus alerts may not fire even when metrics exceed thresholds due to the ‘for’ pending duration, sparse sampling, and Grafana’s range queries, and this article explains the underlying mechanisms, illustrates common pitfalls with diagrams, and offers practical strategies to diagnose and resolve missing or unexpected alerts.

GrafanaObservabilityPrometheus
0 likes · 6 min read
Why Does Prometheus Sometimes Fail to Trigger Alerts? Explained
DevOps Coach
DevOps Coach
Feb 9, 2021 · Operations

Master Elastic Observability: Build a Full‑Stack Monitoring Platform in Half a Day

This workshop guides participants from installing a single‑node Elastic Stack to deploying a cloud‑native observability platform for a multi‑tier pet‑store application, covering health checks, metrics, logs, APM tracing, SLO/SLI setup, and custom dashboards across local, AWS, and Tencent Cloud environments.

Cloud NativeElastic StackObservability
0 likes · 7 min read
Master Elastic Observability: Build a Full‑Stack Monitoring Platform in Half a Day
21CTO
21CTO
Feb 3, 2021 · Operations

Bridging Product Development and SRE: How to Ensure Stability Across the Software Lifecycle

This article explains the role of Site Reliability Engineering (SRE) in bridging product and foundational technology development, outlines the software lifecycle, describes how SRE ensures system stability through controllability, observability, and protection, and provides practical best‑practice checklists and maturity levels for evaluating and improving reliability.

ObservabilityOperationsSRE
0 likes · 13 min read
Bridging Product Development and SRE: How to Ensure Stability Across the Software Lifecycle
JavaEdge
JavaEdge
Feb 2, 2021 · Cloud Native

Why Istio Is the Go-To Service Mesh for Modern Microservices

Istio is a fully open‑source service‑mesh platform that adds a transparent control plane to existing distributed applications, enabling traffic routing, access policies, telemetry, security, and observability without code changes, and it offers features such as virtual services, destination rules, gateways, sidecar configuration, fault injection, retries, timeouts, metrics, logging and distributed tracing.

IstioKubernetesObservability
0 likes · 14 min read
Why Istio Is the Go-To Service Mesh for Modern Microservices
dbaplus Community
dbaplus Community
Feb 1, 2021 · Operations

How to Build a Low‑Cost Distributed Tracing System for Microservices

This article explains the evolution from a monolithic architecture to microservices, outlines the new pain points such as fault isolation, performance bottlenecks and scaling inefficiencies, and presents a practical, low‑cost distributed tracing solution with unified frameworks, components, configuration management, data collection, and visualization.

Configuration ManagementDistributed TracingObservability
0 likes · 31 min read
How to Build a Low‑Cost Distributed Tracing System for Microservices
DevOps Cloud Academy
DevOps Cloud Academy
Jan 25, 2021 · Cloud Native

Blackbox Monitoring with Prometheus Blackbox Exporter in Kubernetes

This guide explains how to complement Prometheus white‑box monitoring with black‑box probes by deploying the Blackbox Exporter in a Kubernetes cluster, configuring ConfigMaps, Deployments, Services, and Prometheus scrape jobs for HTTP, DNS, TCP, and ICMP checks, and using annotations for automatic service discovery.

Blackbox ExporterObservabilityPrometheus
0 likes · 10 min read
Blackbox Monitoring with Prometheus Blackbox Exporter in Kubernetes
Efficient Ops
Efficient Ops
Jan 19, 2021 · Operations

How SRE Bridges Development and Operations to Boost System Reliability

This article explores the role of Site Reliability Engineering (SRE) as a bridge between product development and operations, detailing its responsibilities, core principles, lifecycle perspective, stability value, and practical frameworks for controllability, observability, and best‑practice implementation to enhance system reliability.

ObservabilitySREreliability engineering
0 likes · 13 min read
How SRE Bridges Development and Operations to Boost System Reliability
High Availability Architecture
High Availability Architecture
Jan 19, 2021 · Cloud Native

Key Considerations for Building a Cloud‑Native Architecture

The article outlines the principles and practical considerations of cloud‑native architecture, covering platform‑agnostic design, container and Kubernetes foundations, microservice decomposition, CI/CD pipelines, monitoring, tracing, logging, and fault‑tolerant high‑availability strategies for building resilient distributed systems.

CI/CDMicroservicesObservability
0 likes · 13 min read
Key Considerations for Building a Cloud‑Native Architecture
Architects' Tech Alliance
Architects' Tech Alliance
Jan 16, 2021 · Cloud Native

Understanding Cloud‑Native Architecture and Its Key Patterns

The article explains cloud‑native architecture as a set of principles and design patterns that offload non‑functional concerns to cloud services, and it details major patterns such as service‑oriented, mesh, serverless, storage‑compute separation, distributed transactions, observability, and event‑driven architectures.

Event-drivenMicroservicesObservability
0 likes · 10 min read
Understanding Cloud‑Native Architecture and Its Key Patterns
Programmer DD
Programmer DD
Jan 15, 2021 · Operations

Why Does Prometheus Sometimes Fail to Trigger Alerts?

This article explains why Prometheus alerts may not fire or may fire unexpectedly, covering the role of the for parameter, sampling intervals, Grafana range queries, and practical steps to diagnose and fix alerting issues.

AlertingGrafanaObservability
0 likes · 7 min read
Why Does Prometheus Sometimes Fail to Trigger Alerts?
Efficient Ops
Efficient Ops
Jan 11, 2021 · Operations

Unlocking Prometheus: How TSDB Powers Scalable Monitoring and Fast Queries

This article demystifies Prometheus by explaining its core concepts, daily monitoring queries, the role of its TSDB storage engine, how series, label, and time indexes enable fast time‑series queries, and how pre‑computed recording rules boost performance for dashboards and alerts.

ObservabilityPrometheusTSDB
0 likes · 8 min read
Unlocking Prometheus: How TSDB Powers Scalable Monitoring and Fast Queries
Programmer DD
Programmer DD
Jan 3, 2021 · Cloud Native

5 Must-Watch Open-Source Kubernetes Projects Shaping 2021

Discover five emerging open-source Kubernetes projects—including Quarkus, OpenTelemetry, Argo CD, Envoy/Contour, and OKD 4—that are driving cloud-native innovation in 2021 by enhancing Java workloads, observability, GitOps, traffic management, and developer tooling, and simplifying deployment pipelines.

Cloud NativeDevOpsGitOps
0 likes · 7 min read
5 Must-Watch Open-Source Kubernetes Projects Shaping 2021
Architect
Architect
Jan 2, 2021 · Operations

Layered Architecture of Microservice Monitoring and Key Practices

This article explains the layered architecture of microservice monitoring, detailing five monitoring levels—from infrastructure to end-user experience—along with essential monitoring points such as logs, metrics, tracing, alerts, and health checks, and presents a typical monitoring stack using agents, Kafka, ELK, and InfluxDB.

MetricsObservabilityOperations
0 likes · 6 min read
Layered Architecture of Microservice Monitoring and Key Practices
Cloud Native Technology Community
Cloud Native Technology Community
Dec 30, 2020 · Operations

Lessons Learned from Two Years of Running Kubernetes in Production

This article recounts a two‑year journey of migrating from Ansible‑managed EC2 deployments to Kubernetes, detailing the motivations, migration strategy, operational challenges, tooling choices, resource management, security, cost considerations, and the development of custom controllers and CRDs to run production workloads reliably.

CI/CDDevOpsInfrastructure
0 likes · 18 min read
Lessons Learned from Two Years of Running Kubernetes in Production
Architect
Architect
Dec 23, 2020 · Operations

Design and Evaluation of Log Collection Agents: Flume vs Filebeat

This article analyses the shortcomings of traditional log‑collection agents, compares Flume and Filebeat based on low‑cost, stability, efficiency and lightweight criteria, and presents practical solutions for file discovery, offset tracking, multi‑line handling and performance tuning in modern logging pipelines.

Agent DesignFlumeObservability
0 likes · 13 min read
Design and Evaluation of Log Collection Agents: Flume vs Filebeat
JD Cloud Developers
JD Cloud Developers
Dec 17, 2020 · Backend Development

How Loki Cuts Log Storage Costs While Integrating Deeply with Prometheus

This article explains Loki's origins, data model, LogQL query language, low‑cost storage design, and the full read‑write architecture—including Distributor, Ingester, Querier, and QueryFrontend—showing how it solves the shortcomings of traditional Elasticsearch‑based logging solutions and integrates tightly with Prometheus monitoring.

LogQLLokiObservability
0 likes · 21 min read
How Loki Cuts Log Storage Costs While Integrating Deeply with Prometheus
Top Architect
Top Architect
Dec 14, 2020 · Cloud Native

Lessons Learned from Two Years of Production Kubernetes at Grofers

This article recounts Grofers' two‑year journey migrating from Ansible‑managed EC2 instances to Kubernetes, detailing the motivations, migration strategy, operational challenges, observability choices, CI/CD tooling, resource management, security practices, cost considerations, and the overall impact on development velocity and platform stability.

CI/CDCloud NativeDevOps
0 likes · 20 min read
Lessons Learned from Two Years of Production Kubernetes at Grofers
21CTO
21CTO
Dec 10, 2020 · Operations

How Netflix’s Telltale Transforms Application Monitoring and Incident Response

This article explains how Netflix built the Telltale monitoring system to consolidate data sources, provide multidimensional health assessments, deliver intelligent alerts, and streamline incident management for over 100 production applications, reducing on‑call fatigue and improving service reliability.

NetflixObservabilityincident response
0 likes · 14 min read
How Netflix’s Telltale Transforms Application Monitoring and Incident Response
Yanxuan Tech Team
Yanxuan Tech Team
Dec 8, 2020 · Cloud Native

How Yanxuan Scaled to 1,000 Services with a Cloud‑Native Platform

Facing rapid growth in 2019, Yanxuan partnered with NetEase Qingzhou to co‑build a cloud‑native platform, detailing a multi‑stage migration that standardized services, reduced code changes, enhanced high‑availability, optimized performance, and improved observability, ultimately supporting over 300 cloud‑migrated services and boosting development efficiency by more than 200%.

Cloud NativeContainerizationDevOps
0 likes · 13 min read
How Yanxuan Scaled to 1,000 Services with a Cloud‑Native Platform
Efficient Ops
Efficient Ops
Nov 25, 2020 · Operations

How to Build a Scalable, Highly‑Available Prometheus Monitoring Stack with Thanos

This article explains why standard Prometheus HA solutions fall short for large, multi‑region deployments, and walks through using Thanos—its components, configuration, and best‑practice tips—to achieve long‑term storage, unlimited scaling, a global view, and non‑intrusive monitoring across 300+ clusters.

KubernetesObservabilityPrometheus
0 likes · 24 min read
How to Build a Scalable, Highly‑Available Prometheus Monitoring Stack with Thanos
Programmer DD
Programmer DD
Nov 21, 2020 · Operations

When to Use Monitoring, Tracing, or Logging? A Practical Guide

This article explains the distinct purposes and characteristics of monitoring, tracing, and logging in system design, compares their typical toolchains such as Prometheus, Jaeger, and ELK, and clarifies when each component is necessary for effective observability.

ELKObservabilityjaeger
0 likes · 7 min read
When to Use Monitoring, Tracing, or Logging? A Practical Guide
vivo Internet Technology
vivo Internet Technology
Nov 18, 2020 · Cloud Native

vivo Distributed Tracing System Agent Technology Principles and Practical Experience

The 2017‑initiated vivo distributed tracing system leverages a JavaAgent‑based micro‑kernel architecture, using ByteBuddy for non‑intrusive bytecode instrumentation, a Disruptor lock‑free queue, and Kafka to capture Trace/Span data—including cross‑thread propagation—while employing sampling, degradation, and JVM metrics to ensure 94% adoption stability.

DisruptorDistributed TracingJavaAgent
0 likes · 23 min read
vivo Distributed Tracing System Agent Technology Principles and Practical Experience
Programmer DD
Programmer DD
Nov 17, 2020 · Cloud Native

What Is Cloud Native? Core Concepts, Technologies, and Benefits Explained

This article defines cloud native as an optimal, low‑overhead approach to designing software that lives in the cloud, outlines its key technology domains—including containers, Kubernetes, service mesh, observability, and serverless—and explains why evolving infrastructure to the cloud brings consistency, scalability, and immutable deployment advantages.

Container TechnologyInfrastructure as CodeKubernetes
0 likes · 6 min read
What Is Cloud Native? Core Concepts, Technologies, and Benefits Explained
DevOps
DevOps
Nov 16, 2020 · Cloud Native

Key Principles and Trends in Cloud‑Native Software Architecture

This article explores cloud‑native software architecture, covering the 12‑factor app foundation, loose‑coupled design, API‑first and SOLID principles, event‑driven and service‑mesh patterns, observability, serverless runtimes, and emerging technologies such as Dapr, GraalVM and WebAssembly.

DaprMicroservicesObservability
0 likes · 29 min read
Key Principles and Trends in Cloud‑Native Software Architecture
Java Backend Technology
Java Backend Technology
Nov 8, 2020 · Operations

How Distributed Tracing with SkyWalking Solves Microservice Performance Challenges

This article explains the principles, architecture, and practical adoption of distributed tracing—covering OpenTracing standards, SkyWalking's design, sampling strategies, plugin development, and real‑world company practices—to help engineers pinpoint bottlenecks and improve observability in microservice systems.

Distributed TracingMicroservicesObservability
0 likes · 17 min read
How Distributed Tracing with SkyWalking Solves Microservice Performance Challenges
System Architect Go
System Architect Go
Nov 7, 2020 · Operations

Request Log Analysis System: Collected Fields, Derived Data, and Metrics

This article outlines a request log analysis system that records core request fields, adds proxy‑related data, derives IP‑based ASN and geographic information, parses user‑agent details, and provides comprehensive metrics such as PV/QPS, UV, traffic, latency, status monitoring, and business‑specific insights, all visualized via an ELK‑Kafka architecture.

BackendELKKafka
0 likes · 5 min read
Request Log Analysis System: Collected Fields, Derived Data, and Metrics
Programmer DD
Programmer DD
Nov 7, 2020 · Operations

Loki 2.0.0 Unveiled: Transforming Log Observability for Kubernetes

Loki 2.0.0 introduces major enhancements such as a revamped LogQL pipeline, native Prometheus‑style alerts, and simplified storage with boltdb‑shipper, delivering a more resource‑efficient, scalable log aggregation solution for Kubernetes environments.

KubernetesLogQLLoki
0 likes · 3 min read
Loki 2.0.0 Unveiled: Transforming Log Observability for Kubernetes
Efficient Ops
Efficient Ops
Nov 3, 2020 · Operations

How to Build a Scalable Prometheus Monitoring System with Thanos on Kubernetes

This article explains why monitoring is essential, compares white‑box and black‑box approaches, details Prometheus features, metric naming, query language, high‑availability challenges, and shows how to extend Prometheus with Thanos, Pushgateway, Alertmanager, and Kubernetes deployments for a robust observability stack.

AlertmanagerKubernetesObservability
0 likes · 20 min read
How to Build a Scalable Prometheus Monitoring System with Thanos on Kubernetes
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 11, 2020 · Operations

How Alibaba’s SLS Powers a Unified Observability Platform for Massive Data

Alibaba Cloud’s Log Service (SLS) has evolved into a unified observability middle‑platform that handles tens of petabytes daily, offering integrated storage, processing, and AI‑driven analysis for logs, metrics, and traces, while addressing challenges of data ingestion, performance, and scalability across diverse Ops scenarios.

Big DataLog AnalyticsObservability
0 likes · 16 min read
How Alibaba’s SLS Powers a Unified Observability Platform for Massive Data
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Sep 22, 2020 · Operations

Design and Implementation of a Distributed Call‑Chain Tracing System for Microservices

This article explains how to design a non‑intrusive distributed tracing system for microservices by assigning global TraceIDs, generating hierarchical SpanIDs, using lightweight agents to propagate identifiers via transport headers, and aggregating data in a collector to visualize complete call graphs and diagnose performance issues.

Distributed TracingMicroservicesObservability
0 likes · 6 min read
Design and Implementation of a Distributed Call‑Chain Tracing System for Microservices
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Sep 17, 2020 · Operations

Understanding Distributed Tracing and SkyWalking: Principles, Architecture, and Practical Implementation

This article explains the fundamentals of distributed tracing, the OpenTracing standard, and how SkyWalking implements automatic span collection, cross‑process context propagation, unique traceId generation, sampling strategies, performance benchmarks, and real‑world adaptations within a micro‑service environment.

Distributed TracingMicroservicesObservability
0 likes · 16 min read
Understanding Distributed Tracing and SkyWalking: Principles, Architecture, and Practical Implementation
Didi Tech
Didi Tech
Aug 30, 2020 · Cloud Native

Didi's Seven‑Layer Access Platform: Service Governance, Stability Practices, and Cloud‑Native Exploration

Didi’s Seven‑Layer Access Platform, handling millions of QPS and hundreds of billions of daily requests across thousands of services, provides ultra‑stable, sub‑millisecond routing through Nginx‑based data and control planes, advanced service discovery, rate‑limiting, observability, zero‑risk change controls, and is now evolving toward a cloud‑native, mesh‑enabled sidecar architecture.

Cloud NativeObservabilityhigh availability
0 likes · 16 min read
Didi's Seven‑Layer Access Platform: Service Governance, Stability Practices, and Cloud‑Native Exploration
Efficient Ops
Efficient Ops
Aug 25, 2020 · Operations

How to Build an Enterprise‑Grade Observability System and Master Incident Response

This article explains how enterprises adopting SRE can design a comprehensive observability platform—covering metrics, logs, and tracing—while also detailing effective incident response, post‑mortem practices, testing, capacity planning, automation tool development, and user‑experience focus to improve overall operational reliability.

ObservabilityOperationsSRE
0 likes · 17 min read
How to Build an Enterprise‑Grade Observability System and Master Incident Response
Java Architecture Diary
Java Architecture Diary
Aug 24, 2020 · Backend Development

Why Is Spring Boot Admin’s HTTP Trace Missing? How to Restore It

This article explains why the HTTP trace feature disappears in Spring Boot Admin after version 2.2.x, details the investigation steps that reveal the default disabling of the InMemoryHttpTraceRepository, and recommends using third‑party tracing solutions such as Prometheus with Grafana for observable metrics.

GrafanaHTTP TraceObservability
0 likes · 3 min read
Why Is Spring Boot Admin’s HTTP Trace Missing? How to Restore It
DevOps
DevOps
Aug 13, 2020 · Operations

ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions

This article outlines ByteDance’s adoption of chaos engineering, describing its background, industry examples, the evolution of internal fault‑injection platforms across three generations, the fault model and center design, experiment principles, and future plans for infrastructure‑level chaos and automated diagnostics.

Distributed SystemsFault InjectionObservability
0 likes · 21 min read
ByteDance’s Chaos Engineering Journey: Practices, Architecture, and Future Directions
Programmer DD
Programmer DD
Aug 6, 2020 · Operations

Why SkyWalking’s Architecture Makes Modern Observability Seamless

This article explains SkyWalking’s modular, protocol‑oriented and lightweight architecture, its core components, design principles, and advantages such as cross‑environment consistency, easy maintenance, high performance, and extensibility for both traditional and cloud‑native systems.

APMApache SkyWalkingCloud Native
0 likes · 12 min read
Why SkyWalking’s Architecture Makes Modern Observability Seamless
dbaplus Community
dbaplus Community
Aug 3, 2020 · Operations

How iQIYI Built a Full‑Link Automated Monitoring Platform for Microservices

iQIYI’s tech product team designed a unified full‑link automated monitoring platform that integrates link, metric, and log collection with deep analysis, enhancing fault localization, performance insight, and scalability across microservices, while addressing limitations of existing tools like ELK, Prometheus, and Dapper.

MetricsObservabilityfull‑link
0 likes · 15 min read
How iQIYI Built a Full‑Link Automated Monitoring Platform for Microservices
Aikesheng Open Source Community
Aikesheng Open Source Community
Jul 29, 2020 · Operations

Understanding Prometheus Exporters: Operation Modes, Data Format, and a Go Implementation Example

This article explains the purpose and operation modes of Prometheus exporters, details the text-based metric exposition format including HELP, TYPE, and sample lines for counters, gauges, summaries, and histograms, and provides a complete Go example showing how to build, run, and expose a custom exporter with Prometheus client libraries.

GolangMetricsObservability
0 likes · 11 min read
Understanding Prometheus Exporters: Operation Modes, Data Format, and a Go Implementation Example
Java Backend Technology
Java Backend Technology
Jul 5, 2020 · Cloud Native

Why Loki Beats ELK for Cloud‑Native Log Management: Architecture and Benefits

This article explains the motivations behind choosing Loki over traditional ELK/EFK stacks for container‑cloud logging, outlines its cost‑effective design, describes its simple architecture and components such as Distributor, Ingester, and Querier, and highlights its scalability and seamless integration with Prometheus.

ELK alternativeLokiObservability
0 likes · 8 min read
Why Loki Beats ELK for Cloud‑Native Log Management: Architecture and Benefits
Efficient Ops
Efficient Ops
Jun 28, 2020 · Operations

How Observability Redefines Modern Monitoring: Metrics, Logs, Tracing, Events

Modern monitoring has evolved into comprehensive observability, encompassing metrics, logging, tracing, and events, and requires specialized storage solutions for each data type; this article explores the origins, key concepts, and design considerations for building effective observability systems in today's complex internet engineering landscape.

EventsObservabilitytracing
0 likes · 9 min read
How Observability Redefines Modern Monitoring: Metrics, Logs, Tracing, Events
Programmer DD
Programmer DD
Jun 15, 2020 · Cloud Native

Why Envoy Is the Go-To L7 Proxy for Modern Cloud‑Native Architectures

This article explains how Envoy, a lightweight high‑performance L7 proxy and communication bus, provides non‑intrusive sidecar architecture, multi‑layer networking, HTTP/2 support, dynamic configuration, gRPC and special protocol handling, and built‑in observability for cloud‑native systems.

Cloud NativeEnvoyL7 Proxy
0 likes · 5 min read
Why Envoy Is the Go-To L7 Proxy for Modern Cloud‑Native Architectures
Cloud Native Technology Community
Cloud Native Technology Community
Jun 3, 2020 · Cloud Native

10 Common Istio Pitfalls and How to Resolve Them

This article outlines ten frequent Istio exceptions—from service port naming constraints and flow‑control ordering to mTLS‑induced connection drops—explaining their root causes, diagnostic steps, and practical best‑practice solutions for reliable mesh deployments.

IstioKubernetesObservability
0 likes · 17 min read
10 Common Istio Pitfalls and How to Resolve Them
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 3, 2020 · Cloud Native

Why Containers Are Revolutionizing Cloud‑Native Architecture

This article explains how container technology, inspired by shipping containers, transforms software delivery with modular, lightweight virtualization, and how Alibaba Cloud’s container services—ACK, ASK, ACR, and ASM—provide agile, elastic, portable, and secure cloud‑native solutions for hybrid and multi‑cloud environments.

Alibaba CloudCloud NativeContainers
0 likes · 22 min read
Why Containers Are Revolutionizing Cloud‑Native Architecture
Cloud Native Technology Community
Cloud Native Technology Community
May 25, 2020 · Cloud Native

Istio 1.6 Release Highlights: Simplified Installation, Enhanced Lifecycle Experience, Observability, VM Support, and Network Improvements

The Istio 1.6 release introduces a fully migrated Istiod architecture, streamlined installation and upgrade processes, expanded observability features, native support for virtual‑machine workloads via WorkloadEntry, and several network enhancements including improved secret handling and experimental Service API support.

Cloud NativeIstioKubernetes
0 likes · 5 min read
Istio 1.6 Release Highlights: Simplified Installation, Enhanced Lifecycle Experience, Observability, VM Support, and Network Improvements
Yanxuan Tech Team
Yanxuan Tech Team
May 25, 2020 · Operations

How NetEase Cloud Music Built a Scalable Full‑Link Tracing System for Real‑Time Service Diagnosis

This article details the design, implementation, and evolution of NetEase Cloud Music's full‑link tracing platform, covering its motivations, architecture, low‑overhead data collection, multi‑dimensional analysis, service grooming, automated diagnosis, and future plans for AI‑driven anomaly detection and big‑data processing.

Distributed SystemsObservabilityservice monitoring
0 likes · 19 min read
How NetEase Cloud Music Built a Scalable Full‑Link Tracing System for Real‑Time Service Diagnosis
Efficient Ops
Efficient Ops
May 17, 2020 · Operations

How EMonitor Outperforms CAT: Deep Dive into Modern Monitoring Architecture

EMonitor, Meituan’s unified monitoring platform, extends CAT’s concepts with real‑time 10‑second aggregation, richer metric types, advanced dashboards, and seamless integration across IaaS, PaaS, and application layers, illustrating the evolution from log‑based monitoring to a comprehensive, proactive observability system.

CATEMonitorObservability
0 likes · 15 min read
How EMonitor Outperforms CAT: Deep Dive into Modern Monitoring Architecture
Efficient Ops
Efficient Ops
May 11, 2020 · Operations

How Nightingale Transforms Monitoring for Scalable Stability

This article introduces Didi's open‑source monitoring system Nightingale, detailing its design, architecture, key improvements over Open‑Falcon, and how its flexible alerting and data handling capabilities support the full lifecycle of stability engineering in large‑scale operations.

AlertingDevOpsObservability
0 likes · 23 min read
How Nightingale Transforms Monitoring for Scalable Stability
DataFunTalk
DataFunTalk
Apr 27, 2020 · Operations

ByteDance’s Chaos Engineering Practice and Platform Evolution

This article describes ByteDance’s multi‑generation chaos engineering practice, covering industry background, fault‑injection models, the design of a declarative fault‑center, experiment selection principles, detailed experiment processes, metric classifications, red‑blue war‑game workflows, strong/weak dependency analysis, and future directions for infrastructure‑level chaos engineering.

Fault InjectionObservabilityReliability
0 likes · 21 min read
ByteDance’s Chaos Engineering Practice and Platform Evolution
Tencent Cloud Middleware
Tencent Cloud Middleware
Apr 16, 2020 · Cloud Native

How Tencent’s TSF Mesh Overcame Real‑World Service Mesh Challenges

This article examines the evolution of Tencent's TSF Mesh Service Mesh platform, detailing its architecture, the technical hurdles faced when supporting heterogeneous environments, multi‑tenant isolation, DNS and Spring Cloud interoperability, and the solutions implemented to achieve robust, cloud‑native service governance.

Cloud NativeIstioKubernetes
0 likes · 18 min read
How Tencent’s TSF Mesh Overcame Real‑World Service Mesh Challenges
Cloud Native Technology Community
Cloud Native Technology Community
Apr 8, 2020 · Operations

Decoding Thanos Architecture: From Query to Compact for Scalable Monitoring

This article provides a detailed analysis of Thanos' architecture, explaining each core component—Query, Sidecar, Store Gateway, Ruler, Compact, and the upcoming Receiver—how they enable global view, high availability, and long‑term storage for distributed Prometheus deployments, and discusses design trade‑offs and optimization strategies.

Cloud NativeLong‑term StorageObservability
0 likes · 12 min read
Decoding Thanos Architecture: From Query to Compact for Scalable Monitoring
360 Quality & Efficiency
360 Quality & Efficiency
Apr 3, 2020 · Operations

Prometheus Monitoring System: Concepts, Architecture, and Hands‑On Deployment with Node Exporter and Grafana

This article introduces the core concepts and architecture of the open‑source Prometheus monitoring system, explains its data model and metric types, and provides a step‑by‑step guide to install a Prometheus server, collect host metrics with Node Exporter, and visualize them using Grafana.

GrafanaMetricsObservability
0 likes · 10 min read
Prometheus Monitoring System: Concepts, Architecture, and Hands‑On Deployment with Node Exporter and Grafana
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Mar 30, 2020 · Cloud Native

Understanding Istio 1.5: Architecture, New Features, and Installation Guide

This article explains what Istio is, outlines the major updates in version 1.5—including the unified istiod control plane, WebAssembly extensibility, simplified installation, and improved observability—describes core control‑plane components, and provides step‑by‑step instructions for preparing a Kubernetes cluster and installing Istio.

Cloud NativeIstioKubernetes
0 likes · 10 min read
Understanding Istio 1.5: Architecture, New Features, and Installation Guide