Tagged articles

Observability

1054 articles · Page 9 of 11

Apr 27, 2022 · Industry Insights

How JD Achieves Seamless Stability During Massive Sales Events

The article reviews the Global Information System Stability Summit and JD's technical architect Li Junliang's detailed case study on the engineering practices, observability, chaos engineering, and resource‑scheduling innovations that enable JD’s e‑commerce platform to handle sales‑peak traffic that spikes hundreds of times over normal load.

Industry InsightsObservabilityResource Scheduling

0 likes · 7 min read

How JD Achieves Seamless Stability During Massive Sales Events

Top Architect

Apr 27, 2022 · Backend Development

Comprehensive Guide to Backend Architecture: Microservices, Service Mesh, Observability, and Messaging

This article provides an in‑depth overview of modern backend architecture, covering microservice fundamentals, service mesh concepts, observability pillars, messaging queue choices, and practical design considerations such as service registration, configuration centers, and security mechanisms.

Observabilitybackend-architecturemessaging

0 likes · 28 min read

Comprehensive Guide to Backend Architecture: Microservices, Service Mesh, Observability, and Messaging

Volcano Engine Developer Services

Apr 26, 2022 · Operations

How Volcano Engine’s TLS Transforms Log Management for Kubernetes at Scale

This article explains the challenges of traditional open‑source log collection in cloud‑native environments, describes Volcano Engine’s unified TLS architecture, its centralized configuration, CRD‑based deployment, and showcases real‑world case studies that demonstrate improved availability, efficiency, and scalability.

KubernetesObservabilitycloud-native

0 likes · 15 min read

How Volcano Engine’s TLS Transforms Log Management for Kubernetes at Scale

dbaplus Community

Apr 25, 2022 · Operations

From Monitoring to Observability: Expert Insights on Evolving Cloud‑Native Operations

In this interview series, three industry experts explain how monitoring differs from observability, the shifts required for ops, developers, and architects, the core methodologies and technologies behind metrics, traces, and logs, and practical guidance for selecting and integrating observability tools in cloud‑native environments.

ObservabilityOperationscloud-native

0 likes · 16 min read

From Monitoring to Observability: Expert Insights on Evolving Cloud‑Native Operations

MaGe Linux Operations

Apr 22, 2022 · Backend Development

Essential Microservice Patterns: Decomposition, Integration & Observability

This article outlines the key microservice design patterns—including decomposition, integration, event‑driven, saga, and observability techniques—while explaining their goals, principles, and practical considerations such as database per service, CQRS, and cross‑cutting concerns like health checks and circuit breakers.

Design PatternsObservabilityService Integration

0 likes · 19 min read

Essential Microservice Patterns: Decomposition, Integration & Observability

Ops Development Stories

Apr 21, 2022 · Cloud Native

Essential Kubernetes Production Checklist for Web Services

A comprehensive, step‑by‑step checklist guides teams through documentation, application design, security, CI/CD, Kubernetes configuration, monitoring, testing, and 24/7 support to reliably run web services with HTTP APIs in production on Kubernetes.

KubernetesObservabilityProduction

0 likes · 9 min read

Essential Kubernetes Production Checklist for Web Services

政采云技术

Apr 19, 2022 · Cloud Native

A Practical Guide to Dapr Core Features: Pub/Sub, Resource Bindings, Actors, Observability, Secrets, and Configuration

This comprehensive technical tutorial demonstrates how to implement and configure core Dapr features, including publish/subscribe messaging, resource bindings, virtual actors, distributed tracing, secrets management, and dynamic configuration, using Java applications deployed on Kubernetes with practical code examples and command-line instructions.

DaprJavaKubernetes

0 likes · 21 min read

A Practical Guide to Dapr Core Features: Pub/Sub, Resource Bindings, Actors, Observability, Secrets, and Configuration

YunZhu Net Technology Team

Apr 15, 2022 · Operations

Design and Architecture of a Cloud‑Native Monitoring Platform for Business Systems

The document outlines the background, vision, current status, technical research, value, product and technical architecture, and functional design of a cloud‑native monitoring platform that integrates SkyWalking and Prometheus to provide comprehensive APM, resource utilization, alerting, and rapid fault localization for business and technical middle‑platform services.

APMObservabilityOperations

0 likes · 10 min read

Design and Architecture of a Cloud‑Native Monitoring Platform for Business Systems

NetEase Smart Enterprise Tech+

Apr 14, 2022 · Operations

How to Build Precise Alerting with Prometheus to Eliminate Alert Storms

This article explains how to use Prometheus to create a precise, end‑to‑end alerting system that shortens detection and diagnosis time, integrates logs and metrics, routes alerts to the right owners, and prevents overwhelming alert storms in production environments.

AlertingObservabilityPrometheus

0 likes · 10 min read

How to Build Precise Alerting with Prometheus to Eliminate Alert Storms

Alibaba Cloud Native

Apr 13, 2022 · Cloud Native

From Dapper to OpenTelemetry: A Practical Guide to Distributed Tracing and Observability

This article explains the challenges of long request chains in micro‑service architectures, reviews Google’s Dapper tracing requirements, introduces OpenTracing and OpenCensus standards, compares their strengths, and details how OpenTelemetry unifies tracing, metrics and logs with practical integration steps and best‑practice guidance.

Distributed TracingObservabilityOpenCensus

0 likes · 24 min read

From Dapper to OpenTelemetry: A Practical Guide to Distributed Tracing and Observability

DevOps

Apr 12, 2022 · Operations

Understanding Observability: Core Concepts, SRE Methodology, AIOps, and Business Architecture

The article explains the rising importance of observability in modern operations, defines its control‑theory roots, breaks it down into metrics, traces and logs, and argues that successful implementation requires three pillars—SRE practices, AIOps algorithms, and deep business‑architecture knowledge—together with well‑designed SLOs and critical‑path mapping.

AIOpsBusiness ArchitectureLogging

0 likes · 10 min read

Understanding Observability: Core Concepts, SRE Methodology, AIOps, and Business Architecture

Alibaba Cloud Native

Apr 3, 2022 · Cloud Native

How to Achieve Full Observability for Performance Testing with Prometheus

This guide explains the essential observability concepts—metrics, logs, and traces—for performance testing, compares Zabbix and Prometheus, shows how to extend JMeter with a Prometheus exporter, and details step‑by‑step integration of Alibaba Cloud PTS and Grafana dashboards for comprehensive monitoring.

ObservabilityPrometheuscloud-native

0 likes · 9 min read

How to Achieve Full Observability for Performance Testing with Prometheus

SQB Blog

Apr 2, 2022 · Operations

Designing a Next‑Gen Observability Platform: From Zipkin to Hera

This article chronicles the evolution of a company's monitoring system from a Zipkin‑based tracing solution to a cloud‑native observability platform called Hera, detailing design goals, technology choices, challenges with MySQL storage, and the adoption of Prometheus‑compatible metrics, Jaeger tracing, and Kubernetes operators.

Distributed TracingJaegerObservability

0 likes · 22 min read

Designing a Next‑Gen Observability Platform: From Zipkin to Hera

Laravel Tech Community

Mar 29, 2022 · Backend Development

Apache APISIX 2.13.0 LTS Release: New Features, Observability, Multi‑language Support, and Bug Fixes

The Apache APISIX community announced the 2.13.0 LTS release, enhancing stability, adding observability plugins, a new OpenTelemetry tracing plugin, multi‑language (Wasm, Python, Go) support, and a comprehensive list of bug fixes and improvements.

API GatewayApache APISIXBackend Development

0 likes · 7 min read

Apache APISIX 2.13.0 LTS Release: New Features, Observability, Multi‑language Support, and Bug Fixes

Aikesheng Open Source Community

Mar 29, 2022 · Databases

Performance Tuning and Observation Techniques for dble Using BenchmarkSQL

This article shares practical configuration recommendations, system‑resource monitoring methods, and thread‑adjustment strategies for optimizing dble performance during BenchmarkSQL TPC‑C style load testing, highlighting how observable metrics guide effective tuning of the middleware and underlying MySQL nodes.

BenchmarkSQLObservabilitythread optimization

0 likes · 10 min read

Performance Tuning and Observation Techniques for dble Using BenchmarkSQL

StarRocks

Mar 28, 2022 · Backend Development

Scaling Microservice Tracing with Zipkin and StarRocks: A Practical Guide

This article explains how Sohu Smart Media built a high‑performance tracing system for microservices by integrating Zipkin for data collection with StarRocks for storage and analytics, covering architecture, data models, SQL queries, Flink processing, and real‑world results that boost observability and engineering efficiency.

FlinkObservabilitySQL

0 likes · 31 min read

Scaling Microservice Tracing with Zipkin and StarRocks: A Practical Guide

Architects Research Society

Mar 25, 2022 · Operations

Understanding Observability: Importance, Benefits, Challenges, and Best Practices

Observability measures a system’s current state using telemetry such as logs, metrics, and traces, enabling IT, DevOps, and SRE teams to detect, diagnose, and resolve issues in complex multi‑cloud environments while delivering better performance, reliability, and business outcomes.

AIOpsIT OperationsObservability

0 likes · 19 min read

Understanding Observability: Importance, Benefits, Challenges, and Best Practices

Sohu Tech Products

Mar 23, 2022 · Big Data

Microservice Tracing with Zipkin and StarRocks: Architecture and Practice

This article describes how Sohu Intelligent Media built a microservice tracing system using Zipkin for data collection and StarRocks for storage and analysis, covering architecture, data model, ingestion pipeline, SQL analytics, performance monitoring, and future improvements.

ObservabilityStarRocksTracing

0 likes · 27 min read

Microservice Tracing with Zipkin and StarRocks: Architecture and Practice

Open Source Linux

Mar 18, 2022 · Operations

Evolution of Open‑Source Monitoring Tools: From Nagios to Prometheus

This article traces the development of open‑source monitoring solutions from early tools like Nagios and Cacti through modern platforms such as Prometheus and Nightingale, comparing their strengths, weaknesses, and typical use cases while also looking ahead to emerging observability trends in cloud‑native environments.

ObservabilityOperationsPrometheus

0 likes · 14 min read

Evolution of Open‑Source Monitoring Tools: From Nagios to Prometheus

Efficient Ops

Mar 16, 2022 · Operations

Why Traditional Monitoring Fails and Observability Is the Future for Ops Teams

Drawing from years of ops experience, the author recounts the decline of traditional monitoring, the rise of automated dashboards, the challenges of AIOps and observability, and proposes a shift toward data‑driven, business‑focused capability building to make alerts truly useful.

AIOpsObservabilitySRE

0 likes · 13 min read

Why Traditional Monitoring Fails and Observability Is the Future for Ops Teams

DataFunTalk

Mar 11, 2022 · Cloud Native

Operator‑Based Log Collection and the Evolution of Loggie in Cloud‑Native Environments

This article recounts NetEase's journey from early host‑based log collection to operator‑driven Kubernetes logging, discusses the challenges of large‑scale log ingestion, evaluates existing agents, and introduces the open‑source Loggie project with its architecture, features, performance gains, and roadmap.

KubernetesLoggieLogging

0 likes · 12 min read

Operator‑Based Log Collection and the Evolution of Loggie in Cloud‑Native Environments

Open Source Linux

Mar 8, 2022 · Operations

Master Kubernetes Troubleshooting: The Three Pillars Every Engineer Needs

This article breaks down Kubernetes troubleshooting into three essential steps—understanding the failure, managing the response, and preventing recurrence—while mapping key monitoring, observability, and incident‑response tools to each phase for reliable cloud‑native operations.

Incident ManagementKubernetesObservability

0 likes · 8 min read

Master Kubernetes Troubleshooting: The Three Pillars Every Engineer Needs

Ops Development Stories

Mar 3, 2022 · Operations

What Exactly Does an SRE Do? Unpacking Roles, Skills, and Practices

This article explains the SRE role originated by Google, outlines its core responsibilities such as automation, observability, incident response, testing, capacity planning, and SLI/SLO/SLA management, and highlights the skills and cultural practices needed for reliable service operations.

ObservabilitySLASLI

0 likes · 29 min read

What Exactly Does an SRE Do? Unpacking Roles, Skills, and Practices

Alibaba Cloud Native

Mar 1, 2022 · Cloud Native

How Alibaba’s KubeProbe Tackles Large‑Scale Kubernetes Stability Challenges

This article explains how Alibaba Cloud's self‑built KubeProbe combines universal link probing and targeted inspections to detect, diagnose, and remediate issues in massive multi‑cluster Kubernetes environments, improving reliability and reducing on‑call overhead.

ChatOpsKubernetesLarge Scale

0 likes · 19 min read

How Alibaba’s KubeProbe Tackles Large‑Scale Kubernetes Stability Challenges

政采云技术

Mar 1, 2022 · Cloud Native

Introduction to Dapr: Features, Architecture, and Installation Guide

This article introduces Dapr, a cloud‑native sidecar runtime for building resilient microservices, explains its core features such as service invocation, state management, pub/sub, bindings, actors, observability, and secrets, and provides step‑by‑step installation instructions for CLI, binaries, Kubernetes, and Helm.

DaprInstallationObservability

0 likes · 10 min read

Introduction to Dapr: Features, Architecture, and Installation Guide

Alibaba Cloud Native

Feb 28, 2022 · Cloud Native

How to Observe and Diagnose DNS Failures in Kubernetes Clusters

This article explains how DNS operates inside Kubernetes, enumerates common failure causes, describes CoreDNS's built‑in observability plugins, introduces BPF‑based client‑side diagnostics, and provides a step‑by‑step troubleshooting workflow to identify and resolve DNS issues in cloud‑native environments.

BPFCoreDNSDNS

0 likes · 18 min read

How to Observe and Diagnose DNS Failures in Kubernetes Clusters

21CTO

Feb 24, 2022 · Backend Development

42 Hard‑Earned Lessons for Building Reliable Production Databases

This article translates Mahesh Balakrishnan’s 42‑point guide on building production databases, covering customer focus, project management, design principles, code review practices, strategy, observability, and research, offering concrete advice for engineers and teams creating robust backend systems.

ObservabilityProduction SystemsProject Management

0 likes · 12 min read

42 Hard‑Earned Lessons for Building Reliable Production Databases

Laravel Tech Community

Feb 20, 2022 · Backend Development

Highlights of .NET 7 Preview 1: Nullable Annotations, Observability, Code Generation, and New APIs

The article outlines the major features of .NET 7 Preview 1, including nullable annotations for Microsoft.Extensions libraries, enhancements to tracing APIs, code‑generation improvements, dynamic PGO and Arm64 support, p/invoke source generation, new System.Text.Json APIs, and expanded hot‑reload capabilities.

.NETNullable AnnotationsObservability

0 likes · 5 min read

Highlights of .NET 7 Preview 1: Nullable Annotations, Observability, Code Generation, and New APIs

MaGe Linux Operations

Feb 19, 2022 · Cloud Native

Kubernetes Hits Mainstream: Key Insights from CNCF’s 2021 Cloud‑Native Survey

According to CNCF’s 2021 Cloud‑Native Survey, 96% of organizations are using or evaluating Kubernetes, marking its transition to mainstream, with rapid growth in developer adoption, container runtimes, and related projects, while highlighting emerging trends in edge, observability, and security for 2022.

CNCF SurveyKubernetesObservability

0 likes · 8 min read

Kubernetes Hits Mainstream: Key Insights from CNCF’s 2021 Cloud‑Native Survey

Ctrip Technology

Feb 17, 2022 · Operations

Evolution and Architecture of the Hickwall Enterprise Monitoring Platform

The article details the background, challenges, multi‑year evolution, current architecture, and future roadmap of Hickwall, Ctrip's enterprise‑grade monitoring and observability platform, covering metrics, logs, traces, high‑cardinality handling, cloud‑native integration, alert governance, and storage engine migrations.

AlertingObservabilityOperations

0 likes · 15 min read

Evolution and Architecture of the Hickwall Enterprise Monitoring Platform

Alibaba Cloud Native

Feb 11, 2022 · Cloud Native

What New Features and ACK Enhancements Arrive with Kubernetes 1.22?

This FAQ outlines the new Kubernetes 1.22 capabilities, the components Alibaba Cloud ACK upgrades for this version, added observability, stability and performance improvements, and key upgrade considerations such as deprecated APIs and runtime changes.

ACKKubernetesObservability

0 likes · 6 min read

What New Features and ACK Enhancements Arrive with Kubernetes 1.22?

Efficient Ops

Feb 7, 2022 · Operations

Mastering Application Monitoring with Prometheus: Practical Metrics and Grafana Tips

This article explains how to design effective Prometheus metrics for various application types, choose appropriate vectors, labels, and buckets, and offers Grafana tricks for visualizing dimensions and linking tooltips, providing a comprehensive guide for robust observability.

Best PracticesGrafanaObservability

0 likes · 10 min read

Mastering Application Monitoring with Prometheus: Practical Metrics and Grafana Tips

MaGe Linux Operations

Feb 2, 2022 · Operations

Master Prometheus Metrics: Best Practices for Effective Monitoring

This article outlines practical Prometheus monitoring techniques, covering how to choose metrics, define labels, select vectors and buckets, and use Grafana tips to build reliable observability for various application types.

GrafanaObservabilityPrometheus

0 likes · 8 min read

Master Prometheus Metrics: Best Practices for Effective Monitoring

Baidu Tech Salon

Jan 27, 2022 · Cloud Native

How China Unicom’s Service Mesh Evolved: From SDKs to Sidecars and Beyond

This article details China Unicom Software Research Institute's multi‑year journey of adopting Kubernetes‑based service mesh, outlining the evolution from SDK‑driven microservices to sidecar‑based architectures, migration strategies with Baidu, performance optimizations, observability enhancements, and future product roadmaps.

IstioKubernetesObservability

0 likes · 13 min read

How China Unicom’s Service Mesh Evolved: From SDKs to Sidecars and Beyond

ITFLY8 Architecture Home

Jan 26, 2022 · Operations

Mastering Microservice Monitoring, Fault Tolerance, and Security: A Complete Guide

This article explains how to monitor microservice architectures, describes log, tracing, and metric monitoring, compares open‑source tracing tools, outlines fault‑tolerance strategies such as timeout, rate‑limiting, degradation, async buffering and circuit breaking, and details access‑security mechanisms including gateway authentication, service‑side auth, and OAuth2.0 token flows, while also introducing container technology and its role in microservice deployment.

ContainersObservabilityfault-tolerance

0 likes · 43 min read

Mastering Microservice Monitoring, Fault Tolerance, and Security: A Complete Guide

MaGe Linux Operations

Jan 22, 2022 · Cloud Native

Boost Kubernetes Monitoring: Migrate from Prometheus to Thanos for Scalable Low‑Cost Metrics

This article examines the limitations of a standard Prometheus‑based monitoring stack on Kubernetes, explains how adopting Thanos improves metric retention and reduces infrastructure costs, and provides a detailed multi‑cluster deployment guide with Terraform, TLS configuration, and Grafana visualization.

KubernetesObservabilityPrometheus

0 likes · 16 min read

Boost Kubernetes Monitoring: Migrate from Prometheus to Thanos for Scalable Low‑Cost Metrics

Efficient Ops

Jan 20, 2022 · Operations

Mastering Prometheus Metrics: Best Practices for Effective Monitoring

This article outlines practical guidelines for designing Prometheus metrics, covering how to define monitoring targets, choose appropriate vectors and labels, name metrics and labels correctly, select histogram buckets, and leverage Grafana features to visualize and troubleshoot data effectively.

GrafanaObservabilityPrometheus

0 likes · 11 min read

Mastering Prometheus Metrics: Best Practices for Effective Monitoring

Baidu Geek Talk

Jan 12, 2022 · Backend Development

Serverless Architecture Evolution: Baidu Search Content Platform's FaaS and Intelligent Transformation

Baidu’s search content platform transitioned to a serverless, FaaS‑based architecture with intelligent scheduling and automated control, cutting resource waste by 87%, boosting automatic recovery to 96.7%, and delivering roughly tenfold productivity gains across development, deployment, and maintenance while simplifying scalability and high‑availability concerns.

FaaSIntelligent SchedulingObservability

0 likes · 27 min read

Serverless Architecture Evolution: Baidu Search Content Platform's FaaS and Intelligent Transformation

Java High-Performance Architecture

Jan 12, 2022 · Cloud Native

Mastering Service Mesh with Istio: A Hands‑On Guide to Traffic, Security, and Observability

This tutorial explains the fundamentals of service mesh, explores Istio’s architecture and core components, and provides step‑by‑step instructions for installing Istio on Kubernetes, deploying a sample microservice application, and leveraging traffic management, mutual TLS, observability, and advanced use cases such as routing, circuit breaking, and JWT‑based access control.

IstioKubernetesObservability

0 likes · 22 min read

Mastering Service Mesh with Istio: A Hands‑On Guide to Traffic, Security, and Observability

HaoDF Tech Team

Jan 11, 2022 · Big Data

Using ClickHouse for Real‑Time Log Analytics and Data Storage in Microservice Governance at Haodf

The article describes how Haodf's SRE team replaced Elasticsearch with ClickHouse to handle massive microservice logs, achieve low‑latency queries, reduce storage costs, and support real‑time monitoring, tracing, and metric analysis through columnar OLAP features, sharding, TTL, and materialized views.

AnalyticsBig DataClickHouse

0 likes · 25 min read

Using ClickHouse for Real‑Time Log Analytics and Data Storage in Microservice Governance at Haodf

Architecture Digest

Jan 9, 2022 · Cloud Native

Introduction to Service Mesh and Istio: Concepts, Architecture, and Practical Usage

This tutorial explains the fundamentals of service mesh, details Istio's architecture and core components, and provides step‑by‑step instructions for installing Istio on Kubernetes, deploying a sample microservice application, and leveraging traffic management, security, and observability features.

IstioKubernetesObservability

0 likes · 18 min read

Introduction to Service Mesh and Istio: Concepts, Architecture, and Practical Usage

IT Architects Alliance

Jan 7, 2022 · Cloud Native

Introduction to Service Mesh and Istio: Concepts, Architecture, and Practical Deployment

This tutorial explains the fundamentals of service mesh, outlines Istio’s architecture and core components, demonstrates how to install and configure Istio on Kubernetes, and showcases common use cases such as traffic management, security, observability, and alternatives, providing a comprehensive guide for modern micro‑service deployments.

IstioObservabilityService Mesh

0 likes · 18 min read

Architect

Jan 5, 2022 · Cloud Native

Introduction to Service Mesh and Istio: Concepts, Architecture, and Hands‑On Guide

This tutorial explains the fundamentals of service mesh, outlines Istio’s architecture and core components, demonstrates how to install Istio on Kubernetes, and walks through practical examples such as traffic routing, security policies, observability, and common use‑cases, while also comparing alternative solutions.

IstioKubernetesObservability

0 likes · 20 min read

Introduction to Service Mesh and Istio: Concepts, Architecture, and Hands‑On Guide

Tencent Cloud Developer

Dec 23, 2021 · Cloud Native

An Overview of OpenTelemetry: Origins, Architecture, and Instrumentation

OpenTelemetry unifies tracing, metrics, and logs by merging OpenTracing and OpenCensus into a cross‑language specification, collector, language SDKs, and instrumentation libraries, offering vendor‑agnostic, low‑maintenance telemetry collection that separates data gathering from business logic while requiring external back‑ends for storage and analysis.

CollectorInstrumentationObservability

0 likes · 10 min read

An Overview of OpenTelemetry: Origins, Architecture, and Instrumentation

Qingyun Technology Community

Dec 22, 2021 · Cloud Native

What’s New in KubeSphere 3.2.1? Key Features, Fixes, and Upgrade Guide

Version 3.2.1 of the open‑source KubeSphere platform introduces a series of enhancements—including container group status filtering, improved image builder dialogs, expanded quota visibility, numerous UI bug fixes, and updated DevOps pipelines—alongside detailed installation and upgrade instructions for Linux and Kubernetes environments.

KubeSphereKubernetesObservability

0 likes · 8 min read

What’s New in KubeSphere 3.2.1? Key Features, Fixes, and Upgrade Guide

21CTO

Dec 20, 2021 · Cloud Native

Why Cloud‑Native Architecture Is the Future of SaaS and How to Implement It

This article explains what cloud‑native architecture is, why it is essential for modern SaaS businesses, and provides a step‑by‑step guide—including serverless migration, elasticity, observability, resilience, and automation—on how to adopt it using Alibaba Cloud SAE and related services.

ObservabilitySaaScloud-native

0 likes · 22 min read

Why Cloud‑Native Architecture Is the Future of SaaS and How to Implement It

Java Architecture Diary

Dec 13, 2021 · Backend Development

Essential Java & Cloud Native Resources: From JDK 17 to GraalVM, Spring & More

This curated collection gathers essential articles and tutorials covering Java 8‑17 updates, GraalVM performance tricks, Spring Native adoption, Spring Cloud and RSocket alternatives, GraphQL frameworks, observability stacks like Grafana, Prometheus and Loki, IDE enhancements, database fundamentals, and low‑code platform building, providing a comprehensive knowledge base for modern backend developers.

DatabasesGraalVMJava

0 likes · 4 min read

Essential Java & Cloud Native Resources: From JDK 17 to GraalVM, Spring & More

Tencent Cloud Middleware

Dec 9, 2021 · Cloud Native

Why Observability Is the Missing Piece for Day‑2 Success in Cloud‑Native and Serverless Systems

The article explains how observability—through logs, metrics, and traces—transforms the opaque, complex day‑2 operations of micro‑service, Kubernetes, and serverless environments into a deterministic, diagnosable system, highlighting OpenTelemetry, practical collection methods, and real‑world implementation challenges and benefits.

ObservabilityOpenTelemetryServerless

0 likes · 17 min read

Why Observability Is the Missing Piece for Day‑2 Success in Cloud‑Native and Serverless Systems

Alibaba Cloud Native

Dec 7, 2021 · Cloud Native

Unlocking the Third Way of Distributed Tracing: Post‑Aggregation Link Analysis Explained

This article introduces the third, post‑aggregation approach to link tracing—link analysis—showing how real‑time aggregation of stored trace data can quickly pinpoint uneven traffic, single‑machine failures, slow interfaces, business‑level traffic shifts, and gray‑release anomalies while outlining its practical constraints.

APMLink AnalysisObservability

0 likes · 11 min read

Unlocking the Third Way of Distributed Tracing: Post‑Aggregation Link Analysis Explained

Laravel Tech Community

Dec 2, 2021 · Cloud Native

New Features in Apache APISIX 2.11.0: LDAP Authentication, Observability Plugins, Azure Functions, and WASM Support

Apache APISIX 2.11.0 adds an LDAP‑based authentication plugin, expands observability with Datadog and SkyWalking plugins, introduces Azure Functions integration, provides early WASM support, and enhances existing plugins, all illustrated with detailed configuration examples and code snippets.

API GatewayAzure FunctionsLDAP

0 likes · 8 min read

New Features in Apache APISIX 2.11.0: LDAP Authentication, Observability Plugins, Azure Functions, and WASM Support

GrowingIO Tech Team

Dec 2, 2021 · Cloud Native

Mastering Chaos Mesh: A Hands‑On Guide to Cloud‑Native Chaos Engineering

Chaos Mesh is an open‑source cloud‑native chaos engineering platform that lets you experiment with fault injection across Kubernetes environments, offering visual dashboards, extensive fault types, and step‑by‑step installation and experiment creation guides to help teams uncover system weaknesses and improve resilience.

Chaos MeshFault InjectionKubernetes

0 likes · 12 min read

Mastering Chaos Mesh: A Hands‑On Guide to Cloud‑Native Chaos Engineering

Efficient Ops

Nov 24, 2021 · Operations

Why Switch to Loki? Step‑by‑Step Installation and Grafana Visualization

This guide explains why Loki is a lightweight alternative to EFK/ELK, walks through installing Loki and Promtail binaries, configuring them with YAML files, and visualizing logs in Grafana using LogQL, providing a complete end‑to‑end log management solution.

GrafanaObservabilitylog management

0 likes · 6 min read

Why Switch to Loki? Step‑by‑Step Installation and Grafana Visualization

Baidu Geek Talk

Nov 24, 2021 · Operations

How Baidu’s Fengjing Uses Holographic Logs to Debug Massive Microservices

Baidu’s Fengjing monitoring platform tackles the daunting challenge of pinpointing failures in its massive Java‑based microservice ecosystem by employing a non‑intrusive probe that captures log metadata, stores it in a database, and reconstructs full request‑level logs with minimal storage overhead.

Distributed TracingJavaObservability

0 likes · 9 min read

How Baidu’s Fengjing Uses Holographic Logs to Debug Massive Microservices

Efficient Ops

Nov 16, 2021 · Operations

How to Build a Scalable Prometheus Monitoring System with Thanos on Kubernetes

This article explains why monitoring is essential for production stability, compares white‑box and black‑box approaches, and provides a step‑by‑step guide to deploying Prometheus, configuring scrape targets, using Pushgateway and Alertmanager, and scaling the solution with Thanos in a Kubernetes environment.

AlertmanagerObservabilityPrometheus

0 likes · 21 min read

How to Build a Scalable Prometheus Monitoring System with Thanos on Kubernetes

Code Ape Tech Column

Nov 15, 2021 · Operations

A Comprehensive Guide to Using Apache SkyWalking for Distributed Tracing, Logging, and Performance Analysis

This article introduces Apache SkyWalking as a powerful open‑source APM solution, compares it with Spring Cloud Sleuth+ZipKin, explains its architecture, walks through server and client setup, data persistence, log collection, performance profiling, alert configuration, and provides practical code snippets and configuration examples.

Distributed TracingJavaObservability

0 likes · 14 min read

A Comprehensive Guide to Using Apache SkyWalking for Distributed Tracing, Logging, and Performance Analysis

Open Source Linux

Oct 31, 2021 · Operations

Designing Effective Metrics: From Requirements to Labels and Buckets

This guide explains how to define, name, and organize monitoring metrics—covering Google’s four golden indicators, system‑specific measurement objects, vector selection, label conventions, bucket design, and practical Grafana tips—for reliable observability of diverse services.

Observabilitylabelingmetrics

0 likes · 10 min read

Designing Effective Metrics: From Requirements to Labels and Buckets

Top Architect

Oct 17, 2021 · Cloud Native

How Redis Simplifies Microservice Design Patterns, Distributed Transactions, and Observability

This article explains how Redis can be used to implement and simplify a wide range of microservice design patterns—including bounded contexts, asynchronous messaging, orchestrated sagas, transaction inboxes, telemetry, event sourcing, CQRS, and shared data—while improving performance, scalability, and observability in cloud‑native architectures.

CQRSObservabilityRedis

0 likes · 16 min read

How Redis Simplifies Microservice Design Patterns, Distributed Transactions, and Observability

Alibaba Cloud Native

Oct 10, 2021 · Cloud Native

How to Detect Service and Workload Anomalies in Kubernetes with Advanced Monitoring

This article explains the common pain points of locating anomalies in Kubernetes environments and presents a multi‑layer monitoring framework—trace, metrics, events, and alerts—along with best‑practice scenarios such as network performance, DNS issues, full‑link stress testing, external MySQL access, and multi‑tenant architectures.

DNSKubernetesNetwork Performance

0 likes · 20 min read

How to Detect Service and Workload Anomalies in Kubernetes with Advanced Monitoring

21CTO

Sep 27, 2021 · Cloud Native

Why Loki Beats ELK for Kubernetes Logging: Architecture, Deployment, and Query Guide

This article explains the motivation behind choosing Loki over heavyweight ELK/EFK stacks for container‑cloud logging, outlines Loki's lightweight architecture and components, provides step‑by‑step deployment instructions on OpenShift/Kubernetes, and demonstrates how to query logs using the LogQL language and HTTP API.

KubernetesLogQLObservability

0 likes · 17 min read

Why Loki Beats ELK for Kubernetes Logging: Architecture, Deployment, and Query Guide

21CTO

Sep 26, 2021 · Backend Development

How Baidu’s Hulk Framework Accelerates Go Service Development

The Hulk framework, built on GDP2, provides a business‑oriented Go web development platform with out‑of‑the‑box components, standardized architecture, rich observability, and tooling that together improve code quality, development speed, and SRE efficiency for large‑scale short‑video services.

Observabilitybackendframework

0 likes · 18 min read

How Baidu’s Hulk Framework Accelerates Go Service Development

Top Architect

Sep 24, 2021 · Cloud Native

Loki Log System Overview, Architecture, and Deployment Guide

This article introduces Loki, a lightweight log aggregation system for Kubernetes, explains its background and motivations, details its simple architecture and core components (Distributor, Ingester, Querier), discusses scalability and storage options, and provides step‑by‑step deployment instructions with example YAML and shell commands.

DeploymentKubernetesLogging

0 likes · 16 min read

Loki Log System Overview, Architecture, and Deployment Guide

IT Architects Alliance

Sep 20, 2021 · Operations

Why Loki Beats ELK for Kubernetes Logging: Architecture and Deployment Guide

This article explains the motivations behind choosing Loki over ELK for container‑cloud logging, details Loki's lightweight architecture—including Distributor, Ingester, and Querier components—covers deployment steps on OpenShift/Kubernetes with YAML manifests, and demonstrates LogQL query syntax for efficient log retrieval.

KubernetesLogQLLogging

0 likes · 18 min read

Why Loki Beats ELK for Kubernetes Logging: Architecture and Deployment Guide

Alibaba Cloud Native

Sep 16, 2021 · Cloud Native

How to Use Kubernetes Monitoring for End-to-End Application Architecture Exploration

This session explains why Kubernetes monitoring is essential for end-to-end observability, describes the five data sources and layers it covers, and walks through discovering and locating architecture, performance, resource, scheduling, and network issues using topology, anomaly detection, and correlation techniques.

KubernetesObservabilityPerformance

0 likes · 13 min read

How to Use Kubernetes Monitoring for End-to-End Application Architecture Exploration

IT Architects Alliance

Sep 15, 2021 · Backend Development

Comprehensive Guide to Backend Architecture: Microservices, Service Mesh, Messaging, and Observability

This article provides a detailed overview of modern backend architecture, covering microservice fundamentals, design principles such as Conway's Law and DDD, gateway patterns, communication protocols, service registration, configuration management, observability pillars, service mesh options, and a comparative analysis of popular message‑queue technologies.

Observabilitybackend-architecturecloud-native

0 likes · 27 min read

Comprehensive Guide to Backend Architecture: Microservices, Service Mesh, Messaging, and Observability

HomeTech

Sep 15, 2021 · Backend Development

How ASF Simplifies gRPC‑to‑Go Migration and Boosts Service Governance

This article explains the AutoHome Service Framework (ASF), its architecture, how it enables seamless migration from gRPC to Go services, the added Dubbo‑go support, configuration optimizations, advanced load‑balancing strategies, observability enhancements, and future plans for adaptive balancing and zero‑downtime deployments.

ObservabilityService Governanceconfiguration

0 likes · 18 min read

How ASF Simplifies gRPC‑to‑Go Migration and Boosts Service Governance

Dada Group Technology

Sep 10, 2021 · Operations

Design and Implementation of JD Daojia Log System Based on Loki

This document details the motivation, architecture, components, query language, and deployment of a Loki‑based log collection and analysis platform for JD Daojia, comparing it with ELK, describing ingestion, real‑time and historical log handling, technical challenges, configuration examples, and future scaling plans.

CassandraGrafanaObservability

0 likes · 15 min read

Design and Implementation of JD Daojia Log System Based on Loki

Baidu Intelligent Testing

Sep 9, 2021 · Cloud Native

Observability Practices in Baidu Search Platform: Real‑time Metrics, Tracing, Logging, and Topology at Hundred‑Billion Scale

This article explains how Baidu's search middle‑platform adopts cloud‑native observability—covering metrics, distributed tracing, log querying, and topology analysis—to ensure high availability, performance, and controllability for a system handling hundreds of billions of requests across millions of micro‑service instances.

LoggingObservabilityTracing

0 likes · 12 min read

Observability Practices in Baidu Search Platform: Real‑time Metrics, Tracing, Logging, and Topology at Hundred‑Billion Scale

Efficient Ops

Sep 5, 2021 · Operations

Why Prometheus’s TSDB Makes Monitoring Scalable: A Deep Dive

This article explains how Prometheus’s time‑series database handles massive monitoring data, illustrates practical query examples, and shows why its storage engine and pre‑computation features enable efficient, high‑performance observability for large‑scale services.

ObservabilityPrometheusTSDB

0 likes · 8 min read

Why Prometheus’s TSDB Makes Monitoring Scalable: A Deep Dive

Alibaba Cloud Native

Sep 2, 2021 · Cloud Native

2021 GIAC Cloud Native Conference Highlights: Service Mesh, SkyWalking, Dubbogo 3.0

The article summarizes key insights from the 2021 GIAC Cloud Native conference, covering strategies to limit service explosion radius, SkyWalking-based Kubernetes event monitoring, Kuaishou's Service Mesh implementation, and Dubbogo 3.0's innovations such as proxyless mesh and adaptive throttling.

KubernetesObservabilityService Mesh

0 likes · 13 min read

2021 GIAC Cloud Native Conference Highlights: Service Mesh, SkyWalking, Dubbogo 3.0

Alibaba Cloud Native

Sep 1, 2021 · Cloud Computing

Understanding Serverless: Architecture, Workflow, and Observability

This article explains the concept of Serverless computing, its components (FaaS and BaaS), typical development workflow, supporting tools, Serverless Workflow orchestration, and observability features such as metrics, logging, and tracing.

BaaSCloud ComputingObservability

0 likes · 12 min read

Understanding Serverless: Architecture, Workflow, and Observability

DevOps

Aug 31, 2021 · Backend Development

Designing an Uber‑Like Microservice System with DDD, OpenTelemetry Observability, and Reinforced Chaos Engineering

This article describes how to model a complex Uber‑style ride‑hailing system using Domain‑Driven Design, implement it with Java Spring Boot microservices, instrument it with OpenTelemetry for full observability, and validate the observability pipeline through a gamified chaos‑engineering approach that reduces MTTR.

DDDJavaObservability

0 likes · 13 min read

Designing an Uber‑Like Microservice System with DDD, OpenTelemetry Observability, and Reinforced Chaos Engineering

Open Source Linux

Aug 26, 2021 · Cloud Native

Why Switch from Prometheus to Thanos? Boost Metric Retention & Cut Costs

This article explains the limitations of a traditional Prometheus‑based monitoring stack for Kubernetes, demonstrates how integrating Thanos improves metric retention, scalability, and storage cost, and provides a complete multi‑cluster deployment example with Terraform and Helm configurations.

KubernetesObservabilityPrometheus

0 likes · 15 min read

Why Switch from Prometheus to Thanos? Boost Metric Retention & Cut Costs

21CTO

Aug 20, 2021 · Backend Development

How the Hulk Framework Boosts Go Service Development and Operations

This article explains the background, design, components, ecosystem, and real‑world benefits of the Hulk Go web framework developed by the short‑video R&D team, showing how it improves development efficiency, code quality, performance, observability, and incident response for large‑scale microservices.

ObservabilityWeb Frameworkgo

0 likes · 19 min read

How the Hulk Framework Boosts Go Service Development and Operations

MaGe Linux Operations

Aug 14, 2021 · Operations

Boost System Reliability: 4 Proven Practices to Master Observability

This article explains why observability is essential for DevOps, outlines four key practices—including production‑environment monitoring, structured logging, a DevOps‑focused culture, and pre‑deployment observability with remote debugging—to help teams detect, diagnose, and prevent issues throughout the software lifecycle.

CI/CDCultureLogging

0 likes · 9 min read

Boost System Reliability: 4 Proven Practices to Master Observability

Java Architecture Diary

Aug 11, 2021 · Operations

Unlock Loki v2.3.0: Custom Retention, Deletion & Recording Rules Explained

Version 2.3.0 of Loki introduces enhanced features such as per‑tenant custom retention policies, time‑range log deletion via the Compactor API, Prometheus‑style recording rules, a new pattern parser for LogQL, ingestion sharding for faster queries, and advanced IP‑matching syntax, all aimed at improving storage efficiency, compliance, and observability.

LogQLObservabilitySharding

0 likes · 9 min read

Unlock Loki v2.3.0: Custom Retention, Deletion & Recording Rules Explained

DevOps

Aug 11, 2021 · Operations

Introduction to Chaos Engineering – Part 2: Four Steps for Disrupting Complex Systems

This article explains that chaos engineering is not a magic cure but a disciplined practice for testing distributed systems by designing and running controlled experiments, outlining four essential steps—observability, defining steady state, hypothesizing events, and executing experiments—to gain confidence in system resilience.

ObservabilityOperationschaos engineering

0 likes · 11 min read

Introduction to Chaos Engineering – Part 2: Four Steps for Disrupting Complex Systems

ITFLY8 Architecture Home

Aug 9, 2021 · Operations

How Liulishuo Scaled Its Unified Monitoring Platform for Billions of Users

This article examines the evolution of online education, introduces Liulishuo's massive English‑learning platform, and details the technical challenges, design choices, and architecture of its cloud‑native unified monitoring system that handles tens of terabytes of data daily.

ObservabilityPrometheusSLS

0 likes · 13 min read

How Liulishuo Scaled Its Unified Monitoring Platform for Billions of Users

Code Ape Tech Column

Jul 27, 2021 · Cloud Native

Understanding Loki: Advantages, Architecture, Installation, and Query Practices

This article explains Loki's low‑index overhead, concurrent query handling, tag‑based indexing, component roles, read/write paths, step‑by‑step installation of Promtail and Loki, label matching techniques, dynamic‑tag handling, high‑cardinality concerns, and query optimization strategies for cloud‑native log aggregation.

Observabilitycloud-nativelog-aggregation

0 likes · 13 min read

Understanding Loki: Advantages, Architecture, Installation, and Query Practices

Tencent Cloud Developer

Jul 22, 2021 · Operations

Observability in Serverless Environments: Monitoring, Logging, Distributed Tracing, and Best Practices

In this talk, Gal Bashan explains how serverless architectures complicate observability and why metrics, logs, and especially distributed tracing with tools like OpenTelemetry, Jaeger, or commercial platforms are essential for gaining end-to-end visibility, automating instrumentation, and maintaining reliable, business-focused services across cloud providers.

Distributed TracingLoggingObservability

0 likes · 12 min read

Observability in Serverless Environments: Monitoring, Logging, Distributed Tracing, and Best Practices

Alibaba Cloud Native

Jul 19, 2021 · Operations

Scaling Distributed Observability: A Case Study of ARMS Front‑End Monitoring at a Kids Coding Platform

This article details how a rapidly growing Chinese children's programming platform tackled the complexity of distributed system observability by adopting SkyWalking, Prometheus, and Alibaba Cloud ARMS front‑end monitoring, achieving faster fault detection, reduced operational workload, and improved user experience.

ARMSObservabilitydistributed systems

0 likes · 12 min read

Scaling Distributed Observability: A Case Study of ARMS Front‑End Monitoring at a Kids Coding Platform

High Availability Architecture

Jul 15, 2021 · Operations

Baidu Game Microservice Monitoring Practice and System Design

This article describes Baidu's comprehensive approach to monitoring game microservices, covering the background, initial monitoring tools, evolution of the monitoring system, systematic design for risk control, intelligent detection, alarm optimization, efficient fault localization, and future outlook for high‑availability architecture.

BaiduGame DevelopmentObservability

0 likes · 13 min read

Baidu Game Microservice Monitoring Practice and System Design

Top Architect

Jul 7, 2021 · Backend Development

Design and Implementation of a High‑Concurrency API Gateway

This article details the architecture and implementation of a high‑concurrency API gateway built on RxNetty, covering request routing, conditional routing, API management, rate limiting, circuit breaking, security policies, monitoring, tracing, and future enhancements within a microservices environment.

API GatewayBackend DevelopmentObservability

0 likes · 11 min read

Design and Implementation of a High‑Concurrency API Gateway

Baidu Geek Talk

Jul 5, 2021 · Operations

Automated and Intelligent Analysis of Baidu Search Stability Issues

The team automated Baidu Search fault diagnosis by building a side‑index for instant log lookup, streaming incremental analysis, exhaustive rule templates, feature‑engineering pipelines, query‑scene reconstruction, entropy‑based ranking, per‑second timeline views, and chaos‑engineered fault injection, achieving near‑99% accuracy and second‑level, module‑granular stability tracing.

ObservabilitySearch Stabilitychaos engineering

0 likes · 15 min read

Automated and Intelligent Analysis of Baidu Search Stability Issues

Cloud Native Technology Community

Jul 2, 2021 · Cloud Native

Unpacking the CNCF Cloud Native Landscape: A Layer‑by‑Layer Guide

This comprehensive guide breaks down the CNCF cloud native landscape into its four core layers—Provisioning, Runtime, Orchestration & Management, and Application Definition—explaining the problems each layer solves, the key technologies involved, and how they interoperate to enable modern, scalable applications.

CNCFObservabilityOrchestration

0 likes · 60 min read

Unpacking the CNCF Cloud Native Landscape: A Layer‑by‑Layer Guide

Programmer DD

Jul 1, 2021 · Operations

Why Loki Beats Elasticsearch: Low Index Overhead, Fast Queries, and Easy Setup

This article explains Loki's advantages over Elasticsearch, including low indexing overhead, concurrent query processing with caching, seamless integration with Prometheus and Grafana, detailed architecture components, installation steps, label handling, high‑cardinality challenges, and best practices for efficient log management.

ElasticsearchGrafanaObservability

0 likes · 15 min read

Why Loki Beats Elasticsearch: Low Index Overhead, Fast Queries, and Easy Setup

Baidu Geek Talk

Jun 30, 2021 · Operations

How Baidu Achieves 5‑9+ Availability: Inside Its Stability Engineering and Observability

This article dissects Baidu Search's ultra‑large micro‑service architecture, detailing the challenges of maintaining five‑nine‑plus availability, the diverse failure modes, and the step‑by‑step evolution of its observability stack—from early log‑only analysis to the kepler1.0/kepler2.0 tracing, full‑log indexing, custom span‑id generation, and compression techniques that together enable rapid root‑cause diagnosis at massive scale.

Baidu SearchDistributed TracingObservability

0 likes · 21 min read

How Baidu Achieves 5‑9+ Availability: Inside Its Stability Engineering and Observability

Alibaba Cloud Native

Jun 28, 2021 · Cloud Native

How Chanjet Scaled SaaS for 1.3M SMEs with Cloud‑Native Architecture

Chanjet transformed its monolithic SaaS platform for millions of small‑business customers by adopting a cloud‑native, container‑based micro‑service architecture, enabling elastic scaling, reduced operational costs, unified data services, automated DevOps pipelines, and comprehensive observability across front‑end, back‑end, and infrastructure layers.

ObservabilitySaaScloud-native

0 likes · 27 min read

How Chanjet Scaled SaaS for 1.3M SMEs with Cloud‑Native Architecture

Tencent Cloud Developer

Jun 28, 2021 · Cloud Native

Effective Service Governance for Serverless: Challenges and Solutions

Effective serverless governance requires comprehensive observability, traffic management, and service registration built on Kubernetes, using either a mesh sidecar with Istio or an embedded SDK, to simplify complex operational tasks such as discovery, fault tolerance, gray releases, and metric correlation for large‑scale function deployments.

ObservabilityOperationsServerless

0 likes · 17 min read

Effective Service Governance for Serverless: Challenges and Solutions

Java High-Performance Architecture

Jun 24, 2021 · Operations

How Netflix’s Telltale Transforms Application Monitoring and Smart Alerting

Netflix’s in‑house Telltale system consolidates diverse monitoring data, reduces alert noise, provides multidimensional health assessments, and delivers intelligent, context‑rich notifications, enabling engineers to quickly diagnose and resolve issues across more than 100 production services.

AlertingNetflixObservability

0 likes · 11 min read

How Netflix’s Telltale Transforms Application Monitoring and Smart Alerting

Full-Stack Internet Architecture

Jun 19, 2021 · Operations

Solving Monitoring Pain Points: Unified Framework, Alert Prioritization, and Classification

The article discusses common monitoring challenges such as fragmented tooling and noisy alerts, and proposes solutions including consolidating to a single monitoring framework, prioritizing runtime exceptions, and classifying business alerts with codes and trace information to improve incident response.

AlertingIncident ManagementObservability

0 likes · 6 min read

Solving Monitoring Pain Points: Unified Framework, Alert Prioritization, and Classification

IT Architects Alliance

Jun 19, 2021 · Operations

Reference Architecture for Digital Transformation Platforms

The article outlines a comprehensive reference architecture for digital transformation platforms, detailing typical organizational contexts, desired outcomes, and key components such as integration layers, API gateways, IAM, BPM, observability, multi‑region deployment, and development practices to enable seamless, secure, and scalable business services.

API GatewayIAMObservability

0 likes · 10 min read

Reference Architecture for Digital Transformation Platforms

Tencent Cloud Developer

Jun 17, 2021 · Industry Insights

Serverless 2024: Key Trends, Best Practices, and Future Challenges Revealed

The second Techo TVP Developer Summit showcased Serverless as a leading cloud‑computing trend, delivering market forecasts, technical deep‑dives on microVMM, observability, AI inference, and real‑world use cases while highlighting governance challenges and future directions for the ecosystem.

Cloud ComputingIndustry TrendsMicroVMM

0 likes · 18 min read

Serverless 2024: Key Trends, Best Practices, and Future Challenges Revealed

58 Tech

Jun 11, 2021 · Frontend Development

Beidou Frontend Monitoring System: Architecture, Challenges, and Solutions

The article details the design, architecture, and operational challenges of the Beidou frontend monitoring platform at 58 Group, covering SDK management, behavior trace logging, front‑back link integration, performance optimizations, minute‑level alerting, and permission management.

AlertingFrontendObservability

0 likes · 22 min read

Beidou Frontend Monitoring System: Architecture, Challenges, and Solutions

Liulishuo Tech Team

Jun 2, 2021 · Backend Development

Understanding Distributed Tracing and Its Use at Liulishuo

This article explains what distributed tracing is, why it is needed alongside logging and metrics for observability, how it works with trace and span IDs, and describes Liulishuo's implementation using OpenTelemetry, W3C Trace Context, and tail‑based sampling to improve backend debugging.

Distributed TracingObservabilityOpenTelemetry

0 likes · 9 min read

Understanding Distributed Tracing and Its Use at Liulishuo

Ops Development Stories

Jun 1, 2021 · Cloud Native

Designing Multi‑Tenant Loki Logging on Kubernetes: Centralized vs Partitioned

This article explores how to implement Loki’s multi‑tenant logging on Kubernetes, comparing centralized storage (Scheme A) and partitioned storage (Scheme B), detailing required configuration flags, runtime limits, client setups with Logging Operator, Fluentd/FluentBit, and gateway routing strategies.

KubernetesMulti‑tenantObservability

0 likes · 11 min read

Designing Multi‑Tenant Loki Logging on Kubernetes: Centralized vs Partitioned

Baidu Geek Talk

May 31, 2021 · Cloud Native

Adoption of Service Mesh (Istio) at Baidu iFanFan: Challenges, Migration Strategy, and Benefits

Baidu iFanFan migrated all its Java‑based services to a native Kubernetes + Istio service mesh within three months, replacing fragmented, manual governance with automated rate‑limiting, canary releases, chaos testing and observability, which cut governance cycles from months to minutes, reduced CI time by ~20 % and dramatically improved system stability and multi‑cloud readiness.

IstioKubernetesObservability

0 likes · 21 min read

Adoption of Service Mesh (Istio) at Baidu iFanFan: Challenges, Migration Strategy, and Benefits

Amap Tech

May 28, 2021 · Operations

System Observability Practices in Gaode Ride-Hailing: From Unified Logging to Fault Defense

Gaode Ride‑Hailing created a comprehensive 360° observability platform—standardized logging, distributed tracing, multi‑domain metrics, visual dashboards, and an incident workflow—that transforms raw data into actionable insights, accelerates root‑cause analysis, and enables automated fault defense for its large‑scale cloud‑native microservice system.

LoggingObservabilityTracing

0 likes · 22 min read

System Observability Practices in Gaode Ride-Hailing: From Unified Logging to Fault Defense

21CTO

May 27, 2021 · Cloud Native

Mastering Cloud‑Native Architecture: Practical Steps to Transform SaaS on Alibaba Cloud

This article explains what cloud‑native architecture is, why it is essential for modern SaaS businesses, and provides a step‑by‑step guide—including maturity models, serverless migration, namespace and application setup, load‑balancer binding, service/configuration centers, elasticity, observability, resilience, and automation—using Alibaba Cloud SAE and MSE services.

Alibaba CloudObservabilitySaaS

0 likes · 23 min read

Mastering Cloud‑Native Architecture: Practical Steps to Transform SaaS on Alibaba Cloud

New Oriental Technology

May 24, 2021 · Operations

Overview of SkyWalking UI: Dashboard, Topology, Tracing, Profiling, and Alerts

The article provides a comprehensive English overview of SkyWalking UI, detailing its dashboard metrics, topology visualization, trace analysis, performance profiling workflow, and alarm management, illustrating how the tool monitors microservice and cloud‑native environments with metrics such as throughput, latency, Apdex, and JVM statistics.

APMDistributed TracingObservability

0 likes · 11 min read

Overview of SkyWalking UI: Dashboard, Topology, Tracing, Profiling, and Alerts

DevOps

May 17, 2021 · Cloud Native

Challenges of Testing Cloud‑Native Applications and the Need for New Approaches

Amid accelerating Agile and DevOps adoption, the rapid delivery of cloud‑native microservices introduces cascading risks and makes traditional monolithic testing inadequate, prompting a shift toward observability‑driven “right‑shift” testing, exploratory methods, and chaos engineering to embrace failure as the new normal.

ObservabilityTestingchaos engineering

0 likes · 8 min read

Challenges of Testing Cloud‑Native Applications and the Need for New Approaches