Tagged articles
969 articles
Page 6 of 10
Didi Tech
Didi Tech
Sep 21, 2023 · Cloud Native

OBC: A Cloud-Native Real-Time Computing Engine for Metrics at Didi

To replace costly, duplicated Flink jobs, Didi built Observe‑Compute (OBC), a cloud‑native, PromQL‑driven real‑time metric engine with centralized policy management, scalable containerized workers, and zero‑downtime scaling, achieving million‑RMB annual savings while handling 10 M points per second.

Flink alternativeOBCObservability
0 likes · 17 min read
OBC: A Cloud-Native Real-Time Computing Engine for Metrics at Didi
Alibaba Cloud Native
Alibaba Cloud Native
Sep 21, 2023 · Cloud Native

How Alibaba Cloud’s SAE Achieves High Stability with Diagnostic Engines and Probes

This article explains how Alibaba Cloud's Serverless Application Engine (SAE) builds end‑to‑end stability by dividing fault handling into prevention, detection, localization and recovery, using a Kubernetes‑based diagnostic engine, runtime availability probes, a unified alert center, and a plug‑in architecture for root‑cause analysis.

Cloud NativeKubernetesObservability
0 likes · 28 min read
How Alibaba Cloud’s SAE Achieves High Stability with Diagnostic Engines and Probes
HomeTech
HomeTech
Sep 19, 2023 · Operations

Implementing Observability and Alerting with Grafana Unified Alerting in a Cloud‑Native Service Mesh

This article explains how the automotive platform accelerated its cloud‑native service‑mesh transformation by integrating Opentelemetry, Prometheus, and Grafana, then details the configuration and practical use of Grafana's unified alerting module—including installation, data source setup, alert rule definition, contact points, message templates, and silencing—to achieve comprehensive observability and automated incident response.

AlertingGrafanaObservability
0 likes · 14 min read
Implementing Observability and Alerting with Grafana Unified Alerting in a Cloud‑Native Service Mesh
Zhuanzhuan Tech
Zhuanzhuan Tech
Sep 19, 2023 · Operations

Design and Implementation of an Integrated Monitoring System at ZhaiZhai Using Prometheus, Grafana, and M3DB

This article describes how ZhaiZhai unified dozens of legacy monitoring tools into a single, all‑in‑one observability platform by adopting Prometheus + Grafana, extending the Prometheus client to push metrics to M3DB, automating Grafana dashboard creation, and building a custom alerting service to reduce operational complexity and improve visibility across business, middleware, and infrastructure services.

AlertingArchitectureGrafana
0 likes · 21 min read
Design and Implementation of an Integrated Monitoring System at ZhaiZhai Using Prometheus, Grafana, and M3DB
Huolala Tech
Huolala Tech
Sep 14, 2023 · Operations

Designing an Effective UI for Monitoring Alerts: Insights from Huolala

This article shares Huolala's experience designing a unified monitoring platform UI, covering the evolution from open‑source dashboards to a fully self‑developed solution, simplification of PromQL, computed metrics, log and trace integration, and the challenges of alert configuration and visualization.

AlertingObservabilityOperations
0 likes · 16 min read
Designing an Effective UI for Monitoring Alerts: Insights from Huolala
MaGe Linux Operations
MaGe Linux Operations
Sep 13, 2023 · Operations

Top 9 Log Management Solutions Compared: Features, Pricing, Pros & Cons

This article provides a side‑by‑side comparison of nine popular log management tools—Filebeat, Graylog, LogDNA, ELK, Grafana Loki, Datadog, Logstash, Fluentd and Splunk—detailing each product's core features, pricing models, advantages and disadvantages to help you choose the right solution for your observability needs.

DatadogELKFilebeat
0 likes · 16 min read
Top 9 Log Management Solutions Compared: Features, Pricing, Pros & Cons
Efficient Ops
Efficient Ops
Sep 12, 2023 · Operations

Understanding Prometheus Metric Types: Counters, Gauges, Histograms & Summaries

This article explains how metrics are used to monitor software performance, introduces basic metric components and dimensional metrics, compares Prometheus, OpenMetrics and OpenTelemetry standards, and provides detailed guidance on Prometheus metric types—Counter, Gauge, Histogram, and Summary—with code examples and query patterns.

MetricsObservabilityPrometheus
0 likes · 18 min read
Understanding Prometheus Metric Types: Counters, Gauges, Histograms & Summaries
Architect
Architect
Sep 7, 2023 · Cloud Native

How Vivo Scaled Container Monitoring with Prometheus, Kafka, and VictoriaMetrics

This article details how Vivo's container platform faced exploding metric volumes, component overload, data gaps, and storage spikes, and explains the step‑by‑step architectural redesign, metric governance, performance tuning, cAdvisor redeployment, and VictoriaMetrics upgrade that restored high‑availability, low‑latency monitoring across a large Kubernetes fleet.

Cloud NativeKubernetesObservability
0 likes · 18 min read
How Vivo Scaled Container Monitoring with Prometheus, Kafka, and VictoriaMetrics
Baidu Geek Talk
Baidu Geek Talk
Sep 6, 2023 · Cloud Native

DeeTune: Baidu’s eBPF‑Based Cloud‑Native Network Framework for Service Topology, Traffic Recording, and Non‑Intrusive Monitoring

DeeTune is Baidu’s eBPF‑based cloud‑native network framework that automatically builds complete service topologies, records configurable inter‑service traffic, and provides non‑intrusive metric monitoring with minimal CPU and memory overhead, enabling efficient fault localization and performance analysis across heterogeneous PaaS and container environments.

BaiduNetwork FrameworkObservability
0 likes · 15 min read
DeeTune: Baidu’s eBPF‑Based Cloud‑Native Network Framework for Service Topology, Traffic Recording, and Non‑Intrusive Monitoring
Didi Tech
Didi Tech
Sep 5, 2023 · Operations

Observability and Stability Engineering in Didi Ride‑Hailing Platform

At Didi, observability and stability engineering combine automated, AI‑driven alarm generation, distributed tracing, and ChatOps‑based fault handling to manage micro‑service complexity, massive traffic spikes, and cross‑region operations, emphasizing systematic investment, AIOps evolution, and a recruitment call for backend and test engineers.

DidiDistributed SystemsObservability
0 likes · 16 min read
Observability and Stability Engineering in Didi Ride‑Hailing Platform
Aikesheng Open Source Community
Aikesheng Open Source Community
Sep 4, 2023 · Databases

Observability of MySQL 8 Replication Using Performance Schema and Sys Schema Views

The article explains how MySQL 8 enhances replication observability by exposing detailed metrics through Performance Schema tables and sys schema views, providing DBAs with richer information such as per‑channel lag, worker thread states, and full replication status beyond the traditional SHOW REPLICA STATUS output.

InnoDB ClusterMySQLObservability
0 likes · 14 min read
Observability of MySQL 8 Replication Using Performance Schema and Sys Schema Views
FunTester
FunTester
Sep 1, 2023 · Operations

Observability in the Cloud‑Native Era: Data Collection Strategies and Sampling Techniques

The article explains how cloud‑native observability systems gather massive telemetry from infrastructure, containers, middleware and services, compares direct push and file‑based collection approaches, and details head, tail and local sampling methods to optimize data completeness and performance.

Distributed TracingObservabilityPerformance Optimization
0 likes · 10 min read
Observability in the Cloud‑Native Era: Data Collection Strategies and Sampling Techniques
dbaplus Community
dbaplus Community
Aug 22, 2023 · Operations

Designing a Multi‑Cloud Intelligent Monitoring Platform at Huolala: Architecture, Practices, and Future Directions

This article details Huolala's one‑stop monitoring platform called Monitor, covering its multi‑cloud architecture, data collection pipelines, real‑time business monitoring, unified alarm handling, and future AI‑driven enhancements, while sharing concrete metrics, incident case studies, and practical implementation steps for large‑scale observability.

GPTObservabilityOperations
0 likes · 19 min read
Designing a Multi‑Cloud Intelligent Monitoring Platform at Huolala: Architecture, Practices, and Future Directions
21CTO
21CTO
Aug 18, 2023 · Backend Development

Pick the Best Microservices Framework 2023: Top 10 & Key Practices

This article explains what microservices are, compares them with monolithic architecture, outlines their benefits and challenges, highlights the importance of observability, and reviews the top ten microservice frameworks and best‑practice guidelines for 2023.

Backend ArchitectureMicroservicesObservability
0 likes · 15 min read
Pick the Best Microservices Framework 2023: Top 10 & Key Practices
Huolala Tech
Huolala Tech
Aug 18, 2023 · Operations

Beyond System Metrics: Building Effective Business Monitoring for Pricing Services

Facing unpredictable software behavior, the article explains why traditional system‑level monitoring often misses critical business issues, especially in complex pricing services, and presents a comprehensive approach that combines result (black‑box) and process (white‑box) monitoring, practical metrics, and actionable recommendations to improve observability and reduce operational risk.

ObservabilityOperationsbusiness metrics
0 likes · 14 min read
Beyond System Metrics: Building Effective Business Monitoring for Pricing Services
Tech Architecture Stories
Tech Architecture Stories
Aug 15, 2023 · Cloud Native

Unlocking Microservice Success: The Interplay of Metrics, Governance, and Validation

This article explains how measurement (SLI/SLO), governance (architecture refactoring, MTTx), and validation (chaos engineering, disaster drills) interrelate in microservice systems, illustrating how observability drives governance actions, governance improves metrics, and validation reinforces both through continuous testing.

MicroservicesObservabilitySLI
0 likes · 4 min read
Unlocking Microservice Success: The Interplay of Metrics, Governance, and Validation
MaGe Linux Operations
MaGe Linux Operations
Aug 11, 2023 · Operations

How eBPF Transformed Linux: From BPF Roots to Modern Observability

This article traces the evolution of eBPF from its BPF predecessor, explains its kernel requirements, security model, probe mechanisms, performance impact, tracing capabilities, and potential event‑loss risks, and looks ahead to its expanding role in networking and system observability.

Linux kernelObservabilityPerformance
0 likes · 11 min read
How eBPF Transformed Linux: From BPF Roots to Modern Observability
Alibaba Cloud Native
Alibaba Cloud Native
Aug 4, 2023 · Backend Development

Unlocking Dubbo3’s Cloud‑Native Observability: A Complete Guide

This article explains how Dubbo3’s new observability starter provides visual cluster metrics, full‑link tracing, multi‑dimensional monitoring, Prometheus/Grafana integration, and log management, offering practical steps and configurations for building a robust cloud‑native microservice observability platform.

BackendCloud NativeMetrics
0 likes · 10 min read
Unlocking Dubbo3’s Cloud‑Native Observability: A Complete Guide
Didi Tech
Didi Tech
Aug 3, 2023 · Cloud Native

eBPF-Based Cross-Language Non-Intrusive Traffic Recording for Cloud-Native Services

The article describes an eBPF‑based, language‑agnostic traffic recording framework that hooks low‑level socket operations and thread identifiers to capture complete request‑response flows across Java, PHP, and Go services without modifying application code, dramatically lowering implementation and maintenance costs for cloud‑native traffic replay.

Cloud NativeGoObservability
0 likes · 15 min read
eBPF-Based Cross-Language Non-Intrusive Traffic Recording for Cloud-Native Services
MaGe Linux Operations
MaGe Linux Operations
Aug 1, 2023 · Cloud Native

Why Service Mesh Is Essential for Modern Cloud‑Native Microservices

This article explains how service mesh complements Kubernetes by providing advanced traffic management, observability, and security for microservices, discusses common distributed‑system fallacies and service‑governance challenges, compares Istio with FloMesh, and explores future trends such as Wasm sidecars, ambient mesh, and eBPF.

Cloud NativeMicroservicesObservability
0 likes · 15 min read
Why Service Mesh Is Essential for Modern Cloud‑Native Microservices
Open Source Linux
Open Source Linux
Jul 28, 2023 · Operations

Master Linux Performance: Essential Monitoring Tools Explained

This article introduces a comprehensive set of Linux performance and observability tools—such as vmstat, iostat, dstat, iotop, pidstat, top/htop, mpstat, netstat, ps, strace, uptime, lsof, perf, and sar—explaining their purpose, typical usage, and how they fit into basic and advanced performance analysis workflows.

LinuxObservabilitySystem Tools
0 likes · 14 min read
Master Linux Performance: Essential Monitoring Tools Explained
DevOps
DevOps
Jul 28, 2023 · Operations

The Temporary End of Moore’s Law and the Revival of “Systems Performance”

The article discusses the renewed relevance of performance engineering amid the slowdown of Moore’s Law, highlighting the Chinese edition of "Systems Performance: Enterprise and the Cloud," modern observability tools like eBPF, the "golden 60‑second" analysis, and the push toward continuous performance monitoring and expert systems.

ObservabilityPerformanceSystems
0 likes · 7 min read
The Temporary End of Moore’s Law and the Revival of “Systems Performance”
dbaplus Community
dbaplus Community
Jul 27, 2023 · Operations

How to Build Scalable Observability for Cloud‑Native Environments: Lessons from SRE

This article summarizes a technical talk on the challenges of cloud‑native transformation, the design of an application‑centric observability platform using CMDB, Prometheus, Thanos and VictoriaMetrics, practical solutions for high‑cardinality metrics and alerting, and future directions such as eBPF and AI‑driven fault detection.

CMDBMetricsObservability
0 likes · 14 min read
How to Build Scalable Observability for Cloud‑Native Environments: Lessons from SRE
DaTaobao Tech
DaTaobao Tech
Jul 24, 2023 · Cloud Native

Tengine-Ingress: High‑Performance Cloud‑Native Ingress Gateway for Alibaba Group

Tengine‑Ingress is Alibaba’s cloud‑native Ingress gateway built on the high‑performance Tengine‑Proxy, replacing the legacy Unified Access with dynamic, loss‑less configuration, per‑domain gray‑rollout, dual‑certificate TLS, real‑time observability, and checksum validation, achieving up to 20 % lower latency, CPU and memory usage while scaling to thousands of pods, and paving the way for a universal API gateway supporting TCP, UDP, gRPC, QUIC/HTTP3 and advanced TLS.

Cloud NativeDynamic ConfigurationIngress
0 likes · 18 min read
Tengine-Ingress: High‑Performance Cloud‑Native Ingress Gateway for Alibaba Group
Tech Architecture Stories
Tech Architecture Stories
Jul 23, 2023 · Backend Development

Beyond Scale: Rethinking Architecture Boundaries for Massive Services

This article reflects on years of designing large‑scale backend systems at Tencent, discussing how to define clear architecture boundaries, ensure high availability, integrate diverse technologies, and use observability and monitoring to continuously evolve and improve massive service architectures.

ArchitectureDistributed SystemsObservability
0 likes · 25 min read
Beyond Scale: Rethinking Architecture Boundaries for Massive Services
Volcano Engine Developer Services
Volcano Engine Developer Services
Jul 19, 2023 · Cloud Native

How Kelemetry Transforms Kubernetes Observability with Object‑Centric Tracing

Kelemetry, an open‑source tracing system from ByteDance, visualizes Kubernetes control‑plane events by treating each object as a span, linking audit logs, events, and component interactions to provide a unified, searchable view that simplifies debugging, performance analysis, and multi‑cluster observability.

KubernetesObservabilitydebugging
0 likes · 14 min read
How Kelemetry Transforms Kubernetes Observability with Object‑Centric Tracing
Programmer DD
Programmer DD
Jul 18, 2023 · Backend Development

Explore the Best Spring I/O 2023 Talks: Must‑Watch Videos for Modern Java Developers

This article curates the most valuable Spring I/O 2023 video sessions—covering the latest Java version adaptations, Spring Framework and Boot innovations, cloud‑native deployments, security, observability, and architectural best practices—providing concise Chinese summaries so developers can quickly identify which talks merit deeper viewing.

Cloud NativeMicroservicesObservability
0 likes · 24 min read
Explore the Best Spring I/O 2023 Talks: Must‑Watch Videos for Modern Java Developers
dbaplus Community
dbaplus Community
Jul 17, 2023 · Big Data

How Bilibili Built Billions 3.0: A Low‑Cost, Scalable Log Platform with ClickHouse, Iceberg, and Trino

This article details Bilibili's evolution from the ClickHouse‑based Billions 2.0 log system to the Billions 3.0 architecture, explaining how they reduced storage costs, improved troubleshooting, adopted a lake‑house design with Iceberg on HDFS, leveraged ClickHouse for acceleration, and integrated Trino as the unified query engine.

ClickHouseIcebergObservability
0 likes · 37 min read
How Bilibili Built Billions 3.0: A Low‑Cost, Scalable Log Platform with ClickHouse, Iceberg, and Trino
Qunar Tech Salon
Qunar Tech Salon
Jul 12, 2023 · Operations

Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis

This article describes Qunar's comprehensive root cause analysis platform, detailing its background, data-driven fault categorization, architecture—including trace, runtime, middleware, and event analysis modules—and demonstrates its high accuracy and practical impact on reducing incident resolution times across microservice services.

DevOpsMicroservicesObservability
0 likes · 20 min read
Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis
Top Architect
Top Architect
Jul 11, 2023 · Operations

Introducing MyPerf4J: A High‑Performance Java Monitoring and Statistics Tool

MyPerf4J is a Java‑agent based, low‑overhead performance monitoring library that provides real‑time method, memory, GC and class metrics for high‑concurrency, low‑latency applications, offering quick start, configurable properties, and detailed statistical reports for both development and production environments.

JavaAgentMetricsObservability
0 likes · 7 min read
Introducing MyPerf4J: A High‑Performance Java Monitoring and Statistics Tool
DataFunSummit
DataFunSummit
Jul 11, 2023 · Big Data

Tencent's Autonomous Big Data Platform: Data‑Driven Governance and AI‑Powered Optimization

Tencent’s big data platform introduces a data‑plus‑algorithm driven autonomous solution that automates self‑diagnosis, self‑optimization, and self‑management for trillion‑scale analytics, addressing challenges of massive task governance, resource efficiency, and stability through observable data foundations, pluggable decision engines, and generalized AI decision intelligence.

AI decisionAutonomous PlatformBig Data
0 likes · 17 min read
Tencent's Autonomous Big Data Platform: Data‑Driven Governance and AI‑Powered Optimization
AntTech
AntTech
Jul 11, 2023 · Operations

Achieving Full-Stack Observability for Cloud and On-Premise Applications with Ant Group's BOS Platform

This article examines the challenges of maintaining stability across cloud and on‑premise environments, explains how Ant Group's Business‑Intelligent Observability Service (BOS) addresses these issues through unified metadata, seamless application integration, data standardization, and extensive case studies, and demonstrates the resulting improvements in reliability and operational efficiency.

Cloud ComputingFull-stack TracingObservability
0 likes · 16 min read
Achieving Full-Stack Observability for Cloud and On-Premise Applications with Ant Group's BOS Platform
dbaplus Community
dbaplus Community
Jul 10, 2023 · Operations

Why Most Logging and Metrics Strategies Fail – and How to Fix Them

The author reflects on the shortcomings of current logging, metrics, and tracing practices, explains why they become costly and unscalable, and offers concrete recommendations—including log level discipline, structured logging, metric aggregation, and the use of tools like Prometheus, Cortex, and Thanos—to build a more efficient observability stack.

MetricsObservabilityPrometheus
0 likes · 18 min read
Why Most Logging and Metrics Strategies Fail – and How to Fix Them
DataFunTalk
DataFunTalk
Jul 9, 2023 · Operations

Building High‑Performance Observability Data Pipelines with Vector and Honghu

This article explains the concepts and importance of observability, introduces the Vector data‑pipeline tool and its architecture, demonstrates how to configure sources, transforms and sinks, and shows how to integrate Vector with the Honghu platform to build a complete, real‑time monitoring solution for modern distributed systems.

Big DataHonghuObservability
0 likes · 33 min read
Building High‑Performance Observability Data Pipelines with Vector and Honghu
dbaplus Community
dbaplus Community
Jul 8, 2023 · Operations

How QQ Music Achieves High Availability: Architecture, Tools, and Observability

This article explains how QQ Music embraces inevitable faults by building a high‑availability architecture that combines redundant infrastructure, automated failover, stability strategies, a robust toolchain for chaos engineering and full‑link load testing, and comprehensive observability to ensure graceful fault handling at scale.

MicroservicesObservabilitychaos-engineering
0 likes · 27 min read
How QQ Music Achieves High Availability: Architecture, Tools, and Observability
Architects Research Society
Architects Research Society
Jul 7, 2023 · Operations

Design Patterns and Principles for Building Large‑Scale Systems

This article outlines key design patterns and principles—such as scalability, idempotency, asynchronous processing, health checks, circuit breakers, feature flags, bulkheads, service discovery, retries, metrics, rate limiting, back‑pressure, and canary releases—that enable large‑scale, reliable, and resilient distributed systems.

Distributed SystemsObservabilityReliability
0 likes · 16 min read
Design Patterns and Principles for Building Large‑Scale Systems
Meituan Technology Team
Meituan Technology Team
Jul 6, 2023 · Databases

Meituan Database Attack‑Defense Practice: Kernel Observability, Full SQL, and Index Optimization

The article details how Meituan built a MySQL autonomous platform by constructing kernel observability to split OnCPU/OffCPU wait time, capturing full SQL directly from the kernel with compression, designing a safe exception‑handling workflow, and generating cost‑based index‑tuning suggestions—including what‑if analysis and workload‑driven recommendations—to enable comprehensive SQL governance.

Exception HandlingFull‑SQLIndex Tuning
0 likes · 34 min read
Meituan Database Attack‑Defense Practice: Kernel Observability, Full SQL, and Index Optimization
Qunar Tech Salon
Qunar Tech Salon
Jul 5, 2023 · Mobile Development

Long‑Term Client Crash Governance Mechanism at Qunar: Architecture, Detection, and Resolution Strategies

This article describes Qunar's systematic client crash governance framework, covering background challenges, APM‑based fast problem discovery, multi‑level alerting, common‑issue remediation, code‑level fixes for URL and Bundle size crashes, detection tools, code checks, automated testing, and the measurable improvements achieved in Android and iOS stability.

APMAndroidMobile
0 likes · 19 min read
Long‑Term Client Crash Governance Mechanism at Qunar: Architecture, Detection, and Resolution Strategies
Didi Tech
Didi Tech
Jul 4, 2023 · Cloud Native

eBPF Technology and Its Application in Didi's Cloud-Native Observability: HuaTuo Platform Practice

eBPF, a safe, high‑performance Linux kernel extension evolving from the 1993 Berkeley Packet Filter to modern dynamic tracing, underpins Didi’s HuaTuo platform, which consolidates bytecode management, fast data processing, stability self‑healing, and container insight to solve traffic replay, topology, security, and root‑cause analysis challenges across cloud‑native services, with plans to broaden business use and community collaboration.

Container SecurityHuatuoObservability
0 likes · 12 min read
eBPF Technology and Its Application in Didi's Cloud-Native Observability: HuaTuo Platform Practice
Alibaba Cloud Native
Alibaba Cloud Native
Jun 30, 2023 · Cloud Native

Simplify Hybrid Cloud Kubernetes Management with Alibaba ACK One

This article explains how Alibaba Cloud ACK One enables unified registration and management of Kubernetes clusters across public clouds, private data centers, and edge environments, detailing core features, architecture, security measures, and observability capabilities for seamless multi‑cluster operations.

ACK OneCloud NativeKubernetes
0 likes · 9 min read
Simplify Hybrid Cloud Kubernetes Management with Alibaba ACK One
Efficient Ops
Efficient Ops
Jun 25, 2023 · Operations

How to Build a Next‑Gen “Big Operations” System for Reliability and Observability

This article outlines the evolution from manual operations to DevOps and SRE‑driven “big operations,” detailing system reliability and continuity practices, observability concepts, and the development of AIOps maturity standards, offering a comprehensive guide for building stable, efficient, and secure operational frameworks.

DevOpsObservabilityOperations
0 likes · 14 min read
How to Build a Next‑Gen “Big Operations” System for Reliability and Observability
dbaplus Community
dbaplus Community
Jun 24, 2023 · Operations

How Bilibili Scales Capacity: VPA, HPA, and Cost‑Saving Strategies

This article summarizes Zhang He’s Bilibili SRE talk on building a capacity‑management system that visualizes resource usage, reduces costs, improves stability, and leverages Kubernetes VPA, HPA, pooling, and quota management to support massive live‑stream events and rapid feature releases.

Cost OptimizationHPAKubernetes
0 likes · 21 min read
How Bilibili Scales Capacity: VPA, HPA, and Cost‑Saving Strategies
SQB Blog
SQB Blog
Jun 16, 2023 · Operations

Boost Java Performance: Optimize JFR Analysis with Flame Graphs and Async‑Profiler

This article explores the evolution of continuous performance profiling, explains why traditional tracing falls short, and details a series of optimizations—including batch processing, object‑reference serialization, aggregation insertion, and multi‑chunk handling—to dramatically reduce memory usage and speed up Java Flight Recorder analysis using async‑profiler and flame graphs.

JFRObservabilityasync-profiler
0 likes · 13 min read
Boost Java Performance: Optimize JFR Analysis with Flame Graphs and Async‑Profiler
Bitu Technology
Bitu Technology
Jun 14, 2023 · Operations

Getting Started with eBPF: Concepts, Examples, and Security Considerations

This article reviews the fundamentals of eBPF, explains its architecture and tracing mechanisms such as USDT, uprobes, and TC hooks, provides practical code examples, discusses security aspects, and lists notable open‑source projects that leverage eBPF for performance and observability.

LinuxObservabilityPerformance
0 likes · 9 min read
Getting Started with eBPF: Concepts, Examples, and Security Considerations
Laravel Tech Community
Laravel Tech Community
May 23, 2023 · Operations

Comparison of Common Log Management Tools: Features, Pricing, Advantages and Disadvantages

This article provides a detailed comparative overview of nine popular log management solutions—including Filebeat, Graylog, LogDNA, ELK, Grafana Loki, Datadog, Logstash, Fluentd, and Splunk—covering their core features, pricing models, strengths, and weaknesses to help readers choose the most suitable tool for their environment.

DatadogELKFilebeat
0 likes · 14 min read
Comparison of Common Log Management Tools: Features, Pricing, Advantages and Disadvantages
Efficient Ops
Efficient Ops
May 22, 2023 · Operations

What’s Driving China’s AIOps Evolution? Insights from the 2023 Survey

The 2023 China AIOps Status Survey, launched by CAICT and the Cloud Computing Open Source Industry Alliance, gathers input from over 60 enterprises to reveal current intelligent‑operations practices, observability adoption, generative AI prospects, and best‑practice case studies, while inviting participants to shape the upcoming report.

Industry SurveyIntelligent OperationsObservability
0 likes · 9 min read
What’s Driving China’s AIOps Evolution? Insights from the 2023 Survey
Alibaba Cloud Developer
Alibaba Cloud Developer
May 18, 2023 · Operations

Why Gray Releases Fail: A Real-World Bug and an MVP Gray Release Blueprint

This article examines a subtle gray‑release bug that caused message loss due to mismatched environment configurations, analyzes its root causes, and proposes a minimum‑viable‑product gray‑release design with practical strategies, observability tips, and configuration examples to ensure safe, incremental rollouts.

DeploymentObservabilityconfiguration
0 likes · 21 min read
Why Gray Releases Fail: A Real-World Bug and an MVP Gray Release Blueprint
MaGe Linux Operations
MaGe Linux Operations
May 11, 2023 · Cloud Native

Master Distributed Tracing in Go with OpenTelemetry – A Practical Guide

In modern cloud‑native applications, distributed tracing is essential for pinpointing errors across microservices, and OpenTelemetry provides a standardized framework for collecting and analyzing trace data, with a hands‑on Go implementation demonstrated in an upcoming expert-led workshop.

Cloud NativeDistributed TracingGo
0 likes · 5 min read
Master Distributed Tracing in Go with OpenTelemetry – A Practical Guide
Tencent Cloud Developer
Tencent Cloud Developer
May 8, 2023 · Cloud Native

Modernizing Tencent Cloud Log Service (CLS): Cloud‑Native Architecture, Challenges, and Benefits

Tencent Cloud Log Service was modernized by migrating over 95 % of its components to a cloud‑native stack of containers, Kubernetes, and declarative APIs, addressing chaotic infrastructure, stateful‑to‑stateless conversion, configuration drift, upgrade risk, elastic scaling, traffic protection and observability, which cut costs by more than 20 million CNY, reduced scaling latency by 90 %, and achieved over 99.99 % availability with petabyte‑scale burst handling.

ArchitectureConfiguration ManagementLog Service
0 likes · 15 min read
Modernizing Tencent Cloud Log Service (CLS): Cloud‑Native Architecture, Challenges, and Benefits
MaGe Linux Operations
MaGe Linux Operations
May 7, 2023 · Operations

How Meta’s SLICK Transforms SLO Management for Reliable Services

This article explains how Meta built SLICK, a centralized SLO/SLI platform that improves service reliability through discoverability, long‑term insights, integrated workflows, and scalable architecture, and shares real‑world examples and lessons learned from its deployment across thousands of services.

MetaObservabilityReliability
0 likes · 13 min read
How Meta’s SLICK Transforms SLO Management for Reliable Services
政采云技术
政采云技术
Apr 29, 2023 · Cloud Native

Understanding Observability: Challenges, Principles, and OpenTelemetry Architecture

The article explains how growing system complexity drives the need for observability, outlines the three pillars of logs, traces, and metrics, compares traditional stability stacks with modern observability, and details OpenTelemetry's design, advantages, and implementation considerations for cloud‑native environments.

MicroservicesObservabilityOpenTelemetry
0 likes · 16 min read
Understanding Observability: Challenges, Principles, and OpenTelemetry Architecture
DataFunSummit
DataFunSummit
Apr 29, 2023 · Operations

Application Monitoring Principles and Non‑Intrusive Data Collection at Huya

This article explains the fundamentals of distributed application monitoring, describes Huya's non‑intrusive data‑collection techniques using SDKs and plugins, outlines the design and correlation of observable metrics, and demonstrates practical results and troubleshooting scenarios for backend services.

Distributed TracingMetrics DesignObservability
0 likes · 16 min read
Application Monitoring Principles and Non‑Intrusive Data Collection at Huya
Qunar Tech Salon
Qunar Tech Salon
Apr 24, 2023 · Operations

Design and Evolution of Qunar's Watcher Enterprise Monitoring Platform

The article details the background, architecture, core features, alert governance, trace integration, and cloud‑native evolution of Watcher, Qunar's internally built, highly scalable monitoring platform that unifies application‑level metrics, alerting, and observability across thousands of services and containers.

AlertingDevOpsObservability
0 likes · 19 min read
Design and Evolution of Qunar's Watcher Enterprise Monitoring Platform
ITPUB
ITPUB
Apr 23, 2023 · Cloud Native

How Kindling Leverages eBPF to Reach 1‑5‑10 Observability Targets

This article examines the difficulty of achieving the 1‑5‑10 observability goal, reviews current tracing, logging, and metrics tools, introduces the open‑source Kindling project’s eBPF‑based trace‑profiling approach, and walks through several real‑world use cases that demonstrate faster root‑cause analysis in cloud‑native environments.

KindlingObservabilityPerformance
0 likes · 16 min read
How Kindling Leverages eBPF to Reach 1‑5‑10 Observability Targets
Qunar Tech Salon
Qunar Tech Salon
Apr 19, 2023 · Operations

Heimdall Exception Statistics System: Architecture, Implementation, and Practice

This article describes the design, implementation, and evolution of Heimdall, an exception‑statistics platform built on Kafka, Flink, and HBase that provides minute‑level anomaly aggregation, stack trace querying, and integration with release and alerting workflows to improve service reliability across thousands of micro‑services.

Exception MonitoringKafkaObservability
0 likes · 14 min read
Heimdall Exception Statistics System: Architecture, Implementation, and Practice
Efficient Ops
Efficient Ops
Apr 12, 2023 · Operations

Building Highly Available Prometheus Monitoring with Thanos: A Practical Guide

This article explains why native Prometheus HA solutions fall short for large, multi‑region clusters and shows how to use Thanos components—including sidecar, query, store gateway, and compactor—to achieve long‑term storage, unlimited scaling, a global view, and non‑intrusive integration with existing Prometheus deployments.

KubernetesObservabilityPrometheus
0 likes · 22 min read
Building Highly Available Prometheus Monitoring with Thanos: A Practical Guide
dbaplus Community
dbaplus Community
Apr 5, 2023 · Cloud Native

How Baidu’s Search Platform Achieves Billion‑Scale Observability in a Cloud‑Native Era

This article explains why observability is critical in cloud‑native architectures and describes how Baidu’s search middle‑platform handles hundred‑billion‑level traffic by implementing low‑cost real‑time metrics, distributed tracing, log querying and topology analysis, while tackling challenges of massive microservice scale, scenario‑level monitoring, and efficient resource usage.

MetricsObservabilitycloud-native
0 likes · 12 min read
How Baidu’s Search Platform Achieves Billion‑Scale Observability in a Cloud‑Native Era
System Architect Go
System Architect Go
Apr 3, 2023 · Cloud Native

Why Cilium Beats Flannel: Real‑World Kubernetes Networking Insights

The article analyzes how Cilium’s eBPF‑based architecture, advanced network policies, cluster‑wide traffic control, and observability tools like Hubble solved performance, security, and scalability challenges that Flannel and kube‑proxy could not meet in production Kubernetes environments.

CNICiliumCloud Native
0 likes · 12 min read
Why Cilium Beats Flannel: Real‑World Kubernetes Networking Insights
MaGe Linux Operations
MaGe Linux Operations
Mar 30, 2023 · Operations

Demystifying PromQL: How Nested Functional Queries Work in Prometheus

This article explores the structure and evaluation of PromQL queries, covering its nested functional language nature, expression types, time handling with instant and range queries, and practical examples using the PromLens visualizer, helping readers grasp how Prometheus processes and types queries.

ObservabilityPromQLTime Series
0 likes · 11 min read
Demystifying PromQL: How Nested Functional Queries Work in Prometheus
ITPUB
ITPUB
Mar 29, 2023 · Databases

Beyond ACID: A Maslow‑Inspired Hierarchy of Database Needs

Drawing parallels with Maslow’s hierarchy, the article outlines an eight‑level model of database requirements—from core kernel correctness and ACID to advanced observability, automation, and the vision of a truly autonomous database—explaining how each tier maps to functional, security, reliability, ROI, insight, control, and transcendence.

ArchitectureObservabilityPerformance
0 likes · 12 min read
Beyond ACID: A Maslow‑Inspired Hierarchy of Database Needs
Efficient Ops
Efficient Ops
Mar 28, 2023 · Operations

Why SRE Matters: Bridging Product Development and Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its responsibilities, how it complements product development, the software lifecycle perspective, and practical approaches to ensure system stability through controllability, observability, and best‑practice implementation.

ObservabilityOperationsSRE
0 likes · 14 min read
Why SRE Matters: Bridging Product Development and Reliability Engineering
Alibaba Cloud Native
Alibaba Cloud Native
Mar 28, 2023 · Cloud Native

How RocketMQ 5.0 Enables Distributed End‑to‑End Tracing with OpenTelemetry

This article explains how Apache RocketMQ 5.0 integrates standardized distributed tracing via OpenTelemetry, detailing the underlying span model, semantic conventions for messaging, automatic and manual instrumentation options, configuration steps, a complete example workflow, and how to export traces to Alibaba Cloud SLS and ARMS for observability.

Cloud NativeDistributed TracingMessaging
0 likes · 17 min read
How RocketMQ 5.0 Enables Distributed End‑to‑End Tracing with OpenTelemetry
ITPUB
ITPUB
Mar 24, 2023 · Cloud Native

Why Open‑Falcon Stalled and How Cloud‑Native Monitoring Is Evolving

This article reviews the evolution of monitoring in the cloud‑native era, analyzes Open‑Falcon’s architecture, strengths, and shortcomings, explains why its development hit a bottleneck, and outlines the design principles and features of the Nightingale monitoring system as a modern, open‑source alternative.

ArchitectureMicroservicesObservability
0 likes · 15 min read
Why Open‑Falcon Stalled and How Cloud‑Native Monitoring Is Evolving
Top Architect
Top Architect
Mar 22, 2023 · Operations

Log Management, Observability, and APM: Concepts, Practices, and Tools

This article explains what logs are, when to record them, their value in large-scale systems, and how to build effective log‑management and observability platforms using APM concepts, including metrics, tracing, ELK, Prometheus, and custom tooling for distributed architectures.

APMELKObservability
0 likes · 20 min read
Log Management, Observability, and APM: Concepts, Practices, and Tools
Architect
Architect
Mar 21, 2023 · Operations

Log Management, Observability, and APM Practices in Distributed Systems

This article explains what logs are, when to record them, their value in large‑scale architectures, and how to build effective logging, metrics, and tracing platforms using tools such as ELK, Prometheus, and SkyWalking, while also presenting good and bad logging practices and sample batch‑log retrieval code.

APMDistributed SystemsELK
0 likes · 20 min read
Log Management, Observability, and APM Practices in Distributed Systems
New Oriental Technology
New Oriental Technology
Mar 10, 2023 · Cloud Native

Middleware PaaS on Kubernetes: Architecture, Benefits, and IP Reservation Challenges

This article explains how the New Oriental architecture team migrated middleware services like Redis, Kafka, and RocketMQ to Kubernetes, detailing the benefits over traditional PaaS, the Capo IP reservation solution for network stability, and the resulting operational, observability, and resource utilization improvements.

Cloud NativeKubernetesObservability
0 likes · 18 min read
Middleware PaaS on Kubernetes: Architecture, Benefits, and IP Reservation Challenges
AntTech
AntTech
Mar 7, 2023 · Cloud Native

Introduction to HoloInsight: A Cloud‑Native Lightweight Observability Platform

HoloInsight is an open‑source, cloud‑native observability platform derived from Ant Group's AntMonitor, offering integrated log‑based monitoring, business metric analysis, and AI‑driven AIOps capabilities while providing a lightweight, modular architecture and extensive extensibility for modern software stacks.

Observabilityaiopscloud-native
0 likes · 13 min read
Introduction to HoloInsight: A Cloud‑Native Lightweight Observability Platform
DataFunSummit
DataFunSummit
Mar 4, 2023 · Operations

Full‑Chain Monitoring and Trace System at Huolala: Evolution, Architecture, and Visualization

This article details how Huolala built a comprehensive full‑chain monitoring and tracing platform, covering the historical evolution of observability tools, the company’s multi‑stage monitoring architecture, bytecode‑enhanced instrumentation, trace sampling strategies, and a "what‑you‑see‑is‑what‑you‑get" visualization approach.

MicroservicesObservabilityPrometheus
0 likes · 15 min read
Full‑Chain Monitoring and Trace System at Huolala: Evolution, Architecture, and Visualization
ByteDance SYS Tech
ByteDance SYS Tech
Feb 28, 2023 · Cloud Native

How ByteDance’s ARES Boosts Cloud‑Native Resilience with Chaos Engineering

This article explains ByteDance’s end‑to‑end chaos engineering practice for cloud‑native environments, covering its background, principles, comparison with traditional testing, the evolution of its internal platforms, and a detailed look at the Application Resilience Enhancement Service (ARES) and its core features.

Fault InjectionKubernetesMicroservices
0 likes · 17 min read
How ByteDance’s ARES Boosts Cloud‑Native Resilience with Chaos Engineering
Alibaba Cloud Native
Alibaba Cloud Native
Feb 27, 2023 · Cloud Native

What’s Next for Microservices? Highlights from the Beijing Cloud Native Meetup

The Beijing "Microservices x Container Open Source Developer Meetup" gathered over 100 developers and core maintainers of leading cloud‑native projects to discuss next‑generation microservice architectures, static compilation, service governance, multi‑cluster management, observability, and more, providing deep technical insights and real‑world examples.

Cloud NativeKubernetesObservability
0 likes · 11 min read
What’s Next for Microservices? Highlights from the Beijing Cloud Native Meetup
Baidu Geek Talk
Baidu Geek Talk
Feb 20, 2023 · Operations

Deep Dive into Logging Operations and Observability in Distributed Systems

The article examines logging’s critical role in distributed systems, detailing its purpose, severity levels, and value for debugging, performance, security, and auditing, while highlighting challenges of inconsistent formats and traceability, and reviewing observability pillars, ELK and tracing tools, and practical implementation best practices.

APMELKObservability
0 likes · 19 min read
Deep Dive into Logging Operations and Observability in Distributed Systems
Alibaba Cloud Native
Alibaba Cloud Native
Feb 8, 2023 · Cloud Native

Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark

This article benchmarks Alibaba Cloud Prometheus against the open‑source Prometheus across multiple cluster sizes, churn rates, and query patterns, revealing that while the open‑source version remains stable under light load, its CPU and memory usage grow non‑linearly with high cardinality, whereas Alibaba's managed service delivers higher compatibility, better query performance, and more predictable scaling.

Cloud NativeMetricsObservability
0 likes · 30 min read
Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark
Cloud Native Technology Community
Cloud Native Technology Community
Feb 8, 2023 · Operations

FinOps Core Principles and the Rationale for Left‑Shift in Cloud Cost Management

The article explains how DevOps teams can adopt FinOps principles and a left‑shift approach—combining static and dynamic logging, fostering cross‑team collaboration, and integrating cost awareness into the software development lifecycle—to reduce cloud expenses, improve MTTR, and drive sustainable engineering productivity.

Cloud CostDevOpsFinOps
0 likes · 10 min read
FinOps Core Principles and the Rationale for Left‑Shift in Cloud Cost Management
dbaplus Community
dbaplus Community
Feb 6, 2023 · Operations

How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services

This article outlines Vivo's multi‑year journey of designing, evolving, and operating a cloud‑native, AIOps‑enabled monitoring platform that supports tens of thousands of hosts, databases, containers, and services, detailing its architecture, challenges, and future directions for observability and reliability.

ObservabilityOperationsSystem Architecture
0 likes · 18 min read
How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services
Tencent Cloud Developer
Tencent Cloud Developer
Feb 3, 2023 · Cloud Computing

Cloud Load Testing: Strategies, Scenarios, and Practice Cases for High‑Traffic Events

Tencent’s cloud load‑testing platform simulates massive Chinese‑New‑Year traffic by offering concurrency and RPS modes, multi‑language test authoring, realistic data generation, and unified OpenTelemetry reporting, enabling early bottleneck detection, proactive scaling, and successful high‑load drills such as Mobile QQ and video services.

JavaScriptLoad TestingMicroservices
0 likes · 23 min read
Cloud Load Testing: Strategies, Scenarios, and Practice Cases for High‑Traffic Events
Open Source Linux
Open Source Linux
Feb 3, 2023 · Cloud Native

Why eBPF Is the Secret Weapon Behind Modern Cloud‑Native Platforms

This article explains how eBPF extends kernel functionality to enable secure, high‑performance networking, observability, and programmable workloads in cloud‑native environments, detailing its architecture, use cases, market adoption, commercialization models, and the challenges and advantages that make it comparable to JavaScript for the kernel.

Cloud NativeLinuxNetworking
0 likes · 12 min read
Why eBPF Is the Secret Weapon Behind Modern Cloud‑Native Platforms
Architects Research Society
Architects Research Society
Feb 2, 2023 · Backend Development

Medium’s Journey to Microservices: Principles, Strategies, and Lessons Learned

This article explains why Medium transitioned from a monolithic Node.js application to a microservice architecture, outlines the core design principles, shares practical strategies for building, deploying, and observing services, and warns about common pitfalls such as the microservice syndrome.

DeploymentObservabilityService Architecture
0 likes · 23 min read
Medium’s Journey to Microservices: Principles, Strategies, and Lessons Learned
ITPUB
ITPUB
Jan 31, 2023 · Databases

How Pigsty Turns PostgreSQL into a Cost‑Effective Open‑Source RDS Alternative

Pigsty is an open‑source platform that upgrades PostgreSQL across six dimensions—observability, reliability, availability, maintainability, extensibility, and interoperability—delivering enterprise‑grade features, built‑in monitoring, automatic failover, backup, and performance tuning while cutting cloud database costs dramatically.

Cost OptimizationObservabilityOpen-source
0 likes · 22 min read
How Pigsty Turns PostgreSQL into a Cost‑Effective Open‑Source RDS Alternative
dbaplus Community
dbaplus Community
Jan 26, 2023 · Operations

Unified Metrics, Tracing, and Logging: A Financial Firm’s Path to Microservice Observability

Facing the challenges of distributed microservice architectures, a financial services company implemented a unified observability platform that combines metrics, tracing, and logging via OpenTelemetry and custom agents, enabling real‑time visualization, anomaly detection, and performance analysis across seven core business middle‑platforms.

Distributed TracingMetricsMicroservices
0 likes · 17 min read
Unified Metrics, Tracing, and Logging: A Financial Firm’s Path to Microservice Observability
MaGe Linux Operations
MaGe Linux Operations
Jan 23, 2023 · Operations

Prometheus vs Zabbix: Which Monitoring Tool Wins in Modern Environments?

This article compares Prometheus and Zabbix, detailing their histories, architectures, performance, community support, and suitability for different environments, and concludes with guidance on choosing the right monitoring solution for physical servers, cloud-native deployments, and large‑scale container clusters.

Cloud NativeObservabilityZabbix
0 likes · 9 min read
Prometheus vs Zabbix: Which Monitoring Tool Wins in Modern Environments?