Tagged articles

Observability

1054 articles · Page 6 of 11

May 23, 2024 · Operations

eBPF + LLM: Building the Infrastructure for Observability Agents

The article explains how zero‑intrusion eBPF provides full‑stack, high‑quality observability data that, when combined with large language models, enables AI‑driven agents to automate ticket handling, change impact analysis, and vulnerability triage, dramatically improving operational efficiency.

AI AgentDistributed TracingLLM

0 likes · 17 min read

eBPF + LLM: Building the Infrastructure for Observability Agents

DataFunSummit

May 22, 2024 · Operations

Building an Observability System: Practices and Solutions from Yanhuang Data

This article explains how to build a robust observability system for cloud‑native microservice architectures, detailing the three core signals—metrics, traces, and logs—common challenges such as complexity and data silos, and presents Yanhuang Data’s integrated platform with unified data collection, storage, analysis, and visualization solutions.

KubernetesObservabilitylogs

0 likes · 23 min read

Building an Observability System: Practices and Solutions from Yanhuang Data

Tencent Cloud Developer

May 21, 2024 · Operations

Why Prometheus Metrics Aren’t 100% Accurate – The Hidden Trade‑offs Explained

The article analyzes why Prometheus sometimes returns inaccurate metric values, revealing the design trade‑offs that favor efficiency over precision, and walks through common pitfalls in rate/increase calculations, histogram P99 estimation, and practical recommendations for choosing scrape intervals and query windows.

HistogramObservabilityP99

0 likes · 20 min read

Why Prometheus Metrics Aren’t 100% Accurate – The Hidden Trade‑offs Explained

DevOps Operations Practice

May 9, 2024 · Cloud Native

Configuring Prometheus Alert Rules for Monitoring Kubernetes Pod Status

This article demonstrates how to set up Prometheus alerting rules to monitor Kubernetes Pod phases, explains the different Pod states, provides example alert expressions, and discusses practical solutions to avoid false alarms during deployments.

KubernetesObservabilityPod Monitoring

0 likes · 6 min read

Configuring Prometheus Alert Rules for Monitoring Kubernetes Pod Status

ByteDance SYS Tech

May 9, 2024 · Operations

How Large‑Model Agents Transform AIOps: From Automation to Self‑Healing Operations

The presentation explains how large‑model agents empower AIOps by automating routine tasks, enhancing anomaly detection, fault diagnosis, and remediation, while outlining architectural components, multi‑agent collaboration, and future directions for building self‑healing, observability‑driven operations platforms.

AIOpsAgentObservability

0 likes · 15 min read

How Large‑Model Agents Transform AIOps: From Automation to Self‑Healing Operations

MaGe Linux Operations

May 4, 2024 · Operations

Prometheus vs Zabbix: Which Monitoring Tool Wins in Modern Cloud Environments?

This article compares Prometheus and Zabbix, covering their histories, architectural differences, storage models, configuration complexity, community activity, and container support, to help you decide which monitoring solution best fits physical servers or cloud-native environments.

ObservabilityPrometheusZabbix

0 likes · 8 min read

Prometheus vs Zabbix: Which Monitoring Tool Wins in Modern Cloud Environments?

Mike Chen's Internet Architecture

May 3, 2024 · Cloud Native

What Makes Cloud‑Native Architecture Essential for Modern Apps?

This article explains cloud‑native architecture, covering its definition, core concepts such as microservices, containerization, automation, storage, networking, and the guiding principles of service orientation, elastic scaling, and observability that together enable highly available, scalable, and agile applications.

KubernetesObservabilitycontainerization

0 likes · 5 min read

What Makes Cloud‑Native Architecture Essential for Modern Apps?

DataFunTalk

Apr 30, 2024 · Big Data

Vivo's Evolution of Large‑Scale Distributed Messaging Middleware Architecture and Practices

This technical presentation details Vivo's end‑to‑end big‑data architecture, the evolution from Kafka to Pulsar for massive message processing, deployment strategies, high‑availability mechanisms, observability practices, and future plans for cloud‑native, containerized messaging middleware.

Distributed MessagingHigh AvailabilityKafka

0 likes · 18 min read

Vivo's Evolution of Large‑Scale Distributed Messaging Middleware Architecture and Practices

Rare Earth Juejin Tech Community

Apr 29, 2024 · Artificial Intelligence

Building Enterprise‑Grade Retrieval‑Augmented Generation (RAG) Systems: Challenges, Fault Points, and Best Practices

This comprehensive guide explores the complexities of building enterprise‑level Retrieval‑Augmented Generation (RAG) systems, detailing common failure points, architectural components such as authentication, input guards, query rewriting, document ingestion, indexing, storage, retrieval, generation, observability, caching, and multi‑tenant considerations, and provides actionable best‑practice recommendations for developers and technical leaders.

CachingEnterprise AILLM

0 likes · 32 min read

Building Enterprise‑Grade Retrieval‑Augmented Generation (RAG) Systems: Challenges, Fault Points, and Best Practices

21CTO

Apr 22, 2024 · Operations

Discover Guider: A Python‑Powered Linux Observability Suite with 150+ Commands

Guider, a Python‑based Linux observability suite created by Hyundai engineer Peace Lee, offers over 150 command‑line tools for real‑time performance monitoring, resource tracing, automated reporting, and visualizations, enabling developers to diagnose slow startups, crashes, GPU stalls, and system resets with microsecond precision.

CLILinuxObservability

0 likes · 7 min read

Discover Guider: A Python‑Powered Linux Observability Suite with 150+ Commands

dbaplus Community

Apr 21, 2024 · Cloud Native

What Cloud‑Native Tech Stack Should You Use in 2024? A Real‑World Guide

In 2024 the author reflects on a decade of backend evolution and shares a practical, experience‑driven guide to the cloud‑native stack—including Kubernetes, multi‑cloud strategies, DevOps tooling, service mesh, observability, and message‑queue choices—tailored to teams of different sizes.

ObservabilityService Meshdevops

0 likes · 12 min read

What Cloud‑Native Tech Stack Should You Use in 2024? A Real‑World Guide

Cognitive Technology Team

Apr 17, 2024 · Operations

Using eBPF to Capture Complete Java Call Stacks from OpenJDK 8 JVM without Agent

The team successfully employed eBPF dynamic tracing to obtain full Java call‑stack traces from any point in an OpenJDK 8 JVM process with microsecond‑level overhead, without requiring any JVM agents, bytecode injection, or code modifications, making it suitable for production environments.

JVM tracingJavaObservability

0 likes · 2 min read

Using eBPF to Capture Complete Java Call Stacks from OpenJDK 8 JVM without Agent

Alibaba Cloud Native

Apr 16, 2024 · Operations

Unlocking Log Insights: How SPL Brings Interactive Pipe‑Style Queries to Cloud‑Native Observability

This article explains how the SLS Processing Language (SPL) enables interactive, pipeline‑based log analysis in cloud‑native environments, covering the challenges of unstructured log data, Unix‑inspired exploration, SPL syntax, key commands, and practical examples for efficient querying and transformation.

ObservabilitySPLcloud-native

0 likes · 12 min read

Unlocking Log Insights: How SPL Brings Interactive Pipe‑Style Queries to Cloud‑Native Observability

Alibaba Cloud Observability

Apr 16, 2024 · Cloud Native

Mastering Interactive Log Exploration with SPL: Unix‑Inspired Pipelines in Cloud Native Environments

This article explains how the SLS Processing Language (SPL) brings Unix‑style pipelined, interactive log exploration to cloud‑native observability, detailing why logs are unstructured, how SPL’s unified syntax works, and which commands simplify field projection, enrichment, filtering, and semi‑structured data parsing.

Log ProcessingObservabilitySPL

0 likes · 12 min read

Mastering Interactive Log Exploration with SPL: Unix‑Inspired Pipelines in Cloud Native Environments

Alibaba Cloud Observability

Apr 12, 2024 · Cloud Computing

Why Alibaba Cloud SLS Beats Open‑Source ELK for Log Management

Alibaba Cloud Log Service (SLS) offers a serverless, high‑availability, low‑cost alternative to self‑built ELK stacks, providing comparable Elasticsearch and Kafka compatibility, superior storage, query, and alerting capabilities, and streamlined migration paths, making it a compelling choice for large‑scale observability workloads.

Cloud ServiceELKObservability

0 likes · 13 min read

Why Alibaba Cloud SLS Beats Open‑Source ELK for Log Management

ByteDance Cloud Native

Mar 27, 2024 · Cloud Native

How ByteDance Optimized Its Metrics Agent for 70% CPU Savings

This article details how ByteDance's cloud‑native observability team tackled performance bottlenecks in their metricserver2 Agent—reducing memory copies, merging tiny packets, applying SIMD for tag parsing, and switching compression libraries—to cut CPU usage by over 10% and memory usage by nearly 20% while handling petabyte‑scale metric data.

C++MsgpackObservability

0 likes · 15 min read

How ByteDance Optimized Its Metrics Agent for 70% CPU Savings

Tencent Cloud Developer

Mar 21, 2024 · Backend Development

Backend Refactoring and Architecture Design of Tencent Docs Collection Form Service

Tencent Docs transformed its high‑traffic Collection Form by refactoring a monolithic C++‑style service into 19 loosely‑coupled vertical services with light‑heavy separation, database isolation, async Kafka pipelines, and full observability via Tianji, achieving dramatically improved stability, millisecond‑level sync, reliable export, and faster incident resolution.

ObservabilityPerformancebackend

0 likes · 21 min read

Backend Refactoring and Architecture Design of Tencent Docs Collection Form Service

DevOps

Mar 20, 2024 · Cloud Computing

Platform Engineering: Beyond Infrastructure – Core Pillars and Human Collaboration

The article explains that platform engineering extends far beyond basic infrastructure, highlighting its core pillars such as automation, composability, agility, observability, and the essential role of collaboration and culture in creating value‑driven, cloud‑native software delivery.

AutomationCloud ComputingCollaboration

0 likes · 6 min read

Platform Engineering: Beyond Infrastructure – Core Pillars and Human Collaboration

MaGe Linux Operations

Mar 15, 2024 · Cloud Native

How to Enable and Analyze Istio Distributed Tracing with Jaeger on Kubernetes

This guide explains why distributed tracing is needed, how Istio uses Jaeger as the tracing backend, the required request‑header propagation, step‑by‑step deployment of Jaeger and the Bookinfo demo on Kubernetes, and how to inspect and interpret the generated spans.

Distributed TracingIstioJaeger

0 likes · 16 min read

How to Enable and Analyze Istio Distributed Tracing with Jaeger on Kubernetes

Practical DevOps Architecture

Mar 15, 2024 · Operations

Comprehensive Practical Guide to Prometheus Configuration, Optimization, and Source Code Development

This multi‑chapter guide provides in‑depth, hands‑on instruction for configuring and optimizing all Prometheus components, exploring Kubernetes monitoring, source‑code analysis, custom exporter development, high‑availability setups, service discovery, resource‑efficient scraping, and integrating Thanos for long‑term storage.

KubernetesObservabilityOperations

0 likes · 4 min read

Comprehensive Practical Guide to Prometheus Configuration, Optimization, and Source Code Development

Mike Chen's Internet Architecture

Mar 12, 2024 · Backend Development

Master Distributed Tracing: Why It’s Critical for Microservices and How to Choose the Right Tool

This article explains the fundamentals of distributed tracing, why it’s essential for complex microservice architectures, the core concepts and mechanisms behind it, and compares popular tracing frameworks such as Zipkin, Spring Cloud Sleuth, Jaeger, and Pinpoint.

Distributed TracingJaegerObservability

0 likes · 6 min read

Master Distributed Tracing: Why It’s Critical for Microservices and How to Choose the Right Tool

Alibaba Cloud Developer

Mar 11, 2024 · Operations

Why iLogtail Needed a Complete Architecture Overhaul and How It Was Done

This article explains the motivations behind iLogtail's architectural redesign, details the evolution from a single‑file C++ collector to a modular pipeline with Golang plugins, outlines the refactor goals and implementation practices, and reflects on the challenges and outcomes of the six‑month effort.

C++GolangObservability

0 likes · 38 min read

Why iLogtail Needed a Complete Architecture Overhaul and How It Was Done

DevOps Cloud Academy

Mar 10, 2024 · Operations

Top 10 Open‑Source Monitoring Tools for DevOps in 2024 – Features, Pros and Cons

This article reviews the ten most important open‑source monitoring and observability tools for modern DevOps teams in 2024, outlining each tool's key features, advantages, disadvantages, and how they compare for performance, scalability, cost and ease of use.

Observabilitydevopsmonitoring

0 likes · 15 min read

Top 10 Open‑Source Monitoring Tools for DevOps in 2024 – Features, Pros and Cons

Baidu Geek Talk

Mar 6, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis

The article explains why collective communication is critical for distributed large‑model training, outlines the new requirements for system reliability, and introduces Baidu’s Collective Communication Library (BCCL), detailing its enhanced observability, fault‑diagnosis, stability, and performance optimizations that raise effective training time to 98 % and bandwidth utilization to 95 %.

AI InfrastructureFault diagnosisObservability

0 likes · 11 min read

How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis

Tencent Cloud Developer

Mar 6, 2024 · Backend Development

Developing a Business Gateway: Protocol Conversion, Performance, and Observability

The article details how a Node‑based business gateway for QQ Channel services replaced legacy RPC with a lightweight tRPC solution, achieving ten‑fold latency reduction, higher QPS, improved security and availability, and added observability through logging, metrics, and a WebSocket push layer.

Node.jsObservabilityPerformance

0 likes · 23 min read

Developing a Business Gateway: Protocol Conversion, Performance, and Observability

DevOps

Mar 4, 2024 · Frontend Development

Building QQ Front-end Unified Access Layer: Architecture, Technical Choices, and Performance Insights

This article shares a decade‑long journey of designing and scaling the QQ front‑end unified access layer, covering business background, overall architecture, solution comparisons, core challenges, observability, and performance optimizations while reflecting on practical lessons for large‑scale front‑end systems.

Case StudyFrontendObservability

0 likes · 10 min read

Building QQ Front-end Unified Access Layer: Architecture, Technical Choices, and Performance Insights

Efficient Ops

Mar 3, 2024 · Operations

Mastering Prometheus: From Metrics Collection to Alerting and Visualization

This comprehensive guide explains Prometheus' architecture, metric collection models, storage format, query language (PromQL), alerting workflow, configuration reload methods, metric types, custom exporters, and how to visualise data with Grafana, providing a complete end‑to‑end monitoring solution.

GrafanaObservabilityPromQL

0 likes · 21 min read

Mastering Prometheus: From Metrics Collection to Alerting and Visualization

Yum! Tech Team

Mar 1, 2024 · Operations

Building an Observability System Traffic Distribution Diagram

This article explains how to design and implement a traffic distribution diagram for an observability system, covering current cloud‑native tooling, data standardization, transformation, traffic‑flow modeling, aggregation, storage with ClickHouse, and visualisation techniques such as Sankey diagrams.

Observabilitycloud-nativedata modeling

0 likes · 7 min read

Building an Observability System Traffic Distribution Diagram

Baidu Intelligent Cloud Tech Hub

Mar 1, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

Baidu’s Collective Communication Library (BCCL) enhances large‑model distributed training by improving real‑time bandwidth monitoring, fault diagnosis, network stability, and performance, leveraging RDMA networks and GPU‑specific optimizations to increase effective training time to 98% and bandwidth utilization to 95%.

AI InfrastructureFault diagnosisObservability

0 likes · 11 min read

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

MaGe Linux Operations

Feb 29, 2024 · Operations

Quickly Set Up OpenTelemetry on Kubernetes: Installation, Modes & Config

This guide walks you through deploying OpenTelemetry in Kubernetes, covering the purpose of otel‑collector, installation via manifests or Helm, the three deployment patterns (No‑Collector, Agent, Gateway), running the otel‑demo, and detailed configuration of receivers, processors, exporters, connectors, extensions, and service pipelines.

CollectorKubernetesObservability

0 likes · 11 min read

Quickly Set Up OpenTelemetry on Kubernetes: Installation, Modes & Config

OPPO Kernel Craftsman

Feb 23, 2024 · Mobile Development

Understanding Perfetto Data Flow Architecture and Reducing Trace Data Loss

Perfetto’s tracing system links multiple producers to a single consumer via shared‑memory buffers, where careful sizing of pages, chunks, and central buffers, along with tuned protobuf encoding and scheduling priorities, mitigates CPU overhead and prevents data loss, enabling reliable observability on Android devices.

AndroidData FlowObservability

0 likes · 26 min read

Understanding Perfetto Data Flow Architecture and Reducing Trace Data Loss

Alibaba Cloud Native

Feb 22, 2024 · Cloud Native

Achieving 50% Cost Cut with Cloud‑Native Architecture: A Flexible Workforce Platform Case

Facing poor observability, high resource waste, and unstable releases, QingTuan’s flexible‑workforce platform transformed its monolithic and SOA systems into a cloud‑native micro‑service architecture using Alibaba Cloud ACK, MSE, ARMS, and Prometheus, achieving higher availability, elastic scaling, and up to 50% infrastructure cost reduction.

Observabilityarchitecturecloud-native

0 likes · 22 min read

Achieving 50% Cost Cut with Cloud‑Native Architecture: A Flexible Workforce Platform Case

Alibaba Cloud Native

Feb 20, 2024 · Cloud Native

What’s New in iLogtail 2.0? A Deep Dive into the Updated Pipeline Architecture

iLogtail 2.0 replaces the monolithic, file‑oriented design of its predecessor with a modular pipeline configuration, new input/processor/output plugins, a refreshed API, SPL processing, finer‑grained parsing controls, nanosecond‑level timestamps, enhanced observability, and performance improvements, while providing compatibility guidance for both commercial and open‑source editions.

APIObservabilitySPL

0 likes · 17 min read

What’s New in iLogtail 2.0? A Deep Dive into the Updated Pipeline Architecture

Tencent Cloud Developer

Feb 20, 2024 · Frontend Development

From Frontend to Full‑Stack: Architecture, Challenges, and Practices of the QQ Frontend Unified Access Layer

The veteran front‑end engineer chronicles a decade of building QQ’s large‑scale products, detailing how the new Frontend Unified Access Layer replaced fragmented SDKs with a high‑performance, scalable, secure gateway built on an internal http2rpc framework, while tackling legacy protocol coexistence, observability, alert fatigue, and targeted performance optimizations.

FrontendFull-StackObservability

0 likes · 10 min read

From Frontend to Full‑Stack: Architecture, Challenges, and Practices of the QQ Frontend Unified Access Layer

Efficient Ops

Feb 19, 2024 · Operations

Mastering Prometheus: Practical Tips for Effective Application Monitoring

This article explains how to design and implement Prometheus metrics for application monitoring, covering the selection of monitoring targets, golden metrics, label conventions, naming rules, histogram bucket choices, and Grafana visualization tricks to help engineers build reliable observability pipelines.

GrafanaObservabilityOperations

0 likes · 10 min read

Mastering Prometheus: Practical Tips for Effective Application Monitoring

DevOps Cloud Academy

Feb 17, 2024 · Operations

Implementing Reusable GitHub Actions Workflows for Scalable CI at McDonald's

McDonald's engineering team built a fast, reliable, and flexible continuous integration system by leveraging reusable GitHub Actions workflows, centralizing CI code, defining a golden‑path pipeline, balancing developer autonomy, and adding observability across multilingual microservices, improving productivity and maintainability.

AutomationCI/CDGitHub Actions

0 likes · 7 min read

Implementing Reusable GitHub Actions Workflows for Scalable CI at McDonald's

DevOps Operations Practice

Feb 2, 2024 · Operations

Zabbix vs Prometheus: A Detailed Comparison of Features, Architecture, and Use Cases

This article provides a comprehensive comparison between Zabbix and Prometheus, covering their functional architecture, metric collection methods, data storage, query capabilities, visualization options, and alerting mechanisms, helping readers decide which monitoring system best fits their enterprise needs.

ComparisonObservabilityPrometheus

0 likes · 8 min read

Zabbix vs Prometheus: A Detailed Comparison of Features, Architecture, and Use Cases

DaTaobao Tech

Jan 29, 2024 · Cloud Native

Observability: Logging, Metrics, and Tracing in Distributed Systems

Observability in distributed systems combines event logging, aggregated metrics, and request tracing—each offering distinct trade‑offs in detail, storage, and overhead—and while the ELK stack dominates log and metric handling, tracing solutions such as EagleEye and SkyWalking differ by protocol and language, prompting many teams to adopt unified, cloud‑native platforms like Alibaba Cloud’s Log Service for lower cost, real‑time analysis and simplified management.

ELKLoggingObservability

0 likes · 32 min read

Observability: Logging, Metrics, and Tracing in Distributed Systems

Linux Code Review Hub

Jan 29, 2024 · Cloud Native

How Minsheng Bank Built eBPF‑Based Observability for Cloud‑Native Services

The article details Minsheng Bank's step‑by‑step journey from traditional network monitoring to a full‑stack, zero‑intrusion observability platform built with DeepFlow, vTap, distributed data collection, and eBPF, illustrating concrete case studies and future plans for expanding business‑level monitoring.

DeepFlowDistributed TracingNetwork Monitoring

0 likes · 18 min read

How Minsheng Bank Built eBPF‑Based Observability for Cloud‑Native Services

MaGe Linux Operations

Jan 25, 2024 · Operations

Mastering Monitoring: From Concepts to Prometheus in Operations

This article explains monitoring fundamentals, distinguishes black‑box and white‑box approaches, outlines key metrics and their aggregation, and provides a comprehensive guide to Prometheus architecture, data model, query language, and practical examples for CPU, memory, and disk usage monitoring.

ObservabilityPrometheusmetrics

0 likes · 18 min read

Mastering Monitoring: From Concepts to Prometheus in Operations

Architect

Jan 24, 2024 · Operations

Mastering End-to-End Tracing in Go Microservices with OpenTracing and Zipkin

This article walks through the complete design and implementation of full‑stack distributed tracing for Go‑based microservices, explaining correlation IDs, OpenTracing concepts, component roles, client and server code, database and service call tracing, compatibility issues, and best‑practice design guidelines.

Distributed TracingObservabilityOpenTracing

0 likes · 20 min read

Mastering End-to-End Tracing in Go Microservices with OpenTracing and Zipkin

Alibaba Cloud Native

Jan 23, 2024 · Cloud Native

How eBPF Enables High‑Performance, Language‑Agnostic Application Monitoring

This article explains how the eBPF‑based ARMS monitoring solution provides non‑intrusive, language‑independent observability for cloud‑native microservices by addressing the shortcomings of traditional protocol parsers with a low‑overhead, real‑time parsing architecture.

ARMSLinux kernelObservability

0 likes · 12 min read

How eBPF Enables High‑Performance, Language‑Agnostic Application Monitoring

Java Captain

Jan 15, 2024 · Operations

Java Distributed Tracing: Concepts, Principles, Implementation, and Application Scenarios

This article explains the concept of distributed tracing, outlines its underlying principles in Java, details step‑by‑step implementation using popular SDKs, and describes common application scenarios such as performance monitoring, fault diagnosis, complex event handling, traffic analysis, and system optimization.

Distributed TracingFault diagnosisJava

0 likes · 5 min read

Java Distributed Tracing: Concepts, Principles, Implementation, and Application Scenarios

NetEase Cloud Music Tech Team

Jan 10, 2024 · Operations

Building Cloud Music's APM Metric Monitoring System Based on VictoriaMetrics

Cloud Music’s middleware team built the Pylon APM monitoring system on VictoriaMetrics, combining exporters, vmagent, Nacos, Flink‑based pre‑aggregation recording rules and vminsert for collection with Grafana, a custom Proxy and vmselect for querying, achieving millisecond‑level latency, metric‑trace correlation, stability improvements, and cost‑effective storage for nearly 700 million active time series.

APM monitoringFlinkMetric Pre-aggregation

0 likes · 12 min read

Building Cloud Music's APM Metric Monitoring System Based on VictoriaMetrics

Tencent Cloud Developer

Jan 9, 2024 · Operations

Tencent Cloud APM Full-Link Tracing Implementation and Best Practices

The article explains how Tencent Cloud APM implements full‑link tracing using OpenTelemetry standards, addresses challenges such as protocol compatibility, massive trace storage, and bytecode overhead with solutions like conversion gateways, tail sampling and thread profiling, and showcases best‑practice scenarios for topology analysis, front‑end/back‑end integration, and log‑trace correlation within the broader TCOP observability suite.

APMCloud MonitoringFull‑Link Tracing

0 likes · 11 min read

Tencent Cloud APM Full-Link Tracing Implementation and Best Practices

Sanyou's Java Diary

Jan 8, 2024 · Cloud Native

How Distributed Tracing Solves Microservice Performance Mysteries with SkyWalking

This article explains the principles and benefits of distributed tracing systems, introduces OpenTracing standards, details SkyWalking’s architecture and mechanisms for automatic span collection, context propagation, unique trace IDs, sampling strategies, and performance impact, and shares practical implementation experiences and custom plugin development within a real‑world microservice environment.

Distributed TracingObservabilityOpenTracing

0 likes · 20 min read

How Distributed Tracing Solves Microservice Performance Mysteries with SkyWalking

FunTester

Jan 7, 2024 · Operations

Integrating Monitoring and Observability for Effective Application Performance Management

The article explains how combining traditional monitoring with modern observability, supported by data quality practices and unified workflows, enables more reliable, scalable, and insightful application performance management in agile and cloud‑native environments.

APMData QualityObservability

0 likes · 18 min read

Integrating Monitoring and Observability for Effective Application Performance Management

dbaplus Community

Jan 2, 2024 · Operations

How Xiaohongshu Scaled Its Metrics System Tenfold with Cloud‑Native Architecture

Facing exploding metric volumes, high resource consumption, and fragile operations, Xiaohongshu's observability team completely rebuilt its metrics pipeline using Victoriametrics, achieving ten‑fold performance gains, minute‑level scaling, high‑availability, cost reduction, and robust multi‑cloud active‑active deployment while preserving data safety and query speed.

ObservabilityPrometheuscloud-native

0 likes · 34 min read

How Xiaohongshu Scaled Its Metrics System Tenfold with Cloud‑Native Architecture

Zuoyebang Tech Team

Dec 28, 2023 · Big Data

How We Scaled Our Data Platform by Migrating to Apache DolphinScheduler

Facing growing task volumes and diverse workload types, we upgraded our data development platform's scheduling engine to Apache DolphinScheduler, detailing the migration process, architectural enhancements, stability and observability improvements, multi‑tenant support, and the resulting performance gains and future roadmap.

Apache DolphinSchedulerBig DataData Platform

0 likes · 12 min read

How We Scaled Our Data Platform by Migrating to Apache DolphinScheduler

Weimob Technology Center

Dec 26, 2023 · Operations

Rebuilding Our APM: Scalable Metrics & Alerts with VictoriaMetrics & VMAlert

This article details the complete redesign of our internal APM system, covering the motivations, architecture choices, metric collection pipeline, integration of VictoriaMetrics and VMAlert, metric and alert design principles, implementation steps, visualizations, performance gains, and future plans for scaling and SaaS‑ification.

APMAlertingObservability

0 likes · 17 min read

Rebuilding Our APM: Scalable Metrics & Alerts with VictoriaMetrics & VMAlert

Efficient Ops

Dec 24, 2023 · Operations

Avoid These 6 Common Prometheus Mistakes When Getting Started

This guide translates and condenses six frequent errors new Prometheus users make—high‑cardinality labels, losing valuable tags during aggregation, using bare selectors, omitting the for field, choosing too‑short rate windows, and applying rate‑related functions to wrong metric types—offering practical fixes to improve monitoring reliability.

ObservabilityPromQLPrometheus

0 likes · 12 min read

Avoid These 6 Common Prometheus Mistakes When Getting Started

Architect

Dec 22, 2023 · Operations

How Tencent Search Built a Multi‑Layered Stability Architecture to Slash MTTD and MTTR

The article details Tencent Search’s end‑to‑end stability engineering practice, covering a ten‑step architecture that combines redundancy, proactive detection, rapid emergency response, automated cut‑over, defensive caching, and continuous drills, and shows how these measures collectively reduced mean‑time‑to‑detect and mean‑time‑to‑recover by an order of magnitude while keeping service availability high.

Incident ManagementObservabilityResilience

0 likes · 32 min read

How Tencent Search Built a Multi‑Layered Stability Architecture to Slash MTTD and MTTR

DevOps Cloud Academy

Dec 14, 2023 · Operations

CI/CD Observability via OpenTelemetry at Grafana Labs

The article explains the importance of CI/CD observability, outlines common pipeline problems, introduces Grafana's GraCIe plugin built on OpenTelemetry, and discusses how enhanced visibility can improve reliability, decision‑making, and future standardization across CI/CD platforms.

CI/CDGrafanaObservability

0 likes · 13 min read

CI/CD Observability via OpenTelemetry at Grafana Labs

Architect

Dec 13, 2023 · Industry Insights

How Bilibili Engineered a 1.2 B‑Viewer Live Stream for the LoL World Championship

This article details Bilibili's end‑to‑end technical planning, traffic‑estimation models, and concrete optimizations—including hotspot caching, traffic dispersion, long‑connection isolation, and automated fault‑injection—that enabled the S13 League of Legends finals to serve over 1.2 billion viewers with stable, low‑latency streaming.

Incident ManagementLive StreamingObservability

0 likes · 22 min read

How Bilibili Engineered a 1.2 B‑Viewer Live Stream for the LoL World Championship

DataFunTalk

Dec 13, 2023 · Databases

SelectDB Boosts GuanceDB Observability: Architecture Upgrade, Cost Reduction, and Performance Gains

This article details how SelectDB’s inverted‑index, Variant data type, and sampling capabilities were integrated into GuanceDB to replace Elasticsearch, achieving up to 70% storage cost reduction, 2‑4× query speed improvement, and a ten‑fold overall cost‑performance boost for log analytics and observability workloads.

Log AnalyticsObservabilityPerformance

0 likes · 20 min read

SelectDB Boosts GuanceDB Observability: Architecture Upgrade, Cost Reduction, and Performance Gains

Qunar Tech Salon

Dec 12, 2023 · Backend Development

System Slimming at Qunar Travel: Reducing Code and Service Footprint by 50% Using Observability and Automation

This article presents Qunar Travel's "system slimming" project, describing how observability techniques, a two‑stage strategy, and automated tooling were used to identify and remove unused services and code, achieving a 50% reduction in code size, a 26% cut in services, and measurable improvements in reliability and release efficiency.

JavaObservabilitybackend optimization

0 likes · 20 min read

System Slimming at Qunar Travel: Reducing Code and Service Footprint by 50% Using Observability and Automation

DevOps Cloud Academy

Dec 9, 2023 · Operations

How Prometheus Memory Usage Was Halved: Insights from Bryan Boreham’s KubeCon Talk

Grafana Labs engineer Bryan Boreham explained at KubeCon how a series of code changes, label optimizations, and Go runtime tuning reduced Prometheus memory consumption by roughly 50%, detailing the technical challenges, solutions, and measurable impact on modern monitoring deployments.

KubeConMemory optimizationObservability

0 likes · 9 min read

How Prometheus Memory Usage Was Halved: Insights from Bryan Boreham’s KubeCon Talk

DevOps Coach

Dec 6, 2023 · Operations

How to Combine Azure OpenAI with Elastic Observability AI Assistant in 10 Minutes

This guide walks through setting up Azure OpenAI (GPT‑4) as a connector for Elastic Observability’s AI Assistant, covering prerequisites, Azure resource creation, connector configuration, URL formatting, and practical examples of log analysis and chat‑based troubleshooting.

AI assistantAzure OpenAIObservability

0 likes · 14 min read

How to Combine Azure OpenAI with Elastic Observability AI Assistant in 10 Minutes

37 Interactive Technology Team

Dec 4, 2023 · Backend Development

Root Cause Analysis of Missing Trace Data in Go Services Using Prometheus Metrics and GZIP Compression

The missing trace data in two Go services was caused by the GoFrame tracing middleware recording the gzip‑compressed /metrics response body as a UTF‑8 string, which the OpenTelemetry exporter rejected as invalid UTF‑8; disabling Prometheus compression or decompressing the body before logging resolves the issue.

ObservabilityOpenTelemetryPrometheus

0 likes · 16 min read

Root Cause Analysis of Missing Trace Data in Go Services Using Prometheus Metrics and GZIP Compression

Bilibili Tech

Dec 1, 2023 · Operations

Safe Production Practices: Change Management Platform Design and Implementation at Bilibili

After a series of change‑induced outages in early 2023, Bilibili instituted a comprehensive change‑management framework—including a preventive change platform, a central control system, quality and monitoring tools, strict gray‑release policies, observability checks, and rapid rollback mechanisms—to dramatically cut emergency incidents and embed a reliability‑first culture.

ObservabilityPlatform EngineeringReliability

0 likes · 16 min read

Safe Production Practices: Change Management Platform Design and Implementation at Bilibili

Architecture and Beyond

Nov 25, 2023 · Operations

Designing and Implementing an Effective Log System for Internet Startups

The article explains why comprehensive logging is essential for internet startups, outlines the three stages of a log system, details log levels, required fields, best‑practice principles, collection architectures such as local files and ELK, and how collected logs support monitoring, debugging, and analytics.

ELKLoggingObservability

0 likes · 12 min read

Designing and Implementing an Effective Log System for Internet Startups

Programmer DD

Nov 24, 2023 · Backend Development

What’s New in Spring Boot 3.2? Explore Java 21 Features and Virtual Threads

Spring Boot 3.2, released shortly after Java 21, brings a host of enhancements such as virtual thread support, CRaC checkpoint restore, SSL bundle reloading, improved observability, new RestClient and JdbcClient, Jetty 12, Pulsar, Kafka and RabbitMQ SSL, redesigned nested JAR handling, Docker image build upgrades, and a comprehensive video walkthrough by Josh Long.

Backend DevelopmentDockerJava 21

0 likes · 7 min read

What’s New in Spring Boot 3.2? Explore Java 21 Features and Virtual Threads

macrozheng

Nov 23, 2023 · Operations

How Distributed Tracing with SkyWalking Solves Microservice Performance Mysteries

This article explains the principles of distributed tracing, the OpenTracing standard, SkyWalking's architecture and sampling strategies, and shares practical company implementations and custom plugins that help locate performance bottlenecks in micro‑service systems.

Distributed TracingObservabilitySkyWalking

0 likes · 18 min read

How Distributed Tracing with SkyWalking Solves Microservice Performance Mysteries

dbaplus Community

Nov 22, 2023 · Operations

How We Re‑engineered Our Log Platform: From ELK to ClickHouse with Vector and Log‑Pilot

Facing data growth, reliability demands, and high maintenance costs, a company redesigned its logging stack by replacing ELK with a Kubernetes‑native pipeline built on Log‑Pilot, Vector, and ClickHouse, achieving lower cost, higher performance, and seamless migration while preserving familiar query interfaces.

ClickHouseELKKubernetes

0 likes · 12 min read

How We Re‑engineered Our Log Platform: From ELK to ClickHouse with Vector and Log‑Pilot

Ops Development Stories

Nov 20, 2023 · Operations

How eBPF Powers Next‑Gen Observability and Fault Diagnosis in Kubernetes

At KubeCon China 2023, experts Liu Kai and Dong Shandong presented a three‑part deep dive into Kubernetes observability challenges, demonstrating how eBPF enables comprehensive data collection across all stack layers, seamless integration, and intelligent root‑cause analysis through dimension attribution, anomaly bounding, and fault‑tree methods.

Fault diagnosisKubernetesObservability

0 likes · 20 min read

How eBPF Powers Next‑Gen Observability and Fault Diagnosis in Kubernetes

Alibaba Cloud Native

Nov 17, 2023 · Cloud Native

How Dubbo-go’s New Triple Protocol Transforms Cloud‑Native Microservices

The article introduces Dubbo‑go 3.2’s comprehensive upgrade, focusing on the Triple protocol’s gRPC and HTTP compatibility, simplified API, service‑governance features, code examples for server and client, configuration‑driven deployment, built‑in observability, traffic‑management capabilities, and the modular plugin architecture.

Observabilitycloud-nativedubbo-go

0 likes · 14 min read

How Dubbo-go’s New Triple Protocol Transforms Cloud‑Native Microservices

Huya Tech Engineering

Nov 10, 2023 · Operations

How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs

This article describes how Huya built a unified metadata platform to break data silos across its SRE systems, enabling standardized data ingestion, correlation, and analysis that improve resource governance, root‑cause diagnosis, and overall cost‑efficiency for large‑scale live streaming services.

MetadataObservabilitySRE

0 likes · 13 min read

How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs

AntTech

Nov 7, 2023 · Operations

ChaosMeta V0.6.0 Release: New Features, Lossless Injection, Automated Experiments, and Future Directions

ChaosMeta V0.6.0 introduces DNS and log injection capabilities, lossless fault injection concepts, automated experiment orchestration with atomic tasks, and a roadmap for multi‑cloud support and advanced metrics, aiming to solve the last‑mile challenge of continuous automated chaos experiments in production environments.

Fault InjectionObservabilityautomated experiments

0 likes · 9 min read

ChaosMeta V0.6.0 Release: New Features, Lossless Injection, Automated Experiments, and Future Directions

Efficient Ops

Nov 2, 2023 · Operations

How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation

The Industrial and Commercial Bank of China software development center created an SRE panoramic monitoring view system that unifies data channels, standardizes metrics, offers multi‑dimensional dashboards, and introduces an intelligent Ops Assistant, dramatically improving fault detection, response speed, and cross‑team operational efficiency.

ICBCObservabilityOperations

0 likes · 6 min read

How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation

Inke Technology

Oct 31, 2023 · Operations

How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse

This article details the redesign of a company’s logging infrastructure—from an ELK‑based solution to a ClickHouse‑powered architecture—highlighting the motivations, key requirements, component choices, configuration examples, performance optimizations, and the resulting cost and storage benefits.

Big DataClickHouseLogging

0 likes · 13 min read

How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse

Architect

Oct 26, 2023 · Big Data

Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry

This article details Bilibili's evolution of its log system from an Elastic Stack‑based solution to a ClickHouse‑backed architecture with OpenTelemetry, describing the challenges of cost, stability, and scalability, the new components such as Log‑Agent, Log‑Ingester, and a custom visualization platform, and the performance gains and future directions.

ClickHouseObservabilityOpenTelemetry

0 likes · 26 min read

Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry

Architect

Oct 25, 2023 · Operations

The Importance of Logging and Distributed Log Operations in Modern Architecture

This article explores why logs are essential in software development, outlines when to record them, discusses the value of logging in large-scale distributed systems, and examines the capabilities required of log‑operation tools such as APM, metrics, tracing, ELK, Prometheus, and custom batch querying solutions.

APMELKObservability

0 likes · 21 min read

The Importance of Logging and Distributed Log Operations in Modern Architecture

HomeTech

Oct 25, 2023 · Operations

How Metrics‑Driven Development Supercharges a Used‑Car Platform

This article examines how a metrics‑driven development approach, combined with observability tools like Prometheus, helped a large online used‑car marketplace improve system insight, accelerate business processes, and deliver measurable performance and efficiency gains across both customer‑facing and dealer‑facing operations.

Data-Driven EngineeringMetrics-Driven DevelopmentObservability

0 likes · 16 min read

How Metrics‑Driven Development Supercharges a Used‑Car Platform

Efficient Ops

Oct 24, 2023 · Operations

How to Monitor Business Metrics with Prometheus in Kubernetes

This article explains how to use Prometheus to monitor business‑level metrics in a Kubernetes environment, covering observability fundamentals, metric definitions, metric types, exposing metrics via a /metrics endpoint, and practical Go code examples for defining, recording, and scraping custom metrics.

KubernetesObservabilityPrometheus

0 likes · 11 min read

How to Monitor Business Metrics with Prometheus in Kubernetes

Efficient Ops

Oct 22, 2023 · Operations

Master Loki: Deploy, Configure, and Query Logs Efficiently

This guide explains Loki's core concepts, deployment steps for Promtail and Loki, Grafana integration, label‑based indexing, handling dynamic and high‑cardinality tags, and query optimization techniques, providing a complete roadmap for building a cost‑effective, scalable log aggregation system.

GrafanaKubernetesLogging

0 likes · 15 min read

Master Loki: Deploy, Configure, and Query Logs Efficiently

Alibaba Cloud Native

Oct 21, 2023 · Operations

How to Reveal Tracing Blind Spots with Continuous Profiling and Code Hotspots

This article explains the evolution of observability, outlines a step‑by‑step diagnosis workflow using metrics, logs and tracing, highlights the blind spots of traditional tracing, and demonstrates how Alibaba Cloud ARMS continuous profiling and code‑hotspot features can pinpoint slow call‑chain issues in Java applications.

APMContinuous ProfilingJava

0 likes · 14 min read

How to Reveal Tracing Blind Spots with Continuous Profiling and Code Hotspots

Selected Java Interview Questions

Oct 15, 2023 · Cloud Native

The Hidden Frictions of Kubernetes Adoption: From Speed Gains to Platform Engineering Challenges

The article examines how rapid Kubernetes adoption accelerates development velocity but also introduces hidden frictions such as standardization limits, DevOps disruption, monitoring difficulties, and team isolation, emphasizing the need for collaborative platform engineering and contextual observability.

ObservabilityPlatform Engineeringcloud-native

0 likes · 13 min read

The Hidden Frictions of Kubernetes Adoption: From Speed Gains to Platform Engineering Challenges

DataFunSummit

Oct 13, 2023 · Big Data

Practical Experience of Flink on Kubernetes at Kuaishou

This article presents Kuaishou's comprehensive journey of adopting Flink on Kubernetes, covering its background, evolution, architecture, production migration, observability, testing, and future plans, and demonstrates how large‑scale streaming workloads are transformed to a cloud‑native environment.

Big DataFlinkKubernetes

0 likes · 14 min read

Practical Experience of Flink on Kubernetes at Kuaishou

Ops Development Stories

Oct 12, 2023 · Cloud Native

How to Monitor Kubernetes with OpenTelemetry Collector: Step‑by‑Step Helm Deployment

This guide walks through installing OpenTelemetry Collector on a Kubernetes cluster using Helm, configuring DaemonSet and Deployment collectors, integrating Prometheus for metrics, and customizing receivers, processors, and exporters to achieve comprehensive observability of nodes, pods, containers, and cluster resources.

KubernetesObservabilityOpenTelemetry

0 likes · 26 min read

How to Monitor Kubernetes with OpenTelemetry Collector: Step‑by‑Step Helm Deployment

Bilibili Tech

Oct 10, 2023 · Backend Development

Design and Implementation of a Scalable Live‑Streaming Full‑Stream Data System

The article details a scalable live‑stream full‑stream data system that replaces a tightly‑coupled legacy architecture with a producer‑consumer model using a custom key‑value store, bucket sharding, gRPC server‑streaming, versioned caching, and comprehensive observability, achieving sub‑second queries, horizontal scalability, and reliable support for thousands of downstream services.

Live StreamingObservabilitydata pipeline

0 likes · 18 min read

Design and Implementation of a Scalable Live‑Streaming Full‑Stream Data System

DevOps Cloud Academy

Oct 4, 2023 · Operations

Integrating OpenTelemetry Metrics into Apache Airflow with Prometheus and Grafana

This guide explains how to enable OpenTelemetry in Apache Airflow, configure an OTel collector, use Prometheus as a metrics backend, set up Grafana dashboards, and visualize sample DAG metrics, providing a complete observability stack for Airflow pipelines.

Apache AirflowGrafanaObservability

0 likes · 12 min read

Integrating OpenTelemetry Metrics into Apache Airflow with Prometheus and Grafana

Architects Research Society

Oct 3, 2023 · Cloud Native

Chaos Engineering: Concepts, History, Benefits, Challenges, and Getting Started

Chaos engineering is a disciplined approach to testing distributed systems by intentionally injecting failures to verify resilience, covering its definition, origins at Netflix, operational workflow, benefits, challenges, and practical steps for organizations to adopt resilient cloud‑native applications.

ObservabilityResiliencechaos engineering

0 likes · 18 min read

Chaos Engineering: Concepts, History, Benefits, Challenges, and Getting Started

MaGe Linux Operations

Sep 30, 2023 · Cloud Native

How DeWu Built a Scalable Cloud‑Native Trace2.0 Observability Platform

This article details DeWu's evolution from a sneaker marketplace to a full‑stack e‑commerce platform and explains how its cloud‑native monitoring system, based on OpenTelemetry, ClickHouse, and object storage, was architected, optimized, and scaled to handle billions of spans daily.

ObservabilityOpenTelemetrycloud-native

0 likes · 16 min read

How DeWu Built a Scalable Cloud‑Native Trace2.0 Observability Platform

Didi Tech

Sep 26, 2023 · Databases

Didi's Time Series Storage Evolution: From InfluxDB to VictoriaMetrics

Facing exponential growth of time‑series data from 2017 to 2023, Didi migrated from InfluxDB to RRDtool, then to an in‑memory cache layer, and finally adopted VictoriaMetrics because its low‑cost commodity‑hardware operation, high write throughput, strong compression, and easy horizontal scaling solved the earlier storage, OOM, and scalability problems.

ObservabilityTSDBVictoriaMetrics

0 likes · 13 min read

Didi's Time Series Storage Evolution: From InfluxDB to VictoriaMetrics

Bilibili Tech

Sep 26, 2023 · Backend Development

Applying CQRS Architecture to Live Streaming Room Service: Design, Evolution, and Operational Practices

The live‑streaming room service was re‑architected using CQRS, dividing read‑heavy viewer functions from write‑intensive broadcaster operations, splitting the monolith into focused Go micro‑services, adding multi‑level caching, event‑driven sync, extensive observability, and automated incident‑response to achieve massive scalability and rapid fault recovery.

CQRSLive StreamingObservability

0 likes · 18 min read

Applying CQRS Architecture to Live Streaming Room Service: Design, Evolution, and Operational Practices

Didi Tech

Sep 21, 2023 · Cloud Native

OBC: A Cloud-Native Real-Time Computing Engine for Metrics at Didi

To replace costly, duplicated Flink jobs, Didi built Observe‑Compute (OBC), a cloud‑native, PromQL‑driven real‑time metric engine with centralized policy management, scalable containerized workers, and zero‑downtime scaling, achieving million‑RMB annual savings while handling 10 M points per second.

Flink alternativeOBCObservability

0 likes · 17 min read

OBC: A Cloud-Native Real-Time Computing Engine for Metrics at Didi

Alibaba Cloud Native

Sep 21, 2023 · Cloud Native

How Alibaba Cloud’s SAE Achieves High Stability with Diagnostic Engines and Probes

This article explains how Alibaba Cloud's Serverless Application Engine (SAE) builds end‑to‑end stability by dividing fault handling into prevention, detection, localization and recovery, using a Kubernetes‑based diagnostic engine, runtime availability probes, a unified alert center, and a plug‑in architecture for root‑cause analysis.

KubernetesObservabilityServerless

0 likes · 28 min read

How Alibaba Cloud’s SAE Achieves High Stability with Diagnostic Engines and Probes

HomeTech

Sep 19, 2023 · Operations

Implementing Observability and Alerting with Grafana Unified Alerting in a Cloud‑Native Service Mesh

This article explains how the automotive platform accelerated its cloud‑native service‑mesh transformation by integrating Opentelemetry, Prometheus, and Grafana, then details the configuration and practical use of Grafana's unified alerting module—including installation, data source setup, alert rule definition, contact points, message templates, and silencing—to achieve comprehensive observability and automated incident response.

AlertingGrafanaObservability

0 likes · 14 min read

Implementing Observability and Alerting with Grafana Unified Alerting in a Cloud‑Native Service Mesh

Zhuanzhuan Tech

Sep 19, 2023 · Operations

Design and Implementation of an Integrated Monitoring System at ZhaiZhai Using Prometheus, Grafana, and M3DB

This article describes how ZhaiZhai unified dozens of legacy monitoring tools into a single, all‑in‑one observability platform by adopting Prometheus + Grafana, extending the Prometheus client to push metrics to M3DB, automating Grafana dashboard creation, and building a custom alerting service to reduce operational complexity and improve visibility across business, middleware, and infrastructure services.

AlertingGrafanaM3DB

0 likes · 21 min read

Design and Implementation of an Integrated Monitoring System at ZhaiZhai Using Prometheus, Grafana, and M3DB

Didi Tech

Sep 14, 2023 · Operations

eBPF-based Service Interface Topology Observation and Validation in Didi's Observability Platform

Didi’s observability platform leverages non‑intrusive eBPF probes to automatically capture and validate service‑to‑service call tuples, supplement missing SDK data, achieve roughly 80 % core‑path coverage, and address verification challenges while planning future user‑space VM hooks and deeper MTL integration.

BPFGolangMTL

0 likes · 20 min read

eBPF-based Service Interface Topology Observation and Validation in Didi's Observability Platform

Huolala Tech

Sep 14, 2023 · Operations

Designing an Effective UI for Monitoring Alerts: Insights from Huolala

This article shares Huolala's experience designing a unified monitoring platform UI, covering the evolution from open‑source dashboards to a fully self‑developed solution, simplification of PromQL, computed metrics, log and trace integration, and the challenges of alert configuration and visualization.

AlertingObservabilityOperations

0 likes · 16 min read

Designing an Effective UI for Monitoring Alerts: Insights from Huolala

MaGe Linux Operations

Sep 13, 2023 · Operations

Understanding Prometheus Metric Types: Counters, Gauges, Histograms & Summaries

This article explains how metrics are used to monitor software performance, introduces basic metric components and dimensional metrics, compares Prometheus, OpenMetrics and OpenTelemetry standards, and provides detailed guidance on Prometheus metric types—Counter, Gauge, Histogram, and Summary—with code examples and query patterns.

ObservabilityPrometheusPython

0 likes · 18 min read

Understanding Prometheus Metric Types: Counters, Gauges, Histograms & Summaries

Architect

Sep 7, 2023 · Cloud Native

How Vivo Scaled Container Monitoring with Prometheus, Kafka, and VictoriaMetrics

This article details how Vivo's container platform faced exploding metric volumes, component overload, data gaps, and storage spikes, and explains the step‑by‑step architectural redesign, metric governance, performance tuning, cAdvisor redeployment, and VictoriaMetrics upgrade that restored high‑availability, low‑latency monitoring across a large Kubernetes fleet.

KubernetesObservabilityPrometheus

0 likes · 18 min read

How Vivo Scaled Container Monitoring with Prometheus, Kafka, and VictoriaMetrics

Baidu Geek Talk

Sep 6, 2023 · Cloud Native

DeeTune: Baidu’s eBPF‑Based Cloud‑Native Network Framework for Service Topology, Traffic Recording, and Non‑Intrusive Monitoring

DeeTune is Baidu’s eBPF‑based cloud‑native network framework that automatically builds complete service topologies, records configurable inter‑service traffic, and provides non‑intrusive metric monitoring with minimal CPU and memory overhead, enabling efficient fault localization and performance analysis across heterogeneous PaaS and container environments.

BaiduNetwork FrameworkObservability

0 likes · 15 min read

DeeTune: Baidu’s eBPF‑Based Cloud‑Native Network Framework for Service Topology, Traffic Recording, and Non‑Intrusive Monitoring

Spring Full-Stack Practical Cases

Sep 6, 2023 · Operations

How to Integrate Prometheus and Grafana with Spring Boot for Real‑Time Monitoring

Learn step‑by‑step how to set up Prometheus and Grafana with a Spring Boot 2.4.12 application, configure dependencies, expose metrics via Actuator, customize meters, and monitor database connection pools, providing a complete observability solution for Java backend services.

GrafanaObservabilityPrometheus

0 likes · 4 min read

How to Integrate Prometheus and Grafana with Spring Boot for Real‑Time Monitoring

Didi Tech

Sep 5, 2023 · Operations

Observability and Stability Engineering in Didi Ride‑Hailing Platform

At Didi, observability and stability engineering combine automated, AI‑driven alarm generation, distributed tracing, and ChatOps‑based fault handling to manage micro‑service complexity, massive traffic spikes, and cross‑region operations, emphasizing systematic investment, AIOps evolution, and a recruitment call for backend and test engineers.

AIOpsDidiObservability

0 likes · 16 min read

Observability and Stability Engineering in Didi Ride‑Hailing Platform

Aikesheng Open Source Community

Sep 4, 2023 · Databases

Observability of MySQL 8 Replication Using Performance Schema and Sys Schema Views

The article explains how MySQL 8 enhances replication observability by exposing detailed metrics through Performance Schema tables and sys schema views, providing DBAs with richer information such as per‑channel lag, worker thread states, and full replication status beyond the traditional SHOW REPLICA STATUS output.

InnoDB ClusterMySQLObservability

0 likes · 14 min read

Observability of MySQL 8 Replication Using Performance Schema and Sys Schema Views

FunTester

Sep 1, 2023 · Operations

Observability in the Cloud‑Native Era: Data Collection Strategies and Sampling Techniques

The article explains how cloud‑native observability systems gather massive telemetry from infrastructure, containers, middleware and services, compares direct push and file‑based collection approaches, and details head, tail and local sampling methods to optimize data completeness and performance.

Distributed TracingObservabilityPerformance Optimization

0 likes · 10 min read

Observability in the Cloud‑Native Era: Data Collection Strategies and Sampling Techniques

dbaplus Community

Aug 31, 2023 · Operations

Which Open‑Source Log Management Tool Is Right for You? A Deep Dive into Six Solutions

This article compares six open‑source log management platforms—OpenObserve, Grafana Loki, SigNoz, Graylog, Syslog‑ng, and Highlight.io—detailing their features, deployment options, advantages, and drawbacks to help you choose the most suitable solution for effective observability and system performance.

AlertingObservabilityOperations

0 likes · 13 min read

Which Open‑Source Log Management Tool Is Right for You? A Deep Dive into Six Solutions