Tagged articles

Observability

1054 articles · Page 7 of 11

Aug 22, 2023 · Operations

Designing a Multi‑Cloud Intelligent Monitoring Platform at Huolala: Architecture, Practices, and Future Directions

This article details Huolala's one‑stop monitoring platform called Monitor, covering its multi‑cloud architecture, data collection pipelines, real‑time business monitoring, unified alarm handling, and future AI‑driven enhancements, while sharing concrete metrics, incident case studies, and practical implementation steps for large‑scale observability.

GPTMulti-CloudObservability

0 likes · 19 min read

Designing a Multi‑Cloud Intelligent Monitoring Platform at Huolala: Architecture, Practices, and Future Directions

Efficient Ops

Aug 21, 2023 · Operations

Mastering Application Monitoring with Prometheus: Practical Tips and Best Practices

This guide explains how to design effective Prometheus metrics, choose appropriate monitoring objects, labels, and buckets, and leverage Grafana visualizations to gain deep insight into application performance across online services, offline processing, and batch jobs.

GrafanaObservabilityPrometheus

0 likes · 10 min read

Mastering Application Monitoring with Prometheus: Practical Tips and Best Practices

21CTO

Aug 18, 2023 · Backend Development

Pick the Best Microservices Framework 2023: Top 10 & Key Practices

This article explains what microservices are, compares them with monolithic architecture, outlines their benefits and challenges, highlights the importance of observability, and reviews the top ten microservice frameworks and best‑practice guidelines for 2023.

Best PracticesObservabilitybackend-architecture

0 likes · 15 min read

Pick the Best Microservices Framework 2023: Top 10 & Key Practices

Huolala Tech

Aug 18, 2023 · Operations

Beyond System Metrics: Building Effective Business Monitoring for Pricing Services

Facing unpredictable software behavior, the article explains why traditional system‑level monitoring often misses critical business issues, especially in complex pricing services, and presents a comprehensive approach that combines result (black‑box) and process (white‑box) monitoring, practical metrics, and actionable recommendations to improve observability and reduce operational risk.

ObservabilityOperationsbusiness metrics

0 likes · 14 min read

Beyond System Metrics: Building Effective Business Monitoring for Pricing Services

Tech Architecture Stories

Aug 15, 2023 · Cloud Native

Unlocking Microservice Success: The Interplay of Metrics, Governance, and Validation

This article explains how measurement (SLI/SLO), governance (architecture refactoring, MTTx), and validation (chaos engineering, disaster drills) interrelate in microservice systems, illustrating how observability drives governance actions, governance improves metrics, and validation reinforces both through continuous testing.

Disaster RecoveryObservabilitySLI

0 likes · 4 min read

Unlocking Microservice Success: The Interplay of Metrics, Governance, and Validation

Top Architect

Aug 14, 2023 · Operations

Setting Up Nginx Log Collection and Visualization with Promtail, Loki, and Grafana

This guide explains how to collect Nginx access logs, convert them to JSON, ship them with Promtail to Loki, and visualize the data in Grafana, including installation steps, configuration snippets, GeoIP setup, and dashboard customization for a complete observability solution.

GrafanaLog MonitoringNginx

0 likes · 9 min read

Setting Up Nginx Log Collection and Visualization with Promtail, Loki, and Grafana

Tech Architecture Stories

Aug 14, 2023 · Operations

Why Governing Microservices Is Essential for Stability and Scalability

The article explains why microservice governance—through measurement, targeted remediation, and verification—is crucial for maintaining system stability, reducing complexity, and improving availability in large‑scale, rapidly evolving architectures.

GovernanceObservabilitySLO

0 likes · 9 min read

Why Governing Microservices Is Essential for Stability and Scalability

MaGe Linux Operations

Aug 11, 2023 · Operations

How eBPF Transformed Linux: From BPF Roots to Modern Observability

This article traces the evolution of eBPF from its BPF predecessor, explains its kernel requirements, security model, probe mechanisms, performance impact, tracing capabilities, and potential event‑loss risks, and looks ahead to its expanding role in networking and system observability.

Linux kernelObservabilityPerformance

0 likes · 11 min read

How eBPF Transformed Linux: From BPF Roots to Modern Observability

Efficient Ops

Aug 6, 2023 · Cloud Native

Mastering Prometheus: Build a Cloud‑Native Monitoring System from Scratch

This article explains how to design a Prometheus‑based cloud‑native monitoring solution, covering target selection, metric collection, server configuration, Grafana visualization, and alert management with practical examples and code snippets.

AlertingCloud Native MonitoringGrafana

0 likes · 8 min read

Mastering Prometheus: Build a Cloud‑Native Monitoring System from Scratch

Alibaba Cloud Native

Aug 4, 2023 · Backend Development

Unlocking Dubbo3’s Cloud‑Native Observability: A Complete Guide

This article explains how Dubbo3’s new observability starter provides visual cluster metrics, full‑link tracing, multi‑dimensional monitoring, Prometheus/Grafana integration, and log management, offering practical steps and configurations for building a robust cloud‑native microservice observability platform.

ObservabilityTracingbackend

0 likes · 10 min read

Unlocking Dubbo3’s Cloud‑Native Observability: A Complete Guide

Didi Tech

Aug 3, 2023 · Cloud Native

eBPF-Based Cross-Language Non-Intrusive Traffic Recording for Cloud-Native Services

The article describes an eBPF‑based, language‑agnostic traffic recording framework that hooks low‑level socket operations and thread identifiers to capture complete request‑response flows across Java, PHP, and Go services without modifying application code, dramatically lowering implementation and maintenance costs for cloud‑native traffic replay.

ObservabilitySocketcloud-native

0 likes · 15 min read

eBPF-Based Cross-Language Non-Intrusive Traffic Recording for Cloud-Native Services

MaGe Linux Operations

Aug 1, 2023 · Cloud Native

Why Service Mesh Is Essential for Modern Cloud‑Native Microservices

This article explains how service mesh complements Kubernetes by providing advanced traffic management, observability, and security for microservices, discusses common distributed‑system fallacies and service‑governance challenges, compares Istio with FloMesh, and explores future trends such as Wasm sidecars, ambient mesh, and eBPF.

ObservabilityService Meshcloud-native

0 likes · 15 min read

Why Service Mesh Is Essential for Modern Cloud‑Native Microservices

Open Source Linux

Jul 28, 2023 · Operations

Master Linux Performance: Essential Monitoring Tools Explained

This article introduces a comprehensive set of Linux performance and observability tools—such as vmstat, iostat, dstat, iotop, pidstat, top/htop, mpstat, netstat, ps, strace, uptime, lsof, perf, and sar—explaining their purpose, typical usage, and how they fit into basic and advanced performance analysis workflows.

LinuxObservabilitycommand-line

0 likes · 14 min read

Master Linux Performance: Essential Monitoring Tools Explained

DevOps

Jul 28, 2023 · Operations

The Temporary End of Moore’s Law and the Revival of “Systems Performance”

The article discusses the renewed relevance of performance engineering amid the slowdown of Moore’s Law, highlighting the Chinese edition of "Systems Performance: Enterprise and the Cloud," modern observability tools like eBPF, the "golden 60‑second" analysis, and the push toward continuous performance monitoring and expert systems.

ObservabilityOptimizationPerformance

0 likes · 7 min read

The Temporary End of Moore’s Law and the Revival of “Systems Performance”

dbaplus Community

Jul 27, 2023 · Operations

How to Build Scalable Observability for Cloud‑Native Environments: Lessons from SRE

This article summarizes a technical talk on the challenges of cloud‑native transformation, the design of an application‑centric observability platform using CMDB, Prometheus, Thanos and VictoriaMetrics, practical solutions for high‑cardinality metrics and alerting, and future directions such as eBPF and AI‑driven fault detection.

CMDBObservabilitySLA

0 likes · 14 min read

How to Build Scalable Observability for Cloud‑Native Environments: Lessons from SRE

DaTaobao Tech

Jul 24, 2023 · Cloud Native

Tengine-Ingress: High‑Performance Cloud‑Native Ingress Gateway for Alibaba Group

Tengine‑Ingress is Alibaba’s cloud‑native Ingress gateway built on the high‑performance Tengine‑Proxy, replacing the legacy Unified Access with dynamic, loss‑less configuration, per‑domain gray‑rollout, dual‑certificate TLS, real‑time observability, and checksum validation, achieving up to 20 % lower latency, CPU and memory usage while scaling to thousands of pods, and paving the way for a universal API gateway supporting TCP, UDP, gRPC, QUIC/HTTP3 and advanced TLS.

IngressKubernetesObservability

0 likes · 18 min read

Tengine-Ingress: High‑Performance Cloud‑Native Ingress Gateway for Alibaba Group

Tech Architecture Stories

Jul 23, 2023 · Backend Development

Beyond Scale: Rethinking Architecture Boundaries for Massive Services

This article reflects on years of designing large‑scale backend systems at Tencent, discussing how to define clear architecture boundaries, ensure high availability, integrate diverse technologies, and use observability and monitoring to continuously evolve and improve massive service architectures.

High AvailabilityObservabilitySystem Design

0 likes · 25 min read

Beyond Scale: Rethinking Architecture Boundaries for Massive Services

Volcano Engine Developer Services

Jul 19, 2023 · Cloud Native

How Kelemetry Transforms Kubernetes Observability with Object‑Centric Tracing

Kelemetry, an open‑source tracing system from ByteDance, visualizes Kubernetes control‑plane events by treating each object as a span, linking audit logs, events, and component interactions to provide a unified, searchable view that simplifies debugging, performance analysis, and multi‑cluster observability.

KubernetesObservabilityTracing

0 likes · 14 min read

How Kelemetry Transforms Kubernetes Observability with Object‑Centric Tracing

Programmer DD

Jul 18, 2023 · Backend Development

Explore the Best Spring I/O 2023 Talks: Must‑Watch Videos for Modern Java Developers

This article curates the most valuable Spring I/O 2023 video sessions—covering the latest Java version adaptations, Spring Framework and Boot innovations, cloud‑native deployments, security, observability, and architectural best practices—providing concise Chinese summaries so developers can quickly identify which talks merit deeper viewing.

Backend DevelopmentJavaObservability

0 likes · 24 min read

Explore the Best Spring I/O 2023 Talks: Must‑Watch Videos for Modern Java Developers

dbaplus Community

Jul 17, 2023 · Big Data

How Bilibili Built Billions 3.0: A Low‑Cost, Scalable Log Platform with ClickHouse, Iceberg, and Trino

This article details Bilibili's evolution from the ClickHouse‑based Billions 2.0 log system to the Billions 3.0 architecture, explaining how they reduced storage costs, improved troubleshooting, adopted a lake‑house design with Iceberg on HDFS, leveraged ClickHouse for acceleration, and integrated Trino as the unified query engine.

ClickHouseIcebergObservability

0 likes · 37 min read

How Bilibili Built Billions 3.0: A Low‑Cost, Scalable Log Platform with ClickHouse, Iceberg, and Trino

Qunar Tech Salon

Jul 12, 2023 · Operations

Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis

This article describes Qunar's comprehensive root cause analysis platform, detailing its background, data-driven fault categorization, architecture—including trace, runtime, middleware, and event analysis modules—and demonstrates its high accuracy and practical impact on reducing incident resolution times across microservice services.

ObservabilityOperationsRoot Cause Analysis

0 likes · 20 min read

Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis

Top Architect

Jul 11, 2023 · Operations

Introducing MyPerf4J: A High‑Performance Java Monitoring and Statistics Tool

MyPerf4J is a Java‑agent based, low‑overhead performance monitoring library that provides real‑time method, memory, GC and class metrics for high‑concurrency, low‑latency applications, offering quick start, configurable properties, and detailed statistical reports for both development and production environments.

JavaJavaAgentObservability

0 likes · 7 min read

Introducing MyPerf4J: A High‑Performance Java Monitoring and Statistics Tool

DataFunSummit

Jul 11, 2023 · Big Data

Tencent's Autonomous Big Data Platform: Data‑Driven Governance and AI‑Powered Optimization

Tencent’s big data platform introduces a data‑plus‑algorithm driven autonomous solution that automates self‑diagnosis, self‑optimization, and self‑management for trillion‑scale analytics, addressing challenges of massive task governance, resource efficiency, and stability through observable data foundations, pluggable decision engines, and generalized AI decision intelligence.

AI decisionAutonomous PlatformBig Data

0 likes · 17 min read

Tencent's Autonomous Big Data Platform: Data‑Driven Governance and AI‑Powered Optimization

AntTech

Jul 11, 2023 · Operations

Achieving Full-Stack Observability for Cloud and On-Premise Applications with Ant Group's BOS Platform

This article examines the challenges of maintaining stability across cloud and on‑premise environments, explains how Ant Group's Business‑Intelligent Observability Service (BOS) addresses these issues through unified metadata, seamless application integration, data standardization, and extensive case studies, and demonstrates the resulting improvements in reliability and operational efficiency.

Cloud ComputingFull-stack TracingObservability

0 likes · 16 min read

Achieving Full-Stack Observability for Cloud and On-Premise Applications with Ant Group's BOS Platform

dbaplus Community

Jul 10, 2023 · Operations

Why Most Logging and Metrics Strategies Fail – and How to Fix Them

The author reflects on the shortcomings of current logging, metrics, and tracing practices, explains why they become costly and unscalable, and offers concrete recommendations—including log level discipline, structured logging, metric aggregation, and the use of tools like Prometheus, Cortex, and Thanos—to build a more efficient observability stack.

LoggingObservabilityPrometheus

0 likes · 18 min read

Why Most Logging and Metrics Strategies Fail – and How to Fix Them

DataFunTalk

Jul 9, 2023 · Operations

Building High‑Performance Observability Data Pipelines with Vector and Honghu

This article explains the concepts and importance of observability, introduces the Vector data‑pipeline tool and its architecture, demonstrates how to configure sources, transforms and sinks, and shows how to integrate Vector with the Honghu platform to build a complete, real‑time monitoring solution for modern distributed systems.

Big DataHonghuObservability

0 likes · 33 min read

Building High‑Performance Observability Data Pipelines with Vector and Honghu

dbaplus Community

Jul 8, 2023 · Operations

How QQ Music Achieves High Availability: Architecture, Tools, and Observability

This article explains how QQ Music embraces inevitable faults by building a high‑availability architecture that combines redundant infrastructure, automated failover, stability strategies, a robust toolchain for chaos engineering and full‑link load testing, and comprehensive observability to ensure graceful fault handling at scale.

Observabilitychaos-engineeringdistributed-systems

0 likes · 27 min read

How QQ Music Achieves High Availability: Architecture, Tools, and Observability

Architects Research Society

Jul 7, 2023 · Operations

Design Patterns and Principles for Building Large‑Scale Systems

This article outlines key design patterns and principles—such as scalability, idempotency, asynchronous processing, health checks, circuit breakers, feature flags, bulkheads, service discovery, retries, metrics, rate limiting, back‑pressure, and canary releases—that enable large‑scale, reliable, and resilient distributed systems.

ObservabilityReliabilitydistributed systems

0 likes · 16 min read

Design Patterns and Principles for Building Large‑Scale Systems

Meituan Technology Team

Jul 6, 2023 · Databases

Meituan Database Attack‑Defense Practice: Kernel Observability, Full SQL, and Index Optimization

The article details how Meituan built a MySQL autonomous platform by constructing kernel observability to split OnCPU/OffCPU wait time, capturing full SQL directly from the kernel with compression, designing a safe exception‑handling workflow, and generating cost‑based index‑tuning suggestions—including what‑if analysis and workload‑driven recommendations—to enable comprehensive SQL governance.

Full‑SQLIndex TuningMySQL

0 likes · 34 min read

Meituan Database Attack‑Defense Practice: Kernel Observability, Full SQL, and Index Optimization

Qunar Tech Salon

Jul 5, 2023 · Mobile Development

Long‑Term Client Crash Governance Mechanism at Qunar: Architecture, Detection, and Resolution Strategies

This article describes Qunar's systematic client crash governance framework, covering background challenges, APM‑based fast problem discovery, multi‑level alerting, common‑issue remediation, code‑level fixes for URL and Bundle size crashes, detection tools, code checks, automated testing, and the measurable improvements achieved in Android and iOS stability.

APMAndroidCrash Monitoring

0 likes · 19 min read

Long‑Term Client Crash Governance Mechanism at Qunar: Architecture, Detection, and Resolution Strategies

Didi Tech

Jul 4, 2023 · Cloud Native

eBPF Technology and Its Application in Didi's Cloud-Native Observability: HuaTuo Platform Practice

eBPF, a safe, high‑performance Linux kernel extension evolving from the 1993 Berkeley Packet Filter to modern dynamic tracing, underpins Didi’s HuaTuo platform, which consolidates bytecode management, fast data processing, stability self‑healing, and container insight to solve traffic replay, topology, security, and root‑cause analysis challenges across cloud‑native services, with plans to broaden business use and community collaboration.

HuatuoObservabilityRoot Cause Analysis

0 likes · 12 min read

eBPF Technology and Its Application in Didi's Cloud-Native Observability: HuaTuo Platform Practice

Efficient Ops

Jul 3, 2023 · Operations

Mastering Application Monitoring with Prometheus: Practical Metrics and Best Practices

This article explains how to design effective Prometheus metrics for various application types, covering golden metrics, label selection, naming conventions, bucket choices, and Grafana visualization tips to help engineers build reliable observability solutions.

Best PracticesGrafanaObservability

0 likes · 9 min read

Mastering Application Monitoring with Prometheus: Practical Metrics and Best Practices

Alibaba Cloud Native

Jun 30, 2023 · Cloud Native

Simplify Hybrid Cloud Kubernetes Management with Alibaba ACK One

This article explains how Alibaba Cloud ACK One enables unified registration and management of Kubernetes clusters across public clouds, private data centers, and edge environments, detailing core features, architecture, security measures, and observability capabilities for seamless multi‑cluster operations.

ACK OneHybrid CloudKubernetes

0 likes · 9 min read

Simplify Hybrid Cloud Kubernetes Management with Alibaba ACK One

Architecture Digest

Jun 27, 2023 · Operations

MyPerf4J – High‑Performance Java Performance Monitoring and Statistics Tool

MyPerf4J is a Java‑agent based, low‑overhead monitoring solution that provides real‑time method, memory, GC and class metrics, enabling developers to quickly locate performance bottlenecks and assess service capacity in both development and production environments.

JavaJavaAgentObservability

0 likes · 6 min read

MyPerf4J – High‑Performance Java Performance Monitoring and Statistics Tool

Efficient Ops

Jun 25, 2023 · Cloud Native

Master Loki on Kubernetes: Complete Deployment, Configuration, and Troubleshooting Guide

This article explains why Loki is a lightweight log aggregation solution, outlines its key advantages, describes its architecture and deployment modes, provides step‑by‑step Kubernetes deployment instructions with full configuration examples, and offers practical troubleshooting tips for common issues.

KubernetesLoggingObservability

0 likes · 14 min read

Master Loki on Kubernetes: Complete Deployment, Configuration, and Troubleshooting Guide

Efficient Ops

Jun 25, 2023 · Operations

How to Build a Next‑Gen “Big Operations” System for Reliability and Observability

This article outlines the evolution from manual operations to DevOps and SRE‑driven “big operations,” detailing system reliability and continuity practices, observability concepts, and the development of AIOps maturity standards, offering a comprehensive guide for building stable, efficient, and secure operational frameworks.

AIOpsObservabilityOperations

0 likes · 14 min read

How to Build a Next‑Gen “Big Operations” System for Reliability and Observability

dbaplus Community

Jun 24, 2023 · Operations

How Bilibili Scales Capacity: VPA, HPA, and Cost‑Saving Strategies

This article summarizes Zhang He’s Bilibili SRE talk on building a capacity‑management system that visualizes resource usage, reduces costs, improves stability, and leverages Kubernetes VPA, HPA, pooling, and quota management to support massive live‑stream events and rapid feature releases.

Cost OptimizationHPAKubernetes

0 likes · 21 min read

How Bilibili Scales Capacity: VPA, HPA, and Cost‑Saving Strategies

SQB Blog

Jun 16, 2023 · Operations

Boost Java Performance: Optimize JFR Analysis with Flame Graphs and Async‑Profiler

This article explores the evolution of continuous performance profiling, explains why traditional tracing falls short, and details a series of optimizations—including batch processing, object‑reference serialization, aggregation insertion, and multi‑chunk handling—to dramatically reduce memory usage and speed up Java Flight Recorder analysis using async‑profiler and flame graphs.

JFRJavaObservability

0 likes · 13 min read

Boost Java Performance: Optimize JFR Analysis with Flame Graphs and Async‑Profiler

Bitu Technology

Jun 14, 2023 · Operations

Getting Started with eBPF: Concepts, Examples, and Security Considerations

This article reviews the fundamentals of eBPF, explains its architecture and tracing mechanisms such as USDT, uprobes, and TC hooks, provides practical code examples, discusses security aspects, and lists notable open‑source projects that leverage eBPF for performance and observability.

LinuxObservabilityPerformance

0 likes · 9 min read

Getting Started with eBPF: Concepts, Examples, and Security Considerations

Efficient Ops

Jun 1, 2023 · Operations

How Tencent’s On‑Call System Transforms Incident Management and Quality Ops

This article explores how Tencent builds and practices its SRE quality operation system, focusing on On‑Call incident management, standardized channels, alert handling, data quality models, and the resulting improvements in reliability, MTTR reduction, and data‑driven decision making.

ObservabilityOn-CallOperations

0 likes · 14 min read

How Tencent’s On‑Call System Transforms Incident Management and Quality Ops

MaGe Linux Operations

May 27, 2023 · Operations

Choosing the Right Log Collection Tool: Logstash vs Fluentd, Fluent Bit & Vector

This article compares four popular open‑source log collection tools—Logstash, Fluentd, Fluent Bit, and Vector—examining their key features, performance, resource usage, scalability, security, and ecosystem to help enterprises select the most suitable solution for their specific logging needs.

Fluent BitFluentdLogstash

0 likes · 6 min read

Choosing the Right Log Collection Tool: Logstash vs Fluentd, Fluent Bit & Vector

Laravel Tech Community

May 23, 2023 · Operations

Comparison of Common Log Management Tools: Features, Pricing, Advantages and Disadvantages

This article provides a detailed comparative overview of nine popular log management solutions—including Filebeat, Graylog, LogDNA, ELK, Grafana Loki, Datadog, Logstash, Fluentd, and Splunk—covering their core features, pricing models, strengths, and weaknesses to help readers choose the most suitable tool for their environment.

DatadogELKGraylog

0 likes · 14 min read

Comparison of Common Log Management Tools: Features, Pricing, Advantages and Disadvantages

Efficient Ops

May 22, 2023 · Operations

What’s Driving China’s AIOps Evolution? Insights from the 2023 Survey

The 2023 China AIOps Status Survey, launched by CAICT and the Cloud Computing Open Source Industry Alliance, gathers input from over 60 enterprises to reveal current intelligent‑operations practices, observability adoption, generative AI prospects, and best‑practice case studies, while inviting participants to shape the upcoming report.

AIOpsGenerative AIIndustry Survey

0 likes · 9 min read

What’s Driving China’s AIOps Evolution? Insights from the 2023 Survey

Alibaba Cloud Developer

May 18, 2023 · Operations

Why Gray Releases Fail: A Real-World Bug and an MVP Gray Release Blueprint

This article examines a subtle gray‑release bug that caused message loss due to mismatched environment configurations, analyzes its root causes, and proposes a minimum‑viable‑product gray‑release design with practical strategies, observability tips, and configuration examples to ensure safe, incremental rollouts.

DeploymentObservabilityconfiguration

0 likes · 21 min read

Why Gray Releases Fail: A Real-World Bug and an MVP Gray Release Blueprint

Efficient Ops

May 17, 2023 · Operations

How JD Built a Scalable H5 Observability Platform to Boost Performance and Reduce Costs

This article details JD's end‑to‑end H5 observability solution, covering the challenges of hybrid app development, the design of a three‑stage UEM platform, deep active and passive monitoring, automated quality gates, and real‑world case studies that demonstrate cost savings and performance improvements.

FrontendH5Hybrid App

0 likes · 15 min read

How JD Built a Scalable H5 Observability Platform to Boost Performance and Reduce Costs

MaGe Linux Operations

May 11, 2023 · Cloud Native

Master Distributed Tracing in Go with OpenTelemetry – A Practical Guide

In modern cloud‑native applications, distributed tracing is essential for pinpointing errors across microservices, and OpenTelemetry provides a standardized framework for collecting and analyzing trace data, with a hands‑on Go implementation demonstrated in an upcoming expert-led workshop.

Distributed TracingObservabilityOpenTelemetry

0 likes · 5 min read

Master Distributed Tracing in Go with OpenTelemetry – A Practical Guide

Tencent Cloud Developer

May 8, 2023 · Cloud Native

Modernizing Tencent Cloud Log Service (CLS): Cloud‑Native Architecture, Challenges, and Benefits

Tencent Cloud Log Service was modernized by migrating over 95 % of its components to a cloud‑native stack of containers, Kubernetes, and declarative APIs, addressing chaotic infrastructure, stateful‑to‑stateless conversion, configuration drift, upgrade risk, elastic scaling, traffic protection and observability, which cut costs by more than 20 million CNY, reduced scaling latency by 90 %, and achieved over 99.99 % availability with petabyte‑scale burst handling.

Elastic ScalingLog ServiceObservability

0 likes · 15 min read

Modernizing Tencent Cloud Log Service (CLS): Cloud‑Native Architecture, Challenges, and Benefits

MaGe Linux Operations

May 7, 2023 · Operations

How Meta’s SLICK Transforms SLO Management for Reliable Services

This article explains how Meta built SLICK, a centralized SLO/SLI platform that improves service reliability through discoverability, long‑term insights, integrated workflows, and scalable architecture, and shares real‑world examples and lessons learned from its deployment across thousands of services.

MetaObservabilityReliability

0 likes · 13 min read

How Meta’s SLICK Transforms SLO Management for Reliable Services

政采云技术

Apr 29, 2023 · Cloud Native

Understanding Observability: Challenges, Principles, and OpenTelemetry Architecture

The article explains how growing system complexity drives the need for observability, outlines the three pillars of logs, traces, and metrics, compares traditional stability stacks with modern observability, and details OpenTelemetry's design, advantages, and implementation considerations for cloud‑native environments.

ObservabilityOpenTelemetryStability

0 likes · 16 min read

Understanding Observability: Challenges, Principles, and OpenTelemetry Architecture

DataFunSummit

Apr 29, 2023 · Operations

Application Monitoring Principles and Non‑Intrusive Data Collection at Huya

This article explains the fundamentals of distributed application monitoring, describes Huya's non‑intrusive data‑collection techniques using SDKs and plugins, outlines the design and correlation of observable metrics, and demonstrates practical results and troubleshooting scenarios for backend services.

Distributed TracingMetrics DesignObservability

0 likes · 16 min read

Application Monitoring Principles and Non‑Intrusive Data Collection at Huya

Qunar Tech Salon

Apr 24, 2023 · Operations

Design and Evolution of Qunar's Watcher Enterprise Monitoring Platform

The article details the background, architecture, core features, alert governance, trace integration, and cloud‑native evolution of Watcher, Qunar's internally built, highly scalable monitoring platform that unifies application‑level metrics, alerting, and observability across thousands of services and containers.

AlertingObservabilityTrace

0 likes · 19 min read

Design and Evolution of Qunar's Watcher Enterprise Monitoring Platform

ITPUB

Apr 23, 2023 · Cloud Native

How Kindling Leverages eBPF to Reach 1‑5‑10 Observability Targets

This article examines the difficulty of achieving the 1‑5‑10 observability goal, reviews current tracing, logging, and metrics tools, introduces the open‑source Kindling project’s eBPF‑based trace‑profiling approach, and walks through several real‑world use cases that demonstrate faster root‑cause analysis in cloud‑native environments.

KindlingObservabilityPerformance

0 likes · 16 min read

How Kindling Leverages eBPF to Reach 1‑5‑10 Observability Targets

Alibaba Cloud Native

Apr 21, 2023 · Backend Development

What’s New in Dubbo 3.2? Deep Dive into Cloud‑Native Features and Performance Boosts

Dubbo 3.2 introduces native REST support, enhanced observability with Metrics and Tracing, GraalVM Native Image compatibility, JDK 17/21 readiness, significant RPC performance gains, and a smooth upgrade path, all aimed at strengthening its role as a cloud‑native RPC framework for microservices.

JavaObservabilityPerformance

0 likes · 14 min read

What’s New in Dubbo 3.2? Deep Dive into Cloud‑Native Features and Performance Boosts

Qunar Tech Salon

Apr 19, 2023 · Operations

Heimdall Exception Statistics System: Architecture, Implementation, and Practice

This article describes the design, implementation, and evolution of Heimdall, an exception‑statistics platform built on Kafka, Flink, and HBase that provides minute‑level anomaly aggregation, stack trace querying, and integration with release and alerting workflows to improve service reliability across thousands of micro‑services.

Exception MonitoringKafkaObservability

0 likes · 14 min read

Heimdall Exception Statistics System: Architecture, Implementation, and Practice

Efficient Ops

Apr 12, 2023 · Operations

Building Highly Available Prometheus Monitoring with Thanos: A Practical Guide

This article explains why native Prometheus HA solutions fall short for large, multi‑region clusters and shows how to use Thanos components—including sidecar, query, store gateway, and compactor—to achieve long‑term storage, unlimited scaling, a global view, and non‑intrusive integration with existing Prometheus deployments.

High AvailabilityKubernetesObservability

0 likes · 22 min read

Building Highly Available Prometheus Monitoring with Thanos: A Practical Guide

dbaplus Community

Apr 5, 2023 · Cloud Native

How Baidu’s Search Platform Achieves Billion‑Scale Observability in a Cloud‑Native Era

This article explains why observability is critical in cloud‑native architectures and describes how Baidu’s search middle‑platform handles hundred‑billion‑level traffic by implementing low‑cost real‑time metrics, distributed tracing, log querying and topology analysis, while tackling challenges of massive microservice scale, scenario‑level monitoring, and efficient resource usage.

ObservabilityTracingcloud-native

0 likes · 12 min read

How Baidu’s Search Platform Achieves Billion‑Scale Observability in a Cloud‑Native Era

System Architect Go

Apr 3, 2023 · Cloud Native

Why Cilium Beats Flannel: Real‑World Kubernetes Networking Insights

The article analyzes how Cilium’s eBPF‑based architecture, advanced network policies, cluster‑wide traffic control, and observability tools like Hubble solved performance, security, and scalability challenges that Flannel and kube‑proxy could not meet in production Kubernetes environments.

CNICiliumKubernetes

0 likes · 12 min read

Why Cilium Beats Flannel: Real‑World Kubernetes Networking Insights

MaGe Linux Operations

Mar 30, 2023 · Operations

Demystifying PromQL: How Nested Functional Queries Work in Prometheus

This article explores the structure and evaluation of PromQL queries, covering its nested functional language nature, expression types, time handling with instant and range queries, and practical examples using the PromLens visualizer, helping readers grasp how Prometheus processes and types queries.

ObservabilityPromQLquery language

0 likes · 11 min read

Demystifying PromQL: How Nested Functional Queries Work in Prometheus

ITPUB

Mar 29, 2023 · Databases

Beyond ACID: A Maslow‑Inspired Hierarchy of Database Needs

Drawing parallels with Maslow’s hierarchy, the article outlines an eight‑level model of database requirements—from core kernel correctness and ACID to advanced observability, automation, and the vision of a truly autonomous database—explaining how each tier maps to functional, security, reliability, ROI, insight, control, and transcendence.

DatabasesObservabilityPerformance

0 likes · 12 min read

Beyond ACID: A Maslow‑Inspired Hierarchy of Database Needs

Efficient Ops

Mar 28, 2023 · Operations

Why SRE Matters: Bridging Product Development and Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its responsibilities, how it complements product development, the software lifecycle perspective, and practical approaches to ensure system stability through controllability, observability, and best‑practice implementation.

ObservabilityOperationsReliability Engineering

0 likes · 14 min read

Why SRE Matters: Bridging Product Development and Reliability Engineering

Alibaba Cloud Native

Mar 28, 2023 · Cloud Native

How RocketMQ 5.0 Enables Distributed End‑to‑End Tracing with OpenTelemetry

This article explains how Apache RocketMQ 5.0 integrates standardized distributed tracing via OpenTelemetry, detailing the underlying span model, semantic conventions for messaging, automatic and manual instrumentation options, configuration steps, a complete example workflow, and how to export traces to Alibaba Cloud SLS and ARMS for observability.

Distributed TracingObservabilityOpenTelemetry

0 likes · 17 min read

How RocketMQ 5.0 Enables Distributed End‑to‑End Tracing with OpenTelemetry

ITPUB

Mar 24, 2023 · Cloud Native

Why Open‑Falcon Stalled and How Cloud‑Native Monitoring Is Evolving

This article reviews the evolution of monitoring in the cloud‑native era, analyzes Open‑Falcon’s architecture, strengths, and shortcomings, explains why its development hit a bottleneck, and outlines the design principles and features of the Nightingale monitoring system as a modern, open‑source alternative.

NightingaleObservabilityOpen-Falcon

0 likes · 15 min read

Why Open‑Falcon Stalled and How Cloud‑Native Monitoring Is Evolving

Top Architect

Mar 22, 2023 · Operations

Log Management, Observability, and APM: Concepts, Practices, and Tools

This article explains what logs are, when to record them, their value in large-scale systems, and how to build effective log‑management and observability platforms using APM concepts, including metrics, tracing, ELK, Prometheus, and custom tooling for distributed architectures.

APMELKLogging

0 likes · 20 min read

Log Management, Observability, and APM: Concepts, Practices, and Tools

Architect

Mar 21, 2023 · Operations

Log Management, Observability, and APM Practices in Distributed Systems

This article explains what logs are, when to record them, their value in large‑scale architectures, and how to build effective logging, metrics, and tracing platforms using tools such as ELK, Prometheus, and SkyWalking, while also presenting good and bad logging practices and sample batch‑log retrieval code.

APMELKLogging

0 likes · 20 min read

Log Management, Observability, and APM Practices in Distributed Systems

DevOps Cloud Academy

Mar 21, 2023 · Cloud Native

Robusta: An Open‑Source Python Platform for Kubernetes Troubleshooting and Automated Incident Response

Robusta is a Python‑based open‑source platform that layers on top of monitoring stacks like Prometheus to automatically detect, diagnose, and remediate Kubernetes alerts through built‑in automations, optional web UI, and Helm‑based installation for cloud‑native environments.

AutomationKubernetesObservability

0 likes · 7 min read

Robusta: An Open‑Source Python Platform for Kubernetes Troubleshooting and Automated Incident Response

New Oriental Technology

Mar 10, 2023 · Cloud Native

Middleware PaaS on Kubernetes: Architecture, Benefits, and IP Reservation Challenges

This article explains how the New Oriental architecture team migrated middleware services like Redis, Kafka, and RocketMQ to Kubernetes, detailing the benefits over traditional PaaS, the Capo IP reservation solution for network stability, and the resulting operational, observability, and resource utilization improvements.

KubernetesMiddlewareNetwork

0 likes · 18 min read

Middleware PaaS on Kubernetes: Architecture, Benefits, and IP Reservation Challenges

dbaplus Community

Mar 8, 2023 · Operations

Why Logging Matters: Building Effective Distributed Log Operations and Observability

This article explains what logs are, when and why to record them, their value in large‑scale systems, the challenges of log management in micro‑service architectures, and how to design observability platforms using metrics, logging, tracing, and tools such as ELK, Prometheus, OpenTracing, and SkyWalking.

APMLoggingObservability

0 likes · 21 min read

Why Logging Matters: Building Effective Distributed Log Operations and Observability

AntTech

Mar 7, 2023 · Cloud Native

Introduction to HoloInsight: A Cloud‑Native Lightweight Observability Platform

HoloInsight is an open‑source, cloud‑native observability platform derived from Ant Group's AntMonitor, offering integrated log‑based monitoring, business metric analysis, and AI‑driven AIOps capabilities while providing a lightweight, modular architecture and extensive extensibility for modern software stacks.

AIOpsObservabilitycloud-native

0 likes · 13 min read

Introduction to HoloInsight: A Cloud‑Native Lightweight Observability Platform

DataFunSummit

Mar 4, 2023 · Operations

Full‑Chain Monitoring and Trace System at Huolala: Evolution, Architecture, and Visualization

This article details how Huolala built a comprehensive full‑chain monitoring and tracing platform, covering the historical evolution of observability tools, the company’s multi‑stage monitoring architecture, bytecode‑enhanced instrumentation, trace sampling strategies, and a "what‑you‑see‑is‑what‑you‑get" visualization approach.

ObservabilityPrometheusSkyWalking

0 likes · 15 min read

Full‑Chain Monitoring and Trace System at Huolala: Evolution, Architecture, and Visualization

ByteDance SYS Tech

Feb 28, 2023 · Cloud Native

How ByteDance’s ARES Boosts Cloud‑Native Resilience with Chaos Engineering

This article explains ByteDance’s end‑to‑end chaos engineering practice for cloud‑native environments, covering its background, principles, comparison with traditional testing, the evolution of its internal platforms, and a detailed look at the Application Resilience Enhancement Service (ARES) and its core features.

Fault InjectionKubernetesObservability

0 likes · 17 min read

How ByteDance’s ARES Boosts Cloud‑Native Resilience with Chaos Engineering

Alibaba Cloud Native

Feb 27, 2023 · Cloud Native

What’s Next for Microservices? Highlights from the Beijing Cloud Native Meetup

The Beijing "Microservices x Container Open Source Developer Meetup" gathered over 100 developers and core maintainers of leading cloud‑native projects to discuss next‑generation microservice architectures, static compilation, service governance, multi‑cluster management, observability, and more, providing deep technical insights and real‑world examples.

KubernetesObservabilitycloud-native

0 likes · 11 min read

What’s Next for Microservices? Highlights from the Beijing Cloud Native Meetup

Architecture & Thinking

Feb 21, 2023 · Operations

Why Logging Matters: Building Distributed Log Operations & Observability

This article explores why logs are essential in software development, when to record them, their value for debugging, performance, security and business decisions, and how distributed architectures require robust log‑operation tools such as ELK, Prometheus, tracing systems to achieve effective observability.

APMELKLogging

0 likes · 23 min read

Why Logging Matters: Building Distributed Log Operations & Observability

Baidu Geek Talk

Feb 20, 2023 · Operations

Deep Dive into Logging Operations and Observability in Distributed Systems

The article examines logging’s critical role in distributed systems, detailing its purpose, severity levels, and value for debugging, performance, security, and auditing, while highlighting challenges of inconsistent formats and traceability, and reviewing observability pillars, ELK and tracing tools, and practical implementation best practices.

APMELKLogging

0 likes · 19 min read

Deep Dive into Logging Operations and Observability in Distributed Systems

MaGe Linux Operations

Feb 18, 2023 · Operations

Prometheus vs Zabbix: Which Monitoring Tool Wins in Modern Environments?

This article compares Prometheus and Zabbix, covering their histories, architectures, strengths, and weaknesses, and provides guidance on choosing the right solution based on factors such as scalability, container support, data storage, community activity, and deployment complexity.

ObservabilityZabbixcloud-native

0 likes · 8 min read

Prometheus vs Zabbix: Which Monitoring Tool Wins in Modern Environments?

Alibaba Cloud Native

Feb 8, 2023 · Cloud Native

Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark

This article benchmarks Alibaba Cloud Prometheus against the open‑source Prometheus across multiple cluster sizes, churn rates, and query patterns, revealing that while the open‑source version remains stable under light load, its CPU and memory usage grow non‑linearly with high cardinality, whereas Alibaba's managed service delivers higher compatibility, better query performance, and more predictable scaling.

ObservabilityPerformance BenchmarkPrometheus

0 likes · 30 min read

Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark

Ops Development Stories

Feb 8, 2023 · Cloud Native

How to Deploy and Use SigNoz for Full‑Stack Observability on Kubernetes

This guide walks you through installing the open‑source SigNoz APM platform on Kubernetes, configuring its components, exploring link tracing with a demo app, and setting up logging and alerting for comprehensive cloud‑native observability.

APMKubernetesObservability

0 likes · 8 min read

How to Deploy and Use SigNoz for Full‑Stack Observability on Kubernetes

Cloud Native Technology Community

Feb 8, 2023 · Operations

FinOps Core Principles and the Rationale for Left‑Shift in Cloud Cost Management

The article explains how DevOps teams can adopt FinOps principles and a left‑shift approach—combining static and dynamic logging, fostering cross‑team collaboration, and integrating cost awareness into the software development lifecycle—to reduce cloud expenses, improve MTTR, and drive sustainable engineering productivity.

Cloud CostFinOpsLeft Shift

0 likes · 10 min read

FinOps Core Principles and the Rationale for Left‑Shift in Cloud Cost Management

dbaplus Community

Feb 6, 2023 · Operations

How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services

This article outlines Vivo's multi‑year journey of designing, evolving, and operating a cloud‑native, AIOps‑enabled monitoring platform that supports tens of thousands of hosts, databases, containers, and services, detailing its architecture, challenges, and future directions for observability and reliability.

AIOpsObservabilityOperations

0 likes · 18 min read

How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services

Tencent Cloud Developer

Feb 3, 2023 · Cloud Computing

Cloud Load Testing: Strategies, Scenarios, and Practice Cases for High‑Traffic Events

Tencent’s cloud load‑testing platform simulates massive Chinese‑New‑Year traffic by offering concurrency and RPS modes, multi‑language test authoring, realistic data generation, and unified OpenTelemetry reporting, enabling early bottleneck detection, proactive scaling, and successful high‑load drills such as Mobile QQ and video services.

JavaScriptObservabilitycloud testing

0 likes · 23 min read

Cloud Load Testing: Strategies, Scenarios, and Practice Cases for High‑Traffic Events

Alibaba Cloud Native

Feb 3, 2023 · Operations

How eBPF Enables Zero‑Intrusion Monitoring for Multi‑Language Serverless Apps

This article explains how eBPF technology provides a unified, zero‑intrusion monitoring solution for Serverless applications across any language, detailing its architecture, workflow, and the advantages it brings to cloud‑native operations such as low cost, high performance, and multi‑protocol support.

ObservabilityPerformanceServerless

0 likes · 9 min read

How eBPF Enables Zero‑Intrusion Monitoring for Multi‑Language Serverless Apps

Open Source Linux

Feb 3, 2023 · Cloud Native

Why eBPF Is the Secret Weapon Behind Modern Cloud‑Native Platforms

This article explains how eBPF extends kernel functionality to enable secure, high‑performance networking, observability, and programmable workloads in cloud‑native environments, detailing its architecture, use cases, market adoption, commercialization models, and the challenges and advantages that make it comparable to JavaScript for the kernel.

LinuxObservabilitycloud-native

0 likes · 12 min read

Why eBPF Is the Secret Weapon Behind Modern Cloud‑Native Platforms

Architects Research Society

Feb 2, 2023 · Backend Development

Medium’s Journey to Microservices: Principles, Strategies, and Lessons Learned

This article explains why Medium transitioned from a monolithic Node.js application to a microservice architecture, outlines the core design principles, shares practical strategies for building, deploying, and observing services, and warns about common pitfalls such as the microservice syndrome.

Backend DevelopmentDeploymentObservability

0 likes · 23 min read

Medium’s Journey to Microservices: Principles, Strategies, and Lessons Learned

ITPUB

Jan 31, 2023 · Databases

How Pigsty Turns PostgreSQL into a Cost‑Effective Open‑Source RDS Alternative

Pigsty is an open‑source platform that upgrades PostgreSQL across six dimensions—observability, reliability, availability, maintainability, extensibility, and interoperability—delivering enterprise‑grade features, built‑in monitoring, automatic failover, backup, and performance tuning while cutting cloud database costs dramatically.

Cost OptimizationHigh AvailabilityObservability

0 likes · 22 min read

How Pigsty Turns PostgreSQL into a Cost‑Effective Open‑Source RDS Alternative

dbaplus Community

Jan 26, 2023 · Operations

Unified Metrics, Tracing, and Logging: A Financial Firm’s Path to Microservice Observability

Facing the challenges of distributed microservice architectures, a financial services company implemented a unified observability platform that combines metrics, tracing, and logging via OpenTelemetry and custom agents, enabling real‑time visualization, anomaly detection, and performance analysis across seven core business middle‑platforms.

Distributed TracingLoggingObservability

0 likes · 17 min read

Unified Metrics, Tracing, and Logging: A Financial Firm’s Path to Microservice Observability

MaGe Linux Operations

Jan 23, 2023 · Operations

Prometheus vs Zabbix: Which Monitoring Tool Wins in Modern Environments?

This article compares Prometheus and Zabbix, detailing their histories, architectures, performance, community support, and suitability for different environments, and concludes with guidance on choosing the right monitoring solution for physical servers, cloud-native deployments, and large‑scale container clusters.

ObservabilityZabbixcloud-native

0 likes · 9 min read

NetEase Yanxuan Technology Product Team

Jan 16, 2023 · Backend Development

Design and Implementation of a Business‑Facing Message Center Management Platform

The platform centralizes message‑center management for e‑commerce by adding end‑to‑end tracing, real‑time metrics, and unified logging, enabling business users to query message links, view dashboards, automate retries and approvals, dramatically reducing manual monitoring, improving completion rates above 90%, and paving the way for cost‑optimized, data‑driven operations.

LoggingObservabilitydevops

0 likes · 15 min read

Design and Implementation of a Business‑Facing Message Center Management Platform

Code Ape Tech Column

Jan 14, 2023 · Operations

Comparison of Common Log Management Tools: Features, Pricing, Pros and Cons

This article provides a detailed comparison of nine popular log management solutions—including Filebeat, Graylog, LogDNA, the ELK stack, Grafana Loki, Datadog, Logstash, Fluentd, and Splunk—covering their main features, pricing models, advantages, and disadvantages to help readers choose the right tool for their needs.

ELKObservabilitylog management

0 likes · 16 min read

Comparison of Common Log Management Tools: Features, Pricing, Pros and Cons

Su San Talks Tech

Jan 13, 2023 · Operations

How Distributed Tracing with SkyWalking Solves Microservice Performance Mysteries

This article explains the principles, architecture, and practical implementation of distributed tracing—especially SkyWalking—in microservice environments, showing how it identifies call chains, isolates performance bottlenecks, and integrates with existing monitoring systems while maintaining low overhead and non‑intrusive instrumentation.

Distributed TracingJavaAgentObservability

0 likes · 20 min read

How Distributed Tracing with SkyWalking Solves Microservice Performance Mysteries

Open Source Linux

Jan 13, 2023 · Cloud Native

Unlocking Kernel Power: How eBPF Transforms Cloud‑Native Networking and Security

This article explains what eBPF is, why it matters for cloud‑native architectures, its key components and use‑cases in networking, observability and security, and explores current market momentum and commercialization models for leveraging eBPF in modern infrastructure.

LinuxObservabilitycloud-native

0 likes · 9 min read

Unlocking Kernel Power: How eBPF Transforms Cloud‑Native Networking and Security

Top Architect

Jan 6, 2023 · Operations

Understanding Distributed Tracing and SkyWalking: Principles, Architecture, and Performance

This article explains the concept of distributed tracing, its importance in micro‑service architectures, the OpenTracing standard, and how SkyWalking implements automatic span collection, context propagation, unique trace IDs, sampling strategies, and performance optimizations to provide low‑overhead observability for backend systems.

Distributed TracingObservabilityOpenTracing

0 likes · 12 min read

Understanding Distributed Tracing and SkyWalking: Principles, Architecture, and Performance

Tencent Cloud Developer

Jan 5, 2023 · Cloud Native

QQ Music High-Availability Architecture Overview

QQ Music achieves high availability by layering redundant multi‑datacenter architecture, proactive chaos‑engineering toolchains, and comprehensive observability—including metrics, logging, tracing and profiling—while employing service grading, adaptive retry windows and EMA‑based dynamic timeouts to gracefully handle faults across its massive micro‑service ecosystem.

High AvailabilityObservabilitychaos engineering

0 likes · 24 min read

QQ Music High-Availability Architecture Overview

Architecture & Thinking

Jan 5, 2023 · Operations

How Critical Path Tracing Cuts Latency in Large Distributed Systems

This article explains why latency analysis is crucial for user experience in large distributed services, reviews common methods such as RPC monitoring, CPU profiling, and distributed tracing, and then dives deep into the principles, implementation, aggregation, storage, and visualization of critical path analysis, showcasing its practical impact in Baidu's App recommendation platform.

Latency analysisObservabilitycritical path tracing

0 likes · 15 min read

How Critical Path Tracing Cuts Latency in Large Distributed Systems

Alibaba Terminal Technology

Jan 5, 2023 · Mobile Development

Why Mobile Trace Is Hard and How OpenTelemetry Solves It

This article explores the challenges of end‑to‑end tracing on mobile apps, explains why issues are hard to reproduce, and presents a four‑step solution using a unified OpenTelemetry standard, automated data linking, performance optimizations, and machine‑learning‑driven root‑cause analysis.

AndroidObservabilityOpenTelemetry

0 likes · 20 min read

Why Mobile Trace Is Hard and How OpenTelemetry Solves It

Architecture Digest

Dec 30, 2022 · Operations

Vivo Monitoring Platform: Architecture, Evolution, and Future Directions

The article details the evolution, architecture, capabilities, challenges, and future plans of Vivo's comprehensive monitoring platform, covering its transition from simple Zabbix setups to a cloud‑native, AI‑ops enabled system that ensures service availability across massive infrastructure.

AIOpsObservabilityPlatform

0 likes · 16 min read

Vivo Monitoring Platform: Architecture, Evolution, and Future Directions

Efficient Ops

Dec 29, 2022 · Operations

How eBay Scales Its Event Platform with ClickHouse and Kubernetes

This article details eBay's event platform architecture, explaining why a dedicated event system is needed, how ClickHouse provides high‑performance storage, the use of Kubernetes CRDs for cross‑region high availability, data routing, read/write separation, and query optimizations with LogQL.

ClickHouseEvent PlatformHigh Availability

0 likes · 18 min read

How eBay Scales Its Event Platform with ClickHouse and Kubernetes

Meituan Technology Team

Dec 29, 2022 · Artificial Intelligence

Top 20 Most Popular Meituan Tech Blog Articles of 2022

Meituan’s technology team highlights its twenty most‑read 2022 blog posts, spanning observability, system design, data governance, AI, cloud‑native engineering, and practical innovations such as visual log tracing, Kafka scaling, functional programming, Elasticsearch optimization, CI/CD pipelines, and advanced object‑detection frameworks.

2022 HighlightsArtificial IntelligenceData Governance

0 likes · 13 min read

Top 20 Most Popular Meituan Tech Blog Articles of 2022

Tencent Cloud Developer

Dec 28, 2022 · Operations

Technical Architecture, Observability, and Operational Practices of Tencent Health Code System

The article details how Tencent’s health‑code platform leveraged a cloud‑native, serverless architecture, extensive observability (Prometheus, Grafana, RUM), rigorous capacity testing, chaos engineering, and ITIL‑based change management to sustain billions of page views, support massive concurrency, and ensure reliable, scalable epidemic‑control services.

Health CodeObservabilityOperations

0 likes · 16 min read

Technical Architecture, Observability, and Operational Practices of Tencent Health Code System

IT Architects Alliance

Dec 24, 2022 · Operations

Unlocking Linux Observability: A Hands‑On Guide to eBPF with Real‑World Examples

This article introduces eBPF, explains its origins and how it extends BPF for kernel‑level observability, compares it with SystemTap and DTrace, outlines common use cases, details its loading‑compile‑execute workflow, and provides step‑by‑step Python/BCC examples with installation instructions and advanced latency measurement code.

BCCLinuxObservability

0 likes · 21 min read

Unlocking Linux Observability: A Hands‑On Guide to eBPF with Real‑World Examples

MaGe Linux Operations

Dec 23, 2022 · Operations

How to Build an Enterprise‑Grade Observability System for Reliable SRE

This article explains how enterprises can design and implement a comprehensive observability platform—covering metrics, logs, tracing, fault response, post‑mortems, testing, capacity planning, and automation—to improve system reliability and user experience.

AutomationObservabilitySRE

0 likes · 16 min read

How to Build an Enterprise‑Grade Observability System for Reliable SRE

ITPUB

Dec 20, 2022 · Operations

How We Scaled SkyWalking to Billions of Segments: A Full‑Stack Monitoring Journey

This article recounts a year‑long, hands‑on experience of deploying and continuously optimizing Apache SkyWalking for full‑link monitoring in a large micro‑service environment, covering the motivations, architecture choices, pre‑research, POC integration, and a series of performance‑tuning steps that reduced segment storage from billions to millisecond‑level query latency.

APMFull-Stack MonitoringObservability

0 likes · 21 min read

How We Scaled SkyWalking to Billions of Segments: A Full‑Stack Monitoring Journey