Tagged articles
969 articles
Page 7 of 10
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Jan 16, 2023 · Backend Development

Design and Implementation of a Business‑Facing Message Center Management Platform

The platform centralizes message‑center management for e‑commerce by adding end‑to‑end tracing, real‑time metrics, and unified logging, enabling business users to query message links, view dashboards, automate retries and approvals, dramatically reducing manual monitoring, improving completion rates above 90%, and paving the way for cost‑optimized, data‑driven operations.

DevOpsMetricsObservability
0 likes · 15 min read
Design and Implementation of a Business‑Facing Message Center Management Platform
Code Ape Tech Column
Code Ape Tech Column
Jan 14, 2023 · Operations

Comparison of Common Log Management Tools: Features, Pricing, Pros and Cons

This article provides a detailed comparison of nine popular log management solutions—including Filebeat, Graylog, LogDNA, the ELK stack, Grafana Loki, Datadog, Logstash, Fluentd, and Splunk—covering their main features, pricing models, advantages, and disadvantages to help readers choose the right tool for their needs.

ELKLog ManagementObservability
0 likes · 16 min read
Comparison of Common Log Management Tools: Features, Pricing, Pros and Cons
Su San Talks Tech
Su San Talks Tech
Jan 13, 2023 · Operations

How Distributed Tracing with SkyWalking Solves Microservice Performance Mysteries

This article explains the principles, architecture, and practical implementation of distributed tracing—especially SkyWalking—in microservice environments, showing how it identifies call chains, isolates performance bottlenecks, and integrates with existing monitoring systems while maintaining low overhead and non‑intrusive instrumentation.

Distributed TracingJavaAgentObservability
0 likes · 20 min read
How Distributed Tracing with SkyWalking Solves Microservice Performance Mysteries
Top Architect
Top Architect
Jan 6, 2023 · Operations

Understanding Distributed Tracing and SkyWalking: Principles, Architecture, and Performance

This article explains the concept of distributed tracing, its importance in micro‑service architectures, the OpenTracing standard, and how SkyWalking implements automatic span collection, context propagation, unique trace IDs, sampling strategies, and performance optimizations to provide low‑overhead observability for backend systems.

Distributed TracingObservabilityOpenTracing
0 likes · 12 min read
Understanding Distributed Tracing and SkyWalking: Principles, Architecture, and Performance
Tencent Cloud Developer
Tencent Cloud Developer
Jan 5, 2023 · Cloud Native

QQ Music High-Availability Architecture Overview

QQ Music achieves high availability by layering redundant multi‑datacenter architecture, proactive chaos‑engineering toolchains, and comprehensive observability—including metrics, logging, tracing and profiling—while employing service grading, adaptive retry windows and EMA‑based dynamic timeouts to gracefully handle faults across its massive micro‑service ecosystem.

Distributed SystemsMicroservicesObservability
0 likes · 24 min read
QQ Music High-Availability Architecture Overview
Architecture & Thinking
Architecture & Thinking
Jan 5, 2023 · Operations

How Critical Path Tracing Cuts Latency in Large Distributed Systems

This article explains why latency analysis is crucial for user experience in large distributed services, reviews common methods such as RPC monitoring, CPU profiling, and distributed tracing, and then dives deep into the principles, implementation, aggregation, storage, and visualization of critical path analysis, showcasing its practical impact in Baidu's App recommendation platform.

Observabilitycritical path tracinglatency analysis
0 likes · 15 min read
How Critical Path Tracing Cuts Latency in Large Distributed Systems
Alibaba Terminal Technology
Alibaba Terminal Technology
Jan 5, 2023 · Mobile Development

Why Mobile Trace Is Hard and How OpenTelemetry Solves It

This article explores the challenges of end‑to‑end tracing on mobile apps, explains why issues are hard to reproduce, and presents a four‑step solution using a unified OpenTelemetry standard, automated data linking, performance optimizations, and machine‑learning‑driven root‑cause analysis.

AndroidObservabilityOpenTelemetry
0 likes · 20 min read
Why Mobile Trace Is Hard and How OpenTelemetry Solves It
Architecture Digest
Architecture Digest
Dec 30, 2022 · Operations

Vivo Monitoring Platform: Architecture, Evolution, and Future Directions

The article details the evolution, architecture, capabilities, challenges, and future plans of Vivo's comprehensive monitoring platform, covering its transition from simple Zabbix setups to a cloud‑native, AI‑ops enabled system that ensures service availability across massive infrastructure.

ObservabilityReliabilityaiops
0 likes · 16 min read
Vivo Monitoring Platform: Architecture, Evolution, and Future Directions
Efficient Ops
Efficient Ops
Dec 29, 2022 · Operations

How eBay Scales Its Event Platform with ClickHouse and Kubernetes

This article details eBay's event platform architecture, explaining why a dedicated event system is needed, how ClickHouse provides high‑performance storage, the use of Kubernetes CRDs for cross‑region high availability, data routing, read/write separation, and query optimizations with LogQL.

ClickHouseEvent PlatformKubernetes
0 likes · 18 min read
How eBay Scales Its Event Platform with ClickHouse and Kubernetes
Meituan Technology Team
Meituan Technology Team
Dec 29, 2022 · Artificial Intelligence

Top 20 Most Popular Meituan Tech Blog Articles of 2022

Meituan’s technology team highlights its twenty most‑read 2022 blog posts, spanning observability, system design, data governance, AI, cloud‑native engineering, and practical innovations such as visual log tracing, Kafka scaling, functional programming, Elasticsearch optimization, CI/CD pipelines, and advanced object‑detection frameworks.

2022 HighlightsArtificial IntelligenceData Governance
0 likes · 13 min read
Top 20 Most Popular Meituan Tech Blog Articles of 2022
Tencent Cloud Developer
Tencent Cloud Developer
Dec 28, 2022 · Operations

Technical Architecture, Observability, and Operational Practices of Tencent Health Code System

The article details how Tencent’s health‑code platform leveraged a cloud‑native, serverless architecture, extensive observability (Prometheus, Grafana, RUM), rigorous capacity testing, chaos engineering, and ITIL‑based change management to sustain billions of page views, support massive concurrency, and ensure reliable, scalable epidemic‑control services.

Health CodeObservabilityOperations
0 likes · 16 min read
Technical Architecture, Observability, and Operational Practices of Tencent Health Code System
IT Architects Alliance
IT Architects Alliance
Dec 24, 2022 · Operations

Unlocking Linux Observability: A Hands‑On Guide to eBPF with Real‑World Examples

This article introduces eBPF, explains its origins and how it extends BPF for kernel‑level observability, compares it with SystemTap and DTrace, outlines common use cases, details its loading‑compile‑execute workflow, and provides step‑by‑step Python/BCC examples with installation instructions and advanced latency measurement code.

BCCLinuxNetworking
0 likes · 21 min read
Unlocking Linux Observability: A Hands‑On Guide to eBPF with Real‑World Examples
ITPUB
ITPUB
Dec 20, 2022 · Operations

How We Scaled SkyWalking to Billions of Segments: A Full‑Stack Monitoring Journey

This article recounts a year‑long, hands‑on experience of deploying and continuously optimizing Apache SkyWalking for full‑link monitoring in a large micro‑service environment, covering the motivations, architecture choices, pre‑research, POC integration, and a series of performance‑tuning steps that reduced segment storage from billions to millisecond‑level query latency.

APMFull-Stack MonitoringObservability
0 likes · 21 min read
How We Scaled SkyWalking to Billions of Segments: A Full‑Stack Monitoring Journey
Inke Technology
Inke Technology
Dec 19, 2022 · Backend Development

How to Build a Highly Available, Stable, and Observable SMS Service

This article explains how to design a high‑availability SMS system by identifying stability bottlenecks, defining reliability goals, implementing failover strategies for Redis, MySQL and external services, establishing a comprehensive observability framework, and measuring key quality metrics to ensure 99.99% uptime.

BackendMetricsObservability
0 likes · 11 min read
How to Build a Highly Available, Stable, and Observable SMS Service
Java Architecture Diary
Java Architecture Diary
Dec 8, 2022 · Operations

Why Grafana Tempo and TraceQL Are Game‑Changers for Lightweight Tracing

This article introduces Grafana Tempo, its integration with Grafana, Prometheus, Loki, and the new TraceQL query language, explaining how they provide a lightweight, scalable tracing solution for small‑to‑medium teams and enhance observability through powerful, data‑type‑aware queries.

Distributed TracingGrafana TempoObservability
0 likes · 6 min read
Why Grafana Tempo and TraceQL Are Game‑Changers for Lightweight Tracing
DeWu Technology
DeWu Technology
Dec 5, 2022 · Operations

Evolution of Application Monitoring at 得物: From CAT to OpenTelemetry

After rebuilding its transaction system in 2020, 得物 progressed from the basic CAT monitoring tool to OpenTracing with Prometheus, and finally adopted OpenTelemetry to unify metrics, traces, and logs via a custom vmagent‑Kafka‑Flink pipeline, dynamic sampling, and extensible javaagents, positioning the platform for a performance‑analysis‑driven future.

CATMicroservicesObservability
0 likes · 18 min read
Evolution of Application Monitoring at 得物: From CAT to OpenTelemetry
ITPUB
ITPUB
Dec 4, 2022 · Databases

Can National Standards Accelerate the Growth of China's Domestic Databases?

The article examines whether establishing national standards for Chinese domestic databases can foster industry development, weighing the risks of over‑regulation against the benefits of standardized observability, data‑dictionary, cloud‑integration, and programming interfaces, while sharing real‑world migration experiences.

Chinese DatabasesDatabase StandardsObservability
0 likes · 11 min read
Can National Standards Accelerate the Growth of China's Domestic Databases?
Efficient Ops
Efficient Ops
Dec 1, 2022 · Operations

Why Choose Loki Over ELK? A Hands‑On Guide to Deploying and Using Grafana Loki

This article explains the motivations for selecting Grafana Loki instead of ELK/EFK, introduces its core concepts and features, provides step‑by‑step deployment instructions for Promtail and Loki, and demonstrates how to configure Grafana, query logs, and handle label indexing, dynamic tags, and high‑cardinality challenges.

GrafanaKubernetesLoki
0 likes · 15 min read
Why Choose Loki Over ELK? A Hands‑On Guide to Deploying and Using Grafana Loki
DataFunTalk
DataFunTalk
Nov 27, 2022 · Operations

Best Practices for Full‑Stack Operations Monitoring and Cost Reduction Using Alibaba Cloud Elasticsearch

This article presents a comprehensive, three‑part guide on the current state of full‑stack operations monitoring, common challenges and solutions, and a real‑world use case, illustrating how Alibaba Cloud Elasticsearch can improve observability, boost performance, and cut costs for complex distributed systems.

Cost OptimizationElasticsearchObservability
0 likes · 13 min read
Best Practices for Full‑Stack Operations Monitoring and Cost Reduction Using Alibaba Cloud Elasticsearch
Programmer DD
Programmer DD
Nov 23, 2022 · Backend Development

Spring Boot 3.0.0: Key Updates and How to Get Ready

The article outlines the recent Spring 6.0 release, the cascade of updates across major Spring projects, and previews the upcoming Spring Boot 3.0.0, highlighting the first RC, the new aot.factories feature, and enhanced observability for Java developers.

ObservabilitySpring 6spring-boot
0 likes · 3 min read
Spring Boot 3.0.0: Key Updates and How to Get Ready
ByteDance Terminal Technology
ByteDance Terminal Technology
Nov 18, 2022 · Big Data

Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance

This article presents ByteDance’s experience building a massive trace‑data analysis platform, covering observability fundamentals, the evolution of its distributed tracing system, various aggregation computation models, technical architecture choices, and concrete use‑cases such as precise topology, traffic estimation, dependency analysis, performance anti‑patterns, bottleneck detection, and error propagation.

Big DataDistributed TracingGraph Database
0 likes · 21 min read
Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance
Alibaba Cloud Native
Alibaba Cloud Native
Nov 17, 2022 · Cloud Native

How RocketMQ Harnesses Prometheus for Full‑Stack Observability

This article explains how RocketMQ integrates with Prometheus and Grafana to provide comprehensive metrics, tracing, and logging, detailing the exporter architecture, deployment choices, span topology, dashboard examples, and ARMS‑based alerting for cloud‑native message‑queue observability.

ARMSCloud NativeMetrics
0 likes · 14 min read
How RocketMQ Harnesses Prometheus for Full‑Stack Observability
21CTO
21CTO
Nov 15, 2022 · Operations

Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems

This article explains how SRE teams should define Service Level Indicators, Objectives and Agreements, manage reliability, performance, saturation and observability, use proper metrics and tracing, handle error budgets, assess risks, and implement effective incident and project management to create robust, cloud‑native services.

Error BudgetObservabilityReliability
0 likes · 14 min read
Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems
DevOps Cloud Academy
DevOps Cloud Academy
Nov 13, 2022 · Cloud Native

Grafana Phlare: Open‑Source Continuous Profiling Database – Architecture, Features, and Kubernetes Deployment Guide

Grafana Phlare is an open‑source, horizontally scalable continuous profiling database that integrates with Grafana, offering easy installation, multi‑tenant support, and object‑storage‑backed long‑term storage, with detailed deployment instructions for both monolithic and micro‑service modes on Kubernetes using Helm.

Continuous ProfilingGrafanaKubernetes
0 likes · 11 min read
Grafana Phlare: Open‑Source Continuous Profiling Database – Architecture, Features, and Kubernetes Deployment Guide
Open Source Linux
Open Source Linux
Nov 7, 2022 · Cloud Native

Unlock Scalable Cloud‑Native Alerting with Grafana Mimir: Architecture & Setup

This article explains the current state of cloud‑native alerting, introduces Grafana Mimir as a horizontally scalable, multi‑tenant storage for Prometheus, details its architecture and components, and provides step‑by‑step guidance for installing, configuring, and operating Mimir in Kubernetes environments.

AlertingCloud NativeKubernetes
0 likes · 24 min read
Unlock Scalable Cloud‑Native Alerting with Grafana Mimir: Architecture & Setup
政采云技术
政采云技术
Nov 7, 2022 · Cloud Native

Deployment Architecture of a Government Procurement Cloud Platform Based on Dragonfly OS

The article details Zhengcaiyun's government procurement cloud platform, its large‑scale architecture, migration to the domestically‑adapted Dragonfly operating system, integrated cloud‑native operations, observability built on OpenTelemetry, and ongoing efforts to enhance security, performance, and ecosystem collaboration.

Dragonfly OSObservabilitycloud-native
0 likes · 7 min read
Deployment Architecture of a Government Procurement Cloud Platform Based on Dragonfly OS
政采云技术
政采云技术
Nov 7, 2022 · Cloud Native

Deployment Architecture of a Government Procurement Cloud Platform Based on the Longxi Operating System

The article details Zhengcaiyun's government procurement cloud platform, its large‑scale deployment architecture built on the Longxi OS, covering cloud‑native design, domestic adaptation, observability, and operational practices that enable high‑performance, secure, and scalable public procurement services.

Observabilitycloud-nativeeBPF
0 likes · 6 min read
Deployment Architecture of a Government Procurement Cloud Platform Based on the Longxi Operating System
Alibaba Cloud Native
Alibaba Cloud Native
Nov 3, 2022 · Cloud Native

How to Leverage Alibaba Cloud Prometheus for Fine‑Grained Cloud Product Monitoring

This guide explains why native cloud monitoring falls short, how building custom Prometheus exporters adds overhead, and how Alibaba Cloud's fully managed Prometheus service—through enterprise cloud‑monitoring and self‑monitoring integration modes—provides ready‑to‑use exporters, agents, Grafana dashboards, and alert templates for dozens of cloud products.

Alibaba CloudCloud NativeGrafana
0 likes · 12 min read
How to Leverage Alibaba Cloud Prometheus for Fine‑Grained Cloud Product Monitoring
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Oct 24, 2022 · Backend Development

Understanding Zipkin: Principles, Architecture, Core Components, and Deployment for Distributed Tracing

This article explains why Zipkin is needed for microservice observability, describes its architecture, core components, trace and span model, workflow, and provides step‑by‑step Docker and JAR deployment instructions, helping developers quickly locate service bottlenecks and failures.

Distributed TracingMicroservicesObservability
0 likes · 7 min read
Understanding Zipkin: Principles, Architecture, Core Components, and Deployment for Distributed Tracing
Software Development Quality
Software Development Quality
Oct 23, 2022 · Operations

Top Observability Tools: Datadog, Grafana, Instana, New Relic, Prometheus

This article provides an overview of five leading observability solutions—Datadog, Grafana, Instana, New Relic, and Prometheus—detailing their core features, supported data sources, deployment models, and how they help teams monitor cloud‑native applications, infrastructure, and services to ensure reliability and performance.

DevOpsObservabilitycloud-native
0 likes · 4 min read
Top Observability Tools: Datadog, Grafana, Instana, New Relic, Prometheus
Programmer DD
Programmer DD
Oct 21, 2022 · Cloud Native

How Grafana Mimir Transforms Cloud‑Native Monitoring and Alerting

This article explains how Grafana Mimir provides a scalable, highly‑available, multi‑tenant long‑term storage for Prometheus, details its architecture and core components such as compactor, distributor, ingester, querier, query‑frontend and store‑gateway, and shows step‑by‑step installation, status checking, and Alertmanager configuration for cloud‑native environments.

AlertmanagerCloud Native MonitoringGrafana Mimir
0 likes · 22 min read
How Grafana Mimir Transforms Cloud‑Native Monitoring and Alerting
Alibaba Cloud Native
Alibaba Cloud Native
Oct 19, 2022 · Cloud Native

How to Monitor Non‑Kubernetes ECS Apps with Alibaba Cloud Managed Prometheus

This guide explains how to use Alibaba Cloud's fully managed Prometheus service to collect and visualize metrics from ECS‑based applications across pure VPC, hybrid VPC‑IDC, and multi‑cloud scenarios, detailing the pain points of self‑built solutions and providing step‑by‑step configuration instructions.

Alibaba CloudECSObservability
0 likes · 11 min read
How to Monitor Non‑Kubernetes ECS Apps with Alibaba Cloud Managed Prometheus
Programmer DD
Programmer DD
Oct 19, 2022 · Backend Development

Unlock Full Observability in Spring Boot 3: Micrometer Observation API Explained

This article walks through adding complete observability to Spring Boot 3 applications using Micrometer's Observation API, covering metrics, tracing, log correlation, configuration, code examples for both server and client, and even native image support for production-ready monitoring.

MetricsMicrometerObservability
0 likes · 23 min read
Unlock Full Observability in Spring Boot 3: Micrometer Observation API Explained
Top Architect
Top Architect
Oct 18, 2022 · Operations

Apache SkyWalking APM: Concepts, Docker Installation, and UI Guide

This article introduces Application Performance Management (APM), explains the features of Apache SkyWalking for micro‑service and cloud‑native monitoring, and provides step‑by‑step Docker‑compose installation, agent configuration, and a detailed walkthrough of the SkyWalking UI components.

APMDockerMicroservices
0 likes · 13 min read
Apache SkyWalking APM: Concepts, Docker Installation, and UI Guide
DeWu Technology
DeWu Technology
Oct 17, 2022 · Operations

High Availability: Principles and Practices for System Stability

High availability—measured in nines of uptime—requires partitioning systems, decoupling components, choosing robust technologies, deploying redundant instances with automatic failover, capacity planning, rapid scaling, traffic shaping, resource isolation, global protection, observability, and disciplined change management to achieve stable, resilient services.

Observabilitycapacity planningchange management
0 likes · 10 min read
High Availability: Principles and Practices for System Stability
Cloud Native Technology Community
Cloud Native Technology Community
Oct 17, 2022 · Cloud Native

A Three‑Step Approach to Understanding, Managing, and Preventing Kubernetes Failures

This article presents a practical three‑step methodology—understanding, managing, and preventing—to troubleshoot Kubernetes deployments, explains how to leverage monitoring, observability, and incident‑response tools, and offers guidance on fostering team collaboration and building resilient, self‑healing cloud‑native systems.

Cloud NativeKubernetesObservability
0 likes · 7 min read
A Three‑Step Approach to Understanding, Managing, and Preventing Kubernetes Failures
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Oct 8, 2022 · Operations

Complete Solution for Sentry Error and Performance Monitoring in Qiankun Micro‑Frontend Architecture

This article presents a complete solution for routing Sentry error and performance data to the correct micro‑frontend projects in a Qiankun architecture by intercepting transport, redistributing URLs, and distinguishing transaction types, with detailed code examples for both Sentry 6.x and 7.x versions.

JavaScriptMicro‑frontendObservability
0 likes · 10 min read
Complete Solution for Sentry Error and Performance Monitoring in Qiankun Micro‑Frontend Architecture
Alibaba Cloud Native
Alibaba Cloud Native
Oct 4, 2022 · Cloud Native

How Service Mesh Redefines Cloud‑Native Networking, Security, and Observability

This article explains the fundamentals of service mesh as a cloud‑native infrastructure layer, covering its control‑plane and data‑plane architecture, sidecar and waypoint proxies, L4/L7 decoupling, eBPF acceleration, zero‑trust security, traffic management, observability, and real‑world deployment scenarios.

Cloud NativeKubernetesObservability
0 likes · 20 min read
How Service Mesh Redefines Cloud‑Native Networking, Security, and Observability
DataFunSummit
DataFunSummit
Sep 28, 2022 · Big Data

Elasticsearch Time Series Engine: Practices, Challenges, and Alibaba Cloud TimeStream

This article presents a comprehensive overview of using Elasticsearch as a time series engine, covering its motivations, challenges, key features, Alibaba Cloud TimeStream optimizations such as columnar storage, LSM structures, downsampling, and integration with Prometheus and Grafana, while also discussing performance and cost considerations.

Big DataDownsamplingElasticsearch
0 likes · 15 min read
Elasticsearch Time Series Engine: Practices, Challenges, and Alibaba Cloud TimeStream
IT Architects Alliance
IT Architects Alliance
Sep 25, 2022 · Backend Development

12 Proven Strategies to Seamlessly Migrate Your Monolith to Microservices

This guide presents twelve practical steps—from understanding the trade‑offs and planning the transition to adopting monorepos, CI pipelines, API gateways, feature flags, and observability—that help teams safely decompose a large monolithic application into a robust microservices architecture.

CI/CDMicroservicesObservability
0 likes · 14 min read
12 Proven Strategies to Seamlessly Migrate Your Monolith to Microservices
Cloud Native Technology Community
Cloud Native Technology Community
Sep 23, 2022 · Cloud Native

What Cloud‑Native Networking Trends Kube‑OVN Reveals and How DeepFlow Enables Full‑Stack Observability

In this technical session, experts from Lingque Cloud and Yunshan Network discuss emerging cloud‑native networking trends through Kube‑OVN, demonstrate DeepFlow's full‑stack observability in Kube‑OVN environments, and answer a wide range of practical Q&A covering IP stability, underlay challenges, CNI support, and performance tuning.

CNICloud Native NetworkingDeepFlow
0 likes · 14 min read
What Cloud‑Native Networking Trends Kube‑OVN Reveals and How DeepFlow Enables Full‑Stack Observability
IT Architects Alliance
IT Architects Alliance
Sep 23, 2022 · Operations

Which APM Tool Wins? A Deep Comparison of Zipkin, SkyWalking, and Pinpoint

This article analyzes full‑link monitoring in micro‑service architectures, outlines the goals and functional modules of tracing systems, explains core concepts such as Span, Trace, and Annotation, and then compares Zipkin, SkyWalking, and Pinpoint across performance impact, scalability, data analysis depth, developer transparency, and topology visualization.

APMComparisonDistributed Tracing
0 likes · 27 min read
Which APM Tool Wins? A Deep Comparison of Zipkin, SkyWalking, and Pinpoint
Big Data Technology Architecture
Big Data Technology Architecture
Sep 17, 2022 · Databases

Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry

This article describes how Bilibili redesigned its log service by replacing Elasticsearch with ClickHouse, introducing OpenTelemetry‑based logging, optimizing storage, query, and alerting components, and enhancing ClickHouse features such as configuration tuning, Map types, and implicit columns to achieve higher performance, lower cost, and better observability.

ClickHouseDatabase OptimizationObservability
0 likes · 28 min read
Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry
Bilibili Tech
Bilibili Tech
Sep 16, 2022 · Big Data

Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry

Bilibili’s Log Service 2.0 replaces its Elastic‑Stack pipeline with an OpenTelemetry‑driven architecture that writes logs via high‑performance Go/Java SDKs to ClickHouse, delivering ten‑fold write throughput, two‑fold query speed, one‑third storage cost, a custom query gateway, visualization UI, and advanced alerting.

ClickHouseObservabilityOpenTelemetry
0 likes · 27 min read
Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry
Architect's Guide
Architect's Guide
Sep 14, 2022 · Backend Development

Architect’s Blueprint: Backend Architecture, Microservices, Message Queues, and Observability

This article presents a comprehensive backend architecture guide covering microservice fundamentals, domain‑driven design, gateway patterns, service registration, configuration centers, observability pillars, service mesh options, and a detailed comparison of major message‑queue technologies.

Backend ArchitectureObservabilityService Mesh
0 likes · 27 min read
Architect’s Blueprint: Backend Architecture, Microservices, Message Queues, and Observability
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Sep 13, 2022 · Operations

How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices

This article details Yanxuan's four‑year evolution of a unified monitoring, alerting, and event‑bus platform for micro‑service architectures, covering design principles, technology selection, multi‑stage implementation, dynamic sampling, custom plugins, data modeling, visualization upgrades, and the final fault‑driven, system‑wide integration.

AlertingFull‑Link TracingMicroservices
0 likes · 23 min read
How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 9, 2022 · Information Security

How to Build a Comprehensive Cloud‑Native Kubernetes Security Monitoring System

This article examines the evolving security risks of cloud‑native architectures, explains why traditional perimeter defenses are insufficient, introduces zero‑trust principles for Kubernetes, outlines common K8s threat vectors, and presents a complete data‑collection and monitoring solution based on the open‑source iLogtail agent.

KubernetesObservabilityZero Trust
0 likes · 30 min read
How to Build a Comprehensive Cloud‑Native Kubernetes Security Monitoring System
Efficient Ops
Efficient Ops
Sep 7, 2022 · Operations

How DeepFlow Automates Full‑Stack Observability for Cloud‑Native Environments

This article presents DeepFlow, an open‑source, highly automated observability platform that uses eBPF to provide zero‑code AutoMetrics and AutoTracing, integrates with Prometheus, OpenTelemetry and SkyWalking, and enables SRE, DevOps and NewOps teams to build full‑stack metrics and blind‑spot‑free tracing for cloud‑native applications.

DevOpsMetricsObservability
0 likes · 20 min read
How DeepFlow Automates Full‑Stack Observability for Cloud‑Native Environments
Tencent Cloud Developer
Tencent Cloud Developer
Sep 7, 2022 · Cloud Native

Why Build Probe Capabilities Based on OpenTelemetry for Cloud‑Native Observability

Building probe capabilities on OpenTelemetry gives cloud‑native teams a vendor‑neutral, standardized way to extend monitoring into full observability—supporting large‑scale, language‑specific instrumentation, plug‑and‑play plugins, and seamless integration with APM backends—so developers and operators can detect, debug, and predict faults across distributed containers.

APMCloud NativeNode.js
0 likes · 15 min read
Why Build Probe Capabilities Based on OpenTelemetry for Cloud‑Native Observability
Alibaba Cloud Native
Alibaba Cloud Native
Sep 6, 2022 · Cloud Native

What’s New in KubeVela 1.5? Deep Dive into Plugins, Observability, and Cloud Shell

Version 1.5 of the open‑source Cloud Native application delivery platform KubeVela introduces enhanced plugin specifications, built‑in observability with Prometheus‑Grafana, a browser‑based Cloud Shell, advanced Canary rollouts via OpenKruise, multi‑environment UI improvements, and performance optimizations, while moving toward CNCF incubation.

CI/CDCloud NativeKubeVela
0 likes · 16 min read
What’s New in KubeVela 1.5? Deep Dive into Plugins, Observability, and Cloud Shell
dbaplus Community
dbaplus Community
Sep 5, 2022 · Operations

How EyesTSDB Evolved into a Cloud‑Native, Second‑Level Monitoring Platform

This article details the evolution of NetEase's self‑built time‑series database EyesTSDB into a cloud‑native, second‑level monitoring solution, covering its architecture, core features, integration with VictoriaMetrics, custom plugin workflow, CMDB linkage, real‑world use cases, and future challenges.

CMDB integrationMetricsObservability
0 likes · 21 min read
How EyesTSDB Evolved into a Cloud‑Native, Second‑Level Monitoring Platform
DeWu Technology
DeWu Technology
Sep 2, 2022 · Operations

Design and Implementation of Trace2.0 Distributed Tracing Platform

Trace2.0 is an OpenTelemetry‑based distributed tracing platform that collects billions of spans daily, routes data through a control plane, OTel Server, and Kafka to ClickHouse hot‑cold storage with tail sampling, achieving 66% cost reduction, 12× compression, sub‑second query latency, and plans to offload raw spans to object storage.

Backend ArchitectureClickHouseDistributed Tracing
0 likes · 12 min read
Design and Implementation of Trace2.0 Distributed Tracing Platform
Efficient Ops
Efficient Ops
Aug 31, 2022 · Operations

How Intelligent Operations and Observability Transform Cloud‑Native Environments

In this talk, Wu Yakun from Guance Cloud explains the shortcomings of traditional operations, introduces intelligent, data‑driven approaches for the cloud‑native era, and outlines how unified data collection, observability, and SLO‑based monitoring can dramatically improve fault detection and system reliability.

Intelligent OperationsObservabilitySLO
0 likes · 16 min read
How Intelligent Operations and Observability Transform Cloud‑Native Environments
Architects Research Society
Architects Research Society
Aug 25, 2022 · Operations

Core Reliability Principles in the Google Cloud Architecture Framework

This article outlines the core reliability principles of the Google Cloud Architecture Framework, explaining key terms such as SLI, SLO, error budget, and SLA, and describing design and operational guidelines for defining reliability goals, building observability, ensuring high availability, creating robust processes, effective alerting, and collaborative incident management.

Cloud ComputingError BudgetObservability
0 likes · 12 min read
Core Reliability Principles in the Google Cloud Architecture Framework
Baidu Geek Talk
Baidu Geek Talk
Aug 22, 2022 · Mobile Development

How Baidu Optimized Low‑End Device Startup Performance: A Deep Dive

This article explains how Baidu's performance team tackled the slowdown of mobile internet growth by defining low‑end devices, building observability, creating high‑efficiency tooling, redesigning key components such as KV storage and locks, and introducing a smart scheduling framework that together reduced Android cold‑start TTI by over 50% and iOS cold‑start TTI by more than 40%, while establishing a continuous anti‑degradation pipeline.

Mobile DevelopmentObservabilityPerformance Optimization
0 likes · 20 min read
How Baidu Optimized Low‑End Device Startup Performance: A Deep Dive
Architect's Guide
Architect's Guide
Aug 18, 2022 · Databases

42 Lessons Learned from Building a Production Database

This article translates and summarizes Mahesh Balakrishnan’s 42 practical insights on building a production database, covering customer focus, project management, design principles, code review, observability, research, and cultural practices for engineering teams.

DesignInfrastructureObservability
0 likes · 11 min read
42 Lessons Learned from Building a Production Database
Efficient Ops
Efficient Ops
Aug 17, 2022 · Operations

Master System Monitoring with the USE Method and Prometheus

This article explains how to build a comprehensive monitoring system using the concise USE (Utilization, Saturation, Errors) method, outlines key system and application metrics, and demonstrates practical implementation with Prometheus, Grafana, full‑link tracing, and ELK for observability and performance troubleshooting.

Full‑Link TracingObservabilityPrometheus
0 likes · 13 min read
Master System Monitoring with the USE Method and Prometheus
IT Architects Alliance
IT Architects Alliance
Aug 15, 2022 · R&D Management

Essential Practices for Effective Engineering Projects and R&D Management

This article outlines comprehensive guidelines for keeping customers happy, managing projects, designing robust APIs, conducting thorough code reviews, shaping strategic direction, ensuring observability, and fostering research, all aimed at building resilient and high‑performing engineering teams.

Code reviewObservabilityProject Management
0 likes · 12 min read
Essential Practices for Effective Engineering Projects and R&D Management
DevOps
DevOps
Aug 12, 2022 · Operations

9 DevOps Best Practices and Common Anti‑Patterns

This article explains what DevOps is, why it matters, and presents nine practical best‑practice recommendations—including culture, CI/CD, testing, observability, automation, security, and IaC—while also highlighting common anti‑patterns to avoid for successful DevOps adoption.

Anti-PatternsDevOpsInfrastructure as Code
0 likes · 13 min read
9 DevOps Best Practices and Common Anti‑Patterns
Huolala Tech
Huolala Tech
Aug 11, 2022 · Operations

How Huolala Built an AI‑Powered Intelligent Monitoring Platform at Scale

This article details Huolala's journey from basic monitoring to an AI‑driven intelligent observability platform, covering AIOps concepts, a comprehensive monitoring framework, practical implementations, automated alert analysis, lessons learned, and future directions for large‑scale operations.

DevOpsHuolalaObservability
0 likes · 18 min read
How Huolala Built an AI‑Powered Intelligent Monitoring Platform at Scale
Java Architecture Diary
Java Architecture Diary
Aug 8, 2022 · Operations

How to Integrate Jaeger Tracing with Rainbond Using OpenTelemetry

This guide explains why distributed tracing is essential for micro‑service architectures, introduces Jaeger as an open‑source APM solution, and provides step‑by‑step instructions for deploying and configuring Jaeger on Rainbond with OpenTelemetry, including environment variables, service naming, and topology generation.

APMDistributed TracingObservability
0 likes · 11 min read
How to Integrate Jaeger Tracing with Rainbond Using OpenTelemetry
Architecture Digest
Architecture Digest
Aug 2, 2022 · Cloud Native

Microservice Architecture and Design Patterns: Goals, Principles, and Decomposition Strategies

This article explains the four primary goals of microservice architecture, outlines essential design principles, and details a comprehensive set of decomposition and integration patterns—including business‑function, sub‑domain, transaction, Strangler, Bulkhead, Sidecar, API‑gateway, Aggregator, CQRS, Saga, observability, and deployment patterns—providing practical guidance for building resilient cloud‑native systems.

ArchitectureCloud NativeMicroservices
0 likes · 18 min read
Microservice Architecture and Design Patterns: Goals, Principles, and Decomposition Strategies
DevOps Cloud Academy
DevOps Cloud Academy
Jul 26, 2022 · Operations

9 DevOps Best Practices: What You Should Do and Not Do

This article outlines nine essential DevOps best practices—from fostering a collaborative, blameless culture and adopting CI/CD, automated testing, observability, and IaC, while also highlighting common anti‑patterns such as isolated DevOps teams, hero reliance, and unchecked tool sprawl.

CI/CDDevOpsObservability
0 likes · 13 min read
9 DevOps Best Practices: What You Should Do and Not Do
dbaplus Community
dbaplus Community
Jul 24, 2022 · Fundamentals

Meta’s Secret to Near‑Zero Cache Inconsistency

Meta’s engineering team describes how they raised cache consistency from six‑nines to ten‑nines by defining precise invalidation semantics, building the Polaris observability service, and implementing systematic tracking of cache mutations, offering practical strategies that apply to any distributed cache such as Redis or TAO.

ConsistencyMetaObservability
0 likes · 17 min read
Meta’s Secret to Near‑Zero Cache Inconsistency
FunTester
FunTester
Jul 24, 2022 · Operations

Boost Service Reliability with Chaos Engineering: Practical Steps & Evaluation

Chaos engineering, a discipline for experimenting on distributed systems, helps teams identify hidden weaknesses, improve high‑availability, and build confidence in production by defining stable states, injecting realistic failures, and measuring impact through observability metrics, with practical steps, tool choices, maturity stages, and evaluation methods.

Distributed SystemsFault InjectionObservability
0 likes · 11 min read
Boost Service Reliability with Chaos Engineering: Practical Steps & Evaluation
dbaplus Community
dbaplus Community
Jul 21, 2022 · Operations

How Huolala Built an AI‑Powered End‑to‑End Monitoring Platform

This article details Huolala's journey from a fragmented monitoring stack to a unified, AI‑enhanced observability platform, covering AIOps concepts, the design of a comprehensive monitoring framework, concrete implementation of metrics, tracing, logging, alerting, and lessons learned for large‑scale operations.

DevOpsObservabilityaiops
0 likes · 19 min read
How Huolala Built an AI‑Powered End‑to‑End Monitoring Platform
Meituan Technology Team
Meituan Technology Team
Jul 21, 2022 · Backend Development

Visualized Full‑Chain Log Tracing for Complex Business Systems

The article analyzes the shortcomings of traditional ELK and distributed tracing for complex business systems, proposes a visualized full‑chain log tracing solution that organizes and dynamically links logs by business chain, and demonstrates its implementation and performance gains at Meituan’s content platform.

BackendDSLDistributed Systems
0 likes · 26 min read
Visualized Full‑Chain Log Tracing for Complex Business Systems
Baidu Geek Talk
Baidu Geek Talk
Jul 19, 2022 · Cloud Native

How OpenTelemetry and Jaeger Power Cloud‑Native Tracing

This article explains cloud‑native observability, defines its three pillars—metrics, tracing, and logging—details the OpenTelemetry tracing data model and Span structure, reviews industry implementations such as Jaeger and Alibaba Eagle Eye, and shares practical challenges and solutions from real‑world production use.

Alibaba Eagle EyeCloud NativeDistributed Systems
0 likes · 11 min read
How OpenTelemetry and Jaeger Power Cloud‑Native Tracing
IT Architects Alliance
IT Architects Alliance
Jul 18, 2022 · Operations

Comparison of Prometheus and Zabbix Monitoring Solutions

This article compares Prometheus and Zabbix, outlining their histories, architectures, storage models, configuration complexity, community activity, and suitability for different environments, and concludes with recommendations on when to choose each monitoring system.

ComparisonObservabilityOperations
0 likes · 9 min read
Comparison of Prometheus and Zabbix Monitoring Solutions
Top Architect
Top Architect
Jul 8, 2022 · Cloud Native

Understanding Service Mesh and Istio: Architecture, Features, and Hands‑On Deployment

This tutorial explains the fundamentals of service mesh, outlines Istio's architecture and core components, walks through installing Istio on Kubernetes, demonstrates a sample microservice deployment with traffic‑management, security, and observability features, and discusses when to adopt a service mesh and its alternatives.

Cloud NativeIstioMicroservices
0 likes · 20 min read
Understanding Service Mesh and Istio: Architecture, Features, and Hands‑On Deployment
Selected Java Interview Questions
Selected Java Interview Questions
Jul 6, 2022 · Operations

Grafana 9.0 New Features and Improvements Overview

Grafana 9.0 introduces a suite of usability enhancements—including a visual Prometheus query builder, a visual Loki LogQL generator, improved Explore‑to‑dashboard workflow, revamped heatmap panel, command palette, panel search, trace panel, navigation upgrades, and alerting refinements—aimed at simplifying observability, data visualization, and operational efficiency.

AlertingDashboardGrafana
0 likes · 7 min read
Grafana 9.0 New Features and Improvements Overview
Alibaba Cloud Native
Alibaba Cloud Native
Jul 5, 2022 · Cloud Native

Unlocking eBPF: How Kernel‑Level Observability Powers Modern Cloud‑Native Apps

This article explains what eBPF is, why it was created, its core characteristics, common use cases such as network optimization, fault diagnosis, security control and performance monitoring, and provides practical step‑by‑step guidance, tooling commands, program types, and ecosystem resources for leveraging eBPF in cloud‑native environments.

Cloud NativeKubernetesLinux
0 likes · 20 min read
Unlocking eBPF: How Kernel‑Level Observability Powers Modern Cloud‑Native Apps
Architect's Guide
Architect's Guide
Jul 5, 2022 · Backend Development

Architect’s Guide: Backend Architecture, Microservices, Service Mesh, and Message Queues

This comprehensive article reviews backend architectural concepts such as microservices design, service mesh, observability pillars, gateway patterns, service registration, configuration centers, and a detailed comparison of message‑queue technologies, providing practical guidance for architects and engineers.

Backend ArchitectureObservabilityService Mesh
0 likes · 27 min read
Architect’s Guide: Backend Architecture, Microservices, Service Mesh, and Message Queues
dbaplus Community
dbaplus Community
Jul 4, 2022 · Operations

Why Most Monitoring Systems Fail: Lessons from a Veteran Ops Engineer

A seasoned operations professional shares personal experiences and hard‑earned insights on why traditional monitoring often becomes ineffective, how over‑automation and noisy dashboards hurt teams, and what a capability‑focused, user‑centric approach to observability should look like.

ObservabilityOperationsSRE
0 likes · 12 min read
Why Most Monitoring Systems Fail: Lessons from a Veteran Ops Engineer
AntTech
AntTech
Jun 28, 2022 · Operations

AntMonitor: Evolution, Features, and Core Technologies of Ant Group’s Observability Platform

The article details Ant Group’s AntMonitor observability platform, covering its development timeline, holographic monitoring capabilities, integrated performance analysis, efficient data integration, built‑in AI‑driven analytics, Monitoring‑as‑a‑Service, and the underlying high‑performance time‑series database and cloud‑native architecture that support massive real‑time data processing.

CloudNativeObservabilityTimeSeriesDatabase
0 likes · 17 min read
AntMonitor: Evolution, Features, and Core Technologies of Ant Group’s Observability Platform
High Availability Architecture
High Availability Architecture
Jun 24, 2022 · Backend Development

Improving Cache Invalidation and Consistency at Scale

Meta engineers describe the challenges of cache invalidation and consistency in large‑scale distributed systems, explain why stale caches are problematic, present their Polaris observability service and consistency‑tracking techniques, and detail how they raised TAO’s cache consistency from six‑nines to ten‑nines.

ConsistencyDistributed SystemsObservability
0 likes · 17 min read
Improving Cache Invalidation and Consistency at Scale
HaoDF Tech Team
HaoDF Tech Team
Jun 21, 2022 · Operations

Evolution and High‑Availability Construction of the Haodafu Offline Message Push System

This article describes how the Haodafu offline push service grew from a simple PHP notification tool into a robust, highly‑available micro‑service platform by redesigning architecture, adopting vendor push channels, adding message‑queue reliability, implementing comprehensive monitoring, observability, and a fault‑diagnosis platform to ensure delivery rates and operational stability.

Mobile BackendObservabilitySRE
0 likes · 21 min read
Evolution and High‑Availability Construction of the Haodafu Offline Message Push System
Programmer DD
Programmer DD
Jun 21, 2022 · Operations

Discover Grafana 9.0: Visual Query Builders, Heatmap Panel & More

Grafana 9.0 introduces a suite of usability enhancements—including visual Prometheus and Loki query builders, an Explore‑to‑dashboard workflow, a high‑performance heatmap panel, command‑palette navigation, and improved alerting—making data exploration, visualization, and monitoring more intuitive for developers and operators.

DashboardGrafanaLoki
0 likes · 8 min read
Discover Grafana 9.0: Visual Query Builders, Heatmap Panel & More
Architecture Digest
Architecture Digest
Jun 20, 2022 · Backend Development

Architectural Guide: Microservices, Service Mesh, Messaging, and Observability

This article presents a comprehensive architectural roadmap covering microservice fundamentals, design principles, service discovery, API protocols, gateway patterns, observability pillars, service mesh options, and a detailed comparison of modern message‑queue technologies, offering practical guidance for backend system design and selection.

Backend ArchitectureCloud NativeMicroservices
0 likes · 28 min read
Architectural Guide: Microservices, Service Mesh, Messaging, and Observability
ITPUB
ITPUB
Jun 18, 2022 · Operations

How MDD and SRE Cut Mini‑Program Image‑Upload Failures from Days to Minutes

This article recounts a three‑day image‑upload outage in a mini‑program, analyzes the multi‑layer causes, and shows how combining Metrics‑Driven Development with SRE and a custom observability platform dramatically reduces diagnosis time and improves reliability.

Metrics-Driven DevelopmentMini ProgramObservability
0 likes · 20 min read
How MDD and SRE Cut Mini‑Program Image‑Upload Failures from Days to Minutes
Xingsheng Youxuan Technology Community
Xingsheng Youxuan Technology Community
Jun 17, 2022 · Frontend Development

How Prism Transformed Front‑End Monitoring at Scale: Architecture, Challenges & Insights

This article details the design, challenges, and solutions behind Prism, a self‑built front‑end monitoring platform that collects multi‑device SDK data, processes it through Kafka, Flink and ClickHouse, visualizes metrics, integrates with A/B testing, and outlines future enhancements for broader enterprise adoption.

AB testingFrontendObservability
0 likes · 14 min read
How Prism Transformed Front‑End Monitoring at Scale: Architecture, Challenges & Insights
Architecture Digest
Architecture Digest
Jun 17, 2022 · Cloud Native

Vivo Container Cluster Monitoring Architecture and Cloud‑Native Practices

This article describes Vivo's practical experience building a cloud‑native monitoring system for large‑scale container clusters, covering the shortcomings of traditional monitoring, the Prometheus‑centric ecosystem, high‑availability architecture, challenges faced, and future directions such as automation and AI‑driven operations.

ObservabilityPrometheusVictoriaMetrics
0 likes · 13 min read
Vivo Container Cluster Monitoring Architecture and Cloud‑Native Practices
Meituan Technology Team
Meituan Technology Team
Jun 16, 2022 · Artificial Intelligence

Building a Quality Model for Meituan's Recommendation System

This article presents a request‑granularity quality model for Meituan's integrated recommendation system, linking data tables, algorithm models, services, and user requests, and details its metrics, defect taxonomy, calculation formulas, data‑lineage expansion, implementation, alert routing, and operational outcomes.

Data LineageMeituanObservability
0 likes · 22 min read
Building a Quality Model for Meituan's Recommendation System
vivo Internet Technology
vivo Internet Technology
Jun 15, 2022 · Cloud Native

Vivo Container Cluster Monitoring Architecture and Cloud‑Native Observability Practices

Vivo’s cloud‑native monitoring solution combines high‑availability Prometheus clusters, VictoriaMetrics storage, Grafana visualization, and a custom leader‑election adapter to deduplicate data while forwarding metrics to Kafka and OLAP systems, addressing large‑scale performance, scalability, and integration challenges and paving the way for AI‑driven AIOps.

Cloud Native MonitoringKubernetesObservability
0 likes · 18 min read
Vivo Container Cluster Monitoring Architecture and Cloud‑Native Observability Practices
dbaplus Community
dbaplus Community
Jun 13, 2022 · Operations

How We Built a Mini‑Program Observability Platform to Slash Incident Resolution Time

After a three‑day, ten‑person investigation into a mini‑program image‑upload failure, we designed and implemented an end‑to‑end observability platform using MDD and SRE principles, defining SLI/SLO, instrumenting client, network, gateway and backend layers, and visualizing metrics with Grafana, ClickHouse and Prometheus.

GrafanaMDDMetrics
0 likes · 18 min read
How We Built a Mini‑Program Observability Platform to Slash Incident Resolution Time
Top Architect
Top Architect
Jun 12, 2022 · Backend Development

Comprehensive Guide to Backend Architecture: Microservices, Observability, Service Mesh, and Messaging

This article provides an in‑depth overview of modern backend architecture, covering microservice fundamentals, design principles, gateway patterns, service registration, configuration management, observability pillars, service mesh options, and a detailed comparison of popular message‑queue technologies.

ArchitectureBackendMessaging
0 likes · 29 min read
Comprehensive Guide to Backend Architecture: Microservices, Observability, Service Mesh, and Messaging