Tagged articles
969 articles
Page 5 of 10
Meituan Technology Team
Meituan Technology Team
Jun 27, 2024 · Mobile Development

Meituan Mini‑Program Testability Practices and Implementation

Meituan’s Tech Salon Session 77 describes how the company built a generic JavaScript‑hooking SDK packaged as an NPM module to give mini‑programs on platforms such as WeChat, Alipay and Kuaishou observable and controllable capabilities across UI, storage, network and system layers, enabling automated cache management, request mocking, state inspection and visual diff testing, which the ticket team leveraged to achieve over 30 % test scenario coverage, 100 % page coverage and discover hundreds of defects, while outlining future plans to stabilize and expand the framework.

JavaScriptMini ProgramObservability
0 likes · 19 min read
Meituan Mini‑Program Testability Practices and Implementation
dbaplus Community
dbaplus Community
Jun 24, 2024 · Operations

How Qunar Achieved Sub‑Second Monitoring to Slash Fault Detection Time to Under 1 Minute

Qunar’s Watcher monitoring platform was upgraded from minute‑level to second‑level precision, redesigning storage, data collection, and alerting pipelines, adopting VictoriaMetrics, enhancing client SDKs, and adding fine‑grained alarm rules, which reduced fault detection from four minutes to under one minute while improving reliability and scalability.

DevOpsObservabilityTime Series Database
0 likes · 20 min read
How Qunar Achieved Sub‑Second Monitoring to Slash Fault Detection Time to Under 1 Minute
Sanyou's Java Diary
Sanyou's Java Diary
Jun 24, 2024 · Operations

How Visualized Full‑Chain Log Tracing Transforms Complex Business Systems

This article explains a new visualized full‑chain log tracing solution that organizes business logs by logical flow, dynamically links them during execution, and provides a visual, searchable view of the entire business process, dramatically improving issue localization in large‑scale distributed systems.

BackendObservabilityOperations
0 likes · 26 min read
How Visualized Full‑Chain Log Tracing Transforms Complex Business Systems
Cloud Native Technology Community
Cloud Native Technology Community
Jun 19, 2024 · Cloud Native

Lessons Learned from Migrating Applications to Kubernetes

This article recounts a two‑year journey of moving from Ansible‑based EC2 deployments to a Kubernetes‑centric platform, detailing motivations, migration strategies, operational challenges, tooling choices, cost considerations, and practical lessons for teams contemplating a similar cloud‑native transformation.

CI/CDDevOpsInfrastructure Migration
0 likes · 19 min read
Lessons Learned from Migrating Applications to Kubernetes
Bilibili Tech
Bilibili Tech
Jun 18, 2024 · Frontend Development

Design and Implementation of a Front-End Observability System for Business Monitoring

The article describes a unified front‑end observability platform that standardizes data‑point collection via a common SDK, automatically generates health and business dashboards, integrates real‑time monitoring and heatmaps, and has been adopted on 140 pages, delivering faster first‑screen loads, lower error and bounce rates, and higher conversion.

DashboardFrontendObservability
0 likes · 22 min read
Design and Implementation of a Front-End Observability System for Business Monitoring
DevOps
DevOps
Jun 16, 2024 · Operations

Performance Engineering Challenges and Practices for Software‑Defined Vehicles

The article examines how the shift to Software‑Defined Vehicles introduces complex performance engineering challenges across software, hardware, and organizational domains, and proposes an engineering‑driven, continuous‑observability approach—including modeling, monitoring, iterative optimization, and specialized team structures—to sustainably improve automotive software performance.

ObservabilityPerformance OptimizationSDV
0 likes · 17 min read
Performance Engineering Challenges and Practices for Software‑Defined Vehicles
21CTO
21CTO
Jun 7, 2024 · Artificial Intelligence

Why AI Gateways Are the Next Evolution of API Gateways

AI gateways have emerged as essential infrastructure for modern AI applications, offering specialized security, load balancing, cost management, and observability that go beyond traditional API gateways, and understanding their differences and deployment considerations is crucial for developers and ops teams.

AI InfrastructureAI gatewayCost Management
0 likes · 10 min read
Why AI Gateways Are the Next Evolution of API Gateways
Linux Code Review Hub
Linux Code Review Hub
Jun 5, 2024 · Operations

Observing Virtio‑Net NIC Queues with eBPF: A Practical Guide

This article explains how to extend eBPF to make virtio‑net NIC queue metrics observable, walks through the front‑end send/receive flow, defines key queue indices, integrates the probes into the rtrace tool, and demonstrates fault detection with real‑time data.

Linux kernelNIC queueObservability
0 likes · 17 min read
Observing Virtio‑Net NIC Queues with eBPF: A Practical Guide
Efficient Ops
Efficient Ops
Jun 4, 2024 · Operations

How Huya Unified Its Monitoring Platform with OpenTelemetry for Zero‑Cost Integration

This article details Huya's transition from fragmented, non‑standard monitoring solutions to a unified OpenTelemetry‑based platform, covering project background, pain points, design decisions, SDK architecture, data pipeline, storage, alerting, root‑cause analysis, and future plans, highlighting the benefits of standardization and zero‑cost service integration.

HuyaMetricsObservability
0 likes · 13 min read
How Huya Unified Its Monitoring Platform with OpenTelemetry for Zero‑Cost Integration
Alibaba Cloud Observability
Alibaba Cloud Observability
May 29, 2024 · Cloud Native

Why iLogtail Needed a Complete Architecture Overhaul and How It Was Done

This article explains the evolution of iLogtail from a single‑file collector to a multi‑language, plugin‑based observability pipeline, outlines the motivations for refactoring, describes the new unified data model, plugin abstractions, pipeline design, configuration management, hot‑reload mechanisms, and the separation of enterprise and open‑source code, providing a comprehensive view of the architectural upgrade.

Configuration ManagementGolangObservability
0 likes · 43 min read
Why iLogtail Needed a Complete Architecture Overhaul and How It Was Done
Alibaba Cloud Observability
Alibaba Cloud Observability
May 29, 2024 · Operations

How Alibaba Cloud Prometheus Enables Ultra‑Fast Host Monitoring for Supercomputing

This article explains the unique business characteristics of large‑scale supercomputing workloads, outlines the observability challenges they pose, and details how Alibaba Cloud Prometheus host monitoring provides automated service discovery, rapid probe deployment, fine‑grained metrics, and ready‑to‑use Grafana dashboards to achieve second‑level monitoring at massive scale.

ObservabilitySupercomputinghost monitoring
0 likes · 15 min read
How Alibaba Cloud Prometheus Enables Ultra‑Fast Host Monitoring for Supercomputing
Alibaba Cloud Observability
Alibaba Cloud Observability
May 29, 2024 · Cloud Native

How to Achieve End-to-End Cloud Native Tracing and Solve the 3 Major Challenges

This article explains why distributed tracing is essential for modern cloud‑native systems, outlines the three toughest problems—instrumentation, data collection, and context propagation—and shows how Alibaba Cloud ARMS and OpenTelemetry provide a comprehensive, multi‑language solution for end‑to‑end traceability.

ARMSAlibaba CloudDistributed Tracing
0 likes · 14 min read
How to Achieve End-to-End Cloud Native Tracing and Solve the 3 Major Challenges
Cognitive Technology Team
Cognitive Technology Team
May 23, 2024 · Operations

eBPF + LLM: Building the Infrastructure for Observability Agents

The article explains how zero‑intrusion eBPF provides full‑stack, high‑quality observability data that, when combined with large language models, enables AI‑driven agents to automate ticket handling, change impact analysis, and vulnerability triage, dramatically improving operational efficiency.

AI AgentDistributed TracingLLM
0 likes · 17 min read
eBPF + LLM: Building the Infrastructure for Observability Agents
DataFunSummit
DataFunSummit
May 22, 2024 · Operations

Building an Observability System: Practices and Solutions from Yanhuang Data

This article explains how to build a robust observability system for cloud‑native microservice architectures, detailing the three core signals—metrics, traces, and logs—common challenges such as complexity and data silos, and presents Yanhuang Data’s integrated platform with unified data collection, storage, analysis, and visualization solutions.

KubernetesMetricsObservability
0 likes · 23 min read
Building an Observability System: Practices and Solutions from Yanhuang Data
Tencent Cloud Developer
Tencent Cloud Developer
May 21, 2024 · Operations

Why Prometheus Metrics Aren’t 100% Accurate – The Hidden Trade‑offs Explained

The article analyzes why Prometheus sometimes returns inaccurate metric values, revealing the design trade‑offs that favor efficiency over precision, and walks through common pitfalls in rate/increase calculations, histogram P99 estimation, and practical recommendations for choosing scrape intervals and query windows.

HistogramMetricsObservability
0 likes · 20 min read
Why Prometheus Metrics Aren’t 100% Accurate – The Hidden Trade‑offs Explained
ByteDance SYS Tech
ByteDance SYS Tech
May 9, 2024 · Operations

How Large‑Model Agents Transform AIOps: From Automation to Self‑Healing Operations

The presentation explains how large‑model agents empower AIOps by automating routine tasks, enhancing anomaly detection, fault diagnosis, and remediation, while outlining architectural components, multi‑agent collaboration, and future directions for building self‑healing, observability‑driven operations platforms.

AgentObservabilityOperations Automation
0 likes · 15 min read
How Large‑Model Agents Transform AIOps: From Automation to Self‑Healing Operations
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
May 3, 2024 · Cloud Native

What Makes Cloud‑Native Architecture Essential for Modern Apps?

This article explains cloud‑native architecture, covering its definition, core concepts such as microservices, containerization, automation, storage, networking, and the guiding principles of service orientation, elastic scaling, and observability that together enable highly available, scalable, and agile applications.

ContainerizationKubernetesMicroservices
0 likes · 5 min read
What Makes Cloud‑Native Architecture Essential for Modern Apps?
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Apr 29, 2024 · Artificial Intelligence

Building Enterprise‑Grade Retrieval‑Augmented Generation (RAG) Systems: Challenges, Fault Points, and Best Practices

This comprehensive guide explores the complexities of building enterprise‑level Retrieval‑Augmented Generation (RAG) systems, detailing common failure points, architectural components such as authentication, input guards, query rewriting, document ingestion, indexing, storage, retrieval, generation, observability, caching, and multi‑tenant considerations, and provides actionable best‑practice recommendations for developers and technical leaders.

Enterprise AILLMObservability
0 likes · 32 min read
Building Enterprise‑Grade Retrieval‑Augmented Generation (RAG) Systems: Challenges, Fault Points, and Best Practices
21CTO
21CTO
Apr 22, 2024 · Operations

Discover Guider: A Python‑Powered Linux Observability Suite with 150+ Commands

Guider, a Python‑based Linux observability suite created by Hyundai engineer Peace Lee, offers over 150 command‑line tools for real‑time performance monitoring, resource tracing, automated reporting, and visualizations, enabling developers to diagnose slow startups, crashes, GPU stalls, and system resets with microsecond precision.

CLILinuxObservability
0 likes · 7 min read
Discover Guider: A Python‑Powered Linux Observability Suite with 150+ Commands
dbaplus Community
dbaplus Community
Apr 21, 2024 · Cloud Native

What Cloud‑Native Tech Stack Should You Use in 2024? A Real‑World Guide

In 2024 the author reflects on a decade of backend evolution and shares a practical, experience‑driven guide to the cloud‑native stack—including Kubernetes, multi‑cloud strategies, DevOps tooling, service mesh, observability, and message‑queue choices—tailored to teams of different sizes.

DevOpsObservabilityService Mesh
0 likes · 12 min read
What Cloud‑Native Tech Stack Should You Use in 2024? A Real‑World Guide
Alibaba Cloud Native
Alibaba Cloud Native
Apr 16, 2024 · Operations

Unlocking Log Insights: How SPL Brings Interactive Pipe‑Style Queries to Cloud‑Native Observability

This article explains how the SLS Processing Language (SPL) enables interactive, pipeline‑based log analysis in cloud‑native environments, covering the challenges of unstructured log data, Unix‑inspired exploration, SPL syntax, key commands, and practical examples for efficient querying and transformation.

Cloud NativeObservabilityPipeline
0 likes · 12 min read
Unlocking Log Insights: How SPL Brings Interactive Pipe‑Style Queries to Cloud‑Native Observability
Alibaba Cloud Observability
Alibaba Cloud Observability
Apr 16, 2024 · Cloud Native

Mastering Interactive Log Exploration with SPL: Unix‑Inspired Pipelines in Cloud Native Environments

This article explains how the SLS Processing Language (SPL) brings Unix‑style pipelined, interactive log exploration to cloud‑native observability, detailing why logs are unstructured, how SPL’s unified syntax works, and which commands simplify field projection, enrichment, filtering, and semi‑structured data parsing.

Log ProcessingObservabilitySPL
0 likes · 12 min read
Mastering Interactive Log Exploration with SPL: Unix‑Inspired Pipelines in Cloud Native Environments
Alibaba Cloud Observability
Alibaba Cloud Observability
Apr 12, 2024 · Cloud Computing

Why Alibaba Cloud SLS Beats Open‑Source ELK for Log Management

Alibaba Cloud Log Service (SLS) offers a serverless, high‑availability, low‑cost alternative to self‑built ELK stacks, providing comparable Elasticsearch and Kafka compatibility, superior storage, query, and alerting capabilities, and streamlined migration paths, making it a compelling choice for large‑scale observability workloads.

Cloud ServiceELKObservability
0 likes · 13 min read
Why Alibaba Cloud SLS Beats Open‑Source ELK for Log Management
ByteDance Cloud Native
ByteDance Cloud Native
Mar 27, 2024 · Cloud Native

How ByteDance Optimized Its Metrics Agent for 70% CPU Savings

This article details how ByteDance's cloud‑native observability team tackled performance bottlenecks in their metricserver2 Agent—reducing memory copies, merging tiny packets, applying SIMD for tag parsing, and switching compression libraries—to cut CPU usage by over 10% and memory usage by nearly 20% while handling petabyte‑scale metric data.

MsgpackObservabilitySIMD
0 likes · 15 min read
How ByteDance Optimized Its Metrics Agent for 70% CPU Savings
Tencent Cloud Developer
Tencent Cloud Developer
Mar 21, 2024 · Backend Development

Backend Refactoring and Architecture Design of Tencent Docs Collection Form Service

Tencent Docs transformed its high‑traffic Collection Form by refactoring a monolithic C++‑style service into 19 loosely‑coupled vertical services with light‑heavy separation, database isolation, async Kafka pipelines, and full observability via Tianji, achieving dramatically improved stability, millisecond‑level sync, reliable export, and faster incident resolution.

BackendMicroservicesObservability
0 likes · 21 min read
Backend Refactoring and Architecture Design of Tencent Docs Collection Form Service
DevOps
DevOps
Mar 20, 2024 · Cloud Computing

Platform Engineering: Beyond Infrastructure – Core Pillars and Human Collaboration

The article explains that platform engineering extends far beyond basic infrastructure, highlighting its core pillars such as automation, composability, agility, observability, and the essential role of collaboration and culture in creating value‑driven, cloud‑native software delivery.

Cloud ComputingCollaborationObservability
0 likes · 6 min read
Platform Engineering: Beyond Infrastructure – Core Pillars and Human Collaboration
Practical DevOps Architecture
Practical DevOps Architecture
Mar 15, 2024 · Operations

Comprehensive Practical Guide to Prometheus Configuration, Optimization, and Source Code Development

This multi‑chapter guide provides in‑depth, hands‑on instruction for configuring and optimizing all Prometheus components, exploring Kubernetes monitoring, source‑code analysis, custom exporter development, high‑availability setups, service discovery, resource‑efficient scraping, and integrating Thanos for long‑term storage.

KubernetesObservabilityOperations
0 likes · 4 min read
Comprehensive Practical Guide to Prometheus Configuration, Optimization, and Source Code Development
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 11, 2024 · Operations

Why iLogtail Needed a Complete Architecture Overhaul and How It Was Done

This article explains the motivations behind iLogtail's architectural redesign, details the evolution from a single‑file C++ collector to a modular pipeline with Golang plugins, outlines the refactor goals and implementation practices, and reflects on the challenges and outcomes of the six‑month effort.

ArchitectureGolangObservability
0 likes · 38 min read
Why iLogtail Needed a Complete Architecture Overhaul and How It Was Done
Baidu Geek Talk
Baidu Geek Talk
Mar 6, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis

The article explains why collective communication is critical for distributed large‑model training, outlines the new requirements for system reliability, and introduces Baidu’s Collective Communication Library (BCCL), detailing its enhanced observability, fault‑diagnosis, stability, and performance optimizations that raise effective training time to 98 % and bandwidth utilization to 95 %.

AI InfrastructureDistributed TrainingFault Diagnosis
0 likes · 11 min read
How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis
DevOps
DevOps
Mar 4, 2024 · Frontend Development

Building QQ Front-end Unified Access Layer: Architecture, Technical Choices, and Performance Insights

This article shares a decade‑long journey of designing and scaling the QQ front‑end unified access layer, covering business background, overall architecture, solution comparisons, core challenges, observability, and performance optimizations while reflecting on practical lessons for large‑scale front‑end systems.

ArchitectureFrontendObservability
0 likes · 10 min read
Building QQ Front-end Unified Access Layer: Architecture, Technical Choices, and Performance Insights
Efficient Ops
Efficient Ops
Mar 3, 2024 · Operations

Mastering Prometheus: From Metrics Collection to Alerting and Visualization

This comprehensive guide explains Prometheus' architecture, metric collection models, storage format, query language (PromQL), alerting workflow, configuration reload methods, metric types, custom exporters, and how to visualise data with Grafana, providing a complete end‑to‑end monitoring solution.

GrafanaMetricsObservability
0 likes · 21 min read
Mastering Prometheus: From Metrics Collection to Alerting and Visualization
Yum! Tech Team
Yum! Tech Team
Mar 1, 2024 · Operations

Building an Observability System Traffic Distribution Diagram

This article explains how to design and implement a traffic distribution diagram for an observability system, covering current cloud‑native tooling, data standardization, transformation, traffic‑flow modeling, aggregation, storage with ClickHouse, and visualisation techniques such as Sankey diagrams.

Cloud NativeObservabilitydata modeling
0 likes · 7 min read
Building an Observability System Traffic Distribution Diagram
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 1, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

Baidu’s Collective Communication Library (BCCL) enhances large‑model distributed training by improving real‑time bandwidth monitoring, fault diagnosis, network stability, and performance, leveraging RDMA networks and GPU‑specific optimizations to increase effective training time to 98% and bandwidth utilization to 95%.

AI InfrastructureDistributed TrainingFault Diagnosis
0 likes · 11 min read
How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis
MaGe Linux Operations
MaGe Linux Operations
Feb 29, 2024 · Operations

Quickly Set Up OpenTelemetry on Kubernetes: Installation, Modes & Config

This guide walks you through deploying OpenTelemetry in Kubernetes, covering the purpose of otel‑collector, installation via manifests or Helm, the three deployment patterns (No‑Collector, Agent, Gateway), running the otel‑demo, and detailed configuration of receivers, processors, exporters, connectors, extensions, and service pipelines.

CollectorKubernetesObservability
0 likes · 11 min read
Quickly Set Up OpenTelemetry on Kubernetes: Installation, Modes & Config
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Feb 23, 2024 · Mobile Development

Understanding Perfetto Data Flow Architecture and Reducing Trace Data Loss

Perfetto’s tracing system links multiple producers to a single consumer via shared‑memory buffers, where careful sizing of pages, chunks, and central buffers, along with tuned protobuf encoding and scheduling priorities, mitigates CPU overhead and prevents data loss, enabling reliable observability on Android devices.

AndroidData FlowObservability
0 likes · 26 min read
Understanding Perfetto Data Flow Architecture and Reducing Trace Data Loss
Alibaba Cloud Native
Alibaba Cloud Native
Feb 22, 2024 · Cloud Native

Achieving 50% Cost Cut with Cloud‑Native Architecture: A Flexible Workforce Platform Case

Facing poor observability, high resource waste, and unstable releases, QingTuan’s flexible‑workforce platform transformed its monolithic and SOA systems into a cloud‑native micro‑service architecture using Alibaba Cloud ACK, MSE, ARMS, and Prometheus, achieving higher availability, elastic scaling, and up to 50% infrastructure cost reduction.

ArchitectureObservabilitycloud-native
0 likes · 22 min read
Achieving 50% Cost Cut with Cloud‑Native Architecture: A Flexible Workforce Platform Case
Alibaba Cloud Native
Alibaba Cloud Native
Feb 20, 2024 · Cloud Native

What’s New in iLogtail 2.0? A Deep Dive into the Updated Pipeline Architecture

iLogtail 2.0 replaces the monolithic, file‑oriented design of its predecessor with a modular pipeline configuration, new input/processor/output plugins, a refreshed API, SPL processing, finer‑grained parsing controls, nanosecond‑level timestamps, enhanced observability, and performance improvements, while providing compatibility guidance for both commercial and open‑source editions.

APICloud NativeObservability
0 likes · 17 min read
What’s New in iLogtail 2.0? A Deep Dive into the Updated Pipeline Architecture
Tencent Cloud Developer
Tencent Cloud Developer
Feb 20, 2024 · Frontend Development

From Frontend to Full‑Stack: Architecture, Challenges, and Practices of the QQ Frontend Unified Access Layer

The veteran front‑end engineer chronicles a decade of building QQ’s large‑scale products, detailing how the new Frontend Unified Access Layer replaced fragmented SDKs with a high‑performance, scalable, secure gateway built on an internal http2rpc framework, while tackling legacy protocol coexistence, observability, alert fatigue, and targeted performance optimizations.

FrontendObservabilityPerformance
0 likes · 10 min read
From Frontend to Full‑Stack: Architecture, Challenges, and Practices of the QQ Frontend Unified Access Layer
Efficient Ops
Efficient Ops
Feb 19, 2024 · Operations

Mastering Prometheus: Practical Tips for Effective Application Monitoring

This article explains how to design and implement Prometheus metrics for application monitoring, covering the selection of monitoring targets, golden metrics, label conventions, naming rules, histogram bucket choices, and Grafana visualization tricks to help engineers build reliable observability pipelines.

GrafanaMetricsObservability
0 likes · 10 min read
Mastering Prometheus: Practical Tips for Effective Application Monitoring
DevOps Cloud Academy
DevOps Cloud Academy
Feb 17, 2024 · Operations

Implementing Reusable GitHub Actions Workflows for Scalable CI at McDonald's

McDonald's engineering team built a fast, reliable, and flexible continuous integration system by leveraging reusable GitHub Actions workflows, centralizing CI code, defining a golden‑path pipeline, balancing developer autonomy, and adding observability across multilingual microservices, improving productivity and maintainability.

CI/CDDevOpsGitHub Actions
0 likes · 7 min read
Implementing Reusable GitHub Actions Workflows for Scalable CI at McDonald's
DaTaobao Tech
DaTaobao Tech
Jan 29, 2024 · Cloud Native

Observability: Logging, Metrics, and Tracing in Distributed Systems

Observability in distributed systems combines event logging, aggregated metrics, and request tracing—each offering distinct trade‑offs in detail, storage, and overhead—and while the ELK stack dominates log and metric handling, tracing solutions such as EagleEye and SkyWalking differ by protocol and language, prompting many teams to adopt unified, cloud‑native platforms like Alibaba Cloud’s Log Service for lower cost, real‑time analysis and simplified management.

ELKMetricsObservability
0 likes · 32 min read
Observability: Logging, Metrics, and Tracing in Distributed Systems
Linux Code Review Hub
Linux Code Review Hub
Jan 29, 2024 · Cloud Native

How Minsheng Bank Built eBPF‑Based Observability for Cloud‑Native Services

The article details Minsheng Bank's step‑by‑step journey from traditional network monitoring to a full‑stack, zero‑intrusion observability platform built with DeepFlow, vTap, distributed data collection, and eBPF, illustrating concrete case studies and future plans for expanding business‑level monitoring.

Cloud NativeDeepFlowDistributed Tracing
0 likes · 18 min read
How Minsheng Bank Built eBPF‑Based Observability for Cloud‑Native Services
MaGe Linux Operations
MaGe Linux Operations
Jan 25, 2024 · Operations

Mastering Monitoring: From Concepts to Prometheus in Operations

This article explains monitoring fundamentals, distinguishes black‑box and white‑box approaches, outlines key metrics and their aggregation, and provides a comprehensive guide to Prometheus architecture, data model, query language, and practical examples for CPU, memory, and disk usage monitoring.

MetricsObservabilityPrometheus
0 likes · 18 min read
Mastering Monitoring: From Concepts to Prometheus in Operations
Architect
Architect
Jan 24, 2024 · Operations

Mastering End-to-End Tracing in Go Microservices with OpenTracing and Zipkin

This article walks through the complete design and implementation of full‑stack distributed tracing for Go‑based microservices, explaining correlation IDs, OpenTracing concepts, component roles, client and server code, database and service call tracing, compatibility issues, and best‑practice design guidelines.

Distributed TracingGoMicroservices
0 likes · 20 min read
Mastering End-to-End Tracing in Go Microservices with OpenTracing and Zipkin
Java Captain
Java Captain
Jan 15, 2024 · Operations

Java Distributed Tracing: Concepts, Principles, Implementation, and Application Scenarios

This article explains the concept of distributed tracing, outlines its underlying principles in Java, details step‑by‑step implementation using popular SDKs, and describes common application scenarios such as performance monitoring, fault diagnosis, complex event handling, traffic analysis, and system optimization.

Distributed TracingFault DiagnosisMicroservices
0 likes · 5 min read
Java Distributed Tracing: Concepts, Principles, Implementation, and Application Scenarios
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Jan 10, 2024 · Operations

Building Cloud Music's APM Metric Monitoring System Based on VictoriaMetrics

Cloud Music’s middleware team built the Pylon APM monitoring system on VictoriaMetrics, combining exporters, vmagent, Nacos, Flink‑based pre‑aggregation recording rules and vminsert for collection with Grafana, a custom Proxy and vmselect for querying, achieving millisecond‑level latency, metric‑trace correlation, stability improvements, and cost‑effective storage for nearly 700 million active time series.

APM monitoringFlinkMetric Pre-aggregation
0 likes · 12 min read
Building Cloud Music's APM Metric Monitoring System Based on VictoriaMetrics
Tencent Cloud Developer
Tencent Cloud Developer
Jan 9, 2024 · Operations

Tencent Cloud APM Full-Link Tracing Implementation and Best Practices

The article explains how Tencent Cloud APM implements full‑link tracing using OpenTelemetry standards, addresses challenges such as protocol compatibility, massive trace storage, and bytecode overhead with solutions like conversion gateways, tail sampling and thread profiling, and showcases best‑practice scenarios for topology analysis, front‑end/back‑end integration, and log‑trace correlation within the broader TCOP observability suite.

APMFull‑Link TracingObservability
0 likes · 11 min read
Tencent Cloud APM Full-Link Tracing Implementation and Best Practices
Sanyou's Java Diary
Sanyou's Java Diary
Jan 8, 2024 · Cloud Native

How Distributed Tracing Solves Microservice Performance Mysteries with SkyWalking

This article explains the principles and benefits of distributed tracing systems, introduces OpenTracing standards, details SkyWalking’s architecture and mechanisms for automatic span collection, context propagation, unique trace IDs, sampling strategies, and performance impact, and shares practical implementation experiences and custom plugin development within a real‑world microservice environment.

Distributed TracingMicroservicesObservability
0 likes · 20 min read
How Distributed Tracing Solves Microservice Performance Mysteries with SkyWalking
dbaplus Community
dbaplus Community
Jan 2, 2024 · Operations

How Xiaohongshu Scaled Its Metrics System Tenfold with Cloud‑Native Architecture

Facing exploding metric volumes, high resource consumption, and fragile operations, Xiaohongshu's observability team completely rebuilt its metrics pipeline using Victoriametrics, achieving ten‑fold performance gains, minute‑level scaling, high‑availability, cost reduction, and robust multi‑cloud active‑active deployment while preserving data safety and query speed.

MetricsObservabilityPrometheus
0 likes · 34 min read
How Xiaohongshu Scaled Its Metrics System Tenfold with Cloud‑Native Architecture
Zuoyebang Tech Team
Zuoyebang Tech Team
Dec 28, 2023 · Big Data

How We Scaled Our Data Platform by Migrating to Apache DolphinScheduler

Facing growing task volumes and diverse workload types, we upgraded our data development platform's scheduling engine to Apache DolphinScheduler, detailing the migration process, architectural enhancements, stability and observability improvements, multi‑tenant support, and the resulting performance gains and future roadmap.

Apache DolphinSchedulerBig DataData Platform
0 likes · 12 min read
How We Scaled Our Data Platform by Migrating to Apache DolphinScheduler
Weimob Technology Center
Weimob Technology Center
Dec 26, 2023 · Operations

Rebuilding Our APM: Scalable Metrics & Alerts with VictoriaMetrics & VMAlert

This article details the complete redesign of our internal APM system, covering the motivations, architecture choices, metric collection pipeline, integration of VictoriaMetrics and VMAlert, metric and alert design principles, implementation steps, visualizations, performance gains, and future plans for scaling and SaaS‑ification.

APMAlertingMetrics
0 likes · 17 min read
Rebuilding Our APM: Scalable Metrics & Alerts with VictoriaMetrics & VMAlert
Efficient Ops
Efficient Ops
Dec 24, 2023 · Operations

Avoid These 6 Common Prometheus Mistakes When Getting Started

This guide translates and condenses six frequent errors new Prometheus users make—high‑cardinality labels, losing valuable tags during aggregation, using bare selectors, omitting the for field, choosing too‑short rate windows, and applying rate‑related functions to wrong metric types—offering practical fixes to improve monitoring reliability.

ObservabilityPromQLPrometheus
0 likes · 12 min read
Avoid These 6 Common Prometheus Mistakes When Getting Started
Architect
Architect
Dec 22, 2023 · Operations

How Tencent Search Built a Multi‑Layered Stability Architecture to Slash MTTD and MTTR

The article details Tencent Search’s end‑to‑end stability engineering practice, covering a ten‑step architecture that combines redundancy, proactive detection, rapid emergency response, automated cut‑over, defensive caching, and continuous drills, and shows how these measures collectively reduced mean‑time‑to‑detect and mean‑time‑to‑recover by an order of magnitude while keeping service availability high.

ArchitectureObservabilityResilience
0 likes · 32 min read
How Tencent Search Built a Multi‑Layered Stability Architecture to Slash MTTD and MTTR
DevOps Cloud Academy
DevOps Cloud Academy
Dec 14, 2023 · Operations

CI/CD Observability via OpenTelemetry at Grafana Labs

The article explains the importance of CI/CD observability, outlines common pipeline problems, introduces Grafana's GraCIe plugin built on OpenTelemetry, and discusses how enhanced visibility can improve reliability, decision‑making, and future standardization across CI/CD platforms.

CI/CDDevOpsGrafana
0 likes · 13 min read
CI/CD Observability via OpenTelemetry at Grafana Labs
Architect
Architect
Dec 13, 2023 · Industry Insights

How Bilibili Engineered a 1.2 B‑Viewer Live Stream for the LoL World Championship

This article details Bilibili's end‑to‑end technical planning, traffic‑estimation models, and concrete optimizations—including hotspot caching, traffic dispersion, long‑connection isolation, and automated fault‑injection—that enabled the S13 League of Legends finals to serve over 1.2 billion viewers with stable, low‑latency streaming.

ObservabilityPerformance OptimizationTraffic Engineering
0 likes · 22 min read
How Bilibili Engineered a 1.2 B‑Viewer Live Stream for the LoL World Championship
DataFunTalk
DataFunTalk
Dec 13, 2023 · Databases

SelectDB Boosts GuanceDB Observability: Architecture Upgrade, Cost Reduction, and Performance Gains

This article details how SelectDB’s inverted‑index, Variant data type, and sampling capabilities were integrated into GuanceDB to replace Elasticsearch, achieving up to 70% storage cost reduction, 2‑4× query speed improvement, and a ten‑fold overall cost‑performance boost for log analytics and observability workloads.

Cloud NativeLog AnalyticsObservability
0 likes · 20 min read
SelectDB Boosts GuanceDB Observability: Architecture Upgrade, Cost Reduction, and Performance Gains
Qunar Tech Salon
Qunar Tech Salon
Dec 12, 2023 · Backend Development

System Slimming at Qunar Travel: Reducing Code and Service Footprint by 50% Using Observability and Automation

This article presents Qunar Travel's "system slimming" project, describing how observability techniques, a two‑stage strategy, and automated tooling were used to identify and remove unused services and code, achieving a 50% reduction in code size, a 26% cut in services, and measurable improvements in reliability and release efficiency.

MicroservicesObservabilitybackend optimization
0 likes · 20 min read
System Slimming at Qunar Travel: Reducing Code and Service Footprint by 50% Using Observability and Automation
37 Interactive Technology Team
37 Interactive Technology Team
Dec 4, 2023 · Backend Development

Root Cause Analysis of Missing Trace Data in Go Services Using Prometheus Metrics and GZIP Compression

The missing trace data in two Go services was caused by the GoFrame tracing middleware recording the gzip‑compressed /metrics response body as a UTF‑8 string, which the OpenTelemetry exporter rejected as invalid UTF‑8; disabling Prometheus compression or decompressing the body before logging resolves the issue.

GzipObservabilityOpenTelemetry
0 likes · 16 min read
Root Cause Analysis of Missing Trace Data in Go Services Using Prometheus Metrics and GZIP Compression
Bilibili Tech
Bilibili Tech
Dec 1, 2023 · Operations

Safe Production Practices: Change Management Platform Design and Implementation at Bilibili

After a series of change‑induced outages in early 2023, Bilibili instituted a comprehensive change‑management framework—including a preventive change platform, a central control system, quality and monitoring tools, strict gray‑release policies, observability checks, and rapid rollback mechanisms—to dramatically cut emergency incidents and embed a reliability‑first culture.

ObservabilityReliabilitySRE
0 likes · 16 min read
Safe Production Practices: Change Management Platform Design and Implementation at Bilibili
Architecture and Beyond
Architecture and Beyond
Nov 25, 2023 · Operations

Designing and Implementing an Effective Log System for Internet Startups

The article explains why comprehensive logging is essential for internet startups, outlines the three stages of a log system, details log levels, required fields, best‑practice principles, collection architectures such as local files and ELK, and how collected logs support monitoring, debugging, and analytics.

ELKLog ManagementObservability
0 likes · 12 min read
Designing and Implementing an Effective Log System for Internet Startups
Programmer DD
Programmer DD
Nov 24, 2023 · Backend Development

What’s New in Spring Boot 3.2? Explore Java 21 Features and Virtual Threads

Spring Boot 3.2, released shortly after Java 21, brings a host of enhancements such as virtual thread support, CRaC checkpoint restore, SSL bundle reloading, improved observability, new RestClient and JdbcClient, Jetty 12, Pulsar, Kafka and RabbitMQ SSL, redesigned nested JAR handling, Docker image build upgrades, and a comprehensive video walkthrough by Josh Long.

DockerJava 21Kafka
0 likes · 7 min read
What’s New in Spring Boot 3.2? Explore Java 21 Features and Virtual Threads
Ops Development Stories
Ops Development Stories
Nov 20, 2023 · Operations

How eBPF Powers Next‑Gen Observability and Fault Diagnosis in Kubernetes

At KubeCon China 2023, experts Liu Kai and Dong Shandong presented a three‑part deep dive into Kubernetes observability challenges, demonstrating how eBPF enables comprehensive data collection across all stack layers, seamless integration, and intelligent root‑cause analysis through dimension attribution, anomaly bounding, and fault‑tree methods.

Cloud NativeFault DiagnosisKubernetes
0 likes · 20 min read
How eBPF Powers Next‑Gen Observability and Fault Diagnosis in Kubernetes
Alibaba Cloud Native
Alibaba Cloud Native
Nov 17, 2023 · Cloud Native

How Dubbo-go’s New Triple Protocol Transforms Cloud‑Native Microservices

The article introduces Dubbo‑go 3.2’s comprehensive upgrade, focusing on the Triple protocol’s gRPC and HTTP compatibility, simplified API, service‑governance features, code examples for server and client, configuration‑driven deployment, built‑in observability, traffic‑management capabilities, and the modular plugin architecture.

Cloud NativeObservabilitydubbo-go
0 likes · 14 min read
How Dubbo-go’s New Triple Protocol Transforms Cloud‑Native Microservices
Huya Tech Engineering
Huya Tech Engineering
Nov 10, 2023 · Operations

How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs

This article describes how Huya built a unified metadata platform to break data silos across its SRE systems, enabling standardized data ingestion, correlation, and analysis that improve resource governance, root‑cause diagnosis, and overall cost‑efficiency for large‑scale live streaming services.

DevOpsObservabilitySRE
0 likes · 13 min read
How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs
AntTech
AntTech
Nov 7, 2023 · Operations

ChaosMeta V0.6.0 Release: New Features, Lossless Injection, Automated Experiments, and Future Directions

ChaosMeta V0.6.0 introduces DNS and log injection capabilities, lossless fault injection concepts, automated experiment orchestration with atomic tasks, and a roadmap for multi‑cloud support and advanced metrics, aiming to solve the last‑mile challenge of continuous automated chaos experiments in production environments.

Fault InjectionObservabilityautomated experiments
0 likes · 9 min read
ChaosMeta V0.6.0 Release: New Features, Lossless Injection, Automated Experiments, and Future Directions
Efficient Ops
Efficient Ops
Nov 2, 2023 · Operations

How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation

The Industrial and Commercial Bank of China software development center created an SRE panoramic monitoring view system that unifies data channels, standardizes metrics, offers multi‑dimensional dashboards, and introduces an intelligent Ops Assistant, dramatically improving fault detection, response speed, and cross‑team operational efficiency.

Digital TransformationICBCObservability
0 likes · 6 min read
How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation
Inke Technology
Inke Technology
Oct 31, 2023 · Operations

How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse

This article details the redesign of a company’s logging infrastructure—from an ELK‑based solution to a ClickHouse‑powered architecture—highlighting the motivations, key requirements, component choices, configuration examples, performance optimizations, and the resulting cost and storage benefits.

Big DataClickHouseObservability
0 likes · 13 min read
How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse
Architect
Architect
Oct 26, 2023 · Big Data

Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry

This article details Bilibili's evolution of its log system from an Elastic Stack‑based solution to a ClickHouse‑backed architecture with OpenTelemetry, describing the challenges of cost, stability, and scalability, the new components such as Log‑Agent, Log‑Ingester, and a custom visualization platform, and the performance gains and future directions.

ClickHouseObservabilityOpenTelemetry
0 likes · 26 min read
Design and Optimization of Bilibili Log Service 2.0 Using ClickHouse and OpenTelemetry
Architect
Architect
Oct 25, 2023 · Operations

The Importance of Logging and Distributed Log Operations in Modern Architecture

This article explores why logs are essential in software development, outlines when to record them, discusses the value of logging in large-scale distributed systems, and examines the capabilities required of log‑operation tools such as APM, metrics, tracing, ELK, Prometheus, and custom batch querying solutions.

APMDistributed SystemsELK
0 likes · 21 min read
The Importance of Logging and Distributed Log Operations in Modern Architecture
HomeTech
HomeTech
Oct 25, 2023 · Operations

How Metrics‑Driven Development Supercharges a Used‑Car Platform

This article examines how a metrics‑driven development approach, combined with observability tools like Prometheus, helped a large online used‑car marketplace improve system insight, accelerate business processes, and deliver measurable performance and efficiency gains across both customer‑facing and dealer‑facing operations.

Data-Driven EngineeringMetrics-Driven DevelopmentObservability
0 likes · 16 min read
How Metrics‑Driven Development Supercharges a Used‑Car Platform
Efficient Ops
Efficient Ops
Oct 24, 2023 · Operations

How to Monitor Business Metrics with Prometheus in Kubernetes

This article explains how to use Prometheus to monitor business‑level metrics in a Kubernetes environment, covering observability fundamentals, metric definitions, metric types, exposing metrics via a /metrics endpoint, and practical Go code examples for defining, recording, and scraping custom metrics.

GoKubernetesMetrics
0 likes · 11 min read
How to Monitor Business Metrics with Prometheus in Kubernetes
Efficient Ops
Efficient Ops
Oct 22, 2023 · Operations

Master Loki: Deploy, Configure, and Query Logs Efficiently

This guide explains Loki's core concepts, deployment steps for Promtail and Loki, Grafana integration, label‑based indexing, handling dynamic and high‑cardinality tags, and query optimization techniques, providing a complete roadmap for building a cost‑effective, scalable log aggregation system.

GrafanaKubernetesLoki
0 likes · 15 min read
Master Loki: Deploy, Configure, and Query Logs Efficiently
Alibaba Cloud Native
Alibaba Cloud Native
Oct 21, 2023 · Operations

How to Reveal Tracing Blind Spots with Continuous Profiling and Code Hotspots

This article explains the evolution of observability, outlines a step‑by‑step diagnosis workflow using metrics, logs and tracing, highlights the blind spots of traditional tracing, and demonstrates how Alibaba Cloud ARMS continuous profiling and code‑hotspot features can pinpoint slow call‑chain issues in Java applications.

APMContinuous ProfilingObservability
0 likes · 14 min read
How to Reveal Tracing Blind Spots with Continuous Profiling and Code Hotspots
Selected Java Interview Questions
Selected Java Interview Questions
Oct 15, 2023 · Cloud Native

The Hidden Frictions of Kubernetes Adoption: From Speed Gains to Platform Engineering Challenges

The article examines how rapid Kubernetes adoption accelerates development velocity but also introduces hidden frictions such as standardization limits, DevOps disruption, monitoring difficulties, and team isolation, emphasizing the need for collaborative platform engineering and contextual observability.

Cloud NativeDevOpsObservability
0 likes · 13 min read
The Hidden Frictions of Kubernetes Adoption: From Speed Gains to Platform Engineering Challenges
DataFunSummit
DataFunSummit
Oct 13, 2023 · Big Data

Practical Experience of Flink on Kubernetes at Kuaishou

This article presents Kuaishou's comprehensive journey of adopting Flink on Kubernetes, covering its background, evolution, architecture, production migration, observability, testing, and future plans, and demonstrates how large‑scale streaming workloads are transformed to a cloud‑native environment.

Big DataFlinkKubernetes
0 likes · 14 min read
Practical Experience of Flink on Kubernetes at Kuaishou
Ops Development Stories
Ops Development Stories
Oct 12, 2023 · Cloud Native

How to Monitor Kubernetes with OpenTelemetry Collector: Step‑by‑Step Helm Deployment

This guide walks through installing OpenTelemetry Collector on a Kubernetes cluster using Helm, configuring DaemonSet and Deployment collectors, integrating Prometheus for metrics, and customizing receivers, processors, and exporters to achieve comprehensive observability of nodes, pods, containers, and cluster resources.

KubernetesObservabilityOpenTelemetry
0 likes · 26 min read
How to Monitor Kubernetes with OpenTelemetry Collector: Step‑by‑Step Helm Deployment
Bilibili Tech
Bilibili Tech
Oct 10, 2023 · Backend Development

Design and Implementation of a Scalable Live‑Streaming Full‑Stream Data System

The article details a scalable live‑stream full‑stream data system that replaces a tightly‑coupled legacy architecture with a producer‑consumer model using a custom key‑value store, bucket sharding, gRPC server‑streaming, versioned caching, and comprehensive observability, achieving sub‑second queries, horizontal scalability, and reliable support for thousands of downstream services.

Observabilitydata pipelinegRPC
0 likes · 18 min read
Design and Implementation of a Scalable Live‑Streaming Full‑Stream Data System
Architects Research Society
Architects Research Society
Oct 3, 2023 · Cloud Native

Chaos Engineering: Concepts, History, Benefits, Challenges, and Getting Started

Chaos engineering is a disciplined approach to testing distributed systems by intentionally injecting failures to verify resilience, covering its definition, origins at Netflix, operational workflow, benefits, challenges, and practical steps for organizations to adopt resilient cloud‑native applications.

ObservabilityResiliencechaos engineering
0 likes · 18 min read
Chaos Engineering: Concepts, History, Benefits, Challenges, and Getting Started
MaGe Linux Operations
MaGe Linux Operations
Sep 30, 2023 · Cloud Native

How DeWu Built a Scalable Cloud‑Native Trace2.0 Observability Platform

This article details DeWu's evolution from a sneaker marketplace to a full‑stack e‑commerce platform and explains how its cloud‑native monitoring system, based on OpenTelemetry, ClickHouse, and object storage, was architected, optimized, and scaled to handle billions of spans daily.

ObservabilityOpenTelemetrycloud-native
0 likes · 16 min read
How DeWu Built a Scalable Cloud‑Native Trace2.0 Observability Platform
Didi Tech
Didi Tech
Sep 26, 2023 · Databases

Didi's Time Series Storage Evolution: From InfluxDB to VictoriaMetrics

Facing exponential growth of time‑series data from 2017 to 2023, Didi migrated from InfluxDB to RRDtool, then to an in‑memory cache layer, and finally adopted VictoriaMetrics because its low‑cost commodity‑hardware operation, high write throughput, strong compression, and easy horizontal scaling solved the earlier storage, OOM, and scalability problems.

ObservabilityPerformance EvaluationTSDB
0 likes · 13 min read
Didi's Time Series Storage Evolution: From InfluxDB to VictoriaMetrics
Bilibili Tech
Bilibili Tech
Sep 26, 2023 · Backend Development

Applying CQRS Architecture to Live Streaming Room Service: Design, Evolution, and Operational Practices

The live‑streaming room service was re‑architected using CQRS, dividing read‑heavy viewer functions from write‑intensive broadcaster operations, splitting the monolith into focused Go micro‑services, adding multi‑level caching, event‑driven sync, extensive observability, and automated incident‑response to achieve massive scalability and rapid fault recovery.

CQRSObservabilitylive streaming
0 likes · 18 min read
Applying CQRS Architecture to Live Streaming Room Service: Design, Evolution, and Operational Practices