Tagged articles
969 articles
Page 10 of 10
Efficient Ops
Efficient Ops
Mar 24, 2020 · Operations

How NetEase Scales Game Monitoring to Billions: Architecture, Data, and AI

This article details NetEase's game monitoring system that supports billions of users worldwide, covering global monitoring challenges, a layered observability architecture, massive time‑series processing, visualisation and alerting mechanisms, and intelligent AI‑driven anomaly detection practices.

AI anomaly detectionCloud NativeObservability
0 likes · 22 min read
How NetEase Scales Game Monitoring to Billions: Architecture, Data, and AI
Didi Tech
Didi Tech
Mar 21, 2020 · Operations

Why Didi’s Nightingale Is Redefining Cloud‑Native Monitoring

Nightingale, Didi’s open‑source enterprise monitoring platform, builds on Open‑Falcon but adds a hierarchical object tree, in‑memory indexing, Gorilla‑compressed time‑series storage, a hybrid push‑pull alert engine, built‑in log monitoring, and a unified monapi module, delivering scalable, cloud‑native observability for both container and bare‑metal workloads.

ArchitectureCloud NativeObservability
0 likes · 10 min read
Why Didi’s Nightingale Is Redefining Cloud‑Native Monitoring
Efficient Ops
Efficient Ops
Mar 11, 2020 · Operations

How to Elevate Your Monitoring System: Proven Practices from Top DevOps Models

This article explains why modern services depend on highly available, scalable monitoring, outlines a systematic way to assess and improve monitoring capabilities using open‑source tools and the DevOps Capability Maturity Model, and details concrete improvement points across data collection, management, and application.

DevOpsObservabilityOperations
0 likes · 9 min read
How to Elevate Your Monitoring System: Proven Practices from Top DevOps Models
Qunar Tech Salon
Qunar Tech Salon
Feb 20, 2020 · Operations

Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud

This article explains why monitoring is essential for operations, outlines the four‑layer monitoring standard (infrastructure, liveliness, performance, business), breaks down functional modules and data flows, and showcases JD Cloud's practical design, alarm‑convergence project, and future AI‑driven observability directions.

JD CloudObservabilityOperations
0 likes · 12 min read
Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud
58 Tech
58 Tech
Jan 13, 2020 · Backend Development

Building a PHP Extension for Automated Web API Monitoring at 58 Anjuke

This article describes the design, implementation, and deployment of a PHP extension that enables automated, low‑overhead monitoring of web API performance, detailing its flexible configuration, high resource efficiency, concurrency handling, and successful production rollout within the 58 rental business platform.

API monitoringExtensionObservability
0 likes · 10 min read
Building a PHP Extension for Automated Web API Monitoring at 58 Anjuke
Java High-Performance Architecture
Java High-Performance Architecture
Jan 13, 2020 · Backend Development

10 Proven Practices to Master Microservices Architecture

This article outlines ten essential microservices best practices—from domain‑driven design and independent databases to async communication, observability, and organizational alignment—providing a comprehensive guide for building scalable, maintainable service‑oriented systems.

ArchitectureCI/CDDomain-Driven Design
0 likes · 7 min read
10 Proven Practices to Master Microservices Architecture
Alibaba Cloud Native
Alibaba Cloud Native
Nov 30, 2019 · Cloud Native

How Alibaba Cloud Manages Over 10,000 Kubernetes Clusters at Double‑11 Scale

This article explains how Alibaba Cloud Container Service (ACK) designs a unit‑based, tiered management system, capacity planning model, global observability architecture, and pluggable components to reliably operate more than ten thousand diverse Kubernetes clusters during the massive Double‑11 shopping event.

ACKAlibaba CloudCluster Management
0 likes · 13 min read
How Alibaba Cloud Manages Over 10,000 Kubernetes Clusters at Double‑11 Scale
Cloud Native Technology Community
Cloud Native Technology Community
Nov 21, 2019 · Cloud Native

Observability in Cloud‑Native Applications with Elastic Stack: A Four‑Step Approach

The talk explains how Elastic Stack can be used to achieve comprehensive observability for cloud‑native applications through a four‑step methodology—health checks, metrics, logging, and tracing—detailing the challenges, implementation details, and best practices for monitoring and debugging modern microservice systems.

APMCloud NativeElastic Stack
0 likes · 10 min read
Observability in Cloud‑Native Applications with Elastic Stack: A Four‑Step Approach
Alibaba Cloud Native
Alibaba Cloud Native
Nov 19, 2019 · Cloud Native

How to Build a Scalable, Reliable K8s Log Platform for Enterprise Needs

This article explains how to design and implement a flexible, high‑performance log system for Kubernetes environments, covering demand‑driven architecture, functional requirements, open‑source component choices, the reasons for a custom solution, and the operational challenges faced at massive scale.

KubernetesObservabilityOpen-source
0 likes · 12 min read
How to Build a Scalable, Reliable K8s Log Platform for Enterprise Needs
Efficient Ops
Efficient Ops
Oct 22, 2019 · Operations

How Modern IT Monitoring Systems Keep Your Services Running Smoothly

This article explains the purpose, core functions, classification, layered architecture, and popular implementations of IT monitoring systems, covering log‑based, trace‑based, and metric‑based approaches as well as a comparison of Zabbix and Prometheus.

IT monitoringObservabilityPrometheus
0 likes · 17 min read
How Modern IT Monitoring Systems Keep Your Services Running Smoothly
Programmer DD
Programmer DD
Oct 10, 2019 · Operations

What’s New in Grafana 6.4? Explore the Latest Features and Improvements

Grafana 6.4, released on October 2 2019, introduces a suite of enhancements—including Explore navigation, real‑time log viewing, new log panels, Data Link upgrades, Series Override line rendering, shared query results, an Alpine‑based Docker image, deprecation of PhantomJS, and the Alpha release of grafana‑toolkit—plus numerous UI and performance improvements.

DashboardGrafanaObservability
0 likes · 7 min read
What’s New in Grafana 6.4? Explore the Latest Features and Improvements
Alibaba Cloud Native
Alibaba Cloud Native
Sep 18, 2019 · Cloud Native

Mastering Kubernetes Logging: Overcoming Real‑World Challenges

This article shares Alibaba's extensive experience building a Kubernetes‑based logging system, detailing the evolution from single‑machine to containerized environments, the critical role of observability, and the specific technical challenges such as dynamic log sources, integration complexity, and massive scale handling.

Distributed SystemsKubernetesObservability
0 likes · 9 min read
Mastering Kubernetes Logging: Overcoming Real‑World Challenges
dbaplus Community
dbaplus Community
Sep 16, 2019 · Operations

How to Build Effective Monitoring for Microservices: Logs, Tracing, and Metrics Explained

This article explains the three main monitoring approaches—log collection, distributed tracing, and metric gathering—in microservice architectures, outlines the layered monitoring model, lists key system, application, and user metrics, and reviews popular open‑source time‑series monitoring tools such as Prometheus, OpenTSDB, and InfluxDB.

MetricsMicroservicesObservability
0 likes · 10 min read
How to Build Effective Monitoring for Microservices: Logs, Tracing, and Metrics Explained
JD Tech Talk
JD Tech Talk
Sep 12, 2019 · Databases

Reflections on ApacheCon 2019 in Las Vegas: ShardingSphere’s First Participation and Community Insights

The article recounts JD Digits architect Zhang Liang’s experience representing the Apache ShardingSphere community at ApacheCon 2019 in Las Vegas, describing the conference atmosphere, community interactions, ShardingSphere’s observability talk and Shark Tank showcase, and the growing Chinese contribution to the Apache ecosystem.

ApacheConDistributed SystemsObservability
0 likes · 5 min read
Reflections on ApacheCon 2019 in Las Vegas: ShardingSphere’s First Participation and Community Insights
DevOps Cloud Academy
DevOps Cloud Academy
Sep 5, 2019 · Operations

An Overview of the Prometheus Monitoring System

Prometheus, an open‑source monitoring and alerting toolkit originally developed by SoundCloud and now a CNCF project, offers multidimensional data models, flexible queries, pull‑based data collection, various metric types (counter, gauge, summary, histogram), local and remote storage, service discovery, and integrates with Grafana for visualization.

Cloud NativeMetricsObservability
0 likes · 8 min read
An Overview of the Prometheus Monitoring System
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Aug 15, 2019 · Operations

Navigating the Open‑Source Distributed Tracing Landscape: Tools, Features, and How to Choose

This guide surveys the most popular open‑source distributed tracing projects, classifying them by instrumentation, tracer, and analysis capabilities, and explains how they fit into modern microservice observability, helping newcomers understand each tool’s strengths, integrations, and the broader tracing ecosystem.

APMCloud NativeDistributed Tracing
0 likes · 10 min read
Navigating the Open‑Source Distributed Tracing Landscape: Tools, Features, and How to Choose
Programmer DD
Programmer DD
Aug 13, 2019 · Operations

Mastering Prometheus Histograms: How Cumulative Buckets Simplify Metrics

This article explains the fundamentals of Prometheus histogram metrics, illustrates why they are cumulative, shows how to drop unwanted buckets with relabeling, and demonstrates quantile calculations using the histogram_quantile function, providing practical examples and code snippets for effective monitoring.

HistogramMetricsObservability
0 likes · 7 min read
Mastering Prometheus Histograms: How Cumulative Buckets Simplify Metrics
dbaplus Community
dbaplus Community
Jul 29, 2019 · Operations

How to Build a Cost‑Effective, Multi‑Layer Monitoring System for Distributed Applications

This article explains why comprehensive, multi‑layer monitoring is essential for distributed systems, outlines environment, program, and business metrics, recommends practical tools such as Zabbix, open‑falcon, Prometheus and Grafana, and provides a step‑by‑step evolution plan and alerting strategy.

Distributed SystemsMetricsObservability
0 likes · 10 min read
How to Build a Cost‑Effective, Multi‑Layer Monitoring System for Distributed Applications
Architecture Digest
Architecture Digest
Jul 29, 2019 · Backend Development

Microservice Architecture at Medium: Lessons, Principles, and Strategies

The article recounts Medium's transition from a monolithic Node.js application to a microservice architecture, explaining the motivations, core design principles, practical strategies, tooling choices, and lessons learned to avoid common pitfalls and improve development velocity and system reliability.

Backend ArchitectureObservabilityservice design
0 likes · 18 min read
Microservice Architecture at Medium: Lessons, Principles, and Strategies
Efficient Ops
Efficient Ops
Jul 28, 2019 · Operations

How 58’s Intelligent Monitoring System Guarantees 24/7 Service Stability

This article details the design, architecture, and AI‑driven features of 58’s intelligent monitoring platform, explaining how multi‑dimensional data collection, predictive analytics, and smart alarm merging ensure continuous, automated observability across network, server, application, and business layers.

Observabilityanomaly detectioncloud infrastructure
0 likes · 20 min read
How 58’s Intelligent Monitoring System Guarantees 24/7 Service Stability
Sohu Tech Products
Sohu Tech Products
Jul 3, 2019 · Cloud Native

Building a Cloud‑Native Distributed Tracing System with Jaeger

This article explains why Jaeger is a popular cloud‑native tracing solution, describes its architecture, sampling options, and deployment strategies on Kubernetes—including DaemonSet and Sidecar modes—followed by a step‑by‑step Django integration example and guidance on monitoring, alerting, and resource cleanup.

Cloud NativeDistributed TracingDjango
0 likes · 13 min read
Building a Cloud‑Native Distributed Tracing System with Jaeger
DevOps Cloud Academy
DevOps Cloud Academy
Jun 9, 2019 · Operations

Prometheus Metric Definitions, Types, and Data Samples

This article explains Prometheus metric naming conventions, label usage, metric types such as Counter, Gauge, Summary, and Histogram, and describes the structure of data samples, providing examples and best‑practice guidelines for defining and classifying metrics in monitoring systems.

MetricsObservabilityOperations
0 likes · 5 min read
Prometheus Metric Definitions, Types, and Data Samples
Cloud Native Technology Community
Cloud Native Technology Community
Jun 4, 2019 · Cloud Native

Introduction to Istio Service Mesh and How It Addresses Common Microservice Challenges

This article introduces Istio as an open‑source service mesh, explains its data‑plane and control‑plane architecture, outlines its traffic management, security, and telemetry features, discusses performance considerations, and shows how Lingque Cloud ASM leverages Istio to solve typical microservice problems such as debugging, testing, release processes, and flexible network policies.

Cloud NativeIstioKubernetes
0 likes · 13 min read
Introduction to Istio Service Mesh and How It Addresses Common Microservice Challenges
Java Backend Technology
Java Backend Technology
Apr 27, 2019 · Operations

Why Apache SkyWalking Became a Top‑Level Project and What It Offers for Modern APM

Apache SkyWalking, an open‑source observability platform that originated in 2015, has graduated to a top‑level Apache project, offering comprehensive APM features such as distributed tracing, metrics, service topology, root‑cause analysis, and flexible storage options for cloud‑native microservice environments.

APMApache SkyWalkingCloud Native
0 likes · 7 min read
Why Apache SkyWalking Became a Top‑Level Project and What It Offers for Modern APM
Ctrip Technology
Ctrip Technology
Apr 18, 2019 · Operations

Application Monitoring Systems: Necessity, Components, Distributed Tracing, and Design for Developers, Testers, and Operations

The article explains why enterprise application monitoring systems are essential, outlines their core components such as Trace, Log, Metric, and Report, discusses distributed tracing techniques, and describes how these insights are designed to aid developers, testers, and operations engineers in performance tuning and fault diagnosis.

Distributed TracingObservabilityapplication monitoring
0 likes · 12 min read
Application Monitoring Systems: Necessity, Components, Distributed Tracing, and Design for Developers, Testers, and Operations
G7 EasyFlow Tech Circle
G7 EasyFlow Tech Circle
Apr 10, 2019 · Operations

Mastering Log Engineering: From Standards to ELK Visualization

This article explains why systematic logging is essential for production debugging, introduces a practical log classification and field schema, describes trace‑ID propagation and performance instrumentation, and walks through building an ELK‑based log collection, storage, and real‑time visualization platform for reliable observability.

ELKObservabilitylogging
0 likes · 15 min read
Mastering Log Engineering: From Standards to ELK Visualization
Efficient Ops
Efficient Ops
Mar 31, 2019 · Operations

How to Design Actionable Alerts and Effective Monitoring Strategies

This article explains why most alerts are poorly designed, defines actionable alerts, outlines monitoring objectives, discusses metric selection, and presents simple yet powerful algorithms for anomaly detection to improve system reliability and operational efficiency.

MetricsObservabilityOperations
0 likes · 21 min read
How to Design Actionable Alerts and Effective Monitoring Strategies
Efficient Ops
Efficient Ops
Mar 14, 2019 · Operations

9 Essential Logging Best Practices to Boost System Performance

This article presents nine practical logging best‑practice recommendations—from understanding human and machine audiences and standardizing log formats to leveraging metrics, proper alerting, severity levels, contextual information, and advanced framework features—helping operations teams improve system performance and troubleshooting efficiency.

MetricsObservabilityOperations
0 likes · 11 min read
9 Essential Logging Best Practices to Boost System Performance
ITPUB
ITPUB
Jan 31, 2019 · Operations

Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators

This article explains how to approach monitoring for a newly introduced system by focusing on white‑box metric collection, distinguishing basic and business metrics, outlining common collection methods, and detailing Google SRE's four golden indicators—error, latency, traffic, and saturation—to guide effective observability.

MetricsObservabilityOperations
0 likes · 10 min read
Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators
360 Tech Engineering
360 Tech Engineering
Jan 22, 2019 · Cloud Native

Microservice Design Patterns: Database, Observability, and Cross‑Cutting Concerns

This article introduces a series of microservice design patterns—including database isolation, observability, and cross‑cutting concerns—explaining the underlying problems each pattern solves and providing concrete solutions such as CQRS, Saga, log aggregation, health checks, and blue‑green deployments.

Backend ArchitectureCloud NativeDesign Patterns
0 likes · 13 min read
Microservice Design Patterns: Database, Observability, and Cross‑Cutting Concerns
Aikesheng Open Source Community
Aikesheng Open Source Community
Dec 30, 2018 · Databases

MySQL Middleware Performance Testing I – Common Mistakes, Practical Methods, and Distributed Transactions

This presentation details how to correctly benchmark MySQL middleware performance, exposing common pitfalls, describing practical testing methodologies, emphasizing the need to observe both middleware and actual database pressure, and discussing distributed transaction considerations and metric selection for reliable results.

MySQLObservabilityPerformance Testing
0 likes · 24 min read
MySQL Middleware Performance Testing I – Common Mistakes, Practical Methods, and Distributed Transactions
Beike Product & Technology
Beike Product & Technology
Dec 20, 2018 · Backend Development

Guide to Developing SkyWalking Java Agent with Byte Buddy and Plugin Implementation

This tutorial explains how to use Byte Buddy to build a JavaAgent for SkyWalking, debug and continuously integrate the agent, and develop custom SkyWalking plugins such as the kob scheduling framework, providing step‑by‑step code examples and configuration details for observability in Java backend services.

APMByteBuddyInstrumentation
0 likes · 12 min read
Guide to Developing SkyWalking Java Agent with Byte Buddy and Plugin Implementation
58 Tech
58 Tech
Nov 12, 2018 · Operations

Key Takeaways from the 58 Group Technical Salon on Monitoring Platforms

The article summarizes the 58 Group technical salon where experts from Momo and 58 shared practical experiences on monitoring platform architectures, coverage, alarm configurations, convergence techniques, custom dimensions, multi‑view dashboards, and future directions for intelligent and automated monitoring across the company.

AlertingDevOpsObservability
0 likes · 9 min read
Key Takeaways from the 58 Group Technical Salon on Monitoring Platforms
Java Captain
Java Captain
Oct 21, 2018 · Backend Development

Effective Logging Practices for Java Backend Services

The article discusses common challenges with missing logs in production, proposes practical solutions such as adding machine identifiers via Nginx headers and embedding user information with Log4j's MDC, and outlines concise logging guidelines to improve traceability and performance analysis for Java backend systems.

BackendObservabilityjava
0 likes · 5 min read
Effective Logging Practices for Java Backend Services
Efficient Ops
Efficient Ops
Sep 19, 2018 · Cloud Native

Kubernetes Log Management: Challenges, Logtail Solution & Architecture

Amid the rise of serverless Kubernetes, growing pod volumes, and real-time log demands, this article examines emerging log-handling challenges, evaluates traditional collection methods, and presents a comprehensive “Logtail + Log Service + Ecosystem” architecture that delivers high-throughput, reliable, and scalable logging for cloud-native environments.

Cloud NativeKubernetesLog Management
0 likes · 22 min read
Kubernetes Log Management: Challenges, Logtail Solution & Architecture
Efficient Ops
Efficient Ops
Sep 17, 2018 · Operations

How Alibaba Scales Monitoring: From CMDB to AI‑Driven Full‑Link Observability

Alibaba’s monitoring evolution—from fragmented early tools to the standardized Sunfire platform and now AI‑powered full‑link observability—addresses scaling challenges, introduces business‑centric metrics, automated traceability, and intelligent anomaly detection, illustrating how massive, multi‑tenant infrastructures achieve unified, proactive operations at scale.

AlibabaObservabilityOperations
0 likes · 19 min read
How Alibaba Scales Monitoring: From CMDB to AI‑Driven Full‑Link Observability
JD Tech Talk
JD Tech Talk
Aug 9, 2018 · Operations

Ensuring Stability and Scalability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

The article explains why operating massive Kubernetes clusters is as challenging as building large systems, outlines three critical stability questions, shares real‑world data collection, visualization, and tooling practices, and provides concrete recommendations for high‑availability, monitoring, and performance optimization.

KubernetesObservabilityautomation
0 likes · 12 min read
Ensuring Stability and Scalability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices
JD Retail Technology
JD Retail Technology
Jul 24, 2018 · Operations

Stability and Operational Practices for Large‑Scale Kubernetes Clusters

This article shares practical experience and best‑practice guidelines for operating large‑scale Kubernetes clusters, covering stability checks, component failure impact, recovery strategies, alerting mechanisms, data collection, visualization, and the suite of operational tools that help ensure reliable, high‑performance cloud‑native infrastructure.

KubernetesObservabilitycluster operations
0 likes · 10 min read
Stability and Operational Practices for Large‑Scale Kubernetes Clusters
UCloud Tech
UCloud Tech
Jul 18, 2018 · Operations

How to Build a Unified Monitoring System for Microservices: Key Dimensions & Scenarios

This article explains how microservice architectures require a comprehensive monitoring system, covering data, resource, and code dimensions, and describes eight atomic monitoring scenarios such as URL, host, product, component, custom, resource, APM, and event monitoring to help engineers design effective observability solutions.

APMObservabilityOperations
0 likes · 7 min read
How to Build a Unified Monitoring System for Microservices: Key Dimensions & Scenarios
Architecture Digest
Architecture Digest
May 8, 2018 · Backend Development

Design and Comparison of Distributed Tracing Systems

The article explains the concept, functions, design goals, data models, log collection, and deployment considerations of distributed tracing systems, and compares several open‑source and proprietary solutions such as Dapper, Zipkin, Pinpoint, Alibaba Eagle Eye, and JD Hydra to guide the selection of an appropriate tracing platform.

BackendDistributed TracingMicroservices
0 likes · 16 min read
Design and Comparison of Distributed Tracing Systems
dbaplus Community
dbaplus Community
Apr 24, 2018 · Cloud Native

How Istio Simplifies Service Mesh Management for Kubernetes Microservices

This article explains why microservice architectures need reliable communication, load balancing, fault tolerance, monitoring, tracing and circuit breaking, and shows how Istio—a cloud‑native service mesh built on Envoy—provides these capabilities, enabling blue‑green and canary deployments, traffic routing, retries, and observability within Kubernetes.

Blue‑Green deploymentIstioKubernetes
0 likes · 8 min read
How Istio Simplifies Service Mesh Management for Kubernetes Microservices
Efficient Ops
Efficient Ops
Apr 2, 2018 · Operations

How Bilibili Revamped Its Monitoring Architecture: From Zabbix to Dapper

An in‑depth look at Bilibili’s multi‑layer monitoring overhaul, detailing the shift from a monolithic Zabbix setup to micro‑service‑based ELK, Dapper, Misaka, Traceon and Lancer systems, and how layered observability improves fault detection across business, application, and infrastructure levels.

Distributed TracingMicroservicesObservability
0 likes · 10 min read
How Bilibili Revamped Its Monitoring Architecture: From Zabbix to Dapper
Programmer DD
Programmer DD
Feb 23, 2018 · Operations

How Zipkin Collects and Processes Sleuth Tracing Data – Deep Dive into Spans

This article explains Zipkin’s data model, how Spring Cloud Sleuth generates and sends Span and Annotation information, the message‑channel listener that converts Sleuth spans to Zipkin spans, debugging techniques to observe the collected data, and why the number of spans shown in Zipkin’s UI can differ from the raw count.

Distributed TracingMicroservicesObservability
0 likes · 17 min read
How Zipkin Collects and Processes Sleuth Tracing Data – Deep Dive into Spans
Hujiang Technology
Hujiang Technology
Jan 29, 2018 · Operations

Design and Implementation of a Low‑Impact Distributed Tracing System for Service Calls

This article describes the background, design goals, architecture, implementation details, and lessons learned from building a low‑overhead, low‑intrusion distributed tracing system using Kafka, Elasticsearch, and OpenTracing to monitor microservice interactions and support performance analysis and DevOps decision‑making.

Distributed TracingElasticsearchKafka
0 likes · 9 min read
Design and Implementation of a Low‑Impact Distributed Tracing System for Service Calls
Node Underground
Node Underground
Jan 12, 2018 · Backend Development

Unlocking Pandora.js: Manageable, Measurable, Traceable Node.js Applications

Pandora.js, the Alibaba Midway team's first open-source project, consolidates years of production-grade Node.js operations by offering three core capabilities—manageable application and process control, comprehensive metrics, and Open-Tracing-based request tracing—exposed via RESTful APIs and logs for easy integration.

BackendMetricsNode.js
0 likes · 2 min read
Unlocking Pandora.js: Manageable, Measurable, Traceable Node.js Applications
Taobao Frontend Technology
Taobao Frontend Technology
Jan 5, 2018 · Operations

Why Metrics Matter: A Deep Dive into Pandora.js’s Measurement System

Metrics act as health checks for applications, enabling developers to monitor performance, track changes, and assess stability; this article explains Pandora.js’s metric naming conventions, types like Gauge, Counter, Histogram, and Meter, and provides practical Node.js code examples for implementing these measurements.

MetricsObservabilityPerformance
0 likes · 13 min read
Why Metrics Matter: A Deep Dive into Pandora.js’s Measurement System
Efficient Ops
Efficient Ops
Dec 18, 2017 · Operations

How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices

This article describes how WiFi 万能钥匙 designed and implemented the Roma monitoring platform to handle billions of daily requests, covering background challenges, architectural principles, component design, data collection, transmission, storage, alerting, and future directions for large‑scale observability.

ArchitectureMicroservicesObservability
0 likes · 16 min read
How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices
Tencent Database Technology
Tencent Database Technology
Nov 21, 2017 · Operations

Introduction to ELKB: Architecture, Components, and Typical Use Cases of Elasticsearch, Logstash, Kibana, and Beats

The article introduces the ELKB stack—a combination of Elasticsearch, Logstash, Kibana, and Beats—explaining its background, user needs, architecture, component functions, typical scenarios, and the team’s practical implementations for real‑time log and time‑series data processing.

BeatsELKElasticsearch
0 likes · 10 min read
Introduction to ELKB: Architecture, Components, and Typical Use Cases of Elasticsearch, Logstash, Kibana, and Beats
Dada Group Technology
Dada Group Technology
Oct 27, 2017 · Operations

Pinpoint Overview and Plugin Development Guide

Pinpoint is a full‑stack, non‑intrusive tracing platform that visualizes service topology, active threads, request latency, and application health, and this article explains its architecture, data model, and step‑by‑step process for creating custom plugins—including ServiceLoader configuration, TraceMetadataProvider, and ProfilerPlugin implementations with code examples.

APMObservabilityPinpoint
0 likes · 22 min read
Pinpoint Overview and Plugin Development Guide
Qunar Tech Salon
Qunar Tech Salon
Oct 26, 2017 · Operations

Evolution of Pinterest's Monitoring System: From Time-Series Metrics to Distributed Tracing

Over seven years, Pinterest’s monitoring team built and refined a three‑pronged observability platform—time‑series metrics, log search, and distributed tracing—scaling from a single‑machine system to handling millions of data points per second across tens of thousands of AWS VMs, while addressing reliability, cost, and usability challenges.

Distributed TracingObservabilitySRE
0 likes · 19 min read
Evolution of Pinterest's Monitoring System: From Time-Series Metrics to Distributed Tracing
Meitu Technology
Meitu Technology
Sep 28, 2017 · Operations

Inside Meipai’s 3‑D Monitoring System: Scaling 150M Users with Unified Observability

This article examines how Meipai, a popular live‑streaming and short‑video platform with over 150 million monthly active users, engineered a comprehensive, three‑dimensional monitoring architecture that spans client to server, integrates unified dashboards, and leverages both private and public cloud resources to ensure reliable, scalable operations.

DevOpsInfrastructureMeipai
0 likes · 3 min read
Inside Meipai’s 3‑D Monitoring System: Scaling 150M Users with Unified Observability
Qunar Tech Salon
Qunar Tech Salon
Aug 14, 2017 · Backend Development

Introduction to QTracer: An Internal Distributed Tracing System at Qunar

QTracer is Qunar’s internal distributed tracing system that generates a global TraceID for each request, records operations across services, and provides features such as execution chain visualization, log correlation, conditional search, service dependency analysis, database statistics, transparent data propagation, and low‑overhead instrumentation for debugging and performance monitoring.

BackendDistributed TracingObservability
0 likes · 20 min read
Introduction to QTracer: An Internal Distributed Tracing System at Qunar
Ctrip Technology
Ctrip Technology
Aug 10, 2017 · Operations

QTracer: An In‑Depth Overview of Qunar’s Distributed Tracing System

This article provides a comprehensive technical overview of QTracer, Qunar’s internal distributed tracing platform, covering its architecture, core concepts, key features such as execution‑chain queries, log association, conditional searches, data storage, non‑intrusive instrumentation, bytecode injection, and the QTracer Debug tool for online breakpoint debugging.

BackendDistributed TracingObservability
0 likes · 19 min read
QTracer: An In‑Depth Overview of Qunar’s Distributed Tracing System
Beike Product & Technology
Beike Product & Technology
Jul 16, 2017 · Industry Insights

How Lianjia Built LTrace: A Low‑Overhead, Scalable Distributed Tracing Platform

This article explains how Lianjia designed and implemented LTrace, a zero‑intrusion, high‑performance distributed tracing system that captures full request chains across heterogeneous services, supports multi‑language environments, offers flexible sampling, and enables rapid fault isolation and performance optimization.

ArchitectureDistributed TracingObservability
0 likes · 12 min read
How Lianjia Built LTrace: A Low‑Overhead, Scalable Distributed Tracing Platform
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jun 20, 2017 · Cloud Native

How Uber Built Jaeger: From In‑House Tracing to a Cloud‑Native Open‑Source Platform

Uber’s engineering team chronicles the evolution of its distributed tracing system—from the early Merckx pull‑based solution and TChannel integration to the open‑source Jaeger platform—detailing architectural shifts, sampling strategies, multi‑language client libraries, and the move toward a fully cloud‑native, end‑to‑end observability stack.

Cloud NativeDistributed TracingMicroservices
0 likes · 17 min read
How Uber Built Jaeger: From In‑House Tracing to a Cloud‑Native Open‑Source Platform
Efficient Ops
Efficient Ops
Mar 1, 2017 · Operations

How Metrics-Driven Development Transforms Software Iteration and Ops

Metrics‑Driven Development (MDD) extends test‑driven principles by embedding real‑time monitoring into design, enabling rapid, precise, and granular software iterations, improving early problem detection, decision support, and aligning development with DevOps culture.

MetricsObservabilitymonitoring
0 likes · 13 min read
How Metrics-Driven Development Transforms Software Iteration and Ops
Node Underground
Node Underground
Jan 24, 2017 · Operations

11 Essential Practices to Master Node.js Application Monitoring

Effective Node.js monitoring boosts competitiveness, user experience, and cost efficiency, and this guide outlines eleven key recommendations—from tracking downtime and response thresholds to linking performance with business metrics and leveraging third‑party APM tools—ensuring robust, noise‑free alerts and secure, scalable applications.

APMDevOpsNode.js
0 likes · 3 min read
11 Essential Practices to Master Node.js Application Monitoring
dbaplus Community
dbaplus Community
May 11, 2016 · Operations

Inside Twitter’s Massive Monitoring Stack: Architecture, Metrics, and Lessons Learned

Twitter’s internal monitoring team built a full‑stack observability platform that handles billions of metric writes per minute, supports distributed tracing, log aggregation, visual dashboards, and alerting across data centers and public clouds, and shares the architecture, components, and key lessons learned.

AlertingDistributed TracingMetrics
0 likes · 18 min read
Inside Twitter’s Massive Monitoring Stack: Architecture, Metrics, and Lessons Learned
dbaplus Community
dbaplus Community
Apr 11, 2016 · Operations

Can External Quality Acceptance Drive DevOps Monitoring and Eliminate Technical Debt?

This article explains how focusing on non‑functional quality during external acceptance testing can drive DevOps teams to improve system monitorability, reduce technical debt, and establish concrete change‑control, acceptance, and performance verification processes for both operational and business‑level observability.

DevOpsObservabilitychange management
0 likes · 15 min read
Can External Quality Acceptance Drive DevOps Monitoring and Eliminate Technical Debt?
21CTO
21CTO
Mar 20, 2016 · Operations

How CAT Powers Real‑Time Distributed Monitoring at Scale

This article introduces CAT, a Java‑based open‑source distributed real‑time monitoring system, covering its origins, design goals, architecture, message processing pipeline, instrumentation model, and how it achieves high availability, scalability, and low‑latency analytics for large‑scale internet services.

ArchitectureCATDistributed Monitoring
0 likes · 17 min read
How CAT Powers Real‑Time Distributed Monitoring at Scale