Topic

monitoring

Collection size
1674 articles
Page 81 of 84
macrozheng
macrozheng
May 6, 2021 · Operations

How I Built an Automated Redis Sentinel System to Handle Failover

An operations engineer narrates how he monitors a four‑node Redis cluster, detects master failure with continuous PINGs, promotes a slave to master, reconfigures replicas, and automates the entire process with a sentinel program and a sentinel cluster for high availability.

AutomationMonitoringOperations
0 likes · 11 min read
How I Built an Automated Redis Sentinel System to Handle Failover
macrozheng
macrozheng
Jan 6, 2021 · Backend Development

Essential Spring Boot Practices for Building Robust Microservices

This article outlines the golden rules for constructing Spring Boot microservices, covering monitoring with Spring Boot Admin and Grafana, exposing metrics via Actuator, centralized logging with ELK, clear API documentation using Swagger, YApi or smart‑doc, transparent build info, and keeping dependencies up‑to‑date.

API DocumentationLoggingMonitoring
0 likes · 8 min read
Essential Spring Boot Practices for Building Robust Microservices
ByteDance Data Platform
ByteDance Data Platform
Dec 29, 2021 · Big Data

How ByteDance’s DataLeap Solves Complex Data Quality Challenges at Scale

This article explains how ByteDance’s DataLeap platform tackles diverse data quality challenges across batch and streaming pipelines by defining quality dimensions, outlining a modular architecture, and sharing best‑practice optimizations for Spark, Flink and Presto‑based monitoring.

ETLFlinkMonitoring
0 likes · 17 min read
How ByteDance’s DataLeap Solves Complex Data Quality Challenges at Scale
Weidian Tech Team
Weidian Tech Team
Dec 15, 2016 · Databases

How to Build a Scalable Automated MySQL Operations Platform

This article explains how to standardize and automate MySQL operations—including multi‑instance deployment, metadata collection, monitoring, backup, and high‑availability using Zookeeper—so that large‑scale database services can be provisioned, managed, and scaled with minimal human intervention.

AutomationBackupDatabase Operations
0 likes · 11 min read
How to Build a Scalable Automated MySQL Operations Platform
160 Technical Team
160 Technical Team
Dec 14, 2023 · Backend Development

How Health160 Scaled to Millions: Real-World Backend Performance Optimization Strategies

This article shares Health160's systematic approach to building a high‑performance, high‑availability medical service platform, covering monitoring, metric design, flow‑control, idempotency, unknown data handling, optimization case studies, architectural choices, NIO networking, middleware tuning, and caching techniques.

CachingMonitoringPerformance Optimization
0 likes · 15 min read
How Health160 Scaled to Millions: Real-World Backend Performance Optimization Strategies
37 Mobile Game Tech Team
37 Mobile Game Tech Team
Jul 2, 2021 · Big Data

Inside Flink Metrics: Adding, Retrieving, and Exposing Metrics in TaskManager

This article walks through Flink's metric system by explaining the core interfaces such as MetricReporter and MetricRegistry, showing how metrics are added, registered, and queried during TaskManager startup, and detailing both REST and Prometheus approaches for retrieving metric values.

FlinkJavaMetrics
0 likes · 16 min read
Inside Flink Metrics: Adding, Retrieving, and Exposing Metrics in TaskManager
Qudian (formerly Qufenqi) Technology Team
Qudian (formerly Qufenqi) Technology Team
Jan 18, 2017 · Operations

Building a Scalable Business Monitoring System: Architecture, Modules & Lessons

This article presents a comprehensive case study of a business monitoring system, covering its background, architectural analysis, module design, time‑series database selection, visualization with Grafana, alerting strategies, decision‑making logic, and intelligent monitoring experiments, followed by key takeaways and lessons learned.

ArchitectureGrafanaInfluxDB
0 likes · 12 min read
Building a Scalable Business Monitoring System: Architecture, Modules & Lessons
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Jun 8, 2021 · Cloud Native

How We Stabilized International Services with a Multi‑Phase Cloud‑Native Migration

This article details a four‑stage migration project that rebuilt international services on a cloud‑native stack, introducing temporary Istio monitoring, standardized change processes, Helm‑based deployments, and full microservice integration while sharing practical quality‑assurance lessons and pitfalls.

HelmKubernetesMonitoring
0 likes · 14 min read
How We Stabilized International Services with a Multi‑Phase Cloud‑Native Migration
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Jul 17, 2020 · Operations

How We Built a Robust Monitoring System for Construction Drawing Production

This article describes how our team designed and implemented a comprehensive online monitoring system for construction drawing generation, covering business background, technical architecture analysis, metric definition, monitoring methods, and the resulting dashboards that improve quality, stability, and rapid issue resolution.

MetricsMonitoringOperations
0 likes · 10 min read
How We Built a Robust Monitoring System for Construction Drawing Production
Ops Development Stories
Ops Development Stories
Mar 19, 2025 · Cloud Native

Unified Multi‑Cluster Monitoring with KubeDoor 1.0: Alerts, Metrics & Best Practices

KubeDoor 1.0 introduces a new architecture for unified multi‑Kubernetes monitoring, offering components for master and agent, flexible deployment options, Helm‑based installation, configurable storage and alerting settings, and detailed guidance on integrating with existing Prometheus/VictoriaMetrics setups while providing automatic peak‑usage data collection.

AlertingClickHouseHelm
0 likes · 14 min read
Unified Multi‑Cluster Monitoring with KubeDoor 1.0: Alerts, Metrics & Best Practices
Ops Development Stories
Ops Development Stories
Apr 12, 2024 · Cloud Native

Mastering etcd: Architecture, Monitoring & Performance Tuning

This article provides a comprehensive overview of etcd—including its origins, role in Kubernetes, version evolution, layered architecture, key terminology, operational commands, monitoring metrics, benchmarking procedures, disk‑performance testing, and tuning recommendations—for building reliable cloud‑native clusters.

Monitoringbenchmarkcloud native
0 likes · 17 min read
Mastering etcd: Architecture, Monitoring & Performance Tuning
Ops Development Stories
Ops Development Stories
Nov 16, 2023 · Fundamentals

Unlocking G1 GC: Why Your Java Service Hangs and How to Fix It

This article explains the G1 garbage collector’s heap layout, collection cycles, pause prediction, log analysis, and monitoring tools, helping Java developers diagnose and resolve performance issues such as frequent restarts, OOM, CPU spikes, and periodic latency spikes.

G1GCJVMJava
0 likes · 19 min read
Unlocking G1 GC: Why Your Java Service Hangs and How to Fix It
Ops Development Stories
Ops Development Stories
Oct 12, 2023 · Cloud Native

How to Monitor Kubernetes with OpenTelemetry Collector: Step‑by‑Step Helm Deployment

This guide walks through installing OpenTelemetry Collector on a Kubernetes cluster using Helm, configuring DaemonSet and Deployment collectors, integrating Prometheus for metrics, and customizing receivers, processors, and exporters to achieve comprehensive observability of nodes, pods, containers, and cluster resources.

HelmKubernetesMonitoring
0 likes · 26 min read
How to Monitor Kubernetes with OpenTelemetry Collector: Step‑by‑Step Helm Deployment
Ops Development Stories
Ops Development Stories
Aug 6, 2022 · Cloud Native

8 Proven Strategies to Beat Alert Fatigue in Kubernetes

This article explains why alert fatigue harms on‑call teams in Kubernetes environments and offers eight practical techniques—ranging from metric definition to alert suppression—to reduce noise, improve response efficiency, and protect team well‑being.

KubernetesMonitoringOperations
0 likes · 8 min read
8 Proven Strategies to Beat Alert Fatigue in Kubernetes
Ops Development Stories
Ops Development Stories
Jul 6, 2022 · Operations

Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring

Drawing from three and a half years of operations experience, this guide outlines practical online operation standards, data protection strategies, security measures, daily monitoring, performance tuning tips, and the right mindset to avoid costly incidents and ensure stable, secure systems.

BackupMonitoringOperations
0 likes · 12 min read
Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring
Ops Development Stories
Ops Development Stories
Mar 4, 2022 · Cloud Native

Why Observability Is the ‘Force’ Empowering Modern IT Systems

This talk explains why observability is essential for cloud‑native IT systems, covering its core value of empowerment, various definitions, evaluation criteria such as zero‑intrusion, multidimensionality and real‑time response, and practical building approaches using SaaS, open‑source and integration, illustrated with numerous industry case studies.

MonitoringOLAPObservability
0 likes · 24 min read
Why Observability Is the ‘Force’ Empowering Modern IT Systems
Ops Development Stories
Ops Development Stories
Jan 24, 2022 · Cloud Native

Deploy and Configure vmagent on Kubernetes for Efficient Metrics

This guide explains what vmagent is, its key features, and provides step‑by‑step instructions to install, configure, and verify vmagent on a Kubernetes cluster, including namespace and RBAC setup, custom scrape configs, monitoring endpoints, and troubleshooting tips.

KubernetesMetricsMonitoring
0 likes · 15 min read
Deploy and Configure vmagent on Kubernetes for Efficient Metrics
Ops Development Stories
Ops Development Stories
Oct 15, 2021 · Operations

Integrate Real‑Time Prometheus Pod Metrics into Probius Using ECharts

After integrating Kubernetes into Probius, this guide shows how to pull pod metrics from Prometheus using the query_range API, process them with a Python client, and visualize CPU, memory, bandwidth, and IOPS data in Probius via ECharts, completing a seamless container‑monitoring feature.

EchartsKubernetesMonitoring
0 likes · 8 min read
Integrate Real‑Time Prometheus Pod Metrics into Probius Using ECharts
Ops Development Stories
Ops Development Stories
Sep 28, 2021 · Operations

Mastering Prometheus Relabeling: Rules, Actions, and Real-World Use Cases

This article explains how Prometheus relabeling works, covering rule actions, hidden metadata labels, label mapping, sharding, and practical examples for filtering targets, modifying labels, and optimizing metric storage in monitoring pipelines.

MonitoringPrometheusYAML
0 likes · 17 min read
Mastering Prometheus Relabeling: Rules, Actions, and Real-World Use Cases