Tagged articles

2179 articles

Page 12 of 22

Feb 28, 2022 · Operations

Render Real‑Time Alert Charts in DingTalk with Promoter – A Go Solution

This article explains how to programmatically render Prometheus alert charts, upload them to object storage, and embed the images in DingTalk notifications using the Go‑based Promoter tool, including template customization, deployment steps, and core rendering logic.

AlertmanagerDingTalkGo

0 likes · 10 min read

Render Real‑Time Alert Charts in DingTalk with Promoter – A Go Solution

Dada Group Technology

Feb 25, 2022 · Databases

Practical Deployment and Operation Guide for StarRocks OLAP Database

This article presents a comprehensive overview of StarRocks, covering its key features, deployment challenges, backup and synchronization methods, cluster configuration and upgrade procedures, as well as monitoring and alerting solutions, followed by practical lessons learned from real‑world usage.

BackupStarRocksmonitoring

0 likes · 13 min read

Practical Deployment and Operation Guide for StarRocks OLAP Database

IT Services Circle

Feb 24, 2022 · Databases

Diagnosing and Solving Redis Performance Issues

This article explains how to detect Redis latency problems, measure baseline performance, monitor slow commands, and address common causes such as network round‑trip delays, fork‑generated RDB snapshots, transparent huge pages, swap usage, AOF settings, key expiration, and big‑key handling, providing practical troubleshooting steps and solutions.

Latencydatabasemonitoring

0 likes · 20 min read

Diagnosing and Solving Redis Performance Issues

ITFLY8 Architecture Home

Feb 24, 2022 · Backend Development

How to Build a 100‑Billion Red‑Envelope System that Handles 60 k QPS

This article details the design, implementation, and performance testing of a scalable red‑envelope service capable of handling up to 100 billion requests, supporting 1 million concurrent users per server, achieving peak QPS of 60 k, and outlines hardware, software, and monitoring strategies.

Performance Testinghigh concurrencymonitoring

0 likes · 17 min read

How to Build a 100‑Billion Red‑Envelope System that Handles 60 k QPS

IT Services Circle

Feb 23, 2022 · Operations

Setting Up Spring Boot Admin to Monitor Spring Boot Applications

This guide explains how to create a Spring Boot Admin server, configure a Spring Boot client to register with it, enable Actuator for extended metrics, and view real‑time logs, providing a comprehensive monitoring solution for Java backend services.

ActuatorOperationsSpring Boot

0 likes · 9 min read

Setting Up Spring Boot Admin to Monitor Spring Boot Applications

DaTaobao Tech

Feb 21, 2022 · Frontend Development

Focused Gray Release Monitoring and Alert Configuration for Frontend Quality

To raise front‑end quality, the team implements gray‑release monitoring that triggers log analysis at a 5 % rollout, automatically generates reports within ten minutes, and uses dynamic thresholds and noise‑reduction tactics to detect errors early, enabling rapid rollback or expansion and markedly improving stability and release efficiency.

AlertingMetricsfrontend

0 likes · 9 min read

Focused Gray Release Monitoring and Alert Configuration for Frontend Quality

IT Architects Alliance

Feb 15, 2022 · Operations

What Real-World Performance Tuning Taught Us About Legacy Web Apps

After a traffic surge exposed severe latency in a 15-year-old multi-service web platform, we used monitoring to discover a DB-connection leak caused by a liveness probe, corrected it, and distilled four practical lessons on latency metrics, tooling, legacy maintenance, and code vigilance.

APMLoad TestingOperations

0 likes · 9 min read

What Real-World Performance Tuning Taught Us About Legacy Web Apps

Top Architect

Feb 13, 2022 · Operations

Performance Optimization Lessons from a Legacy Web Application: Monitoring, Load Testing, and Maintaining Old Systems

The article shares a real‑world case study of a legacy multi‑service web platform where traffic spikes exposed DB connection leaks, leading to a 90% response‑time bottleneck, and outlines four key takeaways about tail‑latency metrics, investing in tools and people, actively maintaining legacy systems, and treating every line of code as critical for performance.

BackendLoad TestingSRE

0 likes · 9 min read

Performance Optimization Lessons from a Legacy Web Application: Monitoring, Load Testing, and Maintaining Old Systems

MaGe Linux Operations

Feb 12, 2022 · Operations

Boost Go Service Reliability with the Lightweight go-monitor Tool

The article presents go-monitor, an open‑source Go library that provides lightweight, lock‑free service quality monitoring, automatic analysis, configurable alerts, and flexible reporting for backend applications, complete with installation steps and code examples.

AlertingGolangmonitoring

0 likes · 9 min read

Boost Go Service Reliability with the Lightweight go-monitor Tool

Su San Talks Tech

Feb 10, 2022 · Operations

Master SkyWalking: End‑to‑End Guide to Distributed Tracing, Setup & Monitoring

This tutorial walks through SkyWalking, an open‑source APM framework, explaining its features, architecture, how to install and configure the server and agents, persist data with MySQL, enable log collection, perform performance profiling, and set up alerting rules for robust distributed tracing.

APMAlertingDistributed Tracing

0 likes · 12 min read

Master SkyWalking: End‑to‑End Guide to Distributed Tracing, Setup & Monitoring

MaGe Linux Operations

Feb 9, 2022 · Backend Development

Mastering Tars: Deploy, Manage, and Monitor a High‑Performance Microservice Framework

This guide provides a comprehensive overview of the Tars microservice framework, covering its core concepts, deployment methods across various environments, configuration management, service discovery, logging, monitoring, and operational features such as gray releases and circuit‑breaker strategies.

ConfigurationDeploymentMicroservices

0 likes · 18 min read

Mastering Tars: Deploy, Manage, and Monitor a High‑Performance Microservice Framework

Practical DevOps Architecture

Feb 8, 2022 · Operations

Extending Zabbix Monitoring with Custom Scripts and Handling Stale NFS Handles

This article explains how Zabbix monitoring can be extended with custom shell or Python scripts to gather business-specific metrics, demonstrates a sample script that checks disk usage, and provides three methods to resolve a stale NFS file handle error, including using fuser, process inspection, and forced unmount.

Custom ScriptNFSOperations

0 likes · 3 min read

Extending Zabbix Monitoring with Custom Scripts and Handling Stale NFS Handles

Efficient Ops

Feb 7, 2022 · Operations

Mastering Application Monitoring with Prometheus: Practical Metrics and Grafana Tips

This article explains how to design effective Prometheus metrics for various application types, choose appropriate vectors, labels, and buckets, and offers Grafana tricks for visualizing dimensions and linking tooltips, providing a comprehensive guide for robust observability.

GrafanaMetricsPrometheus

0 likes · 10 min read

Mastering Application Monitoring with Prometheus: Practical Metrics and Grafana Tips

Architect

Feb 5, 2022 · Backend Development

Best Practices for Designing Consistent RESTful APIs

This article presents a concise, step‑by‑step guide to designing clean, consistent RESTful APIs, covering resource naming, URL conventions, HTTP methods, versioning, pagination, field selection, security, monitoring, error handling, and documentation tools, with concrete code examples for each rule.

HTTP methodsURL conventionsVersioning

0 likes · 10 min read

Best Practices for Designing Consistent RESTful APIs

MaGe Linux Operations

Feb 5, 2022 · Operations

Essential Linux Bash Scripts for Server Operations and Automation

This article presents a collection of practical Bash scripts for Linux servers, covering DOS attack IP blocking, alert emailing, MySQL backup (single and multi‑loop), Nginx log rotation and analysis, real‑time network traffic monitoring, system initialization, and disk usage checks across multiple hosts.

BashLinuxServer

0 likes · 10 min read

Essential Linux Bash Scripts for Server Operations and Automation

Top Architect

Feb 3, 2022 · Backend Development

System Performance Issue Analysis, Diagnosis, and Optimization for Business Applications

This article explains how to analyze, diagnose, and optimize performance problems in production business systems, covering the typical causes such as high concurrency, data growth, hardware limits, and environment changes, and detailing practical steps for hardware, OS, database, middleware, JVM tuning, code review, and APM monitoring.

BackendJVMdiagnosis

0 likes · 15 min read

System Performance Issue Analysis, Diagnosis, and Optimization for Business Applications

DataFunTalk

Feb 1, 2022 · Big Data

Kafka at Meituan: Practices, Challenges, and Optimizations for Large‑Scale Data Platforms

This article presents Meituan's large‑scale Kafka deployment, describing the current state and challenges of massive data ingestion, detailing latency‑reduction techniques, cluster‑level optimizations, SSD‑based caching, isolation strategies, full‑link monitoring, lifecycle management, and future directions for high availability.

Cluster ManagementKafkaMeituan

0 likes · 22 min read

Kafka at Meituan: Practices, Challenges, and Optimizations for Large‑Scale Data Platforms

dbaplus Community

Jan 27, 2022 · Databases

Why ClickHouse Beats Elasticsearch for Microservice Governance Data at Scale

The article examines the data‑storage problems caused by rapid microservice growth, explains why traditional Hadoop/Spark stacks were rejected, presents benchmark comparisons that show ClickHouse’s superior performance and compression, and details practical ClickHouse deployment, schema design, sharding, TTL, indexing, and monitoring integrations for real‑time analytics.

DataAnalyticsDatabaseDesignOLAP

0 likes · 27 min read

Why ClickHouse Beats Elasticsearch for Microservice Governance Data at Scale

Architecture Digest

Jan 27, 2022 · Backend Development

Designing API Error Codes and Result Codes: Best Practices

This article explains why a well‑designed API error‑code system—using consistent numeric or string codes, clear messages, HTTP‑status‑like segmentation, personalized user messages, and unified handling for monitoring—reduces communication overhead, simplifies maintenance, and improves overall backend reliability.

Error CodesHTTP statusapi-design

0 likes · 6 min read

Designing API Error Codes and Result Codes: Best Practices

ITFLY8 Architecture Home

Jan 26, 2022 · Operations

Mastering Microservice Monitoring, Fault Tolerance, and Security: A Complete Guide

This article explains how to monitor microservice architectures, describes log, tracing, and metric monitoring, compares open‑source tracing tools, outlines fault‑tolerance strategies such as timeout, rate‑limiting, degradation, async buffering and circuit breaking, and details access‑security mechanisms including gateway authentication, service‑side auth, and OAuth2.0 token flows, while also introducing container technology and its role in microservice deployment.

ContainersMicroservicesfault-tolerance

0 likes · 43 min read

Mastering Microservice Monitoring, Fault Tolerance, and Security: A Complete Guide

Youzan Coder

Jan 26, 2022 · Big Data

How to Build a Robust Data Quality Assurance Strategy for Large-Scale Data Platforms

This article outlines a comprehensive data quality assurance framework for a massive reporting platform, covering the data pipeline architecture, detailed testing methods for timeliness, completeness, and accuracy, as well as application‑level checks, downgrade and backup strategies, and future automation plans.

Data Qualityautomationbig data testing

0 likes · 14 min read

How to Build a Robust Data Quality Assurance Strategy for Large-Scale Data Platforms

IT Architects Alliance

Jan 23, 2022 · Operations

Microservice Monitoring, Fault Tolerance, Access Security, and Container Technology Overview

This article provides a comprehensive guide to microservice monitoring—including log, tracing, and metrics approaches—fault‑tolerance isolation techniques, access‑security mechanisms such as API‑gateway and OAuth2.0, and the role of container technologies like Docker in cloud‑native deployments.

Cloud NativeContainersMicroservices

0 likes · 30 min read

Microservice Monitoring, Fault Tolerance, Access Security, and Container Technology Overview

Xueersi Online School Tech Team

Jan 21, 2022 · Frontend Development

White‑Screen Detection and Performance Optimization for Front‑End Applications

The article explains the concept of white‑screen time, its impact on user experience, and presents multiple detection methods—including Navigation Timing API, MutationObserver, element‑point analysis, and headless‑browser simulation—along with implementation code and a monitoring‑alert architecture for front‑end performance optimization.

MutationObserverPuppeteerfrontend

0 likes · 17 min read

White‑Screen Detection and Performance Optimization for Front‑End Applications

Efficient Ops

Jan 20, 2022 · Operations

Mastering Prometheus Metrics: Best Practices for Effective Monitoring

This article outlines practical guidelines for designing Prometheus metrics, covering how to define monitoring targets, choose appropriate vectors and labels, name metrics and labels correctly, select histogram buckets, and leverage Grafana features to visualize and troubleshoot data effectively.

GrafanaMetricsPrometheus

0 likes · 11 min read

Mastering Prometheus Metrics: Best Practices for Effective Monitoring

IT Architects Alliance

Jan 20, 2022 · Cloud Native

How to Build a High‑Availability Microservices System on Kubernetes – A Complete Guide

This guide walks you through designing a simple front‑back separation microservice architecture, implementing it with Java Spring Boot, deploying multiple instances with Eureka, adding Prometheus‑Grafana monitoring, logging, tracing, flow control, and finally installing Kubernetes using K8seasy and verifying high‑availability across the cluster.

Cloud NativeKubernetesMicroservices

0 likes · 19 min read

How to Build a High‑Availability Microservices System on Kubernetes – A Complete Guide

Baidu Geek Talk

Jan 19, 2022 · Big Data

Quantile Computation in Baidu Advertising System: Architecture and Implementation

Baidu’s advertising platform computes high‑precision response‑time quantiles at massive scale by intercepting each API call, locally summarizing data with mergeable T‑Digest histograms, periodically uploading compressed, Base64‑encoded summaries to a warehouse where they are merged on demand, enabling low‑latency, cost‑effective percentile analysis with sub‑0.1% error.

QuantileT-Digestdata aggregation

0 likes · 11 min read

Quantile Computation in Baidu Advertising System: Architecture and Implementation

MaGe Linux Operations

Jan 14, 2022 · Operations

Choosing the Right Open‑Source Monitoring Tool: History, Pros, Cons & Use Cases

This comprehensive guide traces the evolution of open‑source monitoring solutions from the early 2000s to modern cloud‑native tools, comparing their strengths, weaknesses, and ideal deployment scenarios to help IT professionals select the most suitable monitoring product for their infrastructure.

OperationsTool comparisoncloud-native

0 likes · 14 min read

Choosing the Right Open‑Source Monitoring Tool: History, Pros, Cons & Use Cases

IT Xianyu

Jan 14, 2022 · Operations

Redis Monitoring, Data Migration, and Cluster Management Tools Overview

This article introduces essential Redis operational tools, covering the INFO command for monitoring, Prometheus‑based redis‑exporter visualization, the Redis‑shake data migration utility, Redis‑full‑check consistency verification, and the CacheCloud platform for comprehensive cluster management.

CacheCloudData MigrationOperations

0 likes · 10 min read

Redis Monitoring, Data Migration, and Cluster Management Tools Overview

Top Architect

Jan 13, 2022 · Backend Development

Microservice Architecture Roadmap: Core Components and Recommended Tools

This article presents a comprehensive roadmap for adopting microservice architecture, explaining why it is chosen, outlining essential concerns such as Docker, container orchestration, API gateways, load balancing, service discovery, event buses, logging, monitoring, tracing, data persistence, caching, and cloud providers, and recommending popular tools for each component.

DockerKubernetesMicroservices

0 likes · 16 min read

Microservice Architecture Roadmap: Core Components and Recommended Tools

Efficient Ops

Jan 12, 2022 · Cloud Native

Why Kubernetes Monitoring Differs from VM Metrics: CPU, Memory, Disk, Network

This article compares Kubernetes pod monitoring metrics with traditional KVM/VM metrics across CPU, memory, disk, and network, explaining the underlying reasons for the differences and offering guidance on interpreting the data for effective performance troubleshooting.

CPUCloud NativeMetrics

0 likes · 11 min read

Why Kubernetes Monitoring Differs from VM Metrics: CPU, Memory, Disk, Network

Programmer DD

Jan 12, 2022 · Backend Development

How to Build a Complete Backend Stack for Your Startup from Scratch

This guide walks startup leaders through designing and assembling a full backend technology stack—from language and component choices to processes, systems, and deployment tools—providing practical recommendations, diagrams, and best‑practice tips for building scalable, maintainable services.

Backend ArchitectureCloud ServicesDevOps

0 likes · 30 min read

How to Build a Complete Backend Stack for Your Startup from Scratch

HaoDF Tech Team

Jan 11, 2022 · Big Data

Using ClickHouse for Real‑Time Log Analytics and Data Storage in Microservice Governance at Haodf

The article describes how Haodf's SRE team replaced Elasticsearch with ClickHouse to handle massive microservice logs, achieve low‑latency queries, reduce storage costs, and support real‑time monitoring, tracing, and metric analysis through columnar OLAP features, sharding, TTL, and materialized views.

AnalyticsBig DataMicroservices

0 likes · 25 min read

Using ClickHouse for Real‑Time Log Analytics and Data Storage in Microservice Governance at Haodf

DataFunTalk

Jan 7, 2022 · Artificial Intelligence

Building an Intelligent Risk Control Tool System: Architecture and Key Components

This article presents a comprehensive overview of constructing an intelligent risk control tool system, detailing its evolution from manual processes to automated platforms, describing the core "three‑piece" suite (model, decision, and feature platforms) along with supporting data and monitoring platforms, and explaining the functions and interactions of each module such as data ingestion, feature engineering, automated modeling, decision flow, and real‑time monitoring.

Data PlatformModelingdecision engine

0 likes · 13 min read

Building an Intelligent Risk Control Tool System: Architecture and Key Components

NetEase LeiHuo Testing Center

Jan 6, 2022 · Game Development

Implementing Test Right-Shift in Game Development: Stable Release, Monitoring, and Risk Control

As agile development matures, QA moves beyond functional testing; this article explains test right‑shift in a game project, covering stable release strategies, log‑based monitoring, opinion monitoring, performance monitoring, and risk control mechanisms.

Game DevelopmentQATest Right-Shift

0 likes · 17 min read

Implementing Test Right-Shift in Game Development: Stable Release, Monitoring, and Risk Control

HomeTech

Jan 6, 2022 · Operations

Design and Implementation of a Centralized Database Log Collection and Analysis Platform

This article describes the background, architecture, and implementation of a centralized database log collection and analysis platform built in 2021, detailing how logs from hosts, containers, and databases are normalized, streamed through Kafka, processed with Flink, stored in Elasticsearch, visualized with Kibana, and extended with alerting and configuration management to improve fault diagnosis and lay the groundwork for future AI‑driven operations.

Big DataKibanalog collection

0 likes · 5 min read

Design and Implementation of a Centralized Database Log Collection and Analysis Platform

Practical DevOps Architecture

Jan 5, 2022 · Operations

Deploying Prometheus and Node Exporter on a Linux Host

This guide walks through installing Prometheus and Node Exporter on a Linux server, copying binaries to system paths, configuring Prometheus with scrape jobs for the local node and remote hosts, and running the exporters with specific collector options for system metrics.

OperationsPrometheusmonitoring

0 likes · 4 min read

Deploying Prometheus and Node Exporter on a Linux Host

Architects' Tech Alliance

Jan 5, 2022 · Backend Development

Essential Microservice Architecture Roadmap: Tools, Patterns, and Best Practices

This guide outlines why microservice architecture is preferred for large applications, presents a clear learning roadmap, and details each critical concern—such as Docker, orchestration, API gateways, load balancing, service discovery, event buses, logging, monitoring, tracing, persistence, caching, and cloud providers—along with recommended tools.

Backend ArchitectureCloud NativeDocker

0 likes · 14 min read

Essential Microservice Architecture Roadmap: Tools, Patterns, and Best Practices

Zhuanzhuan Tech

Jan 5, 2022 · Operations

Design and Implementation of a Multi‑Dimensional Monitoring Platform Based on Prometheus and M3DB

This article details the background, research, architecture, performance testing, and deployment of a comprehensive monitoring system that leverages Prometheus, Grafana, and M3DB to provide flexible metric collection, automatic dashboard generation, and a custom alerting service for large‑scale business services.

AlertingMetricsTime Series

0 likes · 16 min read

Design and Implementation of a Multi‑Dimensional Monitoring Platform Based on Prometheus and M3DB

Architecture Digest

Dec 31, 2021 · Backend Development

Why I Chose Microservice Architecture and a Roadmap of Its Core Components

This article explains why microservice architecture is preferred over monolithic applications, outlines a learning roadmap, and details essential components such as Docker, container orchestration, API gateways, load balancing, service discovery, event buses, logging, monitoring, distributed tracing, data persistence, caching, and cloud providers.

Backend ArchitectureDockerKubernetes

0 likes · 13 min read

Why I Chose Microservice Architecture and a Roadmap of Its Core Components

21CTO

Dec 30, 2021 · Backend Development

Why Choose Microservices? A Practical Roadmap and Tool Guide

This article outlines why microservice architecture is preferred over monolithic designs, presents a clear learning roadmap, and details essential concerns such as Docker, orchestration, API gateways, load balancing, service discovery, logging, monitoring, tracing, persistence, caching, and cloud providers, with recommended tools for each.

DockerMicroservicesapi-gateway

0 likes · 15 min read

Why Choose Microservices? A Practical Roadmap and Tool Guide

Xiaolei Talks DB

Dec 30, 2021 · Databases

Exploring the TiDB Distributed Database Ecosystem: Tools, Automation, and Innovations

This article explains what a distributed database ecosystem is, using TiDB as a case study to detail upstream/downstream tools, backup and monitoring solutions, automation platforms, and emerging projects that together form a comprehensive TiDB ecosystem.

BackupEcosystemTiDB

0 likes · 9 min read

Exploring the TiDB Distributed Database Ecosystem: Tools, Automation, and Innovations

Liulishuo Tech Team

Dec 30, 2021 · Operations

Design and Implementation of an Alert Scheduling System (GoAlert) and Notification Center

This article explains why alerts and on‑call scheduling are needed, outlines the core principles of an alert scheduling system, describes the architecture evolution from PagerDuty to GoAlert and Notice‑Center, and details the implementation, code snippets, and future outlook for a comprehensive operations monitoring solution.

AlertingNotification Systemgoalert

0 likes · 14 min read

Design and Implementation of an Alert Scheduling System (GoAlert) and Notification Center

HomeTech

Dec 30, 2021 · Operations

Open-falcon in Automotive Home: Application, Architecture, and Customizations

This article describes how the open‑falcon monitoring system is applied and customized at Automotive Home, covering its architecture, component roles, a comparison with other open‑source solutions, and the enhancements made for service‑tree based dynamic monitoring, alerting, self‑healing, and high‑availability deployment.

Open-FalconOperationsmonitoring

0 likes · 11 min read

Open-falcon in Automotive Home: Application, Architecture, and Customizations

DataFunSummit

Dec 29, 2021 · Operations

How to Build an Operations Monitoring Platform with Spring Boot Admin

This article explains what Spring Boot Admin is, walks through creating a server and client to monitor Spring Boot applications, shows how to configure ports, enable the admin UI, and set up email and custom alert notifications for operational health monitoring.

OperationsSpring Bootjava

0 likes · 12 min read

How to Build an Operations Monitoring Platform with Spring Boot Admin

DeWu Technology

Dec 24, 2021 · Operations

How to Quickly Attribute Live‑Streaming Alert Issues in a Kubernetes Environment

This article walks through a real‑world live‑streaming service alert where response time and goroutine spikes were traced through Grafana metrics, MySQL/Redis performance, routing logic, and Istio sidecar load, ultimately revealing a mis‑reported Istio metric and a resource‑allocation fix to prevent future jitter.

IstioKubernetesOperations

0 likes · 11 min read

How to Quickly Attribute Live‑Streaming Alert Issues in a Kubernetes Environment

Architecture Digest

Dec 23, 2021 · Operations

Using Filebeat and Graylog for Centralized Log Collection and Monitoring

This article explains how to deploy and configure Filebeat and Graylog for centralized log collection, covering installation methods, configuration files, Docker deployment, input modules, pipelines, and practical examples for efficiently gathering and analyzing logs across multiple environments.

DockerFilebeatGraylog

0 likes · 15 min read

Using Filebeat and Graylog for Centralized Log Collection and Monitoring

ITPUB

Dec 20, 2021 · Databases

From Database Developer to New DBA: Boosting MySQL Efficiency and Automation

The article shares a senior DBA's journey from early database engine development to modern MySQL operations, outlining practical methods for improving efficiency, automating monitoring, building data‑driven processes, and redefining the DBA role for proactive, high‑impact service delivery.

DBADatabase operationsautomation

0 likes · 33 min read

From Database Developer to New DBA: Boosting MySQL Efficiency and Automation

Java High-Performance Architecture

Dec 20, 2021 · Operations

How to Diagnose and Optimize Business System Performance After Launch

This article outlines a comprehensive process for analyzing, diagnosing, and optimizing performance issues in production business systems, covering hardware, OS, database, middleware, JVM settings, code inefficiencies, and the role of monitoring tools like APM to pinpoint bottlenecks.

JVMSystemdiagnostics

0 likes · 15 min read

How to Diagnose and Optimize Business System Performance After Launch

Beike Product & Technology

Dec 17, 2021 · Operations

Practices for Monitoring, Resource Optimization, and Containerization of Large-Scale Flink Jobs at Beike

This article describes Beike's real‑time computing team's end‑to‑end practices for collecting and storing Flink metrics, building visual monitoring dashboards, implementing multi‑level alerting, analyzing logs, estimating CPU and memory resources, and deploying Flink on Kubernetes with containerization and storage separation to improve stability, resource utilization, and operational efficiency.

FlinkKubernetesMetrics

0 likes · 25 min read

Practices for Monitoring, Resource Optimization, and Containerization of Large-Scale Flink Jobs at Beike

Alibaba Cloud Native

Dec 16, 2021 · Cloud Native

From Legacy Monitoring to Modern Observability: A Cloud‑Native Journey

This article traces the 30‑year evolution of system monitoring, explains the differences between monitoring, APM and observability, outlines key practices for building an observability platform, and provides a step‑by‑step guide to implementing Prometheus + Grafana in a cloud‑native environment.

APMARMSGrafana

0 likes · 18 min read

From Legacy Monitoring to Modern Observability: A Cloud‑Native Journey

Architecture Digest

Dec 16, 2021 · Operations

System Performance Issue Analysis, Diagnosis, and Optimization Process

This article outlines a comprehensive approach to diagnosing and optimizing performance problems in production business systems, covering common causes such as concurrency spikes, data growth, and environment changes, and detailing hardware, middleware, database, JVM, code-level analyses, monitoring tools, and APM strategies.

JVMdatabasediagnostics

0 likes · 15 min read

System Performance Issue Analysis, Diagnosis, and Optimization Process

NetEase Smart Enterprise Tech+

Dec 14, 2021 · Backend Development

How NetEase Cloud’s Distributed Recording Cluster Ensures High‑Availability and Scalability

This article explains the architecture and key features of NetEase Cloud's local server‑side recording cluster, detailing how dynamic scaling, multi‑backup high availability, load‑balancing strategies, monitoring, and an embedded registration center enable secure, reliable, and scalable recording for data‑sensitive applications.

Distributed SystemsJava SDKREST API

0 likes · 11 min read

How NetEase Cloud’s Distributed Recording Cluster Ensures High‑Availability and Scalability

Java Interview Crash Guide

Dec 13, 2021 · Operations

How to Diagnose and Optimize Business System Performance After Launch

This article outlines a comprehensive process for analyzing, diagnosing, and optimizing performance issues in production business systems, covering hardware, database, middleware, JVM tuning, code-level problems, testing limitations, scaling considerations, and the role of APM monitoring.

APMDatabase TuningSystem optimization

0 likes · 14 min read

Programmer DD

Dec 12, 2021 · Operations

How Netflix’s Telltale Transforms Monitoring for 100+ Services

This article explains Netflix’s home‑grown monitoring system Telltale, detailing its design, multi‑dimensional health‑assessment model, intelligent alerting, integration with Slack, deployment monitoring, and continuous optimization that together keep over a hundred production applications running smoothly.

AlertingMicroservicesNetflix

0 likes · 13 min read

How Netflix’s Telltale Transforms Monitoring for 100+ Services

IT Architects Alliance

Dec 12, 2021 · Operations

System Performance Issue Analysis and Optimization Process for Business Applications

The article outlines a comprehensive process for diagnosing and optimizing performance problems in production business systems, covering causes such as high concurrency, data growth, hardware constraints, and detailing analysis of hardware, OS, database, middleware, JVM settings, code inefficiencies, and the role of monitoring and APM tools.

BackendDatabase TuningJVM

0 likes · 13 min read

System Performance Issue Analysis and Optimization Process for Business Applications

Selected Java Interview Questions

Dec 10, 2021 · Backend Development

A Comprehensive Guide to Spring Boot Actuator: Quick Start, Endpoints, and Monitoring

This article provides a step‑by‑step tutorial on using Spring Boot Actuator to monitor microservice applications, covering quick setup, essential endpoints such as health, metrics, loggers, info, beans, heapdump, threaddump and shutdown, endpoint exposure configuration, and securing them with Spring Security.

ActuatorBackendEndpoints

0 likes · 14 min read

A Comprehensive Guide to Spring Boot Actuator: Quick Start, Endpoints, and Monitoring

Ctrip Technology

Dec 9, 2021 · Databases

TiDB Operational Practices at Ctrip: Architecture, Use Cases, Performance Tuning, Monitoring, and Tooling

This article details Ctrip's migration from MySQL to TiDB, describing the multi‑data‑center architecture, real‑world use cases such as the international CDP platform and hotel settlement, performance tuning measures, comprehensive monitoring and alerting, auxiliary tools, and future roadmap for the distributed NewSQL database.

HTAPTiDBmonitoring

0 likes · 16 min read

TiDB Operational Practices at Ctrip: Architecture, Use Cases, Performance Tuning, Monitoring, and Tooling

IT Architects Alliance

Dec 9, 2021 · Backend Development

How to Build a Billion‑User Scalable User Center: Architecture, APIs, Token Fallback, and Security

This article presents a comprehensive, practical design for an ultra‑large‑scale user center, covering microservice architecture, API separation, token generation with graceful degradation, data‑sharding strategies, password encryption, asynchronous processing, and detailed monitoring to ensure high availability, performance, and security.

MicroservicesScalabilityToken

0 likes · 16 min read

How to Build a Billion‑User Scalable User Center: Architecture, APIs, Token Fallback, and Security

Alibaba Cloud Native

Dec 7, 2021 · Operations

How Information Entropy Powers AI‑Driven Alert Noise Reduction in Cloud‑Native Operations

This article explains how Shannon's information entropy and NLP are combined in Alibaba Cloud's ARMS intelligent noise reduction to quantify alert uncertainty, filter redundant notifications, and automatically prioritize critical incidents, offering a practical, self‑learning solution for modern monitoring environments.

Alert Noise ReductionNLPinformation entropy

0 likes · 11 min read

How Information Entropy Powers AI‑Driven Alert Noise Reduction in Cloud‑Native Operations

Efficient Ops

Dec 6, 2021 · Operations

How Scenario‑Based AIOps Transforms IT Operations: Insights from GOPS 2023

The article summarizes a GOPS conference presentation by Dingmao Technology on AIOps scenario‑driven construction, detailing challenges, definition of scenarios, technical methods, roadmap planning, and future prospects, while showcasing practical examples and supporting technologies for intelligent IT operations.

Artificial IntelligenceData IntegrationIT Operations

0 likes · 8 min read

How Scenario‑Based AIOps Transforms IT Operations: Insights from GOPS 2023

Aikesheng Open Source Community

Dec 3, 2021 · Operations

Monitoring DBLE with Zabbix: Environment Setup, Scripts, and Template Configuration

This guide explains how to set up a monitoring environment for the DBLE distributed middleware using Zabbix, covering host and software configuration, MySQL master‑slave deployment, DBLE installation, Zabbix script creation, and template configuration with detailed code examples.

DBLEOperationsZabbix

0 likes · 8 min read

Monitoring DBLE with Zabbix: Environment Setup, Scripts, and Template Configuration

Top Architect

Dec 3, 2021 · Operations

Centralized Log Collection with Filebeat and Graylog

This article explains how to use Filebeat together with Graylog to collect, process, and visualize logs from multiple services and environments, covering tool introductions, configuration files, component details, deployment methods, and practical code examples.

DockerELKFilebeat

0 likes · 19 min read

Centralized Log Collection with Filebeat and Graylog

Architecture Digest

Dec 3, 2021 · Backend Development

Design Practices for a Billion‑Scale User Center

This article presents a comprehensive set of design practices for building a highly available, high‑performance, and secure user‑center system that can handle hundreds of millions of users, covering service architecture, API design, token degradation, data sharding, security, asynchronous processing, and monitoring.

ScalabilityTokendatabase sharding

0 likes · 15 min read

Design Practices for a Billion‑Scale User Center

MaGe Linux Operations

Dec 2, 2021 · Operations

Master Log Collection: Deploy Filebeat & Graylog for Centralized Logging

This guide explains how to use Filebeat to ship logs to Graylog, covering Filebeat's architecture, configuration files, deployment options with Docker or native packages, Graylog's components and pipelines, and step‑by‑step Docker‑compose setup for a scalable centralized logging solution.

DockerElasticsearchFilebeat

0 likes · 15 min read

Master Log Collection: Deploy Filebeat & Graylog for Centralized Logging

Alibaba Cloud Native

Nov 30, 2021 · Cloud Native

How to Detect and Resolve Slow Calls in Kubernetes: Best Practices & Real‑World Cases

This article explains why slow calls in Kubernetes can jeopardize user experience, project timelines, and system stability, outlines five common causes, introduces the golden‑signal and USE analysis framework, and walks through three practical case studies with step‑by‑step troubleshooting and remediation techniques.

ARMSKubernetesmonitoring

0 likes · 15 min read

How to Detect and Resolve Slow Calls in Kubernetes: Best Practices & Real‑World Cases

Top Architect

Nov 28, 2021 · Backend Development

Design and Implementation of an RxNetty‑Based API Gateway for Microservice Architectures

This article describes the architecture, core features, and implementation details of a high‑performance API gateway built with RxNetty, covering request dispatch, conditional routing, API management, rate‑limiting, security policies, monitoring, and future improvement directions.

MicroservicesRxNettyapi-gateway

0 likes · 11 min read

Design and Implementation of an RxNetty‑Based API Gateway for Microservice Architectures

Cloud Native Technology Community

Nov 25, 2021 · Databases

Why Is My Redis Slowing Down? A Complete Troubleshooting Guide

This article provides a systematic, step‑by‑step methodology for diagnosing Redis latency spikes, covering baseline performance testing, slow‑log analysis, high‑complexity commands, big‑key handling, expiration patterns, memory limits, fork overhead, huge‑page settings, AOF configurations, CPU binding, swap usage, memory fragmentation, network saturation, and practical monitoring tips.

Latencydatabasemonitoring

0 likes · 42 min read

Why Is My Redis Slowing Down? A Complete Troubleshooting Guide

Efficient Ops

Nov 24, 2021 · Operations

Why Switch to Loki? Step‑by‑Step Installation and Grafana Visualization

This guide explains why Loki is a lightweight alternative to EFK/ELK, walks through installing Loki and Promtail binaries, configuring them with YAML files, and visualizing logs in Grafana using LogQL, providing a complete end‑to‑end log management solution.

GrafanaLog ManagementLoki

0 likes · 6 min read

Why Switch to Loki? Step‑by‑Step Installation and Grafana Visualization

Baidu Geek Talk

Nov 24, 2021 · Operations

How Baidu’s Fengjing Uses Holographic Logs to Debug Massive Microservices

Baidu’s Fengjing monitoring platform tackles the daunting challenge of pinpointing failures in its massive Java‑based microservice ecosystem by employing a non‑intrusive probe that captures log metadata, stores it in a database, and reconstructs full request‑level logs with minimal storage overhead.

Distributed TracingMicroservicesholographic logging

0 likes · 9 min read

How Baidu’s Fengjing Uses Holographic Logs to Debug Massive Microservices

Efficient Ops

Nov 24, 2021 · Operations

Practical Prometheus in Kubernetes: Tips, Limits, and Scaling

This article shares practical experiences and best‑practice guidelines for deploying and operating Prometheus in Kubernetes, covering version selection, inherent limitations, exporter choices, metric design, multi‑cluster scraping, memory and storage planning, GPU monitoring, timezone handling, and alerting considerations.

ExportersGrafanaPrometheus

0 likes · 21 min read

Practical Prometheus in Kubernetes: Tips, Limits, and Scaling

dbaplus Community

Nov 22, 2021 · Databases

Transforming MySQL Monitoring: From Nagios to Kafka‑Powered Alerts

Qunar’s DBA team overhauled their MySQL monitoring and alert system—originally built on Nagios and NRPE—by integrating a Kafka‑based pipeline, a custom alarm service, and MySQL‑stored alert templates, achieving flexible thresholds, granular silencing, high‑availability processing, and early‑stage intelligent management of alerts, slow queries, and disk space.

AlertingDBAKafka

0 likes · 14 min read

Transforming MySQL Monitoring: From Nagios to Kafka‑Powered Alerts

JD Retail Technology

Nov 22, 2021 · Backend Development

Designing a High‑Performance Log Collection System with UDP, Compression, and ClickHouse

The article analyzes the high cost and scalability challenges of traditional log collection pipelines and proposes a streamlined architecture that uses in‑memory buffering, UDP transport, aggressive compression, and ClickHouse storage to achieve massive throughput while drastically reducing hardware and operational expenses.

High ThroughputUDPclickhouse

0 likes · 15 min read

Designing a High‑Performance Log Collection System with UDP, Compression, and ClickHouse

Ops Development Stories

Nov 22, 2021 · Cloud Native

Mastering Kubernetes Pod Resource Requests, Limits, and QoS

Learn how to properly configure CPU and Memory requests and limits for Kubernetes Pods, understand QoS classes, manage namespace quotas with LimitRange and ResourceQuota, and monitor resource usage using Prometheus queries and Grafana dashboards to ensure stable, efficient cluster operations.

KubernetesQoSResourceQuota

0 likes · 11 min read

Mastering Kubernetes Pod Resource Requests, Limits, and QoS

IT Architects Alliance

Nov 20, 2021 · Operations

Analysis and Optimization of Business System Performance

This article outlines a comprehensive approach to diagnosing and optimizing performance problems in production business systems, covering analysis processes, hardware, OS, database, middleware, JVM tuning, code inefficiencies, and monitoring techniques to identify root causes and improve system reliability.

Database TuningOperationsSystem optimization

0 likes · 16 min read

Analysis and Optimization of Business System Performance

Aikesheng Open Source Community

Nov 19, 2021 · Operations

Monitoring TiDB with Zabbix: Using HTTP Agent, Preprocessing, and Triggers

This guide explains how to collect TiDB metrics via its HTTP monitoring API, preprocess the data into JSON, create master and regular items in Zabbix, and configure triggers using Prometheus‑style expressions to achieve effective TiDB monitoring.

AlertingJsonPathMetrics

0 likes · 7 min read

Monitoring TiDB with Zabbix: Using HTTP Agent, Preprocessing, and Triggers

DevOps Cloud Academy

Nov 19, 2021 · Operations

Guide to Using Grafana Stat Panel for Monitoring: Text and Background Modes, Configuration Steps

This tutorial explains how to create and configure Grafana Stat panels—including text and background modes, threshold‑based coloring, unit settings, and Markdown/HTML text panels—to visualize metrics such as node uptime, CPU cores, and total memory on a dashboard.

DashboardGrafanaOperations

0 likes · 8 min read

Guide to Using Grafana Stat Panel for Monitoring: Text and Background Modes, Configuration Steps

vivo Internet Technology

Nov 17, 2021 · Operations

Design and Architecture of a Unified Alert Convergence System for Monitoring

The paper presents a unified alert convergence system that centralizes metric calculation, detection, and alarm handling across monitoring subsystems, employing mechanisms such as convergence, claiming, silencing, escalation, and a Redis‑based delayed queue integrated via Kafka or REST to reduce alarm fatigue, improve MTTA/MTTR, and enable future AI‑driven AIOps.

MTTAMTTROperations

0 likes · 18 min read

Design and Architecture of a Unified Alert Convergence System for Monitoring

Efficient Ops

Nov 16, 2021 · Operations

How to Build a Scalable Prometheus Monitoring System with Thanos on Kubernetes

This article explains why monitoring is essential for production stability, compares white‑box and black‑box approaches, and provides a step‑by‑step guide to deploying Prometheus, configuring scrape targets, using Pushgateway and Alertmanager, and scaling the solution with Thanos in a Kubernetes environment.

AlertmanagerPrometheusPushgateway

0 likes · 21 min read

How to Build a Scalable Prometheus Monitoring System with Thanos on Kubernetes

DevOps Cloud Academy

Nov 15, 2021 · Operations

Creating and Transforming Grafana Table Panels for Server Resource Monitoring

This guide demonstrates how to create a Grafana Table panel to monitor server resources, add multiple queries, merge them using the Transform feature, customize fields and units, and organize rows for a comprehensive dashboard view.

GrafanaOperationsTable Panel

0 likes · 7 min read

Creating and Transforming Grafana Table Panels for Server Resource Monitoring

Big Data Technology & Architecture

Nov 15, 2021 · Operations

A Comprehensive Overview of Kafka Monitoring Tools

This article provides a comprehensive overview of popular Kafka monitoring solutions—including JMX, Kafka Manager (CMAK), Kafka Eagle, and Logi‑KafkaManager—detailing their features, installation steps, configuration examples, and comparative advantages, while also mentioning custom setups using JMXTrans, InfluxDB, and Grafana.

CMAKKafkaKafka Eagle

0 likes · 8 min read

A Comprehensive Overview of Kafka Monitoring Tools

Open Source Linux

Nov 14, 2021 · Databases

Essential Redis Monitoring Metrics Every Engineer Should Know

This guide outlines the key Redis monitoring metrics—including performance, memory, basic activity, persistence, and error indicators—explains their meanings, shows how to retrieve them with Redis commands, and provides practical tips for effective performance and health tracking.

ErrorMetricsmonitoring

0 likes · 6 min read

Essential Redis Monitoring Metrics Every Engineer Should Know

dbaplus Community

Nov 14, 2021 · Operations

How to Boost Service Reliability: SRE Basics and Tackling Technical Debt

This article explains the fundamentals of Site Reliability Engineering, outlines a complete SRE workflow from prevention to post‑mortem, details key availability metrics and golden indicators, examines how technical debt arises and can be mitigated, and describes the tooling and practices needed to keep large‑scale services healthy.

SRETechnical Debtmonitoring

0 likes · 18 min read

How to Boost Service Reliability: SRE Basics and Tackling Technical Debt

Aikesheng Open Source Community

Nov 12, 2021 · Operations

Monitoring TiDB with Zabbix Server 5.4 – Step‑by‑Step Guide

This article explains how to use Zabbix Server 5.4 to monitor TiDB clusters by configuring HTTP agents, converting Prometheus metrics to JSON, creating custom macros, linking TiDB templates, and verifying data collection, while noting version and OS requirements.

OperationsPrometheusTiDB

0 likes · 5 min read

Monitoring TiDB with Zabbix Server 5.4 – Step‑by‑Step Guide

Alibaba Terminal Technology

Nov 10, 2021 · Operations

How QianNiu Cut Plugin Issues by 50%: Open Architecture & Monitoring Secrets

The article explains how Alibaba’s QianNiu multi‑platform workbench reduced open‑plugin related user complaints by half through defining open nodes, optimizing long plugin startup chains, building permission‑request loops, establishing comprehensive data‑driven metrics, and creating an open‑experience dashboard that monitors performance, reliability, and user‑perceived issues across mobile and desktop.

Open PlatformUser experiencedata metrics

0 likes · 14 min read

How QianNiu Cut Plugin Issues by 50%: Open Architecture & Monitoring Secrets

Ops Development Stories

Nov 8, 2021 · Cloud Native

How to Manually Deploy Prometheus Federation on Kubernetes – Step‑by‑Step Guide

This guide walks through manually deploying a Prometheus federation on Kubernetes, covering environment setup with sealos, creating storage classes, persistent volumes, ConfigMaps, StatefulSets, services, applying manifests, and verifying the federation to aggregate metrics across multiple clusters.

Cloud NativeFederationKubernetes

0 likes · 10 min read

How to Manually Deploy Prometheus Federation on Kubernetes – Step‑by‑Step Guide

IT Architects Alliance

Nov 7, 2021 · Cloud Native

Why Microservices Matter: Core Architecture, Benefits, and Real-World Practices

This article provides a comprehensive overview of microservices, covering its origin, core architectural principles, key characteristics, advantages and drawbacks, suitable organizational contexts, and essential technical components such as service discovery, gateways, configuration centers, monitoring, circuit breaking, and container orchestration.

Microservicesarchitecturecontainer orchestration

0 likes · 17 min read

Why Microservices Matter: Core Architecture, Benefits, and Real-World Practices

YunZhu Net Technology Team

Nov 5, 2021 · Backend Development

Practical Java Performance Optimization: Metrics, Bottleneck Identification, and Governance Strategies

This article shares practical Java performance‑optimization techniques, covering UI and non‑UI latency metrics, baseline data collection, bottleneck discovery with tools like Arthas, chronic issue handling, and a comprehensive set of governance measures ranging from network‑level caching to code‑level refactoring, asynchronous processing, and service splitting to achieve stable sub‑200 ms response times.

ArthasBackendcaching

0 likes · 19 min read

Practical Java Performance Optimization: Metrics, Bottleneck Identification, and Governance Strategies

Open Source Linux

Oct 31, 2021 · Operations

Designing Effective Metrics: From Requirements to Labels and Buckets

This guide explains how to define, name, and organize monitoring metrics—covering Google’s four golden indicators, system‑specific measurement objects, vector selection, label conventions, bucket design, and practical Grafana tips—for reliable observability of diverse services.

Metricslabelingmonitoring

0 likes · 10 min read

Designing Effective Metrics: From Requirements to Labels and Buckets

Baidu Geek Talk

Oct 29, 2021 · Industry Insights

Baidu’s QCon 2021 Highlights: Elastic Scaling, Search Architecture, AI Chips

This article compiles Baidu engineers' QCon 2021 talks, covering micro‑service evolution, large‑scale container elastic scaling, search system elasticity, AI‑chip deployment at massive scale, and cost‑focused monitoring, each with abstracts, outlines and key takeaways for practitioners.

AI chipsCloud NativeMicroservices

0 likes · 11 min read

Baidu’s QCon 2021 Highlights: Elastic Scaling, Search Architecture, AI Chips

Huolala Tech

Oct 29, 2021 · Operations

How Huolala Guarantees Cloud‑Native Stability at Scale

In this detailed account of Huolala's 2021 Cloud Operations Best Practices talk, the company shares its multi‑cloud architecture, service‑oriented governance, capacity‑testing, monitoring, and risk‑prediction techniques that together ensure high‑availability and efficient scaling for its diverse logistics services.

Operationscapacity testingmonitoring

0 likes · 17 min read

How Huolala Guarantees Cloud‑Native Stability at Scale

政采云技术

Oct 28, 2021 · Backend Development

HikariCP Overview (Part 1): Initialization, Core Components, Monitoring and Configuration

This article provides a detailed analysis of HikariCP’s initialization, core components, startup flow, connection acquisition logic, monitoring metrics, and key configuration parameters, illustrating how Spring Boot 2.x leverages this high‑performance JDBC connection pool and offering guidance for tuning and extending it.

ConfigurationConnection PoolHikariCP

0 likes · 14 min read

HikariCP Overview (Part 1): Initialization, Core Components, Monitoring and Configuration

dbaplus Community

Oct 26, 2021 · Databases

Scaling JD.com Customer Service with Doris OLAP: Architecture & Caching

JD.com’s customer service team leverages the open‑source MPP database Doris to power real‑time and offline OLAP dashboards, detailing data ingestion pipelines, full‑link monitoring, dual‑stream high‑availability design, dynamic partition management, multi‑layer caching strategies, and performance optimizations applied during the 2020 11.11 shopping festival.

Big DataOLAPReal-time analytics

0 likes · 15 min read

Scaling JD.com Customer Service with Doris OLAP: Architecture & Caching

Tencent IMWeb Frontend Team

Oct 25, 2021 · Backend Development

Mastering Node.js Backend Logging: Design, Tools, and Full‑Trace Strategies

This article shares a comprehensive guide to building robust logging systems for Node.js backend services, covering log types, storage options, performance considerations, full‑trace design, custom field schemas, integration with cloud log platforms, and practical troubleshooting examples.

Node.jsOperationsbackend-development

0 likes · 15 min read

Mastering Node.js Backend Logging: Design, Tools, and Full‑Trace Strategies

Baidu Geek Talk

Oct 20, 2021 · Operations

Practical Strategies for Building High‑Availability Systems

This article presents a comprehensive, step‑by‑step guide on improving system reliability through early fault detection, scope reduction, frequency reduction, and rapid incident handling, using real‑world practices from Baidu's commercial hosting platform.

Log StandardizationOperationscapacity planning

0 likes · 20 min read

Practical Strategies for Building High‑Availability Systems

Open Source Linux

Oct 19, 2021 · Operations

Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring

This guide shares practical Linux operations lessons—ranging from cautious command use, rigorous backup habits, and secure SSH configurations to comprehensive monitoring and performance tuning—to help teams avoid costly mistakes and maintain stable, reliable services.

BackupOperationsmonitoring

0 likes · 12 min read

Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring

Ops Development Stories

Oct 19, 2021 · Operations

How to Build a Highly Available Alertmanager Cluster with Gossip

Learn to set up a highly available Alertmanager cluster using the Gossip protocol, covering deduplication, routing, HA architecture, required cluster parameters, systemd service files, and Prometheus integration, with step‑by‑step commands and configuration examples.

AlertmanagerGossipHA

0 likes · 8 min read

How to Build a Highly Available Alertmanager Cluster with Gossip

dbaplus Community

Oct 18, 2021 · Operations

Master Prometheus: From Setup to Advanced Monitoring in Cloud‑Native Environments

This guide walks through the history, core features, installation methods, configuration, PromQL queries, exporter setup, Grafana integration, and alerting with Alertmanager for Prometheus, providing practical commands and examples for building a complete monitoring solution in cloud‑native environments.

AlertingExportersGrafana

0 likes · 34 min read

Master Prometheus: From Setup to Advanced Monitoring in Cloud‑Native Environments

Full-Stack Internet Architecture

Oct 16, 2021 · Backend Development

Handling MQ Failures: Encapsulation, Degradation, and Message Resend Strategies

The article explains how to properly deal with message‑queue (MQ) outages by first encapsulating MQ operations, then applying degradation tactics such as persisting failed messages to a database, disk, or log, and finally implementing scheduled or manual message‑replay mechanisms while emphasizing monitoring and fallback logic.

BackendFailure HandlingMQ

0 likes · 5 min read

Handling MQ Failures: Encapsulation, Degradation, and Message Resend Strategies

360 Tech Engineering

Oct 15, 2021 · Operations

Log Collection Architecture Using Filebeat, Logstash, and Kafka

This article describes a lightweight, resource‑efficient log collection solution that combines Filebeat agents, optional Logstash aggregation, and Kafka transport, detailing configuration choices, meta‑persistence, back‑pressure mechanisms, monitoring setup, and deployment architecture for reliable at‑least‑once delivery.

FilebeatLogstashOperations

0 likes · 14 min read

Log Collection Architecture Using Filebeat, Logstash, and Kafka

Ops Development Stories

Oct 15, 2021 · Operations

Integrate Real‑Time Prometheus Pod Metrics into Probius Using ECharts

After integrating Kubernetes into Probius, this guide shows how to pull pod metrics from Prometheus using the query_range API, process them with a Python client, and visualize CPU, memory, bandwidth, and IOPS data in Probius via ECharts, completing a seamless container‑monitoring feature.

EChartsKubernetesPrometheus

0 likes · 8 min read

Integrate Real‑Time Prometheus Pod Metrics into Probius Using ECharts