Topic

monitoring

Collection size
1674 articles
Page 4 of 84
IT Architects Alliance
IT Architects Alliance
Jan 6, 2025 · Operations

Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies

The article explores how distributed systems achieve high reliability through redundant design, precise fault detection and recovery, data replication and synchronization, coordinated fault tolerance and load balancing, distributed transaction handling, comprehensive monitoring, elastic scaling, security safeguards, and robust disaster‑recovery planning.

Monitoringdistributed systemsfault tolerance
0 likes · 18 min read
Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Apr 8, 2015 · Cloud Computing

Practices in Building Distributed Technologies for Large‑Scale Cloud Computing Platforms

The article summarizes Dr. Zhang Wensong’s 2014 ArchSummit keynote on the challenges, architectural design, storage strategies, performance optimizations, monitoring, and future directions of Alibaba Cloud’s large‑scale distributed cloud computing platform, covering ECS, SLB, RDS, OCS and full‑link analytics.

Cloud ComputingECSMonitoring
0 likes · 17 min read
Practices in Building Distributed Technologies for Large‑Scale Cloud Computing Platforms
Java Architect Essentials
Java Architect Essentials
Aug 23, 2022 · Cloud Native

Implementing Multi‑Cluster Monitoring with Prometheus and Thanos on Kubernetes

This article explains the limitations of a standard Prometheus monitoring stack on Kubernetes and demonstrates how to migrate to a Thanos‑based solution for long‑term metric retention, reduced infrastructure cost, and scalable multi‑cluster observability using Terraform and cloud‑native components.

KubernetesMonitoringObservability
0 likes · 15 min read
Implementing Multi‑Cluster Monitoring with Prometheus and Thanos on Kubernetes
Architecture Digest
Architecture Digest
Feb 18, 2024 · Operations

Setting Up Nginx Access Log Visualization with Loki and Grafana

This guide walks through installing Loki, Promtail, and Grafana (via Docker), configuring Nginx to emit JSON‑formatted access logs, collecting them with Promtail, storing them in Loki, and visualizing the data in Grafana dashboards, including geo‑IP enrichment and world‑map panels.

DockerGrafanaLogging
0 likes · 7 min read
Setting Up Nginx Access Log Visualization with Loki and Grafana
Code Ape Tech Column
Code Ape Tech Column
Jul 26, 2024 · Operations

Bash Scripts for File Consistency Checks, Log Monitoring, and System Automation

This article presents a comprehensive collection of Bash scripts that perform tasks such as verifying file consistency across servers, scheduled log cleaning, network traffic monitoring, numeric analysis in files, automated FTP downloads, interactive number games, Nginx 502 detection, variable assignments, bulk file renaming, IP address validation, and various system administration operations.

AutomationMonitoringbash
0 likes · 24 min read
Bash Scripts for File Consistency Checks, Log Monitoring, and System Automation
Code Ape Tech Column
Code Ape Tech Column
Dec 12, 2023 · Operations

Centralized Log Collection with Filebeat and Graylog

This article explains how to use Filebeat together with Graylog to collect, ship, store, and analyze logs from multiple environments, covering tool introductions, configuration files, Docker deployment, Spring Boot integration, and practical search syntax for effective log monitoring.

DockerElasticsearchFilebeat
0 likes · 20 min read
Centralized Log Collection with Filebeat and Graylog
FunTester
FunTester
Mar 2, 2025 · Operations

Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems

The article examines typical fault propagation scenarios such as avalanche effects, cascading failures, resource exhaustion, data pollution, and dependency cycles in distributed systems, and outlines proactive measures like rate limiting, circuit breaking, isolation, monitoring, and chaos engineering to prevent small issues from escalating into large-scale outages.

Chaos EngineeringCircuit BreakerMonitoring
0 likes · 11 min read
Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems
FunTester
FunTester
Aug 28, 2024 · Operations

Shadow Testing: Reducing Risk and Ensuring Seamless System Changes

Shadow testing is a parallel deployment strategy that minimizes the risk of system changes, safeguards user experience, validates performance and data integrity, and provides a controlled environment for comprehensive testing, supported by a suite of modern tools and real‑world case studies.

CI/CDMonitoringRisk Mitigation
0 likes · 17 min read
Shadow Testing: Reducing Risk and Ensuring Seamless System Changes
Architect
Architect
Dec 31, 2024 · Operations

Integrating Prometheus with Spring Boot and Visualizing Metrics Using Grafana

This guide explains how to monitor a Spring Boot application using Prometheus, configure Spring Boot Actuator, run Prometheus (including Docker deployment), set up Grafana for visualizing metrics, and create custom metrics with Micrometer, providing step‑by‑step instructions and code examples.

ActuatorDockerGrafana
0 likes · 10 min read
Integrating Prometheus with Spring Boot and Visualizing Metrics Using Grafana
Architect
Architect
Dec 27, 2024 · Big Data

Fault Self‑Healing System for Large‑Scale Big Data Clusters

This article describes the design, architecture, and technical implementation of BMR's fault self‑healing platform, which automatically collects data, analyzes failures, defines decision rules, and executes safe recovery workflows to improve reliability and efficiency of massive, heterogeneous big‑data environments.

AutomationMonitoringOperations
0 likes · 16 min read
Fault Self‑Healing System for Large‑Scale Big Data Clusters
DevOps Operations Practice
DevOps Operations Practice
Apr 11, 2025 · Operations

Promtool: A Complete Guide to Configuration Validation, Rule Checking, TSDB Management, and Debugging for Prometheus

This article introduces Promtool, the multifunctional command‑line utility bundled with Prometheus, and explains how to validate configurations, check and test rules, query metrics, manage the TSDB, run unit tests, use debugging helpers, install the tool, and apply best‑practice recommendations.

MonitoringPrometheusPromtool
0 likes · 5 min read
Promtool: A Complete Guide to Configuration Validation, Rule Checking, TSDB Management, and Debugging for Prometheus
DevOps Operations Practice
DevOps Operations Practice
Aug 11, 2024 · Operations

Monitoring Multi-Region HTTP Requests with Prometheus and Blackbox Exporter

This article explains how to deploy Blackbox Exporter in multiple data centers, configure Prometheus to scrape region‑specific HTTP metrics for a target website, validate the setup via queries, and add alerting rules to detect latency or downtime, providing a self‑hosted monitoring solution.

AlertingBlackbox ExporterDocker
0 likes · 5 min read
Monitoring Multi-Region HTTP Requests with Prometheus and Blackbox Exporter
DevOps Operations Practice
DevOps Operations Practice
May 12, 2024 · Operations

Key Practices for Agile Project Management and DevOps Implementation

The article outlines essential DevOps and agile practices—including Scrum, Kanban, continuous integration, continuous delivery, monitoring, micro‑services, automation, and security—to improve collaboration, increase release frequency, and deliver higher‑quality software faster.

CI/CDDevOpsMonitoring
0 likes · 6 min read
Key Practices for Agile Project Management and DevOps Implementation
DevOps Operations Practice
DevOps Operations Practice
Apr 6, 2024 · Operations

Overview of Common DevOps Tools Used in Large Internet Companies

This article introduces the key DevOps tools—including CI/CD platforms, configuration‑management solutions, containerization technologies, monitoring and logging stacks, and infrastructure‑as‑code utilities—explaining their roles, features, and how they help streamline software delivery in modern enterprises.

CI/CDDevOpsMonitoring
0 likes · 9 min read
Overview of Common DevOps Tools Used in Large Internet Companies
DevOps Operations Practice
DevOps Operations Practice
Mar 25, 2024 · Operations

How to Monitor MySQL with Prometheus and Grafana

This tutorial explains how to install the MySQL Exporter, configure Prometheus to scrape MySQL metrics, set up Grafana dashboards for visualization, and define alerting rules for common MySQL performance indicators, providing a complete end‑to‑end monitoring solution.

AlertingExporterGrafana
0 likes · 5 min read
How to Monitor MySQL with Prometheus and Grafana
DevOps Operations Practice
DevOps Operations Practice
Mar 14, 2024 · Operations

Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions

This article analyzes why a single Prometheus instance repeatedly runs out of memory and crashes, explains the underlying storage mechanisms, and presents practical solutions such as metric reduction, retention tuning, federation architecture, and remote storage integration to improve stability and scalability.

FederationMonitoringPerformance
0 likes · 6 min read
Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions
DevOps Operations Practice
DevOps Operations Practice
Feb 2, 2024 · Operations

Zabbix vs Prometheus: A Detailed Comparison of Features, Architecture, and Use Cases

This article provides a comprehensive comparison between Zabbix and Prometheus, covering their functional architecture, metric collection methods, data storage, query capabilities, visualization options, and alerting mechanisms, helping readers decide which monitoring system best fits their enterprise needs.

MonitoringObservabilityPrometheus
0 likes · 8 min read
Zabbix vs Prometheus: A Detailed Comparison of Features, Architecture, and Use Cases
DevOps Operations Practice
DevOps Operations Practice
Nov 13, 2022 · Operations

Deploying Zabbix Monitoring Platform with Docker Containers

This article provides a step‑by‑step guide to quickly set up the latest Zabbix monitoring platform using Docker, covering Docker installation, MySQL volume creation, deployment of Zabbix server, web UI, Java gateway, agents, and host configuration for comprehensive system monitoring.

Container DeploymentDockerMonitoring
0 likes · 8 min read
Deploying Zabbix Monitoring Platform with Docker Containers
DevOps
DevOps
Aug 28, 2024 · Operations

Observability: From Traditional Monitoring to Full‑Stack Observability in Modern SRE Practices

This article explains the concept of observability, contrasts it with traditional monitoring, outlines its benefits for system stability, reliability and performance, and provides practical guidance on building a full‑stack observability platform using logs, metrics, tracing and modern cloud‑native tools.

MetricsMonitoringObservability
0 likes · 15 min read
Observability: From Traditional Monitoring to Full‑Stack Observability in Modern SRE Practices