Tagged articles
660 articles
Page 1 of 7
MaGe Linux Operations
MaGe Linux Operations
May 14, 2026 · Operations

Ops Veteran's Secret: Master These 10 Tools to Cut Overtime by 80%

The article lists ten essential Linux operations tools—Shell scripting, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, typical scenarios, advantages, and concrete usage examples, helping engineers streamline daily tasks and reduce overtime.

AnsibleDockerELK Stack
0 likes · 9 min read
Ops Veteran's Secret: Master These 10 Tools to Cut Overtime by 80%
Java Architect Essentials
Java Architect Essentials
Apr 26, 2026 · Backend Development

15 SpringBoot Performance Tweaks to Handle Million-Scale Concurrency

This guide walks through exposing metrics, integrating Prometheus and Grafana, using async‑profiler flame graphs, tuning Tomcat/Undertow, optimizing JVM flags, applying SkyWalking tracing, and applying layer‑wise code, cache, and thread‑pool improvements so a SpringBoot service can reliably serve millions of concurrent requests.

GrafanaNGINXPrometheus
0 likes · 20 min read
15 SpringBoot Performance Tweaks to Handle Million-Scale Concurrency
Raymond Ops
Raymond Ops
Apr 22, 2026 · Operations

How Prometheus Recording Rules Can Reduce Alert Noise by 70%

This guide explains how to use Prometheus Recording Rules to pre‑compute, aggregate, and smooth metrics in large‑scale microservice environments, cutting daily alert noise by up to 70% through hierarchical alert design, practical examples, and best‑practice recommendations.

Alert Noise ReductionDevOpsKubernetes
0 likes · 22 min read
How Prometheus Recording Rules Can Reduce Alert Noise by 70%
Ops Community
Ops Community
Apr 18, 2026 · Operations

Master Linux Host Monitoring: Prometheus, Node Exporter, Thresholds & Scripts

This comprehensive guide walks you through building a robust Linux host monitoring system with Prometheus and node_exporter, covering CPU, memory, disk, and network metrics, practical threshold formulas, ready‑to‑run Bash scripts, Alertmanager rules, Grafana dashboards, and best‑practice recommendations for reliable operations.

AlertmanagerGrafanaLinux monitoring
0 likes · 49 min read
Master Linux Host Monitoring: Prometheus, Node Exporter, Thresholds & Scripts
Ops Community
Ops Community
Apr 10, 2026 · Databases

How to Diagnose and Fix MySQL Too Many Connections Errors in Production

When MySQL reports 'Too many connections', this guide walks you through emergency assessment, step‑by‑step diagnostics, quick mitigation scripts, root‑cause analysis of slow queries, connection leaks, short‑connection spikes, and long‑term solutions including parameter tuning, connection‑pool configuration, and Prometheus‑based monitoring to prevent future outages.

AlertmanagerConnection PoolConnection leak
0 likes · 40 min read
How to Diagnose and Fix MySQL Too Many Connections Errors in Production
AI Step-by-Step
AI Step-by-Step
Apr 8, 2026 · Operations

How to Light Up the Black Box of LLM Agents with Full‑Stack Observability

The article explains why traditional logs are insufficient for LLM agents, outlines five observability dimensions—tracing, metrics, behavioral governance, state & memory, and evaluation—and provides concrete, open‑source‑based steps to instrument, monitor, and act on agent workloads in production.

Behavioral GovernanceLLM agentsObservability
0 likes · 11 min read
How to Light Up the Black Box of LLM Agents with Full‑Stack Observability
Linux Tech Enthusiast
Linux Tech Enthusiast
Apr 7, 2026 · Operations

Top 10 Essential Tools Every Ops Engineer Uses Daily

This article enumerates ten widely used operations tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing each tool's function, suitable scenarios, advantages, and concrete usage examples for daily sysadmin tasks.

AnsibleDockerELK
0 likes · 8 min read
Top 10 Essential Tools Every Ops Engineer Uses Daily
DeepHub IMBA
DeepHub IMBA
Apr 4, 2026 · Artificial Intelligence

Building Mini-vLLM from Scratch: KV‑Cache, Dynamic Batching, and Distributed Inference

This article walks through constructing Mini-vLLM, a from‑scratch LLM inference engine that tackles the O(N²) attention cost with KV‑cache, boosts throughput via dynamic batching, adds observability with Prometheus/Grafana, supports gRPC, and scales across multiple workers, with benchmark numbers demonstrating its CPU‑only performance.

DockerDynamic BatchingInference Engine
0 likes · 12 min read
Building Mini-vLLM from Scratch: KV‑Cache, Dynamic Batching, and Distributed Inference
MaGe Linux Operations
MaGe Linux Operations
Mar 30, 2026 · Cloud Native

How to Scale Prometheus to Thousands of Nodes with Thanos: A Deep Dive

This article examines the storage, query performance, high‑availability, and high‑cardinality challenges of running Prometheus on a thousand‑node Kubernetes cluster and presents a complete, step‑by‑step Thanos‑based architecture, capacity‑planning models, configuration examples, and operational best practices for reliable horizontal scaling.

KubernetesObservabilityPrometheus
0 likes · 34 min read
How to Scale Prometheus to Thousands of Nodes with Thanos: A Deep Dive
Raymond Ops
Raymond Ops
Mar 12, 2026 · Operations

How to Supercharge Prometheus: Proven Techniques to Slash Memory and Query Latency

This article shares real‑world experiences and step‑by‑step practices for optimizing Prometheus performance, covering metric pruning, scrape interval tuning, storage engine tweaks, query acceleration, federation architecture, and future observability trends to keep monitoring systems reliable at scale.

Cloud NativeObservabilityOperations
0 likes · 11 min read
How to Supercharge Prometheus: Proven Techniques to Slash Memory and Query Latency
Raymond Ops
Raymond Ops
Mar 2, 2026 · Operations

Why Most Alerts Fail and How to Build a Night‑Quiet, High‑Signal Monitoring System

This article examines the root causes of alert fatigue—mis‑configured thresholds, noisy alerts, lack of context, and poor routing—then presents a step‑by‑step guide using golden signals, dynamic baselines, enriched alert payloads, severity‑based routing, and suppression techniques to create an effective, low‑noise monitoring system.

AlertingAlertmanagerPrometheus
0 likes · 24 min read
Why Most Alerts Fail and How to Build a Night‑Quiet, High‑Signal Monitoring System
Raymond Ops
Raymond Ops
Feb 25, 2026 · Operations

How to Stop 3 AM Alert Wake‑Ups: 5 Smart Monitoring Techniques

Every night engineers are jolted awake by noisy alerts, but by applying five practical techniques—including alert severity tiers, aggregation, dynamic thresholds, intelligent routing, and data‑driven effectiveness analysis—teams can cut daily alerts from over a hundred to fewer than ten and dramatically improve response times.

AlertingAlertmanagerPrometheus
0 likes · 44 min read
How to Stop 3 AM Alert Wake‑Ups: 5 Smart Monitoring Techniques
Raymond Ops
Raymond Ops
Feb 24, 2026 · Cloud Native

Master Enterprise Monitoring: Build a Prometheus + Grafana Observability Platform

This guide details how to design and implement an enterprise‑grade cloud‑native observability platform using Prometheus for metrics collection and Grafana for visualization, covering architecture, high‑availability deployment, alerting, dashboard automation, case studies, best‑practice recommendations, and future trends.

Cloud NativeGrafanaObservability
0 likes · 24 min read
Master Enterprise Monitoring: Build a Prometheus + Grafana Observability Platform
MaGe Linux Operations
MaGe Linux Operations
Feb 19, 2026 · Operations

Master Prometheus Alerting: Write Rules and Configure Alertmanager for Reliable Notifications

This comprehensive guide walks you through the fundamentals of Prometheus alerting, from crafting PromQL‑driven alert rules and setting up Alertmanager with routing, grouping, inhibition and silencing, to configuring DingTalk and WeChat webhooks, implementing tiered alert strategies, best‑practice performance tuning, security hardening, high‑availability deployment, troubleshooting, and backup‑restore procedures.

Alert RulesAlertingAlertmanager
0 likes · 36 min read
Master Prometheus Alerting: Write Rules and Configure Alertmanager for Reliable Notifications
MaGe Linux Operations
MaGe Linux Operations
Feb 18, 2026 · Databases

How to Replace Prometheus Local Storage with VictoriaMetrics for High‑Performance Long‑Term Monitoring

This guide explains why Prometheus’s local TSDB struggles at scale, compares alternative remote‑storage solutions, and provides a step‑by‑step walkthrough for deploying VictoriaMetrics (single‑node or clustered), configuring remote_write, tuning performance, handling multi‑tenant use cases, and troubleshooting common issues.

PrometheusTSDBVictoriaMetrics
0 likes · 42 min read
How to Replace Prometheus Local Storage with VictoriaMetrics for High‑Performance Long‑Term Monitoring
Raymond Ops
Raymond Ops
Feb 3, 2026 · Operations

Zabbix vs Prometheus: Which Monitoring System Wins in 2024?

This guide compares Zabbix and Prometheus across architecture, performance, features, operational costs, and real‑world scenarios, providing a detailed selection roadmap for traditional IT, cloud‑native microservices, and hybrid environments while offering optimization tips and future trends.

PrometheusZabbixcloud-native
0 likes · 16 min read
Zabbix vs Prometheus: Which Monitoring System Wins in 2024?
Raymond Ops
Raymond Ops
Feb 2, 2026 · Operations

10 Essential PromQL Queries Every Ops Engineer Should Master

This article presents ten practical PromQL query examples covering CPU, memory, disk, network, database, Kubernetes, and business metrics, explains the underlying concepts, provides alert thresholds and best‑practice tips, and includes advanced optimization and alert‑rule design guidance for reliable monitoring.

AlertingObservabilityPromQL
0 likes · 22 min read
10 Essential PromQL Queries Every Ops Engineer Should Master
Ops Community
Ops Community
Jan 27, 2026 · Operations

Master Linux System Monitoring: Deep Dive into CPU, Memory, and I/O Metrics

This comprehensive guide explains how to collect and analyze Linux system metrics—including CPU usage, memory consumption, disk I/O, and load average—using native /proc and /sys interfaces, popular command‑line tools, and Prometheus Node Exporter, with practical scripts, configuration examples, and troubleshooting case studies for reliable performance monitoring and capacity planning.

LinuxPrometheusSysadmin
0 likes · 39 min read
Master Linux System Monitoring: Deep Dive into CPU, Memory, and I/O Metrics
MaGe Linux Operations
MaGe Linux Operations
Jan 18, 2026 · Artificial Intelligence

How to Deploy Scalable LLM Inference on Kubernetes with GPU Autoscaling

This guide walks through building a production‑grade Kubernetes GPU cluster for large language model inference, covering hardware sizing, GPU resource scheduling, model storage options, automated scaling with HPA, health checks, monitoring, troubleshooting, and multi‑model deployment strategies.

DockerGPUInference
0 likes · 49 min read
How to Deploy Scalable LLM Inference on Kubernetes with GPU Autoscaling
Java Architect Handbook
Java Architect Handbook
Jan 14, 2026 · Operations

How to Build a Scalable Prometheus Monitoring System for Big Data on Kubernetes

This guide explains how to design, configure, and implement a Prometheus‑based monitoring solution for big‑data components running in Kubernetes, covering metric exposure methods, scrape configurations, alerting architecture, dynamic rule management, exporter deployment, and practical examples with full YAML snippets.

AlertingBig Data MonitoringCloud Native
0 likes · 19 min read
How to Build a Scalable Prometheus Monitoring System for Big Data on Kubernetes
Raymond Ops
Raymond Ops
Jan 12, 2026 · Operations

Build a Real-Time Linux Performance Alert System with Prometheus & Grafana

This guide walks you through designing a layered Linux monitoring architecture, selecting a Prometheus‑Grafana stack, defining key CPU, memory and disk metrics, crafting smart alert rules, visualizing dashboards, and adding automation and AI‑driven predictive techniques for reliable, business‑focused operations.

GrafanaLinuxOps
0 likes · 13 min read
Build a Real-Time Linux Performance Alert System with Prometheus & Grafana
MaGe Linux Operations
MaGe Linux Operations
Jan 7, 2026 · Operations

How to Eliminate Alert Fatigue: 10 Proven Prometheus Alerting Techniques

This comprehensive guide walks you through the architecture of Prometheus and Alertmanager, shows how to design, write, and test robust alert rules, and shares ten practical techniques—including proper for‑durations, rate() usage, recording rules, multi‑level alerts, and inhibition—to dramatically reduce alert noise and improve SRE reliability.

AlertingAlertmanagerDevOps
0 likes · 40 min read
How to Eliminate Alert Fatigue: 10 Proven Prometheus Alerting Techniques
Woodpecker Software Testing
Woodpecker Software Testing
Jan 6, 2026 · User Experience Design

Optimizing the Distribution Platform with User Experience Testing

This article explains how systematic user‑experience testing—covering environment setup, core function benchmarks, and performance monitoring—reveals Distribution’s strengths in multi‑platform compatibility and stability while identifying documentation, configuration, and error‑handling gaps, and recommends tools and continuous improvement practices to enhance the open‑source software distribution platform.

Automated TestingContinuous ImprovementDocker
0 likes · 4 min read
Optimizing the Distribution Platform with User Experience Testing
Woodpecker Software Testing
Woodpecker Software Testing
Jan 5, 2026 · Operations

Three Core Dimensions of Performance Testing: Time Behavior, Resource Utilization, and Capacity

This article breaks down performance testing into three essential dimensions—time behavior, resource utilization, and capacity—explains their key metrics, demonstrates a detailed e‑commerce flash‑sale case study, and shows how systematic testing and optimization can dramatically improve response times, throughput, and scalability.

JMeterLoad TestingPerformance Testing
0 likes · 12 min read
Three Core Dimensions of Performance Testing: Time Behavior, Resource Utilization, and Capacity
Java Web Project
Java Web Project
Jan 4, 2026 · Backend Development

Unlock Spring 6 & Boot 3: Virtual Threads, Declarative HTTP, and GraalVM Native Images

This article walks through the core upgrades in Spring 6 and Spring Boot 3—raising the JDK baseline, adopting Project Loom virtual threads, using the new @HttpExchange declarative client, standardizing error responses with ProblemDetail, compiling to GraalVM native images, and adding Prometheus monitoring—while providing concrete code examples, performance numbers, and a step‑by‑step migration roadmap.

Cloud NativeMicroservicesPrometheus
0 likes · 8 min read
Unlock Spring 6 & Boot 3: Virtual Threads, Declarative HTTP, and GraalVM Native Images
dbaplus Community
dbaplus Community
Dec 22, 2025 · Cloud Computing

How We Cut Kubernetes Costs by 40% Without Switching Platforms

By rethinking resource requests, eliminating unused workloads, downsizing node types, fine‑tuning autoscaling, and trimming log storage, a team reduced their Kubernetes bill by 40% while keeping the same cloud provider, demonstrating that most cost overruns stem from misconfiguration rather than the platform itself.

Cost OptimizationKubernetesPrometheus
0 likes · 6 min read
How We Cut Kubernetes Costs by 40% Without Switching Platforms
Raymond Ops
Raymond Ops
Dec 22, 2025 · Operations

Build a High‑Availability Prometheus Monitoring System from Scratch: Pitfalls & Performance Tuning

This guide walks you through constructing a production‑grade, highly available Prometheus monitoring stack, covering architecture choices, sharding strategies, common pitfalls such as memory bloat, query latency and storage growth, and provides concrete tuning steps, Kubernetes deployment examples, and advanced optimisation techniques.

AlertingKubernetesPrometheus
0 likes · 11 min read
Build a High‑Availability Prometheus Monitoring System from Scratch: Pitfalls & Performance Tuning
Ray's Galactic Tech
Ray's Galactic Tech
Dec 13, 2025 · Cloud Native

Mastering Kubernetes Observability: From Basic Metrics to Production‑Ready Practices

This guide explains how to build a robust Kubernetes observability system, covering core concepts, why traditional monitoring fails, paradigm shifts, best‑practice recommendations, and real‑world case studies that illustrate troubleshooting, alert design, cost and security monitoring, and a step‑by‑step adoption checklist.

Cloud NativeObservabilityPrometheus
0 likes · 10 min read
Mastering Kubernetes Observability: From Basic Metrics to Production‑Ready Practices
MaGe Linux Operations
MaGe Linux Operations
Nov 28, 2025 · Operations

10 Essential Linux Ops Tools Every Engineer Should Master

This article presents a curated list of ten widely used Linux operations tools, detailing each tool's core functions, typical use cases, key advantages, and real‑world examples, while also providing practical shell and Ansible code snippets to help engineers apply them immediately.

AnsibleDockerGrafana
0 likes · 9 min read
10 Essential Linux Ops Tools Every Engineer Should Master
Old Meng AI Explorer
Old Meng AI Explorer
Nov 26, 2025 · Operations

How Alertmanager Turns Chaos into Calm: Mastering Alert Management for DevOps

Alertmanager, the official Prometheus alert manager, consolidates redundant alerts, supports silencing, inhibition, multi‑channel routing, and high‑availability clustering, enabling DevOps teams to quickly pinpoint critical issues, reduce noise, and streamline incident response across large server fleets with simple YAML configuration and command‑line tools.

Alert ManagementAlertmanagerDevOps
0 likes · 10 min read
How Alertmanager Turns Chaos into Calm: Mastering Alert Management for DevOps
MaGe Linux Operations
MaGe Linux Operations
Nov 17, 2025 · Operations

Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices

This guide details production‑grade Prometheus alerting configurations, covering applicable scenarios, prerequisites, anti‑patterns, environment matrices, step‑by‑step deployment of Node Exporter, Prometheus and Alertmanager, comprehensive rule files, performance testing, troubleshooting, best practices, and ready‑to‑use scripts for backup and health checks.

AlertingInfrastructureOps
0 likes · 51 min read
Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices
Code Wrench
Code Wrench
Nov 16, 2025 · Backend Development

Build a High‑Performance Go + Playwright Browser Automation Framework

Learn how to create a production‑grade, high‑throughput browser automation service in Go using Playwright, featuring browser‑context pooling, proxy rotation, task scheduling with watchdogs, Prometheus metrics, and a WebUI, enabling thousands of concurrent tasks, robust monitoring, and easy scalability.

GoPlaywrightPrometheus
0 likes · 14 min read
Build a High‑Performance Go + Playwright Browser Automation Framework
Liangxu Linux
Liangxu Linux
Nov 6, 2025 · Operations

Top 6 Free Open‑Source Network Monitoring Tools You Should Know

This article introduces six free open‑source network monitoring solutions—Zabbix, Prometheus, Cacti, Grafana, OpenNMS, and Nagios—explaining their key features, how they collect and visualize metrics, and why they are valuable for maintaining system stability and security.

GrafanaNagiosNetwork Monitoring
0 likes · 5 min read
Top 6 Free Open‑Source Network Monitoring Tools You Should Know
MaGe Linux Operations
MaGe Linux Operations
Nov 6, 2025 · Cloud Native

Master Kubernetes Node Autoscaling with Custom Prometheus Metrics in 30 Minutes

This guide walks you through a complete, 30‑minute implementation of Kubernetes node autoscaling using Horizontal Pod Autoscaler (HPA) with custom Prometheus metrics, covering prerequisites, anti‑pattern warnings, environment matrix, step‑by‑step deployment, core principles, observability, troubleshooting, best practices, and FAQ.

HPAKubernetesPrometheus
0 likes · 50 min read
Master Kubernetes Node Autoscaling with Custom Prometheus Metrics in 30 Minutes
Linux Ops Smart Journey
Linux Ops Smart Journey
Nov 5, 2025 · Cloud Native

Why Switch from Prometheus? Deploy a High‑Performance vmagent Cluster with VictoriaMetrics

This article explains the scalability limits of Prometheus, introduces vmagent as a lightweight, high‑performance collector compatible with Prometheus, and provides a step‑by‑step guide—including configuration, systemd service setup, and verification—to deploy a resilient vmagent cluster in production.

DeploymentPrometheusVictoriaMetrics
0 likes · 5 min read
Why Switch from Prometheus? Deploy a High‑Performance vmagent Cluster with VictoriaMetrics
Architect
Architect
Nov 4, 2025 · Operations

How to Accurately Track API Calls per Minute: 5 Proven Monitoring Strategies

This article explores why precise per‑minute API call statistics are essential for performance bottleneck detection, capacity planning, security alerts, billing, and troubleshooting, and presents five practical implementations—including fixed‑window counters, sliding windows, AOP‑based interception, Redis time‑series storage, and Micrometer‑Prometheus integration—along with their trade‑offs and capacity‑planning guidelines.

API monitoringJavaPerformance Optimization
0 likes · 25 min read
How to Accurately Track API Calls per Minute: 5 Proven Monitoring Strategies
JakartaEE China Community
JakartaEE China Community
Nov 4, 2025 · Operations

How Logs, Traces, and Metrics Differ—and Why It Matters

Logs, tracing, and metrics each serve distinct monitoring goals—logs capture discrete events for debugging and audit, traces map request flows to pinpoint performance bottlenecks, and metrics provide time‑series health data; understanding their differences and integrating tools like ELK, OpenTelemetry, Prometheus, and Grafana enables robust observability.

ELKGrafanaObservability
0 likes · 7 min read
How Logs, Traces, and Metrics Differ—and Why It Matters
Ops Community
Ops Community
Nov 1, 2025 · Operations

Deploy a Three‑Tier Chrony Time Sync Architecture with µs‑Level Monitoring

Learn how to set up Chrony for precise time synchronization across distributed systems by installing Chrony, configuring a three‑layer Stratum architecture, enabling hardware clock sync, protecting against clock jumps, and monitoring offsets with Prometheus and Node Exporter to achieve microsecond‑level accuracy.

Prometheuschronymonitoring
0 likes · 30 min read
Deploy a Three‑Tier Chrony Time Sync Architecture with µs‑Level Monitoring
MaGe Linux Operations
MaGe Linux Operations
Nov 1, 2025 · Operations

How to Build Production‑Grade Prometheus Alert Rules and Silence Policies in 10 Minutes

This guide walks SRE and operations teams through setting up Prometheus alert rule templates, defining severity/team/service labels, configuring Alertmanager routing and receivers, testing alerts, creating scheduled silences, automating silence management via API, implementing inhibition rules, establishing Git‑based review pipelines, persisting alert history to MySQL, and applying security, performance, and compliance best practices.

AlertingAlertmanagerPrometheus
0 likes · 31 min read
How to Build Production‑Grade Prometheus Alert Rules and Silence Policies in 10 Minutes
Advanced AI Application Practice
Advanced AI Application Practice
Oct 31, 2025 · Operations

How Non‑Coding Test Engineers Can Master Performance Testing Without a Technical Barrier

This guide shows non‑coding software test engineers how to conduct effective performance testing by selecting visual tools, following a clear three‑step process, interpreting business‑focused metrics, and avoiding code‑intensive scenarios, enabling them to deliver reliable results without writing code.

LighthouseNo-codePerformance Testing
0 likes · 11 min read
How Non‑Coding Test Engineers Can Master Performance Testing Without a Technical Barrier
Code Wrench
Code Wrench
Oct 26, 2025 · Backend Development

Build a Scalable Go Actor Framework with Auto‑Scaling and Graceful Shutdown

Explore the Go Actor model’s core concepts, compare popular Actor libraries, and follow a step‑by‑step implementation that introduces a mailbox, supervisor restart strategy, dynamic ActorPool with auto‑scaler, graceful shutdown via context, and Prometheus metrics, culminating in a complete, production‑ready concurrent framework.

Auto ScalingGoPrometheus
0 likes · 15 min read
Build a Scalable Go Actor Framework with Auto‑Scaling and Graceful Shutdown
MaGe Linux Operations
MaGe Linux Operations
Oct 21, 2025 · Operations

Mastering Prometheus: Proven Strategies to Optimize Monitoring Performance

This article shares real‑world experiences and step‑by‑step techniques—including metric pruning, sampling interval tuning, TSDB configuration, query rewriting, and federation—to dramatically improve Prometheus memory usage, query latency, and overall scalability for large‑scale cloud‑native environments.

OperationsPrometheuscloud-native
0 likes · 11 min read
Mastering Prometheus: Proven Strategies to Optimize Monitoring Performance
MaGe Linux Operations
MaGe Linux Operations
Oct 18, 2025 · Operations

10 Proven Causes of Linux CPU Spikes and How to Diagnose Them Fast

Learn a step‑by‑step Linux CPU high‑usage diagnostic guide covering ten root causes, quick monitoring commands, deep analysis with top, ps, strace, perf, and flamegraphs, plus practical remediation and long‑term monitoring setup using sar and Prometheus to prevent future spikes.

CPULinuxPrometheus
0 likes · 22 min read
10 Proven Causes of Linux CPU Spikes and How to Diagnose Them Fast
Raymond Ops
Raymond Ops
Oct 12, 2025 · Operations

Master PromQL: From Basics to Advanced Query Techniques

This comprehensive guide walks you through PromQL fundamentals, covering data types, gauge and counter metrics, time‑series concepts, query selectors, offsets, arithmetic and logical operators, vector matching, aggregation functions, and key Prometheus functions such as increase, rate, and histogram_quantile, with practical examples and visual illustrations.

AlertingPromQLPrometheus
0 likes · 29 min read
Master PromQL: From Basics to Advanced Query Techniques
Java Tech Enthusiast
Java Tech Enthusiast
Oct 11, 2025 · Backend Development

How MyBatis Interceptors Can Safeguard Your Java Service from Out‑of‑Memory Crashes

This article explains how oversized database query results can cause JVM memory spikes and OOM errors, and shows how to use MyBatis interceptors to monitor, limit, and protect memory consumption with non‑intrusive code, Prometheus metrics, and configurable thresholds, ultimately improving system stability and performance.

BackendInterceptorJava
0 likes · 20 min read
How MyBatis Interceptors Can Safeguard Your Java Service from Out‑of‑Memory Crashes
Java One
Java One
Oct 10, 2025 · Operations

Step‑by‑Step Guide to Install, Configure, and Use Grafana Mimir for Scalable Prometheus Monitoring

This tutorial walks through both command‑line and Docker‑Compose installations of Grafana Mimir, shows how to configure Prometheus remote‑write, set up Grafana data sources, create recording and alerting rules, and explains key Mimir features such as multi‑tenant support, hash rings, object storage, HA tracking and retention policies.

AlertingDockerGrafana Mimir
0 likes · 20 min read
Step‑by‑Step Guide to Install, Configure, and Use Grafana Mimir for Scalable Prometheus Monitoring
IT Architects Alliance
IT Architects Alliance
Oct 6, 2025 · Cloud Native

Mastering Cloud‑Native Observability: From Metrics to Tracing

The article explains why enterprises struggle with cloud‑native observability, outlines the exponential complexity and dynamic nature of modern microservice environments, and presents a comprehensive three‑pillar approach—metrics, logging, tracing—along with practical Prometheus, OpenTelemetry, and sidecar configurations, storage choices, sampling, alerting, cost‑control, team upskilling, and future trends such as AIOps and eBPF.

Cloud NativeObservabilityOpenTelemetry
0 likes · 12 min read
Mastering Cloud‑Native Observability: From Metrics to Tracing
MaGe Linux Operations
MaGe Linux Operations
Oct 6, 2025 · Cloud Native

Prometheus vs Cloud Provider Monitoring: Which Is the Most Cost‑Effective Choice for 2025?

This article compares open‑source Prometheus + Grafana with managed cloud monitoring services, evaluating deployment complexity, functionality, scalability, security, and total cost of ownership across small, medium, and large workloads, and provides practical decision‑making guidance for teams of different sizes and requirements.

ObservabilityPrometheuscloud-native
0 likes · 56 min read
Prometheus vs Cloud Provider Monitoring: Which Is the Most Cost‑Effective Choice for 2025?
Java One
Java One
Sep 21, 2025 · Operations

Mastering Prometheus rate, irate, and increase: When and How to Use Each

This article explains how Prometheus’s rate, irate, and increase functions calculate counter growth rates, handle counter resets, and differ in smoothing and responsiveness, guiding you to choose the appropriate function for monitoring request rates, CPU usage, and other metrics.

Prometheusincreaseirate
0 likes · 7 min read
Mastering Prometheus rate, irate, and increase: When and How to Use Each
21CTO
21CTO
Sep 19, 2025 · Operations

Samba 4.23 Unveiled: QUIC Support, Unix Extensions, and Prometheus Integration

Samba 4.23 introduces QUIC transport for SMB3, enables Unix extensions by default, adds Prometheus‑compatible monitoring, improves file timestamp handling, and provides new backup options, while the article also offers step‑by‑step Ubuntu installation commands.

InstallationLinuxPrometheus
0 likes · 6 min read
Samba 4.23 Unveiled: QUIC Support, Unix Extensions, and Prometheus Integration
Java Tech Enthusiast
Java Tech Enthusiast
Sep 14, 2025 · Operations

How to Use Java Agent for Non‑Intrusive SpringBoot Monitoring

Learn how to implement a Java Agent that enables non‑intrusive monitoring of SpringBoot applications, covering agent basics, bytecode manipulation with Byte Buddy, metric collection via Micrometer, Prometheus/Grafana integration, and advanced extensions such as JVM metrics, HTTP client tracing, and distributed tracing.

PrometheusSpringBootbytecode
0 likes · 16 min read
How to Use Java Agent for Non‑Intrusive SpringBoot Monitoring
Code Ape Tech Column
Code Ape Tech Column
Sep 12, 2025 · Operations

Master Grafana & Prometheus: Step‑by‑Step Guide to Build a Full‑Featured Monitoring System

This comprehensive tutorial walks you through installing and configuring Grafana, Prometheus, and related exporters, setting up dashboards, enabling email alerts, and extending monitoring to MySQL, RabbitMQ, Redis, and TiDB, all while providing clear code snippets and practical tips for a robust observability stack.

AlertingDevOpsGrafana
0 likes · 24 min read
Master Grafana & Prometheus: Step‑by‑Step Guide to Build a Full‑Featured Monitoring System
dbaplus Community
dbaplus Community
Sep 11, 2025 · Cloud Native

Building a Scalable Kubernetes Monitoring Architecture and Alert Management

This guide presents a comprehensive, layered Kubernetes monitoring architecture—including control plane, node, resource, and extension layers—detailing high‑availability Prometheus deployment, alert grouping strategies, custom CRD metrics, visualization dashboards, and practical best‑practice recommendations for reliable observability in cloud‑native environments.

AlertingCloud NativeKubernetes
0 likes · 11 min read
Building a Scalable Kubernetes Monitoring Architecture and Alert Management
Java One
Java One
Sep 8, 2025 · Operations

Understanding Prometheus Metric Types: Gauge, Counter, Summary, and Histogram Explained

Prometheus supports four core metric types—gauge, counter, summary, and histogram—each with distinct semantics and usage patterns; this guide explains their definitions, how to update them via client libraries, and how they appear in the Prometheus text exposition format, including example code and query tips.

CounterGaugeHistogram
0 likes · 10 min read
Understanding Prometheus Metric Types: Gauge, Counter, Summary, and Histogram Explained
Java One
Java One
Sep 3, 2025 · Operations

How to Install, Configure, and Run Prometheus: A Step‑by‑Step Guide

This guide walks you through installing Prometheus via binary download, configuring global scrape settings and job definitions, running the server with command‑line options, and using the web UI and PromQL to verify target health and query metrics, illustrated with screenshots and example queries.

InstallationObservabilityPromQL
0 likes · 6 min read
How to Install, Configure, and Run Prometheus: A Step‑by‑Step Guide
Code Ape Tech Column
Code Ape Tech Column
Sep 2, 2025 · Operations

Avoid QPS Miscalculations: 5 Proven Methods to Accurately Measure Traffic

This article explains five practical ways to count QPS—from gateway and application instrumentation to monitoring tools, log analysis, and database metrics—while highlighting common pitfalls such as health‑check filtering, thread‑safety, and multi‑node aggregation, helping engineers make informed scaling decisions.

ELKJavaPerformance Monitoring
0 likes · 16 min read
Avoid QPS Miscalculations: 5 Proven Methods to Accurately Measure Traffic
Qunar Tech Salon
Qunar Tech Salon
Sep 1, 2025 · Databases

Redesigning Database Monitoring: From Push to Pull for Smarter Alerts

This article analyzes the shortcomings of the legacy database monitoring system, explains the transition from a push‑based to a pull‑based architecture, outlines comprehensive metric collection, intelligent alert strategies, and self‑healing mechanisms, and showcases the performance improvements achieved with the new solution.

AlertingDatabase MonitoringPrometheus
0 likes · 25 min read
Redesigning Database Monitoring: From Push to Pull for Smarter Alerts
Raymond Ops
Raymond Ops
Aug 28, 2025 · Operations

Step-by-Step Guide to Install, Configure, and Use Prometheus for Monitoring

This tutorial walks you through downloading Prometheus, setting up self‑monitoring, starting the server, opening firewall ports, exploring the built‑in UI, adding Node Exporter targets, configuring scrape jobs, creating recording rules, and visualizing metrics with queries and graphs.

ConfigurationPrometheusRecording Rules
0 likes · 10 min read
Step-by-Step Guide to Install, Configure, and Use Prometheus for Monitoring
Sanyou's Java Diary
Sanyou's Java Diary
Jul 31, 2025 · Databases

How MyBatis Interceptors Can Safeguard Your Java Service from Memory Overruns

This article explains how oversized database query results can cause JVM heap spikes, frequent Full GC, or OOM crashes in Java services, and demonstrates a non‑intrusive MyBatis interceptor solution that monitors, grades, and blocks risky queries while exposing Prometheus metrics for proactive alerting and capacity planning.

InterceptorJavaMyBatis
0 likes · 18 min read
How MyBatis Interceptors Can Safeguard Your Java Service from Memory Overruns
Efficient Ops
Efficient Ops
Jul 14, 2025 · Operations

Rescuing a Critical CPU Outage: My Step-by-Step Troubleshooting Guide

After a midnight CPU alarm threatened service stability, I walked through rapid diagnosis with top and htop, identified JVM bottlenecks using jstat and async‑profiler, refactored a Java sorting algorithm, added caching, optimized database queries, containerized the service, and set up Prometheus‑Grafana alerts to prevent future incidents.

CPU troubleshootingDockerJava performance
0 likes · 7 min read
Rescuing a Critical CPU Outage: My Step-by-Step Troubleshooting Guide
Architect
Architect
Jul 13, 2025 · Backend Development

Master Spring 6 & Spring Boot 3: Core Features, Virtual Threads, GraalVM & More

This article provides a comprehensive overview of the Spring ecosystem upgrade, detailing Spring 6 core features such as JDK 17 baseline, Project Loom virtual threads, declarative HTTP clients, RFC‑7807 ProblemDetail handling, GraalVM native images, as well as Spring Boot 3 breakthroughs like Jakarta EE migration, OAuth2 server, Prometheus monitoring, and practical migration roadmaps for cloud‑native applications.

MicroservicesPrometheusSpring 6
0 likes · 8 min read
Master Spring 6 & Spring Boot 3: Core Features, Virtual Threads, GraalVM & More
Linux Ops Smart Journey
Linux Ops Smart Journey
Jul 6, 2025 · Cloud Native

Automate Prometheus Service Discovery with Nacos: A Step‑by‑Step Guide

Learn how to replace static Prometheus target files with dynamic service discovery by integrating Alibaba’s open‑source Nacos registry, configuring a Go‑based adapter, adding HTTP‑SD configs to the Prometheus Operator, and validating the automated monitoring of large‑scale microservice deployments.

NacosPrometheusservice discovery
0 likes · 5 min read
Automate Prometheus Service Discovery with Nacos: A Step‑by‑Step Guide
Linux Ops Smart Journey
Linux Ops Smart Journey
Jul 3, 2025 · Cloud Native

How to Visualize Kubernetes Namespace Resource Usage with Prometheus

This guide walks you through deploying kube-state-metrics, configuring Prometheus to collect CPU, memory and other resource metrics per Kubernetes namespace, setting up ResourceQuota and LimitRange visualizations, and verifying data collection with Helm, Docker, and curl commands, enabling comprehensive cluster health monitoring.

KubernetesPrometheusResourceQuota
0 likes · 7 min read
How to Visualize Kubernetes Namespace Resource Usage with Prometheus