Tagged articles
660 articles
Page 2 of 7
Ops Development Stories
Ops Development Stories
Jun 19, 2025 · Operations

How to Build an Automated Prometheus Inspection System with Go

This article explains how to design and implement an automated inspection platform that leverages Prometheus and Grafana for metric collection, splits inspection tasks, schedules them with cron, generates reports, sends WeChat notifications, and exports results to PDF, all using Go and the gin‑vue‑admin framework.

Automated InspectionCloud NativeGo
0 likes · 17 min read
How to Build an Automated Prometheus Inspection System with Go
Linux Ops Smart Journey
Linux Ops Smart Journey
Jun 16, 2025 · Cloud Native

Mastering PrometheusRule: Streamline Kubernetes Alerting & Recording

This article explains how PrometheusRule, a Kubernetes custom resource, simplifies the management of alerting and recording rules by centralizing configurations, reducing restarts, avoiding conflicts, and enabling version‑controlled, modular monitoring for cloud‑native environments.

Cloud NativeKubernetesPrometheus
0 likes · 6 min read
Mastering PrometheusRule: Streamline Kubernetes Alerting & Recording
Liangxu Linux
Liangxu Linux
Jun 10, 2025 · Cloud Native

Why Loki Is the Ideal Cloud‑Native Log Aggregator for Prometheus & Grafana

Loki, an open‑source log aggregation system from Grafana Labs, integrates tightly with Prometheus and Grafana, stores logs efficiently using object storage, offers a simple label‑based model, and provides cost‑effective, high‑performance logging for cloud‑native environments while outlining its architecture, usage, configuration, advantages, limitations, and retention policies.

Cloud NativeGrafanaLoki
0 likes · 10 min read
Why Loki Is the Ideal Cloud‑Native Log Aggregator for Prometheus & Grafana
Selected Java Interview Questions
Selected Java Interview Questions
Jun 2, 2025 · Backend Development

Implementing Precise Per‑Minute API Call Statistics in Java: Multiple Solutions and Best Practices

This article explains why per‑minute API call counting is essential for performance bottleneck detection, capacity planning, security alerts and billing, and presents five concrete Java‑based implementations—including a fixed‑window counter, a sliding‑window counter, AOP‑based transparent monitoring, a Redis time‑series solution, and Micrometer‑Prometheus integration—along with a hybrid architecture, performance benchmarks, and practical capacity‑planning advice.

API monitoringPrometheusSliding Window
0 likes · 25 min read
Implementing Precise Per‑Minute API Call Statistics in Java: Multiple Solutions and Best Practices
Selected Java Interview Questions
Selected Java Interview Questions
May 30, 2025 · Operations

Batch Installation of Node Exporter on Linux Hosts Using Ansible, JumpServer, and a Static File Server

This guide explains three practical methods for deploying the Prometheus node_exporter collector across large numbers of Linux servers—using a JumpServer with Ansible, a standalone Ansible playbook, or a custom Bash script combined with an internal static file server—complete with configuration, service setup, and integration into Consul and vmagent monitoring.

AnsibleConsulLinux monitoring
0 likes · 10 min read
Batch Installation of Node Exporter on Linux Hosts Using Ansible, JumpServer, and a Static File Server
DevOps Operations Practice
DevOps Operations Practice
May 21, 2025 · Operations

Prometheus vs Zabbix: Architecture, Data Collection, Storage, and Alerting Comparison for Enterprise IT Operations

This article compares Prometheus and Zabbix across architecture design, data collection methods, storage engines, scalability, deployment complexity, alerting mechanisms, and suitable scenarios, helping operations teams choose the most appropriate monitoring solution for cloud‑native or traditional enterprise environments.

ComparisonIT OperationsPrometheus
0 likes · 7 min read
Prometheus vs Zabbix: Architecture, Data Collection, Storage, and Alerting Comparison for Enterprise IT Operations
Raymond Ops
Raymond Ops
May 11, 2025 · Cloud Native

How to Expose Ingress Metrics for Prometheus Monitoring in Kubernetes

This guide details how to expose the nginx‑ingress metrics port, configure static and ServiceMonitor‑based scraping in Prometheus Operator, create necessary secrets, and integrate the metrics into Grafana dashboards, providing a complete Kubernetes‑native solution for monitoring ingress traffic.

Cloud NativeIngressPrometheus
0 likes · 6 min read
How to Expose Ingress Metrics for Prometheus Monitoring in Kubernetes
dbaplus Community
dbaplus Community
May 11, 2025 · Operations

Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide

This guide explains the four SRE golden signals—Latency, Traffic, Errors, and Saturation—covers their definitions, how to measure them with Prometheus in Node.js, compares them to RED and USE frameworks, and provides concrete alerting rules for reliable service monitoring.

Golden SignalsObservabilityPrometheus
0 likes · 12 min read
Mastering SRE’s Four Golden Signals with Prometheus: A Practical Guide
Raymond Ops
Raymond Ops
May 9, 2025 · Operations

Build a Complete Prometheus Monitoring Stack with Docker

This tutorial explains Prometheus' core components, shows how to deploy Prometheus Server, Node Exporter, cAdvisor, and Grafana as Docker containers on two hosts, configures scraping and alerting, and demonstrates visualizing metrics with ready‑made Grafana dashboards.

AlertmanagerDockerExporter
0 likes · 8 min read
Build a Complete Prometheus Monitoring Stack with Docker
MaGe Linux Operations
MaGe Linux Operations
May 7, 2025 · Operations

Master PromQL: From Basics to Advanced Query Techniques for Monitoring

This comprehensive guide walks you through PromQL fundamentals, data types, query expressions, selectors, operators, aggregation, and essential functions, illustrating each concept with real‑world monitoring scenarios and code examples to help you effectively query and analyze time‑series data in Prometheus.

PromQLPrometheusTime Series
0 likes · 32 min read
Master PromQL: From Basics to Advanced Query Techniques for Monitoring
Code Ape Tech Column
Code Ape Tech Column
May 7, 2025 · Backend Development

Detailed Overview of Spring 6.0 Core Features and Spring Boot 3.0 Enhancements

This article provides a comprehensive guide to Spring 6.0’s new baseline JDK 17 requirement, virtual threads, declarative HTTP clients, RFC‑7807 ProblemDetail handling, GraalVM native image support, and Spring Boot 3.0 improvements such as Jakarta EE migration, OAuth2 authorization server, Prometheus monitoring, and practical migration steps for enterprise applications.

BackendJavaPrometheus
0 likes · 8 min read
Detailed Overview of Spring 6.0 Core Features and Spring Boot 3.0 Enhancements
DevOps Operations Practice
DevOps Operations Practice
Apr 11, 2025 · Operations

Promtool: A Complete Guide to Configuration Validation, Rule Checking, TSDB Management, and Debugging for Prometheus

This article introduces Promtool, the multifunctional command‑line utility bundled with Prometheus, and explains how to validate configurations, check and test rules, query metrics, manage the TSDB, run unit tests, use debugging helpers, install the tool, and apply best‑practice recommendations.

Configuration ValidationDebuggingPrometheus
0 likes · 5 min read
Promtool: A Complete Guide to Configuration Validation, Rule Checking, TSDB Management, and Debugging for Prometheus
Raymond Ops
Raymond Ops
Apr 7, 2025 · Operations

How to Deploy Prometheus on Kubernetes and Resolve Alertmanager Port Issues

This guide explains what Prometheus monitoring is, walks through downloading the correct version for a Kubernetes cluster, customizing alert rules, deploying and cleaning up Prometheus, and troubleshooting common Alertmanager connection problems by checking DNS and network configurations.

AlertmanagerPrometheusmonitoring
0 likes · 9 min read
How to Deploy Prometheus on Kubernetes and Resolve Alertmanager Port Issues
Volcano Engine Developer Services
Volcano Engine Developer Services
Apr 1, 2025 · Artificial Intelligence

Taming High Cardinality in AI Model & Autonomous Driving Monitoring with Prometheus

This article explores how high cardinality in Prometheus metrics impacts AI large‑model and autonomous‑driving observability, explains the underlying concepts, outlines the performance and cost challenges, and presents practical design, collection, and query‑side solutions—including metric modeling, pre‑aggregation, and remote‑read pushdown—to keep monitoring efficient and scalable.

AI MonitoringCardinalityObservability
0 likes · 12 min read
Taming High Cardinality in AI Model & Autonomous Driving Monitoring with Prometheus
ByteDance Cloud Native
ByteDance Cloud Native
Mar 27, 2025 · Operations

Taming High Cardinality in AI & Autonomous Driving with Prometheus

This article shares practical experience from Volcengine's managed Prometheus service and its deep integration with large‑model and autonomous‑driving platforms, explaining what high cardinality is, its impact on monitoring systems, root causes, and a range of design, collection, and analysis techniques to mitigate it.

AIObservabilityPrometheus
0 likes · 12 min read
Taming High Cardinality in AI & Autonomous Driving with Prometheus
Alibaba Cloud Observability
Alibaba Cloud Observability
Mar 24, 2025 · Artificial Intelligence

Achieving Full Observability for AI Inference Apps with Prometheus

This article explores the observability challenges of AI inference services, outlines a comprehensive Prometheus‑based metric collection strategy, and demonstrates practical monitoring implementations for Ray Serve, vLLM, GPU resources, and custom metrics to build stable, high‑performance inference pipelines.

AI inferenceObservabilityPrometheus
0 likes · 19 min read
Achieving Full Observability for AI Inference Apps with Prometheus
Tencent Cloud Developer
Tencent Cloud Developer
Mar 19, 2025 · Cloud Native

Kubernetes Monitoring: Why It’s Needed, Core Components, and Metric Exposure

Monitoring Kubernetes is essential to detect resource contention, component failures, and network issues; it involves tracking core component metrics such as API server latency, etcd write times, scheduler delays, as well as node‑level CPU, memory, disk, and network statistics, pod health, and custom application metrics exposed via Prometheus exporters for comprehensive observability.

Cloud NativeExportersKubernetes
0 likes · 23 min read
Kubernetes Monitoring: Why It’s Needed, Core Components, and Metric Exposure
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 18, 2025 · Artificial Intelligence

How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus

This article explores the monitoring challenges of large‑scale AI inference services, outlines the key observability requirements, and provides a complete Prometheus‑based metric collection framework—including Ray Serve and vLLM integrations—to help developers build stable, high‑performance inference applications.

AI inferencePrometheusRay Serve
0 likes · 21 min read
How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus
Ops Development Stories
Ops Development Stories
Mar 4, 2025 · Operations

Master Process Exporter: Deploy, Integrate with Prometheus & Grafana in Kubernetes

This guide walks Kubernetes administrators through the full lifecycle of Process Exporter—from lightweight deployment and RBAC setup, through Prometheus Operator integration and Grafana dashboard creation, to detailed configuration and alerting—enabling precise process‑level monitoring and rapid root‑cause analysis.

DaemonSetGrafanaKubernetes
0 likes · 15 min read
Master Process Exporter: Deploy, Integrate with Prometheus & Grafana in Kubernetes
Architecture Development Notes
Architecture Development Notes
Feb 19, 2025 · Operations

Avoid Prometheus Label Pitfalls: Best Practices for Scalable Monitoring

This article examines common label misuse in Prometheus, explains why adding global labels to every metric can cause data bloat, configuration rigidity, and dimensional pollution, and provides concrete best‑practice patterns, dynamic injection techniques, and governance rules to keep monitoring systems efficient and maintainable.

Cloud NativeLabelsPrometheus
0 likes · 7 min read
Avoid Prometheus Label Pitfalls: Best Practices for Scalable Monitoring
Infra Learning Club
Infra Learning Club
Feb 16, 2025 · Operations

GPUprobe: Using eBPF to Monitor CUDA Memory Leaks

The article introduces GPUprobe, an eBPF‑based tool that provides lightweight, continuous, application‑level monitoring of CUDA memory allocation, leaks, and kernel launches, compares it with NSight Systems and DCGM, and demonstrates near‑zero overhead integration with Prometheus and Grafana through detailed code examples and real‑world output analysis.

GPU monitoringGrafanaObservability
0 likes · 13 min read
GPUprobe: Using eBPF to Monitor CUDA Memory Leaks
ITPUB
ITPUB
Jan 18, 2025 · Cloud Native

Prometheus 3.0 Unveiled: New UI, Remote‑Write 2.0, and Native Histograms

Prometheus 3.0, the first major release in seven years, introduces a rebuilt UI, Remote‑Write 2.0 with richer metadata, full UTF‑8 support, native OpenTelemetry ingestion, experimental native histograms, performance gains, and a set of breaking changes that require careful migration.

Cloud NativeNative HistogramsPrometheus
0 likes · 8 min read
Prometheus 3.0 Unveiled: New UI, Remote‑Write 2.0, and Native Histograms
Alibaba Cloud Observability
Alibaba Cloud Observability
Jan 13, 2025 · Cloud Native

Alibaba Cloud’s Guide to Stable Large‑Scale Kubernetes After OpenAI Crash

After the OpenAI outage caused massive Kubernetes API overload, Alibaba Cloud’s Container Service and Observability teams detail how they reinforce large‑scale K8s clusters with high‑availability control‑plane design, optimized Prometheus probing, out‑of‑band monitoring, and best‑practice guidelines for capacity planning, safe releases, and rapid incident response.

Alibaba CloudKubernetesLarge-Scale Clusters
0 likes · 21 min read
Alibaba Cloud’s Guide to Stable Large‑Scale Kubernetes After OpenAI Crash
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 8, 2025 · Cloud Native

Designing AZ‑Level Disaster Recovery with Alibaba Cloud ACK and Service Mesh ASM

This guide explains how to achieve zone‑level disaster recovery on Alibaba Cloud by deploying multi‑AZ ACK clusters, configuring Service Mesh ASM for observability and traffic shifting, and using Prometheus‑based metrics and alerts to detect and isolate failures, including step‑by‑step instructions and sample YAML manifests.

KubernetesMulti‑AZPrometheus
0 likes · 24 min read
Designing AZ‑Level Disaster Recovery with Alibaba Cloud ACK and Service Mesh ASM
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 8, 2025 · Cloud Native

Ensuring Massive Kubernetes Cluster Stability: Lessons from the OpenAI Outage

Using the recent OpenAI service disruption as a case study, this article examines the stability challenges of large‑scale Kubernetes deployments and details how Alibaba Cloud Container Service and its Prometheus‑based observability solutions enhance reliability through high‑availability architecture, optimized exporters, out‑of‑band data links, and best‑practice guidelines.

Alibaba CloudLarge-Scale ClustersObservability
0 likes · 22 min read
Ensuring Massive Kubernetes Cluster Stability: Lessons from the OpenAI Outage
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Jan 7, 2025 · Cloud Native

Build a Full Kubernetes DevOps Pipeline: From Containerization to Monitoring

This guide walks through a complete Kubernetes DevOps case study, detailing how to containerize micro‑services, create Docker images, write deployment and service manifests, set up a CI/CD pipeline with Jenkins or GitLab CI, integrate monitoring with Prometheus‑Grafana, manage logs via ELK/EFK, optionally add a service mesh, and perform fault‑injection testing for continuous optimization.

IstioKubernetesPrometheus
0 likes · 6 min read
Build a Full Kubernetes DevOps Pipeline: From Containerization to Monitoring
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 3, 2025 · Cloud Native

How to Enable LLM Traffic Observability with Alibaba Cloud Service Mesh (ASM)

This guide explains how to use Alibaba Cloud Service Mesh (ASM) to add infrastructure‑level observability for large language model (LLM) traffic, covering custom access‑log fields, new Prometheus metrics for token usage, and adding model dimensions to native Istio metrics, with step‑by‑step commands and configuration examples.

ASMKubernetesLLM
0 likes · 14 min read
How to Enable LLM Traffic Observability with Alibaba Cloud Service Mesh (ASM)
Architect
Architect
Dec 31, 2024 · Operations

Integrating Prometheus with Spring Boot and Visualizing Metrics Using Grafana

This guide explains how to monitor a Spring Boot application using Prometheus, configure Spring Boot Actuator, run Prometheus (including Docker deployment), set up Grafana for visualizing metrics, and create custom metrics with Micrometer, providing step‑by‑step instructions and code examples.

ActuatorDockerGrafana
0 likes · 10 min read
Integrating Prometheus with Spring Boot and Visualizing Metrics Using Grafana
Linux Ops Smart Journey
Linux Ops Smart Journey
Dec 27, 2024 · Cloud Native

How to Enable Ceph Enterprise Monitoring with Prometheus & Grafana

Learn step‑by‑step how to activate Ceph’s monitoring modules, configure Prometheus to collect Ceph metrics, verify data collection, and integrate Grafana dashboards, including tips on required dependencies and troubleshooting, to ensure reliable, secure storage management in enterprise cloud‑native environments.

CephGrafanaPrometheus
0 likes · 4 min read
How to Enable Ceph Enterprise Monitoring with Prometheus & Grafana
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 25, 2024 · Cloud Native

Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices

This article analyses the OpenAI large‑scale Kubernetes outage, explains the inherent risks of massive K8s clusters, and presents Alibaba Cloud's architectural enhancements, observability improvements, and best‑practice guidelines to achieve high‑availability and reliable operation of thousands‑node Kubernetes environments.

Cloud NativeKubernetesLarge-Scale Clusters
0 likes · 21 min read
Ensuring Stability of Large‑Scale Kubernetes Clusters: Lessons from the OpenAI Incident and Alibaba Cloud Practices
Linux Ops Smart Journey
Linux Ops Smart Journey
Dec 20, 2024 · Cloud Native

How to Set Up MinIO Enterprise Monitoring with Prometheus & Grafana

This guide walks you through configuring MinIO's enterprise monitoring panel, generating Prometheus metrics for clusters, nodes, buckets, and resources, integrating them into Grafana dashboards, and verifying successful data collection to enhance data management and operational efficiency.

GrafanaPrometheusmonitoring
0 likes · 7 min read
How to Set Up MinIO Enterprise Monitoring with Prometheus & Grafana
Raymond Ops
Raymond Ops
Dec 19, 2024 · Operations

How to Auto‑Scale Non‑CPU Apps with cAdvisor Network Metrics in Kubernetes

This guide explains how to use cAdvisor‑provided container network traffic counters as custom metrics for Kubernetes HPA, covering metric collection, Prometheus‑adapter configuration, verification, and a complete HPA testing workflow for elastic scaling of non‑CPU‑intensive workloads.

HPAKubernetesPrometheus
0 likes · 7 min read
How to Auto‑Scale Non‑CPU Apps with cAdvisor Network Metrics in Kubernetes
Linux Ops Smart Journey
Linux Ops Smart Journey
Dec 3, 2024 · Cloud Native

How to Set Up Harbor Monitoring with Prometheus and Grafana

Learn step‑by‑step how to deploy the harbor‑exporter, configure Prometheus to scrape Harbor metrics, verify data collection, and add official Grafana dashboards, enabling real‑time monitoring of your Harbor registry for improved stability, security, and performance in cloud‑native environments.

GrafanaHarborKubernetes
0 likes · 6 min read
How to Set Up Harbor Monitoring with Prometheus and Grafana
Zhuanzhuan Tech
Zhuanzhuan Tech
Nov 29, 2024 · Operations

Why Use Prometheus and How It Guarantees Business System Stability

This article explains the motivations for adopting Prometheus, introduces its core components and metric types, and demonstrates how comprehensive monitoring of business‑critical data, failure events, QPS, latency, and underlying resources can improve system stability and accelerate fault response.

JavaPrometheussystem stability
0 likes · 13 min read
Why Use Prometheus and How It Guarantees Business System Stability
ITPUB
ITPUB
Nov 23, 2024 · Operations

Zabbix vs Prometheus: Which Monitoring Tool Wins for Modern Cloud Environments?

This article compares Zabbix and Prometheus across performance, data collection, visualization, and alerting, highlighting their architectural differences, ecosystem strengths, and suitability for traditional data‑center monitoring versus dynamic cloud‑native workloads.

AlertingObservabilityPrometheus
0 likes · 11 min read
Zabbix vs Prometheus: Which Monitoring Tool Wins for Modern Cloud Environments?
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Nov 18, 2024 · Cloud Native

Developing a Custom Kubernetes Controller for Flink Task Scheduling

This article provides a step‑by‑step guide to building a custom Kubernetes controller in Go that uses Prometheus metrics to intelligently schedule Flink TaskManager Pods, covering the underlying scheduler concepts, code implementation, Docker image creation, RBAC setup, deployment, testing, and advanced considerations.

Cloud NativeCustom SchedulerFlink
0 likes · 38 min read
Developing a Custom Kubernetes Controller for Flink Task Scheduling
Linux Ops Smart Journey
Linux Ops Smart Journey
Nov 12, 2024 · Databases

Master PostgreSQL Monitoring with Grafana: Step-by-Step Guide

Learn how to deploy postgres_exporter, configure PostgreSQL extensions, set up Prometheus scraping, and create Grafana dashboards for comprehensive PostgreSQL performance monitoring, complete with command-line instructions and tips for verifying data collection and visualizing metrics.

GrafanaPostgreSQLPrometheus
0 likes · 6 min read
Master PostgreSQL Monitoring with Grafana: Step-by-Step Guide
Linux Ops Smart Journey
Linux Ops Smart Journey
Nov 3, 2024 · Cloud Native

Build a Robust Kubernetes Monitoring System with Prometheus and HAProxy

This guide walks you through setting up a comprehensive Kubernetes monitoring solution—covering component metrics collection, configuring HAProxy for network access, exposing metrics from kube-proxy, Calico, and kube-state-metrics, and integrating everything into Prometheus for reliable cluster health visibility.

CalicoHAProxyKubernetes
0 likes · 12 min read
Build a Robust Kubernetes Monitoring System with Prometheus and HAProxy
Java Architect Essentials
Java Architect Essentials
Oct 27, 2024 · Operations

Integrating Prometheus with Spring Boot for Real‑time Monitoring and Grafana Visualization

This article explains how to use Prometheus together with Spring Boot Actuator and Micrometer to collect, expose, and visualize application metrics, including step‑by‑step dependency configuration, YAML settings, Docker deployment of Prometheus and Grafana, and adding custom metrics for comprehensive monitoring.

ActuatorGrafanaPrometheus
0 likes · 10 min read
Integrating Prometheus with Spring Boot for Real‑time Monitoring and Grafana Visualization
ITPUB
ITPUB
Oct 6, 2024 · Operations

Mastering Prometheus Metrics: Practical Best‑Practice Guide for Effective Monitoring

This guide explains how to design and implement Prometheus metrics for application monitoring, covering the selection of monitoring targets, the four golden metrics, system‑specific metric groups, vector and label choices, naming conventions, histogram bucket design, and useful Grafana visualization tips.

GrafanaOperationsPrometheus
0 likes · 9 min read
Mastering Prometheus Metrics: Practical Best‑Practice Guide for Effective Monitoring
DevOps Operations Practice
DevOps Operations Practice
Sep 25, 2024 · Operations

Prometheus 3.0‑beta Released: New UI, Remote Write 2.0, OpenTelemetry Support, and Other Major Changes

Prometheus 3.0‑beta introduces a completely redesigned UI, Remote Write 2.0 with native support for metadata and histograms, built‑in OpenTelemetry metrics handling, UTF‑8 label support, native histograms, and several feature‑flag removals, while encouraging community testing before production use.

BetaReleaseObservabilityOpenTelemetry
0 likes · 6 min read
Prometheus 3.0‑beta Released: New UI, Remote Write 2.0, OpenTelemetry Support, and Other Major Changes
dbaplus Community
dbaplus Community
Sep 23, 2024 · Operations

How Bilibili Scaled Monitoring: From Prometheus to a 2.0 VM‑Flink Architecture

Bilibili rebuilt its monitoring platform to handle explosive metric growth by separating collection, storage, and compute, adopting VictoriaMetrics, zone‑based scheduling, and Flink‑driven pre‑aggregation, which together improved stability, query performance, cloud data quality, and overall observability.

FlinkObservabilityPrometheus
0 likes · 31 min read
How Bilibili Scaled Monitoring: From Prometheus to a 2.0 VM‑Flink Architecture
Architect
Architect
Sep 12, 2024 · Operations

How Bilibili Scaled Its Monitoring: From Prometheus OOMs to VictoriaMetrics & Flink Pre‑Aggregation

The article details Bilibili's evolution of its monitoring platform, describing the stability and performance challenges of a Prometheus‑Thanos stack, the redesign using VictoriaMetrics, collection‑storage separation, unit‑level disaster recovery, query‑tree auto‑replacement, Flink‑based pre‑aggregation, Grafana upgrades, and future roadmap for observability.

Cloud NativeFlinkObservability
0 likes · 30 min read
How Bilibili Scaled Its Monitoring: From Prometheus OOMs to VictoriaMetrics & Flink Pre‑Aggregation
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Sep 5, 2024 · Artificial Intelligence

Deploying NVIDIA NIM on Alibaba Cloud ACK with Cloud‑Native AI Suite: A Step‑by‑Step Guide

This guide explains how to quickly build a high‑performance, observable, and elastically scalable LLM inference service by deploying NVIDIA NIM on an Alibaba Cloud ACK cluster using the Cloud‑Native AI Suite, KServe, Prometheus, Grafana, and custom autoscaling based on request‑queue metrics.

Alibaba Cloud ACKGrafanaKServe
0 likes · 15 min read
Deploying NVIDIA NIM on Alibaba Cloud ACK with Cloud‑Native AI Suite: A Step‑by‑Step Guide
Sohu Tech Products
Sohu Tech Products
Aug 21, 2024 · Operations

Building Dynamic Grafana Dashboards for Push System Monitoring

By instrumenting each node of ZuanZuan’s push system with a Prometheus counter labeled by node name and traceId, and visualizing these metrics in a Grafana Flowcharting dashboard that dynamically highlights the trace path, developers can instantly pinpoint failures, cutting troubleshooting time from minutes to near‑zero.

Dynamic DashboardGrafanaJava
0 likes · 11 min read
Building Dynamic Grafana Dashboards for Push System Monitoring
Ops Development Stories
Ops Development Stories
Aug 15, 2024 · Backend Development

How to Build a Flexible API Monitoring Exporter with Gin-Vue-Admin and Prometheus

This article walks through extending a simple Prometheus Exporter into a full-featured API monitoring solution using Gin-Vue-Admin, detailing backend task scheduling, database schema, multi-protocol checks (HTTP, TCP, DNS, ICMP), dynamic cron management, and frontend integration for managing and visualizing health metrics.

API monitoringBackendGin
0 likes · 18 min read
How to Build a Flexible API Monitoring Exporter with Gin-Vue-Admin and Prometheus
Bilibili Tech
Bilibili Tech
Aug 9, 2024 · Operations

Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink

The new Monitoring 2.0 architecture separates collection, compute and storage, adopts VictoriaMetrics for compact time‑series storage and a zone‑based scheduler, introduces push‑based ingestion, uses Flink for real‑time pre‑aggregation and automatic PromQL rewrite, delivering ten‑fold query speedups, sub‑300 ms p90 latency, and dramatically higher write and query throughput.

FlinkObservabilityPrometheus
0 likes · 29 min read
Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink
Aikesheng Open Source Community
Aikesheng Open Source Community
Aug 5, 2024 · Databases

Evaluating the Use of mmap in Prometheus TSDB: Advantages, Disadvantages, and Performance Implications

This article examines mmap's historical origins, its performance benefits and drawbacks, and analyzes how Prometheus' time‑series database employs memory‑mapped files, revealing why mmap does not degrade Prometheus performance despite known kernel‑level issues such as TLB misses and lock contention.

LinuxPrometheusTSDB
0 likes · 26 min read
Evaluating the Use of mmap in Prometheus TSDB: Advantages, Disadvantages, and Performance Implications
Sohu Tech Products
Sohu Tech Products
Jul 24, 2024 · Cloud Native

Understanding Helm and Kubernetes Operators

The article explains how Helm simplifies deploying complex Kubernetes applications with a single YAML chart but cannot manage runtime operations, while Kubernetes Operators—built on custom resource definitions and webhook logic—automate tasks such as scaling, upgrades, and side‑car injection, offering higher‑level lifecycle management.

Application DeploymentCRDKubernetes
0 likes · 9 min read
Understanding Helm and Kubernetes Operators
JD Cloud Developers
JD Cloud Developers
Jul 17, 2024 · Databases

Choosing the Right Database: MySQL, Redis, HBase, ClickHouse, MongoDB, Elasticsearch, Neo4j, Prometheus & Milvus Explained

Explore nine major database technologies—from traditional relational MySQL to NoSQL Redis, columnar HBase and ClickHouse, document-oriented MongoDB, search engine Elasticsearch, graph Neo4j, time‑series Prometheus, and vector Milvus—plus practical best‑practice guides, real‑world polyglot persistence scenarios, and recommended resources for mastering modern data storage.

ClickHouseElasticsearchHBase
0 likes · 50 min read
Choosing the Right Database: MySQL, Redis, HBase, ClickHouse, MongoDB, Elasticsearch, Neo4j, Prometheus & Milvus Explained
MaGe Linux Operations
MaGe Linux Operations
Jul 16, 2024 · Cloud Native

How Prometheus Sends Alerts: Rules, Templates, and Frequency Explained

This article explains how Prometheus generates and sends alerts, covering the definition of alert rules with PromQL, grouping, templating, configuring evaluation intervals, deploying a custom alert receiver in Kubernetes, and analyzing alert payloads and delivery frequency, while also detailing alert silencing and resolution behavior.

AlertingAlertmanagerGo
0 likes · 26 min read
How Prometheus Sends Alerts: Rules, Templates, and Frequency Explained
Alibaba Cloud Observability
Alibaba Cloud Observability
Jul 16, 2024 · Cloud Native

How to Seamlessly Migrate Your Self‑Hosted Prometheus + Thanos to Alibaba Cloud Managed Prometheus

This guide explains why many users still run self‑built Prometheus + Thanos, outlines the common deployment scenarios and pain points, and provides detailed step‑by‑step migration procedures—including metric collection, visualization, and alerting—for moving to Alibaba Cloud's fully managed Prometheus service across Kubernetes, ECS, and IDC environments.

Alibaba CloudCloud NativePrometheus
0 likes · 14 min read
How to Seamlessly Migrate Your Self‑Hosted Prometheus + Thanos to Alibaba Cloud Managed Prometheus
JD Tech
JD Tech
Jul 15, 2024 · Databases

A Comprehensive Overview of Nine Database Types and Polyglot Persistence Practices

This article provides an in‑depth survey of nine database categories—including relational, key‑value, columnar, document, graph, time‑series, and vector databases—detailing their architectures, advantages, disadvantages, best‑practice recommendations, typical use cases, and how they can be combined in polyglot persistence solutions.

ClickHouseDatabase TypesHBase
0 likes · 41 min read
A Comprehensive Overview of Nine Database Types and Polyglot Persistence Practices
Spring Full-Stack Practical Cases
Spring Full-Stack Practical Cases
Jul 14, 2024 · Backend Development

Master Spring Boot Observability with @Timed, @Counted, and @MeterTag

Learn how to enable comprehensive observability in Spring Boot 3.2.5 by leveraging Micrometer’s @Timed, @Counted, and @MeterTag annotations, configuring Actuator endpoints, and customizing aspects to monitor method execution time, request counts, and parameters, complete with practical code examples and Prometheus integration.

ObservabilityPrometheusSpring Boot
0 likes · 7 min read
Master Spring Boot Observability with @Timed, @Counted, and @MeterTag