Tagged articles
2179 articles
Page 4 of 22
Efficient Ops
Efficient Ops
Jun 11, 2025 · Operations

Master cURL: Essential Commands for DevOps, Monitoring, and Automation

This guide presents essential cURL commands for service health checks, API testing, file transfer, debugging, Kubernetes interactions, monitoring, load balancing, and webhook triggering, demonstrating how the versatile tool can streamline automation, CI/CD pipelines, and daily DevOps tasks.

API testingAutomationDevOps
0 likes · 5 min read
Master cURL: Essential Commands for DevOps, Monitoring, and Automation
Java Captain
Java Captain
Jun 10, 2025 · Backend Development

Why Spring Batch? Real‑World Scenarios, Core Architecture and Hands‑On Guide

This article explains the necessity of batch processing, presents typical use cases such as daily interest calculation, e‑commerce order archiving, log analysis and medical data migration, then dives deep into Spring Batch's core components, provides step‑by‑step code examples, performance‑tuning tips, production‑grade fault‑tolerance, monitoring solutions and a comprehensive FAQ.

Batch ProcessingData IntegrationJava
0 likes · 20 min read
Why Spring Batch? Real‑World Scenarios, Core Architecture and Hands‑On Guide
FunTester
FunTester
Jun 5, 2025 · Cloud Native

Automating Thread Dump Generation and Retrieval in Kubernetes for Efficient Fault Diagnosis

The article explains how automating thread dump creation and download in Kubernetes using tools like Fabric8, Prometheus, and CI/CD pipelines dramatically improves fault‑diagnosis speed, data centralization, real‑time capture, and integration with testing frameworks, transforming manual, error‑prone processes into streamlined, intelligent operations.

AutomationKubernetesThread Dump
0 likes · 6 min read
Automating Thread Dump Generation and Retrieval in Kubernetes for Efficient Fault Diagnosis
Raymond Ops
Raymond Ops
Jun 4, 2025 · Operations

Mastering SFTP: Complete Planning, Configuration, and High‑Availability Guide

This guide walks you through SFTP server planning, user naming conventions, directory structures, SSH configuration, account creation, permission setup, client usage, log auditing, rotation, connection limits, monitoring, and high‑availability deployment across multiple servers, providing ready‑to‑run commands and scripts.

ACLSFTPSSH
0 likes · 14 min read
Mastering SFTP: Complete Planning, Configuration, and High‑Availability Guide
Alibaba Cloud Observability
Alibaba Cloud Observability
Jun 3, 2025 · Cloud Native

How PromQL Copilot Turns Natural Language into Precise Monitoring Queries

PromQL Copilot leverages Alibaba Cloud's observability platform and AI techniques to convert ambiguous natural‑language monitoring requests into accurate PromQL statements, addressing challenges of ambiguity, domain knowledge, and metric coverage while providing generation, explanation, diagnosis, and recommendation features for cloud‑native environments.

AICloud NativeMetrics
0 likes · 12 min read
How PromQL Copilot Turns Natural Language into Precise Monitoring Queries
Liangxu Linux
Liangxu Linux
Jun 2, 2025 · Operations

10 Must‑Know Ops Tools to Transform Reactive Firefighting into Proactive Management

This guide presents ten essential operations tools—including Zabbix, Prometheus, MySQL, Redis, Ansible, Jenkins, Docker, Kubernetes, LVS, and Kafka—covering monitoring, databases, automation, containerization, and load balancing, to help engineers shift from reactive firefighting to proactive, efficient system management.

AutomationContainersMessaging
0 likes · 4 min read
10 Must‑Know Ops Tools to Transform Reactive Firefighting into Proactive Management
Alibaba Cloud Developer
Alibaba Cloud Developer
May 27, 2025 · Artificial Intelligence

How to Build AI-Powered Java Apps with Spring AI and DeepSeek

This guide walks Java developers through integrating Spring AI with large‑model services such as DeepSeek, covering setup, API key configuration, code examples for synchronous and streaming calls, reactive implementation, monitoring with Actuator, and compatibility with OpenAI‑style APIs.

AI integrationDeepSeekJava
0 likes · 9 min read
How to Build AI-Powered Java Apps with Spring AI and DeepSeek
Bilibili Tech
Bilibili Tech
May 27, 2025 · Operations

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

This article presents a comprehensive overview of server fault management at scale, detailing the classification of failures, shortcomings of traditional manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerting, and end‑to‑end repair workflows, while also outlining future directions for intelligent monitoring and reliability.

AutomationInfrastructureOperations
0 likes · 17 min read
Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook
Java Architecture Diary
Java Architecture Diary
May 26, 2025 · Artificial Intelligence

How to Build Enterprise‑Ready AI Monitoring with Spring AI and Micrometer

This article explains why observability is essential for Spring AI applications, outlines common cost‑control and performance challenges, and provides a step‑by‑step guide—including Maven setup, client configuration, service implementation, metric exposure, Zipkin tracing, and architecture insights—to create a fully observable, enterprise‑grade AI translation service.

MicrometerObservabilitymonitoring
0 likes · 12 min read
How to Build Enterprise‑Ready AI Monitoring with Spring AI and Micrometer
MaGe Linux Operations
MaGe Linux Operations
May 25, 2025 · Cloud Native

Master Docker Volume Management: From Basics to Advanced Ops

This comprehensive guide walks you through Docker volume creation, inspection, mounting, backup, restoration, cross‑host migration, labeling, driver configuration, security permissions, encryption, monitoring, troubleshooting, capacity planning, and automation scripts, providing practical commands and best‑practice recommendations for reliable container storage management.

AutomationContainermonitoring
0 likes · 8 min read
Master Docker Volume Management: From Basics to Advanced Ops
Su San Talks Tech
Su San Talks Tech
May 24, 2025 · Backend Development

12 Proven SpringBoot Performance Hacks to Boost Your API Speed

Discover twelve practical SpringBoot performance optimization techniques—from connection pool tuning and JVM memory settings to caching, async processing, and full‑stack monitoring—each illustrated with code snippets and actionable guidance to prevent full‑table scans, OOM errors, and latency spikes in high‑traffic applications.

JVMJavaPerformance Optimization
0 likes · 13 min read
12 Proven SpringBoot Performance Hacks to Boost Your API Speed
DataFunSummit
DataFunSummit
May 22, 2025 · Operations

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

This article presents Grab's Hugo platform, an automated fault‑detection and self‑healing system for over 4,000 data pipelines that combines multi‑source signal collection, intelligent diagnosis, layered auto‑repair, and a health API to dramatically improve data visibility, reduce manual intervention, and boost operational efficiency across the company.

AutomationBig DataDataOps
0 likes · 12 min read
Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact
Cloud Native Technology Community
Cloud Native Technology Community
May 22, 2025 · Information Security

How to Prevent Common Kubernetes Security Mistakes and Harden Your Cluster

This article analyzes typical Kubernetes security pitfalls—from weak authentication and overly permissive network policies to missing real‑time monitoring, exposed services, outdated versions, and default component settings—and provides concrete, layered mitigation steps and tool recommendations.

Cloud NativeKubernetesNetwork Policy
0 likes · 13 min read
How to Prevent Common Kubernetes Security Mistakes and Harden Your Cluster
Big Data Technology & Architecture
Big Data Technology & Architecture
May 21, 2025 · Big Data

Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring

This article shares an interviewee's experience discussing core Flink interview questions, including typical resource allocation for large online tasks, common problems such as data, performance, stability, and resource issues, and the monitoring practices for clusters and tasks, while also containing a brief self‑promotion.

Big DataFlinkPerformance Issues
0 likes · 7 min read
Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring
Architect's Tech Stack
Architect's Tech Stack
May 20, 2025 · Operations

Visualizing Nginx Access Logs with Loki and Grafana

This guide explains how to collect Nginx access logs, convert them to JSON, store them in Loki using Promtail, and visualize the data with Grafana dashboards, including installation of required modules, Docker deployment, and world‑map panel configuration.

GrafanaJSONLoki
0 likes · 8 min read
Visualizing Nginx Access Logs with Loki and Grafana
Java Tech Enthusiast
Java Tech Enthusiast
May 18, 2025 · Operations

Ten Rules for Writing High‑Quality Logs in Production Systems

This article presents ten practical rules for producing high‑quality, searchable logs—including unified formatting, stack‑trace inclusion, proper log levels, complete parameters, data masking, asynchronous writing, trace‑ID linking, dynamic level control, structured storage, and intelligent monitoring—to help developers quickly diagnose issues in high‑traffic applications.

best practiceslogbacklogging
0 likes · 11 min read
Ten Rules for Writing High‑Quality Logs in Production Systems
Liangxu Linux
Liangxu Linux
May 15, 2025 · Operations

10 Critical Server Ops Mistakes to Avoid and Real-World Lessons

This article outlines ten common server operation pitfalls—such as forced power‑offs, reckless experiments in production, neglecting firewall rules, running unknown scripts as root, unbacked‑up database changes, weak SSH settings, poor log management, exposed ports, unmonitored changes, and delayed patching—each illustrated with real‑world cases and practical remediation advice.

BackupSecuritySystem Administration
0 likes · 7 min read
10 Critical Server Ops Mistakes to Avoid and Real-World Lessons
Raymond Ops
Raymond Ops
May 11, 2025 · Cloud Native

How to Expose Ingress Metrics for Prometheus Monitoring in Kubernetes

This guide details how to expose the nginx‑ingress metrics port, configure static and ServiceMonitor‑based scraping in Prometheus Operator, create necessary secrets, and integrate the metrics into Grafana dashboards, providing a complete Kubernetes‑native solution for monitoring ingress traffic.

Cloud NativeIngressPrometheus
0 likes · 6 min read
How to Expose Ingress Metrics for Prometheus Monitoring in Kubernetes
MaGe Linux Operations
MaGe Linux Operations
May 11, 2025 · Cloud Native

How to Build a High‑Performance, Highly‑Available Cloud‑Native Ingress Gateway

When an Ingress gateway faces traffic exceeding 100,000 QPS, this guide outlines systematic performance optimizations, configuration tweaks, distributed architecture designs, traffic management, monitoring, and disaster‑recovery strategies—including hardware scaling, kernel tuning, DPDK, rate limiting, horizontal scaling, service mesh integration, and CDN offloading—to achieve high concurrency and high availability.

Scalabilitycloud-nativehigh-availability
0 likes · 8 min read
How to Build a High‑Performance, Highly‑Available Cloud‑Native Ingress Gateway
Raymond Ops
Raymond Ops
May 9, 2025 · Operations

Build a Complete Prometheus Monitoring Stack with Docker

This tutorial explains Prometheus' core components, shows how to deploy Prometheus Server, Node Exporter, cAdvisor, and Grafana as Docker containers on two hosts, configures scraping and alerting, and demonstrates visualizing metrics with ready‑made Grafana dashboards.

AlertmanagerDockerExporter
0 likes · 8 min read
Build a Complete Prometheus Monitoring Stack with Docker
Java Captain
Java Captain
Apr 22, 2025 · Operations

Improving Cron Job Stability and Monitoring with Best Practices and Healthchecks

The article analyzes common cron job failures such as accidental deletions, OOM crashes, and lack of monitoring, then proposes standardized Jenkins deployment, automatic server selection, lock mechanisms, queue-based processing, status awareness, and the use of the open‑source Healthchecks system to achieve proactive detection and alerting.

AutomationOperationscron
0 likes · 8 min read
Improving Cron Job Stability and Monitoring with Best Practices and Healthchecks
DeWu Technology
DeWu Technology
Apr 21, 2025 · Backend Development

Design and Evolution of a Unified Exchange Mall Middleware Platform

The unified exchange mall middleware platform consolidates disparate points‑redemption and lottery flows into a four‑layer architecture—business, gameplay templates, domain models, and downstream services—offering standardized APIs, dynamic RPC routing, Redis‑based inventory control, anti‑fraud safeguards, and built‑in monitoring, thereby cutting development costs, enhancing maintainability, and ensuring system stability.

BackendGolangMicroservices
0 likes · 18 min read
Design and Evolution of a Unified Exchange Mall Middleware Platform
Efficient Ops
Efficient Ops
Apr 16, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable operations engineering tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, suitable scenarios, advantages, and real‑world examples, plus sample code snippets to help engineers automate and monitor infrastructure efficiently.

AutomationDevOpsInfrastructure
0 likes · 9 min read
Top 10 Essential Ops Tools Every Engineer Should Master
Architecture and Beyond
Architecture and Beyond
Apr 12, 2025 · Backend Development

How to Keep Your AIGC Service Stable: Queueing and Rate‑Limiting Strategies

This article explains why AIGC services need queueing systems and rate‑limiting, describes the user‑facing behaviors of both mechanisms, outlines design goals, compares queue and limiter implementations, and provides practical guidance on selecting middleware, monitoring, and integrating them into a production workflow.

AIGCBackendMessage Queue
0 likes · 28 min read
How to Keep Your AIGC Service Stable: Queueing and Rate‑Limiting Strategies
FunTester
FunTester
Apr 12, 2025 · Operations

How to Design Effective Fault‑Testing Cases for Resilient Distributed Systems

This article explains why fault testing is essential for modern distributed and cloud environments, outlines core goals, design principles, common fault categories, practical implementation strategies such as chaos engineering and gray releases, and shows how to analyze results to continuously improve system reliability.

Distributed Systemschaos engineeringfault testing
0 likes · 18 min read
How to Design Effective Fault‑Testing Cases for Resilient Distributed Systems
Raymond Ops
Raymond Ops
Apr 7, 2025 · Operations

How to Deploy Prometheus on Kubernetes and Resolve Alertmanager Port Issues

This guide explains what Prometheus monitoring is, walks through downloading the correct version for a Kubernetes cluster, customizing alert rules, deploying and cleaning up Prometheus, and troubleshooting common Alertmanager connection problems by checking DNS and network configurations.

AlertmanagerPrometheusmonitoring
0 likes · 9 min read
How to Deploy Prometheus on Kubernetes and Resolve Alertmanager Port Issues
Deepin Linux
Deepin Linux
Apr 2, 2025 · Operations

Comprehensive Guide to bpftrace: Features, Architecture, Installation, and Practical Use Cases

This article introduces bpftrace, an eBPF‑based dynamic tracing tool for Linux, explains its core concepts, technical architecture, installation methods, basic syntax, and demonstrates real‑world performance analysis, fault diagnosis, and security monitoring scenarios while comparing it with DTrace, SystemTap, and BCC.

DebuggingLinux performanceSystem Tracing
0 likes · 24 min read
Comprehensive Guide to bpftrace: Features, Architecture, Installation, and Practical Use Cases
The Dominant Programmer
The Dominant Programmer
Mar 22, 2025 · Databases

Common Redis Performance Issues and How to Make Your Cache Fly

This article examines the most frequent Redis performance bottlenecks—including high memory usage, network latency, misconfiguration, poor data‑structure choices, and suboptimal persistence—explains why they occur, and provides concrete optimization techniques, monitoring commands, real‑world case studies, and emerging trends to keep your cache fast and stable.

Data StructuresMemory ManagementNetwork Latency
0 likes · 8 min read
Common Redis Performance Issues and How to Make Your Cache Fly
Tencent Cloud Developer
Tencent Cloud Developer
Mar 19, 2025 · Cloud Native

Kubernetes Monitoring: Why It’s Needed, Core Components, and Metric Exposure

Monitoring Kubernetes is essential to detect resource contention, component failures, and network issues; it involves tracking core component metrics such as API server latency, etcd write times, scheduler delays, as well as node‑level CPU, memory, disk, and network statistics, pod health, and custom application metrics exposed via Prometheus exporters for comprehensive observability.

Cloud NativeExportersKubernetes
0 likes · 23 min read
Kubernetes Monitoring: Why It’s Needed, Core Components, and Metric Exposure
JD Tech
JD Tech
Mar 13, 2025 · Operations

Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards

This article details how JD Logistics guarantees the stability of its Double 11 supply‑chain dashboard by mapping the entire data‑flow, identifying risk points across ingestion, processing, storage, service, and monitoring layers, and applying targeted technical and organizational safeguards.

Big DataDashboardSupply Chain
0 likes · 10 min read
Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards
Alibaba Cloud Native
Alibaba Cloud Native
Mar 13, 2025 · Cloud Native

How to Extend SAE with Sidecar Containers for Custom Logging and Monitoring

This article explains how Alibaba Cloud's Serverless Application Engine (SAE) uses sidecar containers to let users add custom log collection, metric monitoring, and resource isolation without modifying their main application code, detailing configuration modes, operational tools, and a step‑by‑step implementation example.

SAEServerlessSidecar
0 likes · 12 min read
How to Extend SAE with Sidecar Containers for Custom Logging and Monitoring
php Courses
php Courses
Mar 13, 2025 · Backend Development

Effective Strategies for Optimizing PHP Application Performance

Optimizing PHP applications involves a combination of code-level improvements—such as caching, efficient algorithms, and query optimization—and server-side configurations like upgrading PHP, enabling opcode caches, tuning web servers, and leveraging CDNs, along with monitoring tools and asynchronous processing to achieve faster, more scalable performance.

BackendPHPPerformance Optimization
0 likes · 5 min read
Effective Strategies for Optimizing PHP Application Performance
JD Tech Talk
JD Tech Talk
Mar 12, 2025 · Big Data

Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies

This article details how the supply‑chain big‑screen dashboard for Double‑11 maintains high stability by mapping the full data‑flow, identifying risk points across ingestion, processing, storage and service layers, and applying comprehensive technical safeguards such as high‑availability design, fault‑tolerance, monitoring, and coordinated operational procedures.

Big DataDashboardSupply Chain
0 likes · 11 min read
Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies
Efficient Ops
Efficient Ops
Mar 9, 2025 · Artificial Intelligence

Essential LLMOps Tools: Build, Deploy, Monitor, and Manage Large Language Models

LLMOps, the end-to-end methodology for managing large language models, encompasses a curated set of development, deployment, monitoring, and local management tools—such as LangChain, vLLM, LangSmith, and Ollama—enabling practitioners to efficiently build, scale, and maintain AI applications.

AI DevelopmentLLMOpsModel Deployment
0 likes · 6 min read
Essential LLMOps Tools: Build, Deploy, Monitor, and Manage Large Language Models
dbaplus Community
dbaplus Community
Mar 5, 2025 · Operations

How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement

This article shares a step‑by‑step account of taking over a content‑risk stability role in early 2024, defining system stability, diagnosing recurring issues, and implementing a three‑phase framework—pre‑emptive reduction, impact mitigation, and post‑incident improvement—to boost success rates, cut incident response time, and achieve a modular architecture.

JVM OptimizationSREincident response
0 likes · 20 min read
How to Build a Resilient Content‑Risk System: From Diagnosis to Continuous Improvement
Practical DevOps Architecture
Practical DevOps Architecture
Mar 5, 2025 · Operations

Zabbix Agent Active Mode Workflow and Configuration Guide

This article explains the Zabbix‑Agent active mode workflow, detailing how the agent initiates TCP connections to the Zabbix‑Server to request monitoring items, receives the item list, sends collected data back, and provides step‑by‑step configuration of the agent and server, including template cloning and essential parameters.

Active ModeZabbixagent configuration
0 likes · 6 min read
Zabbix Agent Active Mode Workflow and Configuration Guide
FunTester
FunTester
Mar 2, 2025 · Operations

Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems

The article examines typical fault propagation scenarios such as avalanche effects, cascading failures, resource exhaustion, data pollution, and dependency cycles in distributed systems, and outlines proactive measures like rate limiting, circuit breaking, isolation, monitoring, and chaos engineering to prevent small issues from escalating into large-scale outages.

chaos engineeringcircuit breakerfault tolerance
0 likes · 11 min read
Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems
Cognitive Technology Team
Cognitive Technology Team
Mar 1, 2025 · Databases

Understanding and Mitigating Redis Large‑Key Issues

The article explains what constitutes a Redis large key, outlines its performance and stability risks, describes common scenarios and root causes, and provides practical detection commands, mitigation techniques such as splitting, compression, proper data modeling, and monitoring strategies to prevent future issues.

Memory Optimizationdatabaselarge key
0 likes · 6 min read
Understanding and Mitigating Redis Large‑Key Issues
macrozheng
macrozheng
Feb 21, 2025 · Backend Development

Boost SpringBoot Performance: Monitoring, Profiling, and Optimization Techniques

This guide walks through practical SpringBoot performance improvements, covering metric exposure with Prometheus, flame‑graph profiling via async‑profiler, distributed tracing with SkyWalking, HTTP and Tomcat tuning, and layer‑specific optimizations for controllers, services, and data access.

monitoring
0 likes · 17 min read
Boost SpringBoot Performance: Monitoring, Profiling, and Optimization Techniques
Architecture Development Notes
Architecture Development Notes
Feb 19, 2025 · Operations

Avoid Prometheus Label Pitfalls: Best Practices for Scalable Monitoring

This article examines common label misuse in Prometheus, explains why adding global labels to every metric can cause data bloat, configuration rigidity, and dimensional pollution, and provides concrete best‑practice patterns, dynamic injection techniques, and governance rules to keep monitoring systems efficient and maintainable.

Cloud NativeLabelsPrometheus
0 likes · 7 min read
Avoid Prometheus Label Pitfalls: Best Practices for Scalable Monitoring
DevOps Cloud Academy
DevOps Cloud Academy
Feb 17, 2025 · Operations

Top 10 AI Tools Transforming DevOps Engineering

This article reviews ten AI‑powered tools—including Jenkins, Ansible, Puppet, Dynatrace, Splunk, GitHub Copilot, New Relic, Azure DevOps, Prometheus, and Chef—that enhance DevOps workflows through predictive analytics, automated rollback, intelligent monitoring, and code assistance, helping teams achieve faster, more reliable software delivery.

AIAutomationDevOps
0 likes · 14 min read
Top 10 AI Tools Transforming DevOps Engineering
Liangxu Linux
Liangxu Linux
Feb 16, 2025 · Operations

How to Quickly Visualize Shell Commands with Sampler – Install, Configure, and Use

Sampler is a lightweight tool that runs shell commands, visualizes their output, and triggers alerts, using simple YAML configuration; the guide explains why it’s useful, how to install it on macOS, Linux, and Windows, and provides detailed examples of components, triggers, interactive shells, and real‑world database monitoring scenarios.

ShellYAMLalerts
0 likes · 14 min read
How to Quickly Visualize Shell Commands with Sampler – Install, Configure, and Use
Deepin Linux
Deepin Linux
Feb 12, 2025 · Operations

Comprehensive Guide to Linux Server Fault Diagnosis and Troubleshooting

This article provides a detailed overview of common Linux server failures, a step‑by‑step methodology for fault isolation, practical monitoring tools and commands, and a real‑world case study illustrating diagnosis and remediation techniques for production environments.

Sysadminlinuxmonitoring
0 likes · 26 min read
Comprehensive Guide to Linux Server Fault Diagnosis and Troubleshooting
ITPUB
ITPUB
Feb 11, 2025 · Operations

Why Your Monitoring Fails and How to Build Effective Observability Data

Many companies deploy fragmented monitoring and observability tools yet still struggle to pinpoint incidents; this article analyzes the root causes—under‑utilized tools and scenario‑agnostic data—and offers practical steps to organize metrics, build layered insights, and improve fault‑resolution efficiency.

ObservabilitySREdata engineering
0 likes · 12 min read
Why Your Monitoring Fails and How to Build Effective Observability Data
Liangxu Linux
Liangxu Linux
Feb 9, 2025 · Fundamentals

Mastering Linux Processes: From Basics to Advanced Monitoring and Management

This guide explains what a process is, how it differs from a program, its lifecycle, how to monitor and interpret process states with ps and top, manage processes using kill, killall, pkill, run jobs in the background with screen or nohup, adjust priorities with nice/renice, and understand load‑average metrics for performance troubleshooting.

Load Averagelinuxmonitoring
0 likes · 32 min read
Mastering Linux Processes: From Basics to Advanced Monitoring and Management
dbaplus Community
dbaplus Community
Feb 6, 2025 · Databases

How a MySQL Online Schema Change Platform Evolved from a Single‑Lane Bridge to a Robust 2.0 System

This article recounts the development of ZzoOnlineDDL, a MySQL schema‑change platform, detailing its 1.0 limitations, the 2.0 architectural upgrades, feature set—including intelligent tool selection, timed execution, sharding support, monitoring, and retry mechanisms—and lessons learned from real‑world incidents such as MDL locks, disk pressure, and unique‑index pitfalls.

Online DDLSchema Changegh-ost
0 likes · 34 min read
How a MySQL Online Schema Change Platform Evolved from a Single‑Lane Bridge to a Robust 2.0 System
Efficient Ops
Efficient Ops
Feb 6, 2025 · Operations

Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices

At the 2024 GOPS Global Operations Conference in Shanghai, Alipay’s monitoring lead Tang Liang presented the challenges, architecture, risk‑prevention practices, and implementation details of the company’s full‑ecosystem availability monitoring system, highlighting its role in DevOps, SRE, and AIOps initiatives.

AvailabilityCloud NativeDevOps
0 likes · 4 min read
Inside Alipay’s Full‑Ecosystem Availability Monitoring: Architecture and Practices
IT Architects Alliance
IT Architects Alliance
Feb 5, 2025 · Cloud Native

Performance Optimization Strategies for Cloud‑Native Applications

This article examines the rapid adoption of cloud‑native architectures and presents a comprehensive guide to identifying performance bottlenecks and applying architectural, resource‑management, caching, networking, and tooling techniques—such as Kubernetes, Prometheus, Grafana, and JMeter—to achieve high‑performance, scalable cloud‑native systems.

cachingcloud-nativemonitoring
0 likes · 22 min read
Performance Optimization Strategies for Cloud‑Native Applications
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Feb 5, 2025 · Frontend Development

Front‑End Tracking (埋点) Overview, Monitoring Types, Performance Metrics, and Implementation Guide

This article explains front‑end tracking concepts, outlines data, performance, and error monitoring, details common performance metrics, compares code‑based, visual, and automatic tracking solutions, and provides practical JavaScript snippets for event collection, error handling, page‑view reporting, and data transmission methods such as XHR, image GIF, and sendBeacon.

Web Analyticsfrontendmonitoring
0 likes · 16 min read
Front‑End Tracking (埋点) Overview, Monitoring Types, Performance Metrics, and Implementation Guide
JavaEdge
JavaEdge
Feb 2, 2025 · Artificial Intelligence

Mastering LLMOps: From Model Deployment to Scalable AI Operations

This article explains LLMOps—its goals, core activities, benefits, best practices, and how using an LLMOps platform like Dify can dramatically cut development time, simplify prompt engineering, data preparation, monitoring, and deployment of large language models.

AI OperationsData ManagementLLMOps
0 likes · 13 min read
Mastering LLMOps: From Model Deployment to Scalable AI Operations
MaGe Linux Operations
MaGe Linux Operations
Jan 27, 2025 · Operations

Redis Sentinel Deep Dive: High‑Availability Architecture & Automatic Failover

This article explains Redis Sentinel’s role as the official high‑availability solution, detailing its monitoring, notification, automatic failover mechanisms, discovery processes, connection types, down‑state classifications, failover steps, leader election, master selection rules, and data consistency guarantees.

Operationsfailoverhigh availability
0 likes · 18 min read
Redis Sentinel Deep Dive: High‑Availability Architecture & Automatic Failover
Soul Technical Team
Soul Technical Team
Jan 24, 2025 · Operations

Migration from Thanos to VictoriaMetrics: Architecture, Plan, Issues, and Benefits

This article details the end‑to‑end migration from Thanos to VictoriaMetrics, covering background analysis, architectural comparison, a phased migration plan, encountered configuration and performance issues, resolution strategies, and the resulting performance, cost, and scalability improvements for the monitoring system.

ThanosTime SeriesVictoriaMetrics
0 likes · 16 min read
Migration from Thanos to VictoriaMetrics: Architecture, Plan, Issues, and Benefits
Top Architect
Top Architect
Jan 21, 2025 · Backend Development

DynamicTp: A SpringBoot‑Based Dynamic Thread‑Pool Framework for Java Applications

The article introduces DynamicTp, a SpringBoot-based dynamic thread‑pool framework that enables real‑time adjustment, monitoring, and alerting of ThreadPoolExecutor parameters via various configuration centers, outlines its architecture, modules, features, and integration with third‑party components, and provides usage guidance for Java backend developers.

ConfigurationDynamicTpJava
0 likes · 12 min read
DynamicTp: A SpringBoot‑Based Dynamic Thread‑Pool Framework for Java Applications
Efficient Ops
Efficient Ops
Jan 19, 2025 · Operations

How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Ops Playbook

After a midnight CPU alarm, I walked through rapid diagnosis, JVM profiling, algorithm refactoring, database indexing, Docker isolation, and enhanced monitoring to bring a high‑load Java service back to stability, illustrating a comprehensive incident‑response workflow for modern operations teams.

CPU troubleshootingDocker deploymentJVM profiling
0 likes · 7 min read
How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Ops Playbook
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Jan 17, 2025 · Operations

10 Essential Linux Sysadmin Tools Every Engineer Should Master

This guide outlines the ten fundamental Linux operations tools and skills—ranging from basic system knowledge and networking services to shell scripting, text processing, databases, firewalls, monitoring, clustering, and backup—that every aspiring sysadmin should learn and practice thoroughly.

NetworkingOperationsSysadmin
0 likes · 6 min read
10 Essential Linux Sysadmin Tools Every Engineer Should Master
Sohu Tech Products
Sohu Tech Products
Jan 15, 2025 · Backend Development

Deep Dive into Druid Connection Pool: Initialization, Retrieval, and Recycling Explained

This technical guide breaks down Alibaba's Druid JDBC connection pool, detailing its initialization process, how connections are fetched and returned, the internal threads and condition‑signal coordination, execution handling, recommended configurations, and monitoring integration, all illustrated with code snippets and diagrams.

ConfigurationConnection PoolDruid
0 likes · 23 min read
Deep Dive into Druid Connection Pool: Initialization, Retrieval, and Recycling Explained
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Jan 13, 2025 · Operations

Top Logstash Interview Questions 11‑20: Answers and Practical Configurations

This article provides concise answers and example configurations for eleven common Logstash interview questions, covering HTTP input/poller plugins, the Split filter, pipeline debugging, performance monitoring with Metricbeat, Grok failure handling, secure communication, multi‑source collection, multiple outputs, differences from Elasticsearch ingest pipelines, and Kibana pipeline management.

ElasticsearchLogstashPipeline
0 likes · 7 min read
Top Logstash Interview Questions 11‑20: Answers and Practical Configurations
Open Source Linux
Open Source Linux
Jan 13, 2025 · Operations

Key Lessons from 2024 Major Service Outages and How to Prevent Future Downtime

The article reviews major 2024 service outages—from Alibaba Cloud to OpenAI—highlights their root causes, and offers practical operations strategies such as disaster recovery, regular backups, load balancing, monitoring, performance tuning, and capacity planning to reduce future downtime.

Operationscapacity planningdisaster recovery
0 likes · 5 min read
Key Lessons from 2024 Major Service Outages and How to Prevent Future Downtime
Java Backend Technology
Java Backend Technology
Jan 9, 2025 · Backend Development

How DynamicTp Enables Real‑Time ThreadPool Tuning and Monitoring in Java

DynamicTp is a SpringBoot‑compatible framework that extends ThreadPoolExecutor to allow live adjustment of pool parameters, real‑time monitoring, multi‑platform alerts, and seamless integration with popular configuration centers, helping Java services achieve higher performance and reliability.

Dynamic ConfigurationMicroservicesSpringBoot
0 likes · 11 min read
How DynamicTp Enables Real‑Time ThreadPool Tuning and Monitoring in Java
Su San Talks Tech
Su San Talks Tech
Jan 8, 2025 · Backend Development

How DynamicTp Enables Real‑Time Thread Pool Monitoring and Auto‑Tuning in Java

DynamicTp extends Java's ThreadPoolExecutor with zero‑intrusion configuration, real‑time parameter adjustment, comprehensive monitoring, and multi‑channel alerting, allowing developers to dynamically tune thread pools across microservices using popular configuration centers and integrate with tools like Micrometer and Grafana.

DynamicTpJavaMicrometer
0 likes · 11 min read
How DynamicTp Enables Real‑Time Thread Pool Monitoring and Auto‑Tuning in Java
macrozheng
macrozheng
Jan 7, 2025 · Backend Development

DynamicTp: Real‑time Monitoring and Dynamic Scaling for SpringBoot Thread Pools

This article introduces DynamicTp, a zero‑intrusion SpringBoot starter that provides real‑time monitoring, dynamic adjustment, and alerting of ThreadPoolExecutor parameters via popular configuration centers, supporting multiple middleware thread pools, various metrics exporters, and extensible SPI interfaces for enterprise‑grade thread‑pool management.

ConfigurationCenterDynamicThreadPoolThreadPoolExecutor
0 likes · 11 min read
DynamicTp: Real‑time Monitoring and Dynamic Scaling for SpringBoot Thread Pools
IT Architects Alliance
IT Architects Alliance
Jan 6, 2025 · Operations

Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies

The article explores how distributed systems achieve high reliability through redundant design, precise fault detection and recovery, data replication and synchronization, coordinated fault tolerance and load balancing, distributed transaction handling, comprehensive monitoring, elastic scaling, security safeguards, and robust disaster‑recovery planning.

Reliabilityfault tolerancemonitoring
0 likes · 18 min read
Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies
IT Architects Alliance
IT Architects Alliance
Dec 29, 2024 · Operations

Design Principles and Key Technologies for High‑Availability Systems

The article explains why 24/7 high‑availability systems are essential for modern enterprises and details core design principles, layered architecture, and critical technologies such as redundancy, load balancing, caching, elastic scaling, monitoring, and fault‑tolerance to ensure continuous, reliable service.

System Designcloud computinghigh availability
0 likes · 23 min read
Design Principles and Key Technologies for High‑Availability Systems
Architect
Architect
Dec 27, 2024 · Big Data

Fault Self‑Healing System for Large‑Scale Big Data Clusters

This article describes the design, architecture, and technical implementation of BMR's fault self‑healing platform, which automatically collects data, analyzes failures, defines decision rules, and executes safe recovery workflows to improve reliability and efficiency of massive, heterogeneous big‑data environments.

Big DataCluster Managementfault self-healing
0 likes · 16 min read
Fault Self‑Healing System for Large‑Scale Big Data Clusters
Linux Ops Smart Journey
Linux Ops Smart Journey
Dec 27, 2024 · Cloud Native

How to Enable Ceph Enterprise Monitoring with Prometheus & Grafana

Learn step‑by‑step how to activate Ceph’s monitoring modules, configure Prometheus to collect Ceph metrics, verify data collection, and integrate Grafana dashboards, including tips on required dependencies and troubleshooting, to ensure reliable, secure storage management in enterprise cloud‑native environments.

CephGrafanaPrometheus
0 likes · 4 min read
How to Enable Ceph Enterprise Monitoring with Prometheus & Grafana
Yang Money Pot Technology Team
Yang Money Pot Technology Team
Dec 26, 2024 · Frontend Development

Design and Implementation of a Multi‑CDN Disaster Recovery Mechanism for Frontend Resource Loading

This article presents a comprehensive multi‑CDN disaster‑recovery solution for frontend static resources, detailing the background, current issues, goals, SDK‑based architecture, monitoring and retry strategies, data‑reporting mechanisms, evaluation results, and future dynamic scheduling improvements.

CDNRetrySDK
0 likes · 12 min read
Design and Implementation of a Multi‑CDN Disaster Recovery Mechanism for Frontend Resource Loading
Linux Ops Smart Journey
Linux Ops Smart Journey
Dec 20, 2024 · Cloud Native

How to Set Up MinIO Enterprise Monitoring with Prometheus & Grafana

This guide walks you through configuring MinIO's enterprise monitoring panel, generating Prometheus metrics for clusters, nodes, buckets, and resources, integrating them into Grafana dashboards, and verifying successful data collection to enhance data management and operational efficiency.

GrafanaPrometheusmonitoring
0 likes · 7 min read
How to Set Up MinIO Enterprise Monitoring with Prometheus & Grafana
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Dec 20, 2024 · Operations

20 Must‑Know Production Ops Issues and Quick Fixes

This guide presents twenty common production‑environment problems—from log analysis and database recovery to Kubernetes scheduling—detailing real‑world scenarios, step‑by‑step command solutions, and preventive measures that help engineers quickly diagnose, resolve, and avoid outages.

DevOpsOperationsmonitoring
0 likes · 17 min read
20 Must‑Know Production Ops Issues and Quick Fixes
DevOps Operations Practice
DevOps Operations Practice
Dec 17, 2024 · Backend Development

From CPU Alert to Resolution: A Step‑by‑Step Backend Performance Debugging Guide

This article recounts a midnight CPU alert incident and walks through systematic backend troubleshooting—from initial system checks and JVM profiling to algorithm refactoring, database indexing, Docker‑based isolation, and proactive monitoring—demonstrating how to restore service performance and prevent future outages.

DockerJVMJava
0 likes · 7 min read
From CPU Alert to Resolution: A Step‑by‑Step Backend Performance Debugging Guide