Tagged articles
2179 articles
Page 5 of 22
转转QA
转转QA
Dec 13, 2024 · Operations

Data Point Governance and Quality Management in ZhiZhu QA Process

This article describes how ZhiZhu's quality inspection team introduced a two‑stage data‑point governance framework—initial manual enforcement followed by automated system monitoring, real‑time validation, user‑behavior trees, and dashboards—to dramatically improve data quality, testing efficiency, and issue resolution.

DashboardQAmonitoring
0 likes · 9 min read
Data Point Governance and Quality Management in ZhiZhu QA Process
Efficient Ops
Efficient Ops
Dec 11, 2024 · Operations

Thanos vs VictoriaMetrics: Which Prometheus Storage Solution Wins for Scale and Cost?

This article compares Thanos and VictoriaMetrics as long‑term storage solutions for Prometheus, evaluating their architecture, write and read paths, reliability, consistency, performance, scalability, high‑availability, and hosting costs to help you choose the most suitable option for your monitoring stack.

Long‑term StorageThanosVictoriaMetrics
0 likes · 18 min read
Thanos vs VictoriaMetrics: Which Prometheus Storage Solution Wins for Scale and Cost?
JD Cloud Developers
JD Cloud Developers
Dec 10, 2024 · Operations

How We Boosted Inventory Platform Stability 24× with Smart Traffic Splitting and Redis Caching

This article examines the stability challenges of an e‑commerce inventory platform—including workflow complexity, database hotspots, and high‑frequency calculations—and details comprehensive solutions such as traffic splitting, gray releases, Redis caching, data consistency mechanisms, rate limiting, and monitoring enhancements that together improved throughput by 24× and reduced latency dramatically.

Operationsinventorymonitoring
0 likes · 14 min read
How We Boosted Inventory Platform Stability 24× with Smart Traffic Splitting and Redis Caching
Top Architect
Top Architect
Dec 9, 2024 · Databases

Database Monitoring and Slow Query Log Management Guide

This article provides a practical guide on monitoring database system resources using Linux commands, configuring MySQL slow query logging, analyzing performance issues, and outlines best practices, while also promoting a ChatGPT community and related services.

DevOpsdatabaseslogging
0 likes · 7 min read
Database Monitoring and Slow Query Log Management Guide
Efficient Ops
Efficient Ops
Dec 8, 2024 · Operations

Diagnosing High Load with Low CPU on Linux: Commands and Tips

This guide explains how to analyze and troubleshoot situations where a Linux system shows high load averages despite low CPU usage, covering common load analysis methods, key commands like top, vmstat, iostat, and practical solutions for I/O bottlenecks and stuck processes.

CPULoadOperations
0 likes · 11 min read
Diagnosing High Load with Low CPU on Linux: Commands and Tips
Top Architect
Top Architect
Dec 5, 2024 · Databases

Database Monitoring and Slow Query Log Management Guide

This article explains how database administrators can monitor system resource usage with commands like top, iostat, and vmstat, and configure MySQL slow query logging, including enabling the log, setting thresholds, viewing logs, and applying best‑practice recommendations for analysis and issue resolution.

Database AdministrationSlow Query Loglinux-commands
0 likes · 8 min read
Database Monitoring and Slow Query Log Management Guide
Linux Ops Smart Journey
Linux Ops Smart Journey
Dec 3, 2024 · Cloud Native

How to Set Up Harbor Monitoring with Prometheus and Grafana

Learn step‑by‑step how to deploy the harbor‑exporter, configure Prometheus to scrape Harbor metrics, verify data collection, and add official Grafana dashboards, enabling real‑time monitoring of your Harbor registry for improved stability, security, and performance in cloud‑native environments.

GrafanaHarborKubernetes
0 likes · 6 min read
How to Set Up Harbor Monitoring with Prometheus and Grafana
58 Tech
58 Tech
Nov 27, 2024 · Operations

Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned

This article details how 58 Group’s cloud authentication service introduced an observability framework—optimizing logs, employing distributed tracing, defining SLO/SLA metrics, and implementing burn‑rate alerts—to improve fault detection, reduce false alarms, and achieve faster root‑cause analysis across the system.

Distributed TracingError BudgetObservability
0 likes · 16 min read
Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned
JD Cloud Developers
JD Cloud Developers
Nov 27, 2024 · Operations

Mastering SLA, SLO, and SLI: Practical Strategies for Reliable Services

This article explains the core concepts of SLA, SLO, and SLI, demonstrates how to set realistic service level objectives, manage alert noise, and apply practical examples—including API, MQ, and scheduled task monitoring—to improve system reliability and performance during high‑traffic events like 11.11 promotions.

SLASLISLO
0 likes · 23 min read
Mastering SLA, SLO, and SLI: Practical Strategies for Reliable Services
Liangxu Linux
Liangxu Linux
Nov 23, 2024 · Cloud Native

How a Solo Engineer Runs a Full‑Stack SaaS on Kubernetes

This article details how a single‑person startup leverages Kubernetes on AWS EKS to handle load balancing, automatic DNS, TLS, autoscaling, monitoring, alerting, secret management, and CI/CD for a Django‑based SaaS, illustrating practical configurations, code snippets, and infrastructure‑as‑code patterns.

AWS EKSDjangoGitOps
0 likes · 16 min read
How a Solo Engineer Runs a Full‑Stack SaaS on Kubernetes
ITPUB
ITPUB
Nov 23, 2024 · Operations

Zabbix vs Prometheus: Which Monitoring Tool Wins for Modern Cloud Environments?

This article compares Zabbix and Prometheus across performance, data collection, visualization, and alerting, highlighting their architectural differences, ecosystem strengths, and suitability for traditional data‑center monitoring versus dynamic cloud‑native workloads.

AlertingObservabilityPrometheus
0 likes · 11 min read
Zabbix vs Prometheus: Which Monitoring Tool Wins for Modern Cloud Environments?
Alibaba Cloud Observability
Alibaba Cloud Observability
Nov 22, 2024 · Cloud Native

Mastering Alibaba Cloud Observability: Tagging Strategies for Efficient Resource Management

This article explains how Alibaba Cloud’s observability suite uses tag metadata to organize, monitor, and secure resources across business, endpoints, applications, middleware, and containers, offering best‑practice design principles and real‑world case studies for building scalable, tag‑driven monitoring dashboards.

Alibaba CloudCloud NativeTag Management
0 likes · 25 min read
Mastering Alibaba Cloud Observability: Tagging Strategies for Efficient Resource Management
Cognitive Technology Team
Cognitive Technology Team
Nov 19, 2024 · Operations

Compile-Time Automatic Instrumentation for Go Applications: Principles, Modular Extensions, and Practical Usage

This article introduces a zero‑intrusive compile‑time automatic instrumentation framework for Go, explains its preprocessing and code‑injection mechanisms, and provides modular extension principles with concrete examples such as HTTP header logging, sort algorithm replacement, SQL injection protection, and gRPC traffic control.

Automatic InstrumentationGoModular Extension
0 likes · 18 min read
Compile-Time Automatic Instrumentation for Go Applications: Principles, Modular Extensions, and Practical Usage
Ops Development Stories
Ops Development Stories
Nov 19, 2024 · Operations

How to Install and Explore Nightingale v7.7: New Features, Upgrade Guide, and Hands‑On Demo

This article introduces Nightingale monitoring's final v7.7 release, outlines its new features and major v7 changes, provides step‑by‑step upgrade instructions, and walks through a Docker‑based installation, data‑source integration, dashboard import, and alert‑rule configuration with DingTalk notifications.

Alert RulesDockerOperations
0 likes · 10 min read
How to Install and Explore Nightingale v7.7: New Features, Upgrade Guide, and Hands‑On Demo
Spring Full-Stack Practical Cases
Spring Full-Stack Practical Cases
Nov 18, 2024 · Backend Development

Master Spring Boot 3 Actuator: Custom Endpoints, Health Checks, and Monitoring

Explore comprehensive Spring Boot 3 Actuator capabilities—including enabling CORS, creating custom endpoints, configuring health indicators, HTTP tracing, security auditing, and process monitoring—through detailed explanations, YAML configurations, and full Java code examples, empowering developers to effectively monitor and manage production-ready applications.

ActuatorCustom EndpointSpring Boot
0 likes · 8 min read
Master Spring Boot 3 Actuator: Custom Endpoints, Health Checks, and Monitoring
JD Retail Technology
JD Retail Technology
Nov 13, 2024 · R&D Management

Guidelines for New Project Managers: Initiation, Planning, Execution, and Monitoring

This article shares practical advice for novice project managers, covering the four process groups—initiation, planning, execution, and monitoring—through real‑world examples, stakeholder identification, risk handling, change control, and communication techniques to help them deliver value and grow their teams.

PlanningProject Managementexecution
0 likes · 25 min read
Guidelines for New Project Managers: Initiation, Planning, Execution, and Monitoring
Linux Ops Smart Journey
Linux Ops Smart Journey
Nov 12, 2024 · Databases

Master PostgreSQL Monitoring with Grafana: Step-by-Step Guide

Learn how to deploy postgres_exporter, configure PostgreSQL extensions, set up Prometheus scraping, and create Grafana dashboards for comprehensive PostgreSQL performance monitoring, complete with command-line instructions and tips for verifying data collection and visualizing metrics.

GrafanaPostgreSQLPrometheus
0 likes · 6 min read
Master PostgreSQL Monitoring with Grafana: Step-by-Step Guide
Java Tech Enthusiast
Java Tech Enthusiast
Nov 10, 2024 · Databases

Database Monitoring and Logging Practices

Effective database administration relies on continuous monitoring of system resources—CPU, memory, disk I/O, and network—using tools like top, iostat, and vmstat, alongside logging slow MySQL queries, analyzing performance bottlenecks, and following best practices such as automated monitoring, centralized log management, regular audits, and log backups.

databaselinuxlogging
0 likes · 5 min read
Database Monitoring and Logging Practices
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Nov 5, 2024 · Operations

10 Essential Linux Ops Tools Every Engineer Should Master

This article introduces ten indispensable Linux operations tools—Shell scripting, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, typical use cases, advantages, and practical examples to help engineers automate and monitor infrastructure efficiently.

Configuration ManagementDevOpsOperations
0 likes · 9 min read
10 Essential Linux Ops Tools Every Engineer Should Master
Linux Ops Smart Journey
Linux Ops Smart Journey
Nov 3, 2024 · Cloud Native

Build a Robust Kubernetes Monitoring System with Prometheus and HAProxy

This guide walks you through setting up a comprehensive Kubernetes monitoring solution—covering component metrics collection, configuring HAProxy for network access, exposing metrics from kube-proxy, Calico, and kube-state-metrics, and integrating everything into Prometheus for reliable cluster health visibility.

CalicoHAProxyKubernetes
0 likes · 12 min read
Build a Robust Kubernetes Monitoring System with Prometheus and HAProxy
BirdNest Tech Talk
BirdNest Tech Talk
Nov 3, 2024 · Databases

Master ClickHouse Write Performance: Proven Optimization Strategies

This comprehensive guide walks through ClickHouse write‑performance optimization, covering hardware choices, system and application‑level tuning, async insert settings, Buffer engine configuration, storage compression, real‑world case studies, monitoring queries, and actionable best‑practice recommendations.

Async InsertBuffer EngineClickHouse
0 likes · 12 min read
Master ClickHouse Write Performance: Proven Optimization Strategies
Java Tech Enthusiast
Java Tech Enthusiast
Nov 1, 2024 · Databases

Quick MySQL Configuration and Monitoring Queries

This guide presents essential MySQL configuration and monitoring queries—covering connection limits, Binlog/GTID status, InnoDB settings—plus a one‑click script that consolidates these checks, enabling quick health assessments and more efficient routine inspections of MySQL servers.

SQLdatabasemonitoring
0 likes · 2 min read
Quick MySQL Configuration and Monitoring Queries
Java Architect Essentials
Java Architect Essentials
Oct 27, 2024 · Operations

Integrating Prometheus with Spring Boot for Real‑time Monitoring and Grafana Visualization

This article explains how to use Prometheus together with Spring Boot Actuator and Micrometer to collect, expose, and visualize application metrics, including step‑by‑step dependency configuration, YAML settings, Docker deployment of Prometheus and Grafana, and adding custom metrics for comprehensive monitoring.

ActuatorGrafanaMicrometer
0 likes · 10 min read
Integrating Prometheus with Spring Boot for Real‑time Monitoring and Grafana Visualization
dbaplus Community
dbaplus Community
Oct 23, 2024 · Backend Development

Mastering Java Thread Pools: Common Pitfalls and Best Practices

This article outlines how to correctly create, monitor, and configure Java ThreadPoolExecutor instances, explains why using the Executors factory can cause OOM, recommends separate named pools per business, provides formulas for sizing CPU‑bound and I/O‑bound workloads, and highlights real‑world pitfalls and dynamic‑configuration solutions.

concurrencymonitoringspring
0 likes · 16 min read
Mastering Java Thread Pools: Common Pitfalls and Best Practices
DeWu Technology
DeWu Technology
Oct 23, 2024 · Backend Development

Automated Traffic Rule Inspection with Flow Replay Platform

The Flow Replay Platform automates traffic‑rule inspection by recording traffic from all environments, letting engineers define jsonPath‑based interface rules that continuously validate pre‑release and production traffic, instantly alerting anomalies, reducing false positives, accelerating release verification, and cutting manual testing effort, as demonstrated by discovered coupon‑related bugs.

Automated TestingBackendmonitoring
0 likes · 9 min read
Automated Traffic Rule Inspection with Flow Replay Platform
Efficient Ops
Efficient Ops
Oct 21, 2024 · Operations

Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability

This article shares practical Prometheus best‑practice tips—from understanding its accuracy‑reliability trade‑offs and self‑monitoring, to avoiding NFS storage, managing high‑cardinality metrics, handling rate() and recording‑rule pitfalls, and fine‑tuning alerting—so you can run a stable, low‑cost monitoring stack.

AlertingCloud NativeObservability
0 likes · 10 min read
Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability
JD Tech Talk
JD Tech Talk
Oct 21, 2024 · Operations

Observability and Quality Assurance: Strategies for Test Teams

This article examines how test teams can enhance application observability and quality assurance by distinguishing observability from traditional monitoring, defining goals, outlining a monitoring foundation, and proposing module‑level and system‑level strategies for proactive fault detection, data analysis, and alerting.

Observabilitymonitoringquality assurance
0 likes · 12 min read
Observability and Quality Assurance: Strategies for Test Teams
JD Cloud Developers
JD Cloud Developers
Oct 21, 2024 · Operations

How Test Teams Can Build Observability Beyond Traditional Monitoring

This article examines how quality assurance engineers can adopt observability principles—distinct from conventional monitoring—to enhance system health detection, root‑cause analysis, and proactive risk mitigation across resources, services, business functions, data, and logs.

ObservabilityOperationsmonitoring
0 likes · 17 min read
How Test Teams Can Build Observability Beyond Traditional Monitoring
Test Development Learning Exchange
Test Development Learning Exchange
Oct 11, 2024 · Fundamentals

Fundamentals of Performance Testing: Concepts, Metrics, Tools, and Best Practices

This article provides a comprehensive overview of performance testing fundamentals, covering core concepts, key metrics, common testing tools, test design, load generation, result analysis, bottleneck identification, optimization techniques, cloud and micro‑service testing, monitoring, reporting, challenges, and cost‑benefit considerations.

BenchmarkingLoad TestingPerformance Testing
0 likes · 12 min read
Fundamentals of Performance Testing: Concepts, Metrics, Tools, and Best Practices
Java Architecture Stack
Java Architecture Stack
Oct 11, 2024 · Operations

25 Proven Linux Performance Tuning Tricks to Boost System Speed

Learn 25 practical Linux performance tuning techniques—from adjusting kernel parameters like swappiness and ulimit to optimizing I/O schedulers, network buffers, and enabling HugePages—each with clear commands and step‑by‑step instructions to help you maximize system responsiveness and throughput.

I/O schedulerKernel ParametersNetwork Tuning
0 likes · 10 min read
25 Proven Linux Performance Tuning Tricks to Boost System Speed
DevOps Operations Practice
DevOps Operations Practice
Oct 10, 2024 · Operations

Seven Key Truths About Operations: Downtime, Automation, Prevention, Technology as a Tool, DevOps, Communication, and Security

Effective operations management acknowledges inevitable downtime, emphasizes automation, prioritizes proactive prevention, treats technology as a means rather than an end, integrates closely with development through DevOps, relies on strong communication, and continuously addresses pervasive security challenges to minimize business impact.

AutomationOperationsSecurity
0 likes · 5 min read
Seven Key Truths About Operations: Downtime, Automation, Prevention, Technology as a Tool, DevOps, Communication, and Security
Efficient Ops
Efficient Ops
Oct 9, 2024 · Cloud Computing

How One Engineer Runs a Full SaaS on Kubernetes with Minimal Effort

This article details how a solo engineer built and operated a SaaS platform on AWS using Kubernetes, covering infrastructure overview, automatic DNS, TLS, load balancing, CI/CD rollouts, autoscaling, caching, secret management, monitoring, logging, error tracking, and cost‑effective operations.

AWSInfrastructure as CodeKubernetes
0 likes · 21 min read
How One Engineer Runs a Full SaaS on Kubernetes with Minimal Effort
Selected Java Interview Questions
Selected Java Interview Questions
Oct 7, 2024 · Operations

Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples

This article introduces ten essential tools for operations engineers—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing each tool's functionality, typical scenarios, advantages, and real‑world examples with code snippets for practical automation and monitoring.

AutomationInfrastructureOperations
0 likes · 8 min read
Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples
Efficient Ops
Efficient Ops
Sep 29, 2024 · Operations

Essential Linux Ops Tools Every Sysadmin Must Master

This guide outlines the ten core tool categories—from Linux basics and networking services to scripting, firewalls, monitoring, clustering, and backup—that a Linux operations engineer should master to become an effective sysadmin.

NetworkingOperationsdatabase
0 likes · 6 min read
Essential Linux Ops Tools Every Sysadmin Must Master
ITPUB
ITPUB
Sep 29, 2024 · Databases

Quick Oracle SQL Monitoring Script – Copy‑Paste Ready

This article shares a ready‑to‑run Oracle SQL*Plus script that lists active sessions with details such as instance ID, username, execution time, SQL text snippet, current event, and wait seconds, plus an example output for immediate performance troubleshooting.

OracleSQLmonitoring
0 likes · 4 min read
Quick Oracle SQL Monitoring Script – Copy‑Paste Ready
IT Architects Alliance
IT Architects Alliance
Sep 28, 2024 · Operations

How DevOps Transforms IT: Core Principles, Practices, and Real-World Success

This article explores the DevOps mindset, its core principles such as collaboration, automation, continuous improvement, and customer focus, outlines essential practices like CI/CD, IaC, monitoring, microservices, and provides a step‑by‑step adoption roadmap illustrated with a detailed case study and future trends.

AutomationCloud NativeDevOps
0 likes · 11 min read
How DevOps Transforms IT: Core Principles, Practices, and Real-World Success
Python Programming Learning Circle
Python Programming Learning Circle
Sep 28, 2024 · Operations

Essential Skills for Becoming a Successful DevOps Engineer

The article outlines the key competencies a DevOps engineer must master—including programming, Linux system knowledge, configuration management, infrastructure-as-code, CI/CD tools, networking and security, monitoring, and cloud services—to guide readers on building a comprehensive skill set for effective DevOps practice.

DevOpsInfrastructure as CodeOperations
0 likes · 5 min read
Essential Skills for Becoming a Successful DevOps Engineer
Alibaba Cloud Native
Alibaba Cloud Native
Sep 27, 2024 · Cloud Native

How SAE’s Cloud‑Native Event Center Tackles Data Explosion and Real‑Time Alerts

The article explains the design and implementation of the Serverless Application Engine (SAE) Event Center, highlighting its cloud‑native architecture, the distinction from traditional monitoring, challenges like data explosion and full GC, and the distributed‑cache solution that enables efficient real‑time event aggregation, notification, and future AI‑driven diagnostics.

Data ExplosionSAEdistributed cache
0 likes · 10 min read
How SAE’s Cloud‑Native Event Center Tackles Data Explosion and Real‑Time Alerts
php Courses
php Courses
Sep 27, 2024 · Backend Development

Developing Real-Time Monitoring Applications with PHP and WebSocket

This article explains how to build real-time monitoring applications using PHP and the WebSocket protocol, covering the fundamentals of WebSocket, setting up a Ratchet server, creating client-side JavaScript connections, and providing complete code examples such as a stock price monitor.

BackendReal-Timemonitoring
0 likes · 7 min read
Developing Real-Time Monitoring Applications with PHP and WebSocket
DevOps Engineer
DevOps Engineer
Sep 25, 2024 · Operations

Understanding What DevOps Truly Is: Principles Over Tools

The article clarifies that DevOps is not defined by specific tools like Kubernetes or Jenkins, but by the ability to design robust systems that ensure smooth deployments, effortless scaling, reliable operation, early issue detection, and effective team collaboration, emphasizing enduring principles over changing technologies.

AutomationCollaborationContinuousDelivery
0 likes · 3 min read
Understanding What DevOps Truly Is: Principles Over Tools
Efficient Ops
Efficient Ops
Sep 24, 2024 · Operations

Master Linux Performance in 60 Seconds: 10 Essential Commands

When a Linux server shows performance issues, the first minute is critical; this guide walks you through ten standard command‑line tools—uptime, dmesg, vmstat, mpstat, pidstat, iostat, free, sar, and top—explaining what each metric means and how to interpret the output for quick troubleshooting.

Operationslinuxmonitoring
0 likes · 19 min read
Master Linux Performance in 60 Seconds: 10 Essential Commands
dbaplus Community
dbaplus Community
Sep 23, 2024 · Operations

How Bilibili Scaled Monitoring: From Prometheus to a 2.0 VM‑Flink Architecture

Bilibili rebuilt its monitoring platform to handle explosive metric growth by separating collection, storage, and compute, adopting VictoriaMetrics, zone‑based scheduling, and Flink‑driven pre‑aggregation, which together improved stability, query performance, cloud data quality, and overall observability.

FlinkObservabilityPrometheus
0 likes · 31 min read
How Bilibili Scaled Monitoring: From Prometheus to a 2.0 VM‑Flink Architecture
Ctrip Technology
Ctrip Technology
Sep 23, 2024 · Frontend Development

Intelligent Alert Attribution System for Ctrip Hotel Frontend: Design, Implementation, and Outcomes

This article details the design and deployment of an intelligent alert attribution system for Ctrip Hotel's front‑end, describing the background challenges, the unified data pool, weighted alert rules, three attribution algorithms, achieved improvements in accuracy and troubleshooting speed, and future enhancement plans.

Alertattributiondata pipeline
0 likes · 18 min read
Intelligent Alert Attribution System for Ctrip Hotel Frontend: Design, Implementation, and Outcomes
Open Source Linux
Open Source Linux
Sep 19, 2024 · Operations

Mastering Linux Performance: From CPU/Memory Profiling to Flame Graphs

This guide explains how to systematically diagnose Linux performance issues using tools such as top, vmstat, perf, and flame graphs, covering CPU, memory, disk I/O, network, and load analysis, and demonstrates a real-world nginx case study with step‑by‑step commands and visualizations.

NginxProfilingflame graphs
0 likes · 21 min read
Mastering Linux Performance: From CPU/Memory Profiling to Flame Graphs
dbaplus Community
dbaplus Community
Sep 13, 2024 · Operations

How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability

This article explains how Bilibili designed and implemented an Emergency Response Center (ERC) to manage the full fault lifecycle—detection, response, delimitation, localization, mitigation and recovery—using alert rules, automated recall, integrated customer feedback, clear role assignments, mobile support, and data‑driven post‑mortems, ultimately reducing MTTR and improving service reliability.

AutomationMTTRSRE
0 likes · 23 min read
How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability
Test Development Learning Exchange
Test Development Learning Exchange
Sep 13, 2024 · Fundamentals

Python Standard Library for Linux: File Operations, Process Management, Networking, System Info, Time, Logging, Monitoring, Compression, and Environment Variables

This article provides a comprehensive guide to Python's standard libraries for Linux, covering file and directory manipulation, process control, socket networking, system information retrieval, date and time handling, logging, file monitoring, compression, and environment variable management with clear code examples.

Networkingfile-operationslogging
0 likes · 10 min read
Python Standard Library for Linux: File Operations, Process Management, Networking, System Info, Time, Logging, Monitoring, Compression, and Environment Variables
Open Source Linux
Open Source Linux
Sep 13, 2024 · Operations

Essential Bash Scripts for Server Monitoring, Automation, and Security

This article presents a collection of practical Bash scripts that cover file consistency checks, scheduled log management, network traffic monitoring, numeric analysis, FTP downloads, user input handling, Nginx 502 detection, variable assignments, bulk file renaming, text processing, port scanning, word filtering, command menus, SSH automation with Expect, user creation, Apache monitoring, password rotation, iptables rate‑limiting, and IP validation, providing sysadmins with ready‑to‑use solutions for everyday Linux operations.

BashServer AutomationShell scripting
0 likes · 25 min read
Essential Bash Scripts for Server Monitoring, Automation, and Security
Architect
Architect
Sep 12, 2024 · Operations

How Bilibili Scaled Its Monitoring: From Prometheus OOMs to VictoriaMetrics & Flink Pre‑Aggregation

The article details Bilibili's evolution of its monitoring platform, describing the stability and performance challenges of a Prometheus‑Thanos stack, the redesign using VictoriaMetrics, collection‑storage separation, unit‑level disaster recovery, query‑tree auto‑replacement, Flink‑based pre‑aggregation, Grafana upgrades, and future roadmap for observability.

Cloud NativeFlinkMetrics
0 likes · 30 min read
How Bilibili Scaled Its Monitoring: From Prometheus OOMs to VictoriaMetrics & Flink Pre‑Aggregation
FunTester
FunTester
Sep 11, 2024 · Operations

Pinterest Performance Plan: Real‑User Monitoring, Regression Detection, and Alerting

Pinterest’s performance program details how the team defines custom Pinner Wait Time metrics, uses real‑user monitoring and fine‑grained alerts to detect regressions quickly, and follows structured root‑cause analysis and ownership processes to prevent performance degradation across web surfaces.

Operationsmonitoringreal‑user
0 likes · 18 min read
Pinterest Performance Plan: Real‑User Monitoring, Regression Detection, and Alerting
JD Tech
JD Tech
Sep 9, 2024 · Backend Development

JADE Dynamic Thread Pool Integration and Visualization Platform Practice

This article explains how to integrate JD's JADE dynamic thread‑pool component with the Wanxiang visualization platform, covering Maven dependencies, configuration files, Spring bean setup, thread‑pool creation, runtime monitoring, underlying source‑code principles, and common pitfalls for stable backend services.

ConfigurationDynamic Thread PoolJADE
0 likes · 20 min read
JADE Dynamic Thread Pool Integration and Visualization Platform Practice
dbaplus Community
dbaplus Community
Sep 8, 2024 · Operations

10 Essential Ops Practices to Prevent System Failures

This article compiles ten practical operations‑engineer guidelines—ranging from change rollbacks and safe command aliases to backup verification, monitoring, and cautious automated failover—to help maintain high availability and avoid costly production incidents.

Automationlinuxmonitoring
0 likes · 18 min read
10 Essential Ops Practices to Prevent System Failures
Software Development Quality
Software Development Quality
Sep 6, 2024 · R&D Management

How to Boost Release Quality: Proven Practices for R&D Teams

This guide outlines essential strategies to improve release quality in R&D, covering strict testing processes, automated CI/CD pipelines, containerization, real‑time monitoring, alert mechanisms, and feedback loops, while also defining key evaluation metrics and practical steps for effective management of these indicators.

ci/cdmonitoringrelease quality
0 likes · 10 min read
How to Boost Release Quality: Proven Practices for R&D Teams
Soul Technical Team
Soul Technical Team
Sep 2, 2024 · Databases

Comparative Analysis of VictoriaMetrics and Thanos for Large‑Scale Metric Storage

This article examines the migration from Thanos to VictoriaMetrics for large‑scale metric storage, detailing background challenges, VictoriaMetrics architecture and storage engine, data write and read processes, and a comparative analysis of performance, scalability, and operational costs between the two systems.

ObservabilityThanosTime Series Database
0 likes · 15 min read
Comparative Analysis of VictoriaMetrics and Thanos for Large‑Scale Metric Storage
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Aug 30, 2024 · Cloud Native

Middleware Containerization and Cloud‑Native Transformation at OPPO

OPPO transformed its sprawling, manually‑provisioned middleware clusters into a cloud‑native, containerized platform by building custom Kubernetes controllers, IP‑preserving StatefulSets, resource‑isolated containers, automated monitoring and self‑healing workflows, enabling rapid provisioning, efficient utilization, fault‑tolerant scaling and future serverless and service‑mesh integration.

KubernetesOperatorcloud-native
0 likes · 20 min read
Middleware Containerization and Cloud‑Native Transformation at OPPO
Top Architect
Top Architect
Aug 29, 2024 · Operations

Setting Up Nginx Log Monitoring with Loki, Promtail, and Grafana

This article walks through a complete, step‑by‑step solution for collecting Nginx access logs, converting them to JSON, shipping them with Promtail to Loki, and visualizing the data in Grafana, including Docker deployment, dashboard import, and world‑map plugin installation.

GrafanaLokiOperations
0 likes · 10 min read
Setting Up Nginx Log Monitoring with Loki, Promtail, and Grafana
FunTester
FunTester
Aug 28, 2024 · Operations

Shadow Testing: Reducing Risk and Ensuring Seamless System Changes

Shadow testing is a parallel deployment strategy that minimizes the risk of system changes, safeguards user experience, validates performance and data integrity, and provides a controlled environment for comprehensive testing, supported by a suite of modern tools and real‑world case studies.

DeploymentShadow TestingSoftware Testing
0 likes · 17 min read
Shadow Testing: Reducing Risk and Ensuring Seamless System Changes
Open Source Linux
Open Source Linux
Aug 23, 2024 · Operations

10 Proven Ops Practices to Prevent System Failures

This article shares ten practical operations strategies—including change rollbacks, safe handling of destructive commands, prompt customization, rigorous backup and verification, production environment discipline, careful handovers, robust alerting, cautious automatic failover, meticulous checks, and simplicity—to dramatically improve system reliability and availability.

BackupOperationsincident response
0 likes · 17 min read
10 Proven Ops Practices to Prevent System Failures
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 19, 2024 · Artificial Intelligence

Ensuring Stable AI Agents: Engineering Practices, RAG, and Monitoring

This article shares engineering insights from Hema’s AI smart customer service deployment, detailing key stability factors for AI agents—including hallucination mitigation, memory integration, RAG enhancement, exception handling, and comprehensive monitoring—to improve reliability and performance in real‑world e‑commerce chatbot scenarios.

AI AgentLLMRAG
0 likes · 13 min read
Ensuring Stable AI Agents: Engineering Practices, RAG, and Monitoring
Efficient Ops
Efficient Ops
Aug 14, 2024 · Operations

Building a Real-Time Log Monitoring System with ELK, Kafka, and Python

This article details how to construct a log‑monitoring platform using the ELK stack, Kafka buffering, and a Python scheduler to collect, process, and alert on error logs, offering practical configuration tips and performance optimizations for production environments.

ELKElasticsearchKafka
0 likes · 10 min read
Building a Real-Time Log Monitoring System with ELK, Kafka, and Python
Architect
Architect
Aug 14, 2024 · Backend Development

How to Build a Scalable Distributed Task Scheduling Platform

This article outlines the essential components and design considerations for creating a distributed task scheduling platform, covering triggers, scheduling strategies, executors, task chains, circuit breakers, exception handling, blocking control, service discovery, monitoring, and a management console.

Backend ArchitectureDistributed Schedulingcircuit breaker
0 likes · 9 min read
How to Build a Scalable Distributed Task Scheduling Platform
JD Tech Talk
JD Tech Talk
Aug 13, 2024 · Frontend Development

Monitoring and Inspection Practices for Enterprise Front‑End Applications

This article describes how a large enterprise front‑end team implements real‑time monitoring, scheduled inspections, alert strategies, performance metrics, error handling, custom reporting, and mobile/native monitoring to ensure system stability, improve user experience, and continuously optimize application performance.

Alertingerror-handlingfrontend
0 likes · 23 min read
Monitoring and Inspection Practices for Enterprise Front‑End Applications
Tencent Cloud Developer
Tencent Cloud Developer
Aug 13, 2024 · Backend Development

Comprehensive Guide to Backend Development: System Design, Architecture, Networking, Fault Handling, Monitoring, Service Governance, Testing, and Deployment

This comprehensive guide to backend development explains essential system and architecture design principles, networking strategies, fault and exception handling, monitoring and alerting, service governance, testing methodologies, and deployment practices, offering best‑practice advice and highlighting common pitfalls for building reliable, scalable internet services.

Backend DevelopmentDeploymentNetworking
0 likes · 28 min read
Comprehensive Guide to Backend Development: System Design, Architecture, Networking, Fault Handling, Monitoring, Service Governance, Testing, and Deployment
ITPUB
ITPUB
Aug 11, 2024 · Operations

Scaling Bilibili’s Metrics Platform with VictoriaMetrics and Flink Pre‑aggregation

This article details how Bilibili redesigned its monitoring system to overcome explosive metric growth by separating collection and storage, adopting VictoriaMetrics, implementing zone‑based scheduling, automating PromQL query replacement, and using Flink for efficient pre‑aggregation, resulting in dramatically lower latency and higher stability.

FlinkObservabilityPromQL
0 likes · 31 min read
Scaling Bilibili’s Metrics Platform with VictoriaMetrics and Flink Pre‑aggregation
Bilibili Tech
Bilibili Tech
Aug 9, 2024 · Operations

Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink

The new Monitoring 2.0 architecture separates collection, compute and storage, adopts VictoriaMetrics for compact time‑series storage and a zone‑based scheduler, introduces push‑based ingestion, uses Flink for real‑time pre‑aggregation and automatic PromQL rewrite, delivering ten‑fold query speedups, sub‑300 ms p90 latency, and dramatically higher write and query throughput.

FlinkMetricsObservability
0 likes · 29 min read
Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink
ITPUB
ITPUB
Aug 8, 2024 · Operations

Why Solid Monitoring Must Come Before Observability Projects (And How to Build It)

Before launching costly observability initiatives, ensure your monitoring is comprehensive and efficient, covering business, application, component, resource, network, and endpoint metrics, and that you have the data collection, storage, alerting, and event‑distribution capabilities to turn raw signals into actionable insights.

AlertingObservabilitymonitoring
0 likes · 9 min read
Why Solid Monitoring Must Come Before Observability Projects (And How to Build It)
JD Retail Technology
JD Retail Technology
Aug 8, 2024 · Frontend Development

Ensuring Frontend System Stability through Monitoring and Automated Inspection

This article explains how modern front‑end teams ensure system stability and high‑quality operation by implementing comprehensive monitoring and automated inspection, covering background, significance, architecture, real‑time and scheduled checks, performance metrics, alert strategies, error handling, custom reporting, and future improvement plans.

AlertingAutomationDevOps
0 likes · 24 min read
Ensuring Frontend System Stability through Monitoring and Automated Inspection
Zhuanzhuan Tech
Zhuanzhuan Tech
Aug 7, 2024 · Operations

Building a Dynamic Grafana Dashboard for Push System TraceId Visualization

This article describes how to use Grafana's Flowcharting plugin and Prometheus metrics to create a dynamic, interactive dashboard that visualizes each logical node of a push notification pipeline, enabling rapid trace‑ID based troubleshooting and reducing manual investigation effort.

GrafanaOperationsdynamic-view
0 likes · 11 min read
Building a Dynamic Grafana Dashboard for Push System TraceId Visualization
Open Source Linux
Open Source Linux
Aug 5, 2024 · Operations

How to Manage Over 10,000 Network Devices with Systematic, Automated Operations

This guide outlines a comprehensive, automated strategy for operating more than ten thousand network devices, covering asset documentation, topology planning, unified monitoring, automation scripts, emergency response, security management, regular maintenance, staff training, and visual management tools.

AutomationScalabilitydevice management
0 likes · 6 min read
How to Manage Over 10,000 Network Devices with Systematic, Automated Operations
IT Services Circle
IT Services Circle
Aug 2, 2024 · Operations

Shell Script for Collecting Linux CPU, Memory, and Disk I/O Metrics

This article presents a Bash script that gathers comprehensive Linux system metrics—including CPU core count, utilization percentages, context switches, interrupts, load averages, memory and swap usage, and disk I/O statistics—explaining each command and its purpose for effective server monitoring.

BashSystemMetricslinux
0 likes · 13 min read
Shell Script for Collecting Linux CPU, Memory, and Disk I/O Metrics
Open Source Linux
Open Source Linux
Aug 1, 2024 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, ideal use cases, key advantages, and practical examples, while also providing code snippets and visual illustrations to help readers understand and apply them effectively.

AutomationConfiguration ManagementInfrastructure
0 likes · 8 min read
Top 10 Essential Ops Tools Every Engineer Should Master