Tagged articles
2179 articles
Page 6 of 22
FunTester
FunTester
Jul 30, 2024 · Operations

Mastering True Observability: Models, Practices, and AI‑Driven Automation

This article explains why true observability is essential for modern software, outlines its five core pillars, details a four‑stage maturity model with benefits and drawbacks, and provides practical steps—including data collection, team organization, and AI automation—to advance from basic monitoring to predictive, self‑healing systems.

AIMaturity Modelautomation
0 likes · 13 min read
Mastering True Observability: Models, Practices, and AI‑Driven Automation
JD Cloud Developers
JD Cloud Developers
Jul 29, 2024 · Backend Development

Designing Robust Backend Services: Path Standards, Security, Monitoring, and Degradation Strategies

This article outlines comprehensive best‑practice guidelines for designing robust, secure, and maintainable backend services—covering API path conventions, request handling, parameter design, error codes, dependency management, monitoring, degradation strategies, legacy service handling, encryption, access control, and tamper‑proof mechanisms, with practical code examples.

BackendService Architectureapi-design
0 likes · 18 min read
Designing Robust Backend Services: Path Standards, Security, Monitoring, and Degradation Strategies
DaTaobao Tech
DaTaobao Tech
Jul 29, 2024 · Operations

Testing Environment Reliability, Routing Isolation, Monitoring, and Efficient Deployment Practices

Alibaba Taotian’s testing platform now lets business owners self‑service reliable environments by binding accounts to isolated routes, monitoring lightweight health metrics with automated self‑healing, accelerating deployments via code caching and JVM tricks, and enabling rapid “time‑travel” scenario testing, while planning tighter observability and production alignment.

Testing Environmentdeployment efficiencymonitoring
0 likes · 11 min read
Testing Environment Reliability, Routing Isolation, Monitoring, and Efficient Deployment Practices
Liangxu Linux
Liangxu Linux
Jul 28, 2024 · Cloud Native

Avoid These 10 Common Kubernetes Mistakes to Boost Reliability

This article outlines the most frequent Kubernetes pitfalls—such as missing resource requests, omitted health checks, using the :latest tag, over‑privileged containers, insufficient monitoring, default namespace misuse, weak security settings, absent PodDisruptionBudgets, lack of pod anti‑affinity, and improper load‑balancing—and provides concrete commands, YAML examples, and best‑practice recommendations to prevent them.

KubernetesResource Managementautoscaling
0 likes · 13 min read
Avoid These 10 Common Kubernetes Mistakes to Boost Reliability
Efficient Ops
Efficient Ops
Jul 28, 2024 · Operations

Building a Resilient, High‑Performance Website: Domains, CDN, Security & Ops

This guide outlines a comprehensive, step‑by‑step strategy for creating a highly available, secure, and scalable website—from buying and protecting multiple domains, configuring DNS and CDN, setting up image and database servers, to implementing monitoring, redundancy, high‑concurrency testing, and disaster‑recovery plans.

CDNhigh availabilitymonitoring
0 likes · 13 min read
Building a Resilient, High‑Performance Website: Domains, CDN, Security & Ops
Architecture and Beyond
Architecture and Beyond
Jul 28, 2024 · Frontend Development

Comprehensive Guide to Front‑End Stability: Observability, Full‑Chain Monitoring, High‑Availability Architecture, Performance Management, Risk Governance, Process Mechanisms, and Engineering Practices

This extensive article presents a systematic approach to front‑end stability, covering observability systems, full‑chain monitoring, high‑availability design, performance management, risk governance, process mechanisms, and engineering practices to ensure reliable user experiences and business continuity.

frontendhigh-availabilitymonitoring
0 likes · 44 min read
Comprehensive Guide to Front‑End Stability: Observability, Full‑Chain Monitoring, High‑Availability Architecture, Performance Management, Risk Governance, Process Mechanisms, and Engineering Practices
Code Ape Tech Column
Code Ape Tech Column
Jul 26, 2024 · Operations

Bash Scripts for File Consistency Checks, Log Monitoring, and System Automation

This article presents a comprehensive collection of Bash scripts that perform tasks such as verifying file consistency across servers, scheduled log cleaning, network traffic monitoring, numeric analysis in files, automated FTP downloads, interactive number games, Nginx 502 detection, variable assignments, bulk file renaming, IP address validation, and various system administration operations.

BashShell scriptingSystem Administration
0 likes · 24 min read
Bash Scripts for File Consistency Checks, Log Monitoring, and System Automation
21CTO
21CTO
Jul 23, 2024 · Information Security

What the Microsoft Blue‑Screen Crisis Teaches About IT Risk Management

The massive Microsoft blue‑screen outage caused by a faulty CrowdStrike update highlights the dangers of single‑system reliance, poor code quality, insufficient QA, and the need for staged rollouts, robust backup, real‑time monitoring, and proactive incident‑response strategies for modern IT organizations.

IT Operationsdisaster recoveryincident response
0 likes · 10 min read
What the Microsoft Blue‑Screen Crisis Teaches About IT Risk Management
Soul Technical Team
Soul Technical Team
Jul 23, 2024 · Big Data

Kafka Stability Challenges and Governance Framework at Soul

This article analyzes the role, application scenarios, stability challenges, and comprehensive governance framework of Apache Kafka at Soul, covering deployment, configuration, monitoring, standard controls, common misuse, and future directions toward cloud‑native solutions.

KafkaOperationsStreaming
0 likes · 30 min read
Kafka Stability Challenges and Governance Framework at Soul
ITPUB
ITPUB
Jul 22, 2024 · Operations

How We Upgraded Watcher to Second‑Level Monitoring for Real‑Time Order Alerts

This article details the end‑to‑end redesign of Quora Travel's Watcher monitoring platform from minute‑level to second‑level precision, covering architectural changes, storage engine migration, client‑side metric collection, server‑side scheduling, dashboard and alarm adaptations, and the resulting operational improvements.

DevOpsTime Seriesmonitoring
0 likes · 20 min read
How We Upgraded Watcher to Second‑Level Monitoring for Real‑Time Order Alerts
JD Tech
JD Tech
Jul 17, 2024 · Information Security

Service Design Tips and Security Practices for Robust API Development

This article presents comprehensive guidelines for designing flexible, secure, and maintainable API services, covering standardized paths, request handling, parameter design, business logic, exception management, dependency classification, monitoring, degradation strategies, handling legacy services, and encryption measures to ensure robust service architecture.

Error HandlingService Architectureapi-design
0 likes · 19 min read
Service Design Tips and Security Practices for Robust API Development
MaGe Linux Operations
MaGe Linux Operations
Jul 16, 2024 · Cloud Native

How Prometheus Sends Alerts: Rules, Templates, and Frequency Explained

This article explains how Prometheus generates and sends alerts, covering the definition of alert rules with PromQL, grouping, templating, configuring evaluation intervals, deploying a custom alert receiver in Kubernetes, and analyzing alert payloads and delivery frequency, while also detailing alert silencing and resolution behavior.

AlertingAlertmanagerGo
0 likes · 26 min read
How Prometheus Sends Alerts: Rules, Templates, and Frequency Explained
Alibaba Cloud Observability
Alibaba Cloud Observability
Jul 16, 2024 · Cloud Native

How to Seamlessly Migrate Your Self‑Hosted Prometheus + Thanos to Alibaba Cloud Managed Prometheus

This guide explains why many users still run self‑built Prometheus + Thanos, outlines the common deployment scenarios and pain points, and provides detailed step‑by‑step migration procedures—including metric collection, visualization, and alerting—for moving to Alibaba Cloud's fully managed Prometheus service across Kubernetes, ECS, and IDC environments.

Alibaba CloudCloud NativePrometheus
0 likes · 14 min read
How to Seamlessly Migrate Your Self‑Hosted Prometheus + Thanos to Alibaba Cloud Managed Prometheus
JD Tech
JD Tech
Jul 12, 2024 · Backend Development

Dynamic Thread Pool: Monitoring, Alerting, and Runtime Parameter Adjustment

The article explains the concept of a dynamic thread pool, identifies common pain points such as invisible runtime status, hard‑to‑trace rejections, and slow parameter tuning, and presents a comprehensive solution that includes monitoring, alerting, automatic stack dumping, and live parameter refresh for Java backend services.

Dynamic Configurationjavamonitoring
0 likes · 20 min read
Dynamic Thread Pool: Monitoring, Alerting, and Runtime Parameter Adjustment
JD Retail Technology
JD Retail Technology
Jul 12, 2024 · Backend Development

Service Design Tips and Best Practices for Robust API Development

This article explores essential service design considerations beyond standard guidelines, covering API path structuring, request handling, parameter design, security measures, monitoring, degradation strategies, and code examples to help build flexible, secure, and maintainable backend services.

MicroservicesService Architectureapi-design
0 likes · 19 min read
Service Design Tips and Best Practices for Robust API Development
Architect
Architect
Jul 11, 2024 · Backend Development

Architecture Refactoring of a Consumer Installment System: Background, Goals, Design, Deployment, and Monitoring

This article presents a comprehensive case study of refactoring a consumer installment platform, covering business restructuring, technical debt resolution, design of domain and module layers, code redesign with design patterns, phased deployment, monitoring setup, and the overall benefits achieved.

Design PatternsMicroservicesarchitecture
0 likes · 11 min read
Architecture Refactoring of a Consumer Installment System: Background, Goals, Design, Deployment, and Monitoring
Software Development Quality
Software Development Quality
Jul 11, 2024 · Information Security

How to Implement Secure and Compliant Log Management Standards

This guide outlines the purpose, scope, principles, and detailed specifications for log management—including file naming, retention periods, content rules, security handling, and monitoring—to ensure reliable issue tracing, data safety, and regulatory compliance across all system development projects.

Log ManagementOperationscompliance
0 likes · 12 min read
How to Implement Secure and Compliant Log Management Standards
Alibaba Cloud Native
Alibaba Cloud Native
Jul 10, 2024 · Cloud Native

Migrate Self‑Hosted Prometheus + Thanos to Alibaba Cloud Managed Service

This guide explains how to move from a self‑built open‑source Prometheus + Thanos monitoring stack to Alibaba Cloud's fully managed Prometheus service, covering typical deployment scenarios, migration requirements, step‑by‑step procedures for metric collection, visualization, and alerting, and key considerations for each environment.

Alibaba CloudPrometheusThanos
0 likes · 15 min read
Migrate Self‑Hosted Prometheus + Thanos to Alibaba Cloud Managed Service
Cloud Native Technology Community
Cloud Native Technology Community
Jul 9, 2024 · Cloud Native

Answering the Top 9 Questions About Monitoring in Kubernetes

This article discusses essential Kubernetes monitoring topics, including cost tracking, tool selection, observability frameworks, responsibility allocation, baseline establishment, namespace best practices, the importance of monitoring, backup solutions, and a comparison of Datadog versus Splunk for metrics.

DatadogKubernetesPrometheus
0 likes · 6 min read
Answering the Top 9 Questions About Monitoring in Kubernetes
Liangxu Linux
Liangxu Linux
Jul 8, 2024 · Operations

7 Practical Linux Performance Optimization Tips Every Engineer Should Know

This article compiles seven hands‑on Linux performance‑optimization practices, covering key factors such as CPU, memory, disk I/O, network, swap usage, and TCP tuning, and provides concrete commands and step‑by‑step troubleshooting methods for system administrators and DevOps engineers.

LinuxSwapTCP
0 likes · 19 min read
7 Practical Linux Performance Optimization Tips Every Engineer Should Know
JD Tech
JD Tech
Jul 8, 2024 · Operations

System Stability Practices: From Development to Production

This article outlines comprehensive system stability strategies for backend development, covering technical design reviews, key reliability techniques such as rate limiting, circuit breaking, timeout handling, isolation, and deployment safeguards like monitoring, gray releases, and rollback, aiming to reduce incidents and improve operational resilience.

incident responsemonitoringsystem stability
0 likes · 26 min read
System Stability Practices: From Development to Production
Efficient Ops
Efficient Ops
Jul 7, 2024 · Operations

Boost Business Continuity and IT System Stability: Practical Strategies

This article explains business continuity concepts, outlines the risks to IT system stability, and provides actionable steps—such as expanding monitoring coverage, improving fault detection, enhancing architecture resilience, and strengthening emergency coordination—to ensure continuous operation despite inevitable failures.

business continuitydisaster recoveryfault management
0 likes · 7 min read
Boost Business Continuity and IT System Stability: Practical Strategies
JD Retail Technology
JD Retail Technology
Jul 5, 2024 · Backend Development

Dynamic Thread Pool: Monitoring, Alerting, and Runtime Parameter Adjustment

The article explains the concept of dynamic thread pools, analyzes common pain points such as invisible runtime status, hard‑to‑locate rejections, and slow parameter tuning, and presents a comprehensive solution that includes monitoring, alerting, automatic stack tracing, and on‑the‑fly parameter refresh using Java code.

Dynamic Configurationjavamonitoring
0 likes · 19 min read
Dynamic Thread Pool: Monitoring, Alerting, and Runtime Parameter Adjustment
转转QA
转转QA
Jul 5, 2024 · Backend Development

Design and Implementation of a Configuration Checking Tool for an After‑Sales System

The article describes how a configuration‑checking tool was designed and built to automatically compare baseline business configuration data with the after‑sales system's settings, detect mismatches before use, and alert responsible testers, thereby reducing manual verification effort and preventing workflow disruptions.

BackendConfiguration ManagementSystem Design
0 likes · 5 min read
Design and Implementation of a Configuration Checking Tool for an After‑Sales System
DevOps Operations Practice
DevOps Operations Practice
Jul 4, 2024 · Operations

Building an Enterprise‑Level Monitoring System: Requirements, Technology Selection, Architecture, Implementation Steps, and Maintenance

This article provides a comprehensive guide to designing and deploying an enterprise‑grade monitoring system, covering requirement analysis, tool selection such as Prometheus and Zabbix, system architecture, step‑by‑step implementation, alerting, visualization, and ongoing maintenance to ensure reliable IT operations.

AlertingGrafanaOperations
0 likes · 7 min read
Building an Enterprise‑Level Monitoring System: Requirements, Technology Selection, Architecture, Implementation Steps, and Maintenance
macrozheng
macrozheng
Jul 3, 2024 · Operations

How to Visualize SpringBoot Metrics with Grafana and Prometheus Using Docker

This guide walks through installing Grafana and Prometheus with Docker, configuring node_exporter to collect system metrics, adding SpringBoot Actuator and Micrometer for application metrics, setting up Prometheus scrape jobs, and importing ready‑made Grafana dashboards to achieve real‑time monitoring and alerting.

AlertingDockerGrafana
0 likes · 10 min read
How to Visualize SpringBoot Metrics with Grafana and Prometheus Using Docker
360 Smart Cloud
360 Smart Cloud
Jul 3, 2024 · Operations

Practical Practices for Enhancing Kafka Cluster Stability at 360

This article details 360's comprehensive approach to improving Apache Kafka cluster stability through proactive operations, capacity assessment, parameter tuning, monitoring, version upgrades, and traffic control, offering concrete guidelines and best‑practice recommendations for large‑scale message‑queue deployments.

ClusterKafkaTuning
0 likes · 33 min read
Practical Practices for Enhancing Kafka Cluster Stability at 360
Alibaba Cloud Native
Alibaba Cloud Native
Jul 2, 2024 · Cloud Native

How Go Agent Enables Zero‑Intrusion Monitoring for Golang Microservices on Kubernetes

This guide explains how the Go Agent injects observability code at compile time to provide automatic tracing and metrics for Golang microservices running on Kubernetes, covering its architecture, supported SDKs, compatibility, and step‑by‑step deployment instructions including component installation, binary compilation, and YAML configuration.

ARMSGoInstrumentation
0 likes · 17 min read
How Go Agent Enables Zero‑Intrusion Monitoring for Golang Microservices on Kubernetes
High Availability Architecture
High Availability Architecture
Jun 28, 2024 · Backend Development

Deep Dive into pfinder: Architecture, Bytecode Enhancement, and Tracing Mechanisms

This article provides a comprehensive technical overview of pfinder, JD's next‑generation APM system, covering its core concepts, feature set, comparison with other tracing tools, bytecode modification techniques using ASM, Javassist, ByteBuddy and ByteKit, Java agent injection via JVMTI and Instrumentation, plugin loading, trace‑ID propagation across threads, and a prototype hot‑deployment capability.

APMBytecodeInstrumentationPerformanceTracing
0 likes · 23 min read
Deep Dive into pfinder: Architecture, Bytecode Enhancement, and Tracing Mechanisms
Open Source Linux
Open Source Linux
Jun 27, 2024 · Operations

Comprehensive Guide to Building a Resilient, High‑Performance Web Infrastructure

This guide outlines essential steps for creating a robust, high‑availability website architecture, covering domain acquisition, DNS management, CDN deployment, image caching, data center selection, monitoring, DDoS mitigation, redundancy, server configuration, database replication, testing environments, security practices, and operational tooling.

Cloud ServicesDDoS protectionOperations
0 likes · 12 min read
Comprehensive Guide to Building a Resilient, High‑Performance Web Infrastructure
Efficient Ops
Efficient Ops
Jun 25, 2024 · Operations

Mastering the Four Golden Signals: A Practical Guide to System Monitoring

This guide explains how to use the four golden signals—latency, traffic, errors, and saturation—to design effective monitoring across servers, services, and external dependencies, helping teams detect issues early and maintain reliable, high‑performance systems.

SREmonitoringsystem reliability
0 likes · 20 min read
Mastering the Four Golden Signals: A Practical Guide to System Monitoring
dbaplus Community
dbaplus Community
Jun 24, 2024 · Operations

How Qunar Achieved Sub‑Second Monitoring to Slash Fault Detection Time to Under 1 Minute

Qunar’s Watcher monitoring platform was upgraded from minute‑level to second‑level precision, redesigning storage, data collection, and alerting pipelines, adopting VictoriaMetrics, enhancing client SDKs, and adding fine‑grained alarm rules, which reduced fault detection from four minutes to under one minute while improving reliability and scalability.

DevOpsTime Series Databasemonitoring
0 likes · 20 min read
How Qunar Achieved Sub‑Second Monitoring to Slash Fault Detection Time to Under 1 Minute
Efficient Ops
Efficient Ops
Jun 23, 2024 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, typical use cases, key advantages, and real‑world examples, while also providing a practical Shell script and an Ansible playbook to illustrate automation in daily workflows.

Infrastructuredevops toolsmonitoring
0 likes · 8 min read
Top 10 Essential Ops Tools Every Engineer Should Master
ITPUB
ITPUB
Jun 22, 2024 · Cloud Native

How to Detect and Prevent OOM and CPU Throttling in Kubernetes

This article explains why memory OOM and CPU throttling are critical issues in Kubernetes, shows how limits and requests work, demonstrates monitoring techniques with Prometheus and cAdvisor, and provides practical best‑practice recommendations to avoid pod eviction and performance degradation.

CPU throttlingKubernetesmonitoring
0 likes · 9 min read
How to Detect and Prevent OOM and CPU Throttling in Kubernetes
Qunar Tech Salon
Qunar Tech Salon
Jun 14, 2024 · Operations

Design and Implementation of a Second-Level Monitoring System for Qunar Travel

This article details the background, overall architecture, challenges, and step‑by‑step redesign of Qunar Travel's Watcher monitoring platform to achieve second‑level (per‑second) data collection, storage, and alerting, including storage engine selection, client and server optimizations, deployment strategies, and operational outcomes.

DevOpsmonitoringtime-series database
0 likes · 17 min read
Design and Implementation of a Second-Level Monitoring System for Qunar Travel
Practical DevOps Architecture
Practical DevOps Architecture
Jun 13, 2024 · Operations

Comprehensive Data Center Operations Training Course Overview

This extensive training program covers everything a data center operations engineer needs—from foundational infrastructure management and server hardware maintenance to advanced network configuration, security hardening, monitoring, fault handling, and practical hands‑on skills for real‑world challenges.

InfrastructureOperationsdata center
0 likes · 6 min read
Comprehensive Data Center Operations Training Course Overview
Qunar Tech Salon
Qunar Tech Salon
Jun 12, 2024 · Artificial Intelligence

Design and Implementation of Qunar Flight Ticket Intelligent Alert (Radar) System

This article presents a comprehensive analysis and engineering of Qunar's flight‑ticket intelligent pre‑warning (Radar) system, covering the business need, value analysis, architectural redesign, feature extraction, indicator classification, accuracy quantification, multi‑algorithm anomaly detection, automatic parameter tuning, observed effects, and future plans to incorporate large‑model techniques.

Operationsanomaly detectionflight ticket
0 likes · 17 min read
Design and Implementation of Qunar Flight Ticket Intelligent Alert (Radar) System
Open Source Tech Hub
Open Source Tech Hub
Jun 10, 2024 · Operations

How to Set Up Zipkin Distributed Tracing in PHP Webman Projects

This guide explains Zipkin's architecture, data collection methods, and step‑by‑step installation and configuration for PHP applications, including creating tracers, recording spans, and integrating a middleware for full‑stack monitoring in Webman microservice environments.

Distributed TracingMicroservicesPHP
0 likes · 8 min read
How to Set Up Zipkin Distributed Tracing in PHP Webman Projects
DevOps Cloud Academy
DevOps Cloud Academy
Jun 4, 2024 · Operations

Comprehensive DevOps Guide: Collaboration, Automation, CI/CD, IaC, Monitoring, and Logging with Practical Code Examples

This comprehensive DevOps guide explains core concepts such as collaboration, automation, CI/CD pipelines, infrastructure as code, and monitoring/logging, and includes practical code examples for Git, shell scripts, Jenkins, GitHub Actions, AWS CodePipeline, Ansible, Docker Compose, Prometheus, Grafana, Fluentd, and Elasticsearch.

DevOpsInfrastructure as Codeautomation
0 likes · 17 min read
Comprehensive DevOps Guide: Collaboration, Automation, CI/CD, IaC, Monitoring, and Logging with Practical Code Examples
Efficient Ops
Efficient Ops
Jun 2, 2024 · Operations

Why Observability Is the Key to Reliable Distributed Systems

Observability, defined as measuring system state through logs, metrics, and tracing, enhances stability of distributed architectures by enabling rapid fault detection, deeper insight, and proactive issue resolution, distinguishing it from traditional monitoring and supporting DevOps, SRE, and business objectives.

Distributed Systemsmonitoring
0 likes · 17 min read
Why Observability Is the Key to Reliable Distributed Systems
DevOps Cloud Academy
DevOps Cloud Academy
May 31, 2024 · Cloud Native

Optimizing RabbitMQ Performance on Kubernetes

This guide explains how to deploy RabbitMQ on Kubernetes and improve its performance through Helm installation, resource tuning, monitoring, scaling, security hardening, and advanced configuration techniques, providing practical code examples for each step.

KubernetesRabbitMQhelm
0 likes · 9 min read
Optimizing RabbitMQ Performance on Kubernetes
Efficient Ops
Efficient Ops
May 29, 2024 · Operations

Essential Operations Metrics Every IT Team Should Track

In today’s competitive business landscape, tracking key operations metrics—such as availability, failure rate, MTTR, MTBF, response time, throughput, error rate, and various utilization and data integrity measures—helps organizations monitor performance, reduce costs, ensure reliability, and maintain regulatory compliance.

AvailabilityIT performancemonitoring
0 likes · 7 min read
Essential Operations Metrics Every IT Team Should Track
dbaplus Community
dbaplus Community
May 28, 2024 · Operations

Top 10 Essential Tools Every Operations Engineer Should Master

This guide reviews ten indispensable tools for operations engineers, detailing each tool's functions, ideal scenarios, advantages, and real‑world examples, and includes practical code snippets for automation, monitoring, container management, and log analysis.

DevOpsInfrastructureautomation
0 likes · 8 min read
Top 10 Essential Tools Every Operations Engineer Should Master
Efficient Ops
Efficient Ops
May 28, 2024 · Operations

How to Build a Resilient High‑Traffic Website: Domains, CDN, Monitoring, and Security

This guide outlines practical steps for creating a highly available, secure, and scalable website—including domain strategy, CDN deployment, image caching, data‑center selection, monitoring, attack mitigation, redundancy, server configuration, database replication, testing environments, disaster‑recovery planning, and high‑concurrency testing.

high availabilitymonitoringwebsite infrastructure
0 likes · 12 min read
How to Build a Resilient High‑Traffic Website: Domains, CDN, Monitoring, and Security
DataFunTalk
DataFunTalk
May 28, 2024 · Big Data

Building and Managing a Metric System in Data Warehouse: Practices from Dongchedi

This article details how the Dongchedi business team designs, implements, and monitors a comprehensive metric system within its data warehouse, covering metric standards, model construction, metadata management, quality monitoring, application scenarios, and future directions using the DataLeap platform.

Big DataData Governancedata modeling
0 likes · 18 min read
Building and Managing a Metric System in Data Warehouse: Practices from Dongchedi
iQIYI Technical Product Team
iQIYI Technical Product Team
May 24, 2024 · Operations

High Availability and Disaster Recovery Practices of iQIYI's Video Relay Service (VRS)

iQIYI’s Video Relay Service ensures uninterrupted video playback by employing a two‑region, three‑center hybrid cloud architecture, multi‑layer storage, cross‑AZ retry mechanisms, protective rate‑limiting and degradation paths, layered monitoring, and rigorous stress‑testing and chaos engineering to achieve high availability and disaster recovery.

Backend ArchitectureCloud NativeVideo Streaming
0 likes · 18 min read
High Availability and Disaster Recovery Practices of iQIYI's Video Relay Service (VRS)
Practical DevOps Architecture
Practical DevOps Architecture
May 22, 2024 · Operations

SRE & Linux Operations Course Outline

This article presents a detailed curriculum covering fundamental infrastructure, cluster architecture, automation, log collection, Linux system administration, containerization, monitoring, security, and related DevOps tools across multiple phases and daily modules for comprehensive SRE training.

SREautomationcloud
0 likes · 8 min read
SRE & Linux Operations Course Outline
Qunar Tech Salon
Qunar Tech Salon
May 20, 2024 · Big Data

Optimizing Kafka Production at Qunar Travel: Reducing CPU Usage by 2000 Cores

This article presents a comprehensive case study of how Qunar Travel identified and resolved Kafka production bottlenecks—through metric monitoring, thread and flush parameter tuning, and Filebeat batch adjustments—resulting in a 2000‑core CPU reduction, higher network idle rates, and lower resource consumption across three clusters.

Kafkamonitoring
0 likes · 12 min read
Optimizing Kafka Production at Qunar Travel: Reducing CPU Usage by 2000 Cores
Qunar Tech Salon
Qunar Tech Salon
May 13, 2024 · Operations

Root Cause Analysis of Intermittent Timeout Issues in the Sirius Service Caused by RAID Card Consistency Checks

This article details the investigation of sporadic interface timeouts in the Sirius real‑time pricing service, revealing a weekly pattern linked to RAID controller consistency checks that cause IO spikes, logback queue blockage, and ultimately Dubbo client‑side timeouts, and proposes mitigation steps and general performance‑troubleshooting guidelines.

OperationsRAIDlogback
0 likes · 22 min read
Root Cause Analysis of Intermittent Timeout Issues in the Sirius Service Caused by RAID Card Consistency Checks
Liangxu Linux
Liangxu Linux
May 12, 2024 · Operations

7 Practical Linux Performance Optimization Tips Every Engineer Should Know

This guide explains the key factors that affect Linux system performance, provides step‑by‑step troubleshooting methods for CPU, memory, disk I/O and network issues, shows how to identify top resource‑hungry processes, clarifies memory reporting differences, discusses swap usage scenarios, and offers concrete TCP tuning parameters for production environments.

LinuxTCPmonitoring
0 likes · 20 min read
7 Practical Linux Performance Optimization Tips Every Engineer Should Know
Java Tech Enthusiast
Java Tech Enthusiast
May 5, 2024 · Information Security

Preventing Malicious API Abuse: Security Measures and Best Practices

To prevent malicious API abuse, implement layered defenses such as firewalls to block unwanted traffic, robust captchas and SMS verification, mandatory authentication with permission controls, IP whitelisting for critical endpoints, HTTPS encryption, strict rate‑limiting via Redis, continuous monitoring with alerts, and an API gateway that centralizes filtering, authentication and throttling.

API SecurityCaptchaIP whitelist
0 likes · 9 min read
Preventing Malicious API Abuse: Security Measures and Best Practices
DevOps Operations Practice
DevOps Operations Practice
May 2, 2024 · Operations

Quick Deployment of a Zabbix Monitoring Platform Using Docker

This article explains how to set up a Zabbix monitoring system by installing Docker, pulling necessary images, creating storage volumes, and running containers for MySQL, Zabbix server, Java gateway, web interface, and agents, providing a fast, container‑based deployment solution.

Container DeploymentLinuxZabbix
0 likes · 8 min read
Quick Deployment of a Zabbix Monitoring Platform Using Docker
Architect
Architect
Apr 27, 2024 · Information Security

How to Stop Malicious API Calls: 8 Practical Defense Strategies

This article walks through eight concrete techniques—firewall rules, captchas, authentication checks, IP whitelists, HTTPS encryption, rate limiting, monitoring, and an API gateway—to prevent abusive requests from draining resources or compromising critical services.

API SecurityAuthenticationCaptcha
0 likes · 11 min read
How to Stop Malicious API Calls: 8 Practical Defense Strategies
ITPUB
ITPUB
Apr 22, 2024 · Backend Development

How Meta Achieves Near‑Perfect Cache Consistency: Lessons from Polaris

This article explains Meta's approach to cache invalidation and consistency, detailing why ultra‑high consistency matters, how their Polaris monitoring system detects and resolves inconsistencies, and provides a simplified Python example that illustrates the underlying mechanisms and challenges.

BackendConsistencyDistributed Systems
0 likes · 12 min read
How Meta Achieves Near‑Perfect Cache Consistency: Lessons from Polaris
21CTO
21CTO
Apr 22, 2024 · Operations

Discover Guider: A Python‑Powered Linux Observability Suite with 150+ Commands

Guider, a Python‑based Linux observability suite created by Hyundai engineer Peace Lee, offers over 150 command‑line tools for real‑time performance monitoring, resource tracing, automated reporting, and visualizations, enabling developers to diagnose slow startups, crashes, GPU stalls, and system resets with microsecond precision.

CLILinuxPython
0 likes · 7 min read
Discover Guider: A Python‑Powered Linux Observability Suite with 150+ Commands
Liangxu Linux
Liangxu Linux
Apr 15, 2024 · Operations

12 Essential Linux Commands to Monitor Memory Usage

This guide presents twelve practical Linux techniques—from basic commands like free and top to advanced tools such as Grafana with Prometheus—enabling administrators to comprehensively track memory consumption, identify bottlenecks, and maintain system stability and performance.

MemorySystem Administrationcommands
0 likes · 8 min read
12 Essential Linux Commands to Monitor Memory Usage
dbaplus Community
dbaplus Community
Apr 14, 2024 · Backend Development

How Meta Reached 99.99999999% Cache Consistency and What You Can Learn

This article explains Meta's approach to cache invalidation and consistency, why ultra‑high consistency matters for user experience, the monitoring infrastructure they built, the Polaris system that detects and repairs inconsistencies, and provides a concrete Python‑style code example illustrating the problem and solution.

BackendCacheConsistency
0 likes · 13 min read
How Meta Reached 99.99999999% Cache Consistency and What You Can Learn
Efficient Ops
Efficient Ops
Apr 14, 2024 · Operations

How to Ensure System Stability and High Availability: An SRE Playbook

This article explains the definitions of stability and high availability, distinguishes their relationship, outlines key performance indicators, and provides a comprehensive framework—including fault prevention, detection, and recovery, as well as design, coding, testing, monitoring, and emergency response practices—to help teams build reliable, highly available systems.

SREcapacity planninghigh availability
0 likes · 10 min read
How to Ensure System Stability and High Availability: An SRE Playbook
JavaEdge
JavaEdge
Apr 13, 2024 · Backend Development

Mastering System Performance: Metrics, Strategies, and Real‑World Implementation

This article explains why performance optimization is essential for growing systems, introduces key metrics such as response time and concurrency, outlines systematic thinking and concrete techniques—including caching, parallelism, and async processing—and demonstrates a live‑streaming case study with actionable solutions.

Scalabilitycachingconcurrency
0 likes · 15 min read
Mastering System Performance: Metrics, Strategies, and Real‑World Implementation
Ops Development Stories
Ops Development Stories
Apr 12, 2024 · Cloud Native

Mastering etcd: Architecture, Monitoring & Performance Tuning

This article provides a comprehensive overview of etcd—including its origins, role in Kubernetes, version evolution, layered architecture, key terminology, operational commands, monitoring metrics, benchmarking procedures, disk‑performance testing, and tuning recommendations—for building reliable cloud‑native clusters.

benchmarkcloud-nativedistributed storage
0 likes · 17 min read
Mastering etcd: Architecture, Monitoring & Performance Tuning
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Apr 11, 2024 · Databases

Mastering Redis Sentinel: Ensuring Automatic High Availability

This article explains Redis Sentinel’s role in providing monitoring, notifications, automatic failover, and configuration updates to achieve high availability, detailing its heartbeat mechanism, master‑down detection, leader election, failover selection criteria, and the trade‑offs of using this solution.

databasefailoverhigh availability
0 likes · 6 min read
Mastering Redis Sentinel: Ensuring Automatic High Availability
Architecture & Thinking
Architecture & Thinking
Apr 10, 2024 · Operations

How Redis Sentinel Ensures Automatic Failover and High Availability

Redis Sentinel provides automatic monitoring, fault detection, and failover for Redis master‑slave clusters, enabling high availability by electing a new master when the original fails, using sdown/odown states, quorum voting, and pub/sub communication to keep services running with minimal downtime.

failoverhigh availabilitymonitoring
0 likes · 11 min read
How Redis Sentinel Ensures Automatic Failover and High Availability
Architect
Architect
Apr 8, 2024 · Backend Development

Mastering Batch Processing: Boost API Performance and Cut Overhead

This guide explains why batch processing is essential for API tuning and provides step‑by‑step techniques—including bulk database operations, request merging, pagination, parallel execution, caching, and monitoring—backed by concrete Java code samples and SQL queries to help engineers dramatically improve throughput and latency.

API optimizationBatch ProcessingDatabase Performance
0 likes · 33 min read
Mastering Batch Processing: Boost API Performance and Cut Overhead
Alibaba Cloud Native
Alibaba Cloud Native
Apr 8, 2024 · Cloud Native

How to Build a Global View for Multiple Prometheus Instances – Community and Alibaba Cloud Solutions

This article explains why a global view is needed when Prometheus metrics are scattered across many instances, compares community approaches such as Federation, Thanos, and Remote Write, and details Alibaba Cloud's Global Aggregation Instance and Remote Write solutions with configuration examples and a real‑world case study.

FederationGlobal ViewPrometheus
0 likes · 25 min read
How to Build a Global View for Multiple Prometheus Instances – Community and Alibaba Cloud Solutions
DevOps Operations Practice
DevOps Operations Practice
Apr 6, 2024 · Operations

Overview of Common DevOps Tools Used in Large Internet Companies

This article introduces the key DevOps tools—including CI/CD platforms, configuration‑management solutions, containerization technologies, monitoring and logging stacks, and infrastructure‑as‑code utilities—explaining their roles, features, and how they help streamline software delivery in modern enterprises.

Configuration ManagementDevOpsInfrastructure as Code
0 likes · 9 min read
Overview of Common DevOps Tools Used in Large Internet Companies
Wukong Talks Architecture
Wukong Talks Architecture
Apr 4, 2024 · Operations

Cloud Stability Governance: Frontend and Backend Strategies, Deployment, and Monitoring Practices

This article presents a comprehensive view of cloud stability governance from both front‑end and back‑end perspectives, detailing system architecture, micro‑frontend integration, CI/CD deployment pipelines, SLB forwarding and health‑check configurations, monitoring dashboards, UI automation testing, and the resulting operational improvements.

SLBci/cdcloud
0 likes · 13 min read
Cloud Stability Governance: Frontend and Backend Strategies, Deployment, and Monitoring Practices
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Apr 1, 2024 · Industry Insights

Why Shifting Testing Left Boosts Quality: Lessons from Cloud Music

The article analyzes the concept of test left‑shift, outlining its theoretical benefits and drawbacks, sharing practical pain points from NetEase Cloud Music, and presenting a comprehensive pre‑, during‑, and post‑shift automation and monitoring strategy to improve software quality and delivery speed.

DevOpsSoftware Testingautomation
0 likes · 13 min read
Why Shifting Testing Left Boosts Quality: Lessons from Cloud Music
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 1, 2024 · Operations

How We Achieved End-to-End Cloud Stability with Micro Frontends and Automated Deployments

This article details a comprehensive, front‑and‑back‑end approach to cloud stability, covering system architecture across private and public clouds, micro‑frontend integration, CI/CD pipelines, SLB routing, health‑check configurations, monitoring dashboards, data reconciliation, UI automation testing, and the resulting improvements in observability, gray‑release, rollback, and incident reduction.

Micro FrontendsSLBautomation
0 likes · 14 min read
How We Achieved End-to-End Cloud Stability with Micro Frontends and Automated Deployments
Efficient Ops
Efficient Ops
Mar 31, 2024 · Operations

Why Most Alerts Fail and How to Design Actionable Monitoring

Most system alerts are poorly designed, flooding engineers with noise; this article explains the essence of alerts, distinguishes business rule vs reliability monitoring, outlines effective metrics and strategies, and presents simple anomaly-detection algorithms to create actionable, high-quality alerts.

alert designanomaly detectionmonitoring
0 likes · 21 min read
Why Most Alerts Fail and How to Design Actionable Monitoring
Architecture Digest
Architecture Digest
Mar 28, 2024 · Operations

A Comprehensive Overview of Monitoring Systems: Fundamentals, Popular Open‑Source Solutions, and Selection Guidance

This article systematically introduces monitoring fundamentals, core concepts, and architecture, then reviews three widely used open‑source monitoring tools—Zabbix, Open‑Falcon, and Prometheus—detailing their components, advantages, disadvantages, and provides practical advice for selecting the most suitable solution.

Open-FalconOperationsZabbix
0 likes · 17 min read
A Comprehensive Overview of Monitoring Systems: Fundamentals, Popular Open‑Source Solutions, and Selection Guidance
Efficient Ops
Efficient Ops
Mar 27, 2024 · Operations

Master System Monitoring with the USE Method and Prometheus

This article explains how to design a comprehensive monitoring system using the concise USE (Utilization, Saturation, Errors) method, outlines essential system and application metrics, and demonstrates practical implementation with Prometheus, Grafana, and related open‑source tools.

PrometheusUSE methodmonitoring
0 likes · 14 min read
Master System Monitoring with the USE Method and Prometheus
ITPUB
ITPUB
Mar 27, 2024 · Backend Development

How Instagram Scaled to 14 Million Users with Just Three Engineers

This article details how Instagram grew from zero to 14 million users in just over a year using three engineers by applying three core principles and a reliable AWS‑based tech stack covering frontend, load balancing, backend, PostgreSQL sharding, S3 storage, Redis caching, asynchronous task queues, and comprehensive monitoring.

AWSBackendinstagram
0 likes · 9 min read
How Instagram Scaled to 14 Million Users with Just Three Engineers
DevOps Operations Practice
DevOps Operations Practice
Mar 25, 2024 · Operations

How to Monitor MySQL with Prometheus and Grafana

This tutorial explains how to install the MySQL Exporter, configure Prometheus to scrape MySQL metrics, set up Grafana dashboards for visualization, and define alerting rules for common MySQL performance indicators, providing a complete end‑to‑end monitoring solution.

AlertingExporterGrafana
0 likes · 5 min read
How to Monitor MySQL with Prometheus and Grafana
Selected Java Interview Questions
Selected Java Interview Questions
Mar 25, 2024 · Databases

Redis Best Practices: Memory Management, Performance Tuning, Reliability, Operations, and Security

This comprehensive guide outlines practical Redis best practices covering memory optimization, key design, data type selection, performance enhancements, high‑availability deployment, operational safeguards, security hardening, and monitoring to help engineers build stable, efficient caching solutions.

Reliabilitybest practicescaching
0 likes · 15 min read
Redis Best Practices: Memory Management, Performance Tuning, Reliability, Operations, and Security
DataFunSummit
DataFunSummit
Mar 22, 2024 · Artificial Intelligence

Risk Control Model Construction for Online Small Loans: Pre‑loan, In‑loan, Post‑loan and Monitoring

This article presents a comprehensive overview of risk control model building for online small‑loan scenarios, covering pre‑loan, in‑loan and post‑loan stages, the associated data pipelines, model deployment strategies, optimization attempts, and monitoring frameworks to ensure accuracy, stability and effectiveness.

Credit ScoringRisk Modelingdata pipeline
0 likes · 16 min read
Risk Control Model Construction for Online Small Loans: Pre‑loan, In‑loan, Post‑loan and Monitoring
dbaplus Community
dbaplus Community
Mar 18, 2024 · Operations

How to Build a Resilient, High‑Traffic Web Infrastructure: A Step‑by‑Step Ops Guide

This guide outlines a complete, practical workflow for acquiring multiple domains, configuring DNS, deploying CDN and image caches, selecting data‑center locations, setting up redundant servers, implementing monitoring, handling DDoS attacks, planning capacity, securing systems, and organizing an operations team to ensure high availability for large‑scale web services.

CDNServer ConfigurationWeb infrastructure
0 likes · 12 min read
How to Build a Resilient, High‑Traffic Web Infrastructure: A Step‑by‑Step Ops Guide
Efficient Ops
Efficient Ops
Mar 18, 2024 · Operations

How to Implement Fault Self‑Healing for Scalable Operations

This article explains why low‑disk alerts demand automation, outlines the concept of fault self‑healing versus manual response, and provides practical guidelines—including standards, monitoring dimensions, CMDB integration, script execution tools, and notification channels—to build a reliable self‑healing system for large‑scale environments.

CMDBfault self-healingmonitoring
0 likes · 10 min read
How to Implement Fault Self‑Healing for Scalable Operations
Efficient Ops
Efficient Ops
Mar 17, 2024 · Operations

How to Build a Scalable Prometheus Monitoring System for Big Data on Kubernetes

This article explains how to design and implement a comprehensive Prometheus‑based monitoring and alerting solution for big‑data components running on Kubernetes, covering metric exposure methods, scrape configurations, exporter deployment, alert rule design, and practical examples with code snippets.

Alertingmonitoring
0 likes · 18 min read
How to Build a Scalable Prometheus Monitoring System for Big Data on Kubernetes
Architect
Architect
Mar 16, 2024 · Operations

How Unified Alert Convergence Can Slash Monitoring Noise and Boost MTTA/MTTR

This article analyzes the shortcomings of fragmented monitoring systems, defines key metrics such as MTTA and MTTR, proposes a unified alert convergence architecture using Redis delayed queues, and details design, implementation, and future AI‑enhanced improvements to reduce alert fatigue and accelerate incident response.

MTTAMTTROperations
0 likes · 22 min read
How Unified Alert Convergence Can Slash Monitoring Noise and Boost MTTA/MTTR