Tagged articles
3281 articles
Page 6 of 33
Test Development Learning Exchange
Test Development Learning Exchange
Feb 12, 2025 · Operations

20 Python Automation Scripts for Common Tasks

This article presents twenty practical Python scripts that automate everyday tasks such as batch file renaming, email sending, scheduling reminders, generating reports, backing up files, updating Excel sheets, downloading web pages, filling forms, extracting PDF text, converting file formats, visualizing data, and more, providing ready-to-use code examples for each.

OperationsPythonScripting
0 likes · 10 min read
20 Python Automation Scripts for Common Tasks
Practical DevOps Architecture
Practical DevOps Architecture
Feb 11, 2025 · Operations

Kubernetes Operations and Cloud Native Architecture Training Course

This comprehensive training program for intermediate to advanced users covers Kubernetes high‑availability deployment, elastic scaling, Helm package management, Ceph distributed storage integration, microservice container migration, Jenkins‑based CI/CD pipelines, and Istio service‑mesh governance, providing hands‑on labs, detailed chapters, and practical resources for mastering modern cloud‑native operations.

CephCloud NativeDevOps
0 likes · 7 min read
Kubernetes Operations and Cloud Native Architecture Training Course
Chen Tian Universe
Chen Tian Universe
Feb 10, 2025 · Operations

How Payment Clearing and Settlement Systems Really Work: A Deep Dive

This article provides a comprehensive overview of payment clearing and settlement, detailing the architecture of clearing subsystems, object relationship models, billing rule engines, settlement processes, various settlement modes, and the full accounting flow that ensures accurate fund distribution across platforms and merchants.

Operationsaccountingclearing
0 likes · 31 min read
How Payment Clearing and Settlement Systems Really Work: A Deep Dive
Data Thinking Notes
Data Thinking Notes
Feb 6, 2025 · Operations

How Data Metric Systems Drive Smarter Business Decisions

In today's digital era, enterprises must transform raw data into actionable insights, and a well‑designed data metric system—by defining dimensions, aggregation methods, and measurement units—provides the quantitative backbone that guides strategic, operational, and competitive decision‑making.

Operationsdecision makingperformance indicators
0 likes · 16 min read
How Data Metric Systems Drive Smarter Business Decisions
Architecture and Beyond
Architecture and Beyond
Feb 6, 2025 · Operations

Analyzing DeepSeek’s Availability Issues and Applying Traditional Internet Reliability Strategies to AIGC

This article examines DeepSeek’s frequent service interruptions, contrasts the inherent reliability challenges of AIGC products with traditional internet applications, and proposes adopting proven isolation, rate‑limiting, and elastic‑scaling techniques to improve AI service availability and user experience.

AIGCAvailabilityDeepSeek
0 likes · 12 min read
Analyzing DeepSeek’s Availability Issues and Applying Traditional Internet Reliability Strategies to AIGC
JD Cloud Developers
JD Cloud Developers
Feb 6, 2025 · Operations

How to Build a Robust Stability Framework: Key Mechanisms for SRE Success

This article outlines a comprehensive stability framework for SRE teams, detailing essential mechanisms such as review processes, coding standards, incident management, on‑call responsibilities, and daily operational practices, while also highlighting the cultural shift needed to achieve reliable, high‑availability systems.

OperationsSREincident management
0 likes · 11 min read
How to Build a Robust Stability Framework: Key Mechanisms for SRE Success
Chen Tian Universe
Chen Tian Universe
Feb 1, 2025 · Operations

Designing a Robust Refund Center: Architecture, Processes, and Product Strategies

This article explains the concept of reverse transactions, outlines the factors influencing refund operations, and details the design of a dedicated refund center, including its product architecture, processing flow, document structure, channel configuration, and the special "refund‑to‑payment" mechanism for out‑of‑time refunds.

OperationsRefundpayment
0 likes · 19 min read
Designing a Robust Refund Center: Architecture, Processes, and Product Strategies
MaGe Linux Operations
MaGe Linux Operations
Jan 29, 2025 · Operations

Deploy ELK Stack: Complete Guide to Elasticsearch, Logstash & Kibana Setup

This guide walks through the ELK log analysis system—explaining its components, core concepts, log processing workflow, and step‑by‑step deployment of Elasticsearch, Logstash, Kibana, and supporting plugins on a multi‑node environment, including configuration, startup commands, and troubleshooting tips.

ELKElasticsearchKibana
0 likes · 15 min read
Deploy ELK Stack: Complete Guide to Elasticsearch, Logstash & Kibana Setup
MaGe Linux Operations
MaGe Linux Operations
Jan 27, 2025 · Operations

Redis Sentinel Deep Dive: High‑Availability Architecture & Automatic Failover

This article explains Redis Sentinel’s role as the official high‑availability solution, detailing its monitoring, notification, automatic failover mechanisms, discovery processes, connection types, down‑state classifications, failover steps, leader election, master selection rules, and data consistency guarantees.

Operationsfailoverhigh availability
0 likes · 18 min read
Redis Sentinel Deep Dive: High‑Availability Architecture & Automatic Failover
Architect
Architect
Jan 23, 2025 · Operations

Designing High‑Availability Systems: Architecture, Capacity Planning, and Fault‑Tolerance Guide

This article presents a comprehensive guide to building high‑availability systems, covering availability metrics, fault prevention, detection and recovery, capacity evaluation, layered architecture design, service tiering, resilience mechanisms, and operational best practices for reliable service delivery.

OperationsSystem Architecturecapacity planning
0 likes · 34 min read
Designing High‑Availability Systems: Architecture, Capacity Planning, and Fault‑Tolerance Guide
Raymond Ops
Raymond Ops
Jan 23, 2025 · Operations

Master Log Management: Automate Cleanup with crontab & logrotate

This guide explains log management goals, special scenarios that cause uncontrolled log growth, and practical solutions using Linux's crontab for scheduled cleanup and the logrotate tool for automated rotation and retention across common services like MySQL, nginx, and Kafka.

LinuxOperationscrontab
0 likes · 10 min read
Master Log Management: Automate Cleanup with crontab & logrotate
Ops Development Stories
Ops Development Stories
Jan 23, 2025 · Operations

How SREs Can Boost Their Influence Within Teams

This article explores why influence matters for Site Reliability Engineers, outlines the challenges they face in gaining recognition, and provides practical strategies—enhancing technical expertise, improving communication, quantifying achievements, and sharing knowledge—to elevate their impact within organizations.

OperationsSREcommunication
0 likes · 19 min read
How SREs Can Boost Their Influence Within Teams
Raymond Ops
Raymond Ops
Jan 22, 2025 · Operations

Master Linux Boot Startup: Systemd, chkconfig, and Crontab Strategies

This guide explains how to configure Linux services to start on boot using systemd (systemctl), the legacy chkconfig tool, general startup scripts, and crontab’s special @reboot keyword, and provides production‑grade recommendations and example scripts for reliable automation.

LinuxOperationsSysadmin
0 likes · 10 min read
Master Linux Boot Startup: Systemd, chkconfig, and Crontab Strategies
JD Tech Talk
JD Tech Talk
Jan 21, 2025 · Operations

Business Monitoring Solutions and Log Practices for KA Merchants

This article details the background, design, implementation, and best‑practice guidelines for business‑level monitoring, unified logging formats, log4j configurations, alert rules, and case studies of common issues faced by KA merchants in logistics operations.

AlertingOperationsbusiness monitoring
0 likes · 13 min read
Business Monitoring Solutions and Log Practices for KA Merchants
JD Cloud Developers
JD Cloud Developers
Jan 21, 2025 · Operations

Building Effective Business Monitoring and Alerting for Logistics Platforms

This article explains how system‑level metric anomalies relate to business‑level metrics, describes the three internal business‑monitoring platforms (UMP, PFinder, Taishan), details unified log formats and Log4j configurations, and shares best‑practice case studies for alert rules, data visualization, and incident handling to improve operational reliability.

AlertingData visualizationOperations
0 likes · 14 min read
Building Effective Business Monitoring and Alerting for Logistics Platforms
Efficient Ops
Efficient Ops
Jan 20, 2025 · Operations

Inside Qunar’s Pre‑Release Platform: Design, Practice, and Future Outlook

The article recaps Li Jingkang’s presentation at the 2024 GOPS Global Operations Conference, detailing the background, principles, design, and real‑world implementation of Qunar’s pre‑release platform, and outlines its future direction within DevOps, SRE, AIOps, and cloud‑native practices.

Cloud NativeDevOpsOperations
0 likes · 3 min read
Inside Qunar’s Pre‑Release Platform: Design, Practice, and Future Outlook
Raymond Ops
Raymond Ops
Jan 18, 2025 · Operations

Master Ansible Playbooks: From Basics to Advanced YAML Techniques

This article explains the limitations of ad‑hoc Ansible commands, introduces the concepts of playbooks, plays, and tasks, demonstrates YAML syntax with examples, shows how to write and run playbooks, and details host selection patterns and execution strategies for efficient automation.

Operations
0 likes · 17 min read
Master Ansible Playbooks: From Basics to Advanced YAML Techniques
Open Source Linux
Open Source Linux
Jan 18, 2025 · Operations

What Caused Alipay’s 5‑Minute P0 Outage and How Much Was Lost?

The article dissects Alipay’s rare P0 incident on January 16 2025, explaining how a misconfigured marketing template triggered a 20% discount for all transactions, detailing the rapid five‑minute fix, estimating the financial loss at roughly 14 million yuan, and outlining operational lessons and accountability.

Operationsdeployment riskfinancial loss
0 likes · 11 min read
What Caused Alipay’s 5‑Minute P0 Outage and How Much Was Lost?
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Jan 17, 2025 · Operations

10 Essential Linux Sysadmin Tools Every Engineer Should Master

This guide outlines the ten fundamental Linux operations tools and skills—ranging from basic system knowledge and networking services to shell scripting, text processing, databases, firewalls, monitoring, clustering, and backup—that every aspiring sysadmin should learn and practice thoroughly.

NetworkingOperationsSysadmin
0 likes · 6 min read
10 Essential Linux Sysadmin Tools Every Engineer Should Master
Raymond Ops
Raymond Ops
Jan 15, 2025 · Operations

Master Linux Process Management and Scheduling: From ps to crontab

This guide explains Linux process concepts, how to view and trace processes with commands like ps and pstree, terminate them using kill, and schedule tasks both once with at and repeatedly with crontab, providing syntax, options, and practical examples.

OperationsSchedulingcrontab
0 likes · 6 min read
Master Linux Process Management and Scheduling: From ps to crontab
Bitu Technology
Bitu Technology
Jan 15, 2025 · Operations

Refactoring Playback Error Reporting, Metrics, and Recovery in Tubi Web/OTT Player

The article details how Tubi's Web/OTT team restructured player error reporting, statistical metrics, and unified handling, introduced precise error‑tracking enums, defined new recovery strategies for device decoding, network, and cache issues, and validated their impact through extensive experiments that improved user experience and key business KPIs.

MetricsOTTOperations
0 likes · 14 min read
Refactoring Playback Error Reporting, Metrics, and Recovery in Tubi Web/OTT Player
FunTester
FunTester
Jan 15, 2025 · Operations

How to Combine Performance Testing and Chaos Engineering for Rock‑Solid Systems

Drawing lessons from the 2021 AWS outage, this article explains how integrating performance testing with fault‑injection (chaos engineering) in microservice and Kubernetes environments can identify bottlenecks, validate resilience, and build a continuous stability strategy that balances speed and reliability.

KubernetesMicroservicesOperations
0 likes · 13 min read
How to Combine Performance Testing and Chaos Engineering for Rock‑Solid Systems
Open Source Linux
Open Source Linux
Jan 13, 2025 · Operations

Key Lessons from 2024 Major Service Outages and How to Prevent Future Downtime

The article reviews major 2024 service outages—from Alibaba Cloud to OpenAI—highlights their root causes, and offers practical operations strategies such as disaster recovery, regular backups, load balancing, monitoring, performance tuning, and capacity planning to reduce future downtime.

Operationscapacity planningdisaster recovery
0 likes · 5 min read
Key Lessons from 2024 Major Service Outages and How to Prevent Future Downtime
Deepin Linux
Deepin Linux
Jan 11, 2025 · Operations

Comprehensive Guide to Diagnosing and Resolving Linux Network Packet Loss

This article explains common Linux network packet loss scenarios, details the kernel’s packet receive and transmit paths, examines hardware and ARP issues, Conntrack limits, UDP buffer problems, and provides practical troubleshooting tools and commands to accurately detect and fix packet drops.

LinuxOperationsPacketLoss
0 likes · 23 min read
Comprehensive Guide to Diagnosing and Resolving Linux Network Packet Loss
Raymond Ops
Raymond Ops
Jan 10, 2025 · Operations

Mastering SVN: 10 Essential Practices for Reliable Code Collaboration

This guide outlines ten practical SVN workflow rules—from regular code uploads and conflict resolution to ignoring generated files and writing clear commit messages—helping development teams maintain clean repositories, reduce errors, and improve collaborative efficiency.

CollaborationOperationsVersion Control
0 likes · 10 min read
Mastering SVN: 10 Essential Practices for Reliable Code Collaboration
Efficient Ops
Efficient Ops
Jan 9, 2025 · Operations

Unlocking BizDevOps: New Chinese Standards, International Model, and 2025 Certification

This article outlines the BizDevOps concept, introduces China’s newly issued BizDevOps standards and the international IG1374 maturity model, explains the assessment framework and process, showcases participating enterprises and certificate examples, and announces the open registration for the 2025 BizDevOps evaluation.

BizDevOpsMaturity ModelOperations
0 likes · 9 min read
Unlocking BizDevOps: New Chinese Standards, International Model, and 2025 Certification
IT Architects Alliance
IT Architects Alliance
Jan 9, 2025 · Operations

Load Balancing Strategies for High Availability in Distributed Systems

This article explores the challenges and opportunities of distributed architectures and explains how various static and dynamic load‑balancing strategies, hardware and software balancers, redundancy, health checks, and failover mechanisms together ensure high availability, illustrated with real‑world e‑commerce and live‑streaming case studies and future trends.

OperationsSystem Architecturehigh availability
0 likes · 20 min read
Load Balancing Strategies for High Availability in Distributed Systems
Architecture Digest
Architecture Digest
Jan 9, 2025 · Operations

Nginx UI: A Web‑Based Management Interface for Nginx Servers

Nginx UI is a Go and Vue‑based web interface that simplifies Nginx server management by providing real‑time statistics, ChatGPT assistance, one‑click Let's Encrypt certificates, configuration editing, log viewing, terminal access, dark mode, and deployment options via binary, systemd, or Docker.

NginxOperationsSystemd
0 likes · 6 min read
Nginx UI: A Web‑Based Management Interface for Nginx Servers
IT Architects Alliance
IT Architects Alliance
Jan 7, 2025 · Cloud Computing

Elastic Architecture: Auto Scaling and Failover for Resilient Systems

The article explains how elastic architecture, through auto‑scaling and failover mechanisms, dynamically adjusts resources and ensures continuous service during traffic spikes and component failures, improving cost efficiency, reliability, and operational stability for modern cloud‑based applications.

Auto ScalingElastic ArchitectureOperations
0 likes · 16 min read
Elastic Architecture: Auto Scaling and Failover for Resilient Systems
Alibaba Cloud Observability
Alibaba Cloud Observability
Jan 6, 2025 · Operations

How Synthetic Monitoring Boosts Network Reliability and User Experience

This article explains the importance of network stability, outlines major real‑world outages, and introduces synthetic monitoring—its functions, advantages, disadvantages, and various types such as protocol, browser, and internal monitoring—while comparing probe point categories and guiding enterprises on selecting the right strategy to improve service reliability and performance.

Network ReliabilityOperationsSynthetic Monitoring
0 likes · 12 min read
How Synthetic Monitoring Boosts Network Reliability and User Experience
DevOps Operations Practice
DevOps Operations Practice
Jan 2, 2025 · Operations

Career Paths for Operations Professionals After Age 35

The article compiles various Zhihu users' perspectives on how operations engineers can navigate career transitions after age 35, emphasizing the importance of aligning with larger companies, developing technical or managerial expertise, leveraging specialized infrastructure knowledge, and considering alternative paths such as product, project, or data roles.

ManagementOperationsSkill development
0 likes · 7 min read
Career Paths for Operations Professionals After Age 35
Efficient Ops
Efficient Ops
Jan 1, 2025 · Operations

What 2024’s Biggest Outages Teach Us About Building Resilient Systems

Reviewing the major service disruptions—from Alibaba Cloud to OpenAI—this article extracts key SRE lessons such as early disaster‑recovery planning, regular backups, load balancing, real‑time monitoring, performance tuning, and capacity planning, urging enterprises to adopt resilient practices for a more stable future.

OperationsOutage ManagementSRE
0 likes · 6 min read
What 2024’s Biggest Outages Teach Us About Building Resilient Systems
Tech Architecture Stories
Tech Architecture Stories
Dec 28, 2024 · Operations

Why Preventing Small Issues Is the Key to System Stability

The article explains how early detection and preventive measures—such as comprehensive monitoring, rate limiting, chaos testing, and proper SLOs—are essential for maintaining system stability and avoiding larger incidents, drawing on SRE principles and the incident triangle theory.

Error BudgetOperationsSRE
0 likes · 4 min read
Why Preventing Small Issues Is the Key to System Stability
Architects' Tech Alliance
Architects' Tech Alliance
Dec 28, 2024 · Cloud Computing

Comprehensive Overview of Cloud Disaster Recovery and Backup Technologies

This article provides a detailed explanation of cloud disaster recovery and cloud backup, covering their definitions, primary application scenarios, reference architectures, and essential technologies such as incremental backup, deduplication, multi‑tenant management, and data‑trust assurance, illustrated with diagrams and practical examples.

Cloud BackupData ProtectionOperations
0 likes · 13 min read
Comprehensive Overview of Cloud Disaster Recovery and Backup Technologies
DaTaobao Tech
DaTaobao Tech
Dec 25, 2024 · Operations

Fundamentals of Service Level Agreements (SLA) for Messaging Middleware

The article explains SLA fundamentals for messaging middleware, defining contracts, SLI/SLO relationships, key metrics such as availability, latency and error‑rate, dynamic lifecycle processes, template components, error‑budget calculations, industry benchmarks, internal monitoring practices, a sample SLA draft, and best‑practice recommendations for continuous improvement.

Messaging MiddlewareOperationsReliability
0 likes · 41 min read
Fundamentals of Service Level Agreements (SLA) for Messaging Middleware
Raymond Ops
Raymond Ops
Dec 24, 2024 · Operations

How to Diagnose and Fix High CPU and Memory Usage in Java Applications

This guide walks through identifying Java processes that cause high CPU load, extracting the hottest threads with top and jstack, analyzing JVM memory regions, interpreting GC logs, and applying practical JVM tuning parameters and tools such as jmap, jstat, and MAT to resolve performance bottlenecks.

CPUJVMMemory
0 likes · 18 min read
How to Diagnose and Fix High CPU and Memory Usage in Java Applications
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Dec 20, 2024 · Operations

20 Must‑Know Production Ops Issues and Quick Fixes

This guide presents twenty common production‑environment problems—from log analysis and database recovery to Kubernetes scheduling—detailing real‑world scenarios, step‑by‑step command solutions, and preventive measures that help engineers quickly diagnose, resolve, and avoid outages.

DevOpsOperationsmonitoring
0 likes · 17 min read
20 Must‑Know Production Ops Issues and Quick Fixes
dbaplus Community
dbaplus Community
Dec 16, 2024 · Operations

How Qunar Built a 5‑Million‑Metric Radar System to Cut Ticket Failures by 87%

This article details the design, implementation, and results of Qunar's intelligent ticket‑monitoring Radar system, covering the business need, architecture, anomaly‑detection algorithms, test‑set construction, parameter tuning, and the achieved 87% detection accuracy with future plans for large‑model integration.

OperationsReliabilityanomaly detection
0 likes · 17 min read
How Qunar Built a 5‑Million‑Metric Radar System to Cut Ticket Failures by 87%
Chen Tian Universe
Chen Tian Universe
Dec 13, 2024 · Fundamentals

Why Mastering Accounting Architecture Is the Key to Seamless Payment Systems

This comprehensive guide explains how robust accounting design—covering principles, account subsystems, hot‑account handling, merging strategies, reverse‑deduction models, sub‑account structures, day‑cut mechanisms, marketing‑related accounting, and settlement processes—forms the backbone of modern payment and clearing systems, helping product and operations teams build reliable financial infrastructure.

Operationsaccountingfinancial architecture
0 likes · 91 min read
Why Mastering Accounting Architecture Is the Key to Seamless Payment Systems
JD Cloud Developers
JD Cloud Developers
Dec 10, 2024 · Operations

How We Boosted Inventory Platform Stability 24× with Smart Traffic Splitting and Redis Caching

This article examines the stability challenges of an e‑commerce inventory platform—including workflow complexity, database hotspots, and high‑frequency calculations—and details comprehensive solutions such as traffic splitting, gray releases, Redis caching, data consistency mechanisms, rate limiting, and monitoring enhancements that together improved throughput by 24× and reduced latency dramatically.

Operationsinventorymonitoring
0 likes · 14 min read
How We Boosted Inventory Platform Stability 24× with Smart Traffic Splitting and Redis Caching
Efficient Ops
Efficient Ops
Dec 8, 2024 · Operations

Diagnosing High Load with Low CPU on Linux: Commands and Tips

This guide explains how to analyze and troubleshoot situations where a Linux system shows high load averages despite low CPU usage, covering common load analysis methods, key commands like top, vmstat, iostat, and practical solutions for I/O bottlenecks and stuck processes.

CPULinuxLoad
0 likes · 11 min read
Diagnosing High Load with Low CPU on Linux: Commands and Tips
Test Development Learning Exchange
Test Development Learning Exchange
Dec 6, 2024 · Operations

Common Docker Commands Reference

This article provides a comprehensive reference of essential Docker commands, covering basic container operations, image management, volume handling, network configuration, and data management, with brief Chinese descriptions and example usages for each command.

CLIContainerDevOps
0 likes · 6 min read
Common Docker Commands Reference
Chen Tian Universe
Chen Tian Universe
Dec 5, 2024 · Operations

Mastering the Four-Stage Reconciliation Model for Large Payment Institutions

This article explains how major payment institutions ensure the accuracy of tens of millions of daily transactions and billions of dollars by using a four‑segment data model, three verification groups, error classification, and extensible data coding to achieve reliable settlement and accounting.

OperationsReconciliationaccounting
0 likes · 6 min read
Mastering the Four-Stage Reconciliation Model for Large Payment Institutions
Efficient Ops
Efficient Ops
Dec 4, 2024 · Operations

Top 35 Linux Ops Interview Questions and Expert Answers

This article compiles thirty‑five essential Linux operations interview questions covering server management, RAID configurations, load‑balancing choices, middleware concepts, MySQL troubleshooting, networking tools, security practices, scripting examples, and system‑level optimizations, providing concise expert answers for each topic.

LinuxNetworkingOperations
0 likes · 34 min read
Top 35 Linux Ops Interview Questions and Expert Answers
Efficient Ops
Efficient Ops
Dec 2, 2024 · Operations

How AI‑Driven Parameter Governance Transforms DevOps Efficiency

This article explains how AI‑powered parameter governance, integrated with DevOps and AIOps practices, tackles the explosion of configuration parameters in large‑scale financial systems, streamlines design, auditing, detection, and deployment, and ultimately boosts operational efficiency and risk control.

Artificial IntelligenceDevOpsOperations
0 likes · 8 min read
How AI‑Driven Parameter Governance Transforms DevOps Efficiency
Efficient Ops
Efficient Ops
Dec 1, 2024 · Operations

How to Evaluate and Mature Your Enterprise DevOps Platform in 2024

This article outlines the current state of enterprise DevOps in China, explains regulatory emphasis on integrated R&D‑operations platforms, describes a five‑level maturity model, and provides detailed guidelines for assessing and improving organizational DevOps platforms using a structured tool‑module framework.

DevOpsDigital TransformationMaturity Model
0 likes · 8 min read
How to Evaluate and Mature Your Enterprise DevOps Platform in 2024
Efficient Ops
Efficient Ops
Dec 1, 2024 · Operations

How I Rescued a Production MySQL Database After a Fatal rm -rf Accident

After a junior engineer mistakenly ran an unguarded rm -rf command that wiped an entire production server—including MySQL and Tomcat—I documented the step‑by‑step recovery using ext3grep, extundelete, and MySQL binlog, highlighting the lessons learned for future operations.

BackupData RecoveryLinux
0 likes · 9 min read
How I Rescued a Production MySQL Database After a Fatal rm -rf Accident
macrozheng
macrozheng
Nov 29, 2024 · Operations

Visual Server Monitoring Made Easy with Sampler: Install & Configure

This article introduces the Sampler visual monitoring tool, explains how to install it on Linux, and provides step‑by‑step YAML configuration examples for tracking CPU, memory, Docker containers, network activity, and system time, enabling quick, intuitive server status checks.

LinuxOperationsServer Monitoring
0 likes · 8 min read
Visual Server Monitoring Made Easy with Sampler: Install & Configure
DevOps Cloud Academy
DevOps Cloud Academy
Nov 22, 2024 · Operations

12 Essential Bash Scripts for DevOps Automation

This article presents twelve practical Bash scripts that automate common DevOps tasks such as system updates, disk monitoring, backups, log rotation, SSH key setup, MySQL dumping, Docker cleanup, Kubernetes pod checks, SSL certificate monitoring, Git pulling, user management, and service health verification.

BashLinuxOperations
0 likes · 11 min read
12 Essential Bash Scripts for DevOps Automation
FunTester
FunTester
Nov 22, 2024 · Operations

Why Java Is the Ultimate Backbone for Performance Testing

The author recounts a four‑year journey from UI automation to Java‑based performance testing, illustrating how mastering Java’s concurrency utilities and Groovy scripting can replace traditional tools like JMeter, enabling flexible, high‑throughput test scenarios and deeper control over test case design.

GroovyJMeterOperations
0 likes · 8 min read
Why Java Is the Ultimate Backbone for Performance Testing
Ops Development Stories
Ops Development Stories
Nov 19, 2024 · Operations

How to Install and Explore Nightingale v7.7: New Features, Upgrade Guide, and Hands‑On Demo

This article introduces Nightingale monitoring's final v7.7 release, outlines its new features and major v7 changes, provides step‑by‑step upgrade instructions, and walks through a Docker‑based installation, data‑source integration, dashboard import, and alert‑rule configuration with DingTalk notifications.

Alert RulesDockerOperations
0 likes · 10 min read
How to Install and Explore Nightingale v7.7: New Features, Upgrade Guide, and Hands‑On Demo
Huolala Tech
Huolala Tech
Nov 14, 2024 · Operations

How Huolala Scaled Kafka: From Integrated Design to Cloud‑Native Elastic Architecture

This article chronicles the evolution of Huolala’s Kafka infrastructure—from an integrated compute‑storage design to a separated compute‑storage model with multi‑tenant deployment, and finally to a cloud‑native elastic architecture—detailing the challenges of capacity awareness, alarm configuration, and cost‑effective performance optimization.

KafkaOperationscapacity planning
0 likes · 9 min read
How Huolala Scaled Kafka: From Integrated Design to Cloud‑Native Elastic Architecture
Cognitive Technology Team
Cognitive Technology Team
Nov 14, 2024 · Operations

Designing Self‑Healing Applications for Fault Tolerance in Distributed Systems

To ensure distributed applications can recover automatically from hardware, network, or service failures, this guide outlines three core capabilities—fault detection, graceful handling, and monitoring—plus practical strategies such as asynchronous component separation, retries, circuit breakers, isolation, load shedding, failover, compensation, checkpointing, graceful degradation, rate limiting, leader election, fault injection, chaos engineering, and use of availability zones.

Cloud NativeDistributed SystemsOperations
0 likes · 7 min read
Designing Self‑Healing Applications for Fault Tolerance in Distributed Systems
Efficient Ops
Efficient Ops
Nov 13, 2024 · Operations

How China’s Auto Giants Are Driving Global DevOps Standardization

The article outlines China’s 2024‑2027 IT standards action plan, CAICT’s synchronized DevOps assessments, and detailed case studies of FAW‑Volkswagen and Changan achieving international and domestic DevOps certifications, highlighting measurable improvements in automation, delivery speed, and platform capabilities across the automotive sector.

Continuous DeliveryDevOpsOperations
0 likes · 12 min read
How China’s Auto Giants Are Driving Global DevOps Standardization
Liangxu Linux
Liangxu Linux
Nov 12, 2024 · Operations

How to Access Firewalled Servers Using Reverse SSH Tunnels

Reverse SSH lets you reach machines behind restrictive firewalls by creating a tunnel from the remote server back to your local host, using the ssh -R option, and includes step‑by‑step commands, configuration tips, and a persistent machine setup for reliable access.

OperationsRemote accessSSH tunneling
0 likes · 6 min read
How to Access Firewalled Servers Using Reverse SSH Tunnels
Efficient Ops
Efficient Ops
Nov 11, 2024 · Operations

How China’s Leading Banks Are Driving Global DevOps Standardization

The article details China’s 2024‑2027 Information Standard Construction Action Plan, the launch of synchronized ITU DevOps and domestic DevOps assessments, and showcases dozens of banking projects—from agile development to continuous delivery, security, and BizDevOps—that have achieved certification, illustrating the nation’s push for international standardization and operational excellence in the financial sector.

BankingContinuous DeliveryDevOps
0 likes · 29 min read
How China’s Leading Banks Are Driving Global DevOps Standardization
DevOps
DevOps
Nov 10, 2024 · Product Management

Product Operations vs. Product Management: Differences, Roles, and Collaboration

This article explains the distinct responsibilities and mindsets of product operations and product management, outlines their daily tasks, career paths, workflow differences, and how the two functions can cooperate to maximize product value and business outcomes.

OperationsProduct Developmentcareer path
0 likes · 17 min read
Product Operations vs. Product Management: Differences, Roles, and Collaboration
Liangxu Linux
Liangxu Linux
Nov 10, 2024 · Operations

50 Essential Ops Troubleshooting & Fix Techniques Every Sysadmin Should Know

This guide compiles fifty practical troubleshooting and remediation techniques covering system, network, application, database, and security layers, enabling operations engineers to quickly diagnose common failures such as high load, service crashes, permission errors, and performance bottlenecks, and apply concrete fixes to maintain stable, secure services.

Operationsnetworksecurity
0 likes · 16 min read
50 Essential Ops Troubleshooting & Fix Techniques Every Sysadmin Should Know
Architect
Architect
Nov 7, 2024 · Operations

Full-Link Multi-Version Deployment: Architecture, Techniques, and Future Outlook

This article explains the concept of full-link multi-version deployment in microservice architectures, describes the challenges of traditional test environments, and details the technical solutions—including traffic coloring, isolation, label propagation, environment management, and monitoring—implemented through a flexible CI/CD pipeline.

MicroservicesMulti-Version DeploymentOperations
0 likes · 16 min read
Full-Link Multi-Version Deployment: Architecture, Techniques, and Future Outlook
FunTester
FunTester
Nov 7, 2024 · Operations

Mastering Software Risk Management: Proven Strategies to Prevent Project Failures

Effective software risk management—by identifying technical and business risks, integrating quality assurance, using structured processes, and leveraging risk‑management tools—helps avoid financial loss, project delays, and reputational damage while ensuring project success and operational stability.

OperationsProject Managementquality assurance
0 likes · 11 min read
Mastering Software Risk Management: Proven Strategies to Prevent Project Failures
Model Perspective
Model Perspective
Nov 6, 2024 · Operations

Unlock Hidden Losses: How the Funnel Model Optimizes Your Process

The Funnel Model breaks down any process into sequential stages, measures entry and exit numbers at each step, calculates stage and overall conversion rates, and reveals where the greatest losses occur, enabling data‑driven optimization for e‑commerce, management, and other applications.

ManagementOperationsconversion rate
0 likes · 5 min read
Unlock Hidden Losses: How the Funnel Model Optimizes Your Process
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Nov 5, 2024 · Operations

10 Essential Linux Ops Tools Every Engineer Should Master

This article introduces ten indispensable Linux operations tools—Shell scripting, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, typical use cases, advantages, and practical examples to help engineers automate and monitor infrastructure efficiently.

Configuration ManagementDevOpsOperations
0 likes · 9 min read
10 Essential Linux Ops Tools Every Engineer Should Master
MaGe Linux Operations
MaGe Linux Operations
Nov 4, 2024 · Cloud Native

Essential kubectl Commands for Viewing, Managing, and Debugging Kubernetes

This guide walks you through essential kubectl commands for checking cluster status, inspecting resources, retrieving detailed object information, monitoring logs, managing configurations, labeling, and performing create, update, and delete operations, empowering you to efficiently view, troubleshoot, and control Kubernetes workloads.

Cloud NativeDevOpsKubernetes
0 likes · 13 min read
Essential kubectl Commands for Viewing, Managing, and Debugging Kubernetes
DevOps Engineer
DevOps Engineer
Oct 29, 2024 · Operations

A Day in the Life of a DevOps Engineer

The article walks through a DevOps engineer’s typical workday, from morning Slack checks and task planning, through code repository maintenance, build and release duties, coffee breaks, lunch with teammates, focused afternoon development, and evening family time, highlighting both technical and personal aspects.

DevOpsInfrastructureOperations
0 likes · 4 min read
A Day in the Life of a DevOps Engineer
Efficient Ops
Efficient Ops
Oct 28, 2024 · Operations

Master Linux Command Line: Essential Tips and Tricks for System Operations

The article covers Linux commands, shortcuts, file and directory management, permissions, users, searching, software repositories, manual pages, advanced topics like redirection, pipelines, processes, daemons, compression, compilation, networking, backup, and system control, providing practical examples and code snippets.

LinuxOperationsSystem Administration
0 likes · 50 min read
Master Linux Command Line: Essential Tips and Tricks for System Operations
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 28, 2024 · Operations

How Zero‑Intrusion eBPF Transforms TCP Network Monitoring and Troubleshooting

This article explains how zero‑intrusion eBPF technology enables detailed, non‑disruptive TCP network monitoring, covering data collection interfaces, aggregation methods, implementation steps, usage limitations, and practical installation and visualization guidance for improving network performance and fault analysis.

Linux kernelNetwork MonitoringOperations
0 likes · 9 min read
How Zero‑Intrusion eBPF Transforms TCP Network Monitoring and Troubleshooting
Efficient Ops
Efficient Ops
Oct 27, 2024 · Operations

How China’s Aviation Leaders Earn International DevOps Certification and Boost Efficiency

The article outlines China’s 2024‑2027 Information Standard Action Plan, CAICT’s synchronized DevOps assessments, and how major aviation firms like Southern Airlines and China Aviation Information Network achieved international and domestic DevOps, AIOps, and BizDevOps certifications, delivering measurable improvements in build success rates, deployment speed, and operational automation.

BizDevOpsContinuous DeliveryDevOps
0 likes · 11 min read
How China’s Aviation Leaders Earn International DevOps Certification and Boost Efficiency
Efficient Ops
Efficient Ops
Oct 27, 2024 · Operations

How China Aviation’s DevOps Assessment Boosted Delivery Efficiency and Set a New Industry Benchmark

The article details China Aviation Information Network's successful dual certification in ITU DevOps international and domestic standards, highlighting the evaluation process, measurable improvements in pipeline alerts and deployment success, and expert insights on the future of DevOps in the aviation sector.

Aviation ITChinaContinuous Delivery
0 likes · 12 min read
How China Aviation’s DevOps Assessment Boosted Delivery Efficiency and Set a New Industry Benchmark
Efficient Ops
Efficient Ops
Oct 24, 2024 · Operations

How Migu’s AI‑Powered Observability Boosts Cloud Gaming Operations

During the 24th GOPS Global Operations Conference, Migu Interactive Entertainment’s Vice President Su Yi discussed how their AI‑driven AIOps observability framework, validated by ITU standards, enhances cloud gaming platform stability, accelerates issue detection, and supports China Mobile’s 5G‑based digital transformation.

AIDigital TransformationOperations
0 likes · 19 min read
How Migu’s AI‑Powered Observability Boosts Cloud Gaming Operations
macrozheng
macrozheng
Oct 24, 2024 · Backend Development

Simplify Nginx Management: A Hands‑On Guide to Using Nginx UI with Docker

This tutorial introduces Nginx UI, a visual management tool for Nginx, explains how to install it via Docker, and demonstrates its core features—including dashboard monitoring, static and dynamic proxy configuration, and SSL management—through a step‑by‑step deployment of a SpringBoot‑Vue e‑commerce project.

NginxOperationsProxy
0 likes · 9 min read
Simplify Nginx Management: A Hands‑On Guide to Using Nginx UI with Docker
Software Development Quality
Software Development Quality
Oct 23, 2024 · R&D Management

Essential R&D Performance Metrics: Measure Business Value, Delivery Speed, Quality and Operations

This article presents a comprehensive set of R&D performance indicators—including business value, delivery speed, engineering quality, and operational reliability—detailing each metric's definition, calculation method, and practical notes to help teams monitor and improve their development efficiency.

OperationsR&D metricsagile
0 likes · 9 min read
Essential R&D Performance Metrics: Measure Business Value, Delivery Speed, Quality and Operations
Efficient Ops
Efficient Ops
Oct 22, 2024 · Operations

How New BizDevOps Standards Are Shaping China’s Digital Transformation

This article reviews the latest progress of DevOps standards in China, introduces the newly released BizDevOps framework, details the content of the standard system, highlights emerging XOps hotspots, and explains how these initiatives support enterprise digital transformation and operational efficiency.

BizDevOpsDevOpsOperations
0 likes · 18 min read
How New BizDevOps Standards Are Shaping China’s Digital Transformation