Tagged articles
3281 articles
Page 7 of 33
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Oct 22, 2024 · Operations

Simplify Multi‑Server Linux Management with a Ready‑Made Batch Script

This article introduces a ready‑to‑use Linux batch‑operation script that enables non‑expert administrators to update, configure, and manage multiple Ubuntu 22.04 servers simultaneously—covering functions such as updating the script, creating SSL certificates, generating SSH keys, bulk password changes, and deploying or removing ALEO services—while also offering a free, comprehensive Linux command and shell‑script tutorial.

Operationsbatch operationsserver management
0 likes · 5 min read
Simplify Multi‑Server Linux Management with a Ready‑Made Batch Script
Efficient Ops
Efficient Ops
Oct 21, 2024 · Operations

Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability

This article shares practical Prometheus best‑practice tips—from understanding its accuracy‑reliability trade‑offs and self‑monitoring, to avoiding NFS storage, managing high‑cardinality metrics, handling rate() and recording‑rule pitfalls, and fine‑tuning alerting—so you can run a stable, low‑cost monitoring stack.

AlertingCloud NativeOperations
0 likes · 10 min read
Essential Prometheus Best Practices: Avoid Common Pitfalls and Boost Reliability
JD Cloud Developers
JD Cloud Developers
Oct 21, 2024 · Operations

How Test Teams Can Build Observability Beyond Traditional Monitoring

This article examines how quality assurance engineers can adopt observability principles—distinct from conventional monitoring—to enhance system health detection, root‑cause analysis, and proactive risk mitigation across resources, services, business functions, data, and logs.

Operationsmonitoringobservability
0 likes · 17 min read
How Test Teams Can Build Observability Beyond Traditional Monitoring
Efficient Ops
Efficient Ops
Oct 19, 2024 · Operations

China Southern Airlines Wins Dual DevOps Certifications: BizDevOps & Continuous Delivery Excellence

China Southern Airlines achieved leading BizDevOps and continuous delivery capabilities by passing both international ITU DevOps and domestic standards across three key projects, highlighting the strategic impact of DevOps standardization on business value, technology integration, and digital transformation within the airline industry.

BizDevOpsDevOpsDigital Transformation
0 likes · 17 min read
China Southern Airlines Wins Dual DevOps Certifications: BizDevOps & Continuous Delivery Excellence
Efficient Ops
Efficient Ops
Oct 18, 2024 · Operations

How Changan Auto Achieved Dual ITU DevOps Certification and Boosted Efficiency

This article details China’s 2024‑2027 ITU‑DevOps standard action plan, CAICT’s dual international and domestic DevOps assessments, Changan Auto’s successful Gaia platform certification, and insights from senior executives on implementation challenges, benefits, and future DevOps trends.

DevOpsDigital TransformationOperations
0 likes · 21 min read
How Changan Auto Achieved Dual ITU DevOps Certification and Boosted Efficiency
DevOps Engineer
DevOps Engineer
Oct 18, 2024 · Operations

Comprehensive DevOps Interview Questions from a Swedish Company

This article presents a comprehensive list of 17 in‑depth DevOps interview questions asked by a Swedish company, covering Linux boot processes, Kubernetes internals, Git workflows, Jenkins pipelines, networking, monitoring, databases, Docker, and soft‑skill topics to help candidates prepare effectively.

DevOpsKubernetesLinux
0 likes · 3 min read
Comprehensive DevOps Interview Questions from a Swedish Company
JD Tech Talk
JD Tech Talk
Oct 17, 2024 · Operations

Comprehensive Guide to Change Management: Compatibility Design, Release Planning, Gray Deployment, Data Migration, Rollback, and Configuration Control

This article presents a detailed overview of change management practices, covering compatibility design across hardware, base software, and applications, release strategies, gray‑deployment techniques, data migration analysis, rollback planning, configuration change control, and verification procedures to ensure system stability and reliability.

CompatibilityGray DeploymentOperations
0 likes · 26 min read
Comprehensive Guide to Change Management: Compatibility Design, Release Planning, Gray Deployment, Data Migration, Rollback, and Configuration Control
JD Cloud Developers
JD Cloud Developers
Oct 17, 2024 · Operations

Master Change Management: Compatibility, Gray Release & Rollback Strategies

This guide outlines comprehensive change‑management practices—including compatibility design across hardware, base and application software, structured release planning, gray‑release techniques, data‑migration safeguards, rollback mechanisms, and configuration control—to ensure system stability and reliability during updates.

DeploymentOperationschange management
0 likes · 25 min read
Master Change Management: Compatibility, Gray Release & Rollback Strategies
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 15, 2024 · Databases

Why Did Redis Crash at 100% Memory? Uncovering Buffer Overflows and Best Practices

A detailed post‑mortem of a Redis outage shows how a traffic surge filled bandwidth, caused massive input and output buffers to consume almost all memory, and led to timeouts, while offering step‑by‑step analysis, memory diagnostics, and practical recommendations to prevent similar buffer‑overflow failures.

Operationsbest practicesbuffer overflow
0 likes · 22 min read
Why Did Redis Crash at 100% Memory? Uncovering Buffer Overflows and Best Practices
MaGe Linux Operations
MaGe Linux Operations
Oct 15, 2024 · Operations

Master Linux Process Management: From Basics to Powerful Commands

This guide explains what a program and a process are, describes process creation, lifecycle, and identifiers, and provides detailed usage of essential Linux commands such as ps, top, pgrep, pstree, lsof, vmstat, iostat, iftop, dstat, as well as foreground/background control and scheduling with at and crontab.

LinuxOperationsSystem Administration
0 likes · 10 min read
Master Linux Process Management: From Basics to Powerful Commands
DevOps Operations Practice
DevOps Operations Practice
Oct 10, 2024 · Operations

Seven Key Truths About Operations: Downtime, Automation, Prevention, Technology as a Tool, DevOps, Communication, and Security

Effective operations management acknowledges inevitable downtime, emphasizes automation, prioritizes proactive prevention, treats technology as a means rather than an end, integrates closely with development through DevOps, relies on strong communication, and continuously addresses pervasive security challenges to minimize business impact.

Operationsautomationdowntime
0 likes · 5 min read
Seven Key Truths About Operations: Downtime, Automation, Prevention, Technology as a Tool, DevOps, Communication, and Security
Qunar Tech Salon
Qunar Tech Salon
Oct 10, 2024 · Operations

Design and Architecture of a Distributed Task Scheduling System for Database Automation

This document outlines the terminology, background, requirements, task classifications, state model, and detailed architecture—including TaskScheduler, TaskWorker, and TaskConsole components—of a new distributed task scheduling system designed to replace Celery in a database automation platform, with emphasis on scalability, reliability, and extensibility.

Distributed SystemsLocksOperations
0 likes · 23 min read
Design and Architecture of a Distributed Task Scheduling System for Database Automation
Architecture Digest
Architecture Digest
Oct 9, 2024 · Operations

Longest‑Running Computer Systems: Real‑World Server Uptime Stories

This article compiles real-world anecdotes from Zhihu users describing computers and servers that have run continuously for years or even decades, highlighting examples such as a 14‑year Red Hat Linux machine, a 20‑year base‑station, long‑standing DOS and Sun systems, and space probes that have operated for nearly half a century.

Operationshardware longevitylong-running systems
0 likes · 7 min read
Longest‑Running Computer Systems: Real‑World Server Uptime Stories
Selected Java Interview Questions
Selected Java Interview Questions
Oct 7, 2024 · Operations

Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples

This article introduces ten essential tools for operations engineers—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing each tool's functionality, typical scenarios, advantages, and real‑world examples with code snippets for practical automation and monitoring.

InfrastructureOperationsautomation
0 likes · 8 min read
Top 10 Tools Frequently Used by Operations Engineers: Features, Use Cases, and Practical Examples
ITPUB
ITPUB
Oct 6, 2024 · Operations

Mastering Prometheus Metrics: Practical Best‑Practice Guide for Effective Monitoring

This guide explains how to design and implement Prometheus metrics for application monitoring, covering the selection of monitoring targets, the four golden metrics, system‑specific metric groups, vector and label choices, naming conventions, histogram bucket design, and useful Grafana visualization tips.

GrafanaMetricsOperations
0 likes · 9 min read
Mastering Prometheus Metrics: Practical Best‑Practice Guide for Effective Monitoring
dbaplus Community
dbaplus Community
Oct 3, 2024 · Operations

How Netflix Uses Chaos Engineering to Build Resilient Distributed Systems

This article explains Netflix's chaos engineering practice, detailing the challenges of microservice reliability, the implementation of the Chaos Monkey tool, the step‑by‑step methodology, guiding principles, and real‑world outcomes that demonstrate improved system availability.

Chaos MonkeyDistributed SystemsNetflix
0 likes · 6 min read
How Netflix Uses Chaos Engineering to Build Resilient Distributed Systems
Liangxu Linux
Liangxu Linux
Oct 2, 2024 · Operations

10 Essential Ops Engineer Tools Every Sysadmin Should Master

A comprehensive guide lists ten indispensable tools for operations engineers, detailing each tool's functionality, ideal use cases, advantages, and real‑world examples, plus practical code snippets for automation, monitoring, container orchestration, and log analysis.

DevOpsOperationsautomation
0 likes · 7 min read
10 Essential Ops Engineer Tools Every Sysadmin Should Master
Liangxu Linux
Liangxu Linux
Oct 1, 2024 · Operations

10 Proven Practices to Prevent System Failures for Ops Teams

This guide outlines ten practical strategies—including rollback testing, safe handling of destructive commands, prompt customization, robust backup and verification, production environment discipline, thorough handover, proactive monitoring, cautious auto‑failover, meticulous execution, and simplicity—to help operations engineers dramatically reduce system outages and improve reliability.

BackupOperationsbest practices
0 likes · 17 min read
10 Proven Practices to Prevent System Failures for Ops Teams
Architect
Architect
Sep 30, 2024 · Operations

Automated Resource Balancing and Migration for Redis Clusters

The article describes how an automated resource‑balancing system continuously monitors Redis host memory usage, selects optimal nodes, safely migrates them through a multi‑step process (adding slaves, verifying replication, promoting masters, deleting old nodes), and provides task management and notification features to maintain high availability and reduce manual DBA effort.

Cluster MigrationOperationsReliability
0 likes · 13 min read
Automated Resource Balancing and Migration for Redis Clusters
Efficient Ops
Efficient Ops
Sep 29, 2024 · Operations

Essential Linux Ops Tools Every Sysadmin Must Master

This guide outlines the ten core tool categories—from Linux basics and networking services to scripting, firewalls, monitoring, clustering, and backup—that a Linux operations engineer should master to become an effective sysadmin.

LinuxNetworkingOperations
0 likes · 6 min read
Essential Linux Ops Tools Every Sysadmin Must Master
IT Architects Alliance
IT Architects Alliance
Sep 28, 2024 · Operations

How DevOps Transforms IT: Core Principles, Practices, and Real-World Success

This article explores the DevOps mindset, its core principles such as collaboration, automation, continuous improvement, and customer focus, outlines essential practices like CI/CD, IaC, monitoring, microservices, and provides a step‑by‑step adoption roadmap illustrated with a detailed case study and future trends.

Cloud NativeDevOpsMicroservices
0 likes · 11 min read
How DevOps Transforms IT: Core Principles, Practices, and Real-World Success
Python Programming Learning Circle
Python Programming Learning Circle
Sep 28, 2024 · Operations

Essential Skills for Becoming a Successful DevOps Engineer

The article outlines the key competencies a DevOps engineer must master—including programming, Linux system knowledge, configuration management, infrastructure-as-code, CI/CD tools, networking and security, monitoring, and cloud services—to guide readers on building a comprehensive skill set for effective DevOps practice.

DevOpsInfrastructure as CodeLinux
0 likes · 5 min read
Essential Skills for Becoming a Successful DevOps Engineer
IT Services Circle
IT Services Circle
Sep 27, 2024 · Operations

Analysis of the Shanghai Stock Exchange Outage and System Design Lessons

The article recounts the Shanghai Stock Exchange’s sudden P0 outage that halted trading, analyzes the causes such as massive order volume and system bottlenecks, and discusses how distributed architectures and message‑queue based queuing can mitigate similar high‑concurrency failures.

Distributed SystemsOperationshigh concurrency
0 likes · 6 min read
Analysis of the Shanghai Stock Exchange Outage and System Design Lessons
Zhuanzhuan Tech
Zhuanzhuan Tech
Sep 26, 2024 · Artificial Intelligence

Pricing Strategy and Model Evolution for Second‑Hand Phone Auctions in ZhaiZhai TOB Marketplace

This article examines the characteristics of ZhaiZhai's B2B auction scenario, defines core pricing metrics, presents a step‑by‑step methodology for determining optimal starting prices, reviews early practices and their shortcomings, and details the current modular machine‑learning model architecture that improves transaction rates and reduces price premiums for second‑hand smartphones.

OperationsPrice Optimizationalgorithm
0 likes · 29 min read
Pricing Strategy and Model Evolution for Second‑Hand Phone Auctions in ZhaiZhai TOB Marketplace
Efficient Ops
Efficient Ops
Sep 24, 2024 · Operations

Master Linux Performance in 60 Seconds: 10 Essential Commands

When a Linux server shows performance issues, the first minute is critical; this guide walks you through ten standard command‑line tools—uptime, dmesg, vmstat, mpstat, pidstat, iostat, free, sar, and top—explaining what each metric means and how to interpret the output for quick troubleshooting.

LinuxOperationsmonitoring
0 likes · 19 min read
Master Linux Performance in 60 Seconds: 10 Essential Commands
Top Architect
Top Architect
Sep 23, 2024 · Backend Development

Understanding Nginx Architecture, Process Model, FastCGI Integration, and Performance Optimization

This article provides a comprehensive overview of Nginx's high‑performance architecture, including its core, basic, and third‑party modules, master‑worker process model, asynchronous non‑blocking I/O mechanisms, FastCGI and PHP‑FPM integration, and practical configuration and tuning tips for optimal server operation.

BackendNginxOperations
0 likes · 46 min read
Understanding Nginx Architecture, Process Model, FastCGI Integration, and Performance Optimization
FunTester
FunTester
Sep 20, 2024 · Operations

Chaos Engineering vs Fault Testing: Methods, Challenges, and Future Trends

This article compares chaos engineering and fault testing, outlines fault injection techniques, implementation layers, testing strategies, challenges, and future trends such as automation, AI-driven diagnostics, and cloud‑native integration, providing a comprehensive guide for improving system resilience and reliability.

Cloud NativeOperationschaos engineering
0 likes · 17 min read
Chaos Engineering vs Fault Testing: Methods, Challenges, and Future Trends
Liangxu Linux
Liangxu Linux
Sep 17, 2024 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article presents ten indispensable tools for operations engineers—detailing each tool’s functionality, ideal use cases, advantages, and real‑world examples, from shell scripting and Git to Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, the ELK stack, and Zabbix, helping professionals streamline automation, monitoring, and deployment tasks.

Configuration ManagementDevOpsOperations
0 likes · 8 min read
Top 10 Essential Ops Tools Every Engineer Should Master
Architects' Tech Alliance
Architects' Tech Alliance
Sep 12, 2024 · Industry Insights

Managing and Optimizing Large‑Scale AI Compute Clusters: Practical Insights

This article examines the key pain points of massive AI compute clusters—including heterogeneous hardware compatibility, efficient scheduling, training and inference acceleration, and fault‑tolerant operations—while presenting practical management and performance‑tuning strategies, a cloud‑native AI platform implementation, and future directions for the ecosystem.

AI computingCluster ManagementOperations
0 likes · 7 min read
Managing and Optimizing Large‑Scale AI Compute Clusters: Practical Insights
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Sep 11, 2024 · Operations

Essential Linux Commands Every Sysadmin Should Master

This guide compiles the most frequently used Linux commands—covering help utilities, file and directory manipulation, content processing, compression, system information, networking, disk management, permissions, user administration, and process control—to provide a comprehensive reference for effective system operation and troubleshooting.

OperationsShellUnix
0 likes · 14 min read
Essential Linux Commands Every Sysadmin Should Master
DevOps Engineer
DevOps Engineer
Sep 11, 2024 · Operations

Will DevOps Disappear? How AI Impacts the Role of DevOps Engineers

While AI can automate many routine DevOps tasks such as scripting, CI/CD pipeline creation, and infrastructure design, it cannot replace the contextual understanding, critical thinking, experience, and judgment of senior DevOps engineers, who will evolve into architects and innovators rather than being rendered obsolete.

Artificial IntelligenceCareer DevelopmentDevOps
0 likes · 4 min read
Will DevOps Disappear? How AI Impacts the Role of DevOps Engineers
FunTester
FunTester
Sep 11, 2024 · Operations

Pinterest Performance Plan: Real‑User Monitoring, Regression Detection, and Alerting

Pinterest’s performance program details how the team defines custom Pinner Wait Time metrics, uses real‑user monitoring and fine‑grained alerts to detect regressions quickly, and follows structured root‑cause analysis and ownership processes to prevent performance degradation across web surfaces.

Operationsmonitoringreal‑user
0 likes · 18 min read
Pinterest Performance Plan: Real‑User Monitoring, Regression Detection, and Alerting
DevOps Operations Practice
DevOps Operations Practice
Sep 8, 2024 · Operations

Which Types of Companies Pay Well for Operations Engineers

The article explains that technology‑driven firms, financial institutions, large multinational corporations, and innovative startups are the main types of companies that tend to offer high salaries to operations engineers because of their critical reliance on stable and secure IT infrastructure.

IT infrastructureOperationscareer
0 likes · 6 min read
Which Types of Companies Pay Well for Operations Engineers
NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
Sep 6, 2024 · Operations

From Log Beginner to Pro: A QA’s Journey in Game Log Management and Monitoring

This article chronicles the author’s progression from a novice to a proficient log analyst in game development, explaining what logs are, how to collect and classify them, establishing standards and workflows, and detailing the implementation of log monitoring and QA processes for reliable game operations.

Game DevelopmentLog MonitoringOperations
0 likes · 20 min read
From Log Beginner to Pro: A QA’s Journey in Game Log Management and Monitoring
FunTester
FunTester
Sep 4, 2024 · Operations

Reflections on Technical Growth: Foundations, Output, and Continuous Learning

The article shares a software engineer’s personal journey, emphasizing the importance of solid fundamentals, proactive output, curiosity‑driven problem solving, documentation, and process optimization to build lasting technical competence and reduce tacit knowledge throughout a career.

Career DevelopmentDocumentationOperations
0 likes · 13 min read
Reflections on Technical Growth: Foundations, Output, and Continuous Learning
Efficient Ops
Efficient Ops
Sep 2, 2024 · Operations

How China’s Auto Industry Is Leading the Way in DevOps Standardization

The article details China’s 2024‑2027 Information Standardization Action Plan, the CAICT’s DevOps assessment framework, and showcases how automotive firms like FAW‑Volkswagen and Chang'an have achieved top‑tier continuous delivery and system‑tool standards, highlighting key metrics and the role of international ITU standards.

Continuous DeliveryDevOpsIT Governance
0 likes · 9 min read
How China’s Auto Industry Is Leading the Way in DevOps Standardization
Volcano Engine Developer Services
Volcano Engine Developer Services
Sep 2, 2024 · Operations

How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active

This article details ByteDance’s disaster‑recovery evolution—from a single‑room deployment to same‑city multi‑data‑center setups and finally to active‑active multi‑region architectures—explaining the challenges, specific failure scenarios, and the strategic practices used to ensure continuous service during outages.

InfrastructureOperationsdisaster recovery
0 likes · 15 min read
How ByteDance Scales Disaster Recovery: From Single Data Center to Multi‑Region Active‑Active
Top Architect
Top Architect
Aug 29, 2024 · Operations

Setting Up Nginx Log Monitoring with Loki, Promtail, and Grafana

This article walks through a complete, step‑by‑step solution for collecting Nginx access logs, converting them to JSON, shipping them with Promtail to Loki, and visualizing the data in Grafana, including Docker deployment, dashboard import, and world‑map plugin installation.

GrafanaLokiOperations
0 likes · 10 min read
Setting Up Nginx Log Monitoring with Loki, Promtail, and Grafana
MaGe Linux Operations
MaGe Linux Operations
Aug 29, 2024 · Databases

Database Server Ops: Hardware, Tuning, Backup & Security Best Practices

This guide outlines comprehensive best practices for database server operations, covering hardware selection, OS and kernel tuning, storage choices, MySQL configuration, performance monitoring, backup strategies, security measures, high availability, automation, and systematic maintenance procedures to ensure optimal reliability and efficiency.

BackupOperationsdatabases
0 likes · 7 min read
Database Server Ops: Hardware, Tuning, Backup & Security Best Practices
JD Cloud Developers
JD Cloud Developers
Aug 27, 2024 · Cloud Computing

Deploy Black Myth: Wukong on the Cloud for Smooth Steam Streaming

This guide walks you through creating a cloud‑based service that bundles Black Myth: Wukong, the Steam client, and a remote desktop tool, detailing instance setup, remote configuration, streaming steps, and troubleshooting to achieve a seamless gaming experience.

DeploymentGamingOperations
0 likes · 9 min read
Deploy Black Myth: Wukong on the Cloud for Smooth Steam Streaming
Data Thinking Notes
Data Thinking Notes
Aug 25, 2024 · Operations

How Digital Transformation Architecture Shapes Modern Enterprises

This article outlines the background, overall framework, and platform construction of enterprise digital transformation, illustrating each component with detailed diagrams that guide organizations in planning and implementing comprehensive digital strategies to achieve competitive advantage.

Digital TransformationIT StrategyOperations
0 likes · 2 min read
How Digital Transformation Architecture Shapes Modern Enterprises
Open Source Linux
Open Source Linux
Aug 23, 2024 · Operations

10 Proven Ops Practices to Prevent System Failures

This article shares ten practical operations strategies—including change rollbacks, safe handling of destructive commands, prompt customization, rigorous backup and verification, production environment discipline, careful handovers, robust alerting, cautious automatic failover, meticulous checks, and simplicity—to dramatically improve system reliability and availability.

BackupLinuxOperations
0 likes · 17 min read
10 Proven Ops Practices to Prevent System Failures
DevOps
DevOps
Aug 22, 2024 · Operations

Synthetic Monitoring and Fault Drills: Practices for Ensuring Service Stability

This article explains why service stability is critical, outlines the importance and key factors of synthetic monitoring, provides practical guidelines for implementing it, and then describes fault‑drill concepts, benefits, processes, and common cloud‑native tools to proactively discover and mitigate failures in micro‑service environments.

Fault InjectionOperationsSynthetic Monitoring
0 likes · 11 min read
Synthetic Monitoring and Fault Drills: Practices for Ensuring Service Stability
IT Services Circle
IT Services Circle
Aug 21, 2024 · Operations

Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons

On August 19, NetEase Cloud Music suffered a severe infrastructure‑related outage that prevented user login, playlist loading, and song search, prompting a two‑hour recovery effort, a brief free‑membership compensation, and highlighting the critical role of proper change management, gray releases, disaster recovery, and cross‑functional coordination in large‑scale services.

InfrastructureNetEase Cloud MusicOperations
0 likes · 6 min read
Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons
Ops Development Stories
Ops Development Stories
Aug 21, 2024 · Operations

How Large Language Models Can Transform Ops Fault Handling: A Practical Guide

This article outlines a typical operations incident workflow, identifies four key stages where large language models can assist, discusses implementation challenges, introduces the Ops framework and Copilot design, and shares practical examples and a real‑world case to help engineers adopt AI‑driven fault management.

AI OpsLarge Language ModelsOperations
0 likes · 19 min read
How Large Language Models Can Transform Ops Fault Handling: A Practical Guide
Tencent Cloud Developer
Tencent Cloud Developer
Aug 20, 2024 · Backend Development

Why Caching Is the Secret Weapon for High‑Performance Search Engines

This article analyzes real‑world search query characteristics, breaks down a typical search system architecture, classifies cacheable data, compares result‑level, intermediate‑value and multi‑layer caches, discusses update, prefetch and placement strategies, and highlights common pitfalls such as cache miss, consistency, and resource overhead.

BackendCache StrategiesOperations
0 likes · 19 min read
Why Caching Is the Secret Weapon for High‑Performance Search Engines
Data Thinking Notes
Data Thinking Notes
Aug 19, 2024 · Operations

How to Build an Effective Data Metric System for Business Success

This article explains what a data metric system is, why it’s essential for organizations, the stages of building it, required resources, organizational alignment, and a step‑by‑step path to create a robust, data‑driven indicator framework that supports product development, operations, and strategic decision‑making.

Business AnalyticsData-drivenIndicator System
0 likes · 17 min read
How to Build an Effective Data Metric System for Business Success
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Aug 16, 2024 · Artificial Intelligence

How FastGPT Transforms Ticket Handling and Boosts Efficiency by 90%

This article examines the pain points of a custom ticket system, introduces FastGPT’s knowledge‑base and query capabilities, outlines integration architecture and concrete features, and shows how the combined solution reduces ticket resolution time dramatically while improving overall operational efficiency.

AIFastGPTOperations
0 likes · 10 min read
How FastGPT Transforms Ticket Handling and Boosts Efficiency by 90%
21CTO
21CTO
Aug 15, 2024 · Operations

Why GitHub’s Massive Outage Happened: Database Infrastructure Rollback Explained

A detailed account of GitHub’s recent worldwide outage reveals that a rollback of database infrastructure changes caused widespread service failures across GitHub.com, Pages, Copilot, and the API, highlighting the challenges of stateful database reliability in large platforms.

GitHubOperationsOutage
0 likes · 4 min read
Why GitHub’s Massive Outage Happened: Database Infrastructure Rollback Explained
Open Source Linux
Open Source Linux
Aug 13, 2024 · Operations

Complete Guide to Operations Automation Scripts and Directory Structure

This article outlines a comprehensive set of automated operations scripts, including baseline checks, service monitoring, Docker and Kubernetes maintenance, security inspections, and a well‑organized directory layout with roles, system, network, database, application, security, automation, and infrastructure sections.

AnsibleOperationsScript Management
0 likes · 6 min read
Complete Guide to Operations Automation Scripts and Directory Structure
Zhuanzhuan Tech
Zhuanzhuan Tech
Aug 7, 2024 · Operations

Building a Dynamic Grafana Dashboard for Push System TraceId Visualization

This article describes how to use Grafana's Flowcharting plugin and Prometheus metrics to create a dynamic, interactive dashboard that visualizes each logical node of a push notification pipeline, enabling rapid trace‑ID based troubleshooting and reducing manual investigation effort.

GrafanaOperationsdynamic-view
0 likes · 11 min read
Building a Dynamic Grafana Dashboard for Push System TraceId Visualization
ITPUB
ITPUB
Aug 5, 2024 · Operations

Do You Really Need Kubernetes? Real‑World Opinions and Practical Tips

A collection of Zhihu answers debates whether adopting Kubernetes is necessary, presenting viewpoints from developers and ops leaders, highlighting cost, complexity, operational benefits, deployment commands, and practical considerations for small and large scale projects.

KubernetesMicroservicesOperations
0 likes · 10 min read
Do You Really Need Kubernetes? Real‑World Opinions and Practical Tips
Liangxu Linux
Liangxu Linux
Aug 1, 2024 · Operations

Essential Operations Metrics Every IT Team Should Track

This guide outlines key operational metrics—availability, failure rate, MTTR, MTBF, response time, throughput, error rate, capacity utilization, latency, data integrity, and more—explaining their calculations, typical benchmark values, and practical application areas to help organizations monitor and improve IT performance.

AvailabilityMTTRMetrics
0 likes · 6 min read
Essential Operations Metrics Every IT Team Should Track
Open Source Linux
Open Source Linux
Aug 1, 2024 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, ideal use cases, key advantages, and practical examples, while also providing code snippets and visual illustrations to help readers understand and apply them effectively.

Configuration ManagementInfrastructureOperations
0 likes · 8 min read
Top 10 Essential Ops Tools Every Engineer Should Master
FunTester
FunTester
Jul 31, 2024 · Cloud Native

Improving Test Environment Stability with Containerized One-Box and Soft‑Isolation Solutions

The article analyzes why test environments are inherently less stable than production, identifies frequent changes as the root cause, and proposes two container‑based approaches—One‑Box for small services and soft isolation for large microservice systems—plus automated health and business inspections to achieve reasonable, cost‑effective stability.

Cloud NativeMicroservicesOperations
0 likes · 13 min read
Improving Test Environment Stability with Containerized One-Box and Soft‑Isolation Solutions
58 Tech
58 Tech
Jul 29, 2024 · Databases

HBase Cloud Migration: Architecture, Challenges, and Solutions

This technical report details the background, architecture, construction, core issues, migration plans, and future roadmap of moving 58's HBase clusters to a cloud‑native environment, highlighting cost reduction, operational automation, and performance optimizations.

Big DataCloud NativeHBase
0 likes · 22 min read
HBase Cloud Migration: Architecture, Challenges, and Solutions
DataFunSummit
DataFunSummit
Jul 28, 2024 · Product Management

From 1 to N: Building and Maintaining a Tag System – Common Issues and Solutions

This article outlines the three essential steps for scaling a tag system from initial deployment to full maturity, highlights typical challenges such as incomplete functionality, business system integration, and permission management, and provides practical solutions and best‑practice recommendations for each stage.

Data GovernanceOperationsfeature planning
0 likes · 6 min read
From 1 to N: Building and Maintaining a Tag System – Common Issues and Solutions
dbaplus Community
dbaplus Community
Jul 28, 2024 · Operations

A Day in the Life of a Linux Ops Engineer: Real Stories and Practical Tips

This article compiles several Zhihu users' candid accounts of a typical Linux operations day, highlighting constant interruptions, emergency firefighting, performance tuning, monitoring, tool development, and a balanced time‑allocation strategy to make ops work more efficient and sustainable.

LinuxOperationsPerformanceTuning
0 likes · 11 min read
A Day in the Life of a Linux Ops Engineer: Real Stories and Practical Tips
Efficient Ops
Efficient Ops
Jul 25, 2024 · Operations

FAW‑Volkswagen’s Dual DevOps Certification: Driving Digital Transformation

FAW‑Volkswagen successfully earned both ITU DevOps international certification and the domestic DevOps standard assessment for its R&D Efficiency Platform and Integrated Operations Platform, showcasing how standardized DevOps practices can accelerate digital transformation, improve delivery quality, and enhance operational efficiency in the automotive industry.

DevOpsDigital TransformationOperations
0 likes · 16 min read
FAW‑Volkswagen’s Dual DevOps Certification: Driving Digital Transformation
JD Tech Talk
JD Tech Talk
Jul 25, 2024 · Backend Development

Design and Architecture of JD.com’s Buffalo Distributed DAG Scheduling System

The article details the design, core technical solutions, high‑availability architecture, performance optimizations, and open capabilities of Buffalo, JD.com’s distributed DAG‑based job scheduling platform that supports massive task volumes, complex dependencies, and flexible resource management.

BackendDAGDistributed Scheduling
0 likes · 13 min read
Design and Architecture of JD.com’s Buffalo Distributed DAG Scheduling System
Soul Technical Team
Soul Technical Team
Jul 23, 2024 · Big Data

Kafka Stability Challenges and Governance Framework at Soul

This article analyzes the role, application scenarios, stability challenges, and comprehensive governance framework of Apache Kafka at Soul, covering deployment, configuration, monitoring, standard controls, common misuse, and future directions toward cloud‑native solutions.

KafkaOperationsStreaming
0 likes · 30 min read
Kafka Stability Challenges and Governance Framework at Soul
Efficient Ops
Efficient Ops
Jul 22, 2024 · Operations

Mastering Ansible: Core Concepts, Architecture, and Essential Commands

This article introduces Ansible as an open‑source automation tool, explains its declarative, abstract and idempotent characteristics, shows how to install it with pip, outlines its core architecture components, describes its working principles, and provides usage examples for its seven main commands.

AnsibleConfiguration ManagementDevOps
0 likes · 8 min read
Mastering Ansible: Core Concepts, Architecture, and Essential Commands
Architecture and Beyond
Architecture and Beyond
Jul 21, 2024 · Operations

Mastering Backend Stability: 7 Essential Practices for High Availability

This comprehensive guide outlines the seven key pillars—operations, high‑availability architecture, capacity governance, change management, risk governance, fault management, and chaos engineering—that together form a systematic approach to building and maintaining a reliable, 24‑hour backend system.

Operationsbackend stabilitycapacity planning
0 likes · 40 min read
Mastering Backend Stability: 7 Essential Practices for High Availability
ITPUB
ITPUB
Jul 19, 2024 · Information Security

Why Did a CrowdStrike Update Trigger a Global Windows Blue Screen Crisis?

A sudden worldwide surge of Windows Blue Screen of Death incidents on July 1, linked to a CrowdStrike security‑agent update, crippled Microsoft 365 services, disrupted airlines and highlighted the far‑reaching impact of a single software change on global IT stability.

Blue ScreenCrowdStrikeMicrosoft
0 likes · 6 min read
Why Did a CrowdStrike Update Trigger a Global Windows Blue Screen Crisis?
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Jul 17, 2024 · Operations

How NetEase Cloud Music Automated Massive Service Upgrades with a Custom Platform

This article presents a comprehensive case study of NetEase Cloud Music's automatic upgrade platform, detailing the background challenges, technical architecture, sidecar versus component upgrades, workflow orchestration, operational safeguards, performance metrics, and future roadmap for large‑scale microservice migrations.

Cloud NativeMicroservicesOperations
0 likes · 17 min read
How NetEase Cloud Music Automated Massive Service Upgrades with a Custom Platform
Continuous Delivery 2.0
Continuous Delivery 2.0
Jul 17, 2024 · Operations

Design Principles of Deployment Pipelines

The article explains the core concept of deployment pipelines in Continuous Delivery 2.0, outlines essential quality‑gate mechanisms, and details five design principles—build once, loose coupling, parallelization, fast feedback, and important feedback—plus team collaboration disciplines such as immediate pause and security audit.

Continuous DeliveryDeployment PipelineDevOps
0 likes · 8 min read
Design Principles of Deployment Pipelines
DevOps
DevOps
Jul 16, 2024 · Product Management

Comprehensive IT Project Management Process: Product, Requirement, Development, Testing, Release, and Operations

This article provides a detailed overview of the end‑to‑end IT project management lifecycle, including product and requirement management, development and testing steps, version release procedures, and post‑release operations, offering practical guidance for teams to design and control their workflows.

DevelopmentOperationsproduct-management
0 likes · 6 min read
Comprehensive IT Project Management Process: Product, Requirement, Development, Testing, Release, and Operations
Top Architect
Top Architect
Jul 16, 2024 · Operations

Jpom – Lightweight Java‑Based Online Build, Deployment, and Operations Tool

Jpom is a simple, low‑intrusion Java‑based platform that provides online project building, automatic deployment, daily operations, and monitoring features, offering node management, SSH terminal, Docker handling, and a one‑click installation process suitable for individuals and small enterprises.

Build AutomationDevOpsJpom
0 likes · 9 min read
Jpom – Lightweight Java‑Based Online Build, Deployment, and Operations Tool
Top Architecture Tech Stack
Top Architecture Tech Stack
Jul 16, 2024 · Cloud Native

Designing Fault‑Tolerant Microservices Architecture: Patterns and Practices

The article explains how to build reliable microservices by isolating failures, applying graceful degradation, change‑management, health checks, self‑healing, fallback caching, retry strategies, rate limiting, fast‑fail principles, circuit breakers, and failure‑testing to ensure high availability in distributed cloud‑native systems.

Cloud NativeMicroservicesOperations
0 likes · 14 min read
Designing Fault‑Tolerant Microservices Architecture: Patterns and Practices
Software Development Quality
Software Development Quality
Jul 11, 2024 · Information Security

How to Implement Secure and Compliant Log Management Standards

This guide outlines the purpose, scope, principles, and detailed specifications for log management—including file naming, retention periods, content rules, security handling, and monitoring—to ensure reliable issue tracing, data safety, and regulatory compliance across all system development projects.

Log ManagementOperationscompliance
0 likes · 12 min read
How to Implement Secure and Compliant Log Management Standards
Efficient Ops
Efficient Ops
Jul 8, 2024 · Operations

How to Diagnose and Fix High CPU Usage in Java Data Platforms

This article walks through a real‑world incident where a data‑platform server showed near‑100% CPU usage, explains step‑by‑step investigation using top, pwdx, and jstack, identifies a time‑conversion utility as the root cause, and presents a streamlined script‑based solution that reduced CPU load by thirtyfold.

CPU optimizationJava performanceOperations
0 likes · 11 min read
How to Diagnose and Fix High CPU Usage in Java Data Platforms
Efficient Ops
Efficient Ops
Jul 7, 2024 · Operations

How Suzhou Bank’s Mobile Banking 5.0 Sets a New Standard for DevOps in Banking

Suzhou Bank’s Mobile Banking 5.0 platform, showcased at the 23rd GOPS Global Operations Conference, demonstrates how a unified micro‑service architecture, advanced security technologies, and a comprehensive DevOps platform can elevate development efficiency, meet international standards, and drive innovative financial services.

Continuous DeliveryDevOpsMicroservices
0 likes · 4 min read
How Suzhou Bank’s Mobile Banking 5.0 Sets a New Standard for DevOps in Banking
Test Development Learning Exchange
Test Development Learning Exchange
Jul 6, 2024 · Operations

10 Practical Python Automation Scripts for File Management, Web Scraping, Data Cleaning, and More

This article presents ten useful Python automation scripts covering file renaming, web page downloading, data cleaning, scheduled tasks, email sending, testing, database backup, log analysis, file compression, and document generation, each with clear explanations and ready‑to‑run code examples.

CodeExamplesOperationsPython
0 likes · 7 min read
10 Practical Python Automation Scripts for File Management, Web Scraping, Data Cleaning, and More