Tagged articles
3281 articles
Page 9 of 33
MaGe Linux Operations
MaGe Linux Operations
Mar 18, 2024 · Cloud Native

Is Your Kubernetes Setup Secure? A Complete Best‑Practice Checklist

This article provides a thorough checklist covering application deployment, service governance, and cluster configuration in Kubernetes, including health probes, graceful shutdown, fault tolerance, resource limits, labeling, logging, scaling, RBAC, network policies, and compliance with CIS benchmarks.

Cloud NativeKubernetesOperations
0 likes · 27 min read
Is Your Kubernetes Setup Secure? A Complete Best‑Practice Checklist
php Courses
php Courses
Mar 18, 2024 · Operations

Understanding Load Balancing and Its Implementation with Docker and Nginx

This article explains the concept and importance of load balancing, then demonstrates a practical Docker‑Compose setup with multiple PHP containers and an Nginx reverse proxy, including configuration files and test results that show how traffic is distributed to improve system reliability and performance.

DockerNginxOperations
0 likes · 5 min read
Understanding Load Balancing and Its Implementation with Docker and Nginx
Architect
Architect
Mar 16, 2024 · Operations

How Unified Alert Convergence Can Slash Monitoring Noise and Boost MTTA/MTTR

This article analyzes the shortcomings of fragmented monitoring systems, defines key metrics such as MTTA and MTTR, proposes a unified alert convergence architecture using Redis delayed queues, and details design, implementation, and future AI‑enhanced improvements to reduce alert fatigue and accelerate incident response.

MTTAMTTROperations
0 likes · 22 min read
How Unified Alert Convergence Can Slash Monitoring Noise and Boost MTTA/MTTR
Practical DevOps Architecture
Practical DevOps Architecture
Mar 15, 2024 · Operations

Comprehensive Practical Guide to Prometheus Configuration, Optimization, and Source Code Development

This multi‑chapter guide provides in‑depth, hands‑on instruction for configuring and optimizing all Prometheus components, exploring Kubernetes monitoring, source‑code analysis, custom exporter development, high‑availability setups, service discovery, resource‑efficient scraping, and integrating Thanos for long‑term storage.

KubernetesOperationsPrometheus
0 likes · 4 min read
Comprehensive Practical Guide to Prometheus Configuration, Optimization, and Source Code Development
Efficient Ops
Efficient Ops
Mar 13, 2024 · Operations

What Does an Operations Engineer Do? Skills, Tools, and Career Path

This article explains the role of an operations (运维) engineer, covering daily responsibilities, essential knowledge such as Linux and networking, common monitoring tools, and emerging career paths like DevOps, AIOps, and SRE, helping newcomers understand how to start and grow in the field.

DevOpsLinuxOperations
0 likes · 6 min read
What Does an Operations Engineer Do? Skills, Tools, and Career Path
Model Perspective
Model Perspective
Mar 13, 2024 · Operations

Evaluating City Efficiency with DEA’s CCR and BCC Models

This article introduces Data Envelopment Analysis (DEA) as a non‑parametric method for assessing relative efficiency of decision‑making units, explains the CCR and BCC models, and demonstrates their application in evaluating and comparing the efficiency of various U.S. cities using real‑world data.

BCCCCRDEA
0 likes · 9 min read
Evaluating City Efficiency with DEA’s CCR and BCC Models
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Mar 13, 2024 · Operations

Top 10 Essential Tools Every Operations Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, typical use cases, key advantages, and real‑world examples, helping professionals streamline automation, monitoring, configuration, and deployment tasks and improve overall system reliability.

InfrastructureOperationsmonitoring
0 likes · 6 min read
Top 10 Essential Tools Every Operations Engineer Should Master
macrozheng
macrozheng
Mar 12, 2024 · Operations

Why HertzBeat Could Be Your Next Agentless Monitoring Solution

This article introduces HertzBeat, an open‑source real‑time monitoring and alerting system that offers powerful template‑based monitoring without agents, explains its Docker‑quick start, demonstrates how to monitor Redis and SpringBoot services, and walks through email alarm configuration.

Operationsagentlessredis
0 likes · 7 min read
Why HertzBeat Could Be Your Next Agentless Monitoring Solution
Efficient Ops
Efficient Ops
Mar 11, 2024 · Operations

Essential Linux Ops: Proven Troubleshooting Steps for Common Failures

This guide outlines a systematic Linux operations troubleshooting framework—emphasizing error messages, log analysis, root‑cause isolation, and step‑by‑step solutions for six real‑world scenarios ranging from filesystem corruption to inode exhaustion and read‑only file‑system errors.

LinuxOperationsShell Commands
0 likes · 7 min read
Essential Linux Ops: Proven Troubleshooting Steps for Common Failures
21CTO
21CTO
Mar 11, 2024 · Operations

How Netlify’s AI Debugger Turns Failed Deploys into Quick Fixes

Netlify’s new AI‑assisted deployment feature automatically analyzes build failures, offers diagnostic suggestions, and helps developers resolve issues faster, though its recommendations are best‑effort and may require manual verification.

AI debuggingDeploymentNetlify
0 likes · 5 min read
How Netlify’s AI Debugger Turns Failed Deploys into Quick Fixes
DevOps Operations Practice
DevOps Operations Practice
Mar 10, 2024 · Operations

Key Competencies for an Excellent Operations Director

The article outlines the essential technical knowledge, team management, project management, cross‑department coordination, strategic planning, and leadership abilities required for a senior operations director to succeed and advance toward executive roles.

LeadershipOperationsProject Management
0 likes · 5 min read
Key Competencies for an Excellent Operations Director
Open Source Linux
Open Source Linux
Mar 7, 2024 · Operations

How to Fix Disk‑Full Issues in Legacy Kubernetes Clusters Using Docker

This guide explains why old Kubernetes clusters that use Docker can run out of disk space, describes the symptoms such as pods stuck in ContainerCreating, and provides step‑by‑step commands to clean Docker files, prune images, adjust kubelet settings, and prevent future disk‑full problems.

Disk CleanupGarbage CollectionOperations
0 likes · 11 min read
How to Fix Disk‑Full Issues in Legacy Kubernetes Clusters Using Docker
dbaplus Community
dbaplus Community
Mar 5, 2024 · Operations

How to Recover a Failing Elasticsearch Cluster: Master Loss, Shard Corruption, and More

This guide explains Elasticsearch cluster architecture, node roles, and metadata storage, then details step‑by‑step recovery procedures for master‑node loss, complete master outage, data‑node failures, shard allocation problems, corrupted shards, translog issues, and missing segment files, including relevant API commands and tool usage.

Cluster RecoveryData NodeElasticsearch
0 likes · 17 min read
How to Recover a Failing Elasticsearch Cluster: Master Loss, Shard Corruption, and More
JD Retail Technology
JD Retail Technology
Mar 5, 2024 · Operations

Rethinking DevOps: The Rise of Platform Engineering and Its Impact on Software Delivery

This article examines the growing tension between traditional DevOps practices and the emerging concept of platform engineering, exploring why developers resist operational duties, the core principles of platform engineering, success factors, metrics, and future trends shaping software delivery in modern organizations.

Operationsinternal platformsplatform engineering
0 likes · 14 min read
Rethinking DevOps: The Rise of Platform Engineering and Its Impact on Software Delivery
Open Source Tech Hub
Open Source Tech Hub
Mar 5, 2024 · Operations

How to Expose Intranet Web Services with Custom Domains Using frp

This guide explains what frp is, why it’s a strong reverse‑proxy choice, and provides step‑by‑step instructions—including configuration files, port opening, and domain setup—to expose internal web services through custom domains securely.

Custom DomainNetwork ConfigurationOperations
0 likes · 7 min read
How to Expose Intranet Web Services with Custom Domains Using frp
Open Source Linux
Open Source Linux
Mar 1, 2024 · Operations

How Two‑Site Three‑Center Disaster Recovery Boosts Business Continuity with Oracle Data Guard

The two‑site three‑center disaster recovery model combines a production site, a same‑city backup, and a remote backup to ensure data integrity and rapid recovery, leveraging Oracle Data Guard for synchronized and asynchronous replication, thereby improving RPO and RTO across various disaster scenarios.

OperationsOracle Data Guardbusiness continuity
0 likes · 4 min read
How Two‑Site Three‑Center Disaster Recovery Boosts Business Continuity with Oracle Data Guard
Efficient Ops
Efficient Ops
Feb 27, 2024 · Operations

Master Docker Logging and Graylog Integration: A Step‑by‑Step Guide

This guide explains how Docker captures container output, stores it as JSON logs, configures various log drivers, and integrates with Graylog for centralized log management, including deployment, input setup, and sending logs from containers via Docker run or docker‑compose.

ContainerDockerDocker Compose
0 likes · 8 min read
Master Docker Logging and Graylog Integration: A Step‑by‑Step Guide
Volcano Engine Developer Services
Volcano Engine Developer Services
Feb 22, 2024 · Cloud Native

How BMQ’s Cloud‑Native Compute‑Storage Separation Revolutionizes Message Queues

This article explains how ByteDance’s BMQ, a cloud‑native message engine with a compute‑storage separated architecture, overcomes Kafka’s scalability and operational limits by using Proxy, Broker, Coordinator, and Controller modules, a distributed storage model, and advanced caching to achieve rapid scaling, high throughput, and resilient operations.

Cloud NativeMessage QueueOperations
0 likes · 15 min read
How BMQ’s Cloud‑Native Compute‑Storage Separation Revolutionizes Message Queues
Efficient Ops
Efficient Ops
Feb 21, 2024 · Operations

Why Organizational DevOps Assessments Are Critical for 2024‑2027 Tech Maturity

The article explains how Gartner predicts DevOps will reach production maturity by 2024‑2027, describes China CAICT's organization‑level DevOps assessment framework, its standards, classification rules, statistical results across industries, and the tangible benefits reported by participating enterprises.

Capability MaturityDevOpsOperations
0 likes · 8 min read
Why Organizational DevOps Assessments Are Critical for 2024‑2027 Tech Maturity
Efficient Ops
Efficient Ops
Feb 19, 2024 · Operations

Mastering Prometheus: Practical Tips for Effective Application Monitoring

This article explains how to design and implement Prometheus metrics for application monitoring, covering the selection of monitoring targets, golden metrics, label conventions, naming rules, histogram bucket choices, and Grafana visualization tricks to help engineers build reliable observability pipelines.

GrafanaMetricsOperations
0 likes · 10 min read
Mastering Prometheus: Practical Tips for Effective Application Monitoring
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 18, 2024 · Operations

Why Software Supply Chain Consistency Is the Hidden Cost of Scaling

Software development involves both value‑creating features and unavoidable maintenance costs; this article explains how the hidden software supply chain—frameworks, libraries, runtime, cloud services, and configurations—creates consistency challenges, and proposes strategies such as explicit declarations, IaC, serverless, and mono‑repo to reduce scaling costs.

OperationsScalabilityServerless
0 likes · 21 min read
Why Software Supply Chain Consistency Is the Hidden Cost of Scaling
ITPUB
ITPUB
Feb 17, 2024 · Operations

Why Ops Professionals Must Look Up: The 4+1+1+1 Framework Explained

The article reflects on the relentless challenges of IT operations, outlines the never‑ending skill gaps, standards, trends and blame, and introduces a 4+1+1+1 model that separates developers, testers, security staff from four core ops responsibilities to guide systematic ops system construction.

4+1+1+1 modelIT opsInfrastructure Management
0 likes · 6 min read
Why Ops Professionals Must Look Up: The 4+1+1+1 Framework Explained
Architect's Guide
Architect's Guide
Feb 15, 2024 · Operations

Common ELK Deployment Architectures and Practical Solutions for Log Management

This article introduces the core components of the ELK stack, compares three typical deployment architectures—including Logstash‑only, Filebeat‑assisted, and Kafka‑backed designs—and provides concrete configuration examples and troubleshooting tips for multiline merging, timestamp handling, and module‑level log filtering.

ELKElasticsearchFilebeat
0 likes · 11 min read
Common ELK Deployment Architectures and Practical Solutions for Log Management
21CTO
21CTO
Feb 7, 2024 · Operations

Master Your Developer Workflow: Proven Time‑Management Techniques

This article explains why effective time management is essential for developers, explores psychological, physiological, and technical dimensions, and presents practical techniques such as weekly planning, the Pomodoro method, goal‑based planning, and the Eisenhower matrix to boost productivity and work‑life balance.

Developer WorkflowOperationspomodoro
0 likes · 13 min read
Master Your Developer Workflow: Proven Time‑Management Techniques
JD Cloud Developers
JD Cloud Developers
Feb 6, 2024 · Operations

How We Boosted Nginx Performance 50× by Tuning Gzip Settings

This article documents a real‑world Nginx optimization case where adjusting gzip compression levels and switching to static gzip reduced CPU usage dramatically, enabling a 9‑wan QPS load to be handled with only 7% CPU and achieving over a 50‑fold performance gain.

BackendGzipNginx
0 likes · 8 min read
How We Boosted Nginx Performance 50× by Tuning Gzip Settings
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Feb 5, 2024 · R&D Management

Comprehensive Guide to the Workflow Management System: Framework, Features, Process Design, and Operations

This document provides a detailed English guide to a workflow management system, covering its underlying frameworks, feature list, process design operations, UI components, form design, deployment steps, and user interactions such as initiating, reviewing, and handling tasks.

OperationsProcess DesignR&D management
0 likes · 17 min read
Comprehensive Guide to the Workflow Management System: Framework, Features, Process Design, and Operations
dbaplus Community
dbaplus Community
Feb 4, 2024 · Operations

How Ant Group Leverages SLO and AIOps for Fine‑Grained Operations

This article details Ant Group's practical implementation of Service Level Objectives (SLO) and AIOps to achieve fine‑grained operations, covering SLO fundamentals, health‑score architecture, GitOps‑based data pipelines, error‑budget alerting, AI‑driven anomaly detection, fault localization techniques, and real‑world case studies on dashboards, Kubernetes SLOs, and emergency response workflows.

Error BudgetFault LocalizationKubernetes
0 likes · 38 min read
How Ant Group Leverages SLO and AIOps for Fine‑Grained Operations
Data Thinking Notes
Data Thinking Notes
Jan 30, 2024 · Operations

How Banks Can Build an Effective Data Governance Framework

This article outlines a two‑step approach for banks to design a data governance system—clarifying organizational responsibilities and constructing a layered institutional framework—while detailing cross‑department collaboration, head‑office and branch coordination, and practical policy, procedure, and work‑detail levels to sustain continuous improvement and support digital transformation.

BankingData GovernanceData Management
0 likes · 10 min read
How Banks Can Build an Effective Data Governance Framework
dbaplus Community
dbaplus Community
Jan 29, 2024 · Artificial Intelligence

How Meituan Uses AIOps to Revolutionize Incident Management

This article details Meituan's two‑year exploration of AIOps for incident management, covering the challenges of massive, real‑time operational data, the AI‑driven modules for risk prevention, fault detection, diagnosis, and similar‑incident recommendation, and future directions such as intelligent log detection and change recognition.

OperationsRoot Cause Analysisaiops
0 likes · 22 min read
How Meituan Uses AIOps to Revolutionize Incident Management
21CTO
21CTO
Jan 28, 2024 · Operations

Why IPv4 Is Getting Expensive and How to Overcome IPv6 Migration Challenges

The article explains IPv4 address exhaustion, the emerging fees for public IPv4, and the technical, operational, and tooling hurdles that organizations face when transitioning to IPv6, while outlining three strategic options and real‑world migration experiences.

IPv4IPv6Network Migration
0 likes · 13 min read
Why IPv4 Is Getting Expensive and How to Overcome IPv6 Migration Challenges
Architect
Architect
Jan 28, 2024 · Operations

How We Built Real‑Time SLA Monitoring for Message Push and Doubled Throughput

This article details the end‑to‑end design, node‑level splitting, metric definition, and Spring‑based implementation of SLA monitoring for a high‑volume message‑push system, showing how precise latency and vendor‑stability metrics uncovered bottlenecks, enabled rapid issue detection, and ultimately doubled overall throughput.

Message PushMicroservicesOperations
0 likes · 14 min read
How We Built Real‑Time SLA Monitoring for Message Push and Doubled Throughput
Architect
Architect
Jan 27, 2024 · Industry Insights

How We Built a Scalable Smart Customer Service System for an Activity Platform

This article details the end‑to‑end design, implementation, and operational results of a smart customer‑service platform that automates FAQ capture, leverages both Elasticsearch and LLM‑based models, and provides a low‑code, multi‑team backend for rapid issue resolution.

ElasticsearchMicroservicesOperations
0 likes · 13 min read
How We Built a Scalable Smart Customer Service System for an Activity Platform
Python Programming Learning Circle
Python Programming Learning Circle
Jan 27, 2024 · Operations

Automating Log Monitoring, Email Reporting, and DingTalk Alerts with Python

This article presents a Python‑based solution that queries LogEasy data, calculates key metrics such as total requests, 5xx errors, average response time, and unique visitors, formats the results into Excel and HTML reports, and automatically sends them via email and DingTalk alerts for operational monitoring.

DingTalkLog MonitoringOperations
0 likes · 30 min read
Automating Log Monitoring, Email Reporting, and DingTalk Alerts with Python
IT Services Circle
IT Services Circle
Jan 25, 2024 · Operations

How to Resolve Online Message Queue Backlog Issues

This article explains why message queues can become backlogged, identifies producer and consumer causes, and provides practical strategies—including adding consumers, increasing queue capacity, optimizing consumption logic, implementing failure handling, and rapid remediation steps—to quickly resolve backlog in production environments.

BacklogMessage QueueOperations
0 likes · 7 min read
How to Resolve Online Message Queue Backlog Issues
Efficient Ops
Efficient Ops
Jan 24, 2024 · Backend Development

Mastering Nginx: Reverse Proxy, Load Balancing, and High Availability Explained

This comprehensive guide introduces Nginx’s high‑performance architecture, explains forward and reverse proxy concepts, demonstrates load‑balancing and static‑dynamic content separation, provides practical configuration commands, and walks through real‑world setups for reverse proxy, load‑balancing, static‑dynamic separation, and high‑availability using Keepalived.

NginxOperationsServer Configuration
0 likes · 16 min read
Mastering Nginx: Reverse Proxy, Load Balancing, and High Availability Explained
DevOps
DevOps
Jan 23, 2024 · Operations

Collection of Bash Scripts for Server Monitoring, Automation, and Deployment

This article provides a curated set of Bash scripts covering MySQL replication monitoring, directory change detection, bulk user creation, website health checks, remote command execution, LNMP stack deployment, server resource reporting, high‑resource process identification, and automated deployment of Java and PHP projects, offering practical automation tools for system administrators.

BashDeploymentOperations
0 likes · 12 min read
Collection of Bash Scripts for Server Monitoring, Automation, and Deployment
dbaplus Community
dbaplus Community
Jan 22, 2024 · Operations

How NetEase Cloud Music Built a Resilient RPC Framework for Microservices

This article details the practical steps and architectural choices NetEase Cloud Music took to improve RPC stability in a micro‑service environment, covering service discovery, connection management, cloud‑native challenges, SLO design, log governance, degradation, rate limiting, outlier detection, thread‑pool isolation, fast‑failure handling, registry optimizations, multi‑registry support, and post‑incident knowledge‑base building.

Cloud NativeOperationsRPC
0 likes · 14 min read
How NetEase Cloud Music Built a Resilient RPC Framework for Microservices
Architecture Digest
Architecture Digest
Jan 17, 2024 · Operations

Comprehensive Guide to Workflow Process Design, Deployment, and Management

This guide explains how to create, view, edit, and design workflow processes, describes the components of the process designer—including drag‑panel, canvas, property and control panels—covers form design, deployment, process definition, request initiation, task handling, approval actions, delegation, and related source code references.

OperationsProcess Designform design
0 likes · 10 min read
Comprehensive Guide to Workflow Process Design, Deployment, and Management
Efficient Ops
Efficient Ops
Jan 16, 2024 · Operations

How Top Chinese Exchanges Accelerated DevOps Maturity: Insights from CAICT Assessments

Amid a nationwide digital transformation push, four leading Chinese exchanges adopted the CAICT DevOps Capability Maturity Model, achieving multiple level‑3 and level‑2 assessments that boosted IT efficiency, integrated resources, and better supported business systems, offering valuable lessons for the industry.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 8 min read
How Top Chinese Exchanges Accelerated DevOps Maturity: Insights from CAICT Assessments
Efficient Ops
Efficient Ops
Jan 15, 2024 · Operations

How Chinese City Banks Boost IT Efficiency with the DevOps Maturity Model

Amid a nationwide digital transformation push, twelve Chinese city commercial banks adopted the CAICT‑led DevOps Capability Maturity Model, achieving higher IT efficiency, integrated resources, and faster, higher‑quality service delivery across continuous delivery, technical operations, security, and performance measurement standards.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 18 min read
How Chinese City Banks Boost IT Efficiency with the DevOps Maturity Model
Efficient Ops
Efficient Ops
Jan 15, 2024 · Operations

How China’s Top Banks Accelerate IT Efficiency with DevOps Maturity Assessments

Seven leading Chinese joint‑stock banks have evaluated a total of 62 projects against the CAICT DevOps Capability Maturity Model, revealing how continuous delivery, technical operation, security, and performance measurement standards are driving IT efficiency, cultural change, and faster value delivery across the financial sector.

DevOpsIT efficiencyMaturity Model
0 likes · 18 min read
How China’s Top Banks Accelerate IT Efficiency with DevOps Maturity Assessments
DevOps
DevOps
Jan 12, 2024 · Operations

Why Building a Never‑Failing System Is Impossible and How to Pursue Continuous High Availability

The article analyses why truly never‑failing systems cannot exist—citing entropy and Murphy’s laws—examines the organizational and technical obstacles to continuous high availability, and offers practical cultural and engineering practices such as testing, code review, monitoring, and regular system health checks to mitigate risk.

Murphy's LawOperationsSRE
0 likes · 14 min read
Why Building a Never‑Failing System Is Impossible and How to Pursue Continuous High Availability
Liangxu Linux
Liangxu Linux
Jan 10, 2024 · Operations

Top 10 Essential Tools Every Operations Engineer Should Master

This guide introduces ten widely used operations engineering tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, typical scenarios, advantages, and practical examples to help engineers choose the right solution for automation, monitoring, and management tasks.

Configuration ManagementOperationsdevops tools
0 likes · 8 min read
Top 10 Essential Tools Every Operations Engineer Should Master
Efficient Ops
Efficient Ops
Jan 9, 2024 · Operations

35 Must‑Know Linux Operations Interview Questions & Answers

This comprehensive guide compiles 35 essential Linux operations interview questions covering server management, RAID configurations, load balancing with LVS/Nginx/HAProxy, proxy choices, middleware, MySQL troubleshooting, networking tools, security practices, and practical scripts, providing concise answers to help candidates ace DevOps and sysadmin roles.

LinuxOperationsinterview
0 likes · 34 min read
35 Must‑Know Linux Operations Interview Questions & Answers
Efficient Ops
Efficient Ops
Jan 9, 2024 · Operations

What Do 2023 DevOps & AIOps Assessments Reveal About China’s Digital Transformation?

Amid China's sweeping digital, networked, and intelligent transformation, over 100 leading enterprises across banking, finance, communications, manufacturing, and other sectors have participated in DevOps and AIOps maturity model evaluations, providing a comprehensive view of industry adoption, capability levels, and emerging best practices for 2023.

DevOpsDigital TransformationOperations
0 likes · 15 min read
What Do 2023 DevOps & AIOps Assessments Reveal About China’s Digital Transformation?
High Availability Architecture
High Availability Architecture
Jan 9, 2024 · Operations

AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation

This article presents Meituan's two‑year exploration of AIOps in incident management, detailing risk‑prevention change detection, real‑time anomaly discovery, automated root‑cause diagnosis, multi‑dimensional KPI analysis, and similar‑event recommendation, while sharing architectural designs, algorithmic techniques, performance results, and future directions.

NLPOperationsRoot Cause Analysis
0 likes · 24 min read
AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation
dbaplus Community
dbaplus Community
Jan 8, 2024 · Operations

How a Simple Time Adjustment Sparked a Massive Outage: Real Ops Incident Stories

Three real-world operations mishaps are recounted—a mistaken system‑time change that logged out thousands of users, an accidental bulk delete of database accounts, and a failed glibc downgrade that stalled a software release—illustrating the cascading impact of small errors and the urgent remediation steps taken.

LinuxOperationsSysadmin
0 likes · 8 min read
How a Simple Time Adjustment Sparked a Massive Outage: Real Ops Incident Stories
Efficient Ops
Efficient Ops
Jan 8, 2024 · Operations

What Do 2023 DevOps & AIOps Assessments Reveal About China’s Digital Transformation?

Amid China's sweeping digital transformation, the China Academy of Information and Communications Technology (CAICT) reports that 104 leading enterprises across banking, securities, insurance, telecom, manufacturing and other sectors have completed 336 DevOps maturity assessments and 23 enterprises have finished 45 AIOps assessments in 2023, highlighting industry‑wide adoption of DevOps and AIOps standards and offering detailed breakdowns by sector, evaluation levels, and future guidance.

DevOpsDigital TransformationMaturity Model
0 likes · 16 min read
What Do 2023 DevOps & AIOps Assessments Reveal About China’s Digital Transformation?
Efficient Ops
Efficient Ops
Jan 8, 2024 · Information Security

How a Securities Firm Built a 100‑Day DevSecOps Prototype

At the 21st GOPS Global Operations Conference in Shanghai, Shenwan Hongyuan Securities' application security lead Wang Biansi detailed a step‑by‑step 100‑day journey to create a DevSecOps sample room, covering goal setting, research, platform design, tool integration, and security training.

Application SecurityDevSecOpsInformation Security
0 likes · 5 min read
How a Securities Firm Built a 100‑Day DevSecOps Prototype
Zhuanzhuan Tech
Zhuanzhuan Tech
Jan 5, 2024 · Operations

Building an Integrated Monitoring Platform: Architecture, Implementation, and Lessons from ZhaiZhai

This article presents a detailed case study of how ZhaiZhai designed and implemented a unified monitoring platform—combining business services, middleware, and operations resources—by selecting Prometheus and M3DB, automating Grafana dashboards, creating a low‑noise alerting system, and achieving large‑scale observability with significant cost and efficiency gains.

AlertingM3DBOperations
0 likes · 21 min read
Building an Integrated Monitoring Platform: Architecture, Implementation, and Lessons from ZhaiZhai
21CTO
21CTO
Dec 30, 2023 · Operations

How G Bank Turns Application Monitoring into Business‑Driven Visual Operations

This article examines how G Bank builds an application monitoring system based on ITIL and Google SRE principles, identifies its shortcomings, and evolves the platform into a visualized operations solution that aligns technical and business perspectives for faster incident resolution and improved customer experience.

BankingITILOperations
0 likes · 11 min read
How G Bank Turns Application Monitoring into Business‑Driven Visual Operations
Architect
Architect
Dec 29, 2023 · Industry Insights

How Bilibili Built a Scalable Anti‑Crawling System: Architecture, Data Flow, and Real‑World Impact

The article details Bilibili's comprehensive anti‑crawling solution, covering the problem background, a two‑layer detection framework integrated with APIGW and GAIA, risk perception, strategy iteration, verification mechanisms, quantitative results, and future improvement directions, all illustrated with concrete examples and performance numbers.

API SecurityBilibiliOperations
0 likes · 23 min read
How Bilibili Built a Scalable Anti‑Crawling System: Architecture, Data Flow, and Real‑World Impact
JD Retail Technology
JD Retail Technology
Dec 29, 2023 · Operations

Bug Bash Practice Guide for Big Data Real‑Time Platform Teams

This guide details how the Big Data Real‑Time Platform department organized a Bug Bash activity to train new staff, enhance cross‑product knowledge, improve product quality, and strengthen team collaboration through structured preparation, execution, and post‑event analysis.

Big DataBug BashOperations
0 likes · 8 min read
Bug Bash Practice Guide for Big Data Real‑Time Platform Teams
ITPUB
ITPUB
Dec 27, 2023 · Operations

When a Snapshot Share Became a Data Leak: Lessons from a Cloud Ops Failure

A developer mistakenly set a cloud disk snapshot to public, exposing a major client’s data, and recounts the frantic rollback, the ensuing panic among teammates, and the hard‑won operational lessons about high‑risk manual tasks, proper safeguards, and the need for visualized tooling.

Operationsdata securityincident response
0 likes · 10 min read
When a Snapshot Share Became a Data Leak: Lessons from a Cloud Ops Failure
Zhuanzhuan Tech
Zhuanzhuan Tech
Dec 23, 2023 · Operations

Investigation of Zookeeper 3.4.6 Election Port (3888) Failure Caused by Malformed Packets

This article details a troubleshooting investigation of a Zookeeper 3.4.6 cluster where the election port 3888 became unresponsive due to a NegativeArraySizeException triggered by malformed packets, explains the diagnostic steps, root‑cause analysis, and recommends upgrading to a newer version to fix the issue.

ApacheZookeeperClusterTroubleshootingElectionPort
0 likes · 11 min read
Investigation of Zookeeper 3.4.6 Election Port (3888) Failure Caused by Malformed Packets
Efficient Ops
Efficient Ops
Dec 21, 2023 · Operations

How China Galaxy Securities Achieved Level 3 DevOps Continuous Delivery – A Success Story

China Galaxy Securities detailed how three core projects passed the DevOps Continuous Delivery Level‑3 assessment, highlighting tool upgrades, process improvements, metric gains, cultural shifts, and future plans that illustrate the tangible benefits of standardized DevOps practices in a financial institution.

Continuous DeliveryDevOpsMaturity Model
0 likes · 15 min read
How China Galaxy Securities Achieved Level 3 DevOps Continuous Delivery – A Success Story
Meituan Technology Team
Meituan Technology Team
Dec 21, 2023 · Operations

AIOps for Incident Management: Practices and Insights from Meituan

Meituan’s service‑operations team applies AIOps across prevention, detection, and post‑incident stages—using change‑risk analysis, real‑time graph‑based anomaly detection, similarity‑driven root‑cause diagnosis, and NLP‑powered incident recommendation—to achieve sub‑second detection, high precision, 28% faster fault handling, and plans for intelligent log and change recognition.

OperationsRoot Cause Analysisaiops
0 likes · 24 min read
AIOps for Incident Management: Practices and Insights from Meituan
dbaplus Community
dbaplus Community
Dec 20, 2023 · Operations

Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage

This article outlines how a large‑scale Kafka deployment of over a thousand machines across dozens of clusters was engineered for stability and efficiency through a custom Guardian controller that adds partition‑level throttling, automatic balancing, multi‑tenant isolation, cross‑IDC management, tiered storage, audit capabilities, and fully automated operational workflows.

Cluster ManagementKafkaOperations
0 likes · 21 min read
Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage
Efficient Ops
Efficient Ops
Dec 20, 2023 · Operations

How Bilibili Implements SLO Engineering to Boost Service Reliability

This article details Bilibili's practical SLO engineering approach, covering foundational components, SLI selection, application and business level SLIs, alerting strategies, SLO‑driven quality operations, and the GOC framework for rapid fault discovery, localization, and recovery, illustrating how reliability is systematically improved.

OperationsSLOreliability engineering
0 likes · 16 min read
How Bilibili Implements SLO Engineering to Boost Service Reliability
Efficient Ops
Efficient Ops
Dec 20, 2023 · Operations

How China’s Aviation IT System Achieved Leading DevOps Standards

The article details China Civil Aviation Information Network's flight management system passing the CAICT DevOps Technical Operations Level 2+ assessment, explores the interview insights on the project's design, operational improvements, and the broader significance of DevOps standards for digital transformation in the aviation industry.

DevOpsDigital TransformationIT Governance
0 likes · 13 min read
How China’s Aviation IT System Achieved Leading DevOps Standards
Efficient Ops
Efficient Ops
Dec 19, 2023 · Operations

How Zhongtai Securities Achieved Advanced DevOps Standards: A Success Story

Zhongtai Securities’ Centralized Operations Platform passed the CAICT DevOps Technical Operations Level‑2 assessment, showcasing how standardized DevOps practices and tool empowerment can boost quality, efficiency, and digital transformation across banking, securities, and other industries.

DevOpsDigital TransformationOperations
0 likes · 12 min read
How Zhongtai Securities Achieved Advanced DevOps Standards: A Success Story
Efficient Ops
Efficient Ops
Dec 19, 2023 · Operations

How a Chinese Trust Firm Achieved Top‑Tier DevOps Continuous Delivery: A Success Story

Five Minerals International Trust’s OGP platform passed the CAICT DevOps Continuous Delivery Level 3 assessment, marking the first trust‑industry certification in China; the interview reveals how standard‑based DevOps, team restructuring, automation, and cloud‑native architecture boosted efficiency, quality, and security.

Cloud NativeContinuous DeliveryDevOps
0 likes · 14 min read
How a Chinese Trust Firm Achieved Top‑Tier DevOps Continuous Delivery: A Success Story
Efficient Ops
Efficient Ops
Dec 18, 2023 · Operations

Zhongtai Securities’ Path to Advanced DevOps Standards: Inside Their Assessment Success

Zhongtai Securities’ centralized operations platform recently passed the China Academy of Information and Communications Technology’s DevOps Technical Operations Level‑2 assessment, showcasing how standardized DevOps practices, tool empowerment, and rigorous evaluation can boost quality, efficiency, and digital transformation across financial institutions.

DevOpsDigital TransformationOperations
0 likes · 14 min read
Zhongtai Securities’ Path to Advanced DevOps Standards: Inside Their Assessment Success
Efficient Ops
Efficient Ops
Dec 18, 2023 · Operations

How Jinzhou Bank Reached Domestic Leading Level 3 DevOps Continuous Delivery

Jinzhou Bank’s mobile banking investment service microservice transformation project passed the CAICT DevOps Continuous Delivery Level 3 assessment, showcasing how standardized DevOps practices, tool empowerment, and agile adoption dramatically improved delivery speed, quality, and competitive advantage in the financial sector.

Continuous DeliveryDevOpsOperations
0 likes · 13 min read
How Jinzhou Bank Reached Domestic Leading Level 3 DevOps Continuous Delivery
DaTaobao Tech
DaTaobao Tech
Dec 18, 2023 · Industry Insights

Unlocking E‑Commerce Success: Core Principles and Data‑Driven Strategies Behind Modern Online Retail

This comprehensive guide explains what e‑commerce operation entails, breaks down its six functional areas, compares internet and e‑commerce operations, and presents data‑driven tactics—including conversion funnel analysis, traffic optimization, and average order value improvement—to help businesses boost efficiency and revenue.

ConversionData-drivenMarketing
0 likes · 32 min read
Unlocking E‑Commerce Success: Core Principles and Data‑Driven Strategies Behind Modern Online Retail
Efficient Ops
Efficient Ops
Dec 17, 2023 · Operations

How a Chinese Trust Firm Achieved Top‑Tier DevOps Continuous Delivery Certification

In a detailed interview, Five Minerals International Trust explains how its self‑developed Operations Guarantee Platform passed the CAICT DevOps Continuous Delivery Level 3 assessment, showcasing the benefits of standardized DevOps practices, improved efficiency, quality, security, and the broader impact on its digital transformation and industry adoption.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 14 min read
How a Chinese Trust Firm Achieved Top‑Tier DevOps Continuous Delivery Certification
Efficient Ops
Efficient Ops
Dec 17, 2023 · Operations

How FAW‑Volkswagen Reached Top‑Tier DevOps Continuous Delivery: Practices, Metrics & Lessons

The interview reveals how FAW‑Volkswagen leveraged the CAICT DevOps maturity model to achieve Level 3 continuous delivery for its OTD order‑delivery platform and multi‑functional dealer ecosystem, detailing the standards, implementation steps, performance metrics, challenges faced, and future plans for broader digital transformation.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 14 min read
How FAW‑Volkswagen Reached Top‑Tier DevOps Continuous Delivery: Practices, Metrics & Lessons
Efficient Ops
Efficient Ops
Dec 17, 2023 · Operations

How China Postal Savings Bank Achieved Leading‑Edge DevOps Automation Standards

China Postal Savings Bank’s software R&D center detailed how its "Star Platform" earned top‑level DevOps system and tool assessments, showcasing the bank’s automation capabilities, the evaluation process, key improvements, and future plans for expanding DevOps and XOps practices across the organization.

DevOpsOperationsStandard Assessment
0 likes · 14 min read
How China Postal Savings Bank Achieved Leading‑Edge DevOps Automation Standards
dbaplus Community
dbaplus Community
Dec 17, 2023 · Operations

Why Kubernetes Needs an LTS Release: Balancing Stability and Speed

The article examines the rapid Kubernetes upgrade cycle, the operational challenges it creates for teams, argues for a long‑term support (LTS) version, weighs pros and cons, and proposes compromise solutions to improve cluster stability without sacrificing innovation.

Cluster UpgradeKubernetesLTS
0 likes · 10 min read
Why Kubernetes Needs an LTS Release: Balancing Stability and Speed
Efficient Ops
Efficient Ops
Dec 16, 2023 · Operations

How a Chinese Trust Firm Earned Top‑Tier DevOps Continuous Delivery (Level 3)

Five Minerals International Trust’s OGP platform passed the China Information & Communications Academy’s DevOps Continuous Delivery Level 3 assessment, showcasing how standardized DevOps practices, cloud‑native microservices, and automated pipelines can boost efficiency, quality, and security, while offering insights into the evaluation process and future plans.

Cloud NativeContinuous DeliveryDevOps
0 likes · 15 min read
How a Chinese Trust Firm Earned Top‑Tier DevOps Continuous Delivery (Level 3)
Efficient Ops
Efficient Ops
Dec 16, 2023 · Operations

How China’s Aviation IT Leader Earned Top‑Tier DevOps Certification

The article details China’s Civil Aviation Information Network’s successful DevOps 2+ level assessment, highlighting the flight management system’s cloud‑native architecture, high‑concurrency capabilities, and the broader impact of CAICT’s DevOps standards on digital transformation across industries.

Aviation ITCloud NativeDevOps
0 likes · 12 min read
How China’s Aviation IT Leader Earned Top‑Tier DevOps Certification
Efficient Ops
Efficient Ops
Dec 14, 2023 · Cloud Native

Hybrid Cloud Container Stability: Qunar Travel’s Proven Practices from GOPS 2023

At the 21st GOPS Global Operations Conference in Shanghai, Qunar Travel’s tech expert Zou Sheng shared a detailed hybrid‑cloud container stability practice covering IDC‑first deployment, resource utilization over 60%, phased migration, reliability improvements, AZ monitoring, and cost‑saving strategies.

Container StabilityDevOpsOperations
0 likes · 3 min read
Hybrid Cloud Container Stability: Qunar Travel’s Proven Practices from GOPS 2023
Ctrip Technology
Ctrip Technology
Dec 14, 2023 · Operations

Improving Optical Transport Network Reliability at Ctrip: Architecture, Issue Analysis, and Optimization Strategies

This article describes Ctrip's optical transport network (TOTN) architecture, analyzes frequent fiber‑cut incidents and resulting device port flapping, presents technical research on fast optical switching and alarm delay, and details an optimization plan that achieved sub‑100 ms fault‑free switchover and stable Redis performance.

DCILink DelayNetwork Reliability
0 likes · 11 min read
Improving Optical Transport Network Reliability at Ctrip: Architecture, Issue Analysis, and Optimization Strategies
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Dec 14, 2023 · Operations

How GitOps Transforms Change Management: Automation, Code, and Transparency

GitOps leverages Git's version‑control strengths to automate, codify, and make transparent infrastructure changes, combining IaC, merge requests, and CI/CD, while exploring its principles, toolchains like FluxCD, ArgoCD, Jenkins X, and practical implementations such as SRE Stack for end‑to‑end change management.

Cloud NativeGitOpsInfrastructure as Code
0 likes · 17 min read
How GitOps Transforms Change Management: Automation, Code, and Transparency
dbaplus Community
dbaplus Community
Dec 13, 2023 · Databases

Tackling the Top 8 Challenges of Domestic Databases in Banking and Proven Strategies

The article examines the rapid growth of domestic databases in China’s banking sector, identifies eight critical pain points—from product stability and resource consumption to tooling gaps and migration difficulties—and offers detailed countermeasures covering version upgrade planning, resource optimization, functional testing, skill development, monitoring, ecosystem building, data migration, and backup‑recovery improvements.

Operationsdatabasesdomestic
0 likes · 16 min read
Tackling the Top 8 Challenges of Domestic Databases in Banking and Proven Strategies
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Dec 12, 2023 · Operations

How We Built a Stable Offline Testing Environment with Cloud‑Native Practices

This article details the challenges of managing a complex, multi‑layered offline testing environment at KuJiaLe, outlines the standardization of baseline, functional, and integration environments, and explains the comprehensive stability measures—including infrastructure upgrades, automated checks, emergency response, and daily operations—that dramatically improved reliability.

Cloud NativeOperationsenvironment management
0 likes · 14 min read
How We Built a Stable Offline Testing Environment with Cloud‑Native Practices