Tagged articles
3281 articles
Page 13 of 33
Efficient Ops
Efficient Ops
Jan 11, 2023 · Operations

How Guangdong Mobile’s CRM Achieved Leading DevOps Operational Maturity

Guangdong Mobile’s CRM system, supporting over 130 million users, passed the China Information & Communication Research Institute’s DevOps technical‑operation 2+ level assessment, showcasing a landmark achievement in standardized, tool‑enabled DevOps practices that boost quality, safety, and market competitiveness.

CRMContinuous DeliveryDevOps
0 likes · 11 min read
How Guangdong Mobile’s CRM Achieved Leading DevOps Operational Maturity
Efficient Ops
Efficient Ops
Jan 11, 2023 · Operations

How a Securities Firm Achieved DevSecOps Maturity to Boost Transformation

The article details how China’s CITIC Securities leveraged the national DevOps and DevSecOps maturity models, passed Level 2 security assessments, and integrated cultural, procedural, and technical practices to enhance its institutional business service platform, improve security, and accelerate its digital transformation.

DevOpsDevSecOpsDigital Transformation
0 likes · 11 min read
How a Securities Firm Achieved DevSecOps Maturity to Boost Transformation
Efficient Ops
Efficient Ops
Jan 10, 2023 · Operations

How a New Distributed Core Trading System Earned Top DevOps Ratings at China Securities

In a recent interview, the head of the System Operations Department at China Merchants Securities explains how their next‑generation core trading system, built on a distributed micro‑service architecture with open‑source components and cloud‑native tools, achieved Level 2 technical‑operation DevOps certification, detailing the challenges, improvements, and future plans for digital transformation.

Cloud NativeDevOpsOperations
0 likes · 15 min read
How a New Distributed Core Trading System Earned Top DevOps Ratings at China Securities
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 10, 2023 · Operations

How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations

This article explains what stability assurance is, outlines a systematic workflow—including anomaly identification, monitoring configuration, impact assessment, and solution planning—and provides practical methods such as capacity estimation, traffic limiting, load testing, scaling, and pre‑heating to ensure services remain stable during both daily operations and high‑traffic events.

Operationscapacity planningincident response
0 likes · 25 min read
How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations

Loggie: A High-Performance Log Collection Agent System Design and Implementation

Loggie is a cloud-native, Go-based log-collection agent that replaces Filebeat and Flume by using a micro-kernel producer-consumer architecture with hot-swappable pipelines, achieving 2 GB/s read speeds, 1.6‑2.6× higher throughput while using only a quarter of the CPU, and providing built-in observability, reliability, and latency monitoring for large-scale enterprise deployments.

GoOperationslog agent
0 likes · 16 min read
Loggie: A High-Performance Log Collection Agent System Design and Implementation
Ctrip Technology
Ctrip Technology
Jan 6, 2023 · Operations

iDesk Service Platform: Architecture, Development Stages, Core Features, and Operational Insights

The iDesk service platform is a comprehensive internal tool that evolved through three development phases, adopts a BS+Service architecture with modular local services, offers extensive software management and self‑service utilities, integrates tightly with TripPal and service accounts, and implements robust operational monitoring to achieve high availability and user satisfaction.

Backend ArchitectureOperationsSoftware Management
0 likes · 15 min read
iDesk Service Platform: Architecture, Development Stages, Core Features, and Operational Insights
Zhuanzhuan Tech
Zhuanzhuan Tech
Jan 4, 2023 · Operations

Evolution of Zhaozhuan Test Environment Governance: From Physical Isolation to Tag‑Based Traffic Routing

This article describes how Zhaozhuan’s testing environment evolved through three versions—physical isolation, automatic‑IP‑tag routing, and manual‑tag routing—detailing the architectural background, implementation principles, advantages, drawbacks, and supporting tools that dramatically reduced deployment time and resource consumption while introducing new operational challenges.

Cloud NativeOperationsservice governance
0 likes · 23 min read
Evolution of Zhaozhuan Test Environment Governance: From Physical Isolation to Tag‑Based Traffic Routing
Efficient Ops
Efficient Ops
Jan 2, 2023 · Operations

How China’s Bank of Communications Achieved Leading DevOps Maturity

In this interview, Liu Lei, General Manager of the Bank of Communications Software Development Center, explains how three flagship projects passed the DevOps Continuous Delivery Level‑3 assessment, detailing the standards, metrics, tooling improvements and the broader impact on the bank’s digital transformation.

Bank of CommunicationsContinuous DeliveryDevOps
0 likes · 14 min read
How China’s Bank of Communications Achieved Leading DevOps Maturity
Java High-Performance Architecture
Java High-Performance Architecture
Jan 2, 2023 · Backend Development

How to Build a High‑Availability Payment System with Smart Routing

This article explains how a fintech payment platform achieves high availability and optimal channel selection by using decision‑tree routing, sliding‑window negative‑feedback, pressure‑detection services, and component fallback strategies such as RabbitMQ with Redis, supporting millions of daily transactions.

Backend ArchitectureOperationsRouting Algorithm
0 likes · 13 min read
How to Build a High‑Availability Payment System with Smart Routing
Top Architect
Top Architect
Dec 31, 2022 · Operations

Optimizing System Performance and Workflow: From Technical Metrics to DevOps Process Improvement

The article illustrates how to improve the efficiency of an image‑recognition service by measuring performance, redesigning architecture with parallel processing and message queues, and then extends the analogy to enterprise workflow optimization, emphasizing the need to quantify, visualize, and continuously refine DevOps processes.

DevOpsOperationsSystem Architecture
0 likes · 11 min read
Optimizing System Performance and Workflow: From Technical Metrics to DevOps Process Improvement
Architecture Digest
Architecture Digest
Dec 31, 2022 · Operations

Log Size Reduction Techniques: Methodology and Case Study

This article explains why excessive INFO‑level logs can cause performance problems, presents three practical strategies—printing only necessary logs, merging log entries, and simplifying log content with code examples—and demonstrates their impact through a real‑world Java bean pipeline case that cuts daily log volume from about 5 GB to under 1 GB.

Operationsjavalog optimization
0 likes · 7 min read
Log Size Reduction Techniques: Methodology and Case Study
Open Source Linux
Open Source Linux
Dec 30, 2022 · Operations

Top 7 Kubernetes Management Tools to Simplify Cluster Operations

This article introduces seven popular Kubernetes management solutions—including K9s, Rancher, the native Dashboard with Kubectl and Kubeadm, Helm, KubeSpray, Kontena Lens, and WKSctl—detailing their key features, usage scenarios, and how they help streamline cluster monitoring, deployment, scaling, and security across cloud‑native environments.

Cluster ManagementDevOpsKubernetes
0 likes · 9 min read
Top 7 Kubernetes Management Tools to Simplify Cluster Operations
Efficient Ops
Efficient Ops
Dec 29, 2022 · Operations

How China Agricultural Bank Earned Level‑3 DevOps Application Design Certification

China Agricultural Bank’s distributed core customer information project passed the Level‑3 DevOps Application Design assessment, showcasing a cloud‑native micro‑service architecture, comprehensive DevOps practices, and measurable improvements in scalability, observability, and security that set a new industry benchmark.

BankingCloud NativeDevOps
0 likes · 13 min read
How China Agricultural Bank Earned Level‑3 DevOps Application Design Certification
MaGe Linux Operations
MaGe Linux Operations
Dec 28, 2022 · Cloud Native

Master Essential kubectl Commands: A Practical Guide for Kubernetes Ops

This comprehensive guide covers kubectl autocomplete, context configuration, object creation, resource viewing, updating, patching, editing, scaling, deletion, pod and node interaction, as well as the versatile kubectl set commands, formatted output options, and visual references for effective Kubernetes cluster management.

KubernetesOperationscloud-native
0 likes · 15 min read
Master Essential kubectl Commands: A Practical Guide for Kubernetes Ops
Tencent Cloud Developer
Tencent Cloud Developer
Dec 28, 2022 · Operations

Technical Architecture, Observability, and Operational Practices of Tencent Health Code System

The article details how Tencent’s health‑code platform leveraged a cloud‑native, serverless architecture, extensive observability (Prometheus, Grafana, RUM), rigorous capacity testing, chaos engineering, and ITIL‑based change management to sustain billions of page views, support massive concurrency, and ensure reliable, scalable epidemic‑control services.

Health CodeOperationsTencent Cloud
0 likes · 16 min read
Technical Architecture, Observability, and Operational Practices of Tencent Health Code System
Efficient Ops
Efficient Ops
Dec 28, 2022 · Operations

Mastering Ansible: 16 Visual Guides to Automate Your Operations

Ansible, a rapidly popular open‑source automation tool built on Python, enables batch system configuration, program deployment, and command execution through thousands of built‑in modules, offering a simple yet powerful solution for operations engineers, illustrated here with 16 comprehensive images.

AnsibleConfiguration ManagementOperations
0 likes · 3 min read
Mastering Ansible: 16 Visual Guides to Automate Your Operations
MaGe Linux Operations
MaGe Linux Operations
Dec 27, 2022 · Operations

Master Essential Linux Commands for Efficient System Operations

This article shares practical Linux command techniques—including xargs, background execution, process monitoring, multitail, continuous ping logging, TCP state inspection, and SSH port forwarding—to help system administrators streamline tasks, improve script efficiency, and troubleshoot performance issues.

LinuxNetworkingOperations
0 likes · 10 min read
Master Essential Linux Commands for Efficient System Operations
Efficient Ops
Efficient Ops
Dec 27, 2022 · Operations

How China’s Bank Achieved Industry‑Leading DevOps Maturity: A Deep Dive

An in‑depth interview with Liu Lei, General Manager of Bank of Communications' Software Development Center, reveals how three flagship projects passed the Level‑3 Continuous Delivery assessment, illustrating the bank's DevOps transformation, metric improvements, and future roadmap within China's digital banking landscape.

BankingContinuous DeliveryDevOps
0 likes · 17 min read
How China’s Bank Achieved Industry‑Leading DevOps Maturity: A Deep Dive
Efficient Ops
Efficient Ops
Dec 26, 2022 · Operations

How China’s Bank of Communications Achieved Industry‑Leading DevOps Maturity

An in‑depth interview with Liu Lei, GM of Bank of Communications' Software Development Center, reveals how the bank’s three flagship projects passed the DevOps Continuous Delivery Level‑3 assessment, boosting automation, efficiency, and digital transformation across its financial services.

BankingContinuous DeliveryDevOps
0 likes · 15 min read
How China’s Bank of Communications Achieved Industry‑Leading DevOps Maturity
Efficient Ops
Efficient Ops
Dec 26, 2022 · Operations

What Do China’s Latest DevOps Maturity Assessments Reveal About Enterprise Success?

The China Academy of Information and Communications Technology released the latest results of its DevOps Capability Maturity Model assessments, showing how standardization, tool empowerment and continuous delivery pipelines boost quality, efficiency, security and competitiveness across banks, telecom, finance and internet enterprises.

CAICTDevOpsEnterprise Standards
0 likes · 6 min read
What Do China’s Latest DevOps Maturity Assessments Reveal About Enterprise Success?
Efficient Ops
Efficient Ops
Dec 26, 2022 · Operations

What Is AIOps? Exploring China’s New AI‑Driven Operations Maturity Model

The article introduces the AIOps (Artificial Intelligence for IT Operations) capability maturity model developed by China’s Information and Communication Research Institute, explains its two parts—general capabilities and system/tool technical requirements—lists the evaluated modules, and announces the upcoming certification ceremony and contact details for participation.

Artificial IntelligenceIT OperationsMaturity Model
0 likes · 5 min read
What Is AIOps? Exploring China’s New AI‑Driven Operations Maturity Model
Efficient Ops
Efficient Ops
Dec 26, 2022 · Operations

China Agricultural Bank’s DevOps & AIOps Success: Key Lessons for Enterprises

China Agricultural Bank’s recent DevOps and AIOps assessments, covering 17 projects across continuous delivery, security, application design, and intelligent operations, showcase how standardized processes, tool empowerment, and rigorous evaluation boosted efficiency, safety, and digital transformation, offering actionable insights for large enterprises seeking similar maturity.

DevOpsDigital TransformationEnterprise Standards
0 likes · 16 min read
China Agricultural Bank’s DevOps & AIOps Success: Key Lessons for Enterprises
Programmer DD
Programmer DD
Dec 26, 2022 · Operations

Inside Alibaba Cloud Hong Kong Region C Outage: Timeline, Impact, and Lessons Learned

On December 18, 2022, Alibaba Cloud's Hong Kong Region Zone C suffered a massive service interruption—the longest in its operational history—prompting a detailed incident response, extensive service impact across compute, storage, and networking, and a thorough analysis that led to concrete infrastructure and communication improvements.

Alibaba CloudIncident ReportInfrastructure
0 likes · 13 min read
Inside Alibaba Cloud Hong Kong Region C Outage: Timeline, Impact, and Lessons Learned
Architecture Digest
Architecture Digest
Dec 23, 2022 · Backend Development

Case Study: Microservice Migration Challenges and Lessons Learned

This case study examines a data‑service company's transition to a microservice architecture, detailing the initial benefits such as improved visibility and reduced deployment cost, the subsequent explosion of complexity, queue‑head blocking, shared‑library versioning issues, and the trade‑offs that led the team to partially revert to a monolithic design.

DeploymentMicroservicesOperations
0 likes · 11 min read
Case Study: Microservice Migration Challenges and Lessons Learned
Architecture Digest
Architecture Digest
Dec 21, 2022 · Operations

Designing High‑Availability Systems: Principles and Practices Across Six Layers

This article systematically explores high‑availability system design from development standards, capacity planning, application services, storage, product strategies, operations deployment, to incident response, presenting key concepts, architectural patterns, and practical guidelines for building resilient services.

DeploymentOperationsSystem Design
0 likes · 27 min read
Designing High‑Availability Systems: Principles and Practices Across Six Layers
Baidu Geek Talk
Baidu Geek Talk
Dec 20, 2022 · Industry Insights

How AI‑Powered Fault Localization Transforms Automated Testing at Scale

This article explores Baidu's intelligent testing practices, covering spectrum‑based root‑cause localization, error‑code driven build‑system diagnostics, revenue‑change stop‑loss decision workflows, and search UI case‑level tracing, illustrating how data, algorithms, and engineering combine to reduce manual effort and accelerate issue resolution.

Automated TestingFault LocalizationOperations
0 likes · 10 min read
How AI‑Powered Fault Localization Transforms Automated Testing at Scale
Zhuanzhuan Tech
Zhuanzhuan Tech
Dec 20, 2022 · Operations

Alertmanager Alert System Refactoring: Issues, Solutions, and Implementation Details

This article analyzes common problems in a Prometheus‑Alertmanager monitoring setup—such as alert noise, lack of escalation, suppression and silence management—and presents a comprehensive refactor that introduces per‑cluster Alertmanager instances, custom escalation logic, suppression tables, and Python scripts to handle alert routing, silencing, and recovery.

Alert SuppressionAlertmanagerOperations
0 likes · 18 min read
Alertmanager Alert System Refactoring: Issues, Solutions, and Implementation Details
Cloud Native Technology Community
Cloud Native Technology Community
Dec 20, 2022 · Operations

Platform Engineering: The Evolution from DevOps to Internal Developer Platforms

The article explains how platform engineering, emerging from DevOps fatigue, unifies development and operations by providing internal developer platforms that reduce cognitive load, improve self‑service, and enable teams to focus on core product work, especially as organizations grow beyond twenty developers.

DevOpsInternal Developer PlatformOperations
0 likes · 11 min read
Platform Engineering: The Evolution from DevOps to Internal Developer Platforms
Efficient Ops
Efficient Ops
Dec 19, 2022 · Operations

How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE

This article details Tencent CDN's challenges and solutions for business continuity, covering bandwidth and device resource constraints, massive request handling, fault‑management lifecycle, automation bottlenecks, and the implementation of AIOps, intelligent alerts, capacity planning, and root‑cause analysis to ensure reliable service.

CDNOperationsSRE
0 likes · 21 min read
How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE
Alibaba Cloud Native
Alibaba Cloud Native
Dec 15, 2022 · Operations

How to Prevent ZooKeeper Disk Exhaustion: Snapshots, Logs, and Tuning Tips

This article explains why ZooKeeper can run out of disk space due to excessive snapshots and transaction logs, describes the underlying file‑generation mechanism, and provides concrete configuration parameters and best‑practice recommendations to control file growth and keep the cluster stable.

ConfigurationOperationsTransaction Log
0 likes · 9 min read
How to Prevent ZooKeeper Disk Exhaustion: Snapshots, Logs, and Tuning Tips
Efficient Ops
Efficient Ops
Dec 12, 2022 · Operations

How Bilibili Built a 5‑Year SRE Journey: High‑Availability, Multi‑Active, and Capacity Management

This article chronicles Bilibili's five‑year evolution of Site Reliability Engineering, detailing the introduction of SRE culture, the construction of high‑availability and multi‑active architectures, capacity management with Kubernetes, VPA/HPA, incident case studies, and the ongoing transformation of SRE practices across the organization.

KubernetesOperationsSRE
0 likes · 24 min read
How Bilibili Built a 5‑Year SRE Journey: High‑Availability, Multi‑Active, and Capacity Management
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Dec 9, 2022 · Operations

How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming

This article details Alibaba Cloud's Flink Cluster Inspector, explaining the business challenges of hotspot machines, the analysis of resource over‑use, and the four‑stage solution—pre‑profiling, in‑process self‑healing, post‑recovery, and observability—that reduces latency, cuts costs, and improves operational efficiency.

ClusterFlinkHotSpot
0 likes · 19 min read
How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming
37 Interactive Technology Team
37 Interactive Technology Team
Dec 8, 2022 · Operations

Log Alarm Optimization and Grafana Chart Integration Guide

This guide details how to configure Alibaba Cloud Log Service alarms—setting one‑day tokens, handling 1024‑byte truncation, removing record limits with analysis statements, adding a 10‑second query offset for timeliness—and shows how to visualize the data in Grafana using SQL queries for multi‑line and pie charts with timestamp conversion and time‑series filling.

Cloud LoggingGrafanaLog Monitoring
0 likes · 6 min read
Log Alarm Optimization and Grafana Chart Integration Guide
vivo Internet Technology
vivo Internet Technology
Dec 7, 2022 · Databases

vivo's Database Operations Platform: Challenges and Solutions in the Cloud-Native Era

Vivo’s Database‑as‑a‑Service platform tackles cloud‑native challenges by automating massive instance management with self‑service work orders and self‑healing, enabling elastic scaling through mixed‑deployment and multi‑threaded Redis tools, optimizing costs via automatic package shrinkage, and safeguarding personal data with full‑chain encryption, while outlining a roadmap toward AI‑driven fault handling, container‑based resources, and advanced privacy governance.

DaaSOperationscloud-native
0 likes · 14 min read
vivo's Database Operations Platform: Challenges and Solutions in the Cloud-Native Era
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Dec 7, 2022 · Cloud Native

How to Scale Kubernetes to 5,000 Nodes: Master, API Server, and Component Tuning

This guide explains how to push a Kubernetes cluster toward its theoretical limit of 5,000 nodes by detailing official limits, master node sizing for GCE and AWS, kube‑apiserver high‑availability and connection‑count tuning, scheduler and controller‑manager leader election settings, kubelet optimizations, and DNS anti‑affinity configuration.

Cloud NativeKubernetesOperations
0 likes · 6 min read
How to Scale Kubernetes to 5,000 Nodes: Master, API Server, and Component Tuning
Java High-Performance Architecture
Java High-Performance Architecture
Dec 6, 2022 · Cloud Native

How to Build Resilient Microservices: Patterns for Fault Tolerance and High Availability

Learn essential techniques for designing fault‑tolerant microservices, including graceful degradation, change management, health checks, self‑healing, failover caching, retry strategies, rate limiting, circuit breakers, and testing failures, to ensure high availability and reliability in distributed cloud‑native systems.

OperationsReliabilitycloud-native
0 likes · 15 min read
How to Build Resilient Microservices: Patterns for Fault Tolerance and High Availability
Bilibili Tech
Bilibili Tech
Dec 2, 2022 · Big Data

Data Quality Management: Expectations, Measurement, Assurance, and Operation

The article outlines a complete data‑quality‑management framework that first captures business expectations, then translates them into basic and personalized measurement rules, defines four assurance approaches for handling violations, and scales operation with indicators, tooling, and metrics to continuously improve data quality across the lifecycle.

Data GovernanceData QualityMetrics
0 likes · 19 min read
Data Quality Management: Expectations, Measurement, Assurance, and Operation
Efficient Ops
Efficient Ops
Dec 1, 2022 · Operations

Why Choose Loki Over ELK? A Hands‑On Guide to Deploying and Using Grafana Loki

This article explains the motivations for selecting Grafana Loki instead of ELK/EFK, introduces its core concepts and features, provides step‑by‑step deployment instructions for Promtail and Loki, and demonstrates how to configure Grafana, query logs, and handle label indexing, dynamic tags, and high‑cardinality challenges.

GrafanaKubernetesLoki
0 likes · 15 min read
Why Choose Loki Over ELK? A Hands‑On Guide to Deploying and Using Grafana Loki
DevOps
DevOps
Dec 1, 2022 · Cloud Native

Why Dapr Is a 10× Better Cloud‑Native Runtime: Benefits for Developers, Operators, and Architects

The article explains the 10×‑better theory, introduces Dapr as a cloud‑native sidecar framework, and details how it improves productivity for developers, enhances security, resilience and observability for operators, and offers multi‑language, multi‑environment flexibility for architects, while also acknowledging its drawbacks.

10xDaprMicroservices
0 likes · 22 min read
Why Dapr Is a 10× Better Cloud‑Native Runtime: Benefits for Developers, Operators, and Architects
Data Thinking Notes
Data Thinking Notes
Nov 28, 2022 · Big Data

Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality

This comprehensive guide explains how metadata connects source data, warehouses, and applications, outlines its technical and business classifications, demonstrates its value for data management, profiling, portals, and ETL development, and details optimization, storage, lifecycle, and quality practices essential for robust big‑data operations.

Big DataData QualityOperations
0 likes · 35 min read
Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality
DataFunTalk
DataFunTalk
Nov 27, 2022 · Operations

Best Practices for Full‑Stack Operations Monitoring and Cost Reduction Using Alibaba Cloud Elasticsearch

This article presents a comprehensive, three‑part guide on the current state of full‑stack operations monitoring, common challenges and solutions, and a real‑world use case, illustrating how Alibaba Cloud Elasticsearch can improve observability, boost performance, and cut costs for complex distributed systems.

Cost OptimizationElasticsearchOperations
0 likes · 13 min read
Best Practices for Full‑Stack Operations Monitoring and Cost Reduction Using Alibaba Cloud Elasticsearch
DataFunTalk
DataFunTalk
Nov 25, 2022 · Operations

Overview of Volcano Engine A/B Experiment System Platform

This article presents a comprehensive overview of Volcano Engine's A/B testing platform, detailing its four core stages—reliable experiment system, efficient data construction, scientific statistical analysis, and fine-grained governance—while explaining execution components, data pipelines, statistical methods, and operational best practices for large‑scale experimentation.

A/B testingBig DataExperiment Platform
0 likes · 16 min read
Overview of Volcano Engine A/B Experiment System Platform
HelloTech
HelloTech
Nov 22, 2022 · Operations

Guidelines for Incident Postmortem and Fault Review

The incident postmortem guideline advocates a dialectical view of failures, rapid low‑severity recovery, and a structured process—covering background, impact scope, timeline replay, deep root‑cause analysis, SMART improvement actions, responsibility assignment, and PDCA‑validated closure—to enhance system resilience, team anti‑fragility, and knowledge sharing.

MTBFMTTROperations
0 likes · 15 min read
Guidelines for Incident Postmortem and Fault Review
High Availability Architecture
High Availability Architecture
Nov 18, 2022 · Backend Development

Microservice Architecture: Benefits, Pitfalls, and Lessons Learned from a Data‑Service Company

An in‑depth case study of a data‑service company's transition to microservices details the initial benefits such as visibility and reduced deployment cost, the subsequent problems of queue head blocking, shared‑library versioning, scaling complexity, and the eventual trade‑offs that led to a partial monolith rollback.

BackendMicroservicesOperations
0 likes · 10 min read
Microservice Architecture: Benefits, Pitfalls, and Lessons Learned from a Data‑Service Company
Efficient Ops
Efficient Ops
Nov 16, 2022 · Operations

Building a 99.95% Uptime Cloud‑Native Platform: Guoxin Securities’ Ops Journey

Guoxin Securities’ QianKun centralized operation platform showcases a cloud‑native, micro‑service architecture that achieved 99.95% availability through containerization, multi‑region deployment, AI‑driven capacity forecasting, and comprehensive DevOps practices, offering a 24/7 seamless account‑opening experience and setting industry benchmarks.

Cloud NativeDevOpsOperations
0 likes · 14 min read
Building a 99.95% Uptime Cloud‑Native Platform: Guoxin Securities’ Ops Journey
Architects Research Society
Architects Research Society
Nov 16, 2022 · Operations

Understanding Business Process Maturity Models and Their Practical Use

This article explains what maturity models are, why they matter for evaluating and improving organizational processes, reviews common business process maturity models (BPMM) and their limitations, introduces the Capability Maturity Model (CMM) and the Agile ISO Maturity Model (AIMM), and offers guidance on selecting and applying a suitable model.

AIMMCMMIOperations
0 likes · 18 min read
Understanding Business Process Maturity Models and Their Practical Use
Xiaohe Frontend Team
Xiaohe Frontend Team
Nov 14, 2022 · Operations

How to Classify and Prioritize Online Incidents for Better System Stability

Effective incident management begins with clear classification; this guide explains how technical leaders can categorize online failures by nature, severity, and source—distinguishing usability versus financial loss incidents, ranking P0‑P3 levels, and identifying external, operational, product, and system‑quality fault types—to improve stability and learning.

Operationsfault classificationsystem stability
0 likes · 4 min read
How to Classify and Prioritize Online Incidents for Better System Stability
DevOps Coach
DevOps Coach
Nov 14, 2022 · Operations

Inside Google’s Retired File Server Backend: Exploring the Main Directory

This case study examines how Google decommissioned its legacy file‑server backend, focusing on the design, management, and migration of the main directory, and highlights the operational lessons and SRE practices that ensured a smooth transition without service disruption.

File ServerGoogle SREOperations
0 likes · 2 min read
Inside Google’s Retired File Server Backend: Exploring the Main Directory
Zuoyebang Tech Team
Zuoyebang Tech Team
Nov 14, 2022 · Cloud Native

How We Built a Multi‑Cloud, Multi‑Active Architecture at Zuoyebang

This article details Zuoyebang's journey from a single‑cloud setup to a multi‑cloud, multi‑active architecture, covering business drivers, design principles, network planning, compute and storage strategies, traffic scheduling, container migration, operational management, and the measurable cost, stability, and efficiency benefits achieved.

DevOpsOperationsarchitecture
0 likes · 19 min read
How We Built a Multi‑Cloud, Multi‑Active Architecture at Zuoyebang
DevOps Cloud Academy
DevOps Cloud Academy
Nov 13, 2022 · Operations

An Introduction to Apache Airflow: Features and Benefits of Digital Workflow Management

This article explains why modern organizations replace manual cron jobs with automated digital workflow management using Apache Airflow, detailing its troubleshooting, flexibility, monitoring, rich web UI, CLI/API, complex dependency handling, scalability, containerization, and extensibility through plugins and integrations.

Apache AirflowOperationsopen-source
0 likes · 9 min read
An Introduction to Apache Airflow: Features and Benefits of Digital Workflow Management
Efficient Ops
Efficient Ops
Nov 10, 2022 · Operations

How Liaoning Mobile Won the DevOps Team Award: Inside Their Agile Transformation

The article details Liaoning Mobile's award-winning DevOps transformation, describing the team's background, agile implementation, toolchain construction, challenges faced, system highlights, and measurable results that earned them the prestigious Communication Industry DevOps Team Award at the 2022 GOITI ceremony.

Continuous DeliveryDevOpsOperations
0 likes · 8 min read
How Liaoning Mobile Won the DevOps Team Award: Inside Their Agile Transformation
Alibaba Cloud Native
Alibaba Cloud Native
Nov 9, 2022 · Cloud Native

13 Common Kubernetes Pod Failures and How to Diagnose Them

This article outlines the Kubernetes pod lifecycle, describes the five pod phases, enumerates 13 typical failure scenarios—including scheduling, image pull, dependency, init container, probe, and OOM issues—provides error states, root causes, and step‑by‑step kubectl commands for diagnosis and remediation.

Cloud NativeKubernetesOperations
0 likes · 22 min read
13 Common Kubernetes Pod Failures and How to Diagnose Them
Efficient Ops
Efficient Ops
Nov 8, 2022 · Operations

Diagnosing High Load with Low CPU on Linux: Tools and Tips

This guide explains how to analyze and troubleshoot situations where Linux systems show high load averages despite low CPU usage, covering common load analysis methods, key commands like top, vmstat, iostat, sar, and ps, and practical solutions for I/O bottlenecks and D‑state processes.

CPULinuxLoad
0 likes · 11 min read
Diagnosing High Load with Low CPU on Linux: Tools and Tips
21CTO
21CTO
Nov 8, 2022 · Operations

Building a Billion‑User Membership System: ES, Redis & MySQL High‑Availability

This article details how a large‑scale membership platform achieves high performance and near‑zero downtime by employing dual‑center Elasticsearch clusters, traffic‑isolated ES architectures, deep ES optimizations, Redis caching with distributed locks, and a seamless MySQL migration with partitioned, dual‑center databases.

OperationsSystem Architecturehigh availability
0 likes · 20 min read
Building a Billion‑User Membership System: ES, Redis & MySQL High‑Availability
dbaplus Community
dbaplus Community
Nov 7, 2022 · Operations

Automating Fault Self‑Healing: A Practical Guide for Operations Teams

This article explains why disk‑space alerts demand automated handling, introduces the concept of fault self‑healing, outlines required process standards, describes monitoring platform dimensions, details a multi‑source self‑healing platform architecture, and offers practical steps for integration, notification, and continuous improvement.

CMDBOperationsfault self-healing
0 likes · 9 min read
Automating Fault Self‑Healing: A Practical Guide for Operations Teams
Efficient Ops
Efficient Ops
Nov 6, 2022 · Operations

Visualizing Business‑Process Monitoring with Grafana, Diagram & FlowCharting

This article examines the evolution of a monitoring platform, identifies key challenges such as alarm overload and fragmented data, and presents a solution that combines Grafana with Diagram and FlowCharting plugins to create business‑process‑oriented, data‑driven visualizations for faster issue resolution.

DiagramFlowChartingGrafana
0 likes · 10 min read
Visualizing Business‑Process Monitoring with Grafana, Diagram & FlowCharting
MaGe Linux Operations
MaGe Linux Operations
Nov 6, 2022 · Cloud Native

How to Safely Shut Down and Restart a Kubernetes Cluster

This guide walks you through the essential steps, commands, and precautions for safely draining nodes, backing up applications, CRDs, and etcd, then shutting down and later restarting a Kubernetes cluster while avoiding common pitfalls.

BackupCluster MaintenanceKubernetes
0 likes · 6 min read
How to Safely Shut Down and Restart a Kubernetes Cluster
Architects Research Society
Architects Research Society
Nov 4, 2022 · Fundamentals

eBay Scalability Best Practices: Functional Partitioning, Horizontal Sharding, Async Decoupling, and More

This article outlines eBay's key scalability best practices—including functional decomposition, horizontal sharding, avoiding distributed transactions, asynchronous decoupling, virtualization, and intelligent caching—to demonstrate how large‑scale web systems can achieve linear resource growth and high availability.

AsynchronousOperationsScalability
0 likes · 14 min read
eBay Scalability Best Practices: Functional Partitioning, Horizontal Sharding, Async Decoupling, and More
Architect's Guide
Architect's Guide
Nov 1, 2022 · Operations

Implementing Load Balancing with Nginx and SpringBoot

This article explains how to achieve load balancing using Nginx, covering the concepts of hardware and software load balancers, various Nginx balancing algorithms with configuration examples, and a step‑by‑step guide to integrate Nginx with a SpringBoot application, test it, and handle common pitfalls.

ConfigurationNginxOperations
0 likes · 8 min read
Implementing Load Balancing with Nginx and SpringBoot
Efficient Ops
Efficient Ops
Oct 31, 2022 · Operations

Key Takeaways from the 2022 GOPS Global Operations Conference Shanghai – DevOps, AIOps & Cloud Insights

The two‑day 2022 GOPS Global Operations Conference in Shanghai featured 16 tracks, over 80 speakers, new DevOps standards, extensive assessment results, and a wealth of sessions on DevOps, AIOps, cloud‑native practices, security, and industry case studies, offering a comprehensive snapshot of modern operations engineering.

DevOpsOperationsaiops
0 likes · 14 min read
Key Takeaways from the 2022 GOPS Global Operations Conference Shanghai – DevOps, AIOps & Cloud Insights
DevOps Cloud Academy
DevOps Cloud Academy
Oct 31, 2022 · Operations

Rolling Deployment Strategy: Advantages, Disadvantages, and Considerations

The rolling deployment strategy incrementally replaces old application instances with new ones, allowing users to encounter both versions during rollout, and is praised for ease of implementation, low risk, and default support in platforms like Kubernetes, though it can be slow, costly for large infrastructures, and may affect user experience.

Deployment StrategyKubernetesOperations
0 likes · 2 min read
Rolling Deployment Strategy: Advantages, Disadvantages, and Considerations
Efficient Ops
Efficient Ops
Oct 31, 2022 · Operations

How China Minsheng Bank Achieved Advanced DevOps Maturity – A Deep Dive

China Minsheng Bank’s centralized operation business processing system passed the Level 2 technical operation assessment of the national DevOps maturity model, showcasing how standardized DevOps practices, continuous delivery pipelines, and cross‑team collaboration can boost efficiency, safety, and competitiveness in the banking sector.

BankingChinaContinuous Delivery
0 likes · 10 min read
How China Minsheng Bank Achieved Advanced DevOps Maturity – A Deep Dive
Ops Development Stories
Ops Development Stories
Oct 31, 2022 · Information Security

Essential Security Checklist for Ops: From Port Hardening to Data Protection

This article shares practical security best practices for operations teams, covering why security is often overlooked, real incident examples, and detailed guidelines on port hardening, system hardening (login management, vulnerability scanning, baseline checks), application, network, and data protection, emphasizing continuous investment and simple safeguards.

Information SecurityOperationsSystem Hardening
0 likes · 8 min read
Essential Security Checklist for Ops: From Port Hardening to Data Protection
Open Source Linux
Open Source Linux
Oct 30, 2022 · Operations

Unlock Kubernetes Insights: Master Event Types, Monitoring, and Alerting

This guide explains what Kubernetes events are, how to list and filter them, categorizes common event types, and shows practical ways to collect, store, and alert on events using native commands and open‑source tools, helping teams reduce alert fatigue and improve cluster observability.

AlertingEventsKubernetes
0 likes · 11 min read
Unlock Kubernetes Insights: Master Event Types, Monitoring, and Alerting
Efficient Ops
Efficient Ops
Oct 28, 2022 · Operations

How Liaoning Mobile Achieved Leading‑Edge DevOps with a Level‑3 Continuous Delivery Assessment

Liaoning Mobile’s Channel Management System project passed the CAICT DevOps Capability Maturity Model Level‑3 continuous delivery assessment, showcasing how standardized DevOps practices, toolchains, and agile transformation boosted delivery speed, team capability, and operational efficiency, positioning the carrier at the forefront of China’s digital transformation.

Continuous DeliveryDevOpsIT transformation
0 likes · 15 min read
How Liaoning Mobile Achieved Leading‑Edge DevOps with a Level‑3 Continuous Delivery Assessment
Ziru Technology
Ziru Technology
Oct 28, 2022 · Operations

Why Feature Environments Fail and How to Build a Reliable One

This article analyzes the difficulties of initializing stable environments, the poor usability and low reliability of feature environments, proposes concrete solutions such as unified test environments, streamlined creation workflows, middleware adjustments, testing and documentation, and shares practical reflections from real deployments.

DeploymentOperationsfeature environment
0 likes · 12 min read
Why Feature Environments Fail and How to Build a Reliable One
dbaplus Community
dbaplus Community
Oct 25, 2022 · Operations

How a Government System’s Week‑Long Outage Exposed Critical Backend and Load‑Balancing Flaws

A government information system suffered a week of instability, including service deadlocks, Tomcat memory overflows, and load‑balancing failures, prompting a deep forensic analysis that uncovered database lock‑ups, faulty front‑end loops, inadequate monitoring, and misconfigured logging, leading to concrete remediation steps and lessons for future reliability.

OperationsTomcatincident analysis
0 likes · 21 min read
How a Government System’s Week‑Long Outage Exposed Critical Backend and Load‑Balancing Flaws
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 20, 2022 · Cloud Native

Why Kubernetes Remains Complex and How Serverless Designs Aim to Simplify It

The article examines the inherent and accidental complexities of Kubernetes as a distributed cluster manager, discusses challenges in resource scheduling, infrastructure diversity, and operational overhead, and explores how cloud‑native solutions such as managed services, nodeless and serverless Kubernetes architectures attempt to reduce these complexities while introducing new trade‑offs.

Cloud NativeKubernetesOperations
0 likes · 18 min read
Why Kubernetes Remains Complex and How Serverless Designs Aim to Simplify It
Cloud Native Technology Community
Cloud Native Technology Community
Oct 19, 2022 · Industry Insights

What Sets Platform Engineering Apart from DevOps and SRE?

The article clarifies the distinctions between platform engineering, DevOps, and SRE, explaining their origins, common misconceptions, challenges such as shadow operations and developer cognitive load, and how platform engineering builds on these practices to deliver self‑service internal developer platforms that improve productivity and reliability.

DevOpsInternal Developer PlatformOperations
0 likes · 10 min read
What Sets Platform Engineering Apart from DevOps and SRE?
Top Architect
Top Architect
Oct 18, 2022 · Backend Development

Nginx Configuration Guide: HTTP Server, Static Files, Reverse Proxy, Load Balancing and Advanced Directives

This comprehensive guide explains how to configure Nginx as an HTTP server, static file server, reverse proxy, and load balancer, covering directory setup, location matching rules, priority order, upstream strategies, and useful directives such as return, rewrite, error_page, logging and access control.

BackendConfigurationNginx
0 likes · 17 min read
Nginx Configuration Guide: HTTP Server, Static Files, Reverse Proxy, Load Balancing and Advanced Directives
DevOps
DevOps
Oct 17, 2022 · Operations

Platform Engineering: Bridging Developers and Infrastructure Beyond DevOps

The article examines platform engineering as a discipline that unifies developers' desire to avoid infrastructure work with enterprises' need for control, critiques the hype around DevOps, and argues that effective internal developer platforms require solid fundamentals, IaC practices, and cultural change.

DevOpsInternal Developer PlatformOperations
0 likes · 7 min read
Platform Engineering: Bridging Developers and Infrastructure Beyond DevOps
Cloud Native Technology Community
Cloud Native Technology Community
Oct 17, 2022 · Cloud Native

A Three‑Step Approach to Understanding, Managing, and Preventing Kubernetes Failures

This article presents a practical three‑step methodology—understanding, managing, and preventing—to troubleshoot Kubernetes deployments, explains how to leverage monitoring, observability, and incident‑response tools, and offers guidance on fostering team collaboration and building resilient, self‑healing cloud‑native systems.

Cloud NativeKubernetesOperations
0 likes · 7 min read
A Three‑Step Approach to Understanding, Managing, and Preventing Kubernetes Failures
Efficient Ops
Efficient Ops
Oct 16, 2022 · Operations

How Chinese Banks Accelerate IT Efficiency with DevOps Maturity Models

This article reports how 21 Chinese banking institutions evaluated 82 projects using the CAICT-led DevOps Capability Maturity Model, detailing the breakdown across state‑owned, joint‑stock, and city commercial banks, and explains the model’s standards and industry impact.

BankingDevOpsIT efficiency
0 likes · 6 min read
How Chinese Banks Accelerate IT Efficiency with DevOps Maturity Models
Top Architect
Top Architect
Oct 15, 2022 · Backend Development

Designing Fault‑Tolerant Microservices: Patterns and Practices

The article explains how microservice architectures can achieve high availability by isolating failures, employing graceful degradation, change‑management strategies, health checks, fallback caching, retry logic, rate limiting, circuit breakers, and chaos testing, while acknowledging the added complexity and cost of such reliability engineering.

BackendOperationsReliability
0 likes · 13 min read
Designing Fault‑Tolerant Microservices: Patterns and Practices
Big Data Technology Architecture
Big Data Technology Architecture
Oct 15, 2022 · Operations

The Rise of Platform Engineering: From DevOps Frustrations to Internal Developer Platforms

This article explains how platform engineering emerges from DevOps frustrations, defining internal developer platforms, outlining their principles, benefits, and implementation guidelines, and showing why organizations should adopt them to reduce cognitive load and improve developer productivity.

Internal Developer PlatformOperationsplatform engineering
0 likes · 11 min read
The Rise of Platform Engineering: From DevOps Frustrations to Internal Developer Platforms
Architecture and Beyond
Architecture and Beyond
Oct 15, 2022 · Operations

Technical Cost Optimization and Fine‑Grained Operations: Strategies, Processes, and Best Practices

This article provides a comprehensive guide for technical leaders on reducing and managing technology costs through a two‑stage approach of cost optimization and fine‑grained operations, covering team formation, current‑state analysis, discount and storage tactics, project planning, communication, and long‑term process and system support.

Cost OptimizationOperationsResource Management
0 likes · 27 min read
Technical Cost Optimization and Fine‑Grained Operations: Strategies, Processes, and Best Practices
Efficient Ops
Efficient Ops
Oct 13, 2022 · Operations

How China’s Telecom Leaders Boost IT Efficiency Using the DevOps Maturity Model

Across China’s telecom sector, leading operators such as China Mobile, China Unicom, and China Telecom have leveraged the CAICT‑led DevOps Capability Maturity Model to assess dozens of projects, achieving faster delivery cycles, higher automation, standardized interfaces, and improved IT efficiency through continuous delivery, technical operation, and system‑tool integration.

Continuous DeliveryDevOpsMaturity Model
0 likes · 14 min read
How China’s Telecom Leaders Boost IT Efficiency Using the DevOps Maturity Model
Efficient Ops
Efficient Ops
Oct 13, 2022 · Operations

How Leading Chinese Insurers Achieved DevOps Maturity: Real-World Case Studies

This article reviews how three major Chinese insurance companies applied the CAICT DevOps Capability Maturity Model to improve IT efficiency, integrate resources, and support business systems, highlighting project details, architectural innovations, and measurable outcomes across continuous delivery, technology operations, and risk management.

Continuous DeliveryDevOpsInsurance
0 likes · 8 min read
How Leading Chinese Insurers Achieved DevOps Maturity: Real-World Case Studies